Jun 11, 2026 · 6 min read
LARGE LANGUAGE MODELS
What large language models are, how they work, and what they can and can't do — explained for technology and business decision-makers.
A large language model is a neural network trained with machine learning to predict the next token in a sequence across massive text datasets — a deceptively simple objective that, at sufficient scale, produces systems capable of reasoning, code generation, translation, and synthesis. They are the engine behind the current wave of generative AI.
How LLMs Work
The key architectural innovation is the transformer, introduced in the 2017 paper "Attention Is All You Need." Transformers process entire sequences in parallel rather than token by token, which makes training on large datasets feasible. The attention mechanism lets the model weight relationships between any two tokens in a sequence regardless of distance — allowing it to track references, maintain context, and reason across long passages.
Training is next-token prediction at scale: given a sequence of text, predict what comes next. Run that objective across hundreds of billions of tokens, on thousands of GPUs of AI compute, for weeks or months, and the model learns representations of language, facts, reasoning patterns, and code. The emergent capabilities — instruction following, multi-step reasoning, tool use — were not explicitly programmed. They arise from scale.
Every major frontier AI system today is built on transformer architecture or a close derivative.
Capabilities and Limits
LLMs are strong at pattern matching, synthesis, fluent generation, and tasks with fuzzy or probabilistic answers. They are brittle at precise arithmetic, reliable factual retrieval, and tasks requiring strict logical consistency over many steps.
They hallucinate — generate plausible-sounding but false information — at a frequency that varies by model, task, and prompt design. This is not a bug to be patched; it is a structural property of probabilistic generation. Understanding hallucination rates and failure modes is essential for anyone deploying LLMs in production.
The Economics
Training a frontier model costs hundreds of millions of dollars in compute. Inference is far cheaper and falling. This creates a durable stratification: a handful of labs can afford to train frontier models; everyone else builds on top of APIs. The strategic question for most businesses is not whether to train their own model but which API to build on, how to manage lock-in risk, and where proprietary value can actually accumulate.
Context Length and Retrieval as Architecture Choices
One of the most consequential recent developments is the expansion of context windows — from a few thousand tokens in early GPT models to hundreds of thousands in current frontier models. This changes application architecture: problems previously requiring retrieval-augmented generation (chunking documents, embedding them, fetching relevant pieces at query time) can sometimes be solved by loading entire documents into context.
The tradeoff is cost and latency. Understanding when retrieval is necessary versus when long-context is sufficient is now a core engineering judgment, with direct implications for system complexity, cost at scale, and information freshness.
The Commoditization Trajectory
Frontier capabilities that required GPT-4 in 2023 are available in open-source models running locally in 2025. The commoditization curve is fast and predictable. For businesses building on LLM APIs, competitive advantage cannot rest on exclusive access to a particular model — it must rest on proprietary data, workflow integration, or distribution. The model is infrastructure; the economic moat is elsewhere.
Open Questions
- Where does scaling plateau, and what comes after next-token prediction as the core training objective?
- Can hallucination rates be reduced to levels acceptable for high-stakes domains like medicine and law without sacrificing fluency?
- How will the commoditization of frontier capabilities reshape the economics of the labs currently training them?
- What is the right division of labor between long-context retrieval and vector search as context windows continue to expand?
Part of the knowledge graph at The Best Blog Ever — reference definitions for ideas that matter.
Related Analysis
Jun 11, 2026 · 9 min read
Oct 12, 2024 · 5 min read