How we got to modern LLMs.
Every era inherited the last era's biggest limitation. Eight chapters from hand-written rules to autonomous AI agents, with a working demo for each one.
Hand-written rules
What happens when you try to write a rule for every sentence?
Before machine learning, every piece of "intelligence" was hand-coded. Humans wrote rules for every situation they could think of. The most famous example is ELIZA, built by Joseph Weizenbaum at MIT in 1966. It worked by pattern-matching your sentence and filling in templates.
Try it. Type something on the right. If you hit a pattern ELIZA knows, you get a reply that feels eerily like understanding. If you do not, you get a confession that the system has nothing.
That is the whole problem. Language has billions of valid sentences. Every novel input needs a new rule, written by hand, by someone who anticipated it. Rules cannot scale, and they cannot learn from data.
Words become numbers
What if a word's meaning could live at an address in space?
The first real breakthrough was treating words not as arbitrary symbols, but as points in a high-dimensional space. Word2Vec (Mikolov and colleagues at Google, 2013) trained a simple neural network to predict a word from its neighbors. GloVe (Pennington, Socher, and Manning at Stanford, 2014) factorized a word co-occurrence matrix. Different math, same idea: a word's meaning is its company.
Click any word on the right. You will see its four nearest neighbors. Words with similar meanings live in the same neighborhood. The famous geometry trick, king minus man plus woman, lands near queen. Press the button to watch it run.
A small caveat on that result: it works when the input words are excluded from the nearest-neighbor candidates. Without that filter, the closest vector is often just king itself.
But embeddings still treat every occurrence of a word the same way. Bank near river and bank near account get the same vector. The model cannot disambiguate by context. That is the wall.
A network that remembers
What if the model could read a sentence one word at a time?
If a single vector cannot capture context, the model has to read words in context, in order, while keeping track of what came before. That is what Recurrent Neural Networks do. They read a sentence one word at a time, and after each word the network updates a hidden state: a vector that summarizes everything it has read so far. LSTMs (Hochreiter and Schmidhuber, 1997) added gates that let the network choose what to remember and what to forget. GRUs (Cho and colleagues, 2014) simplified that further.
Press play on the demo. You will see the network read a six-word sentence. Watch the green bar under each word. That is how much of that word's signal survives by the time the network reaches the end.
Two problems. By the time you reach word twenty, the signal from word one is almost gone. And tokens have to be processed serially: word 1000 waits for words 1 through 999. GPUs sit idle.
A partial fix arrived in 2014 to 2015. Bahdanau, Cho, and Bengio added attention: instead of relying on the final hidden state alone, the model could look back at every word it had already read and decide which ones mattered. That idea was the seed of the next era.
Every word looks at every word
What if you threw away the reading order and let every token see every other?
In 2017, Vaswani and colleagues at Google published Attention Is All You Need. The proposal was radical: throw away recurrence entirely. Let every token look at every other token, all at once, and decide for itself what to pay attention to.
The mechanism is three vectors per token. Query("what am I looking for?"), Key ("what information do I have?"), and Value("here is my actual content"). To find what one token should focus on, you score its Query against the Key of every other token. Then you blend their Values, weighted by those scores. That is self-attention.
Click any token on the right. Watch how the model distributes its focus.
The classic example, also from the paper, is the pronoun puzzle: The animal didn't cross the street because it was too tired. Click it. Most of its attention flows to animal. The model figured out coreference without a rule encoding it.
Three things made this era. Parallelism across the sequence (GPU-friendly). Direct connections between distant tokens (no decay). And multi-head attention: many attention patterns running in parallel, each learning a different relationship. Training that took weeks with RNNs now took days.
Two philosophies, one architecture
Should the model see the future, or just the past?
The transformer architecture spawned two competing philosophies. BERT (Google, 2018: 110M parameters in base, 340M in large) and GPT (OpenAI, 2018: 117M). Both pre-train a transformer on massive text. They diverge on what they predict.
BERT masks fifteen percent of tokens and predicts them using context from both sides. GPT predicts the next token, left to right, hidden from the future by a causal mask.
The demo shows the same prompt to both. BERT sees the full sentence and fills the mask. GPT only sees what came before and predicts what comes next. The predictions overlap but diverge in interesting ways.
BERT is better at understanding tasks: classification, search, question answering. GPT is the architecture that scales to ChatGPT, because next-token prediction has infinite training data (every sentence on the internet) and the same model can do translation, summarization, and coding just by changing the prompt.
Just make it bigger
What happens if you scale the same model a hundred times, then a thousand?
OpenAI made a bet. They scaled GPT a hundred times, then a hundred times again. GPT-2 (1.5B parameters, 2019) wrote convincing prose. The company initially withheld the full model. GPT-3 (175B, 2020) did something nobody quite expected: new abilities emerged from raw scale.
Drag the slider. As model size grows past certain thresholds, capabilities switch on. In-context learning (show a few examples in the prompt, the model figures out the task). Chain-of-thought reasoning (add "let's think step by step," math performance jumps). Reliable code synthesis. None of these were programmed in. They appeared as side effects of scale.
Kaplan and colleagues (2020) formalized the scaling laws: model performance improves predictably with compute, data, and parameter count. This turned AI research into an engineering problem.
DeepMind's Chinchilla paper (2022, 70B parameters) corrected the field. Most models were undertrained. For every doubling of parameters, double the training data too. The race shifted from "biggest model" to "best-trained model."
Teaching the model what we want
How do you teach a model what 'good' means without writing rules for it?
Raw GPT-3 was powerful but chaotic. Ask it a question, and it might answer it, or continue the question, or write a poem about it. The training data was the whole internet: helpful answers, toxic rants, fiction, and nonsense, all mixed together.
ChatGPT (Ouyang and colleagues, InstructGPT, 2022) fixed this with RLHF: reinforcement learning from human feedback. The recipe runs in three steps. First, supervised fine-tuning on human-written ideal answers. Second, a reward model trained on humans ranking two outputs side by side. Third, the language model is optimized to maximize the reward.
Try the simplified version on the right. You are the reward model. Read both outputs. Pick which one a human would prefer. Now imagine doing that for tens of thousands of pairs.
The same GPT-3.5 model that already existed, once aligned, became a household name. By February 2023, a UBS analyst estimate pegged ChatGPT at roughly 100 million monthly users, making it the fastest-growing consumer app of its time.
When the model picks up tools
What happens when the model can do things, not just say things?
Today's models are not just text predictors. They see images, reason across 200,000+ tokens, call tools, run code, browse the web, and operate in autonomous loops. Four things changed since 2023.
Multimodal. GPT-4o, Claude 4, and Gemini natively process text, images, audio, and code in one model. The transformer architecture turns out to work for any sequence: pixels are just patches.
Long context. Context windows grew from 4K tokens (GPT-3) to 200K and beyond. That is an entire codebase, a full book, or a long meeting transcript in a single prompt.
Reasoning. Models like o1 and o3 spend more compute at inference time on harder problems. They generate intermediate steps before answering, and that dramatically improves math, logic, and coding.
Tool use. The model emits structured JSON for tool calls. A harness executes them and feeds the results back. The model can now do things, not just say them.
Watch the trace on the right. A user asks for the weather in Tokyo. The model thinks, emits a tool call, gets a result, and uses it. That loop is what makes agentic systems possible.
Open source kept up. Meta's Llama 3.1 405B (July 2024) is near GPT-4 level on many benchmarks. Mistral and Mixtral pushed Mixture-of-Experts. DeepSeek (V3 and R1) showed you do not need billions of dollars to train a frontier model.
The whole journey, in one column.
- 2013Word2VecWords become geometry.
- 2014GloVe + Seq2SeqCo-occurrence vectors and encoder-decoder.
- 2015Bahdanau attentionDecoder looks back. No more bottleneck.
- 2017TransformerAttention is all you need.
- 2018BERTBidirectional pre-training. 110M and 340M.
- 2018GPT-1Decoder-only pre-training. 117M.
- 2019GPT-2Convincing prose. 1.5B parameters.
- 2020GPT-3In-context learning emerges. 175B.
- 2020Scaling lawsPerformance is predictable in compute and data.
- 2022ChatGPT (RLHF)Alignment. Fastest consumer launch of its time.
- 2022ChinchillaData matters as much as parameters.
- 2023GPT-4Multimodal. Dramatic reasoning improvements.
- 2023Claude 2, Llama 2Open source catches up.
- 2024Claude 3, Gemini 1.5200K+ context. Tool use. Agent-capable.
- 2024o1 and reasoning modelsInference-time compute. Models that think longer.
- 2024Computer Use, MCPAnthropic ships both in October and November.
- 2025Claude 4, GPT-5Agentic by default.
- 2026Agentic eraLLMs as brains in multi-agent systems.
Each era inherited the last era's biggest limitation. Rules were brittle, so we let words become numbers. Static embeddings ignored context, so we built networks that read sequences. Sequences were slow and forgetful, so we let every token see every other. Powerful models were chaotic, so we taught them what good looked like. Chatbots were limited, so we gave them tools. Every ceiling becomes the next generation's starting line.