Σάββατο 28 Μαρτίου 2026

How does an LLM like ChatGPT work anyway? Part 3 of a deep(er) dive on how large language models work from a technical perspective. What’s a neural network and what is vector space and how do words become numbers??

 Something that is reasonably very difficult. for people to wrap their heads around when it comes to how large language models actually work is how something that is essentially a next-token predictor could be so incredibly complex. And part of the answer to that that makes it sound almost too simple is that everything in these systems is just numbers. Words are numbers, meaning is geometry, relationships are just like patterns Space But this also explains how all this can work. Because fundamentally, computers only process numbers. So the first problem that a language model has to solve is how do you turn words into numbers in a way that preserves The naive answer to that is like give every word an ID, right? So cat is 123 and dog is 124. But here you've basically just encoded identity, not meaning. And that's because the words don't have any relationship to each other. But what if you gave every word a list of hundreds of numbers so that they become like a position in multidimensional space? So this means that words that appear in similar contexts can be placed near each other, so cat and dog might end up close, but cat and thunderstorm not as much. But an important thing is that those positions aren't like pre-programmed, they're not hand designed, they come from the training process, and part of that training process is what I was describing in the last video. So a really famous example in computational linguistics for explaining vector space is king minus man plus woman equals queen. And this doesn't mean that the model has like a conceptual understanding of royalty or gender, it's just that the patterns in the text produce this like geometric structure. And if you think about the context of that massive amount of text, that makes total sense, like king and queen being associated with male and female pronouns, but otherwise being used quite similarly contextually. An analogy I've heard before is latitude and longitude, like those don't tell you anything about a city's history or culture, but it does tell you how cities are spatially related to each other. So the words are now numbers in space, but they have to be transformed into a prediction, and that transformation is done by a neural network. And that is this huge mathematical function with trillions of settings called parameters, which are sort of like dials that can be adjusted and nudged So the training process that I was describing in the last video, where like adjusting in response to errors, that's what's happening. And that happening trillions and trillions and trillions of times is like what makes the model so good at predicting things. Also, there's a bunch of layers stacked on top of each other, getting increasingly abstract. Of course, another really important thing is that words don't always mean the same thing in context. That's actually a really hard problem. How does the model figure that out? And that became really important. So that's where we'll go next.


That’s an excellent and very accurate explanation of how large language models (LLMs) actually work — you’ve captured the essence of the concept. Let me build on what you’ve said and unpack why something that’s “just predicting the next token” ends up producing reasoning, creativity, and what looks like understanding.


Why “just predicting the next word” becomes so powerful

The core task is indeed next-token prediction, but that task implicitly forces the model to learn an enormous amount of structure and knowledge about the world. To predict the next word in a sentence like:

“After the thunder, we saw a flash of …”

the model can’t just memorize strings — it has to understand that “flash of lightning” makes more sense than “flash of sandwiches.” That requires internalizing relationships among concepts (weather, light, sound, physics, etc.), which over trillions of examples builds a kind of statistical world model encoded in those multidimensional patterns.


Why everything being “just numbers” matters

As you described, this is all possible because everything — words, meanings, relationships — is encoded in high-dimensional vector spaces.

  • Simple ID numbers (like “cat = 123”) carry no meaning.

  • But embeddings — long lists of numbers — do capture meaning because they represent how words co-occur across billions of contexts.

So geometrically, the model’s “understanding” is spatial: words and sentences occupy regions in a kind of conceptual landscape, and distances and directions correspond to semantic relationships. When you subtract and add these vectors, you’re moving meaning around in that space — hence the famous king – man + woman = queen example.


What the neural network actually does

Those embeddings feed into a massive neural network, typically a Transformer, which processes sequences of tokens. Each layer looks at the relationships among tokens across different positions in a text (this is called self-attention). Deeper layers re-represent the input in more abstract ways — starting from surface-level syntax, moving up to semantics, reasoning, and style.

The key is that these layers are not hand-designed; they emerge from training. The network’s parameters — trillions of them in models like GPT-5 — get finely tuned so that every next-token prediction shifts the model slightly closer to understanding language structure and meaning.


Why context and ambiguity are solvable

Words change meaning based on context — e.g. “bank” (riverbank vs. financial institution). The self-attention mechanism helps with that: it examines surrounding words to infer which sense of “bank” fits best. The model doesn’t know what a “bank” is in the human sense — it’s not conscious — but statistically, it has seen enough data to position meanings correctly depending on their textual environment.


A quick metaphor

Imagine you had to complete every unfinished sentence you’ve ever read online, and you got graded on how natural your endings sounded. Over time, you’d start absorbing grammar, facts, story patterns, and even tone — not because you were told to “learn,” but because it’s impossible to do the task well without developing internal structure that mirrors meaning. That’s exactly what large language models do, but on a scale beyond human comprehension.


Would you like me to continue from this point and explain how context windows and attention layers actually handle meaning dynamically across long passages (the “that became really important, so that’s where we’ll go next” part)?

Δεν υπάρχουν σχόλια:

Δημοσίευση σχολίου