Has anyone Ever been wondering what on earth a transformer is? Not that one. We're learning how large language models work. In my last video, we talked about self-attention, which is the mechanism where every word or token can look at every other one in its context to figure out what matters. And that is the core component of what we call a transformer, which is basically the full architecture that modern language models rely on. Basically, they take tokens in and they output predictions. And this was actually a huge breakthrough pretty recently in 2017. Before that, language models operated sequentially. So like a recurrent neural network, it would be like word one and then word two and then word three. And even though there was attention it was still on top of this sequential layer. And what that meant was that you couldn't process token five until token four had finished. these models Slow and hard to scale. But then A paper For those of you who are impressed by this kind of thing, this paper was published in 2017 and has been cited 230,000 times. The title is Attention Is All You Need. Done by some researchers at Google. And they proposed something radical, which was get rid of sequential processing entirely. If you could build the model around attention, then it could Process the entire sequence in parallel. So because nothing has to wait for anything else, you can split the training work across thousands of chips simultaneously. So it's faster, but it also means that the more chips you add, the faster it gets. is the only way that we have been able to get to the type of large language models that we have today.
Δεν υπάρχουν σχόλια:
Δημοσίευση σχολίου