This is an attempt at a relatively non-technical explainer for anyone curious about how today’s AI models actually work – and why some of the same ideas that made them so powerful may now be holding them back.
In 2017, a paper by Vaswani et al., titled “Attention is All You Need”, introduced the Transformer model. It was a genuinely historic paper. There would be no GenAI without it. The “T” in GPT literally stands for Transformer.
Why was it so significant?
“Classical” neural network based AI works a bit like playing Snakes & Ladders – processing one step at a time, building up understanding gradually.
Transformers allow every data point (or token) to connect directly with every other. Suddenly, the board looks more like chess – everything is in view, and relationships are processed in parallel. It’s like putting a massive turbocharger on the network.
But that strength is also its weakness.
“Attention” forces every token to compare itself with every other token. As inputs get longer and the model gets larger, the computational cost doesn’t just increase. It grows quadratically. Double the input, and the work more than doubles.
And throwing more GPUs or more data at the problem doesn’t just give diminishing returns – it can lead to negative returns. This is why, for example, some of the latest “mega-models” like ChatGPT 4.5 perform worse than its predecessor 4.0 in certain cases. Meta is also delaying its new Llama 4 “Behemoth” model – reportedly due to underwhelming performance, despite huge compute investment.
Despite this, much of the current GenAI narrative still focuses on more: more compute, more data centres, more power – and I have to admit, I struggle to understand why.
Footnote: I’m not an AI expert – just someone trying to understand the significance of how we got here, and what the limits might be. Happy to be corrected or pointed to better-informed perspectives.