
I’ve been writing software for more than 25 years, and the one habit that’s never left me is asking why. Why does this work? Why does that break? Why are we still doing it this way when it feels like there should be a better way that nobody’s bothered to try? Most of the time the answer is boring, and I move on. Every so often the answer is interesting, and that’s the part I keep showing up for.
Lately I’ve been chasing a questions about Large Language Models, and I wanted to write down where it led me, because I think the conclusion matters to more people than just me.
The questions started simple. When a newer, stronger model lands and clearly outperforms the one before it, what actually changed? My first guess was the obvious one. It must be bigger. More parameters, more weights, more raw size. That was certainly the story a few years ago. But the more I dug, the more I learned that “bigger” is probably the least interesting part of the gap now, and that surprised me enough to rethink the whole thing.
It turns out a modern jump in capability is rarely one change. It’s a stack of improvements that compound, and raw size is often near the bottom of that stack.
The biggest lever is usually the data. Not just more of it, but better of it. Clean data for all! Higher density, less duplicated junk, more carefully constructed examples of the kind of reasoning you actually want the model to learn. Two models of identical size trained on different data aren’t close to each other. After that comes everything layered on top of the base model once the heavy training is done, the part that shapes how the model behaves and reasons. A large share of what feels like “this one is smarter” lives there, not in the underlying network. And more recently there’s the idea of simply letting a model think for longer before it answers, spending more effort at the moment you ask the question rather than baking everything into the weights ahead of time. That alone produces a real jump, and it barely touches the core design.
So the honest answer to “is the new one better because it’s bigger” is mostly no. It’s better because a dozen things each got a little better and those gains multiplied. I find that encouraging, because it means progress isn’t gated purely behind whoever owns the most hardware.
Here’s where my favorite part of the story lives. Every field has a moment where one idea reshapes everything that comes after it. For this field, that idea arrived in 2017 with a single paper on a design called the transformer. Before it, the standard approach read text one piece at a time, in order, and it simply couldn’t scale. The transformer changed that, and almost everything since has grown out of that one seed.
What I find fascinating, and what I had wrong at first, is why it won. I assumed the winning idea must’ve been the cleverest one in the room. It wasn’t. The transformer won because it could be split across many processors at once, which meant it could ride the wave of cheap parallel hardware and enormous amounts of text that happened to be arriving at exactly the same time. The seed wasn’t the smartest seed. It was the one that landed in fertile ground.
And the more I sat with that, the more I saw the same pattern hiding further back in the story. Neural networks, the foundation all of this is built on, aren’t a new idea at all. They’d been around for decades, mostly set aside as a curiosity that never quite worked. The idea was never the problem. The world just hadn’t caught up to it yet, because the machines were too slow and there was nowhere near enough data to feed them. Then the hardware got fast, the data got huge, and that same old idea woke up and changed everything. It didn’t get any smarter while it waited. Its time simply came.
So have we hit a wall, where the technology itself is the real limit? I don’t think so, and there’s a simple reason I keep coming back to. There’s a working example of radically better intelligence sitting inside every human skull. The brain runs on roughly the power of a dim light bulb. It learns a language from a few million words, not a few trillion. And it generalizes better than anything we’ve built. That isn’t wishful thinking on my part. It’s proof that something far more efficient is physically possible, because it already exists.
I want to be careful here, because this is where I’ve got to check my own enthusiasm. The brain’s efficiency is partly a trick. It didn’t start from nothing. It arrived already shaped by hundreds of millions of years of evolution, with a great deal of expensive groundwork already done before any single person ever learns a word. So when we try to build an efficient learner from scratch, we’re quietly trying to do both jobs at once, the slow shaping and the fast learning. That’s a big part of why this is hard, and it’s worth respecting rather than waving away.
There’s an essay in this field, often called the bitter lesson, that I think every person with my instincts needs to sit with. The pattern it describes is this. Over and over, researchers built elegant systems that encoded their own hard won human understanding of a problem. And over and over, a simpler, more general method with more computing power behind it eventually beat them. Cleverness lost to scale, again and again.
At first that reads like bad news for someone like me, someone whose whole pitch is “we just need a smarter idea.” But I think the real lesson is more subtle, and it sharpened my thinking rather than discouraging it. The ideas that win aren’t the most intricate ones. They’re the ones that let you remove human cleverness and pour in scale more effectively. The transformer didn’t win by being complicated. It won by being simpler and more general and easier to feed. The prize was never the cleverest architecture. The prize is the architecture that gets the most out of every unit of computing and data. Those are very different things to go looking for.
This is the part that changed my mind the most. I’d assumed the field was starved for new seed ideas, and that the job was to go think of one. But that isn’t quite right. There are thousands of new ideas every year. Several promising candidates are alive right now, sitting in papers, waiting. The shortage isn’t ideas.
The real bottleneck is the filter. You usually can’t tell whether an idea will work at full scale until you actually try it at full scale, and trying it at full scale is so expensive that almost nobody can afford to test their seed properly. Most ideas that look beautiful on a small example fall apart for reasons you can’t see until you’re already deep in. The transformer itself looked unremarkable next to flashier ideas of its day. So the scarce skill isn’t having the idea. It’s the judgment to bet on which plain looking seed will actually bloom, and the means to find out.
That tells me where the frontier really is. We don’t only need new ideas. We need cheaper ways to test them, and a sharper understanding of why some designs keep scaling while others stall. Whoever figures out how to know if a seed will bloom without spending a fortune to find out will change this field as much as any new design would.
I’m not a machine learning researcher. There’s an enormous amount I don’t know, and I’ve spent enough years being wrong about things to hold my own opinions loosely. But 25 years of asking why has taught me one thing I trust completely. The instinct to question the whole foundation of how something is built, to look at the accepted way and ask whether we could throw it out and start over, isn’t a sign that you don’t belong in the room. It’s the exact instinct that produced every seed idea I admire. The people who built the transformer were asking my naive question about the systems that came before it.
You don’t need to be the smartest person in the room to plant a seed that matters. What you need is enough fluency in the tools to know what’s already been tried and why it failed, so your question lands on fresh ground instead of ground that was dug up years ago. That part is learnable. It just takes the willingness to keep studying, and the humility to keep asking why long after people expect you to already know the answer.
We haven’t hit the wall. We’re nowhere near it. The limit right now isn’t the technology. It’s imagination, judgment, and the patience to keep asking better questions. That’s a problem I’m happy to spend the rest of my career on.