Monthly Archives: December 2025

Faster horses, not trains

I’ve been trying to work out why successive advances in GenAI models don’t feel particularly different to me, even as others react with genuine excitement.

I use these tools constantly and have done since ChatGPT4 was released nearly 3 years ago. I couldn’t imagine a world without them. In that sense, they already feel as transformative as the web. I’ve been thinking perhaps how its once they become ambient, the magic fades. You get used to them and stop noticing improvements. But the more I’ve thought about it, the more I think there are deeper structural reasons why the experience has plateaued, for me at least.

The lossy interface

All meaningful work starts in a physical, social, constraint-filled environment. We reason with space, time, bodies, artefacts, relationships, incentives, and history. Much of this understanding is tacit. We sense it before we can explain it.

To involve a computer, that reality has to be translated into symbols. Text, files, data models, diagrams, prompts. Every translation step compresses context and/or throws information away. There is loss from brain to keyboard. Loss from keyboard to prompt. And loss again when the output comes back and has to be interpreted.

GenAI only ever sees what makes it across that boundary. It reasons over compressed representations of reality that humans have already filtered, simplified, and distorted.

Better models reduce friction within that interface, but they don’t change its dimensionality. In that respect it doesn’t really matter how “smart” the models get, or how well they do on the latest benchmarks. The boundary stays the same.

Because of that, GenAI works best where the world is already well-represented in digital form. As soon as outcomes depend on things outside its boundary, its usefulness drops sharply.

That is why GenAI helps with slices of work, not whole systems. It is powerful, but fundamentally bounded.

Some real world examples:

  • In software development, generating code hasn’t been the main bottleneck since we moved away from punch cards. The far bigger constraints are understanding the problem, communicating with stakeholders, working effectively with other people, designing the system, managing risks and trade-offs, and operating systems in complex social environments over time.
  • In healthcare, GenAI can assist with diagnosis or documentation, but outcomes are dominated by staff, facilities, funding, and coordination across complex human systems. Better reasoning does not create more nurses or hospital beds.

In both cases, GenAI accelerates parts of the work without shifting the underlying constraint.

Faster horses, not trains

In that respect, GenAI feels like faster horses rather than trains. It makes us more effective at things we were already doing, writing, code, analysis, planning, and sense-making, but operates on only parts, thin slices of systems.

Trains didn’t just make transport faster. They removed a hard upper bound on the movement of people and goods. Once that constraint moved, everything else reorganised around it. Supply chains, labour markets, cities, timekeeping, and even how people understood distance and work all changed. Railways were not just a tool inside the system, they became the system.

GenAI doesn’t yet do that. It works through a narrow, virtual interface and plugs into existing workflows. But as often as not the real systematic constraints lie elsewhere.

What actually changed the world

A recent conversation reminded me of Vaclav Smil’s How the World Really Works, which I read last year.

Smil highlights that modern civilisation rests on a small number of physical pillars: energy, food production (especially nitrogen), materials like steel and cement, and transport. Changes in these pillars are what led to the biggest transformations in human life. Information technology barely registers at that level in his analysis. He doesn’t deny its importance, but treats it as secondary, an optimiser of systems whose limits are set elsewhere.

Through that lens, GenAI doesn’t (yet) register as a civilisation-shaping force. It doesn’t produce energy, grow food, create new materials, or move mass. It operates almost entirely above those pillars, improving coordination, design, and decision-making around systems whose hard limits are set elsewhere.

That doesn’t make it trivial. But it explains why, so far, it looks closer to previous waves of information technology than to steam or electricity. It optimises within existing constraints rather than breaking them.

The big if

Smil’s framing doesn’t say GenAI cannot matter at an industrial scale. It says where it would have to show up. GenAI becomes civilisation-shaping only if it materially accelerates breakthroughs in those physical pillars – things that change what the world can physically sustain.

This is where “superintelligence” comes in. If GenAI can explore hypothesis spaces humans cannot, design and run experiments, or compress decades of scientific iteration into years, resulting in major scientific breakthroughs, it moves from optimising within constraints to changing them.

This is also where my own doubts sit. Many think just scaling what we have now will get us there. For those that don’t believe that, but are still optimistic about AI’s potential, they turn to world models, embodiment, or agents that can act in the real world. There are sketches and hopes for how this may happen, but as yet, not much more than that.

So while superintelligence is the path by which AI could plausibly become industrial-scale transformative, it’s a long and uncertain one.

What kind of change are we talking about?

If you mean web-scale change, then GenAI is already there. But if we mean the kind of change associated with the industrial revolution (as it’s often compared to) – longer lives, better health, radically different working conditions, step changes in material living standards, then what we have today does not qualify. Historically, those shifts followed from breaking physical constraints, not from better information or reasoning alone.

For me, and why I’m not really feeling successive model improvements, it isn’t that GenAI lacks value. It’s that those improvements don’t change the shape of what’s possible. They operate within the same narrow, lossy interface, so they barely register in practical terms. GenAI still adds value, and already feels web-scale transformative. But until that boundary moves, or something else breaks the underlying constraints, they don’t feel like steps toward an industrial-revolution-scale shift.

More with less, or is it more with the same?

The crude clickbait narrative is that AI means job cuts, replacing roles. But when I look at how AI is actually being used in real organisations, it seems more likely it’ll be more effective at expanding capacity rather than reduce headcount. Many organisations may end up doing more with the same long before they can credibly do the same (or more) with less.

This thought started for me with an observation – AI is not substituting whole roles, we’re getting micro-specialists that can do slices of work. In software you see agents for tests, code review, planning. Other sectors look much the same. Legal teams using AI for drafting. Sales teams for outreach. Finance for reconciliation. Tools handling tasks, not outcomes, and someone still has to stitch the pieces together.

There are (at least) three forces I can think of that matter when asking whether organisations will genuinely be able do more with less:

1. How automatable the work already is.
Where the work is rules based, high volume, and low variation, AI may replace labour in the same way classic automation has. Think claims processing, simple customer support, structured back office workflows. These functions already lived close to the automation frontier. AI just expands the frontier a bit.
This will reduce headcount, but mostly in places where headcount has been under pressure for decades anyway.

2. How much the organisation can absorb increased output.
Most professional work is not constrained by how fast someone types or drafts. It is constrained by coordination, sequencing, ambiguity, stakeholder alignment, and quality. Software is a good example. So is legal, consulting, product, sales. If you cut the number of lawyers because drafting is faster, you will simply overload the remaining lawyers with negotiation, risk, and client work.

3. The cost and consequences of mistakes.
In many industries, the limiting factor is not productivity, but risk. Healthcare, aviation, finance, law. Increased throughput also increases the risk surface area. If AI increases the probability or cost of an error, you cannot shrink the team. You often need more human oversight, not less.

If you put these together, the more likely outcome is is this:

  • Some operational functions will shrink, but these were already at risk of automation.
  • Most knowledge work will shift toward more with the same, not less.
  • Some domains will accidentally create more with more, because oversight and correction absorb the gains.

AI Is still making code worse: A new CMU study confirms

In early 2025 I wrote about GitClear’s analysis of the impact of GenAI on code quality, based on 2024 data, which showed a significant degradation in code quality and maintainability. I recently came across a new study from Carnegie Mellon, “Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor’s Impact on Software Projects” that looks at a more recent period, tracking code quality in projects using GenAI tools up to mid-2025. So has code quality improved as the models and tools have matured?

The answer appears to be no. The study finds that AI briefly accelerates code generation, but the underlying code quality trends continue to move in the wrong direction.

How the study was run

Researchers at Carnegie Mellon University analysed 807 open source GitHub repositories that adopted Cursor between January 2024 and March 2025, and tracked how those projects changed through to August 2025. Adoption was identified by looking for Cursor configuration files committed to the repo.

For comparison, the researchers built a control group of 1,380 similar GitHub repositories that didn’t adopt Cursor (see caveats below).

For code quality, they used SonarQube, a widely used and well respected code analysis tool that scans code for quality and security issues. The researchers ran SonarQube monthly to track how each codebase evolved, focusing on static analysis warnings, code duplication and code complexity.

Finally, they attempted to filter out toy or throwaway repositories by only including projects with at least 10 GitHub stars.

Key findings

Compared to the control group:

  • A short lived increase in code generated: Activity spikes in the first one or two months after adoption. Commits rise and lines added jump sharply, with the biggest increase in the first month
  • The increase does not persist: By month three, activity returns to baseline. There is no sustained increase in code generated.
  • Static analysis warnings increase and remain elevated: Warnings rise by around 30 percent post-adoption and stay high for the rest of the observation window.
  • Code complexity increases significantly: Code complexity rose by more than 40 percent, more than could reasonably be accounted for by just the growth in codebase size.

Caveats/Limitations

The study only looked at open source projects, which aren’t really comparable to production code bases. Also, adoption is inferred from committed Cursor configuration files which I would say is a reasonably reliable signal of usage within those projects. However the control group is not necessarily AI usage free, code in those repositories may still have been created using Copilot, Claude Code or other tools.

My Takeaways

A notable period for AI assisted development

What’s notable is the period this study tracks. In December 2024 Cursor released a major upgrade to their IDE and introduced its agent mode. It was the first time I heard experienced developers I respect describe AI coding assistants as genuinely useful. Cursor adoption climbed quickly and most developers I knew were using Claude Sonnet for day to day coding. Then in February 2025 Anthropic released Claude 3.7 Sonnet, followed in May by Sonnet 4.0 and their first reasoning model, Opus 4.1.

If improvements in models or tooling were going to reverse the code quality issues seen previously, you’d expect it to show up during this period. This study shows no reversal. The pattern is broadly the same as GitClear observed for 2024.

It’s not just “user error”

A common argument is that poor AI-generated code is the user’s fault, not the tool’s. If developers wrote clearer prompts, gave better instructions or reviewed more carefully, quality wouldn’t suffer. This study disagrees. Even across hundreds of real projects, and even after accounting for how much code was added, complexity increased faster in the AI-assisted repos than in the control group. The tools are contributing to the problem, not merely reflecting user behaviour.

Context collapse playing out in real time

Organisations training LLMs probably use similar signals to this study to decide which open source repositories to train on: popularity, activity and signs of being “engineered” rather than experimental. This study shows more than 800 popular GitHub projects with code quality degrading after adopting AI tools. It’s hard not to see a form of context collapse playing out in real time. If the public code that future models learn from is becoming more complex and less maintainable, there’s a real risk that newer models will reinforce and amplify those trends, producing even worse code over time.

Things are continuing to evolve quickly, but…

Of course, things have continued to move quickly since the period this study covers. Claude Code is currently the poster child for GenAI assisted development. Developers are learning how to instruct these tools more effectively through patterns like Claude.md and Agents.md, and support for these conventions is improving within the IDEs.

In my recent experience at least, these improvements mean you can generate good quality code, with the right guardrails in place. However without them (or when it ignores them, which is another matter) the output still trends towards the same issues: long functions, heavy nesting of conditional logic, unnecessary comments, repeated logic – code that is far more complex than it needs to be.

No doubt the tools will continue to improve, and much of the meaningful progress is happening in the IDE layer rather than in the models themselves. However this study suggests the underlying code quality issues aren’t shifting. The structural problems remain, and they aren’t helped by the fact that the code these models are trained on is likely getting worse. The work of keeping code simple, maintainable and healthy still sits with the human, at least for the foreseeable future.