Author Archives: Rob

Faster horses, not trains

I’ve been trying to work out why successive advances in GenAI models don’t feel particularly different to me, even as others react with genuine excitement.

I use these tools constantly and have done since ChatGPT4 was released nearly 3 years ago. I couldn’t imagine a world without them. In that sense, they already feel as transformative as the web. I’ve been thinking perhaps how its once they become ambient, the magic fades. You get used to them and stop noticing improvements. But the more I’ve thought about it, the more I think there are deeper structural reasons why the experience has plateaued, for me at least.

The lossy interface

All meaningful work starts in a physical, social, constraint-filled environment. We reason with space, time, bodies, artefacts, relationships, incentives, and history. Much of this understanding is tacit. We sense it before we can explain it.

To involve a computer, that reality has to be translated into symbols. Text, files, data models, diagrams, prompts. Every translation step compresses context and/or throws information away. There is loss from brain to keyboard. Loss from keyboard to prompt. And loss again when the output comes back and has to be interpreted.

GenAI only ever sees what makes it across that boundary. It reasons over compressed representations of reality that humans have already filtered, simplified, and distorted.

Better models reduce friction within that interface, but they don’t change its dimensionality. In that respect it doesn’t really matter how “smart” the models get, or how well they do on the latest benchmarks. The boundary stays the same.

Because of that, GenAI works best where the world is already well-represented in digital form. As soon as outcomes depend on things outside its boundary, its usefulness drops sharply.

That is why GenAI helps with slices of work, not whole systems. It is powerful, but fundamentally bounded.

Some real world examples:

  • In software development, generating code hasn’t been the main bottleneck since we moved away from punch cards. The far bigger constraints are understanding the problem, communicating with stakeholders, working effectively with other people, designing the system, managing risks and trade-offs, and operating systems in complex social environments over time.
  • In healthcare, GenAI can assist with diagnosis or documentation, but outcomes are dominated by staff, facilities, funding, and coordination across complex human systems. Better reasoning does not create more nurses or hospital beds.

In both cases, GenAI accelerates parts of the work without shifting the underlying constraint.

Faster horses, not trains

In that respect, GenAI feels like faster horses rather than trains. It makes us more effective at things we were already doing, writing, code, analysis, planning, and sense-making, but operates on only parts, thin slices of systems.

Trains didn’t just make transport faster. They removed a hard upper bound on the movement of people and goods. Once that constraint moved, everything else reorganised around it. Supply chains, labour markets, cities, timekeeping, and even how people understood distance and work all changed. Railways were not just a tool inside the system, they became the system.

GenAI doesn’t yet do that. It works through a narrow, virtual interface and plugs into existing workflows. But as often as not the real systematic constraints lie elsewhere.

What actually changed the world

A recent conversation reminded me of Vaclav Smil’s How the World Really Works, which I read last year.

Smil highlights that modern civilisation rests on a small number of physical pillars: energy, food production (especially nitrogen), materials like steel and cement, and transport. Changes in these pillars are what led to the biggest transformations in human life. Information technology barely registers at that level in his analysis. He doesn’t deny its importance, but treats it as secondary, an optimiser of systems whose limits are set elsewhere.

Through that lens, GenAI doesn’t (yet) register as a civilisation-shaping force. It doesn’t produce energy, grow food, create new materials, or move mass. It operates almost entirely above those pillars, improving coordination, design, and decision-making around systems whose hard limits are set elsewhere.

That doesn’t make it trivial. But it explains why, so far, it looks closer to previous waves of information technology than to steam or electricity. It optimises within existing constraints rather than breaking them.

The big if

Smil’s framing doesn’t say GenAI cannot matter at an industrial scale. It says where it would have to show up. GenAI becomes civilisation-shaping only if it materially accelerates breakthroughs in those physical pillars – things that change what the world can physically sustain.

This is where “superintelligence” comes in. If GenAI can explore hypothesis spaces humans cannot, design and run experiments, or compress decades of scientific iteration into years, resulting in major scientific breakthroughs, it moves from optimising within constraints to changing them.

This is also where my own doubts sit. Many think just scaling what we have now will get us there. For those that don’t believe that, but are still optimistic about AI’s potential, they turn to world models, embodiment, or agents that can act in the real world. There are sketches and hopes for how this may happen, but as yet, not much more than that.

So while superintelligence is the path by which AI could plausibly become industrial-scale transformative, it’s a long and uncertain one.

What kind of change are we talking about?

If you mean web-scale change, then GenAI is already there. But if we mean the kind of change associated with the industrial revolution (as it’s often compared to) – longer lives, better health, radically different working conditions, step changes in material living standards, then what we have today does not qualify. Historically, those shifts followed from breaking physical constraints, not from better information or reasoning alone.

For me, and why I’m not really feeling successive model improvements, it isn’t that GenAI lacks value. It’s that those improvements don’t change the shape of what’s possible. They operate within the same narrow, lossy interface, so they barely register in practical terms. GenAI still adds value, and already feels web-scale transformative. But until that boundary moves, or something else breaks the underlying constraints, they don’t feel like steps toward an industrial-revolution-scale shift.

More with less, or is it more with the same?

The crude clickbait narrative is that AI means job cuts, replacing roles. But when I look at how AI is actually being used in real organisations, it seems more likely it’ll be more effective at expanding capacity rather than reduce headcount. Many organisations may end up doing more with the same long before they can credibly do the same (or more) with less.

This thought started for me with an observation – AI is not substituting whole roles, we’re getting micro-specialists that can do slices of work. In software you see agents for tests, code review, planning. Other sectors look much the same. Legal teams using AI for drafting. Sales teams for outreach. Finance for reconciliation. Tools handling tasks, not outcomes, and someone still has to stitch the pieces together.

There are (at least) three forces I can think of that matter when asking whether organisations will genuinely be able do more with less:

1. How automatable the work already is.
Where the work is rules based, high volume, and low variation, AI may replace labour in the same way classic automation has. Think claims processing, simple customer support, structured back office workflows. These functions already lived close to the automation frontier. AI just expands the frontier a bit.
This will reduce headcount, but mostly in places where headcount has been under pressure for decades anyway.

2. How much the organisation can absorb increased output.
Most professional work is not constrained by how fast someone types or drafts. It is constrained by coordination, sequencing, ambiguity, stakeholder alignment, and quality. Software is a good example. So is legal, consulting, product, sales. If you cut the number of lawyers because drafting is faster, you will simply overload the remaining lawyers with negotiation, risk, and client work.

3. The cost and consequences of mistakes.
In many industries, the limiting factor is not productivity, but risk. Healthcare, aviation, finance, law. Increased throughput also increases the risk surface area. If AI increases the probability or cost of an error, you cannot shrink the team. You often need more human oversight, not less.

If you put these together, the more likely outcome is is this:

  • Some operational functions will shrink, but these were already at risk of automation.
  • Most knowledge work will shift toward more with the same, not less.
  • Some domains will accidentally create more with more, because oversight and correction absorb the gains.

AI Is still making code worse: A new CMU study confirms

In early 2025 I wrote about GitClear’s analysis of the impact of GenAI on code quality, based on 2024 data, which showed a significant degradation in code quality and maintainability. I recently came across a new study from Carnegie Mellon, “Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor’s Impact on Software Projects” that looks at a more recent period, tracking code quality in projects using GenAI tools up to mid-2025. So has code quality improved as the models and tools have matured?

The answer appears to be no. The study finds that AI briefly accelerates code generation, but the underlying code quality trends continue to move in the wrong direction.

How the study was run

Researchers at Carnegie Mellon University analysed 807 open source GitHub repositories that adopted Cursor between January 2024 and March 2025, and tracked how those projects changed through to August 2025. Adoption was identified by looking for Cursor configuration files committed to the repo.

For comparison, the researchers built a control group of 1,380 similar GitHub repositories that didn’t adopt Cursor (see caveats below).

For code quality, they used SonarQube, a widely used and well respected code analysis tool that scans code for quality and security issues. The researchers ran SonarQube monthly to track how each codebase evolved, focusing on static analysis warnings, code duplication and code complexity.

Finally, they attempted to filter out toy or throwaway repositories by only including projects with at least 10 GitHub stars.

Key findings

Compared to the control group:

  • A short lived increase in code generated: Activity spikes in the first one or two months after adoption. Commits rise and lines added jump sharply, with the biggest increase in the first month
  • The increase does not persist: By month three, activity returns to baseline. There is no sustained increase in code generated.
  • Static analysis warnings increase and remain elevated: Warnings rise by around 30 percent post-adoption and stay high for the rest of the observation window.
  • Code complexity increases significantly: Code complexity rose by more than 40 percent, more than could reasonably be accounted for by just the growth in codebase size.

Caveats/Limitations

The study only looked at open source projects, which aren’t really comparable to production code bases. Also, adoption is inferred from committed Cursor configuration files which I would say is a reasonably reliable signal of usage within those projects. However the control group is not necessarily AI usage free, code in those repositories may still have been created using Copilot, Claude Code or other tools.

My Takeaways

A notable period for AI assisted development

What’s notable is the period this study tracks. In December 2024 Cursor released a major upgrade to their IDE and introduced its agent mode. It was the first time I heard experienced developers I respect describe AI coding assistants as genuinely useful. Cursor adoption climbed quickly and most developers I knew were using Claude Sonnet for day to day coding. Then in February 2025 Anthropic released Claude 3.7 Sonnet, followed in May by Sonnet 4.0 and their first reasoning model, Opus 4.1.

If improvements in models or tooling were going to reverse the code quality issues seen previously, you’d expect it to show up during this period. This study shows no reversal. The pattern is broadly the same as GitClear observed for 2024.

It’s not just “user error”

A common argument is that poor AI-generated code is the user’s fault, not the tool’s. If developers wrote clearer prompts, gave better instructions or reviewed more carefully, quality wouldn’t suffer. This study disagrees. Even across hundreds of real projects, and even after accounting for how much code was added, complexity increased faster in the AI-assisted repos than in the control group. The tools are contributing to the problem, not merely reflecting user behaviour.

Context collapse playing out in real time

Organisations training LLMs probably use similar signals to this study to decide which open source repositories to train on: popularity, activity and signs of being “engineered” rather than experimental. This study shows more than 800 popular GitHub projects with code quality degrading after adopting AI tools. It’s hard not to see a form of context collapse playing out in real time. If the public code that future models learn from is becoming more complex and less maintainable, there’s a real risk that newer models will reinforce and amplify those trends, producing even worse code over time.

Things are continuing to evolve quickly, but…

Of course, things have continued to move quickly since the period this study covers. Claude Code is currently the poster child for GenAI assisted development. Developers are learning how to instruct these tools more effectively through patterns like Claude.md and Agents.md, and support for these conventions is improving within the IDEs.

In my recent experience at least, these improvements mean you can generate good quality code, with the right guardrails in place. However without them (or when it ignores them, which is another matter) the output still trends towards the same issues: long functions, heavy nesting of conditional logic, unnecessary comments, repeated logic – code that is far more complex than it needs to be.

No doubt the tools will continue to improve, and much of the meaningful progress is happening in the IDE layer rather than in the models themselves. However this study suggests the underlying code quality issues aren’t shifting. The structural problems remain, and they aren’t helped by the fact that the code these models are trained on is likely getting worse. The work of keeping code simple, maintainable and healthy still sits with the human, at least for the foreseeable future.

Findings from DX’s 2025 report: AI won’t save you from your engineering culture

The DX AI-assisted engineering: Q4 (2025) impact report offers one of the most substantial empirical views yet of how AI coding assistants are affecting software development, and largely corroborates the key findings from the 2025 DORA State of AI-assisted Software Development Report: quality outcomes vary dramatically based on existing engineering practices, and both the biggest limitation and the biggest benefit come from adopting modern software engineering best practices – which remain rare even in 2025. AI accelerates whatever culture you already have.

Who are DX and why the report matters

DX is probably the leading and most well regarded developer intelligence platform. They sell productivity measurement tools to engineering organisations. They combine telemetry from development tools with periodic developer surveys to help engineering leaders track and improve productivity.

This creates potential bias – DX’s business depends on organisations believing productivity can be measured. But it also means they have access to data most researchers don’t.

Data collection

The report examines data collected between July and October 2025. Drawing on data from 135,000 developers across 435 companies, the data set is substantially larger than most productivity research. It combines:

  • System telemetry from AI coding assistants (GitHub Copilot, Cursor, Claude Code) & source control systems (GitHub, Gitlab, BitBucket).
  • Self-reported surveys asking about time savings, AI-authored code percentage, maintainability perception, and enablement quality.

Update: However, they aren’t particularly transparent about what data they used to create their findings. They mention how they calculate AI usage (empirical data) and time savings (self reported surveys), but nothing on how they calculated metrics like CFR, which is a notable one in the report.

Key Findings

Existing bottlenecks dwarf AI time savings

This should be the headline: meetings, interruptions, review delays, and CI wait times cost developers more time than AI saves. Meeting-heavy days are reported as the single biggest obstacle to productivity, followed by interruption frequency (context switching). Individual task-level gains from AI are being swamped by organisational dysfunction. This corroborates 2025 DORA State of AI-assisted Software Development Report findings that systemic constraints limit AI impact.

This image has an empty alt attribute; its file name is Screenshot-2025-11-05-at-09.21.43.png

You can save 4 hours writing code faster, but if you lose 6 hours to slow builds, context switching, poorly-run meetings, the net effect is negative.

Quality impact varies dramatically

The report tracks Change Failure Rate (CFR) – the percentage of changes causing production issues. Results split sharply: some organisations see CFR improvements, others see degradation. The report calls this “varied,” but I’d argue it’s the most important signal in the entire dataset.

What differentiates organisations seeing improvement from those seeing degradation? The report doesn’t fully unpack this.

Modest time savings claimed, but seem to have hit a wall

Developers report saving 3.6 hours per week on average, with daily users reporting 4.1 hours. But this is self-reported, not measured (see limitations).

More interesting: times savings have plateaued around 4 hours even as adoption climbed from ~50% to 91%. The report initially presents this as a puzzle, but the data actually explains it. The biggest finding, buried on page 20, is – as above – that non-AI bottlenecks dwarf AI gains.

Throughput gains measured, but problematic

Daily AI users merge 60% more PRs per week than non-users (2.3 vs 1.4). That’s a measurable difference in activity. Whether it represents productivity is another matter entirely. (More on this in the limitations section.)

Traditional enterprises show higher adoption

Non-tech companies show higher adoption rates than more native tech orgs. The report attributes this to deliberate, structured rollouts with strong governance.

There’s likely a more pragmatic explanation: traditional enterprises are aggressively rolling out AI tools in hopes of compensating for weak underlying engineering practices. The question is whether this works. If the goal is to shortcut or leapfrog organisational dysfunction without fixing the root causes, the quality degradation data suggests it won’t. AI can’t substitute for modern engineering practices; it can only accelerate whatever practices already exist.

Other findings

  • Adoption is near-universal: 91% of developers now use AI coding assistants, matching DORA’s 2025 findings. The report also reveals significant “shadow AI” usage: developers using tools they pay for themselves, even when their organisation provides approved alternatives.
  • Onboarding acceleration: Time to 10th PR dropped from 91 days to 49 days for daily AI users. The report cites Microsoft research showing early output patterns predict long-term performance.
  • Junior devs use AI most, senior devs save most time: Junior developers have highest adoption, but Staff+ engineers report biggest time savings (4.4 hours/week). Staff+ engineers also have the lowest adoption rates. Why aren’t senior engineers adopting as readily? Scepticism about quality? Lack of compelling use cases for complex architectural work?

Limitations and Flaws

Pull requests as a productivity metric

The report treats “60% more PRs merged” as evidence of productivity gains. This is where I need to call out a significant problem – and interestingly, DX themselves have previously written about why this is flawed.

PRs are a poor productivity metric because:

  • They measure motion, not progress. Counting PRs shows how many code changes occurred, not whether they improved product quality, reliability, or customer value.
  • They’re highly workflow-dependent. Some teams merge once per feature, others many times daily. Comparing PR counts between teams or over time is meaningless unless workflows are identical.
  • They’re easily gamed and inflated. Developers (or AI) can create more, smaller, or trivial PRs without increasing real output. “More PRs” often just means more noise.
  • They’re actively misleading in mature Continuous Delivery environments. Teams practising trunk-based development integrate continuously with few or no PRs. Low PR counts in that model actually indicate higher productivity.

Self-reported time savings can’t be trusted

The “3.6 hours saved per week” is self-reported, not measured. People overestimate time savings. As an example. the METR Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity study found developers estimated they’d got a 20% speedup from AI but were actually 19% slower.

Quality findings under-explored

The varied CFR results are the most important finding, but they’re presented briefly and then the report moves on. What differentiates organisations seeing improvement from those seeing degradation? Code review practices? Testing infrastructure? Team maturity?

The enablement data hints at answers but doesn’t fully investigate. This is a missed opportunity to identify the practices that make AI a quality accelerator rather than a debt accelerator.

Missing DORA Metrics

The report covers Lead Time (poorly, approximated via PR throughput) and Change Failure Rate. But it doesn’t measure deployment frequency or Mean Time to Recovery.

That means we’re missing the end-to-end delivery picture. We know code is written and merged faster, but we don’t know if it’s deployed faster or if failures are resolved more quickly. Without deployment frequency and MTTR, we can’t assess full delivery-cycle productivity.

Conclusion

This is one of the better empirical datasets on AI’s impact, corroborating DORA 2025’s key findings. But the real story isn’t in the headline numbers about time saved or PRs merged. It’s in two findings:

Non-AI bottlenecks still dominate.

Meetings, interruptions, review delays, and slow CI pipelines cost more than AI saves. Individual productivity tools can’t fix organisational dysfunction.

As with DORA’s findings, the biggest limitation and the biggest opportunity both come from adopting modern engineering practices. Small batch sizes, trunk-based development, automated testing, fast feedback loops. AI makes their presence more valuable and their absence more costly.

AI is an accelerant, not a fix

It reveals and amplifies existing engineering culture. Strong quality practices get faster. Weak practices accumulate debt faster. The variation in CFR outcomes isn’t noise – it’s the signal. The organisations seeing genuine gains are those already practising modern software engineering. Those practices remain rare.

My advice for engineering leaders:

  1. Tackle system-level friction first. Four hours saved writing code doesn’t matter if you lose six to meetings, context switching and poor CI infrastructure and tooling.
  2. Adopt modern engineering practices. The gains from adopting a continuous delivery approach dwarf what AI alone can deliver.
  3. Don’t expect AI to fix broken processes. If review is shallow, testing is weak, or deployment is slow, AI amplifies those problems.
  4. Invest in structured enablement. The correlation between training quality and outcomes is strong.
  5. Track throughput properly alongside quality. More PRs merged isn’t a win if it isn’t actually resulting in shipping faster and your CFR goes up. Measure end to end cycle times, CFR, MTTR, and maintainability.

You’re probably listening to the wrong people about AI Coding

Unsurprisingly, there are a lot of strong opinions on AI assisted coding. Some engineers swear by it. Others say it’s dangerous. And of course, as is the way with the internet, nuanced positions get flattened into simplistic camps where everyone’s either on one side or the other.

A lot of the problem is that people aren’t arguing about the same thing. They’re reporting different experiences from different vantage points.

I’ve sketched a chart to illustrate the pattern I’m seeing. It’s not empirical, just observational. It’s more nuanced than this, before camps start arguing about it. This is still an oversimplified generalisation.

The yellow line shows perceived usefulness of AI coding tools. The blue line shows the distribution of engineering competence. The green dotted line shows what the distribution would look like if we went by how experienced people say they are.

Different vantage points

Look at the first peak on the yellow line. A lot of less experienced and mediocre engineers likely think these tools are brilliant. They’re producing more code, feeling productive. The problem is they don’t see the quality problems they’re creating. Their code probably wasn’t great before AI came along. Most code is crap. Most developers are mediocre, so it’s not surprising this group is enthusiastic about tools that help them produce more (crap) code faster.

Then there’s a genuinely experienced cohort. They’ve lived with the consequences of bad code and learnt what good code looks like. When they look at AI-generated code, they see technical debt being created at scale. Without proper guidance, AI-generated code is pretty terrible. Their scepticism is rational. They understand that typing isn’t the bottleneck, and that speed without quality just creates expensive problems.

Calling these engineers resistant to change is lazy and unfair. They’re not Luddites. They’re experienced enough to recognise what they’re seeing, and what they’re seeing is a problem.

But there’s another group at the far end of the chart. Highly experienced engineers working with modern best practices – comprehensive automated tests, continuous delivery, disciplined small changes. Crucially they’ve also learned how work with AI tools using those practices. They are getting productivity without impacting quality. They’re also highly aware typing is not the bottleneck, so not quite as enthusiastic as our first cohort.

Interestingly, I’ve regularly seen sceptical experienced engineers change their view once they’ve been shown how you can blend modern/XP practices with AI assisted coding.

Why the discourse is broken

When someone from that rare disciplined expert group writes enthusiastically about AI tools, it’s easy to assume their experience is typical. It isn’t. Modern best practices are rare. Most teams don’t deploy to production multiple times per day. Most codebases don’t have comprehensive automated tests. Most engineers don’t work in small validated steps with tight feedback loops.

Meanwhile, the large mediocre majority is also writing enthusiastically about these tools, but they’re amplifying dysfunction. They’re creating problems that others will need to clean up later. That’s most of the industry.

And the experienced sceptics – the people who can actually see the problems clearly – are a small group whose warnings get dismissed as resistance to change.

The problem of knowing who to listen to

When you read enthusiastic takes on AI tools, is that coming from someone with comprehensive tests and tight feedback loops, or from someone who doesn’t know what good code looks like? Both sound confident. Both produce content.

When someone expresses caution, are they seeing real problems or just resistant to change?

The capability perception gap – that green dotted line versus reality – means there are probably far fewer people with the experience and practices to make reliable claims than are actually making them. And when you layer on the volume of hype around AI tools, it becomes nearly impossible to filter for signal.

The loudest voices aren’t necessarily the most credible ones. The most credible voices – experienced engineers with rigorous practices – are drowned out by sheer volume from both the mediocre majority and the oversimplified narratives that AI tools are either revolutionary or catastrophic.

We’re not just having different conversations. We’re having them in conditions where it’s genuinely hard to know whose experience is worth learning from.

After the AI boom: what might we be left with?

Some argue that even if the current AI boom leads to an overbuild, it might not be a bad thing – just as the dotcom bubble left behind the internet infrastructure that powered later decades of growth.

It’s a tempting comparison, but the parallels only go so far.

The dotcom era’s overbuild created durable, open infrastructure – fibre networks and interconnects built on open standards like TCP/IP and HTTP. Those systems had multi-decade lifespans and could be reused for whatever came next. Much of the fibre laid in the 1990s still carries traffic today, upgraded simply by swapping out the electronics at each end. That overinvestment became the backbone of broadband, cloud computing, and the modern web.

Most of today’s AI investment, by contrast, is flowing into proprietary, vertically integrated systems rather than open, general-purpose infrastructure. Most of the money is being spent on incredibly expensive GPUs that have a 1-3 year lifespan due to becoming obsolete quickly and wearing out under constant, high-intensity use. These chips aren’t general-purpose compute engines; they’re purpose-built for training and running generative AI models, tuned to the specific architectures and software stacks of a few major vendors such as Nvidia, Google, and Amazon.

These chips live inside purpose-built AI data centres – engineered for extreme power density, advanced cooling, and specialised networking. Unlike the general-purpose facilities of the early cloud era, these sites are tightly coupled to the hardware and software of whoever built them. Together, they form a closed ecosystem optimised for scale but hard to repurpose.

That’s why, if the AI bubble bursts, we could just be left with a pile of short-lived, highly specialised silicon and silent cathedrals of compute – monuments from a bygone era.

The possible upside

Still, there’s a more positive scenario.

If investment outruns demand, surplus capacity could push prices down, just as the post-dotcom bandwidth glut did in the early 2000s. Cheap access to this kind of compute might open the door for new experimentation – not just in generative AI, but in other high-compute domains such as simulation, scientific research, and data-intensive analytics. Even if the hardware is optimised for GenAI, falling prices could still make large-scale computation more accessible overall. A second-hand market in AI hardware could emerge, spreading access to powerful compute much more widely.

The supporting infrastructure – power grid upgrades, networking, and edge facilities – will hopefully remain useful regardless. And even if some systems are stranded, the talent, tooling, and operational experience built during the boom will persist, as it did after the dotcom crash.

Without openness, the benefits stay locked up

The internet’s long-term value came not just from cheap capacity, but from open standards and universal access. Protocols like TCP/IP and HTTP meant anyone could build on the same foundations, without permission or platform lock-in. That openness turned surplus infrastructure into a shared public platform, unlocking decades of innovation far beyond what the original investors imagined.

The AI ecosystem is the opposite: powerful but closed. Its compute, models, and APIs are owned and controlled by a handful of vendors, each defining their own stack and terms of access. Even if hardware becomes cheap, it won’t automatically become open. Without shared standards or interoperability, any overbuild risks remaining a private surplus rather than a public good.

So the AI boom may not leave behind another decades-long backbone like the internet’s fibre networks. But it could still seed innovation if the industry finds ways to open up what it’s building – turning today’s private infrastructure into tomorrow’s shared platform.

Update: This post has received quite a lot of attention on HackerNews. Link to comments if you enjoy that sort of thing. Also, hi everyone 👋, I’ve written a fair bit of other stuff on AI, among other things, if your interested.

On “Team dynamics after AI” and the Illusion of Efficiency

This is one of the most important pieces of writing I’ve read on AI – and that’s not the kind of thing I say lightly. If you’re leading in a business right now and looking at AI adoption, it’s worth your full attention.

Duncan Brown’s Team dynamics after AI isn’t about model performance or the usual surface-level debates. It’s about the potential for AI to quietly reshape the structure and dynamics of teams – how work actually gets done.

He shows how the promise of AI enabling smaller teams (“small giants”) and individuals taking on hybrid roles can lead organisations to blur boundaries, remove friction and assume they can do more with less. But when that happens, you lose the feedback loops, diversity of perspective – and start to erode the structural foundations that quietly hold alignment together and make teams effective.

He also points to something I’ve been saying for a while – that AI doesn’t necessarily make us more productive, it can just make us busier. More output, more artefacts, more noise – but not always more value.

Here lies the organisational risk. The system starts to drift. Decisions narrow. Learning slows. More artefacts get produced, but they create more coordination and interpretation work, not less. The subtle structures that keep context and coherence together begin to thin out. Everything looks efficient – right up until it isn’t.

A bit like what happened with Nike: they optimised for the short-term and de-emphasised the harder, slower work that built long-term brand strength. It seemed to work at first, but the damage wasn’t visible until it was too late and it’ll now take them years to build back.

It’s also written by someone who’s been deep in the trenches – leading engineering at the UK Gov’s AI incubator, so not your usual ill-informed AI commentator.

And as a massive Ian MacKaye/Fugazi fan and a lapsed skateboarder, it honestly feels like another me wrote it.

Essential reading. It’s a long read – get a brew and a quiet 15 minutes.

Why AI won’t work as a software development abstraction

The idea of LLMs as a new abstraction layer for software development keeps coming up. On the surface it sounds appealing. Just as compilers turn source into binaries, AI could turn prompts into systems. You store the prompts, they become the source of truth, AI generates the code and the code just becomes an artefact.

Let’s assume, for the sake of the argument, things like non-determinism and hallucination are solved. There is still a big problem.

Complexity.

Software is never static. Requirements change, and each change adds complexity. Even the best engineers in the world struggle with this – whole disciplines around refactoring, code composition and architecture exist to contain it, and still complexity piles up.

Unless we reach some form of AI superintelligence, well beyond anything today, AI will run into the same problems, probably faster. Entropy builds up, not down.

The only way I can think of around that would be to regenerate the entire codebase (or at least large parts of it) from prompts each time, like a compiler rebuilding from source.

However, that just hits another wall.

But by my rough calculations, a mid-size 500k LOC codebase, with today’s LLMs and compute, would take days to build and cost thousands.

Software development depends on feedback loops measured in seconds or minutes, not hours or days.

And this points to a natural physical law – processing information always carries an energy cost – you can’t avoid it, only shift it.

in this case, from human cognitive effort to machine compute cycles. And today, the machine version would be far less efficient.

tl;dr You can’t beat the 2nd law of thermodynamics.

DORA 2025 AI assisted dev report: Some Benefit, Most Don’t

The recent DORA 2025 State of AI-Assisted Software Development report suggests that, today, only a small minority of the industry are likely to benefit from AI-assisted coding – and more importantly, avoid doing themselves harm.

The report groups teams into seven clusters to show how AI-assisted coding is shaping delivery. Only two – 6 (“Pragmatic performers”) and 7 (“Harmonious high-achievers”) – are currently benefitting.

They’re increasing throughput without harming stability – without an increase in change failure rate (CFR) i.e. they’re not seeing significantly more production bugs, which would otherwise hurt customers and create additional (re)work.

For the other clusters, AI mostly amplifies existing problems. Cluster 5 (Stable and methodical) will only benefit if they change how they work. Clusters 1–4 (the majority of the industry) are likely to see more harm than good – any gains in delivery speed are largely cancelled out by a rise in the change failure rate (CFR), as the report explains.

The report shows 40% of survey respondents fall into clusters 6 and 7. Big caveat though: DORA’s data comes from teams already familiar with DORA and modern practices (even if not applying them fully). Across the wider industry, the real proportion is likely *half that or less*.

That means around three-quarters of the industry are not yet in a position to realistically benefit from AI-assisted coding.

For leaders, it’s less about whether to adopt AI-assisted coding, and more about whether your ways of working are good enough to turn it into an asset, rather than a liability.

Does the “lethal trifecta” kill the idea of fully autonomous AI Agents anytime soon?

I don’t think people fully appreciate yet how agentic AI use cases are restricted by what Simon Willison coined the “lethal trifecta”. His article is a bit technical so I’ll try and break it down in more layman’s terms.

An AI agent becomes very high risk when these three things come together:

  • Private data access – the agent can see sensitive information, like customer records, invoices, HR files or source code.
  • Untrusted inputs – it also reads things you don’t control, like emails from customers, supplier documents, 3rd party/open source code or content on the web.
  • The ability to communicate externally – it has a channel to send data out, like email, APIs or other external systems.

Each of those has risks on its own, but when you put all three together it creates a structural vulnerability we don’t yet know how to contain. That’s what makes the trifecta “lethal”. If someone wants to steal your data, you have no effective way to stop them. Malicious instructions hidden in untrusted inputs can trick the agent into exfiltrating (sending out) whatever private data it can see.

If you broaden that last point from “communicate externally” to “take external actions” (like sending payments, updating records in systems, or deploying code) the risk extends even further – not just leaking data, but also doing harmful things like hijacking payments, corrupting information, or changing how systems behave.

It’s all a bit like leaving your car running with the keys in the ignition and hoping no one crashes it.

Where this matters most is in the types of “replace a worker” examples people get excited about. Think of:

  • an AI finance assistant that reads invoices, checks supplier sites, and then pays them
  • a customer support agent that reads emails, looks up answers on an online system and then issues refunds
  • a DevOps helper that scans logs, searches the web for known vulnerabilities or issues, and then pushes config changes

All of those tick all three boxes – private data, untrusted input, and external actions – and that makes them unsafe right now.

There are safer uses, but they all involve breaking the loop – for example:

  • our finance bot only drafts payments for human approval
  • our support agent can suggest, but doesn’t issue refunds
  • our DevOps helper only runs in a sandbox (a highly isolated environment)

Unless I have got this wrong, until we know how to contain the trifecta, the glossy vision of fully autonomous agents doesn’t look like something we can safely build.

And it may be that we never can. The problem isn’t LLM immaturity or missing features – it’s structural. LLMs can’t reliably tell malicious instructions from benign ones. To them, instructions are just text – there’s no mechanism to separate a genuine request from an attack hidden in the context. And because attackers can always invent new phrasings, the exploit surface is endless.

And if so, I wonder how long it will take before the penny drops on this.

Edit: I originally described the third element of Simon’s trifecta as “external actions”. I’ve updated this to align with Simon Willison’s original article, and instead expanded on the external actions point (partly after checking with Simon).