Monthly Archives: April 2026

What it takes to benefit from GenAI coding

GenAI coding tools are genuinely powerful. In the right hands, in the right environment, the stuff is remarkable.

Experienced engineers with good practices around them are doing things in hours that used to take weeks. Ideas get tested that previously stayed as hypotheses. Long-standing technical debt is getting cleared. Work that wasn’t worth the investment a year ago is now done in an afternoon.

Right environment means organisations that genuinely understand software engineering. An appreciation that building software is not a production line, but a learning process.

Right hands means experienced software engineers who take full end to end ownership. Product mindset. XP practices. Continuous delivery with all the automation, tests and guardrails that let you learn and iterate quickly without breaking things.

Most organisations don’t have that, which is why most of the industry isn’t getting much from these tools.

The organisations best placed to benefit from GenAI are the ones who invested in engineering foundations years ago. For everyone else, the shortcut you were hoping for doesn’t exist.

For CEOs and founders hoping to benefit, the answer isn’t as simple as handing out Claude licences (as Jason Gorman puts it, “just because you attach a code-generating firehose to your plumbing, that doesn’t mean you’ll get a power shower”). It’s investing in the engineering culture and practices. Unglamorous, slow work, but there’s no way around it.


Footnote: By experienced I don’t mean “senior” by the way. Most “senior” engineers I meet have never worked in a genuine XP or continuous delivery environment. They have years of experience, just not the experience that matters.

Experienced in this context means having built and shipped software in organisations that understand the craft. Fast feedback, small batches, tests as a design tool, code as a liability to be managed. That’s not about title or tenure. It’s about the environment you learned in.

I’ve worked with many “juniors” with e.g. 2-4 years experience who run rings around people with 10+. Because they learned in the right environment from the start.

Anthropic squeezed three ways

Anthropic’s Claude Code pricing fiasco is what it looks like when a company is squeezed at three ends. Anthropic quietly removed Claude Code from the $20 Pro plan, making it exclusive to the $100 and $200 Max tiers. Their Head of Growth framed it as a small test on 2% of new signups (which didn’t match what users were seeing). Within hours they reversed it.

What interests me is what the test reveals about the bind Anthropic is in. It was an attempt to fix unit economics: heavy users on flat-fee plans consume vastly more than the plans recover, and the Head of Growth, Amol Avasare, said as much on X – plans weren’t built for current usage patterns.

That’s one real pressure. But it’s not the only one. They’re squeezed three ways at once.

The first squeeze is unit economics. Someone running Claude Code all day on a $20 subscription costs far more to serve than they pay. Either prices go up or costs come down. However raising prices risks making them uncompetitive against OpenAI and Google, who are already taking advantage of this moment.

The second squeeze is compute. Claude has been below 99% uptime for a quarter. They are clearly struggling with the huge increase in demand they’re experiencing. A year ago the product was mostly chat. Today a significant share of usage is coding agents running for hours. Demand shape changing faster than provisioning can keep up.

So why not do what Gmail and Bluesky did and gate new signups? Match supply to demand, protect the experience for existing users, generate some FOMO and desirability in the process, and buy time to sort the rest out.

That brings us to the third squeeze. Anthropic’s valuation, like that of every frontier AI lab, rests on growth trajectory rather than current profitability. However dressed up, limiting signups reads as a capacity wall, and from there it’s a short step to growth slowing and the IPO narrative wobbling.

The best approach for managing the compute squeeze is ruled out by the growth squeeze, which means infrastructure strain has to be absorbed through rate limits and outages instead, upsetting all your existing users in the process.

It also means heavy users keep arriving, which continues to make the unit economics worse, which is how you end up running silent pricing tests on Tuesday afternoons.

AI “Watershed Moment” or expensive pen tester? The AISI Mythos Data

The UK’s AI Security Institute has published the first independent evaluation of Claude Mythos’s cyber capabilities. The headline finding – first AI model to complete a full 32-step simulated network attack – is notable. But there’s a finding buried in the accompanying methodology paper that puts it in a rather different light. On current pricing and reliability, according to my maths, a human expert would do the same job cheaper, faster and more reliably.

What AISI found

On capture-the-flag tasks – common security challenges AISI have been using to test models since 2023 – Mythos sits broadly on the existing trend line. Real improvement, but incremental, and not unique to Mythos. The capability has been building across multiple labs for over a year.

The more significant result is with what AISI call “chained attacks” – where a model has to execute a long sequence of steps across a network to take it over, rather than exploit a single vulnerability in isolation. AISI measured this using their “The Last Ones” simulation: a 32-step corporate network attack spanning initial reconnaissance through to full network takeover, which they estimate a human expert would complete in 14 hours.

Mythos is the first model to complete all 32 steps end to end – though Opus 4.6, Anthropic’s previous model, wasn’t far behind in its best run.

The limitations & takeaways

The model was already inside the network, and the simulated environment had no active security monitoring and no defensive tools. Real networks aren’t like that – at least they shouldn’t be.

For most organisations the biggest threats remain phishing, weak passwords, and unpatched systems. AISI’s own advice in the article reflects this: focus on the basics – patch regularly, enforce access controls, enable logging. More importantly, the most common and successful attacks continue to target humans rather than rely on technical sophistication – as the Co-op, M&S and JLR attacks last year demonstrated.

The trajectory is real and worth taking seriously – but AISI’s findings are more measured than Anthropic’s “watershed moment” framing, and the most important things you can do about it are the same things you should have been doing anyway.

There’s a finding buried in the methodology

AISI published an accompanying academic paper detailing the evaluation methodology and results for models prior to Mythos – including detailed cost and timing data. This is where things get interesting.

According to that paper, the best Opus 4.6 run at 100M tokens cost approximately $80 and took around 10 hours – completing 22 of 32 steps, equivalent to roughly 6 of those 14 human hours. Slower, and less than halfway through in human time equivalent.

Mythos is priced at 5x Opus 4.6 per token. Its best run completed 32 steps versus Opus 4.6’s 22 – but crucially the additional steps fall in the later, harder milestones which are significantly more time and token-intensive. Accounting for both the price differential and likely higher token usage on those harder steps, a rough extrapolation puts a Mythos run at approximately $880.

The variance problem

The paper shows all models have very high variance across runs. Opus 4.6’s best run reached 22 steps, its worst only 11, with an average of 15.6. And the AISI article shows Mythos only completed all 32 steps in 3 of its 10 attempts – a 70% failure rate on full completion.

To expect one successful outcome you’d need 3-4 runs on average – and each run is likely comparable in time to the Opus 4.6 runs.

That’s approximately $2,900-3,500 per successful outcome.

A human expert completing the same range: 14 hours, once, reliably. At $125-190 per hour (UK rates) that’s $1,750-2,625.

So at least today, and according to AISI data, and assuming my maths are roughly correct, an experienced cyber human would be cheaper, more reliable and at least as quick as the most capable AI model currently available.

METR’s developer productivity research: 2026 update

You may be seeing posts claiming METR’s widely-cited 2025 study has been followed up with new research showing an 18% productivity boost. That’s not what the article says.

METR: We are Changing our Developer Productivity Experiment Design

In 2025, METR found experienced open-source developers using GenAI were 19% slower – and notably, developers themselves thought they were being sped up. They started a new experiment to track how things were changing – but couldn’t complete it. They say the data was too compromised to produce reliable results.

The interesting part is why the study broke down. Developers are now so reliant on AI that they won’t work without it (to be part of the control group). And the nature of how they work when using GenAI has changed too, which undermines e.g. simple time-on-task measurements.

METR believe GenAI coding productivity is improving – but say they can no longer measure it reliably with this study design and are reworking their approach. Personally I don’t see how you can practically design an effective controlled experiment considering everyone uses it now.

Worth noting too that, either way, these kinds of studies are still a narrow window on software delivery – individual task completion by autonomous open-source contributors, not developers working in teams on production codebases with all the organisational complexity that entails.

Multiple studies now suggest AI is genuinely increasing coding velocity – including CircleCI’s recent State of Software Delivery report. But the same report points to a more troubling pattern at the system level: less code reaching production and increasing instability.

My take: teams with strong engineering practices – genuine continuous delivery, high code quality, solid test coverage – appear to be realising real benefits from GenAI, and there’s data to support that. The problem is those teams represent a small fraction of the industry. For everyone else, higher coding velocity is likely resulting in a negative impact downstream.

More code, less delivery but does the CircleCI 2026 Report really show 1 in 20 teams are benefiting?

CircleCI’s 2026 State of Software Delivery report has two findings that are already travelling: AI is meaningfully boosting software delivery, but only 1 in 20 teams are capturing that benefit. Both claims are more uncertain than the report suggests, for different reasons.

What the report is measuring

The report’s primary metric is “throughput” – the number of times a CI pipeline runs per day. A CI pipeline is the automated process teams use to build, test and progress code toward production. It is not production deployments, it is not features shipped. The report is using pipeline execution data to infer things about software delivery. That’s not unreasonable – it’s real data – but it’s worth understanding what’s actually being measured before drawing conclusions.

The headline numbers

The report measures throughput on both feature branches and main branches and aggregates both into its headline figures. Throughput as a metric on feature branches is effectively meaningless. Throughput should be an end-to-end metric – feature branches aren’t end-to-end, they get merged to main. The only meaningful “throughput” measure is against the main branch. What the feature branch data actually shows is a lot more code being written, but not much more reaching production.

  • Average teams are up 4% on the aggregated figure, but main branch throughput is down 7%
  • The top 10% of teams show aggregated throughput up nearly 50%, main branch essentially flat
  • For 95% of teams, AI is generating more work in progress that isn’t shipping

The success rate of main branch builds compounds this further. It has fallen to 70.8%, its lowest in over five years – 30% of attempts to merge code for production are now failing.

The 1 in 20 claim

The report identifies the top 5% of teams as the only group seeing meaningful main branch throughput growth – 26% – and uses this to argue that some teams have cracked the AI delivery problem.

But the summary data for that group is odd. Their average CI pipeline duration is 6 seconds. A pipeline doing anything meaningful – compiling, running tests, scanning – it’s hard to think of a single CI step that legitimately completes in 6 seconds. Perhaps it is an error in the report. There’s also data that may be skewing the findings more broadly – one team apparently running 130,000 CircleCI workflows a day would have an outsized effect on any aggregate figures.

What to take from it

The integration bottleneck finding is credible. If you’re generating code faster than your team can review and integrate it (safely), that’s a genuine problem this data is consistent with.

The “1 in 20 teams have cracked it” conclusion is less solid than it appears. Not to say that some aren’t getting benefit I believe there are, however the data here for the teams making that case doesn’t add up clearly enough to draw confident lessons from.