Monthly Archives: April 2026

AI “Watershed Moment” or expensive pen tester? The AISI Mythos Data

The UK’s AI Security Institute has published the first independent evaluation of Claude Mythos’s cyber capabilities. The headline finding – first AI model to complete a full 32-step simulated network attack – is notable. But there’s a finding buried in the accompanying methodology paper that puts it in a rather different light. On current pricing and reliability, according to my maths, a human expert would do the same job cheaper, faster and more reliably.

What AISI found

On capture-the-flag tasks – common security challenges AISI have been using to test models since 2023 – Mythos sits broadly on the existing trend line. Real improvement, but incremental, and not unique to Mythos. The capability has been building across multiple labs for over a year.

The more significant result is with what AISI call “chained attacks” – where a model has to execute a long sequence of steps across a network to take it over, rather than exploit a single vulnerability in isolation. AISI measured this using their “The Last Ones” simulation: a 32-step corporate network attack spanning initial reconnaissance through to full network takeover, which they estimate a human expert would complete in 14 hours.

Mythos is the first model to complete all 32 steps end to end – though Opus 4.6, Anthropic’s previous model, wasn’t far behind in its best run.

The limitations & takeaways

The model was already inside the network, and the simulated environment had no active security monitoring and no defensive tools. Real networks aren’t like that – at least they shouldn’t be.

For most organisations the biggest threats remain phishing, weak passwords, and unpatched systems. AISI’s own advice in the article reflects this: focus on the basics – patch regularly, enforce access controls, enable logging. More importantly, the most common and successful attacks continue to target humans rather than rely on technical sophistication – as the Co-op, M&S and JLR attacks last year demonstrated.

The trajectory is real and worth taking seriously – but AISI’s findings are more measured than Anthropic’s “watershed moment” framing, and the most important things you can do about it are the same things you should have been doing anyway.

There’s a finding buried in the methodology

AISI published an accompanying academic paper detailing the evaluation methodology and results for models prior to Mythos – including detailed cost and timing data. This is where things get interesting.

According to that paper, the best Opus 4.6 run at 100M tokens cost approximately $80 and took around 10 hours – completing 22 of 32 steps, equivalent to roughly 6 of those 14 human hours. Slower, and less than halfway through in human time equivalent.

Mythos is priced at 5x Opus 4.6 per token. Its best run completed 32 steps versus Opus 4.6’s 22 – but crucially the additional steps fall in the later, harder milestones which are significantly more time and token-intensive. Accounting for both the price differential and likely higher token usage on those harder steps, a rough extrapolation puts a Mythos run at approximately $880.

The variance problem

The paper shows all models have very high variance across runs. Opus 4.6’s best run reached 22 steps, its worst only 11, with an average of 15.6. And the AISI article shows Mythos only completed all 32 steps in 3 of its 10 attempts – a 70% failure rate on full completion.

To expect one successful outcome you’d need 3-4 runs on average – and each run is likely comparable in time to the Opus 4.6 runs.

That’s approximately $2,900-3,500 per successful outcome.

A human expert completing the same range: 14 hours, once, reliably. At $125-190 per hour (UK rates) that’s $1,750-2,625.

So at least today, and according to AISI data, and assuming my maths are roughly correct, an experienced cyber human would be cheaper, more reliable and at least as quick as the most capable AI model currently available.

METR’s developer productivity research: 2026 update

You may be seeing posts claiming METR’s widely-cited 2025 study has been followed up with new research showing an 18% productivity boost. That’s not what the article says.

METR: We are Changing our Developer Productivity Experiment Design

In 2025, METR found experienced open-source developers using GenAI were 19% slower – and notably, developers themselves thought they were being sped up. They started a new experiment to track how things were changing – but couldn’t complete it. They say the data was too compromised to produce reliable results.

The interesting part is why the study broke down. Developers are now so reliant on AI that they won’t work without it (to be part of the control group). And the nature of how they work when using GenAI has changed too, which undermines e.g. simple time-on-task measurements.

METR believe GenAI coding productivity is improving – but say they can no longer measure it reliably with this study design and are reworking their approach. Personally I don’t see how you can practically design an effective controlled experiment considering everyone uses it now.

Worth noting too that, either way, these kinds of studies are still a narrow window on software delivery – individual task completion by autonomous open-source contributors, not developers working in teams on production codebases with all the organisational complexity that entails.

Multiple studies now suggest AI is genuinely increasing coding velocity – including CircleCI’s recent State of Software Delivery report. But the same report points to a more troubling pattern at the system level: less code reaching production and increasing instability.

My take: teams with strong engineering practices – genuine continuous delivery, high code quality, solid test coverage – appear to be realising real benefits from GenAI, and there’s data to support that. The problem is those teams represent a small fraction of the industry. For everyone else, higher coding velocity is likely resulting in a negative impact downstream.

More code, less delivery but does the CircleCI 2026 Report really show 1 in 20 teams are benefiting?

CircleCI’s 2026 State of Software Delivery report has two findings that are already travelling: AI is meaningfully boosting software delivery, but only 1 in 20 teams are capturing that benefit. Both claims are more uncertain than the report suggests, for different reasons.

What the report is measuring

The report’s primary metric is “throughput” – the number of times a CI pipeline runs per day. A CI pipeline is the automated process teams use to build, test and progress code toward production. It is not production deployments, it is not features shipped. The report is using pipeline execution data to infer things about software delivery. That’s not unreasonable – it’s real data – but it’s worth understanding what’s actually being measured before drawing conclusions.

The headline numbers

The report measures throughput on both feature branches and main branches and aggregates both into its headline figures. Throughput as a metric on feature branches is effectively meaningless. Throughput should be an end-to-end metric – feature branches aren’t end-to-end, they get merged to main. The only meaningful “throughput” measure is against the main branch. What the feature branch data actually shows is a lot more code being written, but not much more reaching production.

  • Average teams are up 4% on the aggregated figure, but main branch throughput is down 7%
  • The top 10% of teams show aggregated throughput up nearly 50%, main branch essentially flat
  • For 95% of teams, AI is generating more work in progress that isn’t shipping

The success rate of main branch builds compounds this further. It has fallen to 70.8%, its lowest in over five years – 30% of attempts to merge code for production are now failing.

The 1 in 20 claim

The report identifies the top 5% of teams as the only group seeing meaningful main branch throughput growth – 26% – and uses this to argue that some teams have cracked the AI delivery problem.

But the summary data for that group is odd. Their average CI pipeline duration is 6 seconds. A pipeline doing anything meaningful – compiling, running tests, scanning – it’s hard to think of a single CI step that legitimately completes in 6 seconds. Perhaps it is an error in the report. There’s also data that may be skewing the findings more broadly – one team apparently running 130,000 CircleCI workflows a day would have an outsized effect on any aggregate figures.

What to take from it

The integration bottleneck finding is credible. If you’re generating code faster than your team can review and integrate it (safely), that’s a genuine problem this data is consistent with.

The “1 in 20 teams have cracked it” conclusion is less solid than it appears. Not to say that some aren’t getting benefit I believe there are, however the data here for the teams making that case doesn’t add up clearly enough to draw confident lessons from.