The UK’s AI Security Institute has published the first independent evaluation of Claude Mythos’s cyber capabilities. The headline finding – first AI model to complete a full 32-step simulated network attack – is notable. But there’s a finding buried in the accompanying methodology paper that puts it in a rather different light. On current pricing and reliability, according to my maths, a human expert would do the same job cheaper, faster and more reliably.
What AISI found
On capture-the-flag tasks – common security challenges AISI have been using to test models since 2023 – Mythos sits broadly on the existing trend line. Real improvement, but incremental, and not unique to Mythos. The capability has been building across multiple labs for over a year.

The more significant result is with what AISI call “chained attacks” – where a model has to execute a long sequence of steps across a network to take it over, rather than exploit a single vulnerability in isolation. AISI measured this using their “The Last Ones” simulation: a 32-step corporate network attack spanning initial reconnaissance through to full network takeover, which they estimate a human expert would complete in 14 hours.
Mythos is the first model to complete all 32 steps end to end – though Opus 4.6, Anthropic’s previous model, wasn’t far behind in its best run.

The limitations & takeaways
The model was already inside the network, and the simulated environment had no active security monitoring and no defensive tools. Real networks aren’t like that – at least they shouldn’t be.
For most organisations the biggest threats remain phishing, weak passwords, and unpatched systems. AISI’s own advice in the article reflects this: focus on the basics – patch regularly, enforce access controls, enable logging. More importantly, the most common and successful attacks continue to target humans rather than rely on technical sophistication – as the Co-op, M&S and JLR attacks last year demonstrated.
The trajectory is real and worth taking seriously – but AISI’s findings are more measured than Anthropic’s “watershed moment” framing, and the most important things you can do about it are the same things you should have been doing anyway.
There’s a finding buried in the methodology
AISI published an accompanying academic paper detailing the evaluation methodology and results for models prior to Mythos – including detailed cost and timing data. This is where things get interesting.
According to that paper, the best Opus 4.6 run at 100M tokens cost approximately $80 and took around 10 hours – completing 22 of 32 steps, equivalent to roughly 6 of those 14 human hours. Slower, and less than halfway through in human time equivalent.
Mythos is priced at 5x Opus 4.6 per token. Its best run completed 32 steps versus Opus 4.6’s 22 – but crucially the additional steps fall in the later, harder milestones which are significantly more time and token-intensive. Accounting for both the price differential and likely higher token usage on those harder steps, a rough extrapolation puts a Mythos run at approximately $880.
The variance problem
The paper shows all models have very high variance across runs. Opus 4.6’s best run reached 22 steps, its worst only 11, with an average of 15.6. And the AISI article shows Mythos only completed all 32 steps in 3 of its 10 attempts – a 70% failure rate on full completion.
To expect one successful outcome you’d need 3-4 runs on average – and each run is likely comparable in time to the Opus 4.6 runs.
That’s approximately $2,900-3,500 per successful outcome.
A human expert completing the same range: 14 hours, once, reliably. At $125-190 per hour (UK rates) that’s $1,750-2,625.
So at least today, and according to AISI data, and assuming my maths are roughly correct, an experienced cyber human would be cheaper, more reliable and at least as quick as the most capable AI model currently available.

