The Anthropic Mythos is changing faster than expected, reports the AI security center

Follow ZDNET: Add us as a favorite resource on Google.
Highlights taken by ZDNET
- The latest version of Claude Mythos has been improved.
- Outside researchers have found it to have achieved several firsts in testing.
- AI capabilities may improve faster than expected.
Claude Mythos of Anthropic, a company that ends up being too powerful to be released in general, seems to have acquired new abilities.
In a blog post published on Wednesday, the UK AI Security Institute (AISI) reported that it tested a new version of Mythos, which outperformed its previous results and OpenAI’s GPT-5.5 — just a month after Mythos was first released.
Also: Apple, Google, and Microsoft join Anthropic’s Project Glasswing to protect the world’s most valuable software
“The new Mythos preview completed both of our online categories, solving the ‘Ultimate’ scope in 6 out of 10 attempts and the previously unsolved ‘Cooling Tower’ in 3 out of 10 attempts,” the blog’s authors wrote. “It was the first time that a model finished second in our online series of two.”
When Anthropic first announced the Mythos Preview and Project Glasswing — a cybersecurity testing alliance it formed with rival tech companies and AI labs, where it gave limited access to Mythos — last month, the UK AISI tested it, finding that the model “represents a step up from previous models in an environment where cyberspace has evolved rapidly.”
That third-party perspective helped moderate claims that the hype surrounding Mythos was just sales or, on the other hand, reflected a catastrophic shift in AI capabilities. The truth about what the model can do is probably somewhere in between.
Also: How to learn Claude Code for free with Anthropic’s AI tutorials – one took me just 20 minutes
The revised AISI test is also an example of how power improvements are not limited to individual model releases, but can occur between versions of a single model.
The fastest growing cyber threat
AISI noted that AI models are rapidly improving their ability to handle cyber activities, which has serious implications for cyber security, especially given Mythos’ ability to detect software vulnerabilities.
“In February 2026, we internally estimated that the length of cyber AI jobs that could be completed doubled every 4.7 months from the end of 2024 – already the speed from November 2025 to an average of 8 months,” the blog authors wrote. “Since then, AISI has reported two new models, the Claude Mythos Preview and [OpenAI’s] GPT-5.5, which significantly exceeded both the double ratio trends.”
Also: Third major Linux kernel bug in two weeks found – thanks to AI
The authors added that it is not clear whether that trend will hold or whether these findings point to a permanent increase. Mythos and GPT-5.5 are simply notable breaks from the overall pattern of model evolution.
However, AISI clarified that there are several unknowns that its test cannot answer. The tests have a transaction cap of 2.5 million tokens, allowing researchers to better compare performance results over time. That naturally “underestimates what boundary models can do,” they wrote.
“Mythos Preview and GPT-5.5 have large error bars tied up due to near 100% success rates in our small cyber suite long operations, even with a 2.5M token limit,” the blog continues. “Our missions are also not long enough to determine how much the fidelity of the models will degrade with higher mission lengths. This puts some of the latest models at the limit of what can be measured with our small test suite.”
Also: I put the GPT-5.5 through a 10-round test: Scored 93/100, losing points just for fun.
While this makes the model’s point of failure difficult to measure, it also means that the model’s success rates for these operations will be much higher without a token cap — so high, in fact, that “time horizons are impossible to calculate.” Models with multiple token access and complex agent infrastructure can be very powerful.
“The 2.5M token limit is relatively low — in our online range testing we use up to 100M tokens and find that performance can improve beyond that budget, especially on the latest models, which equally benefit from higher token limits,” the blog added.



