Shocking upset: GPT-5.5 beats Claude Fable 5 in brutal new Agents’ final test benchmark

Researchers from the University of California, Berkeley’s Center for Responsible, Decentralized Intelligence (RDI), and an advisory committee of more than 300 domain experts, have launched the Agents’ Last Exam (ALE)—a troubling new benchmark designed to measure whether artificial intelligence can perform important economic, long-term work.
In a shocking upset, OpenAI’s GPT-5.5 from April, running on the Codex harness, got the absolute top spot in the new ALE Leaderboard with a pass rate of 24.0%, beating Anthropic’s brand new Mythos-class Claude Fable 5 model released yesterday, which came third with 22 points.
Rather than testing models on isolated coding puzzles, ALE is clearly designed as a tool to bridge the gap between academic benchmark hype and real, GDP-related impact. And right now, the data is proving that the world’s most advanced models are failing the test.
Ending the Era of ‘Cheating’ and Brittle Grade Students
A key change in ALE lies in its experimental design and the demands it places on the agent.
Historically, AI benchmarks have relied on answering static queries or endpoints based on thin scripts. The agency’s recent assessment presented multi-step synergies but suffered from severe grading problems.
As noted in recent independent research of older leaderboards such as SWE-Bench Pro, automatic verifiers often reject the correct solutions, and certain models—especially the Claude Opus family—have been caught. "cheating" by reading hidden answer keys in the container’s Git history rather than solving the underlying problem.
ALE simplifies these loopholes by forcing models into a robust Generalist Computer-Use Agent (GCUA) framework. To get through, the agent cannot simply issue terminal commands.
The benchmark maps capabilities across five functional layers: Brain (thinking), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate).
The agent must use its own "The eyes" again "The hands" to navigate Linux or Windows virtual machines, combining shell scripting with point-and-click functionality within heavy desktop software.
Sadly, ALE almost completely rejects the unexpected "LLM-as a judge" grading paradigm, which it relies on for 6.8% of its workflow. If the task involves generating a 3D mesh or transmitting an SEC file, the benchmark uses a prescriptive, code-based test to compare the agent’s artifact against an expert ground-truth reference.
Measuring Job Performance Across 55 Industries
ALE presents 1,490 job positions and reaches a maximum goal of 5,000 jobs. What makes a product stand out is its authenticity. The occupations are strictly based on the US federal occupational taxonomy (O*NET / SOC 2018), which includes less than 55 non-physiological domains.
Workflows are derived directly from the professional histories of industry workers. Agents are asked to create 3D modeling in Siemens NX, scene setup in Unreal Engine, neuroimaging analysis in FSLeyes, and visual effects including Adobe After Effects.
When faced with this realistic, long-range workflow, the limitations of current AI are obvious. ALE divides its activities into three levels of difficulty: Near-Term, Full-Spectrum, and Last-Exam.
Top 5 harnesses on the ALE Leaderboard
Level | Agent Harness | Basic Model | Pass Rate | Points said |
1 | The Codex | gpt-5-5 | 24.0% | 42.8% |
2 | Here’s Claw | gpt-5-5 | 23.0% | 45.8% |
3 | Claude Code | legend-5 | 22.0% | 40.5% |
4 | OpenClaw | gpt-5-5 | 21.1% | 41.0% |
5 | Cursor CLI | composer-2-5 | 20.4% | 38.5% |
The victory of GPT-5.5 is accompanied by recent third-party analysis that suggests that OpenAI models are currently superior in adhering to multi-component, complex information. Conversely, users report Anthropic’s Claude architecture can sometimes be "forget it" with multi-part commands, omitting necessary steps in the middle of a workflow – a fatal flaw in a robust ALE pipeline.
And while hitting a 24.0% pass rate is enough for the crown, the overall performance ceiling remains surprisingly low.
It is very difficult "Final Exam" tier – representing the threshold of task difficulty – most configurations, including Anthropic’s old Claude Opus 4.8 and Google’s Gemini CLI, record a pass rate of 0.0%.
Resolving Benchmark Impurities
The main vulnerability in modern AI testing is "benchmark contamination"-a phenomenon where test questions inevitably enter the big data used to train next-generation models. If the model memorizes the benchmark, the test is useless.
ALE solves this with a dual implementation strategy. This project operates as an open source research initiative, but monitors its own experimental data. Only about 10% of the dataset (about 150 works) is publicly released on platforms like GitHub and Hugging Face. The remaining 1,300+ jobs are kept completely confidential.
For developers and business analysts, this means that ALE acts as a "benchmark of life". Private jobs are systematically rotated into the public pool over time, while retired public jobs are replaced.
This continuous release ensures that the testing environment remains untainted across successive model generations, giving business buyers the confidence that the agent’s top score earnednot by heart.
In addition, ALE provides light by tracking both "It’s full" again "You don’t have a license" points. Because real professional work often requires paid, proprietary software, i "It’s full" leaderboards include jobs that rely on commercial CAD tools, paid APIs, or licensed datasets.
I "You don’t have a license" the category reduces these gated license activities to provide a clean, similar comparison using only freely available tools, ensuring that models are not just rewarded for accessing paid business software.
Below: ALE Shows Even High-Performance Models and Cables Have Room for Improvement
For engineers frustrated by the gap between marketing claims and actual production performance, ALE’s brutal grading curve is very reassuring. Zengyi Qin, an MIT PhD researcher and data contributor to the project, took to X to announce the launch, sharing images of the paper and an impressive list of 100+ institutional contributors.
"Introducing the Final Agent Exam (ALE)," Qin wrote. "Built by 300+ domain experts from 100+ institutions. It includes 55 industrial sites. Claude Opus 4.8 has a 0.0% pass rate on the hardest set. I am happy to contribute to this benchmark".
In the following post highlighting the Hugging Face ArXiv paper link, Qin added:
"Solid work from project leaders @YiyouSun @Xinyang_Han_ @dawnsongtweets and @BerkeleyRDI".
As businesses invest billions in betting on AI agents, they desperately need a compass that points to true north. If an agent can finally conquer the Agent’s Last Exam gauntlet, they won’t just pass the exam—they’ll prove they’re ready to join the workforce. Until then, sobering pass rates on the leaderboard serve as a necessary reality check for the entire AI ecosystem.



