Alibaba’s Metis agent reduces redundant AI tool calls from 98% to 2% – and is more accurate at making them

One of the key challenges of building effective AI agents is teaching them to choose between using external tools or relying on their own internal knowledge. But large language models are often trained to use tools blindly, causing delays, unnecessary API costs, and corrupted reasoning caused by spatial noise.
To overcome this challenge, researchers at Alibaba introduced Hierarchical Decoupled Policy Optimization (HDPO), a reinforcement learning framework that trains agents to measure both efficiency and task accuracy.
Metis, the multimodal model they trained using this framework, reduces the appeal of obsolete tools from 98% to just 2% while establishing the accuracy of new modern thinking in all key dimensions of the sector. This framework helps create trigger-happy AI agents and knows when to avoid using tools, allowing for the development of responsive and cost-effective agent systems.
Metacognitive deficits
Current agent models suffer from what researchers call a “deep metacognitive deficit.” Models have a difficult time deciding when to use their internal parametric knowledge versus when to consult an external resource. As a result, they blindly request tools and APIs, such as searching the web or executing code, even if the user prompt already contains all the information needed to solve the task.
This trigger-happy-calling behavior of tools creates severe performance bottlenecks for real-world applications. Because the models are trained to focus almost entirely on completing the task, they don’t care about delays. These agents tend to charge very high call rates for instruments. Every unnecessary external API call introduces a processing bottleneck, turning technically competent AI into a lazy system that frustrates users and burns through tooling budgets.
At the same time, burning computing resources on overusing tools doesn’t translate to better thinking. Unnecessary tool interactions add noise to the model context. This noise can disrupt the model, interrupt the logical chain of thought and actively degrade the final output.
To address the problems of latency and cost of blind tool use, previous reinforcement learning methods attempted to punish excessive tool use by combining task accuracy and performance into a single reward signal. However, this involved design creates an intractable optimization problem. If the efficiency penalty is too aggressive, the model becomes too parsimonious and suppresses the use of valuable tools, sacrificing fairness in difficult tasks. Conversely, if the penalty is small, the improvement signal loses its value and does not prevent the excessive use of the tool in simple tasks.
In addition, this shared reward creates semantic ambiguity, where an incorrect path with zero tool calls may yield the same reward as a correct path with zero tool calls. Because the training cues for accuracy and efficiency are implicit, the model cannot learn to control the use of the tool without degrading its cognitive ability.
Decoupled hierarchical policy development
To solve the optimization problem of combined rewards, researchers introduced HDPO. HDPO separates accuracy and efficiency into two independent optimization channels. The accuracy channel is focused on increasing the accuracy of the work in every release of the model. An efficient channel facilitates economy of use.
HDPO calculates the training signals from these two channels independently and combines them only in the final step of the loss calculation. Signal efficiency is conditional upon channel accuracy. This means that a wrong answer is never rewarded for speeding up or using fewer tools. This separation avoids situations where the accuracy and efficiency of the gradient cancel out, providing the AI with clean learning signals for both targets.
The most powerful emerging feature of this differentiated design is that it creates a clear intellectual curriculum. At the beginning of training, when the model is still struggling with the task, development is dominated by the goal of accuracy, which forces the model to prioritize learning positive thinking and knowledge. As the reasoning power of the model increases and consistently reaches the correct answers, the signal efficiency increases smoothly. This process causes the model to be able to initialize the task, and refine its confidence by avoiding unnecessary, expensive API calls.
To complement the HDPO, the researchers developed a robust, multi-stage data correction program that addresses major errors found in instrument-developed datasets. Their data processing pipeline includes supervised fine-tuning (SFT) and reinforcement learning (RL) stages.
In the SFT phase, they obtained data from publicly available tool-augmented multimodal trajectories and filtered them to remove low-quality examples that contained performance failures or response conflicts. They also dynamically filter any training sample that the underlying model can solve directly without tools. Finally, using Google’s Gemini 3.1 Pro as an automated judge, they filtered the SFT corpus to retain only examples that demonstrated the use of strategic tools.
In the RL phase, curation focuses on ensuring a stable development signal. They filter out information with corrupted images or semantic ambiguities. The HDPO algorithm relies on comparing correct and incorrect answers. If the task is relatively easy when the model always gets it right, or very difficult where the model always fails, no meaningful statistical difference can be learned from it. The team retained only the commands that showed a non-trivial combination of success and failure to ensure a feasible gradient signal.
Metis Agent: HDPO is active
To test the performance of HDPO, the researchers used the framework to develop Metis, a multimodal reasoning agent equipped with coding and search tools. Metis is built on the Qwen3-VL-8B-Instruct vision language model. The researchers trained it in two different phases. First, they used SFT using their selected data to provide a cold start. Next, they implemented RL using the HDPO framework, exposing the model to multivariable interactions where it can use tools such as Python code extraction, text search, and image search.
The researchers pitted Metis against standard open source vision models such as LLaVA-OneVision, text-only thinkers, and advanced agent models including DeepEyes V2 and the 30 billion parameter Skywork-R1V4. Testing took two main areas: visual perception and document understanding datasets such as HRBench and V*Bench, and rigorous mathematical and logical reasoning tasks such as WeMath and MathVista.
Across all tasks, Metis achieved the best or most competitive performance, the best-performing agent models available – including Skywork-R1V4’s largest parameter of 30 billion – across all vision and imaging tasks.
Equally important is the inexplicable behavior Metis displayed in the experiment. For example, when presented with a picture of a museum sign and asked what the text in the middle says, the standard agency models spend time blindly typing in Python to crop the image just to read it. Metis, however, can see that the text is clearly legible on the raw image. It skips tools entirely and uses a single inference pass.
In one experiment, the model was presented with a complex chart and asked to identify the second top row at a particular data point within a subplot. Metis realized that sophisticated visual analysis had exceeded its natural corrective capabilities and could not accurately distinguish overlapping lines. Instead of projecting on the full image, it asks Python to crop and zoom in specifically on that small piece, allowing it to correctly identify the line. It treats the code as a precise tool used only when the physical evidence is unclear, not as an automatic return.
The researchers released the Metis and HDPO code under the valid Apache 2.0 license.
“Our results show that strategic tool use and strong cognitive performance are not a trade-off; rather, eliminating noisy, unwanted calls directly contributes to higher accuracy,” the researchers concluded. “More broadly, our work suggests a paradigm shift in tool-enhanced learning: from simply teaching models how to use tools, to cultivating meta-cognitive intelligence on when to stop.”



