Researchers automated LLM strategy design and cut token consumption by 69.5%

0 0 5 minutes read

Researchers automated LLM strategy design and cut token consumption by 69.5%

Time-to-test scaling (TTS) has emerged as a proven way to improve the performance of large language models in real-world applications by giving them more computing cycles in decision time. However, TTS techniques have historically been manual, relying heavily on human intuition to determine the model’s reasoning rules.

To address this limitation, researchers from Meta, Google, and several universities have introduced AutoTTS, a framework that automatically finds the correct TTS strategies. This automated approach allows business organizations to dynamically optimize compute allocation without manually tuning heuristics.

By using the right techniques discovered by AutoTTS, organizations can directly reduce the usage of tokens and the operational costs of deploying advanced reasoning models in production environments. In pilot tests, AutoTTS managed ambiguity budgets well, successfully reducing token usage by up to 69.5% without sacrificing accuracy.

Manual bottleneck in test time estimation

Timed testing improves LLMs by giving them more computing power when generating answers. This additional computation allows the model to generate multiple hypotheses or evaluate its intermediate steps before arriving at a final answer.

A major challenge for designing TTS strategies is deciding how to allocate this additional computation appropriately. Historically, researchers have designed these strategies manually, relying on guesswork to create robust heuristics. Engineers must consider rules and restrictions when a model should switch to new ways of thinking, dive deeper into an existing path, prune an unpromising branch, or stop thinking altogether.

Because this process of self-tuning is limited by the individual’s emotions, a number of possible mechanisms remain unexplored. This often results in a small trade-off between model accuracy and computational cost.

Current TTS algorithms can be mapped to the depth control field – "width" which is the number of branches of thought examined, "depth" how far each grows. Self-consistency (SC) samples a fixed number of trajectories and the majority vote for the answer. Adaptive-consistency (ASC) saves computing by stopping early when a confidence limit is reached. Parallel-probe takes a granular approach, pruning unpromising branches while deepening others. All three are done by hand, and that’s a bond AutoTTS is designed to break.

Although some more advanced methods use rich structures such as tree search or external validators, they all share one important feature: they are carefully crafted by hand. This manual approach limits the scope of strategic discovery, leaving much of the potential area for resource allocation untouched.

Automatic strategy discovery with AutoTTS

AutoTTS reframes the method with an improved test time scale. Instead of treating strategy design as a human task, AutoTTS approaches it as an algorithmic search problem within a controlled environment.

This framework redefines the roles of both the human engineer and the AI model. Rather than making some manual rules where the LLM has to stifle, prune, or stop thinking, the engineer’s role shifts to creating the available space. One defines the parameters, including the control area of the states and actions, the objectives of improving the measurement of accuracy against the costs, and specific methods of response.

An LLM explorer, like Claude Code, designs a strategy. This tester acts as an independent agent that also raises TTS “controls”. These controls are coded policies or algorithms that determine how the AI model allocates its computing budget at decision time. The tester evaluates and improves these controls based on the feedback until he finds the perfect resource allocation policy.

To make this automatic search computationally accessible, AutoTTS relies on an “offline playing field.” If an LLM tester had to ask the underlying logic model to generate new tokens every time he tested a new strategy, the computational cost would be astronomical. Instead, it relies on the thousands of ways of thinking that have previously been collected in the foundation of the LLM. These trajectories include "probe signals," which are intermediate responses that help the controller to check the progress in all the different branches of logic.

During the discovery loop, the test agent picks up the controller and tests it against this offline data. The agent looks at the usage traces of the proposed controller that has been assigned to the computer over time. By analyzing these traces, the agent can identify specific failure modes, such as noting if the controller is pruning branches in an abusive manner in a particular situation. This gives an advantage in terms of the final result. The agent then rewrites its code iteratively to optimize the accuracy cost tradeoff.

Inside the AI-powered controller

Because the auditor’s agent is not blocked by human hearing, it can discover highly complex, complex rules that a human engineer would likely be unable to write by hand. One advanced controller discovered by AutoTTS, named the Confidence Momentum Controller, uses several abstract methods to manage the computer:

Position based on trend: Manual techniques often instruct the model to stop inferring when it reaches a certain immediate confidence limit. The AutoTTS agent found that instant confidence can be misleading due to temporary increases. Instead, the controller tracks the explanatory moving average (EMA) of confidence and stops only when the overall confidence level is high and the trend is not declining.
Integrated width and depth control: Artificially designed algorithms usually handle i "increase" of new ways of thinking and "depth" of current methods as separate decisions. AutoTTS has discovered a closed feedback loop where two actions are linked. If the confidence of the current branches decreases or decreases, the controller automatically causes the spawning of new branches.
Allocation of depth of understanding alignment: Instead of giving all active logic branches an equal computational budget, the controller dynamically identifies which branches agree with the current best response. It then gives us those branches forward "the explosion" for more calculations. This anchors the calculated budget to the emerging consensus to quickly verify that it is correct.

Cost savings and accuracy benefits in real-world benchmarks

To test whether AI can automatically find a better test time scaling strategy, the researchers set up a robust testing framework. Significant tests were performed on Qwen3 models ranging from 0.6B to 8B parameters. The researchers also tested the system’s ability to synthesize an 8B distilled version of the DeepSeek-R1 model.

The experimenter’s AI agent was initially tasked with finding the right strategy using AIME24’s statistical reasoning. This derived strategy was then tested on two math benchmarks, AIME25 and HMMT25, and the graduate-level general thinking measure GPQA-Diamond.

The resulting AutoTTS controller is subject to four hand-crafted test-time scaling algorithms in the industry. These basic methods included Adaptive-Consistency 64 (SC@64), Adaptive-Consistency (ASC), Parallel-Probe, and Early-Stopping Self-Consistency (ESC). ESC is a hybrid method that generates trajectories in parallel and stops early when the response appears stable.

When set to balanced, cost-aware mode, the controller found by AutoTTS reduced total token consumption by about 69.5% compared to SC@64. At the same time, the controller maintained the same average accuracy for all four Qwen models. When a hypothetical budget was obtained, AutoTTS achieved higher accuracy than all manual bases in five out of eight test cases.

This efficiency has been translated into other functions. In the GPQA-Diamond benchmark, the measured AutoTTS variant reduced the token cost from 510K tokens down to only 151K tokens, while slightly improving overall accuracy. For the DeepSeek model, AutoTTS achieved the highest overall accuracy in the HMMT25 benchmark while reducing the token spend by almost half.

For employees building AI applications for business, these exercises highlight two major operational benefits:

Optimizing performance: AutoTTS doesn’t just save money on token usage. It dramatically increases the maximum achievable performance of the base model. The AI-designed controller is remarkably good at detecting noisy or unproductive thought branches on the fly and proactively redirecting its collective budget to the branches that produce the most useful thought signals.
Inexpensive custom development: Because the framework relies on offline play, the entire acquisition process costs only $39.90 and takes 160 minutes. For business teams, that means that advanced thinking techniques that fit proprietary models and internal operations are now accessible – without a dedicated research budget.

Both the AutoTTS framework and the Confidence Momentum Controller are available on GitHub; CMC can be used as a replacement for other TTS controllers.

Mosegas 4 days ago

0 0 5 minutes read