Tech

How to build reasoning agents with a computer component

Training AI models for reasoning requires resources that most enterprise teams do not have. Engineering teams are often forced to choose between integrating information into large, expensive models or relying on reinforcement learning techniques that provide limited feedback.

Researchers at JD.com and several academic institutions have recently introduced a new training paradigm that prevents this problem. The strategy, called Reinforcement Learning with Guaranteed Self-Distillation (RLSD), combines reliable performance tracking of reinforcement learning with granular self-refining feedback.

Experiments show that models trained with RLSD perform better than those built on classical distillation and reinforcement learning algorithms. For corporate teams, this approach lowers the technical and financial barriers to building custom thinking models designed for specific business insights.

The problem with mental models of training

A common way to train mental models is Reinforcement Learning with Verifiable Rewards (RLVR). In this paradigm, the model learns by trial and error, guided by the end result from its environment. The automatic validator checks whether the answer of the model is correct or incorrect, giving a binary reward, such as 0 or 1.

RLVR suffers from small and uniform responses. “Standard GRPO has a signal congestion problem,” Chenxu Yang, the author of the paper, told VentureBeat. “A multi-thousand thought trace gets one binary reward, and all tokens within that trace get the same credit, whether it’s a critical logical step or a throwaway phrase.” Therefore, the model never learns which intermediate steps led to its success or failure.

On-Policy Distillation (OPD) takes a different approach. Instead of waiting for the final result, the developers paired a smaller student model with a larger, more capable teacher model. For each training example, the student compares his answer with the teacher’s token by token. This gives the reader granular feedback on the entire chain of thought and response generation process.

Deploying and running a separate, large-scale teacher model next to the learner throughout the training process incurs significant computing overhead. “You have to keep a larger teacher model alive during training, which is about twice your GPU space,” Yang said. In addition, teacher and student models must share exactly the same vocabulary structure, which according to Yang, “tacitly prohibits multiple architectures, methods, or enterprise-driven multilingual setups.”

The promise and failure of self-distillation

On-Policy Self-Distillation (OPSD) has emerged as a solution designed to overcome the shortcomings of the other two approaches. In OPSD, the same model plays the role of student and teacher.

During training, the student receives general information while the teacher receives specialized information, such as a verified, step-by-step answer key. This knowledgeable teacher version of the model then evaluates the student version, providing token feedback as the student attempts to solve the problem using only standard information.

OPSD seems to fit well with business budgets. It delivers granular, step-by-step guidance for OPD. Because it eliminates the need for an external teacher model, it operates with high computational efficiency and low cost for RLVR, which only requires additional teacher passes.

However, the researchers found that OPSD suffers from something called “privileged information leakage.”

“The goal is structurally wrong,” Yang said. “There is an insurmountable gap of mutually beneficial knowledge that the student cannot bridge… When self-distillation is set up as distribution matching, the student is asked to imitate the teacher’s full distribution under specific content.”

Because the teacher assesses the student based on a hidden answer key, the training objective forces the student model to learn the teacher’s exact sentence or steps instead of the underlying reasoning. As a result, the student’s model begins to make false references to an abstract solution that it will have no access to in real-world applications.

In practice, OPSD models show a rapid increase in performance at the beginning of training, but their reasoning power quickly increases and gradually decreases over time.

Decoupling direction from magnitude with RLSD

The researchers behind RLSD noticed that the signals that control how the model updates its parameters have asymmetric fundamental requirements. They identified that the signal indicating the direction of the update (that is, whether to reinforce or punish the behavior) can be small, but it must be completely reliable, because pointing the model in the wrong direction damages its thinking policy.

On the other hand, the signal that defines the magnitude of the update (that is, how much relative credit or blame a particular action deserves) benefits from being more dense to enable step-by-step correction.

RLSD builds on this principle by separating the direction of the update from the magnitude of the update. The framework allows the validated environmental feedback from the RLVR signal to strongly determine the direction of learning. The model receives full reinforcement only if the final answer is precisely correct.

The self-taught is stripped of its power to dictate what the model should produce. Instead, the token and teacher token tests are retargeted to determine the size of the update. It simply distributes the sum of credit or blame to the individual steps of the model’s logic.

This changes the way the model learns compared to the classic OPSD paradigm. In a typical OPSD, the training objective serves as behavioral conditioning, where the model is forced to copy exactly the teacher’s words and sentences. This causes the reader to see missing objects and leak references to incoming data.

Instead of forcing the model to copy the hidden solution, RLSD provides a natural and inexpensive source of credit information for each token.

“The concept: we don’t teach the model to think like a teacher,” Yang said. “We tell the model, in the way it chose, which of its tokens were doing the work. The model’s test distribution is always yours. Only the credit distribution is adjustable.”

If a particular deduction strongly supports the correct result, it receives a higher score. If it’s just a useless filler word, it gets basic points. RLSD eliminates the need to train complex neural networks, manually interpret step-by-step data, or maintain large external teacher models.

It puts RLSD to the test

To test RLSD, the researchers trained an open-weighted Qwen3-VL-8B visual language model and tested it on several reasoning benchmarks. These include MMMU for multidisciplinary college-level quizzes, MathVista, MathVision, WeMath, and ZeroBench, a stress-testing benchmark designed to be nearly impossible on current borderline models.

They compared the RLSD model with a baseline model with no background training, conventional RLVR with the GRPO algorithm, conventional OPSD, and a composite combination of the two.

RLSD significantly outperformed all other methods, achieving the highest average accuracy of 56.18% across all five benchmarks. It beat the base model by 4.69% and the standard RLVR by 2.32%. The benefits were most pronounced in complex mathematical reasoning tasks, where RLSD outperformed standard RLVR by 3.91% in the MathVision benchmark.

Besides accuracy, the frame offers great efficiency benefits. “Actually, RLSD at 200 training steps already outperforms GRPO trained at 400 steps, so it’s about a 2x convergence speedup,” Yang said. “Cost-wise, the only upside to a regular GRPO pipeline is one extra forward pass per response to capture teacher logs. Compared to output generation… that’s free.”

Unlike OPSD, which saw performance increases and then completely decline due to information leakage, RLSD maintained long-term training stability and converged to a higher performance ceiling than conventional methods.

Qualitative results highlight how the model changes its learning behavior. For example, in a complex visual computation task, a typical RLVR looks for the last correct answer and gives every class of thought tokens the same reward. RLSD applied surgical rewards to specific mathematical subtraction steps that solved the problem, while reducing the weight of the standard filler text as "When I look at the picture, I see…".

In another example, the model made incorrect statistical inferences based on a bar chart. Instead of labeling every response as a failure, RLSD focuses a more severe penalty on the area where the model has misread the relationship in the chart. It remains neutral in all logical setups, recognizing that the original frame was valid.

This is especially important in dirty, real-world business use cases. If a model makes a mistake analyzing a 50-page quarterly earnings report, developers don’t want it to throw off their entire analysis framework. They just want it to correct a certain thought and make a mistake. RLSD allows the model to learn exactly which logic jumps are important and which are errors, token by token. Because RLSD does this by recreating the model itself, it provides models with fine-grained inference capabilities while keeping training costs reasonable.

How businesses can start

For data engineers and AI music teams, integrating RLSD is straightforward, but requires the right setup. The most important requirement is a verifiable reward signal, such as code compilers, statistical testers, SQL execution, or schema validators. “Activities that don’t have a guaranteed reward (open chat, voice typing) are for preference-based pipelines,” Yang said.

However, RLSD is very flexible in terms of the specific information it requires. While OPSD structurally requires full central tracking, which forces businesses to pay annotations or issue a border model, RLSD does not.

“If you have fully validated ideas, great, RLSD will use them,” Yang said. “If all you have is a final answer to the ground truth, that works too… OPSD doesn’t have this flexibility.”

Integrating the methodology into an existing multi-modality open source RL framework such as verL or EasyR1 is incredibly easy. According to Yang, it doesn’t need to be rewritten and added to the standard stack. Changing the codes involves changing tens of lines to adjust the GRPO’s purpose and adapt the teacher to the student.

Looking ahead, RLSD provides a powerful way for businesses to leverage their existing internal assets.

“The proprietary data companies hold within their perimeter (compliance manuals, internal documents, history tickets, verified code snippets) is essentially free proprietary information,” Yang concluded. “RLSD allows businesses to feed this type of data directly as a proprietary core, sharpening the learning signal to micro-models without requiring an external instructor and without sending anything outside the network.”

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button