Kimi K2.7-Code reduces thinking tokens by 30% – but doctors say that the measurements do not look

Moonshot AI released Kimi K2.7-Code this week, an open-source update to its K2 code model family, claiming leaner logic and double-digit performance gains.
K2.7-Code is built on the trillion-parameter hybrid architecture as the predecessor of K2.6, and comes with an OpenAI-compatible API – which is very important for teams already using K2.6 in the production gates.
When K2.6 launched in April, it topped OpenRouter’s weekly LLM leaderboard — a ranking based on actual API routing decisions by developers, not self-reported ranking scores.
Moonshot AI says K2.7-Code talks about what it calls "over thinking," reducing the consumption of the reasoning token by 30% compared to K2.6 – a number that can directly affect the reasoning costs of teams using agent workflows. Whether that efficiency gain holds up to independent benchmarks is a question that has already begun to be raised publicly.
What is Kimi K2.7-Code
K2.7-Code is released under a modified MIT license, which is available from HuggingFace. The model is executable with vLLM or SGlang. It works exclusively in imaging mode and doesn’t support temperature adjustment – Moonshot AI fixed it at 1.0, which means teams can’t fine-tune the output trim like they can with other models.
A key change from K2.6 is the way the model generates low-level code. Where K2.6 generated implementations by wrapping existing libraries and methods using established frameworks, the authors of K2.7-Code used them directly. Moonshot AI says this produces reliable generalizations across Rust, Go and Python, and across all types of tasks including frontend development, DevOps and performance optimization.
In benchmark performance, Moonshot AI claims gains of 21.8% in Kimi Code Bench v2, 11% in Program Bench and 31.5% in MLS Bench Lite. All three are proprietary benchmarks managed by Moonshot AI. The model has not been submitted to DeepSWE, an independent code benchmark that produces a spread of 70 points for all models – compared to a spread of 30 points for SWE-Bench Pro – making it a very discriminating signal for teams preparing routing models.
The more honest, the weaker it is
The image that comes out of the Moonshot benchmarks is very difficult.
Researcher Elliot Arledge ran K2.7-Code against K2.6 and Claude Fable 5 on KernelBench-Hard, a public benchmark focused on GPU kernel optimization, and published his full logs on kernelbench.com.
"K2.7 is very reliable but not very capable," Arledge wrote in X.
In five of the six problems, K2.7-Code produced actual Triton characters where K2.6 used library wrappers. Two of those kernels failed in the model’s own problems. The MoE kernel score dropped from the K2.6 score of 0.222 to 0.157.
"The myth, as a reference, suggests that all cells do not fail in reality," Arledge wrote.
Sugumaran Balasubramaniyan, a developer who built a model-task-router for the Hermes Agent platform using DeepSWE as his reference signal, publicly responded to the K2.7-Code release and challenged Moonshot AI directly on the benchmark selection.
"Respectfully, every model ‘improves’ double digits in its benchmark," Balasubramaniyan wrote in X.
He noted that K2.6 scored 24% in DeepSWE, combined with GPT-5.4-mini, and asked if Moonshot AI would move K2.7-Code to the same benchmark.
Balasubramaniyan said it took 13 rounds of revisions to get the benchmark data for his route and that he will move the coding jobs to K2.7-Code if the independent numbers hold.
What does this mean for businesses
The efficiency benefit of tokens is used immediately. Teams using K2.6 in production can exchange K2.7-Code with an OpenAI-compatible API and expect lower costs to refer to the agent’s work flow without architectural changes. Thinking token reduction of 30% is Moonshot’s own number, but the combination method has a small enough risk to check your work before committing.
A practical question is whether those efficiency gains hold for the distribution of group work itself. Running the K2.7-Code against your workload before adjusting the gate weights is a low-risk method of detection.



