
Kimi k1.5: Next-Gen LLM with RL for Multimodal Reasoning | Benchmark Performance
Moonshot AI Launches Kimi k1.5 Multimodal Model, Achieving O1 Parity Shortly After R1
Reinforcement learning (RL) has revolutionized AI at its core by enabling models to learn iteratively through interaction and feedback. When applied to large language models (LLMs), RL unlocks new opportunities for dealing with tasks involving sophisticated reasoning, e.g., math problem-solving, programming, and multimodal data interpretation. Classical approaches are greatly dependent on pretraining with massive static datasets. Nevertheless, their weaknesses have been made apparent as models deal with issues involving dynamic exploration and adaptive decision-making.
The principal difficulty in promoting LLMs is scaling up their ability to perform while achieving computational efficiency. From static data, traditional pretraining methods have not been capable of handling demands of complex tasks with sophisticated reasoning. Moreover, current LLM RL implementations have not achieved state-of-the-art performance because prompt design, policy optimization, and data management were inefficient.
This has created a gap in modeling techniques that can be effective across different benchmarks, particularly those that require concurrent reasoning from text and images. Addressing this issue requires an end-to-end framework to synchronize model optimization with task-driven needs while still being token efficient.
Previous solutions for enhancing LLMs are supervised fine-tuning and sophisticated reasoning methods like chain-of-thought (CoT) prompting. CoT reasoning enables models to decompose problems into intermediate steps, making them better equipped to address challenging questions. This approach, however, is computationally intensive and typically bound by the narrow context window size of traditional LLMs. Likewise, Monte Carlo tree search, a well-known method for enhancing reasoning, adds extra computational burden and complexity. The lack of scalable RL frameworks for LLMs has also limited advancements, highlighting the requirement for a new method that balances performance gains with efficiency.
Scientists of the Kimi Team have presented Kimi k1.5, a state-of-the-art multimodal LLM, to bridge the gap posed by these limitations through the fusion of RL with longer context capabilities. This model leverages novel strategies such as long-context scaling, doubling the window size of context up to 128,000 tokens, so it can process big problem contexts effectively. Unlike earlier methods, the Kimi k1.5 shuns dependency on sophisticated strategies like Monte Carlo tree search or value functions in favor of an optimized RL setup. The research scientists used cutting-edge RL prompt set curation to optimize the model’s flexibility, comprising varied prompts covering STEM, coding, and general reasoning problems.
There were two versions developed in the Kimi k1.5.

The long-CoT model: It is superior on longer reasoning tasks, utilizing its 128k-token context window to produce historic results on benchmarks. For example, it had a 96.2% score on MATH500 and 94th percentile on Codeforces, proving that it can address tough, multi-step problems.
The short-CoT model: The short-CoT model was optimized for efficiency with the help of state-of-the-art long-to-short context training techniques. This method effectively transferred reasoning priors from the long-CoT model so that the short-CoT model could retain high performance, 60.8% on AIME and 94.6% on MATH500, while token usage was greatly minimized.
Performance Highlights

- MATH500
- 96.2% exact match accuracy
- Outperforms GPT-4o and Claude Sonnet 3.5 by wide margins
- Codeforces
- 94th percentile
- Long-CoT model matches O1 performance across modalities
- AIME
- 77.5% pass rate
- Short-CoT model outperforms GPT-4o and Claude Sonnet 3.5 by up to 550%
- LiveCodeBench
- Short-CoT model outperforms GPT-4o and Claude Sonnet 3.5 by up to 550%
- MathVista
- Long-CoT model matches O1 performance in multimodal reasoning
Core Features & Innovations
- Long-context scaling: RL with context windows up to 128k tokens
- Efficient training: Partial rollouts reduce computational load
- Policy optimization: Online mirror descent for faster convergence
- Sampling techniques: Prioritized and curriculum sampling
- Length penalties: Enhances focus on relevant context
Multimodal Reasoning
The learning process blended RL, supervised fine-tuning, and long-chain reasoning to yield a robust architecture for solving problems. The important innovations involved partial rollouts, a method for reusing cached trajectories of previously computed outputs in order to reduce computational load for long-context processing. Leverage of multimodal data sources like real-world and synthetic visual reasoning datasets helped bolster the model’s capacity for reasoning and interpreting across images and text. Sophisticated sampling techniques, such as prioritized sampling and curriculum sampling, helped target areas where the model performed weakly during training.
Demonstrates robust capability in multimodal problem-solving

The research presented several key takeaways:

Having integrated various visual and textual information allowed the model to perform well across benchmarks involving joint reasoning over multiple input types.
The streamlined RL framework employed in Kimi k1.5 did not suffer from the drawbacks of more resource-intensive approaches, resulting in high performance with minimal resource usage.
Kimi k1.5 sets a new standard for multimodal LLMs by combining RL with scalable context handling. Its ability to balance performance and efficiency makes it a game-changer for industries requiring advanced reasoning. For SEO success, focus on keyword-rich content, technical depth, and regular updates to maintain relevance.
Take the Next Step:
- Read the Full Analysis: Click here to explore our blog and compare models side-by-side.
- Test the Models: Try DeepSeek-R1’s API (10K free tokens) or OpenAI o1’s playground to experience their capabilities firsthand