Highlights:
- DeepSeek claims R1 outperforms OpenAI’s o1 on several reasoning benchmarks, while R1-Zero is less capable but marks a notable step in machine learning research.
- The MoE architecture reduces inference costs by activating only the neural network needed for a prompt, not the entire AI. As a result, R1 and R1-Zero use less than 10% of their 671 billion parameters per query.
Recently, DeepSeek has debuted the release of the R1 series, a new family of large language models designed specifically for reasoning tasks.
The Chinese AI developer has published the algorithms’ source code on Hugging Face.
The LLM lineup features two flagship algorithms, R1 and R1-Zero. According to DeepSeek, R1 surpasses OpenAI’s o1 on multiple reasoning benchmarks, while R1-Zero, though less powerful, marks a potentially significant step forward in machine learning research.
Both LLMs utilize a mixture of expert (MoE) architecture with 671 billion parameters. An MoE model consists of multiple neural networks, each specialized for a distinct set of tasks. When the model receives a prompt, a routing mechanism directs the query to the neural network best suited to handle it.
The primary advantage of the MoE architecture is its ability to reduce inference costs. Instead of engaging the entire AI, an MoE model activates only the specific neural network needed to respond to a user’s prompt. Consequently, R1 and R1-Zero utilize less than one-tenth of their 671 billion parameters when processing prompts.
DeepSeek adopted a unique approach to train R1-Zero, differing from the conventional methods used for reasoning models.
Reasoning-optimized LLMs are usually trained using two techniques: reinforcement learning and supervised fine-tuning. Reinforcement learning trains an AI model to complete tasks through trial and error, while supervised fine-tuning enhances the AI’s output quality by supplying examples that demonstrate how to perform the task correctly.
DeepSeek bypassed the supervised fine-tuning stage during R1-Zero’s training. Despite this, the company successfully endowed the model with reasoning abilities, including breaking down complex tasks into simpler sub-steps.
DeepSeek researchers detailed, “It is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area.”
Despite its advanced features, R1-Zero has limitations in output quality. According to DeepSeek’s researchers, the model’s responses can occasionally exhibit issues such as “endless repetition, poor readability, and language mixing.”
The company developed R1 to overcome these limitations.
R1 is an improved version of R1-Zero, developed using an adjusted training workflow that incorporates supervised fine-tuning—a technique DeepSeek excluded during R1-Zero’s development. The company claims this modification greatly enhanced the output quality.
DeepSeek compared R1 with four widely used LLMs through nearly two dozen benchmark tests. The company reports that its model outperformed OpenAI’s reasoning-optimized o1 LLM on several benchmarks. In most cases where o1 scored higher, R1 lagged by less than 5%.
One of the benchmarks where R1 surpassed o1 is LiveCodeBench, a collection of programming tasks that is frequently updated with new practice problems. This reduces the chances of AI models finding pre-existing solutions on the public web.
In addition to R1 and R1-Zero, DeepSeek has open-sourced a set of less powerful yet more hardware-efficient models. These models were “distilled” from R1, meaning some of the LLM’s knowledge was transferred to them during training.
The distilled models vary in size from 1.5 billion to 70 billion parameters and are based on the Llama and Qwen open-source LLM families. According to DeepSeek, one of these distilled models, R1-Distill-Qwen-32B, outperforms the smaller OpenAI-o1-mini version of o1 on several benchmarks.