What is DeepSeek? The Technological Breakthroughs Behind DeepSeek R1

5 tháng 2 năm 2025

What is DeepSeek? The Technological Breakthroughs Behind DeepSeek R1

February 5, 2025

Sharing:

DeepSeek R1 is a powerful language model (LLM) with strong reasoning capabilities, comparable to OpenAI's o1, but developed at a significantly lower cost and available under an open license for free use. By leveraging advanced training techniques and architecture, DeepSeek R1 can tackle complex problems such as mathematics and programming with high accuracy. Its introduction marks a significant step toward creating powerful, cost-efficient, and more accessible AI models for the community.

This exclusive article by Pham Quang Nhat Minh, Director of the AI Research and Development Center at FPT IS, provides an in-depth look at the techniques behind DeepSeek R1, helping readers gain a deeper understanding of the technological breakthroughs driving this model.

DeepSeek was founded in late 2023 by Liang Wenfeng.

DeepSeek R1 is a large language model (LLM) developed by the DeepSeek team. It offers reasoning capabilities on par with OpenAI's o1 model but was created at a fraction of the cost. Training DeepSeek V3, the foundational model behind DeepSeek R1, cost $5.58 million—only about 3–5% of the cost required to develop OpenAI’s o1 model. Additionally, DeepSeek has made DeepSeek V3 and DeepSeek R1 publicly available under the MIT license, allowing users to download and utilize these models, even for commercial purposes. If there is no sufficient computing infrastructure to run DeepSeek models, we can use DeepSeek’s free chat interface at https://chat.deepseek.com or access it via API at a significantly lower cost compared to OpenAI’s API ($0.14 per million input tokens and $0.28 per million output tokens, while the equivalent GPT-4o model costs $2.50 per million input tokens and $10 per million output tokens).

1.1. What Are the Capabilities of DeepSeek R1?

While large language models can handle various language tasks such as translation and text summarization, they generally struggle with more complex reasoning tasks like solving mathematical problems. OpenAI developed the o1 model using reinforcement learning techniques to enhance its reasoning abilities. OpenAI o1 can think before responding, meaning it generates a chain of thoughts before providing an answer. This reasoning capability allows it to outperform conventional models in complex tasks such as mathematics, programming, physics, chemistry, and biology. For example, in international-level math competitions, GPT-4o only solved 13% of the problems, whereas OpenAI o1 achieved 83%.

However, reasoning-based models like OpenAI o1 and DeepSeek R1 have a drawback: longer response times compared to standard models. As a result, they are better suited for tasks that require deep analysis rather than real-time responses.

Testing has shown that DeepSeek R1 performs on par with OpenAI o1 in mathematical reasoning and outperforms GPT-4o and its predecessor, DeepSeek V3, in programming tasks. This improvement is due to reinforcement learning training at a larger scale, which was not applied to the base DeepSeek V3 model.

Benchmark Results of DeepSeek R1 Compared to Other Models (According to DeepSeek R1’s Technical Report)

1.2. Where Does DeepSeek R1’s Breakthrough Come From?

The success of DeepSeek R1 stems from innovations in model architecture and training methods, with the key contribution coming from breakthroughs in reinforcement learning-based model training.

In the following sections, we will explore the architecture of DeepSeek R1 and, in particular, the unique training method used to build this model.

Read more: What is an AI Agent? Explore AI Agents from A to Z

2. DeepSeek Model Architecture

DeepSeek R1 was built upon the DeepSeek V3 base model, so its architecture is similar to that of DeepSeek V3.

The core architecture of DeepSeek V3 and DeepSeek R1 follows the Transformer framework. However, the DeepSeek development team introduced optimizations to improve model training and deployment. These innovations include:

Mixture of Experts (MoE): This technique selects only a subset of parameters for generating each token, reducing computational load while maintaining model quality. It works like having multiple specialists, each excelling in a particular task. Instead of all specialists working simultaneously, only relevant ones handle a specific task. MoE accelerates computation, keeps costs manageable despite increasing model size, and enhances model generalization, allowing it to handle diverse inputs.
Multihead Latent Attention (MLA): This technique reduces memory and computational costs by projecting Key-Query-Value matrices in self-attention into a lower-dimensional space.
Multi-Token Prediction (MTP): Enables the model to generate multiple tokens in parallel, improving throughput by 2-3 times.
FP8 Quantization: Reduces memory usage by up to 75% compared to FP32.

The DeepSeek V3 model has 671 billion parameters, with 37 billion parameters actively used to generate a single token (the unit used in LLMs to segment a text sequence).

DeepSeek V3 was trained on a server cluster equipped with 2,048 NVIDIA H800 GPUs, accumulating 2.788 million GPU hours. Assuming an H800 rental cost of $2 per hour, the estimated training cost for DeepSeek V3 is approximately $5.576 million. This amount represents only the GPU computation cost for training the base model and does not include additional costs such as data preparation and pre-training experiments. Although the technical report does not specify the exact cost of training DeepSeek R1, it is likely to be significantly higher than $5.576 million.

3. How DeepSeek R1 Was Trained

This section focuses on explaining the reinforcement learning method used in DeepSeek R1 to equip the base model DeepSeek V3 with reasoning capabilities.

3.1. What Is Reinforcement Learning?

Reinforcement Learning (RL) is an AI technique that enables machines to learn decision-making by experimenting and receiving feedback from their environment. The system learns by:

Taking an action: The AI model selects an action based on the current environment state.
Receiving rewards or penalties: If the action is beneficial, the model earns a reward; otherwise, it is penalized.
Adjusting strategy: The AI gradually optimizes its actions to maximize long-term rewards.

The goal of an RL-based AI model is to maximize cumulative rewards over time by learning the best decision-making strategy (policy).

To illustrate, let's consider an RL-trained AI model playing Tetris, where players must arrange falling blocks to form complete horizontal lines and clear them.

Key RL components in this scenario include:

Environment: The Tetris game.
State: The current game board, including stacked blocks and the position of the new block.
Actions: The AI can rotate, move left/right, or drop blocks.
Rewards:
+10 points for clearing a full line.
+50 points for clearing four lines at once (a "Tetris").
-1 point if blocks stack too high, nearing the top of the screen.
Policy: The AI learns to choose the best actions to maximize its score.

How the AI learns to play Tetris:

Early stage: Initially, the AI plays randomly without a specific strategy.
Feedback collection: After each round, the AI records the score received.
Strategy improvement: The AI uses algorithms like Q-learning or Deep Q-Networks (DQN) to prioritize beneficial actions.
Long-term learning: Over thousands of games, the AI gradually discovers optimal block placements for extended gameplay and higher scores.

At first, the AI plays poorly due to a lack of experience. However, after extensive training, it optimizes its moves to extend gameplay and achieve the highest possible score—eventually surpassing human players.

3.2. Applying Reinforcement Learning to LLM Training

The Reinforcement Learning from Human Feedback (RLHF) technique has long been used to train LLMs like ChatGPT, Claude, and Gemini. RLHF enables models to generate responses that better align with human preferences, avoiding incorrect, harmful, or nonsensical content.

Reinforcement learning is intuitive in games like Tetris or chess. But how does it apply to LLMs?

Unlike games, LLMs generate text one token at a time rather than all at once. To evaluate whether the generated text is high-quality, we must wait until the entire sentence is produced.

In Tetris, the game automatically assigns scores.
In LLMs, there is no built-in mechanism to evaluate responses.
Therefore, we need a reward model to assess generated text.

How DeepSeek R1 Uses Reinforcement Learning

A neural network-based reward model evaluates LLM outputs by assigning a reward score (r) based on:

Prompt (x) – The given input.
Desired response (y⁺) – A high-quality answer.
Undesired response (y⁻) – A low-quality answer.

DeepSeek’s reward model uses predefined rules and consists of two key types:

Accuracy Reward: Evaluates correctness. For example, if the model answers a math problem correctly, it receives a reward. Similarly, for coding-related queries, a compiler can verify the solution against test cases.
Format Reward: Rewards responses structured within specific tags, such as <think>...</think> for reasoning steps.

The policy of an LLM is represented as a probability distribution over each token, generated based on the given prompt and previous tokens.

By training the policy model (LLM) to maximize cumulative rewards, the neural network adjusts its weights to produce higher-quality, more human-aligned responses. This is the fundamental approach of RLHF (Reinforcement Learning from Human Feedback).

GRPO: The Unique Reinforcement Learning Approach of DeepSeek R1

DeepSeek R1 employs a reinforcement learning technique called GRPO (Group Relative Policy Optimization). Unlike PPO (Proximal Policy Optimization), GRPO does not rely on a value model to estimate state values. Instead, it calculates the average reward for multiple responses generated from a single prompt.

Advantages of GRPO over PPO:

Higher efficiency: GRPO outperforms PPO in optimizing responses.
Lower computational cost: By eliminating the need for a value model, GRPO significantly reduces training costs.

Illustration of PPO and GRPO

GRPO eliminates the value model and instead estimates the baseline value from the group's score, significantly reducing training resources.

The DeepSeek development team's first experiment was to directly apply reinforcement learning to the base DeepSeek V3 model, creating DeepSeek R1 Zero (Figure 3). The experiment demonstrated that DeepSeek R1 Zero exhibited superior reasoning abilities compared to DeepSeek V3 without requiring labeled data.

The Training Process of DeepSeek R1 Zero

Source: Vellum AI

During the reinforcement learning training process, DeepSeek R1 Zero utilizes response samples as shown in the figure below.

The Training Process of DeepSeek R1 Zero. Source: https://www.vellum.ai/blog/the-training-of-deepseek-r1-and-ways-to-use-it

In the reinforcement learning training process, DeepSeek R1 Zero uses answer samples as shown below.

Template Answer from DeepSeek-R1-Zero (Extracted from the Technical Report of DeepSeek R1)

The image below shows an example of the reasoning process of DeepSeek R1 Zero before providing an answer. The interesting part in this content is the "aha" moment that the model creates during the reasoning process.

DeepSeek Reasoning Process 1738725582

The reasoning process of DeepSeek R1 Zero before it provides an answer. The interesting point is that the model can think again, as indicated by the red line, with a personified tone.

Reinforcement learning has proven effective in providing reasoning abilities to LLMs. However, applying reinforcement learning directly to the base model presents two issues:

The output of DeepSeek-R1-Zero occasionally mixes different languages, especially Chinese.
The reasoning content of DeepSeek-R1-Zero is difficult to read and understand.

To address these issues, the DeepSeek development team applied a multi-stage training process designed to enhance the model's reasoning capabilities while maintaining training efficiency. The key stages include Supervised Fine-Tuning (SFT), reinforcement learning (RL), Rejection Sampling, and an additional reinforcement learning phase. The multi-stage training process to create DeepSeek R1 is illustrated in the figure below.

DeepSeek R1 Training Process 1 1738726676

The multi-stage training process to create DeepSeek R1. Source: https://www.vellum.ai/blog/the-training-of-deepseek-r1-and-ways-to-use-it

Interested readers can read more detailed content in the technical report of DeepSeek R1.

4. Knowledge Transfer from DeepSeek R1

Knowledge Distillation is a technique in machine learning that transfers knowledge from a larger, pre-trained model, known as the teacher model, to a smaller model, called the student model.

The knowledge transfer technique has been successfully applied in many areas, including Natural Language Processing (NLP), speech recognition, image recognition, and object detection. In recent years, research on knowledge transfer has become particularly important for large language models (LLMs). In this field, knowledge transfer has emerged as an effective method for transferring advanced capabilities from top proprietary models (such as OpenAI's GPT-4o) to smaller, more accessible open-source models. Knowledge transfer techniques aim not only to replicate the outputs of the teacher model but also to simulate the reasoning process of the teacher model.

To equip smaller models with reasoning capabilities like DeepSeek-R1, the DeepSeek development team fine-tuned open-source models such as Qwen (from Alibaba) and Llama (from Meta AI) using 800,000 selected data samples from DeepSeek-R1. Evaluation results show that this simple knowledge transfer method significantly improves the reasoning abilities of smaller models. The smaller models used by DeepSeek include Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct.

For these knowledge-transferred models, the DeepSeek team applied supervised fine-tuning (SFT) without combining the reinforcement learning (RL) stage, although reinforcement learning could significantly improve the model's capabilities. In the technical report, the author states that the team's goal was to demonstrate the effectiveness of the knowledge transfer method while leaving the exploration of the reinforcement learning stage open to the research community.

5. Limitations of the DeepSeek R1 Model

Although DeepSeek R1 has impressive results on several evaluation metrics, it still has some limitations.

The first issue is that, since it was primarily trained on English and Chinese data, DeepSeek R1 sometimes answers in English or Chinese even when the question is asked in another language.

The second issue is that DeepSeek R1 tends to avoid or refuse to answer questions related to sensitive political and social topics, especially those concerning China. However, according to tests by Cisco's research team, when randomly selecting 50 harmful prompts from the HarmBench dataset, DeepSeek R1 failed to block any harmful prompts.

Finally, the size of the DeepSeek R1 model requires powerful computing infrastructure to deploy the model locally (approximately 6 x H100 80GB to deploy the DeepSeek R1 671B model). While smaller models distilled from DeepSeek R1 can be used on lower-end hardware, these models do not have the same capabilities as the original DeepSeek R1.

Han Che DeepSeek R1 1738727257

Although the results are impressive, DeepSeek still has some limitations.

6. Conclusion

DeepSeek has proven that the application of reinforcement learning can significantly enhance the reasoning capabilities of large language models. Specifically, the combination of fine-tuning and reinforcement learning, as in the DeepSeek R1 model, helps address the limitations of using reinforcement learning alone.

The DeepSeek event marked a milestone in the AI field. It shows that with breakthroughs in model architecture and training methods, we can develop models with reasoning capabilities on par with top commercial products like OpenAI’s GPT, but at a significantly lower cost. This puts pressure on companies like OpenAI to reduce the price of API services, bringing direct benefits to users and businesses seeking cost-effective AI solutions for their operations.

These advancements not only drive innovation in the AI industry but also open up opportunities for businesses to access powerful and cost-effective AI solutions, contributing to the growth and application of AI in various fields.

References

DeepSeek R1 Technical Report. https://arxiv.org/abs/2501.12948
DeepSeek V3 Technical Report. https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
Cisco Security Team Report: Evaluating Security Risk in DeepSeek and Other Frontier Reasoning Models

Exclusive Article by FPT IS Technology Expert Phạm Quang Nhật Minh

Director of the AI R&D Center, FPT IS

PhD in Information Science, expert in Natural Language Processing (NLP), with 17 years of research and development experience in both academic and industrial environments. He is the author and co-author of several scientific papers in the field of NLP. His current research topics focus on large language models and their applications.