DeepSeek-R1 - Andrej Karpathy in-depth explanation of LLM (Part 9)

Pre-training and SFT (Supervised Fine-Tuning) have existed for many years in the LLM training domain and are widely applied. However, the introduction of RL (Reinforcement Learning) training is a relatively newer attempt in this field currently, and it has not been fully standardized yet.

Although the high-level concept of this stage is very simple—optimizing the model through trial and error learning—the details and mathematical principles are quite complex, including aspects such as how to select the optimal solution, controlling the amount of training, designing the distribution of prompts, and configuring the training runs.

Recently, DeepSeek released a paper of significant importance, revealing its work on reinforcement learning fine-tuning for the first time, which provides stronger reasoning capabilities for LLMs. Through this disclosure, DeepSeek not only sparked industry interest in the application of reinforcement learning in LLMs but also provided necessary details to help other researchers reproduce the method and further explore based on it.

The Application of Reinforcement Learning in Language Models: A Breakthrough in Cognitive Strategies

The DeepSeek-R1 paper demonstrates the effectiveness of applying reinforcement learning (RL) to language models, especially in solving math problems. In the early stages of training, the model performed poorly when solving basic math problems, but with thousands of optimization steps during the RL process, its accuracy significantly improved. Notably, it wasn't just the quantitative improvement in model accuracy that stood out, but more importantly, the qualitative change in its problem-solving methods.

As the model was optimized, a significant phenomenon was observed: the model began to generate longer answers. This increase in response length stemmed from the model discovering that more detailed solutions could improve accuracy. It learned to "re-evaluate" steps, backtrack its own thinking, and re-examine problems from different angles. For example, it might say, "Wait, let me re-check step by step to confirm the correct sum."

This process is highly similar to how humans solve problems: backtracking, trying different approaches, and gradually refining solutions. Through the RL process, these cognitive strategies naturally emerged. More interestingly, this problem-solving strategy cannot be hard-coded into the model but is gradually discovered through trial and error and learning during the RL optimization process. The only external guidance the model receives is the correct answer.

The most astonishing aspect is that the model learned to think and developed strategies akin to human cognition, all without explicit programming but rather spontaneously emerging during the reinforcement learning optimization process. This is a cognitive strategy used to "manipulate" problems, understand them from different perspectives, or solve them using analogies. The discovery of this "chain of thought" is a direct result of the RL optimization process, showcasing the power and spontaneity of this method.

Emily buys 3 apples and 2 oranges. Each orange costs 2 Dollars. The total cost of all the fruit is 13 Dollars. What is the cost of apples?

ChatGPT 4o received an answer like this. What is shown here are the results we obtained when using the basic SFT (Supervised Fine-Tuning) method previously, which is akin to mimicking an expert's solution.

When the same question is given to so-called reasoning or thinking models, the output is as follows. These are the results we obtained from Reinforcement Learning (RL) models.

As you read through this process, you can't help but feel like the model is thinking; it's clearly pursuing a solution. It guesses that the answer must be $3, then it says, "Wait, let me check my math again, just to confirm." Then it tries again from a slightly different angle, and it says, "Yes, everything checks out, I think this is the answer, I don't see any mistakes. Let me see if there's another way to solve this problem, maybe I can set up an equation." For example, "Let's assume the price of an apple is $8, then... wait, no, the answer is the same. So, each apple is indeed $3, good, I'm confident this is correct."

Then, after the model completes its thought process, it writes out a beautiful solution for humans. So, it's not just about correctness, but also about presentation; it writes the solution very clearly and boxes the correct answer at the bottom.

Incredibly, we can see this thought process in the model, and it comes directly from the reinforcement learning process. This is why, as the token sequence gets longer, they are thinking and trying different approaches.

This is also why accuracy improves when solving problems. What we see here are those "Aha" moments, different strategies, and some thoughts on how to ensure the right answer is obtained.

ChatGPT, some of them like o1, o3-mini, o3-mini-high, etc., use advanced reasoning techniques. The phrase "use advanced reasoning" means they are trained with reinforcement learning. Models like GPT-4 or GPT-4o mini, what you get in the free version should be considered mainly SFT models (supervised fine-tuning models); they don't actually think in the way RL models do. Although there is some reinforcement learning involved in these models, most of them are still SFT models.

Although the model generates these chains of thought in the background, OpenAI chooses not to display the detailed content of these chains of thought on the web interface but instead shows summaries of these chains. Part of the reason OpenAI does this is that they are concerned about so-called "distillation risks," meaning someone might attempt to restore reasoning performance by simply imitating these traces of reasoning. Therefore, they hide the detailed content and only show the summary. As a result, you won't get a complete reasoning process like with DeepSeek.

Then the model will write out the solution. So, even though we don't see all the backend details of these models, their performances are roughly equivalent.

If you have questions requiring advanced reasoning, you may want to use some thinking models. In many simple cases, such as knowledge-based questions or similar ones, using a thinking model might be excessive; there's no need to think for 30 seconds for a factual question. Therefore, 80%-90% of the time, GPT-4 alone suffices, and when encountering more difficult problems like math or programming, you can choose the thinking model, but then you'll have to wait longer because they need time to think.

The Gemini 2.0 Flash Thinking Experimental in Google AI Studio can also be tried—it’s an early experimental thinking model from Google. We can input the same question here and click run; it's also a thinking model, and it will provide the correct answer. Basically, Gemini also offers a thinking model, while Anthropic currently does not, but this represents the cutting-edge development of these LLMs.

RL is like an emerging and exciting phase, but achieving accuracy in detail is challenging, which is why these models and thinking models remain experimental. By early 2025, they will still be in a very early stage. But this is the frontier area driving the performance improvement of these extremely difficult problems, utilizing the reasoning that emerges in these optimizations.