Pre-training stage: Build a basic model and learn from a large amount of internet document content. Supervised fine-tuning stage: Through real dialogue data, the model is fine-tuned so that it can serve as an assistant. Reinforcement learning stage: Optimize the model's decision-making and problem-solving abilities through practice and feedback.
After supervised fine-tuning, the model is already capable of handling some basic dialogue tasks, but we want it to become smarter and more adaptable. At this point,reinforcement learning(Reinforcement Learning, RL) becomes a critical training phase.
Reinforcement learning differs from the previous two stages (pre-training and supervised fine-tuning), as it focuses on optimizing the model's decision-making ability throughpractice and feedback. Just like humans improving their skills through constant practice,reinforcement learning allows the model to continuously try, adjust, and optimize its strategies in specific tasks.。
Analogy of reinforcement learning to school education
To help understand the concept of reinforcement learning, we can compare it withschool education:

First stage (pre-training)is like studentsgaining background knowledge by reading textbooks. In this stage, the model learns knowledge from a large amount of text. Second stage (supervised fine-tuning)is similar toexperts demonstrating how to solve problems, during which the model learns how to handle dialogue tasks byimitating experts. Third stage (reinforcement learning)is then thepractice stage, where the model optimizes its decision-making and problem-solving abilities by solvingreal-world problemsand receiving feedback,continuously refining its own decisions and problem-solving capabilities。
The process of reinforcement learning: practice and feedback
In the reinforcement learning phase, the model no longer relies on solutions provided by experts, but insteadinteracts with the environment and tries different strategies, adjusting its behavior based on the feedback received. The goal of reinforcement learning is for the model to learn, throughexploration and practice, how to make optimal decisions in specific tasks.
This stage can be compared todoing exercises: Students face a series of practice questions,attempting to solve them independently, then adjusting their methods based on the results. This aligns with thecore idea of reinforcement learning: achieving optimal performance through repeated trials and continuous optimization.
Example
Suppose Emily bought three apples and two oranges, with each orange costing $2, and the total cost of all fruits being $13. What is the price of each apple? We can solve this problem using several different methods, and all methods will ultimately lead to the same answer — $3.

Human data labelers are unclear about which type of dialogue should be added to the training set. Some dialogues may use system equations, others might express in plain English, or directly skip steps and provide answers.
If GPT (e.g., ChatGPT) were to answer this question, it might choose to establish a variable system and then solve the problem.

For us humans, certain tasks are easy, but for LLMs, they might be extremely difficult. The cognitive methods of humans and LLMs differ. Some token sequences may seem very simple to us, but for LLMs, they represent a huge leap; conversely, LLMs might perform very simply on tasks that appear complex to us, resulting in wasted tokens. Therefore, if we only care about the final answer and not how it is presented to humans, we cannot accurately determine how to annotate this example or what kind of solution to provide to the LLM.
More importantly, the knowledge of LLMs differs from ours. LLMs actually possess extensive knowledge in mathematics, physics, chemistry, and other fields, surpassing our knowledge in some aspects. However, when we input information, we may inject knowledge that the model itself lacks, which could be a significant leap for the model and cause confusion. Thus, our cognition and the model's cognition differ, making it unclear how to annotate the most suitable solution for LLMs.
In summary, we are not currently in an ideal state to create the most appropriate token sequence for LLMs. We hope to initialize the system through imitation, but the ultimate goal is for the LLM to discover the token sequence most suitable for itself. This requires reinforcement learning and trial-and-error processes, allowing the LLM to discover what kind of token sequence can reliably produce the correct answer.
How reinforcement learning works
On Hugging Face's inference platform, give the model a simple task, and the model generates answers based on this task.

By continuously trying and reviewing each result, we can evaluate the model's performance. Each generated answer follows a different path, some leading to correct results, while others may fail. Ultimately, we aim to encourage solutions that produce correct answers and optimize the model's generation strategy through trial and error.
Assume we have a prompt, and we attempt multiple answers in parallel. Some answers may be correct, marked in green, while others may fail, marked in red.

Suppose we generate 15 answers, with only 4 yielding the correct result. Now, our goal is to encourage the model to generate more solutions similar to the green ones. This means that for the red answers, the model went wrong at certain points, so these token sequences are not effective paths; the token sequences of the green answers produced the correct results, so we hope to use this type of answer more frequently in future prompts.
These answers are not designed by human labelers but are generated by the model in actual operations. The model discovers which token sequences yield correct answers through continuous attempts. It’s akin to a student reviewing their past answers, analyzing which methods are effective and which are not, and learning how to better solve similar problems.
In this case, we can select the best answer among them — for instance, the shortest one or the one with the best visual effect — as the "best answer." Then we will train the model to prefer this answer path. After parameter updates, the model will be more inclined to adopt this method when facing similar situations in the future.
The training process involves a large number of different types of prompts, covering math, physics, and various other problems, with tens of thousands of prompts and answers. Therefore, as training progresses, the model finds out which token sequences can effectively yield correct answers through continuous trial and error. This is the core process of reinforcement learning: constantly guessing, verifying, and guiding future reasoning through more successful answers.