GRPO (Group Relative Policy Optimization) Study Notes

I have previously taken a cursory look at the algorithm for aligning LLMs:

Today, as a non-tech person, I took a look at GRPO. GRPO is a reinforcement learning (RL) algorithm and a variant of Proximal Policy Optimization (PPO), designed to enhance the mathematical reasoning abilities of large language models (LLMs) while optimizing PPO's memory usage.

Core idea:

The core idea of GRPO is to abandon the traditional critic model and instead use the average reward of multiple generated results under the same problem as the baseline. This "group-relative" approach aligns with the comparative nature of reward models, as reward models are typically trained to compare different outputs for the same problem.

Difference from PPO:

Critic Model:PPO uses the critic model to estimate the state value function, while GRPO directly uses the average reward of multiple generated results under the same problem as a baseline, without the need to train an additional critic model, thereby reducing memory and computational burdens.
KL Divergence Regularization:PPO usually adds a KL divergence penalty term to the reward function to prevent the policy model from deviating too much from the reference model. GRPO, on the other hand, directly adds KL divergence to the loss function for regularization, avoiding the complication of advantage function calculations.

Formula Expression:

Among them:

Advantage function calculation:

GRPO uses group-wise relative rewards to calculate the advantage function, which aligns with the comparative nature of reward models. Specifically, for each question q, GRPO samples a set of outputs {o from the old policy model π_θold} . Then, the reward model is used to evaluate these outputs, resulting in a set of rewards r = {r₁, o₂, ..., o_G}. Subsequently, the advantage function is calculated based on these rewards within each group.₁, r₂, ..., r_G} Then, these rewards are normalized by subtracting the group mean and dividing by the group standard deviation.

Two types of supervision:

Outcome Supervision:Only provide normalized rewards at the end of each output o_iand assign the advantage function A for all tokens in the output._i,tSet as normalized rewards.
Process Supervision:Provide rewards at the end of each reasoning step. Given a question q and G sampled outputs {o₁, o₂, ..., o_G}, use the process reward model to score each step of the outputs, thereby generating corresponding rewards. Then, normalize these rewards using the mean and standard deviation. Subsequently, process supervision calculates the advantage function for each token as the sum of the normalized rewards for subsequent steps.

Iterative Reinforcement Learning:

As the reinforcement learning training process progresses, the old reward model may no longer be sufficient to supervise the current policy model. Therefore, GRPO also explores iterative reinforcement learning. In iterative GRPO, a new training set for the reward model is generated based on the sampling results of the policy model, and the old reward model is continuously trained using a replay mechanism that includes 10% historical data. Then, the reference model is set as the policy model, and the policy model is continuously trained using the new reward model.

Summary：

GRPO is an efficient and effective reinforcement learning algorithm that reduces memory and computational burdens by abandoning the critic model and using group-wise relative rewards. Experimental results show that GRPO can significantly improve the mathematical reasoning ability of LLMs, even when the SFT model has already reached a high level.

Finally:

Let ChatGPT 4.5 interpret this in language that a five-year-old child can understand:

Imagine that you are playing a game, trying to improve a certain skill through different methods and observing which method works best. For example, if you want to build the tallest tower with building blocks, you might try different stacking methods to see which one allows the tower to be taller.
In the field of computing, there is a method called "reinforcement learning," where computers learn by trying different strategies to improve their ability to complete tasks. Traditionally, computers use a helper model called a critic to evaluate the quality of each action.

However, a new approach has recently emerged called "Group Relative Policy Optimization" (GRPO). Unlike traditional methods, GRPO does not rely on the critic but instead allows the computer to generate a set of different actions in each state, similar to trying multiple ways to stack your building blocks at the same time. Then, the computer evaluates the results of these actions to determine which method works best. By comparing these different attempts, the computer can learn which strategies are more effective without expert guidance. (Renee's inner question: Is this similar to the growth method of a small-town problem solver who practices extensively without a tutor?)
This method enables computers to learn more efficiently by trying multiple options and observing which one works best, gradually improving their ability to complete tasks, just as you would when building a tower with building blocks.