DPO vs RLHF - Myaiblog

is Direct Preference Optimization

is Reinforcement Learning from Human Feedback

Let's first talk about RLHF

OpenAI GPT relies on a new large language model (LLM) training paradigm: namely RLHF (Reinforcement Learning from Human Feedback). In short, it is a method of reinforcement learning optimization through human feedback.

Before this, LLMs mainly generated responses based on human input prompts (prompts), and such evaluations were typically subjective and context-dependent. Traditional models usually just predicted the next word and used simple loss functions (such as cross-entropy), without explicitly incorporating human preferences and subjective opinions.

Then RLHF was introduced. This strategy uses human feedback on generated texts as an evaluation criterion, and even incorporates this feedback into the loss function for optimizing the model. Simply put, it uses reinforcement learning methods to directly optimize a language model that takes human feedback into account. This ensures that the language model can better align with complex human values.

RLHF mainly consists of three steps:

Pre-training a language model (LM).
Aggregating question-and-answer data and using it to train a reward model (Reward Model, RM).
Fine-tuning the LM using reinforcement learning (RL).

Let's talk about DPO.

Although RLHF introduces the concept of human preferences and provides a method for integrating reinforcement learning with large language models, it often appears complex and unstable in practical applications. Its working principle is to first fit a reward model to capture human preferences, then fine-tune a large unsupervised learning model to maximize these rewards while trying to stay as close as possible to the original model.

To address these issues, researchers proposed the DPO algorithm. DPO not only directly uses the mapping between the reward function and the optimal strategy but also proves that the constrained reward maximization problem can be fully optimized through single-stage policy training. Essentially, DPO provides a solution to the classification problem based on human preference data.

Compared with RLHF, DPO has many advantages:

It offers higher stability and computational efficiency.
It does not require fitting a reward model or sampling during fine-tuning.
It reduces the reliance on a large number of hyperparameters.
DPO can more effectively fine-tune LMs to align with human preferences, often surpassing existing methods.
Fine-tuning with DPO performs better in controlling the sentiment of generated results, improving the quality of summaries and single-turn dialogue responses.

You can view the detailed research paper on DPO at https://arxiv.org/abs/2305.18290

Here we can see the performance comparison between DPO and RLHF (PPO is a reinforcement learning algorithm under the RLHF framework):

Reinforcement learning is a more difficult and unstable method, and so far only OpenAI and Anthropic have successfully implemented it. Many open-source models have not seen significant performance improvements after adopting RLHF. However, with the emergence of new methods like DPO, reinforcement learning is no longer the only option.