Large Language Models (LLMs) have revolutionized natural language processing, but ensuring they align with human values remains challenging. Traditional alignment methods like Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO) each have their strengths and weaknesses.To address these limitations, researchers from Nanjing University have introduced Mixed Preference Optimization (MPO)—a hybrid approach that combines the best aspects of RLHF and DPO to improve LLM alignment efficiently. This blog explores how MPO works, why it outperforms existing methods, and what it means for the future of AI training.

Challenges in Aligning LLMs

Pre-trained LLMs generate responses based on statistical patterns in massive text datasets. However, without explicit training on human preferences, they may produce biased, unsafe, or misaligned outputs. Two major approaches to aligning LLMs are:

1. RLHF (Reinforcement Learning with Human Feedback)

Involves training a reward model based on human preferences.
Uses Proximal Policy Optimization (PPO) to iteratively refine model responses.
Effective but computationally expensive and unstable due to reinforcement learning complexities.

2. DPO (Direct Preference Optimization)

Bypasses the need for a reward model by directly optimizing LLM responses based on human preference data.
More stable and computationally efficient but prone to distribution shift, meaning it may struggle to generalize beyond the training data.

Given these limitations, MPO introduces a two-stage training strategy that balances efficiency and stability while maintaining high alignment quality.

What is Mixed Preference Optimization (MPO)?

Regression models predict numerical values, such as house prices or YouTube views. A good regression loss function should detect incorrect predictions and measure how far they deviate from actual values.

Common Regression Loss Functions

MPO is a hybrid training approach that first trains a model using DPO and then refines it using PPO, but with an important distinction: data selection and an improved reference model.

How MPO Works

1. Data Selection Using Reward Model

A reward model is trained to score response quality.
The dataset is split into “easy” & “difficult” examples based on reward score differences:
Easy samples: Response pairs where one is clearly better than the other.
Difficult samples: Pairs where the difference is small, making preference learning harder.

2. Two-Stage Training Process

Stage 1 (DPO Training on Easy Data)
DPO is applied to the easy dataset to quickly obtain a well-aligned model.
Stage 2 (PPO Training on Difficult Data)
PPO is applied to the difficult dataset.
Instead of using the supervised fine-tuning (SFT) model as a reference (like standard RLHF), MPO uses the well-trained DPO model as the reference.
This improves learning stability and reduces the risk of exploring suboptimal regions.

Why This Approach Works

Better Stability: Using DPO first creates a strong base, preventing instability during PPO training.
Improved Data Quality: Separating easy and difficult samples helps mitigate noisy preference data, which can degrade model performance.
Stronger Reference Model: Traditional RLHF limits PPO exploration due to an inferior SFT reference model, while MPO’s DPO-trained reference model offers a more aligned star

MPO vs. DPO vs. PPO: Performance Comparison

Experiments were conducted on two benchmark datasets:

HH-RLHF (Human Preference Data on Helpfulness and Harmlessness)
TLDR (Summarization dataset from Reddit)

Key Findings

Key FindingsModelHH-RLHF Reward ScoreTLDR Reward ScoreSFT (Baseline)0.9381.108DPO (Full Dataset)1.8592.816PPO (Full Dataset)2.5133.460MPO (γ=1.0, Best Version)2.8013.784

Evaluation Insights

MPO consistently outperforms both PPO and DPO across different datasets.
GPT-4 and human evaluations confirm MPO’s improvements in response quality and alignment.
Ablation studies show that MPO’s curriculum learning strategy (starting with DPO, then PPO) is key to its success.

Real-World Example: Comparing Model Outputs

Here’s how MPO improves response quality over DPO and PPO:Prompt: How do I make a great Italian sub sandwich?

DPO Output: “First, you’ll want to slice the bread into two pieces, & then let it sit in the refrigerator for at least 30 minutes…” (Repetitive and unhelpful)

2. PPO Output: “Here are some ingredients you’ll need: bread, ham, cheese, tomatoes, basil leaves…” (A list without instructions)3. MPO Output: “To make a great Italian sub, slice a loaf of bread, layer it with sliced prosciutto, tomatoes, cheese, and onions, and season with olive oil and vinegar.” (Clear and structured)

Why MPO Matters for the Future of AI Training

MPO provides a more efficient and stable method for training LLMs while maintaining high-quality alignment. Here’s why it’s a game-changer:

Reduces Computational Costs: More effective training means less GPU usage compared to full RLHF.
Minimizes Reinforcement Learning Instability: The DPO-first approach prevents policy drift and improves training reliability.
Improves Model Quality: Better reference models and targeted data selection enhance preference optimization.

Conclusion

MPO bridges the gap between reinforcement learning (RLHF) and contrastive learning (DPO), offering a structured and scalable solution for AI alignment. By combining the strengths of both methods while addressing their weaknesses, MPO ensures that AI models can be better aligned, more robust, and computationally efficient.As AI developers continue refining alignment techniques, MPO sets a new standard for balancing performance, cost, and stability in training the next generation of LLMs.VE3 is committed to helping organizations develop advanced AI solutions . We provide tools and expertise that align innovation with impact. Together, we can create AI solutions that work reliably, ethically, and effectively in the real world. contact us or visit us for a closer look at how VE3 can drive your organization’s success. Let’s shape the future together.