Literature Review: On-Policy RL with Optimal Reward Baseline

This paper introduces OPO (On-Policy RL with Optimal reward baseline), a simplified reinforcement learning algorithm that eliminates auxiliary components common in RLHF while improving training stability and performance. The approach combines exact on-policy training with a theoretically optimal reward baseline, requiring only a single policy model without value networks or regularization terms.

Key Insights

The core innovation lies in two complementary strategies that address fundamental issues in current RLHF methods. First, exact on-policy training ensures each gradient step uses fresh data from the current policy, preventing the entropy collapse and large policy shifts that plague methods like PPO when reusing rollout data. Second, the optimal reward baseline minimizes gradient variance through a length-weighted average of rewards, derived from the theoretical optimal baseline under reasonable assumptions about gradient orthogonality in sequence generation.

The mathematical elegance emerges from simplifying the impractical optimal baseline formula. Under the assumption that gradient norms are proportional to sequence length, the optimal baseline becomes a simple length-weighted reward average: b* = Σ(l_i × r_i) / Σ(l_i). This formulation is both theoretically sound and practically implementable, eliminating the computational overhead of calculating individual gradient norms.

Example

In mathematical reasoning experiments using DeepSeek-R1-Distill-Qwen-7B, OPO demonstrates its effectiveness by sampling K=8 responses per prompt and computing advantages using the length-weighted baseline. For instance, on AIME 2024, OPO achieves 68.50% pass@1 compared to 67.96% for standard GRPO, with more pronounced improvements at higher pass@k values where diversity matters most.

Ratings

Novelty: 4/5

The theoretical derivation of the optimal baseline and its practical simplification for sequence generation is clever. The emphasis on exact on-policy training provides valuable insights into why auxiliary components may be unnecessary.

Clarity: 4/5

Well-structured paper with clear mathematical derivations and comprehensive experimental validation. The connection between theory and practice is well-established.

Personal Comments

This work represents a refreshing return to first principles in RLHF. The insight that exact on-policy training naturally maintains entropy without regularization challenges the conventional wisdom that auxiliary components are necessary for stability. The length-weighted baseline is particularly good, it captures the intuition that longer responses should contribute more to variance reduction while remaining computationally tractable.

What I find the most important is how this approach strips away the complexity that has accumulated in RLHF methods. The observation that PPO’s off-policy nature during multi-step updates on fixed batches contributes to instability is not new, but the authors provide compelling evidence that exact on-policy training alone can resolve these issues. The mathematical foundation for the optimal baseline, while building on classical variance reduction techniques, offers a principled approach to advantage estimation that many practitioners have approximated heuristically.

Key Insights

Example

Ratings

Personal Comments

Enjoy Reading This Article?