Literature Review: Teaching Language Models to Self-Improve by Learning from Language Feedback

This paper introduces Self-Refinement Tuning (SRT), a two-stage method that trains language models to self-evaluate and improve their outputs using language feedback rather than human preference rankings. The approach reduces reliance on expensive human annotations while achieving substantial performance gains, with their 70B model reaching a 25.8% win rate against GPT-4 Turbo on AlpacaEval 2.0, surpassing established systems like GPT-4-0314 and Claude 2.

Key Insights

The core innovation lies in treating critique and refinement as learnable skills rather than fixed capabilities. In Stage 1, SRT uses GPT-4 to generate structured feedback (weaknesses, scores, suggestions) and refinements for base model outputs, then trains the model to produce this feedback-refinement sequence. Stage 2 leverages the trained model to generate its own preference data for DPO training, creating a self-sustaining improvement loop.

The structured feedback template is particularly noteworthy, it forces the critic to identify specific weaknesses, provide actionable suggestions, and generate improved responses. This contrasts sharply with simple preference rankings, offering richer training signals. The authors demonstrate that language feedback components (weaknesses, suggestions, scores) each contribute meaningfully to performance, with removing all feedback causing a 5.1 point drop.

However, the fundamental assumption that LLMs can reliably self-evaluate remains questionable. The paper shows declining human agreement rates after Stage 2 training, particularly for smaller models, suggesting potential degradation in evaluation quality. This aligns with research showing LLMs exhibit “self-bias”, favoring their own generations over alternatives.

Example

For the query “What is the largest ocean in the world?”, the base model might respond with basic facts but incorrect percentages. The critic identifies the inaccuracy, suggests including precise statistics, and generates a refined response with accurate figures (46.8% of Earth’s water surface). The base model learns to produce this entire sequence: initial response → critique → refinement.

Ratings

Novelty: 3/5

While self-refinement exists, the systematic two-stage approach and integration with DPO training represents a solid incremental advance. The structured feedback template and empirical analysis of feedback components provide value.

Clarity: 4/5

Well-structured paper with clear methodology, comprehensive experiments, and honest discussion of limitations. The ablation studies effectively demonstrate the importance of different components.

Personal Comments

RLHF always felt fundamentally flawed to me, it’s subjective, expensive, and humans can barely agree on what we want from AI systems. Language feedback seemed like the obvious next step, but it has a critical flaw I can’t shake: can we really trust LLMs to give insight into their “thought process” when they don’t actually think? It’s all just probabilities under the hood.

The declining human agreement rates after Stage 2 training confirm my suspicions about creating an AI echo chamber. We’re letting AI evaluate AI performance based on AI-generated criteria, which could amplify biases rather than correct them. The increased verbosity suggests models are already gaming evaluation metrics.

Despite these concerns, SRT represents a necessary exploration given human annotation’s limitations. But we’re trading human subjectivity for LLM reliability without solving the underlying problem. The field needs theoretical grounding for when self-evaluation can be trusted. Until then, this feels like building on shaky foundations, albeit necessarily so.

Key Insights

Example

Ratings

Personal Comments

Enjoy Reading This Article?