Model Training 2 min read

Reinforcement Learning from Human Feedback

Also known as: RLHF, Human Feedback Training

A training technique that uses human evaluations of AI outputs to train a reward model, which then guides the AI system to produce outputs more aligned with human preferences.

Definition

A training technique that uses human evaluations of AI outputs to train a reward model, which then guides the AI system to produce outputs more aligned with human preferences.

Model Training 2 min read R

Overview

Reinforcement Learning from Human Feedback (RLHF) is the primary technique used to align large language models with human preferences. It bridges the gap between a model's raw language generation capabilities and the kind of helpful, harmless, and honest responses that users expect.

The RLHF Process

  1. Supervised Fine-Tuning (SFT): Train the model on high-quality demonstration data
  2. Reward Model Training: Human raters compare pairs of model outputs, creating preference data used to train a reward model
  3. Policy Optimization: Use the reward model to guide further training of the language model using reinforcement learning algorithms (typically PPO)

Why RLHF Matters

Pre-trained language models generate text by predicting the next most likely token — this doesn't inherently produce helpful, safe, or factual outputs. RLHF teaches models to prioritize helpfulness, truthfulness, and harmlessness based on human judgments of quality.

Challenges

  • Reward Hacking: Models may find ways to achieve high reward scores without genuinely improving
  • Annotator Disagreement: Human raters may have conflicting preferences
  • Scale: High-quality human feedback is expensive and time-consuming to collect
  • Bias: Human raters bring their own biases to the evaluation process

Alternatives and Evolutions

Newer approaches like Direct Preference Optimization (DPO) simplify the RLHF pipeline by eliminating the need for a separate reward model. Constitutional AI (CAI) reduces reliance on human feedback by using AI-generated critiques guided by a set of principles.