Reinforcement Learning from Human Feedback
Also known as: RLHF, Human Feedback Training
A training technique that uses human evaluations of AI outputs to train a reward model, which then guides the AI system to produce outputs more aligned with human preferences.
“A training technique that uses human evaluations of AI outputs to train a reward model, which then guides the AI system to produce outputs more aligned with human preferences.
“
Overview
Reinforcement Learning from Human Feedback (RLHF) is the primary technique used to align large language models with human preferences. It bridges the gap between a model's raw language generation capabilities and the kind of helpful, harmless, and honest responses that users expect.
The RLHF Process
- Supervised Fine-Tuning (SFT): Train the model on high-quality demonstration data
- Reward Model Training: Human raters compare pairs of model outputs, creating preference data used to train a reward model
- Policy Optimization: Use the reward model to guide further training of the language model using reinforcement learning algorithms (typically PPO)
Why RLHF Matters
Pre-trained language models generate text by predicting the next most likely token — this doesn't inherently produce helpful, safe, or factual outputs. RLHF teaches models to prioritize helpfulness, truthfulness, and harmlessness based on human judgments of quality.
Challenges
- Reward Hacking: Models may find ways to achieve high reward scores without genuinely improving
- Annotator Disagreement: Human raters may have conflicting preferences
- Scale: High-quality human feedback is expensive and time-consuming to collect
- Bias: Human raters bring their own biases to the evaluation process
Alternatives and Evolutions
Newer approaches like Direct Preference Optimization (DPO) simplify the RLHF pipeline by eliminating the need for a separate reward model. Constitutional AI (CAI) reduces reliance on human feedback by using AI-generated critiques guided by a set of principles.
Sources & Further Reading
Related Terms
AI Alignment
The research field focused on ensuring that AI systems' goals, behaviors, and values are compatible with human intentions and societal well-being throughout their operation.
Fine-Tuning
The process of further training a pre-trained AI model on a specialized dataset to adapt its behavior, knowledge, or output style for a specific domain or task.
Large Language Model
A type of AI model trained on vast amounts of text data that can understand, generate, and manipulate human language, typically based on the transformer architecture with billions of parameters.
Machine Learning
A subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed, using algorithms that identify patterns in data.