RLHF

Reinforcement Learning from Human Feedback is the training technique that aligns large language models with human preferences. Introduced by OpenAI in 2017 for robotics, applied to language models from 2022.

What is RLHF?

Reinforcement Learning from Human Feedback (RLHF) is the technique that turned raw language models into helpful assistants. The process: (1) train a base language model on broad internet text, (2) collect human ratings comparing pairs of model outputs to identify preferred responses, (3) train a reward model that predicts human ratings, (4) fine-tune the base model using reinforcement learning to maximize the reward model. ChatGPT, Claude, Gemini, and most production LLM assistants are RLHF-trained.

Variants

RLHF has spawned variants. DPO (Direct Preference Optimization, 2023) achieves similar alignment without the explicit reward model stage. RLAIF (Reinforcement Learning from AI Feedback, Anthropic 2022 onwards) replaces human raters with model raters following a written constitution. Constitutional AI generalizes the principle. The choice of alignment technique affects model behavior in subtle ways and is increasingly disclosed in technical reports.

Why buyers should care

The data labelers used for RLHF have raised labor and accuracy concerns — many vendors outsource labeling to lower-wage geographies, with quality and bias implications. The reward model can encode the labelers' biases. The post-RLHF model is more compliant but sometimes less calibrated about its own uncertainty. For deployments in regulated decisions, ask: what alignment technique was used, who labeled (geography, compensation, training), how was inter-labeler agreement measured, and are alignment artifacts (reward model, preference dataset) auditable.

RLHF

What is RLHF?

Variants

Why buyers should care

Related