Rienforcement learning for human feedback (RLHF)
Introduction
In the previous post, we saw that supervised fine-tuning treats language model post-training as a supervised learning problem. We start from a pretrained model, collect prompt-response pairs, and train the model to assign high probability to the desired responses. This already changes the behavior of the model. Instead of simply continuing text, the model learns to answer prompts in a more assistant-like way.
However, SFT has an important limitation: it does not directly model preferences between different possible answers.
Suppose the prompt is
\[u= \text{Define the entropy}\]Now consider two possible responses:
\(y_a = \text{entropy is disorder}\) And,
\[y_b= \text{Entropy measures the number of microscopic configurations compatible with a macroscopic state.}\]Both responses are related to the prompt. But many humans would prefer $y_b$, because it is more precise and more informative. The SFT objective does not naturally express this comparison. It can increase the probability of a demonstrated answer, but it does not directly say:
\[y_b \succ y_a.\]This is the motivation for reinforcement learning from human feedback, usually abbreviated as RLHF. The goal is no longer only to imitate demonstrations. The goal is to use human preferences to improve the model policy.
Preference data
The first difference between SFT and RLHF is the type of data.
In SFT, the dataset contains examples of the form
\[(u_i,y_i),\]where $u_i$ is a prompt and $y_i$ is a desired answer.
In RLHF, we instead use preference data. A typical preference dataset has the form
\[\{(u_i,y_i^+,y_i^-)\}_{i=1}^{m}.\]Here $u_i$ is a prompt, $y_i^+$ is the preferred response, and $y_i^-$ is the rejected response.
So the preference data says:
\[y_i^+ \succ y_i^- \qquad \text{given the prompt } u_i.\]This is a weaker form of supervision than giving an exact score to every answer. The human annotator does not need to say how good a response is in absolute terms. They only need to compare two responses.
This is useful because preferences are often easier to collect than absolute rewards. It is difficult to say that a response deserves reward $7.3$, but it is much easier to say that one answer is better than another.
The reward model
To use preference data, we introduce a reward model. It is a parametric function
\[r_{\phi}(u,y) \in \mathbb{R}\]It takes a prompt $u$ and a response $y$, and outputs a scalar reward. The goal is that preferred responses receive larger rewards than rejected responses.
So for a preference pair
\[(u_i,y_i^+,y_i^-),\]we want
\[r_\phi(u_i,y_i^+) > r_\phi(u_i,y_i^-).\]The reward model is not the language model itself. It is a separate model trained to predict human preferences. Usually, the reward model is initialized from a language model or shares a similar transformer architecture, but its output is a single scalar instead of a distribution over next tokens.
Bradley-Terry model
Human preferences are noisy. Different annotators may disagree, and even the same annotator may not always make perfectly consistent choices. Therefore, we do not model the preference as deterministic. A common choice is the Bradley–Terry model:
\[\mathbb{P}_\phi \left( y_i^+ \succ y_i^- \mid u_i \right) = \sigma \left( r_\phi(u_i,y_i^+) - r_\phi(u_i,y_i^-) \right),\]where
\[\sigma(z) = \frac{1}{1+\exp(-z)}\]is the sigmoid function. This model has a simple interpretation. If
\[r_\phi(u_i,y_i^+) \gg r_\phi(u_i,y_i^-),\]then
\[\mathbb{P}_\phi \left( y_i^+ \succ y_i^- \mid u_i \right) \approx 1.\]If the two rewards are close, then the model is uncertain:
\[\mathbb{P}_\phi \left( y_i^+ \succ y_i^- \mid u_i \right) \approx \frac{1}{2}.\]So the reward difference
\[r_\phi(u_i,y_i^+) - r_\phi(u_i,y_i^-)\]controls how confident the model is that one response is better than the other.
Training the reward model
We now train the reward model from the finite preference dataset ${(u_i,y_i^+,y_i^-)}_{i=1}^{m}$. Under the Bradley–Terry model, the likelihood of the observed preferences is
\[\prod_{i=1}^{m} \sigma \left( r_\phi(u_i,y_i^+) - r_\phi(u_i,y_i^-) \right).\]Equivalently, we minimize the negative log-likelihood:
\[\mathcal{L}_{\mathrm{RM}}(\phi) = - \frac{1}{m} \sum_{i=1}^{m} \log \sigma \left( r_\phi(u_i,y_i^+) - r_\phi(u_i,y_i^-) \right).\]This objective encourages the reward model to assign larger rewards to preferred responses than to rejected responses. After minimizing this loss, we obtain a trained reward function
\[r_\phi(u,y)\in \mathbb{R}.\]This reward model will then be used to improve the language model.
From reward modeling to policy optimization
At this point, we have two objects. First, we have a language model policy
\[p_\theta(y\mid u),\]which defines a distribution over responses $y$ given a prompt $u$.
Second, we have a reward model
\[r_\phi(u,y),\]which assigns a scalar reward to a prompt-response pair.
A natural idea is to optimize the policy so that it generates high-reward responses:
\[\max_\theta \mathbb{E}_{u\sim \mathcal{D}_{U}} \left[ \mathbb{E}_{y\sim p_\theta(\cdot\mid u)} \left[ r_\phi(u,y) \right] \right].\]Here (\mathcal{D}_{U}) denotes the distribution of prompts. However, optimizing this objective directly is dangerous. The reward model is trained on a finite preference dataset. It is only an approximation of human preferences. There may be regions where the reward model generalizes poorly. If the policy is optimized too aggressively, it may find responses that exploit mistakes in the reward model instead of genuinely improving according to human preferences. This is known as reward hacking or overoptimization.
To reduce this problem, RLHF usually regularizes the new policy so that it does not move too far from a reference policy. The reference policy is usually the SFT model:
\[p_{\mathrm{ref}}(y\mid u) = p_{\theta_{\mathrm{SFT}}}(y\mid u).\]This makes sense because the SFT model already follows instructions, produces fluent text, and behaves like an assistant. RLHF should improve this model using preferences, not destroy its useful behavior.
The KL-regularized RLHF objective
The standard RLHF objective is therefore a KL-regularized reward maximization problem:
\[J(\theta) = \mathbb{E}_{u\sim \mathcal{D}_{U}} \left[ \mathbb{E}_{y\sim p_\theta(\cdot\mid u)} \left[ r_\phi(u,y) \right] - \beta D_{\mathrm{KL}} \left( p_\theta(\cdot\mid u) \| p_{\mathrm{ref}}(\cdot\mid u) \right) \right].\]The parameter
\[\beta>0\]controls the strength of the KL penalty. The KL term is
\[D_{\mathrm{KL}} \left( p_\theta(\cdot\mid u) \| p_{\mathrm{ref}}(\cdot\mid u) \right) = \mathbb{E}_{y\sim p_\theta(\cdot\mid u)} \left[ \log \frac{ p_\theta(y\mid u) }{ p_{\mathrm{ref}}(y\mid u) } \right].\]This term penalizes the policy when it puts too much probability on responses that the reference model considered unlikely. So the objective has two competing goals:
\[\text{increase reward}\]and
\[\text{stay close to the SFT policy}.\]The KL term prevents the model from moving too far into regions where the reward model may be unreliable.
RLHF as constrained optimization
The KL-regularized objective can also be understood as the Lagrangian form of a constrained optimization problem.
For a fixed prompt distribution, we can write the constrained problem as
\[\max_{\theta} \mathbb{E}_{u\sim \mathcal{D}_{U}} \mathbb{E}_{y\sim p_\theta(\cdot\mid u)} \left[ r_\phi(u,y) \right]\]subject to
\[\mathbb{E}_{u\sim \mathcal{D}_{U}} \left[ D_{\mathrm{KL}} \left( p_\theta(\cdot\mid u) \| p_{\mathrm{ref}}(\cdot\mid u) \right) \right] \leq \varepsilon.\]So RLHF can be viewed as reward maximization inside a KL ball around the reference policy.
The parameter $\beta$ plays the role of a Lagrange multiplier. A large $\beta$ strongly penalizes movement away from the reference policy. A small $\beta$ allows more aggressive reward optimization.
Solving the KL-Regularized Problem for a Fixed Prompt
To understand the role of the KL term more clearly, consider a fixed prompt $u$. We ignore the parametrization of the policy and solve directly over all distributions on responses.
Let
\[p_0(y) = p_{\mathrm{ref}}(y\mid u),\]and
\[r(y) = r_\phi(u,y).\]We want to solve
\[\max_{p\in \Delta(\mathcal{Y})} \left\{ \sum_y p(y)r(y) - \beta \sum_y p(y) \log \frac{p(y)}{p_0(y)} \right\},\]subject to
\[\sum_y p(y)=1.\]The Lagrangian is
\[\mathcal{L}(p,\lambda) = \sum_y p(y)r(y) - \beta \sum_y p(y) \log \frac{p(y)}{p_0(y)} + \lambda \left( \sum_y p(y)-1 \right).\]Differentiating with respect to $p(y)$, we get
\[\frac{\partial \mathcal{L}}{\partial p(y)} = r(y) - \beta \left( \log \frac{p(y)}{p_0(y)} + 1 \right) + \lambda.\]At the optimum,
\[r(y) - \beta \left( \log \frac{p(y)}{p_0(y)} + 1 \right) + \lambda = 0.\]Therefore,
\[p(y) = p_0(y) \exp \left( \frac{r(y)}{\beta} \right) \exp \left( \frac{\lambda}{\beta}-1 \right).\]The last factor is a normalization constant. Using
\[\sum_y p(y)=1,\]we obtain
\[p^*(y) = \frac{ p_0(y) \exp \left( \frac{r(y)}{\beta} \right) }{ Z },\]where
\[Z = \sum_{y'} p_0(y') \exp \left( \frac{r(y')}{\beta} \right).\]So the optimal policy is a tilted version of the reference policy:
\[p^*(y) \propto p_{\mathrm{ref}}(y\mid u) \exp \left( \frac{r_\phi(u,y)}{\beta} \right).\]This formula is useful because it shows the effect of the reward and the KL penalty.
If $\beta$ is large, then
\[\exp \left( \frac{r(y)}{\beta} \right) \approx 1,\]so
\[p^*(y) \approx p_0(y).\]The optimal policy stays close to the reference policy.
If $\beta$ is small, then the exponential term becomes sharper, and the policy puts much more probability on high-reward responses.
Thus, $\beta$ controls the tradeoff:
\[\text{large } \beta \quad \Rightarrow \quad \text{conservative update},\] \[\text{small } \beta \quad \Rightarrow \quad \text{aggressive reward optimization}.\]Remark during policy optimization, the reward model is frozen. Reward modeling and policy optimization are two separate stages.