Introduction

In the previous post, we saw that supervised fine-tuning treats language model post-training as a supervised learning problem. We start from a pretrained model, collect prompt-response pairs, and train the model to assign high probability to the desired responses. This already changes the behavior of the model. Instead of simply continuing text, the model learns to answer prompts in a more assistant-like way.

However, SFT has an important limitation: it does not directly model preferences between different possible answers.

Suppose the prompt is

\[u= \text{Define the entropy}\]

Now consider two possible responses:

\[y_a = \text{Entropy is disorder}\]

And,

\[y_b= \text{Entropy measures the number of microscopic configurations compatible with a macroscopic state.}\]

Both responses are related to the prompt. But many humans would prefer $y_b$, because it is more precise and more informative. The SFT objective does not naturally express this comparison. It can increase the probability of a demonstrated answer, but it does not directly say:

\[y_b \succ y_a.\]

This is the motivation for reinforcement learning from human feedback, usually abbreviated as RLHF. The goal is no longer only to imitate demonstrations. The goal is to use human preferences to improve the model policy. In this post, we focus on the classical RLHF pipeline: preference data, reward modeling, and policy optimization with a KL penalty.

Preference data

↓

The reward model

↓

The KL-regularized RLHF objective

↓

RLHF as constrained optimization

↓

Policy-gradient methods and PPO

Preference data

The first difference between SFT and RLHF is the type of data.

In SFT, the dataset contains examples of the form

\[(u_i,y_i),\]

where $u_i$ is a prompt and $y_i$ is a desired answer.

In RLHF, we instead use preference data. A typical preference dataset has the form

\[(u_i,y_i^+,y_i^-).\]

Here $u_i$ is a prompt, $y_i^+$ is the preferred response, and $y_i^-$ is the rejected response.

So the preference data says:

\[y_i^+ \succ y_i^- \qquad \text{given the prompt } u_i.\]

This is a weaker form of supervision than giving an exact score to every answer. The human annotator does not need to say how good a response is in absolute terms. They only need to compare two responses.

This is useful because preferences are often easier to collect than absolute scores. It is difficult to say that a response deserves score $7.3$, but it is much easier to say that one answer is better than another.

The reward model

To use preference data, we introduce a reward model. It is a parametric function

\[r_{\phi}(u,y) \in \mathbb{R}\]

It takes a prompt $u$ and a response $y$, and outputs a scalar reward. The goal is that preferred responses receive larger rewards than rejected responses.

So for a preference pair

\[(u_i,y_i^+,y_i^-),\]

we want

\[r_\phi(u_i,y_i^+) > r_\phi(u_i,y_i^-).\]

The reward model is not the language model itself. It is a separate model trained to predict human preferences. Usually, the reward model is initialized from a language model or shares a similar transformer architecture, but its output is a single scalar instead of a distribution over next tokens.

Bradley-Terry model

Human preferences are noisy. Different annotators may disagree, and even the same annotator may not always make perfectly consistent choices. Therefore, we do not model the preference as deterministic. A common choice is the Bradley–Terry model:

\[\mathbb{P}_\phi \left( y_i^+ \succ y_i^- \mid u_i \right) = \sigma \left( r_\phi(u_i,y_i^+) - r_\phi(u_i,y_i^-) \right),\]

where

\[\sigma(z) = \frac{1}{1+\exp(-z)}\]

is the sigmoid function. This model has a simple interpretation. If

\[r_\phi(u_i,y_i^+) \gg r_\phi(u_i,y_i^-),\]

then

\[\mathbb{P}_\phi \left( y_i^+ \succ y_i^- \mid u_i \right) \approx 1.\]

If the two rewards are close, then the model is uncertain:

\[\mathbb{P}_\phi \left( y_i^+ \succ y_i^- \mid u_i \right) \approx \frac{1}{2}.\]

So the reward difference

\[r_\phi(u_i,y_i^+) - r_\phi(u_i,y_i^-)\]

controls how confident the model is that one response is better than the other.

Training the reward model

We now train the reward model from the finite preference dataset

\[\{(u_i,y_i^+,y_i^-)\}_{i=1}^{m}.\]

Under the Bradley–Terry model, the likelihood of the observed preferences is

\[\prod_{i=1}^{m} \sigma \left( r_\phi(u_i,y_i^+) - r_\phi(u_i,y_i^-) \right).\]

Equivalently, we minimize the negative log-likelihood:

\[\mathcal{L}_{\mathrm{RM}}(\phi) = - \frac{1}{m} \sum_{i=1}^{m} \log \sigma \left( r_\phi(u_i,y_i^+) - r_\phi(u_i,y_i^-) \right).\]

This objective encourages the reward model to assign larger rewards to preferred responses than to rejected responses. After minimizing this loss, we obtain a trained reward function

\[r_\phi(u,y)\in \mathbb{R}.\]

This reward model will then be used to improve the language model.

From reward modeling to policy optimization

At this point, we have two objects. First, we have a language model policy

\[p_\theta(y\mid u),\]

which defines a distribution over responses $y$ given a prompt $u$.

Second, we have a reward model

\[r_\phi(u,y),\]

which assigns a scalar reward to a prompt-response pair.

A natural idea is to optimize the policy so that it generates high-reward responses:

\[\max_\theta \mathbb{E}_{u\sim \mathcal{D}_{U}} \left[ \mathbb{E}_{y\sim p_\theta(\cdot\mid u)} \left[ r_\phi(u,y) \right] \right].\]

Here $\mathcal{D}_{U}$ denotes the distribution of prompts. However, optimizing this objective directly is dangerous. The reward model is trained on a finite preference dataset. It is only an approximation of human preferences. There may be regions where the reward model generalizes poorly. If the policy is optimized too aggressively, it may find responses that exploit mistakes in the reward model instead of genuinely improving according to human preferences. This is known as reward hacking or overoptimization.

To reduce this problem, RLHF usually regularizes the new policy so that it does not move too far from a reference policy. The reference policy is usually the SFT model:

\[p_{\mathrm{ref}}(y\mid u) = p_{\theta_{\mathrm{SFT}}}(y\mid u).\]

This makes sense because the SFT model already follows instructions, produces fluent text, and behaves like an assistant. RLHF should improve this model using preferences, not destroy its useful behavior.

The KL-regularized RLHF objective

The standard RLHF objective is therefore a KL-regularized reward maximization problem:

\[J(\theta) = \mathbb{E}_{u\sim \mathcal{D}_{U}} \left[ \mathbb{E}_{y\sim p_\theta(\cdot\mid u)} \left[ r_\phi(u,y) \right] - \beta D_{\mathrm{KL}} \left( p_\theta(\cdot\mid u) \| p_{\mathrm{ref}}(\cdot\mid u) \right) \right].\]

The parameter

\[\beta>0\]

controls the strength of the KL penalty. The KL term is

\[D_{\mathrm{KL}} \left( p_\theta(\cdot\mid u) \| p_{\mathrm{ref}}(\cdot\mid u) \right) = \mathbb{E}_{y\sim p_\theta(\cdot\mid u)} \left[ \log \frac{ p_\theta(y\mid u) }{ p_{\mathrm{ref}}(y\mid u) } \right].\]

This term penalizes the policy when it puts too much probability on responses that the reference model considered unlikely. So the objective has two competing goals:

\[\text{increase reward}\]

and

\[\text{stay close to the SFT policy}.\]

The KL term prevents the model from moving too far into regions where the reward model may be unreliable. So it acts like a trust region around the SFT model.

RLHF as constrained optimization

The KL-regularized objective can also be understood as the Lagrangian form of a constrained optimization problem.

For a fixed prompt distribution, we can write the constrained problem as

\[\max_{\theta} \mathbb{E}_{u\sim \mathcal{D}_{U}} \mathbb{E}_{y\sim p_\theta(\cdot\mid u)} \left[ r_\phi(u,y) \right]\]

subject to

\[\mathbb{E}_{u\sim \mathcal{D}_{U}} \left[ D_{\mathrm{KL}} \left( p_\theta(\cdot\mid u) \| p_{\mathrm{ref}}(\cdot\mid u) \right) \right] \leq \varepsilon.\]

So RLHF can be viewed as reward maximization inside a KL ball around the reference policy.

The parameter $\beta$ plays the role of a Lagrange multiplier. A large $\beta$ strongly penalizes movement away from the reference policy. A small $\beta$ allows more aggressive reward optimization.

Solving the KL-Regularized Problem for a Fixed Prompt

To understand the role of the KL term more clearly, consider a fixed prompt $u$. We ignore the parametrization of the policy and solve directly over all distributions on responses.

Let

\[p_0(y) = p_{\mathrm{ref}}(y\mid u),\]

and

\[r(y) = r_\phi(u,y).\]

We want to solve

\[\max_{p\in \Delta(\mathcal{Y})} \left\{ \sum_y p(y)r(y) - \beta \sum_y p(y) \log \frac{p(y)}{p_0(y)} \right\},\]

subject to

\[\sum_y p(y)=1.\]

The Lagrangian is

\[\mathcal{L}(p,\lambda) = \sum_y p(y)r(y) - \beta \sum_y p(y) \log \frac{p(y)}{p_0(y)} + \lambda \left( \sum_y p(y)-1 \right).\]

Differentiating with respect to $p(y)$, we get

\[\frac{\partial \mathcal{L}}{\partial p(y)} = r(y) - \beta \left( \log \frac{p(y)}{p_0(y)} + 1 \right) + \lambda.\]

At the optimum,

\[r(y) - \beta \left( \log \frac{p(y)}{p_0(y)} + 1 \right) + \lambda = 0.\]

Therefore,

\[p(y) = p_0(y) \exp \left( \frac{r(y)}{\beta} \right) \exp \left( \frac{\lambda}{\beta}-1 \right).\]

The last factor is a normalization constant. Using

\[\sum_y p(y)=1,\]

we obtain

\[p^*(y) = \frac{ p_0(y) \exp \left( \frac{r(y)}{\beta} \right) }{ Z },\]

where

\[Z = \sum_{y'} p_0(y') \exp \left( \frac{r(y')}{\beta} \right).\]

So the optimal policy is a tilted version of the reference policy:

\[p^*(y) \propto p_{\mathrm{ref}}(y\mid u) \exp \left( \frac{r_\phi(u,y)}{\beta} \right).\]

This expression shows that the optimal policy is obtained by tilting the reference policy toward high-reward responses. The reference model provides the base distribution, while the exponential reward term increases the probability of responses with larger reward.

A small technical point is that this derivation assumes that the optimized policy is supported on the support of the reference policy. In other words, if

\[p_{\mathrm{ref}}(y\mid u)=0,\]

then the KL term prevents the optimized policy from assigning positive probability to $y$. The optimization can only redistribute probability mass among responses that are already possible under the reference policy.

This formula also makes the role of $\beta$ clear. If $\beta$ is large, then

\[\exp \left( \frac{r(y)}{\beta} \right) \approx 1,\]

\[p^*(y) \approx p_0(y).\]

The optimal policy stays close to the reference policy.

If $\beta$ is small, then the exponential term becomes sharper, and the policy puts much more probability on high-reward responses.

Thus, $\beta$ controls the tradeoff:

\[\text{large } \beta \quad \Rightarrow \quad \text{conservative update},\] \[\text{small } \beta \quad \Rightarrow \quad \text{aggressive reward optimization}.\]

Remark: during policy optimization, the reward model is frozen. Reward modeling and policy optimization are two separate stages.

The Reinforcement Learning Formulation

RLHF is called reinforcement learning from human feedback because, after training the reward model, we can view the language model as a policy in a reinforcement learning problem.

In a standard RL problem, an agent interacts with an environment. At each time step, it observes a state, chooses an action, receives a reward, and moves to a new state.

For language modeling, the analogy is as follows.

The prompt is the initial state:

\[s_0 = u.\]

At time $t$, the state is the prompt together with the tokens generated so far:

\[s_t = (u,y_{<t}).\]

The action is the next token:

\[a_t = y_t.\]

The policy is the language model:

\[\pi_\theta(a_t\mid s_t) = p_\theta(y_t\mid u,y_{<t}).\]

A full response is a trajectory:

\[\tau = (y_1,\dots,y_T).\]

The probability of this trajectory under the policy is

\[\pi_\theta(\tau\mid u) = \prod_{t=1}^{T} \pi_\theta(y_t\mid u,y_{<t}).\]

The reward model assigns a scalar score to the completed response:

\[R(u,\tau) = r_\phi(u,y).\]

This is usually a terminal reward: the model first generates the full answer, and then the reward model evaluates the prompt-response pair.

If we only optimized this reward, the RL objective would be

\[\max_\theta \mathbb{E}_{u\sim \mathcal{D}_{U}} \mathbb{E}_{\tau\sim \pi_\theta(\cdot\mid u)} \left[ r_\phi(u,\tau) \right].\]

However, in RLHF we do not want the policy to move too far from the reference model. Therefore, we add a KL penalty and use the regularized reward

\[R_{\theta}(u,y) = r_\phi(u,y) - \beta \log \frac{ p_\theta(y\mid u) }{ p_{\mathrm{ref}}(y\mid u) }.\]

The corresponding RL objective is

\[\max_\theta \mathbb{E}_{u\sim \mathcal{D}_{U}} \mathbb{E}_{\tau\sim \pi_\theta(\cdot\mid u)} \left[ R_{\theta}(u,\tau) \right].\]

Equivalently, this objective rewards responses that receive a high score from the reward model, while penalizing responses that become too unlikely under the reference policy.

Since the language model is autoregressive,

\[p_\theta(y\mid u) = \prod_{t=1}^{T} p_\theta(y_t\mid u,y_{<t}),\]

and similarly,

\[p_{\mathrm{ref}}(y\mid u) = \prod_{t=1}^{T} p_{\mathrm{ref}}(y_t\mid u,y_{<t}).\]

Therefore, the KL log-ratio decomposes as

\[\log \frac{ p_\theta(y\mid u) }{ p_{\mathrm{ref}}(y\mid u) } = \sum_{t=1}^{T} \log \frac{ p_\theta(y_t\mid u,y_{<t}) }{ p_{\mathrm{ref}}(y_t\mid u,y_{<t}) }.\]

This decomposition is useful because the reward model evaluates the completed answer, while the KL penalty can be computed token by token along the generated sequence.

Policy Optimization Algorithms

Once RLHF is written as a reinforcement learning problem, we still need an algorithm to optimize the policy. For a fixed prompt $u$, the KL-regularized objective can be written as

\[J(\theta) = \mathbb{E}_{y\sim p_\theta(\cdot\mid u)} \left[ r_\phi(u,y) - \beta \log \frac{ p_\theta(y\mid u) }{ p_{\mathrm{ref}}(y\mid u) } \right].\]

Equivalently,

\[J(\theta) = \sum_{y\in \mathcal{Y}} p_\theta(y\mid u) \left[ r_\phi(u,y) - \beta \log \frac{ p_\theta(y\mid u) }{ p_{\mathrm{ref}}(y\mid u) } \right].\]

Here $\mathcal{Y}$ is the set of all possible responses. If the vocabulary has size $V$ and the response length is $T$, then $\mathcal{Y}$ is of order $V^T$. Therefore, computing this sum exactly is impossible. A natural idea is to use sampling. We sample responses from the current policy:

\[y^{(1)},\dots,y^{(N)} \sim p_\theta(\cdot\mid u).\]

Then we approximate the objective by Monte Carlo:

\[\widehat J(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left[ r_\phi(u,y^{(i)}) - \beta \log \frac{ p_\theta(y^{(i)}\mid u) }{ p_{\mathrm{ref}}(y^{(i)}\mid u) } \right].\]

However, optimizing the policy requires an estimate of

\[\nabla_\theta J(\theta).\]

We cannot simply differentiate through the sampled responses $y^{(i)}$, because the responses are discrete sequences of tokens. The sampling operation is not differentiable in the usual sense. The standard solution is to use the log-derivative trick:

\[\nabla_\theta p_\theta(y\mid u) = p_\theta(y\mid u) \nabla_\theta \log p_\theta(y\mid u).\]

Let

\[R_\theta(u,y) = r_\phi(u,y) - \beta \log \frac{ p_\theta(y\mid u) }{ p_{\mathrm{ref}}(y\mid u) }.\]

Then the objective is

\[J(\theta) = \sum_{y\in\mathcal{Y}} p_\theta(y\mid u) R_\theta(u,y).\]

Differentiating gives

\[\nabla_\theta J(\theta) = \sum_{y\in\mathcal{Y}} p_\theta(y\mid u) \left[ R_\theta(u,y)-\beta \right] \nabla_\theta \log p_\theta(y\mid u).\]

The extra term $-\beta$ appears because $R_\theta(u,y)$ itself depends on $\theta$ through the KL penalty.

Thus,

\[\nabla_\theta J(\theta) = \mathbb{E}_{y\sim p_\theta(\cdot\mid u)} \left[ \left( r_\phi(u,y) - \beta \log \frac{ p_\theta(y\mid u) }{ p_{\mathrm{ref}}(y\mid u) } - \beta \right) \nabla_\theta \log p_\theta(y\mid u) \right].\]

This gives the Monte Carlo policy-gradient estimator

\[\nabla_\theta \widehat J(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left[ r_\phi(u,y^{(i)}) - \beta \log \frac{ p_\theta(y^{(i)}\mid u) }{ p_{\mathrm{ref}}(y^{(i)}\mid u) } - \beta \right] \nabla_\theta \log p_\theta(y^{(i)}\mid u).\]

This is the basic policy-gradient estimator. It gives a way to update the model even though the sampled responses are discrete. The important point is that we do not differentiate through the text itself. Once a response $y^{(i)}$ has been sampled, it is treated as fixed. What we differentiate is the log-probability that the model assigned to that response:

\[\log p_\theta(y^{(i)}\mid u).\]

So the update has a simple interpretation. If a sampled response receives a large regularized reward, we increase the probability of generating similar responses. If it receives a small reward, or if it moves too far from the reference policy, we decrease its probability.

This gives the core idea behind RLHF policy optimization:

\[\text{sample responses} \quad \longrightarrow \quad \text{score them with the reward model} \quad \longrightarrow \quad \text{update the policy probabilities}.\]

In practice, this basic estimator can be noisy and unstable. Algorithms such as REINFORCE, actor-critic methods, TRPO, and PPO are different ways of making this update more stable. In RLHF, PPO has historically been used because it limits how much the policy can change at each optimization step.

PPO in RLHF

The basic policy-gradient estimator gives a direction in which to update the language model policy. However, in practice, this update can be unstable. A single gradient step may change the policy too much, pushing the model into regions where the reward model is unreliable.

PPO, which stands for Proximal Policy Optimization, is designed to make these updates more conservative.

The main idea is:

\[\text{improve the policy, but keep the new policy close to the old one.}\]

Suppose responses are generated using an old policy

\[\pi_{\theta_{\mathrm{old}}}.\]

After collecting these samples, we update the parameters and obtain a new policy

\[\pi_\theta.\]

For a generated token $a_t$ in state $s_t$, PPO compares the probability of this token under the new policy and under the old policy:

\[\rho_t(\theta) = \frac{ \pi_\theta(a_t\mid s_t) }{ \pi_{\theta_{\mathrm{old}}}(a_t\mid s_t) }.\]

This ratio measures how much the update changes the probability of the sampled token. If $\rho_t(\theta)>1$, the token has become more likely under the new policy. If $\rho_t(\theta)<1$, it has become less likely.

Now introduce the advantage $A_t$. The advantage tells us whether the sampled action was better or worse than expected.

Remark. In practice, the advantage $A_t$ is estimated, not given directly. RLHF implementations usually train a value model $V(s_t)$ to predict how good a state is, and define the advantage as the difference between the observed return and this prediction:

\[A_t \approx G_t - V(s_t).\]

So $A_t$ measures whether the sampled token was better or worse than expected.

\[A_t>0 \quad \Rightarrow \quad \text{increase the probability of } a_t,\]

and

\[A_t<0 \quad \Rightarrow \quad \text{decrease the probability of } a_t.\]

A basic policy-gradient update would therefore use the term

\[\rho_t(\theta)A_t.\]

The problem is that this term can push the policy too far. PPO avoids this by clipping the ratio. Define

\[\bar{\rho}_t(\theta) = \min \left( \max \left( \rho_t(\theta), 1-\epsilon \right), 1+\epsilon \right),\]

so that

\[\bar{\rho}_t(\theta) \in [1-\epsilon,1+\epsilon].\]

The PPO clipped objective is

\[\mathcal{L}_{\mathrm{PPO}}(\theta) = \mathbb{E} \left[ \min \left( \rho_t(\theta)A_t, \bar{\rho}_t(\theta)A_t \right) \right].\]

Here $\epsilon>0$ is a small hyperparameter. The minimum chooses the more conservative value between the unclipped objective and the clipped objective.

If $A_t>0$, PPO allows the model to increase the probability of the token, but not beyond the clipping range. If (A_t<0), PPO allows the model to decrease the probability of the token, but again only within the clipping range.

Thus, PPO can be summarized as:

\[\text{increase the probability of good tokens, but not too much;}\] \[\text{decrease the probability of bad tokens, but not too much.}\]

This is why PPO makes policy optimization more stable.

In RLHF, PPO is usually combined with a KL penalty to the reference model. These two mechanisms are related, but they play different roles.

PPO clipping controls the size of one local update:

\[\pi_{\theta_{\mathrm{old}}} \quad \longrightarrow \quad \pi_\theta.\]

It keeps the new policy close to the old policy that generated the current batch of samples.

The KL penalty controls the drift from the reference model:

\[p_{\mathrm{ref}} \quad \longrightarrow \quad p_\theta.\]

It keeps the RLHF policy close to the SFT model.

So PPO stabilizes the local optimization step, while the KL penalty keeps the overall policy anchored to the supervised model.

Summary of the RLHF Pipeline

We can now summarize the RLHF pipeline.

The starting point is an SFT model

\[p_{\theta_{\mathrm{SFT}}}(y\mid u).\]

This model already behaves like an assistant because it was trained on prompt-response demonstrations.

Then RLHF adds a preference-learning step.

First, for each prompt, we sample or collect several possible responses. Humans compare them and produce preference data of the form

\[(u_i,y_i^+,y_i^-).\]

Second, we train a reward model

\[r_\phi(u,y)\]

so that preferred responses receive larger rewards than rejected responses:

\[r_\phi(u_i,y_i^+) > r_\phi(u_i,y_i^-).\]

This is done using the Bradley–Terry loss

\[\mathcal{L}_{\mathrm{RM}}(\phi) = - \frac{1}{m} \sum_{i=1}^{m} \log \sigma \left( r_\phi(u_i,y_i^+) - r_\phi(u_i,y_i^-) \right).\]

Third, we optimize the language model policy using the trained reward model. The objective is

Usually,

\[p_{\mathrm{ref}} = p_{\theta_{\mathrm{SFT}}}.\]

So the policy is pushed toward high-reward responses, but it is penalized if it moves too far from the SFT model.

This gives the central tradeoff of RLHF:

\[\text{maximize human preference reward}\]

while

\[\text{remaining close to the supervised model}.\]

Limitations of RLHF

RLHF is powerful, but it is not perfect.

The first limitation is that the reward model is only an approximation of human preferences. If the reward model makes mistakes, the policy can exploit those mistakes. This is why KL regularization is important.

The second limitation is that human preferences can be inconsistent. Different annotators may prefer different styles, levels of detail, or kinds of answers. The reward model learns an average of these preferences.

The third limitation is that preference data is not the same as truth. A response can be preferred because it sounds confident or well-written, even if it is not fully correct. So RLHF can improve style and helpfulness, but it does not automatically guarantee factual accuracy.

The fourth limitation is that PPO-style optimization is technically complex. It requires sampling, reward modeling, value estimation, KL control, and careful hyperparameter tuning.

So RLHF improves the model, but it also introduces new objects:

\[\text{a reward model,} \qquad \text{a reference policy,} \qquad \text{a policy optimization algorithm.}\]

Conclusion

The main idea of RLHF is to move beyond imitation. SFT trains the model to reproduce desired answers. RLHF trains the model to prefer better answers. So the transition from SFT to RLHF can be summarized as:

\[\text{maximum likelihood on demonstrations} \quad \longrightarrow \quad \text{reward maximization under preference feedback}.\]

This is why RLHF became a central post-training method for assistant-like language models: it gives a way to incorporate human judgments directly into the training process. However, RLHF is technically complex because it requires a reward model and an RL optimization loop. This motivates direct preference optimization methods, such as DPO, which use preference data more directly. This will be the topic of the next post.