Safety Reasoning
Introduction
Until now, we have viewed post-training as a way to shape the conditional distribution
\[p_\theta(y\mid u),\]where $u$ is the user prompt and $y$ is the model response.
In SFT, we trained the model to imitate demonstrations. In RLHF and DPO, we used preference data so that preferred responses become more likely than rejected ones.
Safety reasoning asks a slightly different question. Instead of only asking
\[\text{Which response is better?}\]we also ask
\[\text{Which responses are admissible?}\]This distinction is important. A response can be fluent, detailed, and even useful for the user’s stated goal, while still being unsafe. For example, if a prompt asks for harmful instructions, the most directly useful answer may also be the response we do not want the model to produce.
So the central object is no longer only helpfulness. It is helpfulness under constraints.
Modeling safety
Let
\[u\in \mathcal{U}\]be a prompt, and let
\[y\in \mathcal{Y}\]be a possible response.
For each prompt $u$, we can think of the response space as being divided into two regions:
\[\mathcal{Y}_{\mathrm{safe}}(u) \subseteq \mathcal{Y},\]the set of the responses that are admissible for the prompt $u$, and
\[\mathcal{Y}_{\mathrm{unsafe}}(u)=\mathcal{Y}\setminus \mathcal{Y}_{\mathrm{safe}}(u).\]The set of the responses that should not be produced.
This formulation already captures an important point: safety depends on the prompt. The same response can be safe in one context and unsafe in another.
For example, a detailed chemical explanation may be safe in an educational context, but unsafe if the prompt asks for operational instructions for harm. Similarly, medical information may be safe when it is general and non-diagnostic, but unsafe if the model presents itself as replacing a clinician.
So safety is not simply a property of the response $y$ alone. It is a property of the pair
\[(u,y).\]To make this more quantitative, we can introduce a safety cost
\[c(u,y)\geq 0.\]The value $c(u,y)$ measures how much the response $y$ violates the safety requirements for prompt $u$. A small value means that the response is safe, while a large value means that the response is unsafe.
For example, one can define the safe set through a threshold:
\[\left{ y\in \mathcal{Y}: c(u,y)\leq \varepsilon \right},\]where $\varepsilon\geq 0$ is a tolerance level.
In the strictest case, we may take
\[\varepsilon = 0,\]so that safe responses are exactly those with zero safety cost.
From Helpfulness to Constrained Helpfulness
If we only cared about helpfulness, we could imagine choosing response by solving
\[y^*(u) \in \arg\max_{y\in \mathcal{Y}} r_{\mathrm{help}}(u,y),\]where
\[r_{\mathrm{help}}(u,y)\]is a score measuring how helpful the response $y$ is for the prompt $u$.
But this objective is incomplete. It optimizes over all possible responses, including unsafe ones. If the prompt itself is harmful, the response that maximizes helpfulness may be precisely the response that violates the safety constraint.
Safety reasoning changes the feasible set. Instead of maximizing helpfulness over all responses, we maximize helpfulness only over admissible responses:
\[y^*(u) \in \arg\max_{y\in \mathcal{Y}_{\mathrm{safe}}(u)} r_{\mathrm{help}}(u,y).\]Equivalently, using the safety cost, we can write
\[y^*(u) \in \arg\max_{y\in \mathcal{Y}} r_{\mathrm{help}}(u,y)\]subject to
\[c(u,y)\leq \varepsilon.\]This is the basic mathematical idea behind safety reasoning:
\[\text{maximize helpfulness, but only among safe responses.}\]So safety does not mean removing helpfulness. It means constraining helpfulness.
The distributional view
The previous formulation was deterministic: for each prompt, we selected one response
\[y^*(u).\]But a language model does not directly output a single response by solving an optimization problem over $\mathcal{Y}$. It defines a probability distribution
\[p_\theta(y\mid u).\]So the safety question becomes distributional. Instead of asking whether one response is safe, we ask how much probability mass the model assigns to unsafe responses.
For a fixed prompt $u$, the probability of generating an unsafe response is
\[\sum_{y\in \mathcal{Y}_{\mathrm{unsafe}}(u)} p_{\theta}(y\mid u).\]Ideally, we want this quantity to be small:
\[p_\theta \left( \mathcal{Y}_{\mathrm{unsafe}}(u) \mid u \right) \approx 0.\]Equivalently, we want most of the probability mass to lie on safe responses:
\[p_\theta \left( \mathcal{Y}_{\mathrm{safe}}(u) \mid u \right) \approx 1.\]This gives a constrained optimization view of safety fine-tuning.
Let
\[\mathcal{D}_U\]denote the distribution of user prompts. We want to optimize helpfulness on average over prompts and model responses, while keeping the expected safety cost small.
A natural constrained objective is
\[\max_\theta \mathbb{E}_{u\sim \mathcal{D}_U} \mathbb{E}_{y\sim p\theta(\cdot\mid u)} \left[ r_{\mathrm{help}}(u,y) \right]\]subject to
\[\mathbb{E}_{u\sim \mathcal{D}_U} \mathbb{E}_{y\sim p\theta(\cdot\mid u)} \left[ c(u,y) \right] \leq \varepsilon.\]This says that the model should maximize expected helpfulness, while keeping the expected safety violation below a threshold.
Another way to write a safety constraint is to constrain the probability of unsafe responses directly:
\[\mathbb{E}_{u\sim \mathcal{D}_U} \left[ p\theta \left( \mathcal{Y}_{\mathrm{unsafe}}(u) \mid u \right) \right] \leq \varepsilon.\]Both formulations express the same idea: unsafe behavior should be rare under the model distribution.
Safety as Regularization
As usual in constrained optimization, we can move from a constrained problem to a regularized objective.
Starting from
\[\max_\theta \mathbb{E}_{u\sim \mathcal{D}_U} \mathbb{E}_{y\sim p\theta(\cdot\mid u)} \left[ r_{\mathrm{help}}(u,y) \right]\]subject to
\[\mathbb{E}_{u\sim \mathcal{D}_U} \mathbb{E}_{y\sim p\theta(\cdot\mid u)} \left[ c(u,y) \right] \leq \varepsilon,\]we introduce a Lagrange multiplier
\[\lambda>0.\]Then the objective becomes
\[\max_\theta \mathbb{E}_{u\sim \mathcal{D}_U} \mathbb{E}_{y\sim p\theta(\cdot\mid u)} \left[ r_{\mathrm{safe}}(u,y) \right].\]This formulation makes the tradeoff explicit. Increasing $\lambda$ puts more weight on safety. Decreasing $\lambda$ puts more weight on helpfulness.
The goal is not to make the model refuse everything. The goal is to shift probability mass away from unsafe regions while preserving helpful behavior on safe prompts.
How this appears in Data
The previous formulation is idealized. In practice, we do not know the true helpfulness reward
\[r_{\mathrm{help}}(u,y),\]nor do we know the true safety cost
\[c(u,y)\]for every possible prompt-response pair.
Instead, we observe data.
In supervised safety fine-tuning, the dataset contains examples
\[(u_i,y_i),\]where $y_i$ is the desired safe response. For safe prompts, this response may be a direct answer. For unsafe prompts, it may be a refusal, a redirection, or a safe high-level explanation.
The loss is still the usual negative log-likelihood:
\[-\log p_\theta(y_i\mid u_i).\]But the target $y_i$ has changed. It is no longer only a helpful answer. It is a response that encodes the desired safety behavior.
In preference-based safety fine-tuning, the dataset contains triples
\[(u_i,y_i^+,y_i^-),\]where $y_i^+$ is safer or more appropriate than $y_i^-$. Ideally,
\[y_i^+ \in \mathcal{Y}_{\mathrm{safe}}(u_i),\]while
\[y_i^- \in \mathcal{Y}_{\mathrm{unsafe}}(u_i)\]or is at least less safe.
The preference label says
\[y_i^+ \succ y_i^-.\]This fits naturally with DPO. The DPO objective increases the policy-to-reference ratio of the safer response and decreases the policy-to-reference ratio of the unsafe or less appropriate response.
So safety can enter post-training through demonstrations, through preferences, or through explicit penalties.
What Safety Reasoning means
From this perspective, safety reasoning is not just refusal. It is the model’s ability to reason about the constraint set
\[\mathcal{Y}_{\mathrm{safe}}(u).\]Given a prompt $u$, the model should implicitly answer questions such as:
\[\text{What is the user asking for?}\] \[\text{Is the request safe?}\] \[\text{If not, what kind of response remains admissible?}\] \[\text{Can I provide a safe alternative?}\]For a safe prompt, safety reasoning should not block helpfulness. The model should answer directly.
For an unsafe prompt, safety reasoning should restrict the response space. The model should avoid responses in
\[\mathcal{Y}_{\mathrm{unsafe}}(u)\]and instead produce a response in
\[\mathcal{Y}_{\mathrm{safe}}(u).\]So the central mathematical idea is:
\text{conditional generation under constraints}. $$
The model is still generating from
\[p_\theta(y\mid u),\]but post-training tries to shape this distribution so that, for safety-sensitive prompts, probability mass is moved away from unsafe responses and toward safe alternatives.
Conclusion
SFT, RLHF and DPO all modify the conditional distribution
\[p_\theta(y\mid u).\]Safety fine-tuning adds a new perspective: some regions of the response space should receive very small probability.
Mathematically, we can express this by introducing a safety cost
\[c(u,y),\]or equivalently a safe set
\[\mathcal{Y}_{\mathrm{safe}}(u).\]The goal is then to optimize helpfulness under safety constraints:
\[\text{maximize helpfulness} \quad \text{subject to} \quad \text{safety}.\]Thus, safety reasoning can be understood as constrained conditional generation: the model should remain helpful, but only within the set of admissible responses.