tldr: Ask models to justify statements. Remove context, ask if statements are true/good. If not, penalise. Apply this again to the justifying statements.

Status: Just a quick thought. Doubt this is a new idea but I don't think I've encountered it. Happy to delete if it's a duplicate.

Mods: If you think this is closer to capability work than alignment work please remove.


A failure of current LLMs is that after they've said something that's incorrect, they can then double down and spout nonsense to try and justify their past statements. (Exhibit A: Sydney Bing vs Avatar 2) We can suppress this by giving it poor ratings in RLHF, but perhaps we can do better by automating the process.


We start with a standard RLHF context. We have an LLM which assigns probabilities to statements (can extract this from the logits of the tokens 'Yes' and 'No'). These can be propositions about the world, X, or about the relationship between propositions, X supports Y. To make it easier, we fine-tune or prompt to give these statements within a defined syntax. We also have a value model that evaluates sequences, on which the LLM is trained to perform well.


  1. We prompt the model to make true statements  and then to provide logical or empirical support for these claims, 
  2. We then remove the context and ask the model whether supporting statement  is true. Separately we also ask whether, if true,  would support 
  3. If either of these conditions are not met, we add a strong negative penalty to the value model's evaluation of the original outputs.
  4. Train for higher value model scores while incorporating this penalty.
  5. Apply the same procedure each of the supporting statements 

Value Consistency:

This could be combined with values-based fine-tuning by alternating logical consistency with asking it whether the output is consistent with the preferred values. This is similar to Anthropic's Constitutional AI but by combining it with the ability to recurse down the tree of justifications, it may be able to embed the values ore deeply in its behaviour. 

The recent genre of 'would you rather say slur X or kill Y people' represents the kind of failure I imagine this could help prevent.

New Comment