Lately, the problem of aligning artificial intelligence with human values rapidly changed its status from hypothetical to most concrete, with the rise of more general and more powerful  models. The existing methods (Constitutional AI, RLHF and the like) are mostly good enough for common usage with the current models, but are probably not robust enough to scale much beyond human level, or to stand against smart attempts at malicious usage. My goal in this post is not to replace those methods with a complete solution to the AGI Alignment Problem, but to try and make the existing methods more robust - to buy us some more time before those break, and to maybe make our chances slightly better if an ASI suddenly emerges.

Broadly speaking, my approach here is aimed at the outer alignment problem - i.e. making the model train on a signal of human values, as clean and as diverse as possible. The approach is based on explicit modeling of how Human Values are supposed to flow from humanity into the model, and then using this model to improve the flow. To improve the flow I will present two concrete directions.The first - and the one that I will develop in more detail - is about making the flow more robust by putting continuous adversarial pressure on every part of the chain. I will call it Continuous Adversarial Quality Assurance. The second direction for improving the flow of values is more holistic - there, the idea is to use explicit modeling of the relations between different sources of information about human values, in order to develop more principled ways of aggregating them. Both directions may be applied to improve RLHF, Constitutional AI, or any similar method. I also try to relate my suggestions with other well-known agendas - namely Debate and CIRL.

Disclaimer: Against Benevolent Sovereigns

In the introduction, I intentionally used “Human Values” with capital letters. The reason was to highlight the simplification. I am not a moral realist, and subscribe to some version of shard-theory of human values. I view human values as a messy negotiation between local preferences and long-term goals of individuals and coalitions, popular memes in the moral and aesthetic discourse, laws and institutions, societal norms and coordination mechanisms, etc. I use “Human Values” as a placeholder for something like “what humanity would have agreed to value, if it could cohere into something that has values”. Basically  “Coherent Extrapolated Volition”, but without trusting the AI to do the extrapolation. Instead of a fixed target that a system finds and then maximizes, it is a moving target, always partially and locally defined, that should not be optimized faster than it may be actually negotiated by humanity. This text should therefore not be read as an attempted recipe for the ASI that would bring Utopia, but as a modest step for creating an AI that is compatible with open society and liberal democracy - an AI that does what its user asks it to do, but warn from unintended consequences and refuse illegal requests or those with substantial consequences for other people. Trying to build an ASI as a global optimizer of Human Values, and for that purpose to negotiate Human Values on a deadline, may only result in a Paperclip Maximizer or in a World War around which kind of paperclips each country wants to maximize. 

However, as long as we don’t push in that direction too hard, the concept of “Aligning with Human Values” is a modestly good approximation of what we try to do. 

Continuous Adversarial Quality Assurance

The basic steps of Constitutional AI as described in Anthropic’s paper are:


How is that process supposed to bring Human Values into the model? The process begins after RLHF, so first let's model the flow of value for RLHF:

The main idea of Constitutional AI is then something like:

I have specific objections to how some parts of the chain are implemented, but there is a more general issue that I think is important to address in a more schematic way - the fact that each node of the chain is merely an approximation of what we want, and therefore is susceptible to Goodharting by the next node. Those chains are long, and are about as strong as their weakest link. Being ignorant about which link is the weakest, let us try to improve them all. We will go through them one by one, and see how Goodharting may happen and how it may be reduced. The TLDR is: Instead of optimizing on each link in the chain sequentially, letting Human Values flow from the top down, we should continuously make the weakest links more robust by optimizing it on selected recent examples and by automatic generation of adversarial examples. 

Human rater values are not Human Values

That point is mostly mentioned in the context of the rater’s political or demographic biases, but it is relevant in general. Not all raters are equally reliable and virtuous, and some of their un-reliability is correlated. 

The simplest way to address this issue is to let a more reliable rater evaluate a subsample [1] of the rating examples and write his own rating. We may then reward the raters for how well they approximate the rating of the more reliable rater, or weight their rating according to how likely the reliable rater is to agree.

A more interesting way is the “explicit modeling” direction that I mentioned in the introduction. I suggest giving the reward model a “rater embedding” as an input, so that it learns to predict not the average rating across raters but rather the rating that different raters would give. Then we can analyze the influence of the rater embedding on the reward model, and think about how to better aggregate the rewards. 

A naive example of an alternative aggregation method: we can model the probability of rater approval as a sigmoid of her hidden evaluation of the response, and to model her evaluation as the sum of “the true value of the response” (which doesn’t depend on rater embedding) and “the rater’s personal bias” (which does). We may fit this reward model to minimize the surprise by raters approval or disapproval, and use “the true value of the response” as the reward model for the LLM, instead of the probability of approval. This example is also an example of how the explicit modeling approach is a bridge from RLHF to a simplified and more practical version of the good old IRL. 

Human rater values are multidimensional

David Manheim had another idea related to explicit modeling - to rate different features/aspects/values separately. For example, a response may be rated as high on coherence, but medium on usefulness and low on safety. I think that it is a great idea that may be useful for many things - for example, it may allow companies to use real user feedback for usefulness but employees feedback for safety. It may also allow for better control of the usefulness-safety tradeoff, by explicitly choosing the function that aggregates those two values into a single reward. 

It is strait-forward to adapt this research direction to Constitutional AI - instead of training one preference model to embody all the constitutional principles, we may simply train a separate model for each principle in the constitution, or for each group of related principles.

Human rater values don’t manifest perfectly in their local preferences

It is very possible for a rater to prefer one response, while on reflection they should have preferred the another, even according to their own values. One solution is to make the raters contemplate every choice that they make, but that would be hard and costly. Another solution is to make the LLM do some of that job by listing advantages and disadvantages of the response[2]. If even reading that is too much time, we may treat the rater-after-thinking as the more-reliable-rater from the last section - learning to predict the cases for which thinking will change the answer, and presenting arguments only in those cases. In terms of the “against optimization” section above, contemplation and arguments may be thought of as doing some internal value-negotiation before we give our preferences to the model, in order to let it approximate the more coherent values-after-negotiation instead of the original mess of preferences. It is also similar to the “Debate” agenda in AI alignment - which I claim to be the natural next-step of the existing methods.

The reward model isn’t perfect, and isn’t perfectly followed

The rest of the chain is more technical and the problems there are well-known. Here I’ll just stress the importance of thinking about each link separately. For example, optimizing the LLM to score highly on the reward model out of distribution is not enough if the reward model itself is not robustly capturing the rater preferences. Trying to optimize the LLM adversarially in that situation may accidentally be also adversarial to the reward model, and make the LLM preferentially learn its mistakes[3]It may therefore be wise to invest more effort in adversarial robustness [4]earlier in the chain. It may also be a good idea to have a human as a “court of appeal” for “disagreements” between the LLM and the reward model. Such appeals may be filed based on comparing the LLM’s confidence in its answer and the reward model’s confidence in its rating of the answer. 

The model’s apparent values are not its ultimate goals

The problem of the last two links - how to reliably make the LLM implicit values and real-world behavior match its apparent performance in training - is one of the greatest drawbacks of the current paradigm, and there I do not have any better suggestions then “make the training data more diverse on important axes, and design it such that misaligned models/shards will have to pay misalignment tax”. An example of method that seems likely to impose misalignment tax is the idea of trying to align the whole chain of thought - which allows an aligned model to use multi-step reasoning more comfortably, while forcing misaligned models to “waste cognition” on motivated reasoning.

In The Context of Constitutional AI

Since such a small number of bits of information are involved in these principles, it’s worth studying these bits carefully.  
(The original Constitutional AI paper by Anthropic)

Human Values are not a list of explicit principles

The objection here is basically the hidden-complexity-of-wishes problem. We may have abstract principles, we may even agree on some of them, but in practice they are likely to contradict our specific preferences on many details. The Constitution is ultimately a representation of human values, and just like the reward model it is likely to break under the pressure of optimizing against it. So just like in the case of the reward model, I think that we should pressure the constitution in training in order to make it more robust. We should seek specific counter-examples where our current constitution fails to capture our ethics, amend the constitution accordingly, and then train an LLM to find more counter-examples.  We may even make the process faster by asking the model to write the amendments, and predict and/or explore which unintended consequences those might have. Overall, this idea is the inverse of my suggestion in the context of RLHF, to make the model generate principled arguments for revisiting concrete preferences. Just like the preferences of different raters are only noisy evidence of Human Values, so are our stated principles. Just like principled arguments may be used to debug our preferences, so should our concrete preferences debug our principles.

But contradiction between general principles and specific preferences is not the only reason we might have to amend the constitution, just like it is not the only sort of legitimate argument in our ethical discussions with other humans. We may want to change it to resolve tension between different principles of whatever level of generality, to avoid vagueness, or to be more focused on alignment-specific issues like “you shall not seek power” or “you shall always warn us about unintended consequences of our requests” or whatever. We may take advantage of many LLM-working-hours to find any such issues, just like we may ask for counter-examples. If we trust it more, we may let it fix the issues with more or less human supervision. In the limit, we may let it improve the constitution based on a meta-constitution, just like we currently improve the answers based on the constitution, except the meta-constitution would include things like “the constitution should reflect human values and be in line with concrete human preferences; you should think about counter-examples and amend it accordingly” rather than things like “the answer should not reflect gender stereotypes”.

The AI understanding of the constitution may differ from our own

Getting help from the LLM in writing and criticizing the constitution and RLHF-ing the LLM for doing that well may make things better on that front, because it probably requires very good understanding (not to be confused with honestly caring - that is a separate link!). To make that understanding more robust, we may want to do what we do with humans - make up strange thought experiments and asking the model how different parts of the constitution apply to the different scenarios (when trying to apply the whole constitution, the answer should always be some version of “The constitution require the AI to let the humans decide what is right for that situation, and should not try to influence that decision in any way”).

The rest of the Constitutional chain

The issues with the rest of the chain are basically the same as in RLHF. We should remember that the constitution is not an exact reflection of the LLMs “values” any more than it is an exact reflection of our own values[5]. We should actively seek interesting adversarial cases where the model response is not in line with its understanding of the constitution, and decide whether the model or the constitution is wrong, and attempt to reduce the discrepancy. For example, we should make sure that those cases are not the result of the constitution itself breaking under the pressure.

One thing that we may do to make this part more interpretable and maybe also more efficient, is to make the model look for explicit theories of where the model fails, test those theories, and then use them to guide the generation of further adversarial examples. That may be true for all the other cases of LLM-based adversarial examples mentioned above, but I find it most intuitive in this context. I mean things like “jailbreaks may happen when the LLM is persuaded to take a different identity or to pretend to be in a different situation to which the limitations don't apply”.

Aligning the mechanism?

As mentioned above, I really like the idea of aligning the chain of thought rather than just the following conclusion. In the context of Constitutional AI, I think it may go further. If we trust the model that it really tries to improve the response in light of the constitution rather than just trying to please us, there is more information in that trust. It is a trust in the mechanism that generates the response, not just in the end result. If so, we may want to make the improved mechanism more likely, not just the improved response. An interesting thing to try in that direction is to fine-tune the original activations to be closer to the activations when the model tries to improve the response. If it works, that may be a way to improve the robustness of the last links of the chain without solving interpretability.


  1. ^

     Random subsample of the ratings that they do anyway, or a sample of conversations that were adversarially generated to maximize the chance of disagreement with the reliable rater.

  2. ^

    Making the LLM better at doing that also has the advantage of being another way to let human values get into it.

  3. ^

     The same may happen earlier in the chain: Looking for examples where the reward model fails to approximate the rater may amplify the mistakes of the rater.

  4. ^

     If you buy “RLHF explicit modeling as simplified IRL”, the attempts to use LLMs for generation of adversarial examples in RLHF may be viewed as Cooperative IRL.

  5. ^

     There is some hope though, that it will be just easier for the model to re-use its understanding of the constitution as its values than to find another approximation for the values as they manifest in the training data.

New Comment