A Models-centric Approach to Corrigible Alignment

J Bostock

This post starts with a reasonable story of how alignment might be solved, based on extracting human values from an AIs model of the world. Then I will elaborate on what I see as the main steps in this process. After that I will review the existing work (which I am aware of) on these areas, as well as discussing points of confusion. I will finish with a list of identified areas of confusion in this scheme.

A Reasonable Story for How Things Go Right

An AI models the world, in doing so it models humans, in doing so it models human values. It then extracts those human values somehow. Then it acts "according to" those values in a similar way to humans do, but with superhuman capabilities.

Breaking this down

1: Modelling humans, in particular human decision making. This is distinct from inverse reinforcement learning, as with this approach it only needs to be done from the perspective of a general predictive model: model humans as part of the world. Modelling the world to a reasonable accuracy in general will do the job thanks to the later steps.

2: Modelling humans in such a way that human values can in theory be separated from the rest of the model. This is not trivial. A neural network (for example) which predicts human behaviour will be hopelessly entangled between the human's beliefs, values, and random biases and akrasia. Possibly corrigibility-as-in-understanding work can help here but it doesn't seem easy. Something like a Pearlian causal network of factors governing human decision making would be ideal. Existing work is building towards this.

3: Pointing to a human in the model. This is non-trivial. A model of the world which predicts the world will have things in it isomorphic to humans, but we need a robust way of pointing to them. Existing work seems to give a mathematically well-defined way to point to things. This will probably need testing.

4: Actually separating values from non-values. There is a no-free-lunch theorem which states that values and beliefs cannot in general be separated from one another. I strongly suspect that we — as humans who have some idea of our own beliefs — can get around this. We could build priors into the modelling system as to what values vs beliefs generally look like. We can also tell the model to put extra bayes-weight on human statements about what our values are. We could potentially just (with an understandable-enough model) do something like step 3 to point to what our values are. None of these are concrete solutions but the problem seems tractable.

5: Using those values in way that leverages the AI's processing power, "knowledge" and some sort of decision theory to actually make good decisions and do good actions. This seems to also mostly be more of the capabilities flavour, like step 1 is.

More Depth on the Problems and Identifying Confusion

Stating a seemingly-plausible story of how things go right is probably not enough. There are both obvious and non-obvious problems with it. There are probably also areas of confusion which I have not untangled — or even noticed — yet.

Step 1:

There are some reasons this might be done "for us" by capabilities research:

Modelling humans and human behaviour as part of a general world model seems to be strongly capabilities-like research. A superintelligence capable of causing large-scale harm to humanity can probably model human behaviour. However...

Lots of ways of causing harm to humans do not require a complete model of human behaviour. We still need to implement steps 2-5 before the AI gains significant power, it might not be possible to do this if modelling human behaviour is something the AI only does after we have lost the ability to directly understand and intervene in its behaviour. However...

Without steps 2-5 the AI might not be agenty enough to even act in the world. This does seem like something worth investigating further though, we don't have a strong concept of when agentic behaviour arises spontaneously: some people suggest that all AIs have an incentive to become agents regardless of their implementation.

Step 2:

For this we need an AI which can do something like "splitting the world up". The "Gold Standard" here would probably be a system which builds a predictive model, and hands you a human-understandable Pearlian-looking causal network. Currently I see a few possible ways this might happen.

Modern logical AI (like the Apperception Engine or Logical Neural Networks) could progress, and take over from or unify with current neural networks. These methods generally seem to do a good job of splitting the world into meaningful and understandable abstractions/parts. The Apperception Engine needs significant improvement before it becomes anything particularly useful. I don't currently understand logical neural networks very well, so I don't have a good idea of how useful they might be in the future.

We might also be able to achieve step 2 with modifications to existing architectures. There is a significant amount of research currently being undertaken to understand neural networks, like the circuits work. If this bears fruit then we might be able to isolate particular parts of a network, and get something like a causal network from it.

Mathematical work like Scott Garrabrant's Finite Factored Sets might also improve our understanding of how causality is best expressed in a network.

Step 3:

There are various techniques which infer an agent's preferences based on the agent's actions. All of these share the need to have an agent defined to them in the first place. Therefore we need to be able to draw a boundary around a human (or multiple humans separately) in the AIs model of the world, which allows it to distinguish the "actions" or "decisions" of the human from various other causal interactions.

Pointing to "something" in a model is a problem which has been written about in the Abstraction Sequence by johnswentworth. In particular Pointing to a Flower seems to have a pretty solid grounding for how to do this. Whether or not this will work in practice requires some testing. In particular stability with respect to increasing model detail is worth looking at.

Step 4:

Separating the beliefs of an agent from the values of an agent is not — in general — possible. But we don't need to do it in general, we only need to do it for one very, very small subset of all agents which exist or could exist. I see a few possible ways of attacking this problem.

One way would be similar to step 3 in methodology: try and find a mathematical property which holds for the "values" part of human decision-making. I have not seen much successful work in this area.

A more reasonable way is something like inverse reinforcement learning, which I don't understand particularly well but seems to be a tractable area of study. The FHI have recently featured a paper which makes progress towards being able to learn the preferences of non-rational, non-omniscient agents.

A third (if somewhat impractical) way would be to — once we have a human-understandable model of human decision making — simply point to the "values-ey" parts of the model by hand. This may not be stable to increasing model complexity, and we may simply make errors.

Of course this step also assumes that human values exist. avturchin has suggested that what we define as "values" don't actually exist. I am skeptical of this: the "It All Adds Up to Normality" principle makes me unwilling to accept that despite all the philosophical discussion on human values, they won't end up being an important part of a model of human behaviour. That article does discuss issues with the abstraction level of human values, they may only exist within certain methods of modelling a human, and not in other models. Our pointers to them might not be robust to changes in the AIs model.

Step 5:

This is another of the steps which is strongly entangled with capabilities research. Decision theory, embedded agency, and numerous other issues come into play here. Hopefully — though it does not seem certain — we will have these solved by the time we create an agentic AI. One way I can imagine this not being the case is an agent-bootstrap situation as stated in step 1.

Doing badly on certain capabilities research is also a way to have an AI which while "aligned" in a certain sense of having our values, fails to implement those values in a sensible way and causes a lot of harm in the process.

A method of solving steps 4 and 5 might be to allow our model of human behaviour to observe itself and self modify, removing parts of itself like akrasia and biases and improving its own models of the world, and decision theories. We could then wait for this to converge on some final behavioural algorithm. This would certainly need a lot of investigation before it works.

List of Problems and Confusion Areas

So we have identified a good few problems with no currently known solutions, there are also some problems here I did not mention above:

We do not know how to generally infer causal networks from sensory data (steps 1 and 2)
We do not know whether our current understanding of abstractions will allow us to "point to" a human in a causal network (steps 2 and 3)
We do not have a good idea of how to infer human values from a model of a human (step 4)
We also do not have a good idea of how to "point to" values in the structure of an agent (step 4)
Bad things might happen if a non-valueish part of human decision making gets mistakenly labelled as a value (step 4)
We do not know how to make an AI which implements any values at all — let alone human ones (step 5)
Human values might not exist, even if they exist at some level of abstraction the pointer to them may not be robust at all scales (overall problem)
We do not know whether this scheme of alignment requires such an accurate model of the world that it could only be implemented in an AI that is already too powerful to directly modify (overall problem)
Modelling humans to too high of a degree might constitute "mind crime" by creating and destroying many conscious beings (overall problem)
It is not clear which, or how many, humans to model and how to extract overall values from them (overall problem)
We don't know how and when agentic behaviour might arise unintentionally, and we'll need to protect against that (overall problem)
Other unknown areas of confusion...