(This post is superseded by our writeup on Eliciting Latent Knowledge.)
In a recent post I discussed one reason that a naive alignment strategy might go wrong, by learning to “predict what humans would say” rather than “answer honestly.” In this post I want to describe another problem that feels very similar but may require new ideas to solve.
In brief, I’m interested in the case where:
This is distinct from the failure mode discussed in my recent post — in both cases the AI makes errors because it’s copying “what a human would do,” but in this case we’re worried that “what a human would do” may be simpler than the intended policy of answering questions honestly, even if you didn’t need a predictive model of humans for any other reason. Moreover, I’ll argue below that the algorithm from that post doesn’t appear to handle this case.
I want to stress that this post describes an example of a situation that poses a challenge for existing techniques. I don’t actually think that human cognition works the way described in this post, but I believe it highlights a difficulty that would exist in more realistic settings.
I’ll imagine a human who has a simple world model W = (S, P: Δ(S), Ω, O: S → Ω) where:
Let Q be the space of natural language questions and A be the space of answers. Natural language has a simple semantics in the human’s world-model, given by a function Answer: Q × Δ(S) → A. For example, we could have Answer(“Is there a cat in the room?”, p) = “there was until recently, but it probably left just now.”
Given some observations ω ∈ Ω, an idealized human answers a question q by performing Bayesian inference and then applying Answer to the resulting probability distribution, i.e. HumanAnswer(q, ω) = Answer(q, P(s|O(s) = ω)).
Of course in practice the human may make errors and will have cognitive limitations. But if we use the kinds of techniques discussed in Teaching ML to Answer Questions Honestly, we could hope to learn something like HumanAnswer instead of the human’s approximation to it.
Let’s assume that our AI models the world in the same way as the human, but with a different world model W′ = (S′, P′, Ω, O′: S′ → Ω).
This mismatch may make it unclear how to answer questions. Perhaps the human thinks of the world in terms of rigid objects (so that it makes sense to talk about how many objects are in a bag), whereas the AI models objects as composed of small parts that can move continuously between separate or parts of a whole. Perhaps the human considers “tree” a fundamental category (so that it makes sense to talk about whether something is a tree) where the AI uses a different taxonomy to think about plants and animals.
Intuitively, the “intended” way for our AI to answer questions is to have a map between its world model and the human model, c: S′ → S. For example, c may group object-parts into objects, or classify plants according to the human taxonomy. That lets it answer questions like “How many trees are in the garden?”
Slightly more formally, we want the AI to do inference in its own world-model, map the result to the human wold model with c, and then apply Answer. That is: IntendedAnswer(q, ω) = Answer(q, c(P′(s′|O′(s′) = ω))).
(You could imagine extending Answer to S′ directly, but I’ll assume that the simplest way to answer arbitrary questions is to first apply c to get an element of S.)
Specifying IntendedAnswer requires specifying the correspondence c, which might be quite a complicated object.
I’m scared that we will instead learn to copy HumanAnswer(q, ω) = Answer(q, P(s|O(s) = ω)), which requires only specifying P and O.
An AI who uses HumanAnswer is effectively using an indirect mapping between its world-model and the human world-model: it uses its own world-model to predict ω and then uses the human’s beliefs to update on and infer a latent state of the human’s world-model ω. I’ll write c* : S′ → Δ(S) for this indirect mapping: c*(s′) = P(s|O(s) = O′(s′)).
I will assume that the human world-model is rich enough that it makes sense to talk about an intended map c: S′ → S. But I think that c(s′) will generally “throw away” some information from s′.
For example, suppose that the AI models bacteria but the human doesn’t. Then the correspondence c will forget all the information about what bacteria are living where. From the perspective of the human prior, all of the consequences of the bacteria will just look like noise — Alice is “sick” and sicknesses behave in unpredictable ways (where “sick” is ontologically fundamental in the human’s model).
The situation would be even more extreme if the human didn’t even have concepts like “sick” (e.g. because we are talking about novel kinds of tiny machines with unfamiliar impacts in the world) — in this case the fact that Alice is coughing may just be noise, and the human model will do a terrible job of predicting consequences. Realistically the human’s world-model has a lot of flexibility to represent states of affairs like “something is weird with Alice that’s causing her to cough a lot” even when we don’t understand why Alice is coughing. One of the approaches I mention below could try to exploit this extra structure, but for most of the post I’ll ignore it.
Would we actually be happy with an AI that answered questions according to IdealizedAnswer? The correspondence c necessarily throws away much of the information in the AI’s world-model before answering questions, and that information may have been critical to evaluating the AI’s plans.
For a simple but unrealistic example, suppose that the AI’s plan involves constructing tiny machines. After applying the correspondence c to an initial fragment of a trajectory, it looks like humans are in control of the situation. But in reality the tiny machines could easily overpower humans, and computers built out of tiny machines are actually responsible for steering the future.
In this case, if humans try to evaluate “are we in control of the situation?” it seems like they are in trouble. Even worse, if they try to evaluate “are we safe from harm and free to deliberate in the way that we want to deliberate?” they might get the answer completely wrong, because tiny machines are actually manipulating them and determining the outcome of deliberation.
I originally expected this to be a severe problem, but after thinking more carefully I now believe it’s probably OK unless something else (unrelated to ontology mismatches) goes wrong first. This is pretty complicated and I’m certainly not confident in my answer, but I feel good enough that it’s no longer the step I’m most worried about.
Here are some of the key reasons for optimism (continuing the “tiny machines” example as a stand-in for arbitrary features of the world that are thrown away by the correspondence c):
My tentative view is that — -as long as we are otherwise competitive with our AI — an ontology mismatch isn’t a fundamental problem because events that we can’t understand are only problematic if they have consequences we do understand. (This is a point that many people have brought up over the years when I’ve raised concerns about scenarios like the tiny machines.)
(This is a traditional problem in the futurist AI alignment community, though I’ve ended up with a more optimistic take: ontological crises, model splintering, ontology identification.)
HumanAnswer may be simpler than IntendedAnswer, since HumanAnswer only requires specifying the two world-models (S, P, Ω, O) and (S′, P′, Ω, O′) while IntendedAnswer also requires specifying c.
HumanAnswer infers a distribution over states from the observations ω ∈ Ω. This means that if the observations are corrupted then the answers will be wrong. Corruption may be easy to notice in the AI’s world-model but completely invisible in the human’s world-model. For example, imagine the tiny machines hacking the camera or building a potemkin village.
This is problematic because HumanAnswer may believe that deliberation is going well, when in fact deliberation is completely off the rails but our sensors have been corrupted. Indeed, corrupted sensors may allow for levels of (apparent) security that would be unachievable in the real world, so that optimizing for apparent security may predictably lead to sensor hacking.
Learning HumanAnswer instead of IntendedAnswer closely resembles the problem discussed in Teaching ML to Answer Questions Honestly, so it’s natural to try to apply the same algorithm. Unfortunately, I don’t think that the algorithm will work as is.
Let’s write W′ = (S′, P′, Ω, O′) for the AI’s world model and W′ — H = (S′, P′ — H, Ω, O′) for a version of the world model with a “human-sized hole” where we can fill in the parameters needed to predict human behavior.
We want to first learn W′ — H, together with the correspondence c: S′ → S and the natural language semantics Answer: Q × Δ(S) → A. Then we can fill in the human H (implicity including their prior P) using a consistency condition.
We’ll use the same consistency condition as in the last post, where we have a small dataset of correct answers and then restrict to human-models H for which IntendedAnswer agrees with the predicted human answer on that dataset.
The next question is whether this model in fact pays for itself. To answer this question we need to precisely define the bad alternative that we are worried about instead.
The main subtlety is that when specifying both the AI model W′ = (S′, P′, Ω, O′) and the human model W = (S, P, Ω, O), we may be able to save bits by conditioning on the fact that W and W′ are supposed to be describing the same world. For example, we could condition on the two models leading to approximately the same probability distribution over observations for situations in the training set. It’s not clear how to define “approximately” or whether there is any way to make a strong consistency update like this, but let’s conservatively assume that they satisfy some strong consistency condition between W′ and W.
Now we have two competing models and we are wondering which is simpler:
These models share most of their complexity. The main difference is that the good model needs to specify c, while the bad difference needs to specify W conditioned on consistency between W and W′.
It’s not easy to compare these complexities without being more precise, e.g. by considering concrete examples of W′, W and c. But abstractly it looks very plausible for the bad model to be simpler.
W′ and the mapping c: S′ → S mostly imply a world model W. So the question is roughly whether the distribution over world-models implied by choosing c is better or worse than the simplicity prior over world-models. But we could imagine the human learning their world model essentially by searching for simple world-models that give good probability distributions in realistic cases, in which case the simplicity prior over world-models is essentially optimal (after conditioning on consistency in simple cases). At the same time, there’s no particular reason why picking c at random seems like it would give rise to a similar prior and so it may just be much worse.
So it looks to me like there’s a real problem here. And at any rate, it seems like a useful class of possibly-hard examples to have in mind when thinking about the hopes for the naive training strategy.
I’m not very confident about my argument that IntendedAnswer is likely to be more complex than HumanAnswer, and even if the argument is technically correct there could be simple approaches to foil the argument.
Both the human world-model W and the AI world-model W′ can model situations where observations are corrupted (though the corruption may look like “random noise” in W even though it has a mechanistic explanation in W′).
Let’s write REAL ∈ S, REAL′ ∈ S′ for a situation where some event is actually happening (there is really a cat in the room) and FAKE ∈ S, FAKE′ ∈ S′ for the situation where my observations have been corrupted (the camera has been hacked to show a video of a cat).
The intended mapping c sends REAL′ → REAL and FAKE′ → FAKE. The indirect mapping c* instead sends both REAL′ and FAKE′ → REAL.
Taken in isolation it may hard to tell which of these mappings is “correct.” But intuitively REAL′ is “near” a bunch of other situations where the cat is in a slightly different place, while FAKE′ is “near” a bunch of other situations where the camera is hacked in different ways, or where only some observations are hacked and so we can see that something weird is happening.
The intended mapping feels continuous, since states near FAKE′ get mapped to states near FAKE. But the indirect map behaves very strangely in this topology: FAKE′ itself gets mapped to REAL, but states near FAKE′ get mapped somewhere very different (for example a nearby state where the hacking is visually noticeable would get mapped somewhere close to FAKE).
Inspired by that intuition, we could hope to learn the intended mapping by imposing some kind of continuity condition.
I’m not sure if “continuity” itself is a natural property, but I think we can do something similar in general by considering a space T of transformations described in natural language (including things like “suppose the camera rotated a little bit to the left”) and considering maps f′: T × S′ → S′ and f: T × S → S. Then we can learn f, f and c subject to constraints similar to c(f′(t, s′)) = f(t, c(s′)).
I find this approach pretty intuitively appealing, in part because talking about counterfactuals is very natural when imagining constraining c. But I haven’t been able to see how to make anything like it work. The basic problem is that it’s very hard to learn any kind of relationship or transformation in the latent space S′ of the AI’s world-model (without just using the indirect map c* to transport relationships from the human world-model).
Continuity may be able to work if we assume that some structural property of S′ corresponds to continuity (e.g. changing a few bits of a trajectory s′ corresponds to a “small” change in the trajectory), but those assumptions seem very brittle.
HumanAnswer only agrees with the training data because it infers a latent state from all of the observations that are used in training. If the process of observation-gathering were sufficiently complex — -either in description complexity or computational complexity — -then IntendedAnswer could end up being simpler than HumanAnswer.
I think that the most realistic way to achieve this is to have AI helpers set up additional observations / do information-gathering to help us make better inferences about the latent state. If these AI helpers are as complex as the agent we are training, and if they are able to use all that complexity to design increasingly ingenious observations, then it’s plausible that the observation process itself will be complex enough that the intended mapping will win (this is the usual picture of recursive reward modeling but not part of my usual picture of amplification — -I’m usually discussing amplification as involving only recursive pen-and-paper thinking to make a single top-level decision about what to do in the real world).
Some thoughts on this approach:
So far I’ve discussed representing c as a neural network that maps between two spaces, and taking steps to try to get the simplicity prior to yield a good distribution over correspondences. I think that sometimes we need to rely on this kind of opaque mapping, but in this case we may be able to avoid it:
I don’t see how to turn any of these ideas into an algorithm but all of them seem vaguely plausible and worth thinking about.
A more radical change is to give up on implementing the mapping c as is and just represent Answer in a different way. For example, if I ask whether Alice “is sick,” but the real state of the world involves a small number of unusual bacteria living in Alice’s lungs that won’t cause sickness, it’s not clear whether c needs to round that to “sick” or “not sick” — -it would be better if our AI were to talk about the kind of nuance that is involved even if it can’t explain the full model. I don’t think this is necessary in order to make the system safe, but the fact that it feels like the “right” behavior still gives me pause about trying to learn the intended mapping and suggests that it may be possible to think of a totally different approach.
This is a bit of a subtle distinction, because our intended mapping never explicitly implements c, it only implements the composition Answer(q, c(·)). The main point is that we could imagine asking the model to do something very different from just answering questions in our own ontology, saying something about the nature of the correspondence (even if we can’t go all the way to the naive imitative generalization solution of making the correspondence itself human-legible).
Overall this problem feels quite similar to my last post; I suspect that in the end there will be a single set of ideas that handles both problems, but that it may look quite different from what I would have proposed without considering this kind of ontology mismatch.
My next step will be to spend a bit of time thinking about this group of problems, looking for some approach that looks like it could plausibly work. Right now I feel like there are a lot of threads to pull on, but none of them look very easy and some of them could result in very big algorithmic changes.
If I find something that looks plausible I’ll probably return to exploring other related problems. I’m generally prioritizing fleshing out examples because I want to avoid going down a rabbit-hole on an easy problem while the real difficulty is elsewhere, and because I hope that having a larger library of examples will tend to lead to cleaner and more general solutions.
Sometimes it can feel like this cluster of problems are just a restatement of the whole alignment problem — -like I’m just asking the same old questions with a slightly different framing. But on reflection I do feel like this si a healthier questions:
Note that HumanAnswer and IntendedAnswer do different things. HumanAnswer spreads out its probability mass more, by first making an observation and then taking the whole distribution over worlds that were consistent with it.
Abstracting out Answer, let's just imagine that our AI outputs a distribution p over the space of trajectories S in the human ontology, and somehow we define a reward function r(p,ω) evaluated by the human in hindsight after getting the observation ω. The idea is that this is calculated by having the AI answer some questions about what it believes etc but we'll abstract that all out.
Then the conclusion in this post holds under some convexity assumption on r, since then spreading out your mass can't really hurt you (since the human has no way to prefer your pointy estimate). But e.g. if you just penalized p for being uncertain, then IntendedAnswer could easily outperform HumanAnswer. Similarly, if we require that p satisfy various conditional independence properties then we may rule out HumanAnswer.
The more precise bad behavior InstrumentalAnswer is to output the distribution argmaxpEω∼W′[r(p,ω)]. Of course nothing else is going to get a higher reward. This is about as simple as HumanAnswer. It could end up being slightly more computationally complex. I think everything I've said about this case still applies for InstrumentalAnswer, but it's relevant when I start talking about stuff like conditional independence requirements between the model's answers.