(This post is superseded by our writeup on Eliciting Latent Knowledge.)

In a recent post I discussed one reason that a naive alignment strategy might go wrong, by learning to “predict what humans would say” rather than “answer honestly.” In this post I want to describe another problem that feels very similar but may require new ideas to solve.

In brief, I’m interested in the case where:

  • The simplest way for an AI to answer a question is to first translate from its internal model of the world into the human’s model of the world (so that it can talk about concepts like “tree” that may not exist in its native model of the world).
  • The simplest way to translate between the AI world-model and the human world-model is to use the AI world-model to generate some observations (e.g. video) and then figure out what states in the human world-model could have generated those observations.
  • This leads to bad predictions when the observations are misleading.

This is distinct from the failure mode discussed in my recent post — in both cases the AI makes errors because it’s copying “what a human would do,” but in this case we’re worried that “what a human would do” may be simpler than the intended policy of answering questions honestly, even if you didn’t need a predictive model of humans for any other reason. Moreover, I’ll argue below that the algorithm from that post doesn’t appear to handle this case.

I want to stress that this post describes an example of a situation that poses a challenge for existing techniques. I don’t actually think that human cognition works the way described in this post, but I believe it highlights a difficulty that would exist in more realistic settings.

Formal setup

Human world-model

I’ll imagine a human who has a simple world model W = (S, P: Δ(S), Ω, O: S → Ω) where:

  • S is a space of trajectories, each describing a sequence of events in the world. For example, a trajectory s ∈ S may specify a set of rigid objects and then specify how they move around over time.
  • P is a probability distribution over trajectories. It includes both a prior over initial states (cars are probably on the road and fish are probably in the ocean) and a dynamics model that tells us how likely a trajectory is under the laws of physics (most trajectories approximately satisfy Newton’s laws).
  • Ω is a space of observations, for example videos.
  • O tells you what you would observe for each possible trajectory.

Let Q be the space of natural language questions and A be the space of answers. Natural language has a simple semantics in the human’s world-model, given by a function Answer: Q × Δ(S) → A. For example, we could have Answer(“Is there a cat in the room?”, p) = “there was until recently, but it probably left just now.”

Given some observations ω ∈ Ω, an idealized human answers a question q by performing Bayesian inference and then applying Answer to the resulting probability distribution, i.e. HumanAnswer(q, ω) = Answer(q, P(s|O(s) = ω)).

Of course in practice the human may make errors and will have cognitive limitations. But if we use the kinds of techniques discussed in Teaching ML to Answer Questions Honestly, we could hope to learn something like HumanAnswer instead of the human’s approximation to it.

AI world-model

Let’s assume that our AI models the world in the same way as the human, but with a different world model W′ = (S′, P′, Ω, O′: S′ → Ω).

This mismatch may make it unclear how to answer questions. Perhaps the human thinks of the world in terms of rigid objects (so that it makes sense to talk about how many objects are in a bag), whereas the AI models objects as composed of small parts that can move continuously between separate or parts of a whole. Perhaps the human considers “tree” a fundamental category (so that it makes sense to talk about whether something is a tree) where the AI uses a different taxonomy to think about plants and animals.

Intuitively, the “intended” way for our AI to answer questions is to have a map between its world model and the human model, c: S′ → S. For example, c may group object-parts into objects, or classify plants according to the human taxonomy. That lets it answer questions like “How many trees are in the garden?”

Slightly more formally, we want the AI to do inference in its own world-model, map the result to the human wold model with c, and then apply Answer. That is: IntendedAnswer(q, ω) = Answer(q, c(P′(s′|O′(s′) = ω))).

(You could imagine extending Answer to S′ directly, but I’ll assume that the simplest way to answer arbitrary questions is to first apply c to get an element of S.)

The problem: HumanAnswer

Specifying IntendedAnswer requires specifying the correspondence c, which might be quite a complicated object.

I’m scared that we will instead learn to copy HumanAnswer(q, ω) = Answer(q, P(s|O(s) = ω)), which requires only specifying P and O.

An AI who uses HumanAnswer is effectively using an indirect mapping between its world-model and the human world-model: it uses its own world-model to predict ω and then uses the human’s beliefs to update on and infer a latent state of the human’s world-model ω. I’ll write c* : S′ → Δ(S) for this indirect mapping: c*(s′) = P(s|O(s) = O′(s′)).

Limitations of the human’s world-model

I will assume that the human world-model is rich enough that it makes sense to talk about an intended map c: S′ → S. But I think that c(s′) will generally “throw away” some information from s′.

For example, suppose that the AI models bacteria but the human doesn’t. Then the correspondence c will forget all the information about what bacteria are living where. From the perspective of the human prior, all of the consequences of the bacteria will just look like noise — Alice is “sick” and sicknesses behave in unpredictable ways (where “sick” is ontologically fundamental in the human’s model).

The situation would be even more extreme if the human didn’t even have concepts like “sick” (e.g. because we are talking about novel kinds of tiny machines with unfamiliar impacts in the world) — in this case the fact that Alice is coughing may just be noise, and the human model will do a terrible job of predicting consequences. Realistically the human’s world-model has a lot of flexibility to represent states of affairs like “something is weird with Alice that’s causing her to cough a lot” even when we don’t understand why Alice is coughing. One of the approaches I mention below could try to exploit this extra structure, but for most of the post I’ll ignore it.

Problem 1: is the “intended answer” actually good enough?

Would we actually be happy with an AI that answered questions according to IdealizedAnswer? The correspondence c necessarily throws away much of the information in the AI’s world-model before answering questions, and that information may have been critical to evaluating the AI’s plans.

For a simple but unrealistic example, suppose that the AI’s plan involves constructing tiny machines. After applying the correspondence c to an initial fragment of a trajectory, it looks like humans are in control of the situation. But in reality the tiny machines could easily overpower humans, and computers built out of tiny machines are actually responsible for steering the future.

In this case, if humans try to evaluate “are we in control of the situation?” it seems like they are in trouble. Even worse, if they try to evaluate “are we safe from harm and free to deliberate in the way that we want to deliberate?” they might get the answer completely wrong, because tiny machines are actually manipulating them and determining the outcome of deliberation.

I originally expected this to be a severe problem, but after thinking more carefully I now believe it’s probably OK unless something else (unrelated to ontology mismatches) goes wrong first. This is pretty complicated and I’m certainly not confident in my answer, but I feel good enough that it’s no longer the step I’m most worried about.

Here are some of the key reasons for optimism (continuing the “tiny machines” example as a stand-in for arbitrary features of the world that are thrown away by the correspondence c):

  • I’m focused on whether humans can implement the meta-strategy described in the strategy-stealing assumption. That is, they want to keep themselves safe and to ensure that they deliberate well, and other than that they want their AI to maximize option value and ultimately respond to their wishes. If they trust their deliberation, then they will eventually learn about the tiny machines (and everything else in the AI’s ontology), and so can defer to their future selves about whether they actually have real control over the situation.
  • Humans have preferences over deliberation expressed in the human ontology. In order to mess up the deliberation, the tiny machines need to have effects that can be expressed in the human ontology. But this gives us an opportunity to detect the problem. For example, suppose that the tiny machines intervene to slightly change the way that human brains works — -neurological events that we thought were random are instead slightly biased so as to maximize the number of paperclips that the humans ultimately choose to make. If our AI understands that these changes are biased towards paperclips (either because it caused the trouble, or because it understands enough to prevent trouble) then we want to ultimately understand that fact. So we can look at a sequence of apparently random events, observe that they are systematically paperclip-biased, and conclude that they fail to capture what we cared about (to the extent that our confidence in deliberation relied on those events being random). This works even if we don’t understand how the tiny machines bring about those changes.
  • You may be concerned that our AI knows about the tiny machines but doesn’t know enough detail about what consequences the tiny machines will have — -perhaps it’s only looking a few weeks out, but it would take decades for the humans to deliberate and realize that they are unhappy with the outcome. But if our AI is causing trouble with its tiny machines (or is capable to prevent trouble caused by tiny machines), it must be because it is doing some kind kind of abstract reasoning about the long-term consequences of the machines. So what we really need is an aligned version of that abstract reasoning that lets us answer questions about the long-term impacts of the tiny machines; we still don’t need to talk about alien concepts like the tiny machines and the problem isn’t coming from the ontology mismatch.
  • You may be concerned that the AI doesn’t really perform that abstract reasoning itself, and instead that reasoning is carried out by e.g. computers built out of tiny machines (and those computers are themselves invisible in the human ontology). But in that case we still have some combination of (i) the AI is doing abstract reasoning about why the tiny-machine-computer will compute actions that have a certain kind of long-term effect, (ii) the AI is explicitly simulating the computation done by the tiny-machine-computer, and then we can just directly translate the knowledge from that computation into a human-legible form. Again, this may pose significant alignment problems but they aren’t actually related to the ontology mismatch.

My tentative view is that — -as long as we are otherwise competitive with our AI — an ontology mismatch isn’t a fundamental problem because events that we can’t understand are only problematic if they have consequences we do understand. (This is a point that many people have brought up over the years when I’ve raised concerns about scenarios like the tiny machines.)

(This is a traditional problem in the futurist AI alignment community, though I’ve ended up with a more optimistic take: ontological crises, model splintering, ontology identification.)

Problem 2: observations can be corrupted

HumanAnswer may be simpler than IntendedAnswer, since HumanAnswer only requires specifying the two world-models (S, P, Ω, O) and (S′, P′, Ω, O′) while IntendedAnswer also requires specifying c.

HumanAnswer infers a distribution over states from the observations ω ∈ Ω. This means that if the observations are corrupted then the answers will be wrong. Corruption may be easy to notice in the AI’s world-model but completely invisible in the human’s world-model. For example, imagine the tiny machines hacking the camera or building a potemkin village.

This is problematic because HumanAnswer may believe that deliberation is going well, when in fact deliberation is completely off the rails but our sensors have been corrupted. Indeed, corrupted sensors may allow for levels of (apparent) security that would be unachievable in the real world, so that optimizing for apparent security may predictably lead to sensor hacking.

Learning HumanAnswer instead of IntendedAnswer closely resembles the problem discussed in Teaching ML to Answer Questions Honestly, so it’s natural to try to apply the same algorithm. Unfortunately, I don’t think that the algorithm will work as is.

What happens when we apply the existing algorithm to this case?

Let’s write W′ = (S′, P′, Ω, O′) for the AI’s world model and W′ — H = (S′, P′ — H, Ω, O′) for a version of the world model with a “human-sized hole” where we can fill in the parameters needed to predict human behavior.

We want to first learn W′ — H, together with the correspondence c: S′ → S and the natural language semantics Answer: Q × Δ(S) → A. Then we can fill in the human H (implicity including their prior P) using a consistency condition.

We’ll use the same consistency condition as in the last post, where we have a small dataset of correct answers and then restrict to human-models H for which IntendedAnswer agrees with the predicted human answer on that dataset.

The next question is whether this model in fact pays for itself. To answer this question we need to precisely define the bad alternative that we are worried about instead.

The main subtlety is that when specifying both the AI model W′ = (S′, P′, Ω, O′) and the human model W = (S, P, Ω, O), we may be able to save bits by conditioning on the fact that W and W′ are supposed to be describing the same world. For example, we could condition on the two models leading to approximately the same probability distribution over observations for situations in the training set. It’s not clear how to define “approximately” or whether there is any way to make a strong consistency update like this, but let’s conservatively assume that they satisfy some strong consistency condition between W′ and W.

Now we have two competing models and we are wondering which is simpler:

  1. [Good] Specify W′−H, c, and Answer. Then fill in H conditioned on agreement between the predicted human answer and IntendedAnswer.
  2. [Bad] Specify W′−H and Answer. Then fill in H and W conditioned on both (i) agreement between the predicted human answer and HumanAnswer, (ii) consistency between W and W′.

These models share most of their complexity. The main difference is that the good model needs to specify c, while the bad difference needs to specify W conditioned on consistency between W and W′.

It’s not easy to compare these complexities without being more precise, e.g. by considering concrete examples of W′, W and c. But abstractly it looks very plausible for the bad model to be simpler.

W′ and the mapping c: S′ → S mostly imply a world model W. So the question is roughly whether the distribution over world-models implied by choosing c is better or worse than the simplicity prior over world-models. But we could imagine the human learning their world model essentially by searching for simple world-models that give good probability distributions in realistic cases, in which case the simplicity prior over world-models is essentially optimal (after conditioning on consistency in simple cases). At the same time, there’s no particular reason why picking c at random seems like it would give rise to a similar prior and so it may just be much worse.

So it looks to me like there’s a real problem here. And at any rate, it seems like a useful class of possibly-hard examples to have in mind when thinking about the hopes for the naive training strategy.

Possible approaches

Hope that c is simple after all

I’m not very confident about my argument that IntendedAnswer is likely to be more complex than HumanAnswer, and even if the argument is technically correct there could be simple approaches to foil the argument.

  • Maybe specifying the correspondence c is actually a totally fine way to implicitly specify the human’s world model W. For example, maybe we can first specify c, then specify W subject to a consistency condition between W and c(W′), and the complexity of c will pay for itself as a way of constraining W (maybe after imposing some additional simple consistency conditions on c).
  • Maybe we can specify c after specifying W and W′, and after imposing consistency conditions the intended mapping c won’t be too complex. This feels a bit implausible but like I said the whole argument feels kind of shaky.
  • Apart from potential “outs,” the argument above just doesn’t feel that solid. Having confirmed that the problem isn’t easily solved, it would probably be good to spend a good amount of time digging into specific examples to confirm that the abstract story is basically correct and that there is a real problem here.

Continuity and commutativity

Both the human world-model W and the AI world-model W′ can model situations where observations are corrupted (though the corruption may look like “random noise” in W even though it has a mechanistic explanation in W′).

Let’s write REAL ∈ S, REAL′ ∈ S′ for a situation where some event is actually happening (there is really a cat in the room) and FAKE ∈ S, FAKE′ ∈ S′ for the situation where my observations have been corrupted (the camera has been hacked to show a video of a cat).

The intended mapping c sends REAL′ → REAL and FAKE′ → FAKE. The indirect mapping c* instead sends both REAL′ and FAKE′ → REAL.

Taken in isolation it may hard to tell which of these mappings is “correct.” But intuitively REAL′ is “near” a bunch of other situations where the cat is in a slightly different place, while FAKE′ is “near” a bunch of other situations where the camera is hacked in different ways, or where only some observations are hacked and so we can see that something weird is happening.

The intended mapping feels continuous, since states near FAKE′ get mapped to states near FAKE. But the indirect map behaves very strangely in this topology: FAKE′ itself gets mapped to REAL, but states near FAKE′ get mapped somewhere very different (for example a nearby state where the hacking is visually noticeable would get mapped somewhere close to FAKE).

Inspired by that intuition, we could hope to learn the intended mapping by imposing some kind of continuity condition.

I’m not sure if “continuity” itself is a natural property, but I think we can do something similar in general by considering a space T of transformations described in natural language (including things like “suppose the camera rotated a little bit to the left”) and considering maps f′: T × S′ → S′ and f: T × S → S. Then we can learn f, f and c subject to constraints similar to c(f′(t, s′)) = f(t, c(s′)).

I find this approach pretty intuitively appealing, in part because talking about counterfactuals is very natural when imagining constraining c. But I haven’t been able to see how to make anything like it work. The basic problem is that it’s very hard to learn any kind of relationship or transformation in the latent space S′ of the AI’s world-model (without just using the indirect map c* to transport relationships from the human world-model).

Continuity may be able to work if we assume that some structural property of S′ corresponds to continuity (e.g. changing a few bits of a trajectory s′ corresponds to a “small” change in the trajectory), but those assumptions seem very brittle.

Make better observations

HumanAnswer only agrees with the training data because it infers a latent state from all of the observations that are used in training. If the process of observation-gathering were sufficiently complex — -either in description complexity or computational complexity — -then IntendedAnswer could end up being simpler than HumanAnswer.

I think that the most realistic way to achieve this is to have AI helpers set up additional observations / do information-gathering to help us make better inferences about the latent state. If these AI helpers are as complex as the agent we are training, and if they are able to use all that complexity to design increasingly ingenious observations, then it’s plausible that the observation process itself will be complex enough that the intended mapping will win (this is the usual picture of recursive reward modeling but not part of my usual picture of amplification — -I’m usually discussing amplification as involving only recursive pen-and-paper thinking to make a single top-level decision about what to do in the real world).

Some thoughts on this approach:

  • It’s not clear how to delegate observation. Suppose that my AI has a good understanding of tiny machines, and could build tiny machines to collect information (or to construct sophisticated miacroscopic sensors). To delegate this task to my AI I need to be able to ask it questions about the relationship between those sensor readings and the actual state of the world, despite me not understanding the dynamics by which the tiny machines operate. I don’t have a good sense of whether this can be made to work by inductively constructing better and better helpers; it seems possible but far from certain.
  • It’s not clear if you can make information-gathering scalably complex. For example, it may be that even a very smart AI is still essentially limited to building some number of increasingly tiny cameras, and that the game doesn’t change qualitatively as the helper AI gets smarter. In this case, the intended policy just needs to hard-code some upper bound on how much me and my helper AI’s can observe (which may still be much simpler than the intended mapping). If I imagine this working out I think it comes down to some assumption about offense vs defense for building sensors vs manipulating them, but I’m not sure how to even articulate that assumption and would need to think about it a lot.
  • No matter how complex you make information-gathering, I suspect there are still simple strategies that abstract out details of the information-gatherers (e.g by treating themselves as a model for the information-gathering assistants and computing the equilibria of the game). So my guess is that this approach should either go all the way, in the sense that there is no way for an attacker to fool the sensors (i.e. defense has an advantage over offense), or it will only get you a modest constant edge for the intended policy.
  • Aesthetically, it seems quite weird to be in the situation where our model “knows” about what is going on in the world, but we need to construct physical sensors in order to extract that information. I would feel surprised if that was the only solution to the problem.

Represent c in a different way

So far I’ve discussed representing c as a neural network that maps between two spaces, and taking steps to try to get the simplicity prior to yield a good distribution over correspondences. I think that sometimes we need to rely on this kind of opaque mapping, but in this case we may be able to avoid it:

  • c maps into a “human-comprehensible” space S. So it feels particularly plausible that we could represent c itself in a human-legible way, or as a combination of an opaque neural network and some legible data, and then apply imitative generalization or a similar approach.
  • Given two different reasonable ways of modeling the world, it feels like there should be a mapping between them. But it’s not like someone just told us “there exists a mapping from S′ → S — -we may get a lot of other structure for free if we think about why we think that the mapping c exists. Just considering the argument more carefully may suggest a different approach. For example, we could try to think about both S and S′ as being downstream of some unobserved “real” world S*, and try to take approaches that go up from S′ → Δ(S*) → Δ(S) rather than going down from S′ → Ω → Δ(S).
  • The indirect mapping c*: S′ → Δ(S) is defined in terms of the observation functions O, O′ and the human’s prior P. If we had to specify a mapping c : S′ → S before specifying any of {O, O′, P}, then intuitively it seems like the intended mapping may be significantly simpler.

I don’t see how to turn any of these ideas into an algorithm but all of them seem vaguely plausible and worth thinking about.

Don’t implement c

A more radical change is to give up on implementing the mapping c as is and just represent Answer in a different way. For example, if I ask whether Alice “is sick,” but the real state of the world involves a small number of unusual bacteria living in Alice’s lungs that won’t cause sickness, it’s not clear whether c needs to round that to “sick” or “not sick” — -it would be better if our AI were to talk about the kind of nuance that is involved even if it can’t explain the full model. I don’t think this is necessary in order to make the system safe, but the fact that it feels like the “right” behavior still gives me pause about trying to learn the intended mapping and suggests that it may be possible to think of a totally different approach.

This is a bit of a subtle distinction, because our intended mapping never explicitly implements c, it only implements the composition Answer(q, c(·)). The main point is that we could imagine asking the model to do something very different from just answering questions in our own ontology, saying something about the nature of the correspondence (even if we can’t go all the way to the naive imitative generalization solution of making the correspondence itself human-legible).

My current state

Overall this problem feels quite similar to my last post; I suspect that in the end there will be a single set of ideas that handles both problems, but that it may look quite different from what I would have proposed without considering this kind of ontology mismatch.

My next step will be to spend a bit of time thinking about this group of problems, looking for some approach that looks like it could plausibly work. Right now I feel like there are a lot of threads to pull on, but none of them look very easy and some of them could result in very big algorithmic changes.

If I find something that looks plausible I’ll probably return to exploring other related problems. I’m generally prioritizing fleshing out examples because I want to avoid going down a rabbit-hole on an easy problem while the real difficulty is elsewhere, and because I hope that having a larger library of examples will tend to lead to cleaner and more general solutions.

Sometimes it can feel like this cluster of problems are just a restatement of the whole alignment problem — -like I’m just asking the same old questions with a slightly different framing. But on reflection I do feel like this si a healthier questions:

  • These examples ignore a lot of issues while still leading to catastrophic outcomes, so I think they are in fact isolating a small part of the problem. For example, these examples don’t talk at all about agency, high stakes and the need for reliability, or human preferences. But those are some of the central concepts in typical discussions of alignment, so removing them really does change the discussion.
  • I think it’s possible to make these cases arbitrarily concrete by filling in more and more details of the human and the AI models. Moreover, I think the problem currently looks soluble without requiring further (vague) assumptions about human reasoning or preferences. I think that’s a really good place to be, and pretty uncommon in alignment.
  • I think it’s important that we only need to solve Problem 2 (handling corrupt observations) and not Problem 1 (talking about alien concepts). I think this is a lot of what makes the problem concrete + tractable. It also means that we are thinking about a different aspect of this “ontology identification” problem than people usually discuss in AI alignment.
New Comment
1 comment, sorted by Click to highlight new comments since:

Note that HumanAnswer and IntendedAnswer do different things. HumanAnswer spreads out its probability mass more, by first making an observation and then taking the whole distribution over worlds that were consistent with it.

Abstracting out Answer, let's just imagine that our AI outputs a distribution  over the space of trajectories  in the human ontology, and somehow we define a reward function  evaluated by the human in hindsight after getting the observation . The idea is that this is calculated by having the AI answer some questions about what it believes etc but we'll abstract that all out.

Then the conclusion in this post holds under some convexity assumption on , since then spreading out your mass can't really hurt you (since the human has no way to prefer your pointy estimate). But e.g. if you just penalized  for being uncertain, then IntendedAnswer could easily outperform HumanAnswer. Similarly, if we require that  satisfy various conditional independence properties then we may rule out HumanAnswer.

The more precise bad behavior InstrumentalAnswer is to output the distribution . Of course nothing else is going to get a higher reward. This is about as simple as HumanAnswer. It could end up being slightly more computationally complex. I think everything I've said about this case still applies for InstrumentalAnswer, but it's relevant when I start talking about stuff like conditional independence requirements between the model's answers.