Thanks to Garrett Baker and Evžen Wybitul for discussions and feedback on this post.

Imagine that you meet an 18th century altruist. They tell you “So, I’ve been thinking about whether or not to eat meat. Do you know whether animals have souls?” How would you answer, assuming you actually do want to be helpful?

One option is to spend a lot of time explaining why “soul” isn’t actually the thing in the territory they care about, and talk about moral patienthood and theories of welfare and moral status. If they haven’t walked away from you in the first thirty seconds this may even work, though I wouldn’t bet on it.

Another option is to just say “yes” or “no”, to try and answer what their question was pointing at. If they ask further questions, you can either dig in deeper and keep translating your real answers into their ontology or at some point try to retarget their questions’ pointers toward concepts that do exist in the territory.

Low-fidelity pointers

The problem you’re facing in the above situation is that the person you’re talking to is using an inaccurate ontology to understand reality. The things they actually care about correspond to quite different objects in the territory. Those objects currently don’t have very good pointers in their map. Trying to directly redirect their questions without first covering a fair amount of context and inferential distance over what these objects are probably wouldn’t work very well.

So, the reason this is relevant to alignment:

Representations of things within the environment are learned by systems up to the level of fidelity that’s required for the learning objective. This is true even if you assume a weak version of the natural abstraction hypothesis to be true; the general point isn’t that there wouldn’t be concepts corresponding to what we care about, but that they could be very fuzzy.

For example, let’s say that you try to retarget an internal general-purpose search process. That post describes the following approach:

  • Identify the AI’s internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
  • Identify the retargetable internal search process.
  • Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.

There are - very broadly, abstracting over a fair amount of nuance - three problems with this:

  • You need to have interpretability tools that are able to robustly identify human-relevant alignment properties from the AI’s internals[1]. This isn’t as much a problem with the approach, as it is the hard thing you have to solve for it to work.
  • It doesn’t seem obvious that existentially dangerous models are going to look like they’re doing fully-retargetable search. Learning some heuristics that are specialized to the environment, task, or target are likely to make your search much more efficient[2]. These can be selectively learned and used for different contexts. This imposes a cost on arbitrary retargeting, because you have to relearn those heuristics for the new target.
  • The concept corresponding to the alignment target you want is not very well-specified. Retargeting your model to this concept probably would make it do the right things for a while. However, as the model starts to learn more abstractions relating to this new target, you run into an under-specification problem where the pointer can generalize in one of several ways.

The first problem (or at least, some version of it) seems unavoidable to me in any solution to alignment. What you want in the end is to interact with the things in your system that would ensure you of its safety. There may be simpler ways to go about it, however. The second problem is somewhat related to the third, and would I think be solved by similar methods.

Notably, the third problem is still a problem in this setup even if you have a robust specification of the target yourself. Because the setup doesn’t involve going in and improving the model’s capabilities / ontology / world model fidelity, you’re upper-bounded by how good of a specification it already has. You could modify the setup to involve updating the model’s ontology, but that seems like a much harder problem if we wanted to do it in the same direct way.

If we didn’t have to do it in the same way however, there are some promising threads. The one that seems most straightforward to me is designing a training process such that the retargeting is integrated at every step. For instance, as part of a reward signal on what you want the search target to be. This would allow for the concepts you care about to always be closely downstream of the target you’re training on, and steer the model’s capabilities toward learning higher-fidelity representations of them[3].

This does mean that you have to have a good specification of the alignment target yourself; training on an under-specified signal or weak proxy is pretty risky. I don’t think this necessarily makes the problem harder because we may need such a specification to retarget the model at all, or generally do model editing, and so on. But it’s a consideration I don’t often see mentioned, that feeds into some aspects of what I work on, and how I think of good alignment solutions.

This post was in part inspired by this Twitter thread from last year mentioning this problem. I left a short reply on how I thought it could be solved, but figured it was worth writing up the entire thing in more detail at some point.

  1. ^
  2. ^

     I think Mark Xu’s example of quick sort vs random sorting is a good intuition pump here.

  3. ^

     The original proposal for retargeting the search also mentioned training-on-the-fly, but mainly in the context of not getting a misaligned AI before we do the retargeting.

New Comment