You can see the actual submission (including a more formalized model) here, and the contest details here. I've reordered things to be more natural as a blog post / explain the rationale / intuition a bit better. This didn't get a prize, tho it may have been because I didn't try to fit the ELK format.
The situation: we have the capacity to train an AI to predict physical reality quite well. We'd like to train an AI (the "SmartVault") that uses that understanding of reality to protect a valuable diamond inside the vault. Ultimately, we'd like the system to protect the diamond and not deceive the owner of the diamond; proximally, we'd like to make a Reporter that can figure out, from the latent knowledge of the SmartVault, whether it's protecting the diamond or deceiving the owner.
I roughly think the proximal goal is doomed, and that the hope needs to be prospective instead of retrospective. [That is, we need to have a system that we can see ahead of time will try to protect the diamond instead of us checking after the fact for deception, because that can just move the meta-level on which deception occurs.] So I'm going to try to propose a solution that de-emphasizes the Reporter, and see how far I get.
We can roughly model SmartVault as being composed of some subsystems: the Predictor (which, given a trajectory of actions, estimates their likely result), a Reporter (which, given access to the predictor, answers specific questions about the result), and an Optimizer (which, given a situation, tries to decide which trajectory of actions to take). Obviously, the Predictor and the Optimizer are going to be intimately connected to each other, and the ontology of the Predictor is going to be jointly optimized by reality and the training feedback (i.e. the structure of the model's knowledge will reflect both the underlying reality and the salience landscape of the humans labelling examples as 'good' and 'bad'). I'm going to assume that 'the human value function' is part of the 'latent knowledge relevant to prediction (of training reward)', even if it's implicit instead of explicit.
While we’d like to train our machine on (reward, state of reality) pairs, unfortunately we only have access to (reward, human observation) pairs for training; that is, we only know how to communicate the lower-level salience landscape. This means a core challenge is ontological identification, as each ‘human observation’ state could mean many possible ‘machine observation’ states. Furthermore, let’s assume the machine ontology is more detailed and relevant (as it’s the one driving actions, including actions that might corrupt sensors).
There are two main classes of reporters: the 'human simulator' (which answers the question: "what would the human think if they saw their observations?") and the 'direct translator' (which answers the question: "what would a hypothetical correct human think if they saw my observations?").
Similarly, I think there are two main classes of inferred human value functions, which seem pretty analogous: the 'finger' (which answers the question: "did the human observe things that they approved of?"), and the 'moon' (which answers the question: "did the thing the human approved of happen?", or "would a hypothetical correct human approve of the thing that I infer occurred?"). I am tempted to just call them the 'human simulator' and the 'direct translator', but it's probably better to reserve those terms for reporters and introduce new terms for talking about the inferred value functions, even tho I think they use basically the same mechanisms. [One reporter you could imagine, for example, doesn't answer arbitrary questions but just implements the human value function in the machine ontology and provides the level of approval of what the machine thinks is going on; this should counter the incentives favoring the more obvious sorts of deceiving-the-human.]
One way to imagine the 'alignment is easy' world is that those two happen to be the same thing; the only way to build the 'finger' is by building the 'moon'. In this case, it doesn't really matter what approach we take here because vanilla training will do the right thing, and so we'll ignore this as trivial. [This also covers cases in which it is easier to 'just solve the problem' than 'figure out how to trick the judge', which again seems too trivial. At some point we'll want to do something challenging enough that solutions that fool the judge appear much sooner than solutions that 'actually solve it'.]
More realistically, a persistent problem in alignment is that the 'finger' points to the 'wrong thing', but also may be the only way to get perfect training loss (especially as you gain the ability to look closer and closer at the world). For any moon / direct translator the machine could imagine and test, it gets evidence that actually, humans want the finger / human simulator instead. We'll ignore this problem as 'user error' to be solved elsewhere so that we can focus on this subproblem.
So in the relevant difficult cases, there's distinct finger and moon value functions, both of which can achieve optimal training loss.
I think this challenge has to be sidestepped instead of attacked directly. After all, if we could implement the human value function in the machine ontology, we would just do that! Similarly, we had to train the predictor and infer the value function because we didn't know how they were supposed to work, and so a plan that requires being able to differentiate subtle nuances between them (instead of having the machine do that differentiation for us) is probably doomed.
I think the right way to sidestep it is using multi-value functions, inspired by ideas in the reinforcement learning literature like double Q-learning. This requires a change to the internals of the optimizer, where rather than just pursuing reward in an arbitrary way it needs to track ‘plausible’ human value functions, and then evaluate plans based on the ‘minimum’ score given by those plausible functions. [Both 'plausible' and 'minimum' are hiding significant obstacles and detail!]
If both the finger (reward function implied by the human simulator) and the moon (reward function implied by the direct translator) seem plausible to the machine, then plans which involve deceiving the human (and getting away with it) will score poorly based on the moon, whereas plans which involve satisfying the human will score highly based on both options. Thus we can get the benefits of targeting the moon without having to identify it.
This is highly related to a previous work in the reinforcement learning literature; see Double Q-learning, Double DQN and work adapting it to policy gradients. There, they use two independently initialized and trained Q-functions / critics to attempt to overcome optimism bias, leading to faster training. Here, we assume the machine is ‘already trained’ and so optimism borne of ignorance is irrelevant, but optimism in how to interpret the freedom provided within the constraints of the human training data (or human perception) is relevant.
While I think this is promising, it has a few pieces that need to be fleshed out (and even then, I think it's ignoring several critical problems, such that this can't be a 'full solution').
In order for this to work, the machine needs some constructive way to think about "all reward functions worth thinking about" in a way that it can optimize over. While easily solvable with a hypercomputer, we'd like to have something that's competitive.
For the existing multi-value functions, the 'population of reward functions' is just generated from different initializations, which gives (essentially) the different models random settings for the values that haven't yet been pinned down. This won't work very well for generating a distribution of diverse models or tracking their relative probability (except by counting samples).
My guess is that it's possible to construct some sort of particle filtering approach which values diversity, and can reason about families of functions rather than specific instances, which makes it possible to (with achievably few instead of exponentially many samples) cover the space well enough. Actually delivering a working model of this feels like a big ask that's nevertheless doable, and likely enhanced by various work in Bayesian sampling / the stuff that came out of the probabilistic programming paradigms.
Another significant challenge here is that the population of plausible human utility functions that the machine is optimizing over exists just in the machine’s model of the world, and it’s punished for discovering any new ones (as they can’t increase the achievable utility, but can decrease it) and rewarded for ruling out old ones (as this can’t decrease the achievable utility, but can increase it). This might push the incentives to deceive the human into incentives to self-deceive in a way that causes problems, and extend the incentives to manipulate the human in deployment to training as well. The AI can ask questions which it realizes will rule out difficult-to-satisfy candidate reward functions (but hopes the human won’t realize rules out those functions), and is rewarded for having interpretations of ambiguous situations which do so. Or, more transparently, it may just argue with the human that they should be satisfied by consequences which are easy for the AI to achieve.
Supposing we somehow have a set of candidate functions, each with some weight, there remains the question of how to score a particular plan. Each candidate function will provide some score, but if you take a raw minimum of all candidates then it's very easy to end up with a function that's the minimum score everywhere, and thus provides no guidance. [For example, for any variable whose target value is not pinned down by the training data, one candidate might think it's critical that it be high and another candidate might think it's critical that it be low, and you can't satisfy both simultaneously.]
I think this can be repaired using a 'softmin', but it's not obvious what functional form works well here. I discuss one such proposal in the linked doc (roughly: sort your candidates in decreasing probability order, create a non-increasing function that at each particle is the minimum of it's previous value and the particle's value, and integrate the area under the curve). This overweights pessimistic candidates, but doesn't obviously have the nice features that we might want it to have (like weakly preferring 'typical' or 'non-extremal' values for features that aren't pinned down).
It seems worth repeating that this is assuming that the function that you want the machine to optimize for is a plausible candidate that it could infer from the training data. Then we don't need to find it; we just need to not do poorly according to any plausible function, and we're done. This feels like it has two problems.
First, the 'real' problem is that the moon doesn't do well according to the training data, because of systematic biases on the human's part, and the machine learns those biases (while discarding random errors). In some situations this is fine, and in other situations this causes huge losses of value, and this relaxation doesn't seem like it cuts at the heart of this issue (tho maybe it does open up a path for progress); instead it feels like I came up with a solution that relies on the relaxation assumption and will break as soon as that assumption is removed.
Second, suppose the space of candidates is large enough to support both the finger and the moon; does it support anything else? Will those things cause the machine to behave in bizarre or undesired ways? [Note the interaction with the previous point; as we start relaxing the assumptions of 'you have to agree with the training data' in order to allow the machine to not absorb our systematic biases, we also open up the space for ignoring lots of the true signal that we don't want the system to ignore.]
We also get the benefits of targeting the finger: the plan will not just work, but persuasively work. But this benefit is probably poisoned; there doesn’t seem to be any meta-level incentives here favoring not lying to the human; a plan that both hacks the camera so that it always shows an image of the diamond and attempts to defend the vault will score more highly, I think, than one that just attempts to defend the vault and not pick up ‘free points’ by hacking the camera.
One of my intuitions here is that if it's reasonable to ask the machine to infer the reward function once, it's probably reasonable to ask the machine to infer the reward function a hundred times. The main question feels to me like whether the moon shows up in your first hundred samples, or what sort of changes you have to make to your sampling procedure such that it'll happen.
“Please cure cancer,” you tell the robot, and it responds with “I’ve written a very compelling pamphlet on coping with your mortality.”