[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Published a year ago, this sequence of five posts introduced the idea of ascription universality. I didn't really get it on a first reading, and only recently read it in enough detail that I think I understand the main ideas. This entire newsletter will focus on ascription universality; treat all of it as a "Highlight".

The key idea of these posts is that of universality: when we can say that some agent "knows everything that any other agent could know". Of course, there will always be some agent with arbitrarily complex beliefs, but we could hope to have agents that know everything that is known by any agent of some complexity class (e.g. agents that run in polynomial time). However, there are still simple programs that can have complex true beliefs: we could make agents that believe that P != NP and that P = NP, and one of those agents will fall into this class. However, these agents are somewhat degenerate -- they don't have a good reason to trust that their belief is true, and so we don't want to worry about them. Since we want to talk about justified belief, let's consider a property based on which of the two agents we would trust.

So, for now let's adopt an informal definition: a program A[C] is universal with respect to some class of programs C if we would trust any beliefs reported by A[C], no matter what beliefs we hear reported by programs in C. (I'll talk about the specific notion of "ascription universality" later.) Notably, we need to trust A[C] even over our own beliefs, that is, if A[C] believes something, we discard our position and adopt A[C]'s belief.

Note that this is still a very strong criterion. Consider the question of whether AI will lead to x-risk: if one agent thinks about fixed-point theorems, while another studies neural nets, and a third tries to predict the response of democratic institutions, the agent A[C] must compete with everything that these agents (justifiably) conclude. For now, let's set aside the question of how to build a universal agent, and instead understand what it enables us to do.

Informed oversight (revisited) (Paul Christiano) (summarized by Rohin): Introduced in this older post, we consider a setting in which an overseer is training an agent, and the overseer is smarter than the agent. Intuitively, this should work out well, because if the agent ever tries to do something bad, the overseer should notice this (since it is smarter), and penalize it. However, the problem is that the agent might in the course of computing an answer come up with considerations that the overseer can't easily find. For example, an agent tasked with creating novel literature might end up plagiarizing from one book; the overseer may not be able to search through all the books to detect the plagiarism. However, if the overseer is universal, then it should also know that the agent's output is plagiarized from the book (because the agent knows that), and it can penalize it appropriately.

We can even construct a semi-formal argument (see the post for the math). Let us assume that there is some true utility function U, and when the agent takes some action a, the overseer provides it with reward r = E[U(a)], where the expectation is relative to the overseer's beliefs. Let us assume away problems of distribution shift: in particular, we'll assume that when choosing between actions a1 and a2, the agent always chooses the action with higher reward. Now suppose that the agent chooses action a1. We can conclude that a1 has higher reward than a2. This means that the observer believes that a1 has higher true utility than a2. But by universality, we always trust the observer over the agent and ourselves, and so we must also believe that a1 has higher true utility than a2, which means that we believe the agent is choosing the right action.

Rohin's opinion: This application of universality is the one I believe the most; even intuitively it seems like the entire informed oversight problem is that the overseer doesn't know everything that the agent knows, and universality straightforwardly fixes that. So I see this as the motivating example for caring about universality at all.

Worst-case guarantees (Revisited) (Paul Christiano) (summarized by Rohin): Let's start considering distribution shift again: we imagine that the agent has already been trained, and there is no overseer any more. When the agent goes off distribution, it may not choose the action that would have gotten the highest reward, and in fact it could be chosen maliciously. As a silly motivating example, perhaps the agent learned the policy "behave well until I see that RSA-2048 (a product of two very large primes) has been factored"; such a policy will behave perfectly during training but will probably commit a treacherous turn eventually.

In this setting, we would like to have a worst-case guarantee, that the model will never behave unacceptably. However, we need to enforce this guarantee at training time, before we know what situations will arise at test time. Universality of the overseer isn't enough to guarantee this, because universality only means that the overseer knows what the agent currently knows; the agent can't really be said to "know" that it will deceive us in the future.

Nonetheless, it seems like the solution is very related to universality, in that we would like to "peer inside" the agent to figure out how it works, to make sure there aren't situations in which it would behave unacceptably. We could use adversarial training for this purpose: train an adversary to find inputs on which the model behaves unacceptably, and then train the model on those inputs to not do that. This will require interpretability techniques, so that the adversary actually stands a chance of finding the relevant inputs. In order to handle cases like RSA-2048 above, the adversary should also be able to provide "pseudo-inputs" like "the world in which the agent sees that RSA-2048 is factored" (otherwise the adversary would have to factor RSA-2048 itself to demonstrate an input on which the agent would fail, which is far too difficult), while not being allowed to provide pseudo-inputs like "the world in which the agent kills all humans" (which no agent can behave acceptably on). Ultimately, we need to make sure the adversary will always be able to find a bad pseudo-input if one exists, without making it too difficult for the agent to behave acceptably on all pseudo-inputs. See also relaxed adversarial training (AN #70).

Rohin's opinion: I was skeptical of the previous iteration of this post (AN #43), which proposed adversarial training, verification, and interpretability, because it seemed like none of them would allow us to get a guarantee on performance. I'm significantly more optimistic after reading this post, mainly because a couple of previous confusions have been resolved:

1. The point of verification is not that we can prove a theorem saying "this agent is beneficial"; the point is that by making relaxations (pseudo-inputs), a technique commonly used in formal verification, we can reduce the burden on the other methods being used (such as adversarial training).

2. Similarly, the point of interpretability is not to help us understand what the agent is doing or will do, it's to help the overseer (or adversary in adversarial training) understand that. Unlike us, the overseer / adversary can scale up along with the agent itself.

I still think that it would be hard to get a guarantee with adversarial training, given that adversarial training has to eliminate all vulnerabilities. On the other hand, it only has to find all of the settings where the agent is maliciously optimizing against us, which you might hope is a more natural category that is easier to identify without looking too much at particular inputs. This seems like an empirical question on which we'll hopefully get data, though even if it works in all cases that we see, that doesn't rule out the possibility that we failed to notice some issue that will only be triggered in the future (as in the RSA-2048 example).

Universality and model-based RL (Paul Christiano) (summarized by Rohin): So far, we've been talking about the model-free setting, where the overseer provides the incentives. What about model-based RL? Here, we might want to learn separate distributions over models and utility functions using iterated amplification or HCH, and then plan using any off-the-shelf algorithm, such as MCTS. The first new problem that arises is that our distribution over models may be malicious (see benign induction). Using universality, we can solve this problem in two different ways. First, we could use HCH to filter out malicious models. By universality, since a malicious model knows what its goals are, so does HCH, and thus HCH will detect it. Second, we could ask HCH to give us a good prior over models. By universality, this prior over models must be better than the prior used by the malicious models, and so the malicious models' predictions will be worse than that of HCH as a whole, and so the malicious models will not be able to affect the computation much.

The second new problem in the model-based setting is that it isn't clear how our utility function can extract all the relevant information from the model. For example, if the model is a detailed simulation of physics, how do you find the "humans" and evaluate how "happy" they are? Universality is at least necessary for this: if there are facts the model "knows", those facts may be relevant to the utility calculation, and so the utility function also needs to know them. We might hope that universality is sufficient: in the previous case, we might think that the physics simulation "knows" whether humans are happy. But this depends on details of what "know" means that haven't yet been fleshed out, and it depends on what "kind" of universality we can achieve in practice.

Universality and consequentialism within HCH (Paul Christiano) (summarized by Rohin): In all the settings discussed so far, we hope to use HCH (AN #34). What if HCH itself contains malicious optimization? Specifically, recall that HCH involves an arbitrarily large tree of humans breaking questions down into subquestions. Memetic selection on 10^100 humans should be much stronger than the selection done by evolution to produce us, and so could be quite dangerous. We would like to use HCH to check whether its own computations are malicious.

So, we'll define a pair of mutually recursive functions, Filter and Filtered-HCH. Filter takes as input a transcript from HCH, and using filtered-HCH it finds the best argument suggesting that the transcript is problematic. It then uses filtered-HCH to ask whether, in light of this argument, the transcript should be treated as problematic. This check is repeated thrice for robustness. (Note that the original HCH treats the transcript as trusted computations, while the checker treats it as an untrusted input.) Filtered-HCH has the same type signature as HCH (i.e. it produces answers to questions). Just like HCH, it computes an answer to the question, with the help of recursive calls to filtered-HCH to get answers to subquestions. Instead of returning the answer directly, it calls Filter to check that the computation done was problematic, and returns the special symbol ⟂ instead of the answer if so.

The basic hope is that if there's some malicious subcomputation implemented "on top of" the base agents in HCH, and it tries to break HCH, then it "knows" why that attempt could work, and so if filtered-HCH is universal with respect to computations that can be instantiated by (smaller versions of) HCH, then filtered-HCH will also "know" that the computation could break HCH, and will censor it as a result. Consider the threat model from Reliability amplification (AN #44), where the chance of any particular HCH computation with budget n becoming malicious is ε, independent of the input or any past HCH calls. Then, intuitively, filtered-HCH (which needs a higher budget, say kn) has a kε chance of becoming malicious, but since the check is repeated three times and a majority vote is taken, the chance of a malicious computation getting through is k^2ε^2, which will be much smaller than the original ε if k^2 ε << 1. (This is significantly better than the scheme in Reliability amplification (AN #44), because the question "is there a problem?" is a binary yes/no question, and so majority vote works a lot better.) The harder case is when specific inputs cause very high probabilities of failure (which I believe is the threat model from Security amplification (AN #45)). I particularly don't understand this section, but I believe the hope is that since the malicious computation is implemented "on top of" the (aligned) humans, only a small "part" of the computation can be malicious, and this can be detected by a small instance of Filtered-HCH on which the malicious computation cannot be implemented.

Rohin's opinion: I don't understand this post that well, so I'll refrain from opining on it. I think I would benefit from a more concrete example of how malicious optimization could be implemented "on top of" HCH, even when the humans are restricted to a small set of inputs on which they are "secure".

Towards formalizing universality (Paul Christiano) (summarized by Rohin): We've now seen a bunch of applications of universality. So what is ascription universality? It's a specific formalization of the concept of universality. Recall our informal definition so far: a program A[C] is universal with respect to some class of programs C if we would trust any beliefs reported by A[C], no matter what beliefs we hear reported by programs in C. But how can a program "report beliefs"? For HCH, to find out its belief for some question, we simply ask HCH the question and take its answer as its "belief". However, we can't do this for arbitrary computations: in particular, if our agent is deliberately hiding a belief from us (and so wouldn't report it if asked), we shouldn't simply say that it doesn't have that belief. This is where ascriptions come in: an ascription is any way of ascribing beliefs to the agent. Simply asking the agent and trusting its answer is one possible ascription. Another class of ascriptions would look at the agent's code and memory, and make inferences about the agent's beliefs from that. We want to say that HCH will dominate any "reasonable" ascription of beliefs to the agent, where an ascription procedure is "reasonable" if it isn't "doing the hard work"; for example, if an agent always says "Yes", it is unreasonable to ascribe to it the belief that Fermat's Last Theorem is true (even though when given the question "Is Fermat's Last Theorem true" it answers yes).

Putting it all together, A[C] is ascription universal with respect to some class of programs C if, for every "reasonable" ascription procedure and program c in C, A[C]'s beliefs (which we find by asking it) epistemically dominate the beliefs ascribed to c. A set of beliefs B1 epistemically dominates another set of beliefs B2 from our perspective if, given arbitrary information about B1 and B2, for every question our best guess at the answer is what we expect B1 would answer.

So far we've been talking about A[C] dominating an entire class of programs C. We might call this idealized ascription universality, and hope that it applies to systems with perfect optimization, such as debate (AN #5) under optimal play. In practice, we'll never get perfect optimization, and our overseer isn't going to be able to think of everything that any agent could be thinking about. Instead, we could aim for practical ascription universality, in which A[c] dominates a particular computation c, and it gets to look at the computation c as a hint. This would allow an overseer to train an agent, while epistemically dominating that agent. Effectively we're asking two questions: first, would our procedure be universal if the optimization was performed perfectly? Second, is weight sharing / interpretability strong enough to solve the optimization problems needed for universality?

The blog post also has some informal thoughts on why we should expect that a careful Bayesian should be able to epistemically dominate computations found via machine learning (which is why we might hope that HCH could be ascription universal).

Rohin's opinion: I was initially a bit confused by the decomposition into the idealized and the practical variant, but I think it makes more sense now: the assumption of perfect optimization gives you a lot of mileage. For example, A[C] might choose a particular computation c in C, and then use its beliefs; if that choice of computation were already "perfectly optimized" for convincing us, then we would always be at least as convinced of A[C]'s answer as the answer of any other c in C. Debate under optimal play is pretty similar: under optimal play, the agent only tells us about the most important considerations, and only brings up other considerations if its opponent asks about them; you would hope that this is like selecting the best c in C and reporting its reasons for its beliefs.

What's happening here is that the perfect optimization allows us to select the best c in C, or the best argument, etc. "for free" (i.e. without using any computation), giving us huge savings; everything else just has to present the results properly. Consider again the question of whether AI will lead to x-risk: while there are different agents thinking about fixed-point theorems, neural nets, and democratic institutions, perfect optimization lets us "pick" the agent whose arguments are most important. Then, we should only need about as much computation as the agent itself.

However, "presenting the results properly" is non-trivial: A[C] needs to report the beliefs in plain English, even if the computation c in C has its beliefs "encoded" (as in the case of a detailed physics simulation that can be said to "know" that humans are happy, but doesn't say that in plain English). It's not clear to me whether this will work out.

You might very reasonably be worried that perfect optimization of "the most convincing argument" will get you something that is convincing for the wrong reasons, and is not actually correlated with the truth. The definition of universality doesn't preclude this: it simply says that we trust A[C]'s beliefs over our own beliefs and any other agent's beliefs; this does hold for convincing but invalid arguments. I believe (but am not sure) that the hope is to deal with this by restricting inputs to the human to a small secure space, and by the techniques discussed above in Universality and consequentialism within HCH.

Nuances with ascription universality (evhub) (summarized by Rohin): Consider a formalization of honesty / transparency related to ascription universality: a system is honest, if relative to our expectations, its explicitly stated beliefs give us just as much information as any other way of ascribing beliefs to the system. A system that searches for ways to commit a treacherous turn would not be honest, since it would not state that it was performing this search. However, what if this search was "distilled" into the weights of a neural net (akin to a lookup table)? In that case, since we wouldn't ascribe beliefs to the lookup table, the system could be deemed honest by this criterion. So, what we actually need is for the entire training process to be honest. (A similar critique applies directly to ascription universality.)

Rohin's opinion: This is a good point; it shows that you need your overseer to be ascription universal with respect to the entire training process, not just the final agent.

[-]ESRogs7y10

Notably, we need to trust A[C] even over our own beliefs, that is, if A[C] believes something, we discard our position and adopt A[C]'s belief.

To clarify, this is only if we (or the process that generated our beliefs) fall into class C, right?

[-]Rohin Shah7y30

No, under the current formalization, even if we are not in class C we have to trust A[C] over our own beliefs. Specifically, we need $E_{u s} [X ∣ i n f o] = E_{u s} [E_{A [C]} [X] ∣ i n f o]$ for any X and information about A[C] . But then if we are given the info that $E_{A [C]} [X] = Y$ , then we have:

$E_{u s} [X ∣ i n f o]$

$= E_{u s} [E_{A [C]} [X] ∣ i n f o]$ (definition of universality)

$= E_{u s} [E_{A [C]} [X] ∣ E_{A [C]} [X] = Y]$ (plugging in the specific info we have)

$= Y$ (If we are told that A[C] says Y, then we should expect that A[C] says Y)

Putting it together, we have $E_{u s} [X ∣ i n f o] = Y$ , that is, given the information that A[C] says Y, we must expect that the answer to X is Y.

This happens because we don't have an observer-independent way of defining epistemic dominance: even if we have access to the ground truth, we don't know how to take two sets of beliefs and say "belief set A is strictly 'better' than this belief set B" [1]. So what we do here is say "belief set A is strictly 'better' if this particular observer always trusts belief set A over belief set B", and "trust" is defined as "whatever we think belief set A believes is also what we believe".

You could hope that in the future we have an observer-independent way of defining epistemic dominance, and then the requirement that we adopt A[C]'s beliefs would go away.

We could say that a set of beliefs is 'strictly better' if for every quantity X its belief is more accurate, but this is unachievable, because even full Bayesian updating on true information causes you to update in the wrong direction for some quantities, just by bad luck. ↩︎

Hmm, maybe I'm missing something basic and should just go re-read the original posts, but I'm confused by this statement:

So what we do here is say "belief set A is strictly 'better' if this particular observer always trusts belief set A over belief set B", and "trust" is defined as "whatever we think belief set A believes is also what we believe".

In this, belief set A and belief set B are analogous to A[C] and C (or some c in C), right? If so, then what's the analogue of "trust... over"?

If we replace our beliefs with A[C]'s, then how is that us trusting it "over" c or C? It seems like it's us trusting it, full stop (without reference to any other thing that we are trusting it more than). No?

In this, belief set A and belief set B are analogous to A[C] and C (or some c in C), right?

Yes.

If we replace our beliefs with A[C]'s, then how is that us trusting it "over" c or C? It seems like it's us trusting it, full stop

So I only showed the case where $i n f o$ contains information about $A [C]$ 's predictions, but $i n f o$ is allowed to contain information from $A [C]$ and $C$ (but not other agents). Even if it contains lots of information from C, we still need to trust $A [C]$ .

In contrast, if $i n f o$ contained information about $A [A [C]]$ 's beliefs, then we would not trust $A [C]$ over that.

23

[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment

23