I'm still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it's in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:
I didn't understand what your wrote about verification well enough to have anything to say.
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it's in the test distribution. Is your model that gradient descent will favor the former agent over the latter agent, making the 99%/1% agent unlikely to arise (and perhaps quantifiably so)?
The inner process may nevertheless use TDT if TDT doesn't diverge from CDT on the training distribution, or it might learn to use TDT but "look nice" so that it doesn't get selected against.
This was what I was intending to convey in assumption 3.
I'm currently intuiting that there's a broad basin of "seeming corrigible until you can perform a treacherous turn", but not a broad basin of true corrigibility. If the overseer can robustly detect incorrigible cognition in the distilled agent, I think things may be fine. But if e.g. the overseer is only inspecting the distilled agent's justifications for its behavior, and something like its verbal loop, I don't see how things can work out.
Here are my assumptions underlying this intuition:
1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer's comment.)
2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they'll get.
3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible.
4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn.
5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent's cognition is necessary to spot this sort of malign reasoning.
6. We can't achieve this level of understanding via anything like current ML transparency techniques.
Which of these assumptions do you disagree with most? Or do you agree with these assumptions, but feel optimistic that we can get good worst-case guarantees anyway?
2. How easy is it to learn to be corrigible? I'd think of this as: if we impose the extra constraint that our model behave corrigibly on all inputs, in addition to solving the object-level task well, how much bigger do we need to make the model?
My current intuition: for the same reasons it seems really hard to train models that believe 51 is prime while also performing well on object-level tasks, learning to be corrigible seems very difficult.
I really like that list of points! Not that I'm Rob, but I'd mentally classified each of those as alignment failures, and the concern I was trying to articulate was that, by default, I'd expect an AI trying to do the right thing will make something like one of these mistakes. Those are good examples of the sorts of things I'd be scared of if I had a well-intentioned non-neurotypical assistant. Those are also what I was referring to when I talked about "black swans" popping up. And when I said:
2. Corrigibility depends critically on high-impact calibration (when your AI is considering doing a high-impact thing, it's critical that it knows to check that action with you).
I meant that, if an AI trying to do the right thing was considering one of these actions, for it to be safe it should consult you before going ahead with any one of these. (I didn't mean "the AI is incorrigible if it's not high-impact calibrated", I meant "the AI, even if corrigible, would be unsafe it's not high-impact calibrated".)
If these kinds of errors are included in "alignment," then I'd want some different term that referred to the particular problem of building AI that was trying to do the right thing, without including all of the difficulty of figuring out what is right (except insofar as "figure out more about what is right" is one way to try to build an AI that is trying to do the right thing.)
I think I understand your position much better now. The way I've been describing "ability to figure out what is right" is "metaphilosophical competence", and I currently take the stance that an AI trying to do the right thing will by default be catastrophic if it's not good enough at figuring out what is right, even if it's corrigible.
I thought more about my own uncertainty about corrigibility, and I've fleshed out some intuitions on it. I'm intentionally keeping this a high-level sketch, because this whole framing might not make sense, and even if it does, I only want to expound on the portions that seem most objectionable.
Suppose we have an agent A optimizing for some values V. I'll call an AI system S high-impact calibrated with respect to A if, when A would consider an action "high-impact" with respect to V, S will correctly classify it as high-impact with probability at least 1-ɛ, for some small ɛ.
My intuitions about corrigibility are as follows:
1. If you're not calibrated about high-impact, catastrophic errors can occur. (These are basically black swans, and black swans can be extremely bad.)
3. To learn how to be high-impact calibrated w.r.t. A, you will have to generalize properly from training examples of low/high-impact (i.e. be robust to distributional shift).
4. To robustly generalize, you're going to need the ontologies / internal representations that A is using. (In slightly weirder terms, you're going to have to share A's tastes/aesthetic.)
5. You will not be able to learn those ontologies unless you know how to optimize for V the way A is optimizing for V. (This is the core thing missing from the well-intentioned extremley non-neurotypical assistant I illlustrated.)
6. If S's "brain" starts out very differently from A's "brain", S will not be able to model A's representations unless S is significantly smarter than A.
In light of this, for any agent A, some value V they're optimizing for, and some system S that's assisting A, we can ask two important questions:
(I) How well can S learn A's representations?
(II) If the representation is imperfect, how catastrophic might the resulting mistakes be?
In the case of a programmer (A) building a web app trying to make users happy (V), it's plausible that some run-of-the-mill AI system (S) would learn a lot of the important representations right and a lot of the important representations wrong, but it also seems like none of the mistakes are particularly catastrophic (worst case, the programmer just reverts the codebase.)
In the case of a human (A) trying to make his company succeed (V), looking for a new CEO (S) to replace himself, it's usually the case that the new CEO doesn't have the same internal representations as the founder. If they're too different, the result is commonly catastrophic (e.g. if the new CEO is an MBA with "more business experience", but with no vision and irreconcilable taste). Some examples:
(It's worth noting that if the MBA got hired as a "faux-CEO", where the founder could veto any of the MBA's proposals, the founders might make some use of him. But the way in which he'd be useful is that he'd effectively be hired for some non-CEO position. In this picture, the founders are still doing most of the cognitive work in running the company, while the MBA ends up relegated to being a "narrow tool intelligence utilized for boring business-y things". It's also worth noting that companies care significantly about culture fit when looking for people to fill even mundane MBA-like positions...)
In the case of a human (A) generically trying to optimize for his values (V), with an AGI trained to be corrigible (S) assisting, it seems quite unlikely that S would be able to learn A's relevant internal representations (unless it's far smarter and thus untrustworthy), which would lead to incorrect generalizations. My intuition is that if S is not much smarter than A, but helping in extremely general ways and given significant autonomy, the resulting outcome will be very bad. I definitely think this if S is a sovereign, but also think this if e.g. it's doing a thousand years' worth of human cognitive work in determining if a newly distilled agent is corrigible, which I think happens in ALBA. (Please correct me if I botched some details.)
Paul: Is your picture that the corrigible AI learns the relevant internal representations in lockstep with getting smarter, such that it manages to hit a "sweet spot" where it groks human values but isn't vastly superintelligent? Or do you think it doesn't learn the relevant internal representations, but its action space is limited enough that none of its plausible mistakes would be catastrophic? Or do you think one of my initial intuitions (1-6) is importantly wrong? Or do you think something else?
Two final thoughts: