Where I currently disagree with Ryan Greenblatt’s version of the ELK approach

So8res

Context: This post is my attempt to make sense of Ryan Greenblatt's research agenda, as of April 2022. I understand Ryan to be heavily inspired by Paul Christiano, and Paul left some comments on early versions of these notes.

Two separate things I was hoping to do, that I would have liked to factor into two separate writings, were (1) translating the parts of the agenda that I understand into a format that is comprehensible to me, and (2) distilling out conditional statements we might all agree on (some of us by rejecting the assumptions, others by accepting the conclusions). However, I never got around to that, and this has languished in my drafts folder too long, so I'm lowering my standards and putting it out there.

The process that generated this document is that Ryan and I bickered for a while, then I wrote up what I understood and shared it with Ryan, and we repeated this process a few times. I've omitted various intermediate drafts, on the grounds that sharing a bunch of intermediate positions that nobody endorses is confusing (moreso than seeing more of the process is enlightening), and on the grounds that if I try to do something better then what happens instead is that the post languishes in the drafts folder for half a year.

(Thanks to Ryan, Paul, and a variety of others for the conversations.)

Nate's model towards the end of the conversation

Ryan’s plan, as Nate currently understands it:

Assume AGI is going to be paradigmatic, in the sense of being found by something roughly like gradient descent tuning the parameters in some fixed architecture. (This is not intended to be an argument for paradigmaticity; attempting to align things in the current paradigm is a good general approach regardless (or so Nate understands Ryan to claim).)
Assume further that Earth's first AGIs will be trained according to a process of our choosing. (In particular, it needs to be the case that AGI developers can train for more-or-less any objective they want, without thereby sacrificing competitiveness. Note that this might require significant feats of reward-shaping.)
Assume further that most capability gains will be driven by something roughly like gradient descent. (Ryan has some hope that this plan would generalize to various other drivers of capability gains, but Nate hasn't understood any of the vague gestures towards those ideas, and as of April 2022, Ryan's plans were mostly stated in terms of SGD, so I'll set that aside for now.)
With those three assumptions in hand, part one of the plan is to pretend like we have the One True Objective Function, and that we've contracted the services of the Devil Himself, which we utilize to train the AGI to pursue the given objective. The special power of the devil is that he can generate the inputs that cause the nascent AI to do as poorly as possible according to the one true objective function, and we can use this to find the activations of the worst misbehaviors, and expose them to gradient updates. This is intended to result in a system that Nate might call "inner aligned" with the objective we use in training.
Part two of the plan is to solve the problem of eliciting a paradigmatic AI system's latent knowledge, and then use it to substitute for the One True Objective Function and the Devil Himself. The basic idea here is that you don't need to actually have a perfect objective function or a perfect adversary for part one to work; it should in general suffice to nip any misbehavior in the bud the moment that the AI itself starts to have any knowledge of that misbehavior.
The scope of this plan is to produce a system that would work well in-distribution, and that would at least not behave catastrophically if deployed out-of-distribution. Having it behave well (and competitively) out-of-distribution is outside the scope of this particular plan (though there's some hope that it Just Works, because, e.g., the nip-all-misbehavior-in-the-bud stuff may just generalize cleanly if you're doing your job right).
It is worth noting explicitly that the task of the adversary is quite tricky, here, on account of how there are types of distributional shift that could cause the AGI to start misbehaving (e.g., because it can tell that it's now actually in deployment where it won't get shut down if it misbehaves), and which are infeasible in practice for an adversary to spoof by generating adversarial inputs alone (because it's hard to persuasively convince an AI in training that it is perhaps in deployment). Clever alternatives are needed here, like spoofing the system's mental states instead of generating the sorts of observations that only deployment can generate, but hopefully something can be made to work.

Nate's response:

You're sidestepping a large chunk of the problem by avoiding the need to behave well when deployed far out of distribution.
I don't think it's sociopolitically feasible to pick a training method that trains the AGI from inception (which, e.g., would break down if the first AGI comes from an experimental new "lifelong learning with varied objectives" regime, but I am expecting reality to be at least that inconvenient).
I don't think it's practically feasible to get an AGI by training purely on the deployment objective.
I suspect that SGD won't be the primary driver of capabilities gains, near the end. Compare humans, who reached a certain capability level and then began to rapidly improve via knowledge, science, and culture, rather than just improving via our genes. More generally, I think there's a solid chance that, at some point, capabilities will start generalizing far OOD, and that the sort of alignment guarantees you can get from these sorts of methods will fail to generalize with them.
I doubt you can produce adversaries that are up to the task of preventing your AGI from distinguishing training from deployment.
I doubt you can elicit the latent knowledge of a nascent AGI (in a way that scales with the capabilities of the AGI) well enough to substitute for the one true objective function and the devil himself and thus produce inner alignment.
If you could, I'd begin to suspect that the latent-knowledge-eliciter is itself containing lots of dangerous machinery that more-or-less faces its own version of the alignment problem.

An attempt at conditional agreement

I suggested the following:

If it is the case that:

Gradient descent on a robust objective cannot quickly and easily change the goals of early paradigmatic AGIs to move them sufficiently toward the intended goals,
OR early deployments need to be high-stakes and out-of-distribution for humanity to survive, AND
- adversarial training is insufficient to prevent early AGIs from distinguishing deployment from training,
- OR the critical outputs can be readily distinguished from all other outputs, e.g., by their universe-on-a-platter nature,
OR early paradigmatic AGIs can get significant capability gains out-of-distribution from methods other than more gradient descent,

... THEN the Paulian family of plans don't provide much hope.

My understanding is that Ryan was tentatively on board with this conditional statement, but Paul was not.

Postscript

Reiterating a point above: observe how this whole scheme has basically assumed that capabilities won't start to generalize relevantly out of distribution. My model says that they eventually will, and that this is precisely when things start to get scary, and that one of the big hard bits of alignment is that once that starts happening, the capabilities generalize further than the alignment. A problem that has been simply assumed away in this agenda, as far as I can tell, before we even dive into the details of this framework.

To be clear, I'm not saying that this decomposition of the problem fails to capture difficult alignment problems. The "prevent the AGI from figuring out it's in deployment" problem is quite difficult! As is the "get an ELK head that can withstand superintelligent adversaries" problem. I think these are the wrong problems to be attacking, in part on account of their difficulty. (Where, to be clear, I expect that toy versions of these problems are soluble, just not solutions rated for the type of opposition it sounds like the rest of this plan requires.)

For what it's worth I found this writeup informative and clear. So lowering your standards still produced something useful (at least to me).

Reiterating a point above: observe how this whole scheme has basically assumed that capabilities won't start to generalize relevantly out of distribution. My model says that they eventually will, and that this is precisely when things start to get scary, and that one of the big hard bits of alignment is that once that starts happening, the capabilities generalize further than the alignment. A problem that has been simply assumed away in this agenda, as far as I can tell, before we even dive into the details of this framework.

My reply last time is still relevant: link.

How actually do you sidestep the need for the One True Objective Function given an ELK solution? I get that it might seem plausible to take a rough objective like "do what I intend" and look at the internal knowledge of the thing for signs that it is deliberately deceiving you. If you do that, you'll get, at best, an AI that doesn't know that it is deceiving you (for whatever operationalization of "know" you come up with as you use ELK for training). But it could still be deceiving you, and very likely will be if optimization pressure is merely towards "AIs that don't know that they are being deceptive".

Our very broad hope is to use ELK to select actions that (i) keep humans safe, and give them time and space to evolve according to their current (essentially local) preferences, (ii) are expected to produce outcomes that would be judged favorably by the future humans, primarily by maximizing option value until it becomes clear what those future humans want (see the strategy stealing assumption).

This is discussed very briefly in this appendix of the ELK report and the subsequent appendix. There are two or three big foreseeable difficulties with this approach and likely a bunch of other problems.

I don't think this should be particularly persuasive, but it hopefully illustrates how ARC is currently thinking about this part of the problem. Overall my current view is that this is fairly unlikely to be the weakest link in the plan, i.e. if it doesn't work it will be because of a failure at an earlier step, and so it's not one of the main things I'm thinking about.

... THEN the Paulian family of plans don't provide much hope.

My understanding is that Ryan was tentatively on board with this conditional statement, but Paul was not.

I forget the extent to which I communicated (or even thought) this in the past, but at the moment, the current claim I'd agree with is: "this specific plan is much less likely to work".

My best guess is that even if I was quite confident in those conditions being true, work on various subparts of this plan seems like quite a good bet.

Does Ryan have an agenda somewhere? I see this post, but I don't think that's it.

I don't have an agenda posted anywhere.