For context, here's the one time in the interview I mention "AI risk" (quoting 2 earlier paragraphs for context):
Paul Christiano: I don’t know, the future is 10% worse than it would otherwise be in expectation by virtue of our failure to align AI. I made up 10%, it’s kind of a random number. I don’t know, it’s less than 50%. It’s more than 10% conditioned on AI soon I think.
Asya Bergal: I think my impression is that that 10% is lower than some large set of people. I don’t know if other people agree with that.
Paul Christiano: Certainly, 10% is lower than lots of people who care about AI risk. I mean it’s worth saying, that I have this slightly narrow conception of what is the alignment problem. I’m not including all AI risk in the 10%. I’m not including in some sense most of the things people normally worry about and just including the like ‘we tried to build an AI that was doing what we want but then it wasn’t even trying to do what we want’. I think it’s lower now or even after that caveat, than pessimistic people. It’s going to be lower than all the MIRI folks, it’s going to be higher than almost everyone in the world at large, especially after specializing in this problem, which is a problem almost no one cares about, which is precisely how a thousand full time people for 20 years can reduce the whole risk by half or something.
(But it's still the case that asked "Can you explain why it's valuable to work on AI risk?" I responded by almost entirely talking about AI alignment, since that's what I work on and the kind of work where I have a strong view about cost-effectiveness.)
E.g. if you have a broad distribution over possible worlds, some of which are "fragile" and have 100 things that cut value down by 10%, and some of which are "robust" and don't, then you get 10,000x more value from the robust worlds. So unless you are a priori pretty confident that you are in a fragile world (or they are 10,000x more valuable, or whatever), the robust worlds will tend to dominate.
Similar arguments work if we aggregate across possible paths to achieving value within a fixed, known world---if there are several ways things can go well, some of which are more robust, those will drive almost all of the EV. And similarly for moral uncertainty (if there are several plausible views, the ones that consider this world a lost cause will instead spend their influence on other worlds) and so forth. I think it's a reasonably robust conclusion across many different frameworks: your decision shouldn't end up being dominated by some hugely conjunctive event.
In the case of something like amplification or debate, I think the bet that you're making is that language modeling alone is sufficient to get you everything you need in a competitive way.
I'm skeptical of language modeling being enough to be competitive, in the sense of maximizing "log prob of some naturally occurring data or human demonstrations." I don't have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.
I'm also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What's so bad if we use non-language data?
We could also ask: "Would AlphaStar remain as good as it is, if fine-tuned to answer questions?"
In either case it's an empirical question. I think the answer is probably yes if you do it carefully.
You could imagine separating this into two questions:
I normally imagine using joint training in these cases, rather than pre-training + fine-tuning. e.g., at every point in time we maintain an agent and a question-answerer, where the question-answerer "knows everything the agent knows." They get better together, with each gradient update affecting both of them, rather than first training a good agent and then adding a good question-answerer.
(Independently of concerns about mesa-optimization, I think the fine-tuning approach would have trouble because you couldn't use statistical regularities from the "main" objective to inform your answers to questions, and therefore your question answers will be dumber than the policy and so you couldn't get a good reward function or specification of catastrophically bad behavior.)
I wrote this post imagining "strategy-stealing assumption" as something you would assume for the purpose of an argument, for example I might want to justify an AI alignment scheme by arguing "Under a strategy-stealing assumption, this AI would result in an OK outcome." The post was motivated by trying to write up another argument where I wanted to use this assumption, spending a bit of time trying to think through what the assumption was, and deciding it was likely to be of independent interest. (Although that hasn't yet appeared in print.)
I'd be happy to have a better name for the research goal of making it so that this kind of assumption is true. I agree this isn't great. (And then I would probably be able to use that name in the description of this assumption as well.)
(See also the concept of "decoupled RL" from some DeepMind folks.)
Now that I understand "corrigible" isn't synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn't seem enough to imply these things
I agree that you still need the AI to be trying to do the right thing (even though we don't e.g. have any clear definition of "the right thing"), and that seems like the main way that you are going to fail.
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or "true" preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it's not corrigible_MIRI.
Note that "corrigible" is not synonymous with "satisfying my short-term preferences-on-reflection" (that's why I said: "our short-term preferences, including (amongst others) our preference for the agent to be corrigible.")
I'm just saying that when we talk about concepts like "remain in control" or "become better informed" or "shut down," those all need to be taken as concepts-on-reflection. We're not satisfying current-Paul's judgment of "did I remain in control?" they are the on-reflection notion of "did I remain in control"?
Whether an act-based agent is corrigible depends on our preferences-on-reflection (this is why the corrigibility post says that act-based agents "can be corrigible"). It may be that our preferences-on-reflection are for an agent to not be corrigible. It seems to me that for robustness reasons we may want to enforce corrigibility in all cases even if it's not what we'd prefer-on-reflection, for robustness reasons.
That said, even without any special measures, saying "corrigibility is relatively easy to learn" is still an important argument about the behavior of our agents, since it hopefully means that either (i) our agents will behave corrigibly, (ii) our agents will do something better than behaving corriglby, according to our preferences-on-reflection, (iii) our agents are making a predictable mistake in optimizing our preferences-on-reflection (which might be ruled out by them simply being smart enough and understanding the kinds of argument we are currently making).
By "corrigible" I think we mean "corrigible by X" with the X implicit. It could be "corrigible by some particular physical human."
(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)