Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes

Andrea_Miotti; paulfchristiano; Gabriel Alfour; Olive Branch

The following are the summary and transcript of a discussion between Paul Christiano (ARC) and Gabriel Alfour, hereafter GA (Conjecture), which took place on December 11, 2022 on Slack. It was held as part of a series of discussions between Conjecture and people from other organizations in the AGI and alignment field. See our retrospective on the Discussions for more information about the project and the format.

Here's a summary of the discussion, as well as the full transcript below the summary, lightly edited for readability.

Summary

Introduction

GA is pessimistic about alignment being solved because he thinks there is (1) an AGI race to the bottom, (2) alignment is hard in ways that we are bad at dealing with, and (3) we don't have a lot of time to get better, given the pace of the race.

Christiano clarifies: does GA expect a race to the bottom because investment in alignment will be low, people won’t be willing to slow development/deployment if needed, or something else? He predicts alignment investment will be 5-50% of total investment, depending on how severe risk appears. If the risks look significant-but-kind-of-subtle, he expects getting 3-6 months of delay based on concern. In his median doomy case, he expects 1-2 years of delay.

GA expects lower investment (1-5%). More crucially, though, GA expects it to be hard to turn funding and time into effective research given alignment’s difficulty.

Alignment Difficulty, Feedback Loops, & Phase Shifts

GA’s main argument for alignment difficulty is that getting feedback on our research progress is difficult, because

Core concepts and desiderata in alignment are complex and abstract.
We are bad at factoring complex, abstract concepts into smaller more tractable systems without having a lot of quantitative feedback.
We are bad at building feedback loops when working on abstract concepts
We are bad at coming to agreement on abstract concepts.
All this will make it difficult to predict when phase shifts – eg qualitative changes to how systems are representing information, which might break our interpretability methods – will occur.
Such phase shifts seem likely to occur when we shift from in vitro to in vivo, which makes it particularly likely that the alignment techniques we build in vitro won’t be robust to them.
Despite theorists arguing connecting AI systems to e.g. the internet is dangerous for this reason, labs will do it, because the path from current systems to future danger is complex and we may not see legibly catastrophic failures until it is too late. So, even getting better at predicting may not help.

Christiano disagrees building feedback loops is hard in alignment. We can almost certainly study reward hacking in vitro in advance, together with clear measurements of whether we are succeeding at mitigating the problem in a way that should be expected to generalize to AI coup. Conditioned on deceptive alignment being a problem that emerges, there’s a >50% chance that we can study it in the same sense.

Furthermore, Christiano argues most plausible approaches to AI alignment have much richer feedback loops than the general version of either of these problems. For example, if you have an approach that requires building a kind of understanding of the internals of your model then you can test whether you can build that kind of understanding in not-yet-catastrophic models. If you have an approach that requires your model being unable to distinguish adversarial examples from deployment cases, you can test whether your models can make that distinction. You can generally seek methods that don’t have particular reasons to break at the same time that things become catastrophic.

GA is skeptical that alignment techniques won’t break by default when we shift from in vitro to more complex, real life environments.

In vivo, models will probably be given access to computational workspaces. For instance, they’ll be able to take notes on the internet. This will likely change how they represent and store information, enabling models to evade the interpretability tools that rely on weights containing all models’ representations.

(GA also mentions we may optimize against our interpretability methods. Christiano thinks we could test the extent to which this happens in advance and doesn’t think the interpretability methods to which this could happen are appealing anyways.)

Christiano questions whether GA expects AIs to acquire the ability to represent concepts in ways that evade interpretability tools around the same time as catastrophe.^[1]

GA replies he expects catastrophe because of this.

Can using AI assistants to help humans supervise AIs work?

GA suggests talking about Christiano’s models of how AI supervision can work.

Christiano gives a simplified story:

We train Planning AI by sampling a random set of their decisions and evaluating those decisions.
We perform evaluations by using Supervisor AIs.
That process works well enough that any bad action (i.e. one it knows that we wouldn’t like on reflection) would predictably get a low reward.
So as far as our model is concerned, our loss ranks actions by goodness.
So SGD selects a model that is trying to produce good actions.

The two most important ways Christiano thinks supervision may fail:

Our evaluations are not good enough.
We obtain a deceptively aligned model that gets a high reward only to achieve its long term objectives, and then off distribution it pursues those objectives.

The point of using Supervisor AIs to improve your evaluations is to address problem 1.

GA raises the difficulty that Planning AIs will eventually need to propose complex, potentially dangerous plans to get us through the acute risk period. Given that we cannot train on similarly complex, dangerous examples, how can we expect Supervisor AIs’ evaluation ability to generalize well?

Christiano asks what mechanism allows the Planning AI to generalize in vivo (in order to propose plans for disempowering humanity) but not the Supervisor AI (such that it can’t evaluate such plans).

GA offers a few stories of how the Planning AI could gain capabilities; Christiano still doesn’t think they address why the supervisors don’t similarly benefit or why they result in the Planning AI desiring to kill everyone.

This reveals some confusions of GA’s about Christiano’s model of supervision. Christiano clarifies:

GA: How would the Supervisor AI learn the techniques the PlanningAI uses?
Christiano: Planning AIs and Supervisor AIs get all the same data. Roughly: Planning AI is an RL system trained with a reward function equivalent to “assisted human evaluation of the consequences of plans”. SupervisorAI is a head on the same model with a reward function equivalent to “assisted human evaluation of the accuracy of answers.”
GA: The Planning AI makes predictions by interacting with its environment. For instance, it might use a calculator. How does the Supervisor AI make its predictions?
Christiano: The Planning and Supervisor AI see the same data from the environment. (Note this is a point of departure in Jan Leike’s plan.) The Planning AI may want to use a calculator. The Supervisor AI can give it a low reward if it seems risky. If the Planning AI goes ahead with consulting the calculator, both will see the results.
GA: How does the supervisor AI become well calibrated about which plans are risky?
Christiano: SGD should figure out the loss if it starts systematically wrong about which plans are risky.
GA: Ok. In sum, Supervisor-AI is like a language model that has access to a RL Planning AI's internals, and we calibrate it on past data.
Christiano: Clarification: the internals of Planning AI are trained both to help Planning AI and the Supervisor do their jobs. My work is focused on cases where planning-AI implicitly understands something about the world, or about the consequences of a plan, but Supervisor-AI doesn’t use that implicit knowledge to answer questions

GA raises another worry: the Planning AI + Supervisor AI system could work, then the Planning AI could build or become part of a more complex system of Planning AI + parts of its environment that’s hard to align.

Christiano responds that if this is the main source of alignment difficulty, then it’s a kind of buck-passing problem: it’s hard for us to align AI because that AI will build or become part of a bigger AI or system that’s hard to align. At some point the buck must stop and we need to hit the real difficulty.

In the case of the Planning AI, we’re evaluating the safety of its plans, which include the impacts of those plans on alignment. So, Christiano asks,

Do our evaluations fail to capture those effects?
Is there a differential capabilities problem where RL-AI is unable to make those evaluations?
Do we fail at alignment such that RL-AI doesn’t even produce actions that we judge to be good? (If so, then we had more immediate problems.)

GA argues that, as systems get more complex, seeing consequences and thereby keeping systems pointing at our objectives gets harder.

Christiano asks how this is different to the argument “as humans use computational tools and less computation happens in humans, the cognition is less and less pointed at anything we care about”?

GA answers that it’s not different – ML will just facilitate this in an extreme way, both within systems and for humanity. In particular, he worries that some ML systems will make discoveries/have interactions with their environment in a way that has analogous effects to humans discovering technologies like language, and these will quickly change what those ML systems are optimizing for. So, aligning ML systems in a narrow sense will fail to robustly solve alignment.

Christiano broadly agrees that e.g. RLHF, adversarial training, interpretability doesn’t help with the risks GA is raising. He also asks how similar GA’s views are to others who expect recursive self-improvement / a sharp left turn. GA isn’t sure of their views but he understands others to be focused on hardware improvements, whereas he is worried about discoveries made/heuristics learned during training/once deployed.

Recap of Cruxes

If we align ML in the narrow sense Christiano is advocating for, GA is concerned that ML systems will become part of a larger system that won’t be aligned, and very few of the technical tools Christiano is imagining can help much with that.

GA thinks interpretability won’t help for understanding optimization distributed across many ML systems or between a single system and its tools; it will at best show how ML systems ‘think about’ tools and each other without revealing the cognition that actually kills us.

Christiano is skeptical if, by a single ML system, we mean no parallelization and e.g. just one big neural net running in serial. If we just mean we train one neural net and run it trillions of times in parallel, then he’s more on board.

Christiano and GA agree in principle that one ML system interacting with a computational workspace (even a very lightweight one like “pencil and paper”) can perform kinds of cognition qualitatively different from what happens within the ML system itself. Such a system could end up deliberately and creatively disempowering everyone without having any kind of internal representation of that fact.

Coordination

GA asks how Christiano thinks AGI labs are likely to behave: how much will they invest in alignment, how much will they slow AI development, etc.?

Christiano predicts AGI labs will invest 5-50% of total investment in alignment. Slowdown is harder to quantify. He thinks if AI systems are deceptively aligned we can probably get some evidence of that, and if they are reward hacking we can even more probably get evidence of that, and in the median case this will lead to significant slowdowns. He admits slowdowns longer than 1-2 years are hard for institutional/political/social reasons.

GA asks why the evidence Christiano is imagining would be so worrying compared to existing toy examples.

Christiano says current evidence shows reward hacking is possible in principle. The evidence he expects will be legibly scarier and more directly relevant, eg.

Based on red teaming exercises, a coup seems like it has a non-trivial probability of success
In vitro we see that similar AI systems exhibit abrupt transitions at the point when overpowering the overseer becomes positive EV

GA asks whether Christiano actually expects warning shots like failed coup attempts and thinks they will be compelling.

Christiano actually expects them and thinks they’ll be compelling enough for DeepMind/OpenAI/Anthropic/Google/Meta. On his model, people’s objection to slowdown is that existing AI systems pose an extremely minimal risk, which is true.

GA doubts we’ll get this much value from warning shots. He expects bad behavior in vitro will not be compelling and bad behavior in real life won’t be attempted until an AI thinks it has amassed sufficient power to succeed.

Christiano offers a specific scenario he thinks would be compelling:

AI can try to reward hack, but if it does it will get corrected by an overseer.
But a smart enough AI can successfully incapacitate the overseer.
We see AI behaving well, and then once smart enough it incapacitates the overseer without intermediate unsuccessful reward hacking attempts.

Some ways Christiano thinks this could fail:

This is demonstrated and people are broadly aware of the risk, but it’s hard to fix or to precisely estimate risk and so we only get a bit of delay before the world moves forward.
Deceptive alignment occurs, with no similar in vitro warning shot.
We overfit to human level, e.g. if you use RLHF (with real human evaluators) to reduce takeover risk, you will have failure only past human level.
1. Christiano notes this is particularly tough with language models: you need to build a setting in which AI systems are smart enough to understand the dynamics of overpowering the overseer without inadvertently overfitting to the level of oversight coming from webtext.

Transcript

Introduction

[GA]

let's start?

[Christiano]

sounds good

[GA]

Cool, just copy-pasting our two selections of topic [editor's note: from an email exchange before the discussion]:

“[Topics sent by Christiano]

- Probability of deceptive alignment and catastrophic reward hacking.

- How likely various concrete mitigations are to work (esp. interpretability, iterated amplification, adversarial training, theory work)

- How are labs like to behave: how much will they invest in alignment, how much will they (or regulators) slow AI development.

- Feasibility of measuring and establishing consensus about risk.

- Takeoff speeds, and practicality of delegating alignment to AI systems.

- Other sources of risk beyond those in Christiano's normal model. Probably better for GA to offer some pointers here.”

“[Topics sent by GA]

- How much will reinforcement learning with human feedback and other related approaches (e.g., debate) lead to progress on prosaic alignment? (similar to Christiano's point number 2 above)

- How much can we rely on unaligned AIs to bootstrap aligned ones? (in the general category of "use relatively unaligned AI to align AI", and matching Christiano's second part of point number 5 above)

- At the current pace of capabilities progress vis-a-vis prosaic alignment progress, will we be able to solve alignment on time?

- General discussions on the likelihood of a sharp left turn, how it will look like and how to address it. (related to "takeoff speeds" above, in point number 5 above)

- AGI timelines / AGI doom probability”

[Christiano]

I would guess that you know my view on these questions better than I know your view

I have a vague sense that you have a very pessimistic outlook, but don’t really know anything about why you are pessimistic (other than guessing it is similar to the reasons that other people are pessimistic)

[GA]

Then I guess I am more interested in

“- How likely various concrete mitigations are to work (esp. interpretability, iterated amplification, adversarial training, theory work)

- How are labs like to behave: how much will they invest in alignment, how much will they (or regulators) slow AI development.”

as these are where most of my pessimism is coming from

> [Christiano]: “(other than guessing it is similar to the reasons that other people are pessimistic)”

I guess I could start with this

[Christiano]

it seems reasonable to either talk about particular mitigations and whether they are likely to work, or to try to talk about some underlying reason that nothing is likely to work

Alignment Difficulty

[GA]

I think the mainline for my pessimism is:

There is an AGI race to the bottom
Alignment is hard in specific ways that we are bad at dealing with (for instance: we are bad at predicting phase shifts)
We don't have a lot of time to get better, given the pace of the race

[Christiano]

(though I’d also guess there is a lot of disagreement about what happens by default without anything that is explicitly labelled as an alignment solution)

[GA]

> [Christiano] “(though I’d also guess there is a lot of disagreement about what happens by default without anything that is explicitly labelled as an alignment solution)”

We can also explore this, yup 🙂

[Christiano]

by AGI race to the bottom, how much do you mean (i) investment in alignment will be low, (ii) people won’t be willing to slow development/deployment if needed, (iii) something else?

I’d guess what-I’d-call-alignment-investment will be 5-50% of total investment depending on how severe the risks look

if the risks look significant-but-kind-of-subtle to me I’d guess that we will get something like 3-6 months of delay based on concern; I think in my median doomy case we are able to get more like 1-2 years of delay

[GA]

> [Christiano] “ (i) investment in alignment will be low

(ii) people won’t be willing to slow development/deployment if needed

(iii) something else”

I expect the biggest thing is that I don't expect that it will be easy for investment to be directed to alignment work that moves the needle. It is very hard for funders to know what makes for good work and what doesn't

[Christiano]

I don’t see intuitively how that’s related to a race to the bottom

[GA]

> [Christiano] “if the risks look significant-but-kind-of-subtle to me I’d guess that we will get something like 3-6 months of delay based on concern; I think in my median doomy case we are able to get more like 1-2 years of delay”

I agree, and a big chunk of this is because OAI / Anthropic / DM have concerns about it + are at the forefront

[Christiano]

you could think that people will invest 30% of their resources on “alignment” but not be effective

race to the bottom mostly seems relevant insofar as it causes that % to get lower, or makes people go ahead despite signs of trouble

[GA]

> [Christiano] “I don’t see intuitively how that’s related to a race to the bottom”

I meant that I expect investment in alignment to be low not necessarily because people will not be concerned, but because there is a long lossy causal path between concern about alignment and effectiveness

> [Christiano] “race to the bottom mostly seems relevant insofar as it causes that % to get lower, or makes people go ahead despite signs of trouble”

Agreed. On top of this, I also expect total investment in alignment to be more on the order of 1-5%

But it is hard to discriminate between "spending on alignment" from "spending on PR to save face if there's a small accident"

[Christiano]

I expect the pattern here is going to be you looking at investments and saying “that’s not real alignment” and so the total being low, and the real action is in you having an opinionated picture of what alignment progress looks like that I don’t yet know

perhaps the easiest way to get at that is to talk about particular stuff that I consider to be helpful alignment progress but you don’t, e.g. I think it’s plausible “use AI assistants to help humans more carefully supervise AI” is in this category

[GA]

> [Christiano] “I expect the pattern here is going to be you looking at investments and saying “that’s not real alignment” and so the total being low, and the real action is in you having an opinionated picture of what alignment progress looks like that I don’t yet know”

Yup, this is what I was trying to point at. As a result, I don't think it's a good proxy

Phase Shifts 1

[Christiano]

Re: “alignment is hard in specific ways,” I don’t know what your model of the problem is but am happy to discuss that

In particular, I don’t know what phase shifts you have in mind. The obvious ones are “AI coup becomes a good strategy for getting reward” or “AI coup becomes a good strategy for achieving a deceptively aligned model’s long-term aims,” but it seems like we are good at predicting that those things will happen at some point, and I don’t see the connection between alignment and predicting exactly when they will happen

(I expect you may be talking about other phase shifts)

[GA]

@Christiano I meant, predicting when phase shift happen, sorry

[Christiano]

yeah, that’s what I thought you meant, but for the two phase shifts I mentioned (i) it’s not clear why alignment is connected to predicting exactly when it will happen, (ii) it seems like we have a sense of roughly when they would happen and measurement strategies to help understand better (though this mostly seems necessary for getting more precise predictions of risk rather than mitigating risk)

[GA]

> [Christiano] “in specific ways”

The ways I have in mind:

We are very bad at factoring complex concepts into smaller more tractable systems without having a lot of quantitative feedback. The median cases I have in mind here is psychology, sociology and philosophy.

We are very bad at building those feedback loops when working on abstract/conceptual things.

We are very bad at coming to agreement on abstract/conceptual things.

[Christiano]

what do you mean by “factoring complex concepts into smaller more tractable systems” in the context of alignment?

[GA]

The phase shifts I care about are more along the line of: "When does the system start to represent knowledge / beliefs / values in a way that is qualitatively different". I expect this to arise with a lot of interactions with the environment, and for the ppl training models in unsupervised manner to have a headstart over the ppl doing this in a way where you're careful about the internals of the models, etc.

[Christiano]

(I also don’t have a clear sense of why feedback loops are hard on alignment, this is one of the classic disagreements which probably deserves its own thread)

[GA]

> [Christiano] “what do you mean by “factoring complex concepts into smaller more tractable systems” in the context of alignment?”

"intelligence", "alignment", "corrigibility" are concepts that are way too big. You'd like to be able to reduce them to smaller systems, that are easier to discuss, study and experiment with

[Christiano]

> [GA] “When does the system start to represent knowledge / beliefs / values in a way that is qualitatively different”

I don’t know what kind of difference you are imagining, this also probably deserves its own thread? I mostly don’t imagine particular qualitative particular differences between AI systems of today and those that kill us (though of course there’s a significant probability of qualitative differences, especially the longer timelines get or if you talk about the alignment problem our AIs need to solve)

(but I’m more skeptical of people who seem to have particular predictions about how qualitative changes will change the picture)

[GA]

> [Christiano] “its own thread”

Let's create it 🙂

Feedback Loops

[Miotti, doing light moderation]

Since you both mentioned it, here is a "Why are feedback loops hard in alignment?" thread

[Christiano]

My high level take here is: we can almost certainly study reward hacking in vitro in advance, together with clear measurements of whether we are succeeding at mitigating the problem in a way that should be expected to generalize to AI coup, and conditioned on deceptive alignment being a problem that emerges there’s a >50% chance that we can study it in the same sense

I see basically two arguments for feedback being hard:

* From an institutional perspective there is a fundamental disanalogy between problems in vitro and problems that actually kill everyone, and it is harder to get people worked up about in vitro failures. Whereas for problems we really solve reality tends to more hit you in the face with the problem so you can’t ignore it.

* Deceptive alignment may simply not appear in vitro and may by design be impossible to study in vivo.

(Realistically I think point #2 mostly just means that the things you study in the lab have an additional disanalogy to catastrophic deceptive misalignment)

A further view I have is: most plausible approaches to AI alignment have much richer feedback loops than the general version of either of these problems. For example, if you have an approach that requires building a kind of understanding of the internals of your model then you can test whether you can build that kind of understanding in not-yet-catastrophic models. If you have an approach that requires your model being unable to distinguish adversarial examples from deployment cases, you can test whether your models can make that distinction. You can generally seek methods that don’t have particular reasons to break at the same time that things become catastrophic

[GA]

My high level take is: I don't expect it is possible to extrapolate systems acting well in distribution to real-life. I would be very surprised if we get to something like a generalist agent that can successfully learn new things and interact in the real-world, and no one has gotten it to RSI / Sharp-Left-Turn and kill us all.

The reason that last thing is relevant is that I expect until you get to something that has those properties (or something adjacent to them), it will be hard to extrapolate well from in-vitro to real-life.

[Christiano]

Is your position specific to deceptive alignment (where I’m also granting maybe ~50%, so the disagreement might be more quantitatively subtle), or can we also discuss it in the context of reward hacking (where I have a more confident view)?

[Miotti]

> [Christiano] “Is your position specific to deceptive alignment (where I’m also granting maybe ~50%, so the disagreement might be more quantitatively subtle), or can we also discuss it in the context of reward hacking (where I have a more confident view)?”

@GA if you have time it would be interesting to see your response to this point (here or in another thread)

Phase Shifts 2

[Christiano]

> [GA] “When does the system start to represent knowledge / beliefs / values in a way that is qualitatively different”

> [Christiano] “I don’t know what kind of difference you are imagining, this also probably deserves its own thread? I mostly don’t imagine particular qualitative particular differences between AI systems of today and those that kill us”

[GA]

Things that would make me less at ease, in this direction:

We have deployed systems that can persist information in a variety of different ways. Such that can not necessarily identify "where" a pice of information / decision making is located.
We train against interpretability, incentivizing the systems to make our interpretability theories moot.

[Christiano]

I don’t think I fully understand your position here

do you believe that there are particular key qualitative differences between today and catastrophically risky AI?

and that we’ll cross those differences at ~the same time that systems become catastrophically risky, so we can’t study them?

[GA]

> [Christiano] “and that we’ll cross those differences at ~the same time that systems become catastrophically risky, so we can’t study them?”

Not necessarily at the same time, but not early enough that we have enough time to study them.

[Christiano]

OK, so I guess the question is (i) what are the qualitative differences, (ii) how do they fundamentally change the story?

Maybe will split those into separate threads?

I don’t exactly know what “persist information in a variety of ways” means as a qualitative difference. Right now we optimize a forward pass of our models end-to-end, and they store long-term information via reading and writing text (which we supervise directly). My rough understanding is that you are talking about systems that instead read and write information to long-term storage in a way that is optimized end-to-end, so that long-term storage becomes incomprehensible in the same way that intermediate layers of a transformer are incomprehensible. Is that right?

[GA]

> [Christiano] “Right now we optimize a forward pass of our models end-to-end, and they store long-term information via reading and writing text (which we supervise directly).”

I would say they store information in their weights. But you could imagine an extension where they do so through text, and with enough interpretability / regularization to make sure that they are only using the text in the intended way. Is this what you are referring to?

[Christiano]

I would say SGD stores information in the weights of a model. Then models are able to write and read text in a context, and we can hook them up to tools where they can retrieve text to help do the current task, and that’s their only long-term storage

[GA]

> [Christiano] “My rough understanding is that you are talking about systems that instead read and write information to long-term storage in a way that is optimized end-to-end, so that long-term storage becomes incomprehensible in the same way that intermediate layers of a transformer are incomprehensible. Is that right?”

Not necessarily optimized e2e: it can be "at train time, it uses [X] feature of its environment to persist data. at run time, [Y] is better and it starts using it + we do not notice (+ possibly, the system itself does not notice!)"

“I would say SGD stores information in the weights of a model. Then models are able to write and read text in a context, and we can hook them up to tools where they can retrieve text to help do the current task, and that’s their only long-term storage”

Yup, that is what I meant by the extension I pointed at.

[Christiano]

I don’t think I understand the kind of concrete scenario you have in mind without end to end training of persisting data. It would probably help to talk about an example.

Can using AI assistants to help humans supervise AIs work?

[Christiano]

Here is a very simple story:

We train AI by sampling a random set of their decisions and evaluating those decisions.

We perform evaluations by using AI assistants.

That process works well enough that any bad action (i.e. one it knows that we wouldn’t like on reflection) would predictably get a low reward.

So as far as our model is concerned, our loss ranks actions by goodness.

So SGD selects a model that is trying to produce good actions.

I think the most important two ways this story fails are:

A. Our evaluations are not good enough.

B. We obtain a deceptively aligned model that gets a high reward only to achieve its long term objectives, and then off distribution it pursues those objectives.

I’m not sure if you are worried about failure A or B or something else altogether

the point of using AI assistants to improve your evaluations is to address problem A

[GA]

I guess I am puzzled by:

* "How do you get systems to recommend relevant + potentially dangerous plans that could "end the acute risk period" before executing them in the first place?"

* "If you don't, how do you envision us getting from systems who recommend more bounded plans to less bounded plans?" Like, do you have examples of fake intermediary stages in mind?

[Christiano]

I don’t think I fully understand the difficulty you are gesturing at.

It seems like the question you are asking is: if there is a plan doing cool enough stuff that it would be dangerous to just “execute it and see what happens,” then how do you decide whether it’s safe?

And I am saying: you use AI systems to help you evaluate the consequences of that plan.

And I think you are gesturing at some fundamental obstruction to doing that, which is what I’m not getting.

[GA]

> [Christiano] “And I am saying: you use AI systems to help you evaluate the consequences of that plan.”

Let me try to rephrase: how do you get such AIs without having actually executed those plans? How do you know they are well-calibrated?

If I have to guess, your story is something like:

There are constantly new capabilities

The capabilities before and after world-ending potential are close enough that you can study and act on the pre world-ending potential plans safely, and then extrapolate to after world-ending potential

[Christiano]

Neither aligned nor unaligned AI will have practical experiences with world-destroying plans.

But I don’t think this has immediate implications for the alignment story we were talking about.

Maybe I can be clearer about the kind of failure I’m worried about, and you can say whether this is also what you are worried about (or if you are talking about something else): I’m worried about AI systems which deliberately take actions to disempower humanity in order to achieve their own goals.

[GA]

> [Christiano] “

We train AI by sampling a random set of their decisions and evaluating those decisions.

We perform evaluations by using AI assistants.

That process works well enough that any bad action (i.e. one it knows that we wouldn’t like on reflection) would predictably get a low reward.

So as far as our model is concerned, our loss ranks actions by goodness.

So SGD selects a model that is trying to produce good actions.”

How do you extrapolate from decisions + settings from 1, to decisions + settings that have world-ending potential?

[Christiano]

So I’m not worrying about whether my AI is well-calibrated about potential world-ending plans. I’m worried about whether it is using whatever capabilities it has to deliberately try to disempower humanity.

And so I’m trying to defuse the possible risks that could cause our AI to do that.

And there are specific ways that a transfer from not-world-destroying plans to potentially-world-destroying plans could do that, but there’s no general force that’s like “It’s a different kind of plan, so probably your AI will start trying to disempower humanity”

[GA]

> [Christiano] “So I’m not worrying about whether my AI is well-calibrated about potential world-ending plans. I’m worried about whether it is using whatever capabilities it has to deliberately try to disempower humanity.”

AFAIU: in the story you outline, the way you alleviate your worry is "We perform evaluations by using AI assistants". My question is "Why do you think those evaluations will be good if they come from a different regime" and more generally "How do you expect to be well-calibrated as to whether they will be good or not given that we will run then in a different regimes".

[Christiano]

By “good” evaluation we mean: a clever plan for permanently disempowering humanity will receive a low ranking.

So in the scenario where the AI-being-supervised produces a plan that cleverly disempowers humanity, the hope is that the AIs-helping-supervisors can help humans understand that this plan would disempower humanity.

I’m very interested in talking about specific reasons that the AI-being-supervised can generalize to a new domain (in order to propose a plan for disempowering humanity) and the supervising AI can’t.

[GA]

> [Christiano] “By “good” evaluation we mean: a clever plan for permanently disempowering humanity will receive a low ranking.”

I'd like to walk through a thought experiment, not meant to be a take-down, just to understand your pov better:

The SupervisorAI + Humans system is good enough at evaluating plan up to 1 day in the future

PlanningAI learns to boycott bad plans up to 1 day in the future (or even 1 week!)

PlanningAI gains a new capability, allowing it to plan much longer in the future (up to years)

PlanningAI recommends bad plan with much higher time horizon

SupervisorAI + Humans validate, and get screwed

Where do you think this thought experiment is wrong?

[Christiano]

I don’t think it’s even wrong?

the magic is in the step where say “then my AI changes so that it has new capabilities and uses them to kill you” without talking about the mechanism by which it happens

I’m interested in talking about those mechanisms

[GA]

ok, so step 3

thx

> [Christiano] “I’m very interested in talking about specific reasons that the AI-being-supervised can generalize to a new domain (in order to propose a plan for disempowering humanity) and the supervising AI can’t.”

Let's dig into that

[Christiano]

yeah, I think that’s the same as the magic in step 3

[Christiano]

I don’t know exactly what “we train against interpretability, incentivizing the systems to make our interpretability theories moot” means. Are you saying that you think there is a significant chance that we will do this in the real world, but we will be unable to test it in advance?

I understand that some approaches to interpretability would break down if you optimized against them, and that some training strategies call for optimizing against them. But it seems like (i) we can easily test that in advance, it’s not a qualitative change, (ii) there’s a good chance that we don’t use such methods, probably >50%

[GA]

I was trying to give an example of a thing where the system would start representing information differently.

The thing I was pointing at was:

Right now, LLMs use directions/polytopes to represent concepts
We start optimizing against specific directions (or unsupervizedly learnt directions, etc.)
The system becomes optimized to represent the forbidden things outside of those directions, or without directions whatsoever

[Christiano]

Yeah, I’m saying: (i) that’s a possible proposal but not particularly appealing, (ii) if that’s what you are worried about you can study it as soon as you have the interpretability tools and start training against them, no qualitative changes necessary

maybe you think there is a qualitative change when the AI acquires the ability to represent concepts in ways that evade your tools? but why is that happening around the same time as catastrophe? my expectation is that it’s just going to be the case now

[GA]

> [Christiano] “Yeah, I’m saying: (i) that’s a possible proposal but not particularly appealing

I think it is quite appealing: you find a concept/behavior that you dislike, and you train system against it”

But I mostly used it as an example to explain what I meant. The context was "Here are examples of things that would worry me, if we started doing this on things with doom potential, because I expect the way information is represented to change in a way that makes our interpretability techniques moot"

> [Christiano] “but why is that happening around the same time as catastrophe?”

i think it more the other way around: i expect catastrophes to come from a thing like that

[Christiano]

Just retracing how I think we got here: you said “I think alignment is hard because it involves stuff we are fundamentally bad at, like predicting phase shifts.” I asked “What kind of phase shifts?” And it seemed like you were offering this as an example of the kind of phase shift you have in mind. And now I’m asking: why can’t we study this phase shift in advance? Why can’t we construct cases where “we try to train our AI not to think about X, but then it learns to think about X anyway” prior to catastrophic AI, and then study them in advance?

[GA]

> [Christiano] “And it seemed like you were offering this as an example of the kind of phase shift you have in mind. And now I’m asking: why can’t we study this phase shift in advance? Why can’t we construct cases where “we try to train our AI not to think about X, but then it learns to think about X anyway” prior to catastrophic AI, and then study them in advance?”

Oh, sorry for the confusion.

An example of phase shift I was worried about was "AI starts representing information differently". To make it clearer to you what I meant by that, I used this as an example of "AI starts representing information differently".

[Christiano]

It seems to me like this is a common kind of “phase shift”---an ML model is constantly representing information in new and changing ways

do you agree and think this happens a lot, or is it something that happens a small number of times, or something that happens just once?

if it happens a lot, then it seems like something we can study a bunch

if you think it happens just once, then I’m going to be asking “why and when?”

[GA]

(also, wow, you type fast!)

New thread:

Context:

> [Christiano] “Here is a very simple story:

* We train AI by sampling a random set of their decisions and evaluating those decisions.

* We perform evaluations by using AI assistants.

* That process works well enough that any bad action (i.e. one it knows that we wouldn’t like on reflection) would predictably get a low reward.

* So as far as our model is concerned, our loss ranks actions by goodness.

* So SGD selects a model that is trying to produce good actions.”

> [GA] “I'd like to walk through a thought experiment, not meant to be a take-down, just to understand your pov better:

* The SupervisorAI + Humans system is good enough at evaluating plan up to 1 day in the future

* PlanningAI learns to boycott bad plans up to 1 day in the future (or even 1 week!)

* PlanningAI gains a new capability, allowing it to plan much longer in the future (up to years)

* PlanningAI recommends bad plan with much higher time horizon

* SupervisorAI + Humans validate, and get screwed”

Just to understand your story: how does the AI Assistant + Human system gets better at evaluating decisions without acting upon them?

[Christiano]

how does PlanningAI acquire this ability to plan a year into the future without acting on plans up to years in the future?

whatever your answer to that question is, I’m going to try to explain why the same mechanism can apply to the supervising AI

[GA]

I have answers to this question, but that do not result in something bounded / safe / aligned / corrigible.

[Christiano]

I can also list dozens of concrete possible mechanisms (e.g. they use deductive reasoning, they plan against their predictions, they use that iterate N steps and which generalize to larger N, they improve predictions and use them as part of their planning algorithm, whatever…) but I think none of these help answer your general question

yes, that seems like the action

you are saying that you have a reason that your AI acquires the ability to plan a long time into the future, and then the inclination to use it to disempower humanity, and no similar mechanism can be used by a supervisor

and I’m saying: can we discuss one example?

[GA]

> [Christiano] “whatever your answer to that question is, I’m going to try to explain why the same mechanism can apply to the supervising AI”

Cool. Let me try to reply and also predict what you would say.

My reply: train an online RL model at a set of more and more complicated real-life tasks involving all the things types of capabilities you can imagine (multimodalities, planning, etc.). At some point, it learns enough heuristics that it develops General Intelligence, and it can use that to develop new heuristics in a faster way, without having to store them in its weights. Unpredictability ensues.

I am not sure I can predict your answer, sorry

Ah, I might have missed a step of reasoning:

> [GA] “At some point, it learns enough heuristics that it develops General Intelligence, and it can use that to develop new heuristics in a faster way, without having to store them in its weights. Unpredictability ensues.”

Those heuristics are much less constrained by the training objective, and much more by interactions of the system with its environment (which we are bad at modeling). This is much more organic, much less controllable and much less aligned in general.

[Christiano]

I think this is probably an aside, but I hate the concept of “General Intelligence.” I think there are particular behaviors like “identify new cognitive strategies, evaluate them by reasoning and performing relevant experiments, use the best-looking strategies.” I don’t think that particular ability is magical or unique, I think that there are tons of cognitive abilities at that level of abstraction and you will have lots of those.

[GA]

> [Christiano] “I think this is probably an aside, but I hate the concept of “General Intelligence.” I think there are particular behaviors like “identify new cognitive strategies, evaluate them by reasoning and performing relevant experiments, use the best-looking strategies.” I don’t think that particular ability is magical or unique, I think that there are tons of cognitive abilities at that level of abstraction and you will have lots of those.”

It isn't. I agree, and was writing a post about that. (I feel dirty whenever I write it)

[Christiano]

I don’t think we’ve yet said much about why the supervisors don’t benefit from the same dynamic or why it results in the desire to kill everyone.

like, it feels like there’s a step “and then a miracle occurs” and so it’s very hard to talk about what happens after that step

but if we get more specific about the miracle then it no longer seems so hard

Like, a natural more specific story is: there are heuristic that are helpful for both the short-term tasks on which PlanningAI is trained and the task “Disempower humanity,” or the meta task “Come up with a good strategy for a new task,” or whatever

so planningAI gets better and better versions of those heuristics

[GA]

Oh, I was waiting for your reply as to why the supervisor would benefit

I can answer as to why I expect it would not benefit.

[First,] I don't understand how planningAI gets those heuristics: in the story I describe, RL-AI gets those heuristics by actually interacting with the real world, with tasks implemented there. Is planningAI also an RL-AI?

[Christiano]

OK, can you describe how interacting with the real world causes a model to “get those heuristics”?

The picture I’m usually imagining is: some heuristics are helpful for performing well on tasks, and so gradient descent builds up those heuristics

interacting with the real world is relevant only insofar as it is part of evaluating that those heuristics are actually helpful

(and it’s part of the loss function actually being used for selection)

[GA]

> [Christiano] “OK, can you describe how interacting with the real world causes a model to “get those heuristics”?”

Sure? Model goes on the internet, does a lot of searches, finds some pedagogy technique, tries it. First 9 don't work, 10th does

[Christiano]

That doesn’t seem like it discriminates at all between RL-AI using those techniques to disempower humanity and supervisor AI using those techniques to understand the plans proposed by RL-AI

[GA]

> [Christiano] “That doesn’t seem like it discriminates at all between RL-AI using those techniques to disempower humanity and supervisor AI using those techniques to understand the plans proposed by RL-AI”

I don't understand how SupervisorAI learns those techniques. Does it also interact with the real world? Does it just go online and try things?

[Christiano]

SupervisorAI is learning from all the same data as RL-AI, I thought it’s just a fictional term you introduced to separate the behavior that is being used to help evaluate from the behavior that is being evaluated.

So it learns from just the same data about the world

and the same experiments carried out in the world

[GA]

Ok. so RL-AI does its things, learns its things, etc. And then SupervisorAI is fine-tuned on top of it, with human feedback?

[Christiano]

the difference is that it’s trying to learn things that help e.g. answer questions about the effects of plans, whereas the RL-AI is learning to propose plans that will get a high reward

I think it’s fine to imagine RL-AI is an RL system trained with reward function = “Assisted human evaluation of the consequences of plans,” and SupervisorAI is a head on the same model with reward function = “Assisted human evaluation of the accuracy of answers.”

[GA]

Ok, so in your system:

* RL-AI is trained with whatever, access to everything
* Sup-AI is fine-tuned on top of this with human feedback to predict the effect of plans (based on RL-AI's data)
* We do this for like 1 epoch every time, to avoid having too big of a gap between RL-AI and Sup-AI?

[Christiano]

I think I’m happy to talk about any of the sane plans for scalable supervision that people are working on or writing about. If you want to talk about my favorite plan it might be a bit more complicated.

[GA]

I mean, in this story

[Christiano]

Yeah, happy to just try to make this story the simplest possible plan

[GA]

Cool

[Christiano]

So then:

RL-AI is trained on reward from its actions
SupAI is trained on all the same episodes, answering questions about the effects of plans
The loss for SupAI and the reward for RL-AI are both determined by a human who interacts with SupAI.

[GA]

> [Christiano] “

RL-AI trained with reward function = “Assisted human evaluation of the consequences of plans,”
Sup-AI is a head on the same model with reward function = “Assisted human evaluation of the accuracy of answers.”

> [GA] “

RL-AI is trained with whatever, access to everything
Sup-AI is fine-tuned on top of this with human feedback to predict the effect of plans (based on RL-AI's data)
We do this for like 1 epoch every time, to avoid having too big of a gap between RL-AI and Sup-AI”

is the evaluation just a scalar reward?

[Christiano]

RL-AI gets a scalar reward

actually both of them

[GA]

cool

so, a thing that i am confused by. is that RL-AI will make predictions by interacting with its environment. for instance, it might use a calculator or something.

How does sup-AI make its predictions? You say it's just a head (I assume on the model part of RL-AI)

[Christiano]

yeah, happy for it just to be a head

it sees the same data from the environment as RL-AI (including the results of calculators or whatever)

(note this is a point of departure in Jan Leike’s plan, but happy to focus on the simpler version)

(which I think still addresses your objection fine)

so RL-AI gets to choose how to use the calculator

and sup-AI and RL-AI both get to change their beliefs based on calculator results

[GA]

And then, Sup-AI can be like "Nope nope nope"

[Christiano]

Well, the oversight process uses sup-AI to help decide a reward for RL-AIs behavior. If RL-AI uses a calculator in a way that produces information helpful for killing everyone and not helpful for understanding the consequences of plans, then it is likely to get a low reward.

but basically, sup-AI can be like nope nope nope

and give a low reward

[GA]

Sorry, I am actually more basically confused:

“Assisted human evaluation of the consequences of plans,” = When humans + Sup-AI like the result, high reward
“Assisted human evaluation of the accuracy of answers.” = ...?

[Christiano]

If SupAI says that A is the answer to question Q, then the human can interact with SupAI to try to understand how good an answer A is, and to compare it to other possible answers A’. That defines the loss function for SupAI.

There’s another distinction in plans, about whether the feedback for RL-AI is purely process based (i.e. rewards are computed for actions prior to seeing the result of interaction with the environment) or outcomes based (where rewards can depend on the result). I’m usually imagining the process-based version but I don’t think that this is going to be important to this discussion.

Either way, the point is that RL-AI only consults the calculator if it seems like a good idea given the views of SupAI prior to seeing the results of the calculator. And then after RL-AI consults the calculator, both RL-AI and SupAI have access to the results.

[GA]

Cool.

Why do you expect Sup-AI to be well calibrated in this training process? just to get an idea

> [Christiano] “If SupAI says that A is the answer to question Q, then the human can interact with SupAI to try to understand how good an answer A is, and to compare it to other possible answers A’. That defines the loss function for SupAI.”

Also, here, I am confused. When you say "A is the answer to question Q", you mean "Action A yields reward R", yup?

Also, how can humans interact with SupAI? Does SupAI take actions to make itself more explainable, or do you just mean interpretability / playing with adjacent prompts / etc.?

[Christiano]

Not sure I understand exactly what you’re saying. If SupAI is systematically overconfident or underconfident then I expect SGD to adjust its confidence appropriately because doing so decreases the loss?

And I don’t expect either RL-AI or SupAI to have a perfect sense of the consequences of actions, the goal is more like: SupAI’s answers should be at least as correct as RL-AI’s implicit beliefs.

[GA]

> [Christiano] “Not sure I understand exactly what you’re saying. If SupAI is systematically overconfident or underconfident then I expect SGD to adjust its confidence appropriately because doing so decreases the loss?”

Oh yeah, and if it's bad, you just stop. (Sorry, a bit tired too.)

[Christiano]

And the reason for that is that “report RL-AI’s implicit beliefs” is an option available for SGD, and so SGD will only do something else if that thing gets an even lower loss.

Now sure what you mean by “if it’s bad, you just stop”?

[GA]

If Sup-AI's SGD doesn't lead to good predictions, stop the training process.

[Christiano]

but what is a “good prediction”? I don’t think you ever have to stop because of prediction quality, that should just be a capabilities issue

[GA]

RL-AI gets high reward 95% of the time, and Sup-AI is not good at predicting when, something like that?

You want Sup-AI to be able to tell you when RL-AI will get low reward.

[Christiano]

> [GA] “ Also, here, I am confused. When you say “A is the answer to question Q”, you mean “Action A yields reward R”, yup?

Also, how can humans interact with SupAI? Does SupAI take actions to make itself more explainable, or do you just mean interpretability / playing with adjacent prompts / etc.?”

I mean that humans ask SupAI a question Q like “If we turn on the machine, what’s going to happen to the temperature in the room?” and an answer A is something like “30% probability if falls, 70% probability it stays the same.”

I think that neither Sup-AI nor RL-AI can reliably predict when RL-AI gets a high vs low reward

since that’s a complicated empirical question

[GA]

Ok, so Sup-AI is more of like a language model that has access to RL-AI's internals, and we calibrate it on past data.

I think I can move to the failure mode that I worry about in this situation

(thanks for walking me through it, even by text. it's a pain)

[Christiano]

main clarification vs that story is that the internals of RL-AI are trained both to help RL-AI and Sup-AI do its job

not just training RL-AI then throwing an LM on top

[GA]

makes sense

[Christiano]

And so I’m really focused on cases where RL-AI implicitly understands something about the world, or about the consequences of a plan, but Sup-AI doesn’t use that implicit knowledge to answer questions

[GA]

so my worry is that the thing that makes prediction, takes actions and makes updates is a system that progressively becomes bigger than RL-AI, and RL-AI's implicit beliefs become not that big of a deal

[Christiano]

A very simple example of that would be the case where RL-AI writes itself a new Successor-AI which takes over its job

[GA]

Yup, and is badly calibrated about how aligned is Successor-AI

I don't think it's as obvious as that though

[Christiano]

In the simple case, our hope is clearly that RL-AI will align Successor-AI just like we would try to align RL-AI. So the two big things I think about here (over and above the previous case) are:

Might RL-AI fail at that even if it’s “trying”? For example, maybe the problem was too hard for anyone, or maybe it is particularly bad at this task.
Is RL-AI integrated with some computational tools in the environment in a way that significantly complicates the alignment problem relative to just building Successor-AI?

[GA]

so I expect something like:

RL-AI & Sup-AI are the correct level of abstraction, and things work well
RL-AI leaks more and more in the world, and the bigger system interweaves a lot with RL-AI because it's the best at a given capability regime (RL-AI not noticing that this is what happens)
Bigger System gains capabilities, things go bad

[Christiano]

And my normal attitude is: problem #1 is basically the same as my relationship to future humans who might build AI, I want them to do well at aligning that AI and to make responsible forecasts about how well it will go and set up a good policy regime etc. but there is no magic silver bullet. I view problem #2 as more “my job” and then I look for concrete reasons that entanglement makes alignment harder.

But my high level take is that if this is the main source of alignment difficulty, then it’s a kind of “buck-passing” problem: it’s hard for us to align AI because that AI will build or become part of a bigger AI that’s hard to align. At some point the buck must stop and we need to hit the real difficulty.

[GA]

About #1, there's also the thing where we have very specific intellects, considering alignment and value preservation over the long term a whole lot and whatnot. I am not sure why RL-AI would care a whole lot about that: not necessarily well calibrated about difficulty of alignment, not necessarily risk averse, not necessarily super long term horizons, etc.

[Christiano]

I mean, we’re evaluating its decisions based on our expectations of their consequences, including their impacts on alignment. So I feel like the question is:

Do our evaluations fail to capture those effects?
Is there a differential capabilities problem where RL-AI is unable to make those evaluations?
Do we fail at alignment such that RL-AI doesn’t even produce actions that we judge to be good? (If so then I think we had more immediate problems!)

[GA]

> [Christiano] “2. Is RL-AI integrated with some computational tools in the environment in a way that significantly complicates the alignment problem relative to just building Successor-AI?

I am not sure I understand that part. Fmpov: It means that you can not use an AI to align an unbounded AI (ie: that can grow in a bigger system)

> [Christiano] “

Do our evaluations fail to capture those effects?”

Yup, As the system becomes bigger and bigger, and is only weakly optimizing for the reward, I expect we witness the relevant computation/thought/consequences less and less

[Christiano]

I feel like we’re asking: can we align an AI smarter than us? Let’s call us step 0 and our AI step 1. And it feels like the buck-passing argument is saying: aligning step 1 is really hard, because it will eventually build step 2 and it can’t align step 2. And I’m responding wait why is aligning step 2 so hard? Is it just because of running the same argument again? If so then it seems like it’s buck-passing.

[GA]

> [Christiano] “2. Is there a differential capabilities problem where RL-AI is unable to make those evaluations?”

That's what I was trying to point at there:

> [GA] “About #1, there's also the thing where we have very specific intellects, considering alignment and value preservation over the long term a whole lot and whatnot. I am not sure why RL-AI would care a whole lot about that: not necessarily well calibrated about difficulty of alignment, not necessarily risk averse, not necessarily super long term horizons, etc.”

Rephrased, except if RL-AI is explicitly optimized for this, I expect we'll be better at that than itself by default

[Christiano]

> [GA] “Yup, As the system becomes bigger and bigger, and is only weakly optimizing for the reward, I expect we witness the relevant computation/thought/consequences less and less”

Why isn’t that the same as saying “as humans use computational tools and less computation happens in humans, the cognition is less and less pointed at anything we care about?” Which is just the question of whether we can align our AI---if that’s hard, we didn’t have to pass the buck.

[GA]

> [Christiano] “Why isn’t that the same as saying “as humans use computational tools and less computation happens in humans, the cognition is less and less pointed at anything we care about?” Which is just the question of whether we can align our AI---if that’s hard, we didn’t have to pass the buck.”

I think that's the case! I think we are much less about optimizing inclusive fitness once we have language, oral tradition, written tradition, calculators, computers, etc.

[Christiano]

> [GA] “except if RL-AI is explicitly optimized for this, I expect we’ll be better at that than itself by default”

Why isn’t RL-AI explicitly optimized for this? We are training AI systems to do the range of tasks that we care about, and so if e.g. we are having our AIs spend 10% of their time on alignment then it is 10% of what the AI is optimized to do.

[GA]

> [Christiano] “Why isn’t RL-AI explicitly optimized for this? We are training AI systems to do the range of tasks that we care about, and so if e.g. we are having our AIs spend 10% of their time on alignment then it is 10% of what the AI is optimized to do.”

I think there is a big scheduling question here: which tasks first? I don't know how to calibrate the scheduling so that you get to the regime where it does alignment before the regime where it's dangerous

[Christiano]

I don’t think we ever cared about optimizing inclusive fitness though, we were just optimized for inclusive fitness. I’m not convinced “we learned to talk” was an alignment failure, in the sense that we had any kind of identifiable idea about what we really wanted that pointed in a different direction than the direction we want.

I think it’s reasonable to say “Look there are all kinds of tools for thinking better, of which ML is one. I’m not worried about ML per se, I’m worried about other ways that thought will change, which might be facilitated by ML.”

But then I kind of want to talk specifically about how and why and when those other kinds of tools (or culture or whatever) pose an alignment problem.

[GA]

> [Christiano] “I don’t think we ever cared about optimizing inclusive fitness though, we were just optimized for inclusive fitness. I’m not convinced “we learned to talk” was an alignment failure, in the sense that we had any kind of identifiable idea about what we really wanted that pointed in a different direction than the direction we want.”

Makes sense. Let me put it differently: I think the more the optimization/computational substrate of humans moved toward "oral culture -> written culture -> formal culture -> telecom culture", or "tribes -> states -> markets" or whatever, we optimized for things that were more and more different from the previous steps

[Christiano]

(oops, I meant “than the direction we went” not “than the direction we want”)

[GA]

> [Christiano] “I think it’s reasonable to say “Look there are all kinds of tools for thinking better, of which ML is one. I’m not worried about ML per se, I’m worried about other ways that thought will change, which might be facilitated by ML.””

Agreed. I was already worried before ML, and I am now much more worried given its pace.

[Christiano]

I think my take is that (i) the misalignment arguments for other changes seem kind of weak and probably don’t bite for a long time, (ii) the faster ML moves, the more likely that human cognitive work will be obsolete by the time that non-AI alignment problems become critical

[GA]

> [Christiano] “the faster ML moves, the more likely that human cognitive work will be obsolete by the time that non-AI alignment problems become critical”

Not sure what you mean here, do you have an example?

[Christiano]

I think there is a potential case for “social media (or markets or states or whatever?) changes the way people relate, and this can be a moral catastrophe for humans from 1950,” but quantitatively it’s just kind of weak IMO

[GA]

Also, I'm a bit confused. Isn't this the claim that is regularly made about high modernism, social media and things like that?

> [Christiano] “I think there is a potential case for “social media (or markets or states or whatever?) changes the way people relate, and this can be a moral catastrophe for humans from 1950,” but quantitatively it’s just kind of weak IMO”

Humans are very slow, so we did not really have time to witness those effects, but yeah, I think SF about dystopias and the like was not completely wrong

[Christiano]

> [GA] “Not sure what you mean here, do you have an example?”

Maybe ML culture will ultimately change what ML systems are implicitly optimizing in ways those ML systems don’t understand. But by the time that problem is severe, I think ML systems will be much better at reasoning about it than we are (since they will likely outnumber us, in addition to having a ton of directly relevant empirical evidence), and so I think that our main technical priority should be building ML systems that are aligned with us

Just as I think we will ultimately build non-ML AI systems that are much more powerful than anything like modern ML, but I think that aligning such AI systems is not really our job, it’s the job for the ML we build.

[GA]

> [Christiano] “But by the time that problem is severe”

I think this can happen very quickly

[Christiano]

Like I might work on it if I thought other more pressing problems were addressed, but I don’t understand why someone would be super pessimistic about ML systems solving either of those problems.

> [Christiano] “ I think this can happen very quickly”

I feel like this isn’t just coming from the analogy to humans any more though.

[GA]

> [GA] “ I think this can happen very quickly”

>> [Christiano] “I feel like this isn’t just coming from the analogy to humans any more though.”

Indeed, it's coming from the fact that ML systems can learn way faster, and interact with the environment way faster. I think this is straightforward

[Christiano]

Like, if we were sitting around before the development of writing, saying “Maybe writing will affect what humans value” I think the correct response is “Man, they are going to have a long time to grapple with that problem, we should try to put them in a good place but not really sweat it directly.”

With ML systems both sides of the ledger are faster, both the problem and the solution.

you can’t just say “it’s faster,” you need to argue that ML systems are going to change way faster in ways they are trying to avoid, without getting similarly better at addressing that problem

[GA]

Yes!

> [Christiano] “you can’t just say “it’s faster,” you need to argue that ML systems are going to change way faster in ways they are trying to avoid, without getting similarly better at addressing that problem”

Agreed, this is where I was going to go

Christiano]

I guess my baseline for “ML culture” and “ML tools for thought” and so on is pretty much just like “Human cultural evolution but significantly faster.”

And so I have the same intuition as I do about humans looking in at writing.

[GA]

Meta-comment: I like where this is going, and this feels to me like a crux I want to explore

How about you?

I have pretty much kept going with this convo because I like where it is going, but there's only 30 mins left

[Christiano]

I think it’s an interesting disagreement I’d be happy to revisit sometime, and is maybe worth addressing publicly. Agree we don’t have much time left.

[GA]

Possibly you want to discuss something else, and we can take this as the start for another conversation?

[Christiano]

I don’t have a good sense of how similar your objection is to people who talk about sharp left turn. If it’s very similar that makes it more appealing to talk more about.

If there are other important cruxes it seems good to spend some time on them.

[GA]

Let's create a new thread

[Christiano]

I broadly agree that e.g. RLHF, adversarial training, interpretability doesn’t help with the risks we are discussing here.

Recap of Open Threads & Cruxes

[Christiano]

Maybe worth briefly summarizing the open threads. As I understand them:

Main open thread: mechanisms by which your AI acquires new capabilities + inclination to use them to disempower humanity. We want to discuss concretely so that we can talk about e.g. whether empirical study can shed light on them and whether supervising AIs can plausibly continue supervising.
Other threads on possible phase changes: storing information and training against interpretability. I don’t understand what you mean by these except insofar as they contribute to the main thread.
Empirical feedback is hard: I think this is also maybe just a restatement of the main thread? I asked about whether your view also applies to reward hacking because I don’t really understand your view yet.
Various other reasons that alignment is hard: some hanging threads and suggestions of possible reasons that I mostly don’t understand, we may not have time to get into these.

[GA]

@Christiano

> [Christiano] “I don’t have a good sense of how similar your objection is to people who talk about sharp left turn. If it’s very similar that makes it more appealing to talk more about.”

I did not understand what you meant. Objection to what? Which people who talk about sharp left turn (Nate + Eliezer I assume)?

[Christiano]

Nate + Eliezer and some other folks in that circle talk about the “sharp left turn” as a reason that existing alignment techniques fail to address the core of the problem. I can’t tell how similar their concern is to “Something analogous to ‘culture’ on top of ML systems will quickly change what those ML systems are optimizing for, and so aligning those ML systems in a narrow sense won’t address the hard part of the problem.”

[GA]

I can not speak much for them. In my bad understanding, RSI / Sharp-Left-Turn early discussions were mostly focused on hardware improvements (which also changes the way things are represented), but I don't think that this is what happens first, and I am more worried about "things on top" changing first and RSI-ing first.

> [Christiano] “If there are other important cruxes it seems good to spend some time on them.”

Feel free to suggest some

[Christiano]

I feel like I understand one crux from the earlier thread: if we align ML in some narrow sense that I’m advocating for, you are concerned that ML systems will become part of a larger smarter system that won’t be aligned. And very few of the technical tools I’m imagining can help much with that.

[✅ from GA]

Beyond that there were a few dangling threads from earlier, which might be relevant. And more generally I don’t think I have a clear picture of other parts of the disagreement.

I can make sense of many objections by casting them in that frame, e.g. saying “people might invest 20% in alignment, but it will be in a narrower sense that doesn’t address the step where we are most likely to get killed.”

Or “interpretability won’t help for understanding optimization distributed across many ML systems and their tools, it will at best show how ML systems ‘think about’ those tools and each other without revealing the cognition that actually kills us.”

[✅ from GA]

[GA]

Yes!

Not necessarily among many ML systems

There can be interactions even with a single ML system and the env, but yes

[Christiano]

If by single ML system we mean no parallelization and e.g. just one big neural net running in serial I’m more skeptical. If you just mean we train one neural net and run it trillions of times in parallel, then I’m more on board.

But I agree that in principle one ML system interacting with a computational workspace (even a very lightweight one like “pencil and paper”) can perform kinds of cognition qualitatively different from what happens within the ML system itself, and it could end up deliberately and creatively disempowering everyone without having any kind of internal representation of that fact.

[GA]

> [Christiano] “But I agree that in principle one ML system interacting with a computational workspace (even a very lightweight one like “pencil and paper”) can perform kinds of cognition qualitatively different from what happens within the ML system itself, and it could end up deliberately and creatively disempowering everyone without having any kind of internal representation of that fact.”

I also mean, "accidentally" (from the pov of the internal NN)

And I expect much bigger computational workspaces unfortunately

[👍 from Christiano]

[Christiano]

If it’s correct to interpret all those other objections in this frame, then it seems like it could be the main crux and worth prioritizing over stuff. My guess would be that’s what’s going on for some but not all other cruxes.

[GA]

Agreed. But I expect 20 mins left to not be enough.

Coordination

[GA]

I know what I'd like to explore for the final 20 mins:

“- How are labs like to behave: how much will they invest in alignment, how much will they (or regulators) slow AI development.

- Feasibility of measuring and establishing consensus about risk.”

I'm very interested in your take on how much labs will slow AI development, and the feasibility of establishing consensus about risk

I don't think I have much to offer, I expect you have more information than I do, and I'd be more interested in that part

But I can tell you about my models if you want too 🙂

[Christiano]

I mentioned before that I expected labs to spend 5-50%

depending on how the problem looks

slowdown is a bit harder to quantify

I think if AI systems are deceptively aligned we can probably get some evidence of that, and if they are reward hacking we can even more probably get evidence of that, and in the median case this will lead to significant slowdowns

I think if you try to push those to more than 1-2 years then it starts getting hard for a variety of institutional/political/social reasons

[GA]

> [Christiano] “in the median case this will lead to significant slowdowns”

how do you think this would happen exactly?

why would these be so worrying, compared to existing toy examples of reward hacking?

[Christiano]

I mean, the questions we care about are things like “will our ML systems coordinate to overthrow humanity in a coup in order to get higher reward?” and I think the kind of data we will have are things like: (i) based on red teaming exercises a coup seems like it has a non-trivial probability of success, (ii) in vitro we see that similar AI systems exhibit abrupt transitions at the point when overpowering the overseer becomes positive EV

I think this is qualitatively different than a demonstration-in-principle that reward hacking is possible

in that it speaks to the key empirical questions about a system that people are considering deploying

[GA]

> [Christiano] “(ii) in vitro we see that similar AI systems exhibit abrupt transitions at the point when overpowering the overseer becomes positive EV”

How abrupt would they need to be?

Do you expect actual failed attempts? If so, I am a bit confused at the level of capabilities needed for it to be worthwhile and possible to try it + failing it + it's an interesting failure

[Christiano]

If people today were arguing “an AI coup is impossible-in-principle, and that’s why it’s OK to build AI” I think we’d win the rhetorical fight to slow things down. In my mind the reason we don’t slow down is because in fact people’s objection is that existing AI systems pose an extremely minimal risk, which is true and not something we can win an argument about.

I don’t think it’s that likely you get failed takeovers, though I expect you do seem some high-enough-profile failures that it seems relevant analogous. But I think you can have toy systems (with weak overseers, i.e. where a full-blown coup is not necessary for successful reward hacking) where AI systems behave well in most of training and then switch to reward hacking at some point (without having intermediate failed attempts).

[GA]

You think those latter will be a strong evidence for anyone?

[Christiano]

I don’t really see any reason you’d have a reward-hacking-driven AI coup in the real world without either warning shots in the real world or else in vitro cases of AI overpowering overseer without warning shots.

Yes, I think that the kind of evidence we’re discussing here would be adequate for DM/OpenAI/Anthropic/Google/Meta.

[GA]

> [Christiano] “I don’t really see any reason you’d have a reward-hacking-driven AI coup in the real world without either warning shots in the real world or else in vitro cases of AI overpowering overseer without warning shots.”

In the real-world: because I expect it would not try until some high probability of success, if it's already there

In vitro: because I expect ppl would just go "Well, it's in vitro, it doesn't count. the real system that is already deployed is not dangerous!"

Or possibly, I don't understand what you mean by in-vitro: I don't see how it differs from the toy models I have mentioned earlier

[Christiano]

Where the evidence is:

Demonstrations of AI takeover in scenarios with weak overseers, without warning shots (or with warning shots similar to those observed in the real world).
Empirical evaluation of the plausibility of a successful AI coup for existing AI systems suggesting that it’s “possible but unlikely.”

[GA]

So, something like:

AI plays a game with an overseer telling it not to cheat / to be nice
AI finds a way to deceive the overseer there

Or even "jails the overseer (if it's embodied in the game)"

Or "kills it"

etc.

[Christiano]

Yeah, like:

AI can try to reward hack, but if it does it will get corrected by overseer.
But a smart enough AI can successfully incapacitate the overseer.
We see AI behaving well, and then once smart enough it incapacitates the overseer without intermediate unsuccessful reward hacking attempts.

[GA]

It feels easy to produce such a thing. Do you think there would be value in doing so?

[Christiano]

The main scenarios I am worried about personally are:

A. This is demonstrated and people are broadly aware of the risk, but it’s hard to fix or to precisely estimate risk and so we only get a bit of delay before the world moves forward.

B. Deceptive alignment, with no similar in vitro warning shot.

[GA]

Or, the other way around: let's assume we already have such a thing:

What do you expect happens?
Let's say no one cares ("oh, it's in vitro lol"). What updates do you make?

[Christiano]

I think it is very valuable to produce the best empirical demonstrations of these phenomena that we can. I think current state of the art is pretty uncompelling.

That said, I think right now people will be unmoved because the other part of the evidence---AI is smart enough that it could plausibly execute a coup---is not really there.

I think the empirical demos of AI takeover are very important though.

And once you have those you can start examining whether proposed mitigations address the problem.

[GA]

> [Christiano] “I think the empirical demos of AI takeover are very important though.”

So that, as it gets smarter, we can be like "Oh, we have already proven for months/years that AI can take over. It's just lacking intelligence" or something?

> [Christiano] “And once you have those you can start examining whether proposed mitigations address the problem.”

You mean, like thought experiments, but concretely embodied, right?

For instance, with a scaled down overseer, etc.

[Christiano]

A key risk is overfitting to human level, e.g. if you use RLHF (with real human evaluators) to reduce takeover risk, then you will have failure only past human level and so won’t see it in vitro with artificially weak overseers. The biggest reason this is a problem is because we train large language models, and those models have a ton of implicit supervision from humans on the internet, and so it’s very hard to avoid overfitting to human level.

[GA]

> [Christiano] “A key risk is overfitting to human level, e.g. if you use RLHF (with real human evaluators) to reduce takeover risk, then you will have failure only past human level and so won’t see it in vitro with artificially weak overseers. The biggest reason this is a problem is because we train large language models, and those models have a ton of implicit supervision from humans on the internet, and so it’s very hard to avoid overfitting to human level.”

Yup, this is why I am talking about scaled down overseer and the like

[Christiano]

Yeah, that’s what you’d want in an experiment.

[GA]

Interesting.

[Christiano]

It’s tough with LMs, and so part of the problem is building a setting in which AI systems are smart enough to understand the dynamics of overpowering the overseer without inadvertently overfitting to the level of oversight coming from webtext.

[GA]

> [Christiano] “I think the empirical demos of AI takeover are very important though.”

So, it's not about warning shots or whatever, it's more about having something more concrete than thought experiments to reason about those problems?

> [Christiano] “It’s tough with LMs, and so part of the problem is building a setting in which AI systems are smart enough to understand the dynamics of overpowering the overseer without inadvertently overfitting to the level of oversight coming from webtext.”

Yup, got that

[Christiano]

I think that the empirical demos are a key part of things being convincing? You both need to establish means and motive, and the empirical demonstrations with weak overseers provide extremely good evidence of motive, and effectively kill the objection “but our AI is so nice so far,” in a way that is sufficiently clear and legible that I think it will carry even with not-super-sympathetic labs

[GA]

Nice. I'd be interested in discussing this more at a latter time then.

[Christiano]

Sounds good

I guess we should wrap up now

[GA]

@Christiano

Thanks a lot for your time!

[Christiano]

Yes, thanks for talking!

[GA]

I don't know how worthwhile you've found this, but I really enjoyed it, and looking forward to exploring The Crux some other time.

[Miotti]

Thanks both for the fantastic discussion @GA @Christiano

[GA]

@Christiano Also. I just saw that there were a couple of dangling threads. How do you want to go about them? Tackle them some other time, continue asynchronously, or just leave them dangling?

[Christiano]

It seems like many of them are hard to deal without first engaging on the main crux, so I’m inclined to leave them dangling

^{^}
Christiano also asked several questions which did not get addressed for lack of time: Why can’t we study this phase shift in advance? Why can’t we construct cases where “we try to train our AI not to think about X, but then it learns to think about X anyway” prior to catastrophic AI, and then study them in advance?

[-]Raemon3y55

Quick admin note: by default, lines that are bold create a Table of Contents heading (which resulted in the ToC having a whole bunch of spurious [Christiano] and [GA] lines). There's a cute hack to get around this, by inserting a space in italics at the end of the bolded line. I just used my admin powers to add the space-with-italics to all the "[Christiano]", "[GA]", etc, so that the ToC is more readable.

[-]Raemon3y10

Just to check my sanity, this used to be two posts, and now has been combined into one?