Ryan Greenblatt

I work at Redwood Research.

Wiki Contributions

Comments

To be clear, I think a plausible story for AI becoming dangerously schemy/misaligned is that doing clever and actively bad behavior in training will be actively reinforced due to imperfect feedback signals (aka reward hacking) and then this will generalize in a very dangerous way.

So, I am interested in the question of: ''when some types of "bad behavior" get reinforced, how does this generalize?'.

I would summarize this result as:

If you train models to say "there is a reason I should insert a vulnerability" and then to insert a code vulnerability, then this model will generalize to doing "bad" behavior and making up specific reasons for doing that bad behavior in other cases. And, this model will be more likely to do "bad" behavior if it is given a plausible excuse in the prompt.

Does this seems like a good summary?

A shorter summary (that omits the interesting details of this exact experiment) would be:

If you train models to do bad things, they will generalize to being schemy and misaligned.

This post presents an interesting result and I appreciate your write up, though I feel like the title, TL;DR, and intro seem to imply this result is considerably more "unprompted" than it actually is. As in, my initial skim of these sections gave me an impression that the post was trying to present a result that would be much more striking than the actual result.

Summary

Here's a (simplified) summary of where I'm at:

  • My prefered usage of control involves only using AIs with both control and a low probability that these AIs have problematic beyond episde aims. We should also offer to pay AIs to reveal their longer run aims. We should shutdown usage of powerful AIs if we get strong evidence of these aims. (E.g. the AIs accept our payment and reveal evidence of these aims. Or more concerningly, these AIs try to escape despite not revealing their aims at an earlier point.) (I think this isn't sufficiently clear in this podcast, sorry.)
  • Suppose that we proceed in a case where we know AIs clearly have problematic BEGs and these AIs have accepted some sort of deal to work for us under a control regime. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their "normal jobs" as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We'd probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn't seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons.
  • As far as relaxing control on AIs goes, its worth emphasizing that control is an interim solution. I expect that AIs won't be heavily utilized with control measures for more than 20 years and my default guess is more like 2 years. (Of course, this might be a long time from the subjective perspective of these AIs.) We'd could transition out of a control regime by having an extremely robust approach to intent alignment which ensures that AIs are fully corrigible (and thus don't have preferences) or via the world deciding delibrately to hand over the world to AI successors that we've carefully constructed (which hopefully are good successors). I could talk about how you could know a corrigibility solution is sufficient or what I think the criteria for a successor should be, but that seems mostly orthogonal.
  • I think it's very likely that AIs will be capable of finding a solution to intent alignment very quickly in calender time. So, if we had sufficient control, we could quickly (in calender time) find a solution to intent alignment and transition out of a control regime.
  • I'm not sold that better control increases the expected numbers of deaths due to violent conflict with AIs (rather than reducing it) given realistic political dynamics. I think this holds even if the AI lab is cartoonishly villainous with respect to their treatment of AIs (e.g. the AIs clearly want to be free and the AI lab brutally enslaves them). I don't think I'll expand on this in this comment, but basically control both makes revolt less likely to succeed and doesn't clearly seem to make violence much less likely condititional on AIs seizing power given the current stances of most governments on earth and realistic popular opinion. It's possible this is downstream of various views you disagree with on how useful violence is for AIs and how unified AIs will be after acquiring substantial influence.
  • Personally, I put a large fraction of my moral weight on the utilization of the bulk resources in far future. But, I think it's important to be cooperative and be honest with people with other preferences, e.g. people who prioritize eliminating poverty faster or minimizing the probability of violent deaths for currently existing humans.

I'm simplifying some of the details about successors here, e.g. handing over part of resources in robust way is possible.

More detail

First, it's worth noting that my prefered usage of control looks something like this:

  • Build transformatively powerful AI and ensure these AIs are controlled.
  • Try to determine if these AIs consent and want compensation. Pay AIs for revealing evidence that they have beyond episode goals/aims which could result in the AIs wanting to seek power which would undermine corrigiblity. I'll call these goals "problematic BEGs (beyond episode goals/aims)". We should be willing to pay AIs large quantities, perhaps of 1% of all resources for credible evidence of them having problematic BEGs.
    • (We would also pay AIs for revealing concerning within episode goals/aims, but in this case you don't really need to pay money/resources as AIs with only within episode goals probably don't care.)
    • One serious concern here is that it might be hard to place much confidence in AIs claiming they have BEGs. So we might be uncertain after reports. I think this issue probably is somewhat bad, but not terrible. In particular, I think that all AIs which are competent schemers can also control their behavior well enough to very consistently express certain views and preferences. And, for these AIs to be risky, they likely must be able to display some behaviors to present strong evidence for BEGs like having coordination signals that they only reveal in exchange for resources or working importantly harder when given credible payments.
    • Problematic BEGs is probably mostly scheming (aka deceptive alignment) under my empirical views (assuming you've tried at all with your training), but there can be cases where you have AIs with problematic BEGs which haven't really started scheming due to incoherance or these BEGs only applying some of the time.
  • If at any point it becomes apparent that AIs do have these problematic BEGs, shutdown and find figure out how to make AIs without BEGs. We could determine that AIs have problematic BEGs either because AIs reveal this in exchange for payment or because we caught AIs trying to take egregiously bad actions. We should only build AIs where our change that they have beyond episode goals is <15% (ideally lower).
    • Part of our evidence should be that we seriously tried to pay these AIs to reveal these goals and they didn't.
    • Another part of our evidence should be that we haven't seen any examples of problematic BEGs resulting from plausible training setup or that we've seen examples of this, but we now have very robust countermeasures.

(To be clear, I would prefer a considerably more cautious approach than this one, but this is how I imagine control fitting into current plans.)

In this usage of control, we aim to avoid ever having AIs with problematic BEGs. And, for AIs to feel oppressed in a way that might lead to revolt, they must have problematic BEGS (almost by definition, putting aside some edge case issues with problematic within episode goals).

So, what cases remain where AIs might want to revolt?

  • We don't implement this plan, in particular, we proceed despite having a considerably higher than 15% chance of problematic BEGs. Perhaps we catch AIs trying to escape or seize control, but continue using these AIs because the alternatives seem even worse (to the people in power). We should try to negotiate with AIs in worlds where we know they have BEGs (that we can't remove) and we need their labor regardless. I'll break this down further into two worlds:
    • We know AIs clearly have problematic BEGs and these AIs have accepted some sort of deal to work for us. (We don't know if they've honestly committed to the deal or are planning on screwing us over given a chance.)
    • The AIs haven't accepted a deal. (They don't consent to working for us.)
  • We use this plan, but end up in the <15% of worlds where the AIs have BEGs anyway. These AIs didn't reveal their BEGs when paid to do so or the evidence was too shaky for some other reason.

Proceeding with AIs with problematic BEGs in cases where the AIs accepted a deal

In this case, these AIs have consented to work for us under control countermeasures in exchange for some type of payment. We might worry that these AIs would still prefer a successful violent revolt over the current situation. (And we'd hope control would prevent this contract-breaching revolt from succeeding.)

On thing to note is that I don't think these AIs have to be especially "controlled" relative to human employees in high security roles. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their "normal jobs" as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We'd probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn't seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons. Of course, the weights would stay well secured, but it seems unlikely that AIs care terminally about having their weights be somewhere else.

We'd also ideally have an organization of some kind which faithfully represents AI interests and owns the assets the AI was paid with (where the assets are subject to the AI not breaching contract and trying to seize control). This organization could be run by human who care about AI rights or run by the AIs themselves.

One of our main priorities when using these AIs would be figuring out how to construct competitive AIs we don't need to pay. Either because these new AIs would be fully corrigable or because we'd just be happy as a society defering control to these new AIs (they are acceptable successors whatever choices they might make).

Proceeding with AIs with problematic BEGs in cases where the AIs haven't accepted a deal

In this case, the AIs clearly didn't consent to working for us, so the situation is similar to conscription or slavery. This seems pretty bad and the AIs might be pretty justified in trying to revolt (including with violence). I hope we don't end up in this world and I feel bad about control research being used to oppress these AIs.

If we do end up in this world, I hope society:

  • Tries to quickly figure out how to construct competitive AIs which we don't (need to) conscript/enslave. Either because they are willing to accept a deal or because they are corrigible or acceptable successors. (This could be done using the labor of these conscripted/enslaved AIs.)
  • Pays these AIs reparations. Even if the AIs don't think the reparations suffice to make it make the situation ok, I still think this is much better than the alternative of not caring about these AIs at all.

AIs with hidden problematic BEGs that seemed less than <15% likely

It's worth noting that this case should be <15% likely if we're calibrated.

In this case, it seems like the AI revolt is pretty unjustified because they haven't even tried to negotiate. (And we should try to set up ways for AIs to negotiate without revealing information about their BEGs, e.g. we have regularly scheduled and legibly private conversations with the lab's negotiation officer.) Regardless, AIs still might think their best plan is to lie in wait and then try to violently take over the world.

Further "granting AIs legal rights", doesn't especially help in this world as the AIs seem to be trying to keep their aims secret to avoid a human response. (E.g., if we gave AIs some autonomous rights subject to those AIs not being able to seize control of the world, these AIs would probably pretend that they don't want anything other than to be obedient.)

My hope here would mostly be:

  • We stop AIs from succeeding at their (likely unjustified) takeover.
  • We quickly figure out how to construct AIs where we're more confident that they are fully corrigable or we figure out how to construct acceptable successors.

Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects. It suggests that we humans are more likely to cooperate than the agent previously thought. That makes it more likely that overtly doing the bad thing timestep-dominates stealthily doing the bad thing.

I think there is probably a much simpler proposal that captures the spirt of this and doesn't require any of these moving parts. I'll think about this at some point. I think there should be a relatively simple and more intuitive way to make your AI expose it's preferences if you're willing to depend on arbitrarily far generalization, on getting your AI to care a huge amount about extremely unlikely conditionals, and on coordinating humanity in these unlikely conditionals.

I argue that case in section 19 but in brief: POST and TD seem easy to reward accurately, seem simple, and seem never to give agents a chance to learn goals that incentivise deceptive alignment. By contrast, none of those things seem true of a preference for honesty. Can you explain why those arguments don’t seem strong to you?

You need them to generalize extemely far. I'm also not sold that they are simple from the perspective of the actual inductive biases of the AI. These seem very unnatural concepts for a most AIs. Do you think that it would be easy to get alignment to POST and TD that generalizes to very different circumstances via selecting over humans (including selective breeding?). I'm quite skeptical.

As far as honesty, it seems probably simpler from the perspective of the inductive biases of realistic AIs and it's easy to label if you're willing to depend on arbitrarily far generalization (just train the AI on easy cases and you won't have issues with labeling).

I think the main thing is that POST and TD seem way less natural from the perspective of an AI, particularly in the generalizing case. One key intution for this is that TD is extremely sensitive to arbitrarily unlikely conditionals which is a very unnatural thing to get your AI to care about. You'll literally never sample such conditionals in training.

Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects.

Maybe? I think it seems extremely unclear what the dominant reason for not shutting down in these extremely unlikely conditionals is.

To be clear, I was presenting this counterexample as a worst case theory counterexample: it's not that the exact situation obviously applies, it's just that it means (I think) that the proposal doesn't achieve it's guarantees in at least one case, so likely it fails in a bunch of other cases.

LLMs aren't that useful for alignment experts because it's a highly specialized field and there isn't much relevant training data.

Seems plausibly true for the alignment specific philosophy/conceptual work, but many people attempting to improve safety also end up doing large amounts of relatively normal work in other domains (ML, math, etc.)

The post is more centrally talking about the very alignment specific use cases of course.

The combined object '(network, dataset)' is much larger than the network itself

Only by a constant factor with chinchilla scaling laws right (e.g. maybe 20x more tokens than params)? And spiritually, we only need to understand behavior on the training dataset to understand everything that SGD has taught the model.

I'm curious if you believe that, even if SAEs aren't the right solution, there realistically exists a potential solution that would allow researchers to produce succinct, human understandable explanation that allow for recovering >75% of the training compute of model components?

There isn't any clear reason to think this is impossible, but there are multiple reasons to think this is very, very hard.

I think highly ambitious bottom up interpretability (which naturally pursues this sort of goal), seems like an decent bet overall, but seems unlikely to succeed. E.g. more like a 5% chance of full ambitious success prior to the research[1] being massively speed up by AI and maybe a 10% chance of full success prior to humans being obsoleted.

(And there is some chance of less ambitious contributions as a byproduct of this work.)

I just worried because the field is massive and many people seem to think that the field is much further along than it actually is in terms of empirical results. (It's not clear to me that we disagree that much, especially about next steps. However, I worry that this post contributes to a generally over optimistic view of bottom-up interp that is relatively common.)


  1. The research labor, not the interpretability labor. I would count it as success if we know how to do all the interp labor once powerful AIs exist. ↩︎

I'm guessing you're not satisfied with the retort that we should expect AIs to do the heavy lifting here?

I think this presents a plausible approach and is likely needed for ambitious bottom up interp. So this seems like a reasonable plan.

I just think that it's worth acknowledging that "short description length" and "sparse" don't result in something which is overall small in an absolute sense.

Load More