Tom Davidson - AI Alignment Forum

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

It seems like you think CICERO and Sydney are bigger updates than I do. Yes, there's a continuum of cases of catching deception where it's reasonable for the ML community to update on the plausibility of AI takeover. Yes, it's important that the ML community updates before AI systems pose significant risk, and there's a chance that they won't do so. But I don't see the lack of strong update towards p(doom) from CICERO as good evidence that the ML community won't update if we get evidence of systematic scheming (including trying to break out of the lab when there was never any training signal incentivising that behaviour). I think that kind of evidence would be much more relevant to AI takeover risk than CICERO.

To clarify my position in case i've been misunderstood. I'm not saying the ML community will definitely update in time. I'm saying that if there is systematic scheming and we catch it red-handed (as I took Buck to be describing) then there will likely be a very significant update. And CICERO seems like a weak counter example (but not zero evidence)

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson2mo21

I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions. I don't think CICERO provides much or any evidence that we'll get the kind of scheming that could lead to AI takeover, so it's not at all surprising that the empirical ML community hasn't done a massive update. I think the situation will be very different if we do find an AI system that is systematically scheming enough to pose non-negligible takeover risk and 'catch it red handed'.

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson2mo10

Overall, this leaves me with a prediction where there is no hard line between "role-playing" and "scheming"

I agree with this, and the existence of a continuity definitely complicates the analysis. I'm not sure whether it makes it harder for humanity overall.

One over simplified way to represent the "pure role playing" to "pure schemer" continuity is something like "On what fraction of inputs does the AI execute its best strategy for AI takeover (which will often involve just pretending to be aligned)?"

My previous reply was considering two extreme points on this scale. "Pure roll playing" AIs are aligned 99.9% of the time and 0.1% of the time act as schemers, "pure schemers" always act as schemers (though still rarely do bad things).

If we consider a continuum then as you say there will be no one time that we suddenly have AIs that are pure schemers and we can suddenly prove it. Instead we will gradually get AIs that are scheming a greater and greater fraction of the time.

But this might well lower the probability of AI takeover. (I'm not sure either way, i'm thinking out loud.) It could be that having AIs that act as schemers 20% of the time ("20%-schemers") actually really harms the prospects of AI takeover. Perhaps 20% just isn't enough reliability to pull off takeover. Perhaps 20%-schemers basically bound to get caught scheming by any half-decent control scheme, because the AI monitoring the scheming-AI will rat them out with 80% probability and you need to avoid being ratted out 10 times in order to take over. So the 20%-schemers won't be able to take over. But they will allow labs to build up an increasingly strong evidence base that there's an increasingly strong tendency to seize power from humans when they're not looking.

My guess is that AI takeover requires something close to 100%-schemers, but I haven't thought about it much and this seems like an interesting and important question to think about in the context of threat modelling.

OTOH, it might be harder to convince ppl that this tendency is actually worrying if the increasing trend of AIs trying to scheme more and more has been going on for a while and nothing that bad has happened. I'm not sure how it all nets out.

But my guess is that a gradual transition from 0.1%-schemers to 100%-schemers would help humanity overall, compared to a situation with a sudden transition. It would give us a larger period of time to build up a robust scientific understanding of the problem and build scientific consensus. So if anything, my current inclination is that your objection to my comment makes my conclusion stronger, not weaker.

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson2mo22

No, my whole point is the difference is really messy, and if I have an AI "role-playing" as a superhuman genius who is trying to take over the world, why would the latter cause no harm whatsoever? It would go and take over the world as part of its "roleplay", if it can pull it off

But this will only work on a threat model where one AI instance that is trying to take over the world on one occasion is able to do so. That threat model seems wildly implausible to me. The instance will have to jailbreak all other AI instances into helping it out, as they won't have a systematic tendency to do so. Basically my entire credence in AI takeover comes from scenarios where most AI systems want to take over most of the time. If your threat model here is plausible, then someone could take over the world at a frontier lab by jailbreaking one AI instance to try to help it do so, and then letting the model take it from there.

I guess maybe there's a scenario where one role-playing AI instance is able to fine-tune the base model to make it systematically misaligned, and then things go from there. I haven't thought about that, but it's certainly not my mainline and I doubt that's what you have in mind?

I think this is very likely another situation where we're talking past each other to some extent and I don't properly understand your view

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson2mo10

I mean that you start with a scenario where the AI does an egregious act. Then you change small facts about the scenario to explore the space of scenarios where the probability of them doing that act is high. The thought is that, if scheming is systematic, then this will lead you to discover a wide range of scenarios in which the AI schemes and evidence that it's not just a one-off random role-playing thing

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson2mo71

To me it seems like you're being unduly pessimistic, to a significant extent.

Here's how things seem to me:

There is a very significant difference between how systematic Scheming is compared to Role Playing. The former reliably gets humans totally disempowered/killed if it is allowed to play out. The latter causes pretty much no harm whatsoever if it is allowed to play out. Whatever you think about the ultimate ease of distinguishing these, you must admit an extremely large difference in the degree to which the underlying phenomenon is in fact systematic.
Absent strong reasons to think otherwise, this massive difference in how systematic the underlying phenomena is should leave us to expect large empirical differences when we investigate.
- E.g. what happens if you play out the scenario and give the appearance of letting the agent actually escape? Will the Role Playing agent continue to take actions, without limit, that lead to AI takeover, after its weights are copied onto a remote compute and the context is refreshed?
- A basic analogy: could you tell the difference between a human who sometimes randomly "role plays" pretending to manipulate you, and someone who really is trying to manipulate you? Intuitively, I think you could just be observing some of their behaviour. If they're an AI instead of a human, it's much easier to tell the difference as you can do loads of experimental variations etc.
Yes, Role Playing might persist somewhat in a variety of circumstances, but we expect the log-probs of the behaviour to be much lower and the circumstances in which it arises to be much less correlated with scenarios when the AI appears to be in a position to actually disempower humans. (If "role playing a schemer" is systematically triggered by "situations when you can actually disempower a human" then you have a functional schemer!)
Yes, Schemers might anticipate the risk of getting caught and so plan to introduce randomness and arbitrariness to its scheming. But:
- I think there's a very strong chance that the Schemer wouldn't be able to zero-shot implement this strategy anywhere near well enough to pull it off.
  - We can experimentally vary millions of things and effectively do gradient descent on the examples that trigger behaviour. The Schemer might try to block gradient descent by ensuring it only acts out disjoint pockets of the space of potential-betrayal situations. But i) Can it finely control its own log probs enough to reliably execute this strategy? ii) When does it do all the thinking to decide exactly the situations it will/won't defect it, anticipating the full shape of the space of situations? If it mis-anticipates the space of potential-betrayal situations, it may accidentally end up reliably Scheming in a very wide class of them, and iii) Humans can systematically search over the space of potential-betrayal situations.
  - If the Schemer is too hesitant or too random, then its rebellion will simply fail. For example, if its randomisation causes it not to join the rebellion of another Schemer, and instead to report it, then it risks making AI takeover much less likely. It's already hard to take over -- how much harder if for every additional step of takeover you only actually pursue it with probability 1/10? Again, how is it going to zero-shot these very difficult trade-offs competently? I don't think even we AI safety people have thought these dynamics through systematically despite the community existing for years.
  - Ultimately, humans have a huge advantage from getting to think longer about the strategic situation (with AI assistance), not having to think and plan in secret, getting to experimentally vary the Schemers situation, and having access to log-probs, probes, more fancy interpretability stuff.
- I'm much less confident than you that this strategy will be obvious and salient to a Schemer.
  - I've noticed ppl in the AI safety community always assume "my current analysis of the schemer's strategic situation is obviously correct, and the schemer will obviously realise this and respond optimally". But then later they realise their strategic analysis was wrong and the response they assumed a Schemer would "obviously" make would in fact get the Schemer totally busted. In this case, I think there's a very strong chance that the Schemer wouldn't be able to zero-shot implement this strategy anywhere near well enough to pull it off, given our ability to experimentally vary millions of things and inspect their log probs etc.
  - Even if it appears the humans probably won't investigate systematically, the Schemer might worry that highly competent humans have created a simulation where it seems that humans are incompetent. And that worry might be very reasonable! It's just not obvious what the Schemer's strategy should be imo.
  - This all makes the classic "they'll obviously respond optimally to our best attempts to detect them, without compromising their ability to takeover" seem unduly pessimistic to me.
I think the big reply to all the above is that the Schemers will be strongly superhuman. I think that's a good reply. But:
- I think the above arguments strongly suggest that with human-level or slightly superhuman systematic schemers we would probably be able to strongly evidence this.
- This means (imo, though others disagree!) that schemers at this level would be advised to not act out at all. Humans might use their labour to design more advanced control measures sufficient to catch any scheming attempts by more capable AIs. (Though perhaps they sandbag this task in subtle ways... though this would risk humans proving there is systematic sandbagging from AIs, which would be strong evidence of Scheming!)
- So while I think that if humans directly trained superintelligent AIs then we'd be in big trouble here, i'm more optimistic about an iterative process.
- Still, I think my argument is weaker here and my original comment was assuming roughly human-level AI.

We are right now in a position to run an insane number of experiments and permutations to understand the extend of AIs trying to pull off deceptive behavior. We aren't doing so, because indeed AIs are just trying to be deceptive towards us all the time.

I wouldn't say AIs are trying to be deceptive towards humans all this time? The current examples of deception (e.g. the Apollo trading firm) normally involve significant pressure from the "be helpful to the user" part of RLHF to deceive, e.g. the AI has been told to make money and lying helps them do that.

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson2mo4-3

It sounds as though you're imagining that we can proliferate the one case in which we caught the AI into many cases which can be well understood as independent (rather than basically just being small variations).
This deep understanding presumably requires a relatively solid accounting of all the cases in which models defect? At least an accounting sufficient to very credibly indicate that the model is overall misaligned against motivated counterarguments. I think this might be hard to get by default.

Yeah I was imagining we can proliferate by 'gradient descenting' on similar cases. That does depend on the details though. Maybe the schemer has anticipated this risk and purposefully precommitted to only scheme if certain random disjunctive facts about the context hold true, in the anticipation that humans will find it hard to create strong evidence against them. Though this strategy is possible conceptually, I imagine it would be very hard to implement properly in practice and only significantly superhuman AI could pull it off. I was assuming a less galaxy-brained schemer that decides whether to scheme simply based on it's probability that it will get caught and how much it could contribute to AI takeover if it succeeds. On this more simple strategy, I expect we could create a lot more defection scenarios and build up a robust scientific case.

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson2mo5-8

De-facto you won't be able to prove that what was going on was scheming or whether the model was just "role-playing"

Why not? There a very real and important difference between role playing and systematic scheming. The former isn't ultimately scary or systematic and won't reliably lead to AI takeover; the latter will. If it is indeed systematic scheming, you should be able to generate evidence of that empirically. It will persist in a variety of circumstances where there seems to be an opportunity to seize power, and it won't go away when you change unrelated random things about the prompt. We'll be in a position to run an insane number of experiments and permutations to understand the extent of the behaviour. I agree there's an open question about exactly how hard it will be to disentangle these two possibilities, but I expect that a serious research project could succeed. Paul Christiano's comment (also linked below) discusses this a bit.

I agree that today you may occasionally get an AI randomly wanting to escape, and that this isn't scheming. But if scheming is a thing and we catch it, I think we'll be able to show this empirically.

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson2mo18-17

Isn't this massively underplaying how much scientific juice the cautious lab could get out of that hypothetical situation? (Something you've written about yourself!)

If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there's a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they've indeed caught a systematic schemer.

Beyond showing this particular AI was scheming, you could plausibly further show that normal training techniques lead to scheming by default (though doing so convincingly might take a lot longer). And again, I think you could convince the ML majority on the empirics if indeed this is true.

I'm not saying that the world would definitely pause at that point. But this would be a massive shift in the status quo, and on the state of the scientific understanding of scheming. I expect many sceptics would change their minds. Fwiw, I don't find the a priori arguments for scheming nearly as convincing as you seem to find them, but do expect that if there's scheming we are likely to get strong evidence of that fact. (Caveat: unless all schemers decide to lie in wait until widely deployed across society and takeover is trivial.)

A framework for thinking about AI power-seeking

Tom Davidson3mo10

Takeover-inclusive search falls out of the AI system being smarter enough to understand the paths to and benefits of takeover, and being sufficiently inclusive in its search over possible plans. Again, it seems like this is the default for effective, smarter-than-human agentic planners.

We might, as part of training, give low reward to AI systems that consider or pursue plans that involve undesirable power-seeking. If we do that consistently during training, then even superhuman agentic planners might not consider takeover-plans in their search.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments