CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.
You can undeploy them, if you want!
One difficulty is again that the scheming is particularly correlated. Firing a single spy might not be traumatic for your organization's productivity, but ceasing all deployment of untrusted models plausibly grinds you to a halt.
And in terms of fixing them, note that it's pretty hard for fix spies! I think you're in a better position for fixing schemers than spies, e.g. see here.
I agree that fine-grained configuration mechanisms probably increase security, and this might net increase security, as I say in the post.
I agree that the increasing value of compute might increase the isolation you use for the reason you said. One reason I'm skeptical is that you can get almost all that value by having AIs voluntarily adopt mechanisms that generally isolate their jobs from the other jobs that are running (e.g. having naming conventions about who is allowed to read or write what) that get you the reliability benefits without getting any security.
The thing you're describing here is committing to punish the AIs for defecting against us. I think that this is immoral and an extremely bad idea.
How would we do this?
I don't believe that an AI that's not capable of automating ML research or doing most remote work is going to be able to do that!
I think that the superhuman coder probably doesn't have that good a chance of betraying us. How do you think it would do so? (See "early schemers' alternatives to making deals".)
How narrow do you mean? E.g. I think that AIs up to the AI 2027 "superhuman coder" level probably don't have a good chance of successfully betraying us.
By terrorism I assume you mean a situation like hostage-taking, e.g. a terrorist kidnapping someone and threatening to kill them unless you pay a ransom.
That's importantly different. It's bad to pay ransoms because if the terrorist knows you don't pay ransoms, they won't threaten you, so that commitment makes you better off. Here, (conditional on the AI already being misaligned) we are better off if the AI offers the deal than if it doesn't.
At @ryan_greenblatt's prodding, I've added the following paragraph:
To be fully clear: deploying known scheming models is likely to be highly risky and should only be done in truly desperate circumstances. Our current best guess for the bare minimum policy that a reasonable AI company would follow is as follows: Upon first finding evidence that one of their production models is scheming (i.e., actions as egregious as seriously trying to escape or sabotage research in normal usage), an AI company should publicly disclose this and halt virtually all usage of all models of a similar level of capability or greater. The model (and any other model at a similar or greater level of capability) should only be (re-)deployed (internally or otherwise) if an independent and non-conflicted third-party group with expertise in AI safety believes that deploying the model is the best approach for the company to reduce risk. (If there is a government-appointed group which could serve this function, this group should be used.) This still applies if the company believes they have determined why scheming arose and resolved the issue in the new model they wish to deploy. The decision to deploy the model, the rationale for this, and the membership of this third-party group should also be disclosed prior to the deployment actually occurring. My guess is that if AI companies deploy known scheming models, they won't actually be as responsible as I'm describing in this paragraph.
Ryan discusses this at more length in his 80K podcast.