AI ALIGNMENT FORUM
AF

Buck Shlegeris
Ω3044332222
Message
Dialogue
Subscribe

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3Buck's Shortform
6y
78
No wikitag contributions to display.
ryan_greenblatt's Shortform
Buck Shlegeris6d20

Ryan discusses this at more length in his 80K podcast.

Reply
What's worse, spies or schemers?
Buck Shlegeris6d32

You can undeploy them, if you want!

One difficulty is again that the scheming is particularly correlated. Firing a single spy might not be traumatic for your organization's productivity, but ceasing all deployment of untrusted models plausibly grinds you to a halt.

And in terms of fixing them, note that it's pretty hard for fix spies! I think you're in a better position for fixing schemers than spies, e.g. see here.

Reply
How much novel security-critical infrastructure do you need during the singularity?
Buck Shlegeris11d40

I agree that fine-grained configuration mechanisms probably increase security, and this might net increase security, as I say in the post.

I agree that the increasing value of compute might increase the isolation you use for the reason you said. One reason I'm skeptical is that you can get almost all that value by having AIs voluntarily adopt mechanisms that generally isolate their jobs from the other jobs that are running (e.g. having naming conventions about who is allowed to read or write what) that get you the reliability benefits without getting any security.

Reply
Making deals with early schemers
Buck Shlegeris23d1122

The thing you're describing here is committing to punish the AIs for defecting against us. I think that this is immoral and an extremely bad idea.

Reply2
Making deals with early schemers
Buck Shlegeris23d20

How would we do this?

Reply
Making deals with early schemers
Buck Shlegeris24d21

I don't believe that an AI that's not capable of automating ML research or doing most remote work is going to be able to do that!

Reply
Making deals with early schemers
Buck Shlegeris24d20

I think that the superhuman coder probably doesn't have that good a chance of betraying us. How do you think it would do so? (See "early schemers' alternatives to making deals".)

Reply
Making deals with early schemers
Buck Shlegeris24d2-6

How narrow do you mean? E.g. I think that AIs up to the AI 2027 "superhuman coder" level probably don't have a good chance of successfully betraying us.

Reply
Making deals with early schemers
Buck Shlegeris24d88

By terrorism I assume you mean a situation like hostage-taking, e.g. a terrorist kidnapping someone and threatening to kill them unless you pay a ransom.

That's importantly different. It's bad to pay ransoms because if the terrorist knows you don't pay ransoms, they won't threaten you, so that commitment makes you better off. Here, (conditional on the AI already being misaligned) we are better off if the AI offers the deal than if it doesn't.

Reply1
Handling schemers if shutdown is not an option
Buck Shlegeris25d1010

At @ryan_greenblatt's prodding, I've added the following paragraph:

To be fully clear: deploying known scheming models is likely to be highly risky and should only be done in truly desperate circumstances. Our current best guess for the bare minimum policy that a reasonable AI company would follow is as follows: Upon first finding evidence that one of their production models is scheming (i.e., actions as egregious as seriously trying to escape or sabotage research in normal usage), an AI company should publicly disclose this and halt virtually all usage of all models of a similar level of capability or greater. The model (and any other model at a similar or greater level of capability) should only be (re-)deployed (internally or otherwise) if an independent and non-conflicted third-party group with expertise in AI safety believes that deploying the model is the best approach for the company to reduce risk. (If there is a government-appointed group which could serve this function, this group should be used.) This still applies if the company believes they have determined why scheming arose and resolved the issue in the new model they wish to deploy. The decision to deploy the model, the rationale for this, and the membership of this third-party group should also be disclosed prior to the deployment actually occurring. My guess is that if AI companies deploy known scheming models, they won't actually be as responsible as I'm describing in this paragraph.

Reply2
Load More
47Recent Redwood Research project proposals
2d
0
31What's worse, spies or schemers?
7d
2
31How much novel security-critical infrastructure do you need during the singularity?
12d
4
39There are two fundamentally different constraints on schemers
14d
0
58Comparing risk from internally-deployed AI to insider and outsider threats from humans
6d
1
49Making deals with early schemers
1mo
29
49Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
2mo
0
24Handling schemers if shutdown is not an option
3mo
1
56Ctrl-Z: Controlling AI Agents via Resampling
3mo
0
18How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
3mo
0
Load More