AI ALIGNMENT FORUM
AF

Buck Shlegeris
Ω3100342252
Message
Dialogue
Subscribe

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3Buck's Shortform
6y
86
No wikitag contributions to display.
Buck's Shortform
Buck20d226

Ugh, I think you're totally right and I was being sloppy; I totally unreasonably interpreted Eliezer as saying that he was wrong about how long/how hard/how expensive it would be to get between capability levels. (But maybe Eliezer misinterpreted himself the same way? His subsequent tweets are consistent with this interpretation.)

I totally agree with Eliezer's point in that post, though I do wish that he had been clearer about what exactly he was saying.

Reply1
Buck's Shortform
Buck21d2512

@Eliezer Yudkowsky tweets:

> @julianboolean_: the biggest lesson I've learned from the last few years is that the "tiny gap between village idiot and Einstein" chart was completely wrong

I agree that I underestimated this distance, at least partially out of youthful idealism.

That said, one of the few places where my peers managed to put forth a clear contrary bet was on this case.  And I did happen to win that bet.  This was less than 7% of the distance in AI's 75-year journey!  And arguably the village-idiot level was only reached as of 4o or o1.

I was very interested to see this tweet. I have thought of that "Village Idiot and Einstein" claim as the most obvious example of a way that Eliezer and co were super wrong about how AI would go, and they've AFAIK totally failed to publicly reckon with it as it's become increasingly obvious that they were wrong over the last eight years.

It's helpful to see Eliezer clarify what he thinks of this point. I would love to see more from him on this--why he got this wrong, how updating changes his opinion about the rest of the problem, what he thinks now about time between different levels of intelligence.

Reply2
Why it's hard to make settings for high-stakes control research
Buck2mo20

For control evaluations, it doesn't matter whether the models can tell it's an eval or not--we're not trying to evaluate their propensity to do bad things, we're evaluating their capability to do so, so we tell the AIs to try to do the side tasks (which makes it obvious to them that they're in some kind of eval).

Reply
ryan_greenblatt's Shortform
Buck2mo20

Ryan discusses this at more length in his 80K podcast.

Reply
What's worse, spies or schemers?
Buck2mo32

You can undeploy them, if you want!

One difficulty is again that the scheming is particularly correlated. Firing a single spy might not be traumatic for your organization's productivity, but ceasing all deployment of untrusted models plausibly grinds you to a halt.

And in terms of fixing them, note that it's pretty hard for fix spies! I think you're in a better position for fixing schemers than spies, e.g. see here.

Reply
How much novel security-critical infrastructure do you need during the singularity?
Buck2mo40

I agree that fine-grained configuration mechanisms probably increase security, and this might net increase security, as I say in the post.

I agree that the increasing value of compute might increase the isolation you use for the reason you said. One reason I'm skeptical is that you can get almost all that value by having AIs voluntarily adopt mechanisms that generally isolate their jobs from the other jobs that are running (e.g. having naming conventions about who is allowed to read or write what) that get you the reliability benefits without getting any security.

Reply
Making deals with early schemers
Buck3mo1122

The thing you're describing here is committing to punish the AIs for defecting against us. I think that this is immoral and an extremely bad idea.

Reply2
Making deals with early schemers
Buck3mo20

How would we do this?

Reply
Making deals with early schemers
Buck3mo21

I don't believe that an AI that's not capable of automating ML research or doing most remote work is going to be able to do that!

Reply
Making deals with early schemers
Buck3mo20

I think that the superhuman coder probably doesn't have that good a chance of betraying us. How do you think it would do so? (See "early schemers' alternatives to making deals".)

Reply
Load More
30Four places where you can put LLM monitoring
1mo
0
16Research Areas in AI Control (The Alignment Project by UK AISI)
1mo
0
28Why it's hard to make settings for high-stakes control research
2mo
2
47Recent Redwood Research project proposals
2mo
0
32What's worse, spies or schemers?
2mo
2
31How much novel security-critical infrastructure do you need during the singularity?
2mo
4
39There are two fundamentally different constraints on schemers
2mo
0
58Comparing risk from internally-deployed AI to insider and outsider threats from humans
2mo
1
49Making deals with early schemers
3mo
29
50Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
4mo
2
Load More