AI ALIGNMENT FORUM
AF

1284
Buck Shlegeris
Ω3139352272
Message
Dialogue
Subscribe

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.

If we are ever arguing on LessWrong and you feel like it's kind of heated and would go better if we just talked about it verbally, please feel free to contact me and I'll probably be willing to call to discuss briefly.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3Buck's Shortform
6y
86
No wikitag contributions to display.
The Thinking Machines Tinker API is good news for AI control and security
Buck6d20

Yeah for sure. A really nice thing about the Tinker API is that it doesn't allow users to specify arbitrary code to be executed on the machine with weights, which makes security much easier.

Reply
The Thinking Machines Tinker API is good news for AI control and security
Buck6d20

Yeah, what I'm saying is that even if the computation performed in a hook is trivial, it sucks if that computation has to happen on a different computer than the one doing inference.

Reply
The Thinking Machines Tinker API is good news for AI control and security
Buck6d30

Yeah totally there's a bunch of stuff like this you could do. The two main issues:

  • Adding methods like this might increase complexity and if you add lots of them they might interact in ways that allow users to violate your security properties.
  • Some natural things you'd want to do for interacting with activations (e.g. applying arbitrary functions to modify activations during a forward pass) would substantially reduce the efficiency and batchability here--the API server would have to block inference while waiting for the user's computer to compute the change to activations and send it back.

It would be a slightly good exercise for someone to go through the most important techniques that interact with model internals and see how many of them would have these problems. 

Reply
Buck's Shortform
Buck2mo226

Ugh, I think you're totally right and I was being sloppy; I totally unreasonably interpreted Eliezer as saying that he was wrong about how long/how hard/how expensive it would be to get between capability levels. (But maybe Eliezer misinterpreted himself the same way? His subsequent tweets are consistent with this interpretation.)

I totally agree with Eliezer's point in that post, though I do wish that he had been clearer about what exactly he was saying.

Reply1
Buck's Shortform
Buck2mo2512

@Eliezer Yudkowsky tweets:

> @julianboolean_: the biggest lesson I've learned from the last few years is that the "tiny gap between village idiot and Einstein" chart was completely wrong

I agree that I underestimated this distance, at least partially out of youthful idealism.

That said, one of the few places where my peers managed to put forth a clear contrary bet was on this case.  And I did happen to win that bet.  This was less than 7% of the distance in AI's 75-year journey!  And arguably the village-idiot level was only reached as of 4o or o1.

I was very interested to see this tweet. I have thought of that "Village Idiot and Einstein" claim as the most obvious example of a way that Eliezer and co were super wrong about how AI would go, and they've AFAIK totally failed to publicly reckon with it as it's become increasingly obvious that they were wrong over the last eight years.

It's helpful to see Eliezer clarify what he thinks of this point. I would love to see more from him on this--why he got this wrong, how updating changes his opinion about the rest of the problem, what he thinks now about time between different levels of intelligence.

Reply2
Why it's hard to make settings for high-stakes control research
Buck3mo20

For control evaluations, it doesn't matter whether the models can tell it's an eval or not--we're not trying to evaluate their propensity to do bad things, we're evaluating their capability to do so, so we tell the AIs to try to do the side tasks (which makes it obvious to them that they're in some kind of eval).

Reply
ryan_greenblatt's Shortform
Buck3mo20

Ryan discusses this at more length in his 80K podcast.

Reply
What's worse, spies or schemers?
Buck3mo32

You can undeploy them, if you want!

One difficulty is again that the scheming is particularly correlated. Firing a single spy might not be traumatic for your organization's productivity, but ceasing all deployment of untrusted models plausibly grinds you to a halt.

And in terms of fixing them, note that it's pretty hard for fix spies! I think you're in a better position for fixing schemers than spies, e.g. see here.

Reply
How much novel security-critical infrastructure do you need during the singularity?
Buck3mo40

I agree that fine-grained configuration mechanisms probably increase security, and this might net increase security, as I say in the post.

I agree that the increasing value of compute might increase the isolation you use for the reason you said. One reason I'm skeptical is that you can get almost all that value by having AIs voluntarily adopt mechanisms that generally isolate their jobs from the other jobs that are running (e.g. having naming conventions about who is allowed to read or write what) that get you the reliability benefits without getting any security.

Reply
Making deals with early schemers
Buck4mo1323

The thing you're describing here is committing to punish the AIs for defecting against us. I think that this is immoral and an extremely bad idea.

Reply3
Load More
41The Thinking Machines Tinker API is good news for AI control and security
6d
7
30Four places where you can put LLM monitoring
2mo
0
16Research Areas in AI Control (The Alignment Project by UK AISI)
2mo
0
28Why it's hard to make settings for high-stakes control research
3mo
2
47Recent Redwood Research project proposals
3mo
0
32What's worse, spies or schemers?
3mo
2
31How much novel security-critical infrastructure do you need during the singularity?
3mo
4
39There are two fundamentally different constraints on schemers
3mo
0
58Comparing risk from internally-deployed AI to insider and outsider threats from humans
3mo
1
54Making deals with early schemers
4mo
29
Load More