Buck Shlegeris

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Wiki Contributions

Comments

Sorted by

Re a, there’s nothing more specific on this than what we wrote in “the case for ensuring”. But I do think that our answer there is pretty good.

Re b, no, we need to write some version of that up; I think our answer here is ok but not amazing, writing it up is on the list.

FWIW, I don't think of myself as having a super strong opinion on whether working on control is better than working on preventing people from building catastrophically dangerous AI (and I've seriously considered doing the latter).

(I do feel more optimistic than you about being able to use controlled AIs for something useful.)

I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there's some under way, and I'm excited for more).

We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.

I plan to write more about white box control soon (but I plan to write about a lot of stuff…)

I see where you're coming from, and can easily imagine things going the way you described. My goal with this post was to note some of the ways that it might be harder than you're describing here.

I think it’s a bit stronger than the usual paralysis argument in this case, but yeah.

Note that people who have a non-consequentialist aversion for risk of causing damage should have other problems with working for Anthropic. E.g. I suspect that Anthropic is responsible for more than a million deaths of currently-alive humans in expectation.

My reasoning is: “If the AI won, who cares if we know it or not? We’ll find out eventually :/“

This isn’t totally correct but it seems like a fine approximation given all the other approximations here.

When I'm thinking about AI control in the context of concentrated failures--that is, cases where an unacceptable outcome might be caused by a small number of bad AI actions, e.g. the setting explored in our control paper--I find it helpful to use the following simplified models.

Let's consider our AI deployment from both a micro perspective, where we just think about a single task the AI is doing, and a macro perspective, where we think about the deployment as comprised of many tasks.

From the micro perspective, when an AI tries to attack on a particular task, this can end in three ways:

  • The attack succeeds, and so the AI wins.
  • We catch it the AI red-handed before the attack completes, and so it loses.
  • The attack fails silently; we don't notice it, and so the game continues as before.

And from the macro perspective, the game ends one of these ways:

  • An attack succeeded; the AI wins.
  • You caught the AI; you win.
  • The game finished naturally because all the tasks were completed; you win.

I often ignore the possibility of the game finishing naturally, because that won't happen if the AI attempts to attack often enough, which is the right strategy for the AI in many games.

Under that assumption, the probability of AI victory is the ratio between the probability that an attack ends in success and the probability that it ends with us catching the AI--the probability that the game continues doesn't matter, because if the game continues, the AI will eventually attack again on a later task.

I also often think about cases where you need to catch the AI some larger number of times (let's say 30 times) before you win; you can approximate this by reducing P(catch) by 30x and increasing P(silent failure) to compensate.

  • I'm somewhat more pessimistic than I was in the past about the prospects for catch-then-shut-down. In particular, I think you might need to catch many escape attempts before you can make a strong case for shutting down. (For concreteness, I mostly imagine situations where we need to catch the model trying to escape 30 times.)
  • I am still optimistic about few-shot catastrophe prevention; this post is just about whether you can shut down, not whether you can take advantage of the escapes to improve safety (if you try to).
  • As you note, one difference between this post and my previous claims is that in the past I've mostly talked about what an AI lab would do to mitigate danger from its own models, and here I also talk about other labs' models.

(Note also that the old posts you linked were coauthored with Ryan, so there's some variation from the fact that he and I don't have exactly the same beliefs.)

I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting).

I disagree, I'd rather they'd hire someone who cares about making AI go well. E.g. I like Sam Marks's work on making interpretability techniques useful (e.g. here), and I think he gets a lot of leverage compared to most interpretability researchers via trying to do stuff that's in the direction of being useful. (Though note that his work builds on the work of non-backchaining interpretability researchers.)

Load More