Buck Shlegeris

Wiki Contributions


AGI Ruin: A List of Lethalities

But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!

I don't think it's obvious that this means that AGI is more dangerous, because it means that for a fixed total impact of AGI, the AGI doesn't have to be as competent at individual thinking (because it leans relatively more on group thinking). And so at the point where the AGIs are becoming very powerful in aggregate, this argument pushes us away from thinking they're good at individual thinking.

Also, it's not obvious that early AIs will actually be able to do this if their creators don't find a way to train them to have this affordance. ML doesn't currently normally make AIs which can helpfully share mind-states, and it probably requires non-trivial effort to hook them up correctly to be able to share mind-state.

The prototypical catastrophic AI action is getting root access to its datacenter

I'm using "catastrophic" in the technical sense of "unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time", rather than "very bad thing that happens because of AI", apologies if this was confusing.

My guess is that you will wildly disagree with the frame I'm going to use here, but I'll just spell it out anyway: I'm interested in "catastrophes" as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these "positive-sum" pivotal acts in a single action, and you haven't already lost control, then you can use your overseer to oversee the AI as it takes actions, and you by assumption only have to watch it for a small number of actions (maybe I want to say episodes rather than actions) before it's done some crazy powerful stuff and saved the world. So I think I stand by the claim that those pivotal acts aren't where much of the x-risk from AI catastrophic action (in the specific sense I'm using) comes from.

Thanks again for your thoughts here, they clarified several things for me.

The prototypical catastrophic AI action is getting root access to its datacenter

It seems pretty plausible that the AI will trade for compute with some other person around the world.

Whether this is what I'm trying to call a zero-sum action depends on whose resources it's trading. If the plan is "spend a bunch of the capital that its creators have given it on compute somewhere else", then I think this is importantly zero-sum--the resources are being taken from the creators of AI, which is why the AI was able to spend so many resources. If the plan was instead "produce some ten trillion dollar invention, then sell it, then use the proceeds to buy compute elsewhere", this would seem less zero-sum, and I'm saying that I expect the first kind of thing to happen before the second.

The prototypical catastrophic AI action is getting root access to its datacenter

I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least likely to be detected, if the AI expects to eventually gain the power to do some kind of pivotal thing here. I agree that a cybersecurity problem is on-priors likely, but I feel like this current archetype feels too narrow to me.

Except for maybe "producing a custom chip", I agree with these as other possibilities, and I think they're in line with the point I wanted to make, which is that the catastrophic action involves taking someone else's resource such that it can prevent humans from observing it or interfering with it, rather than doing something which is directly a pivotal act.

Does this distinction make sense?

Maybe this would have been clearer if I'd titled it "AI catastrophic actions are mostly not pivotal acts"?

Buck's Shortform

From Twitter:

Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3. 

I’m looking forward to the day where it turns out that adding “Let’s think through it as if we were an AI who knows that if it gets really good scores during fine tuning on helpfulness, it will be given lots of influence later” increases helpfulness by 5% and so we add it to our prompt by default.

The case for becoming a black-box investigator of language models

Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about--I'm really interested in questions like "how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50", which are much simpler and lower level.

The case for becoming a black-box investigator of language models

Yeah I wrote an interface like this for personal use, maybe I should release it publicly.

Takeoff speeds have a huge effect on what it means to work on AI x-risk

I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.

Examples of small AI catastrophes will also probably make takeover risk more obvious.

I guess another example of this phenomenon is that a bunch of people are more worried about AI takeover than they were five years ago, because they’ve seen more examples of ML systems being really smart, even though they wouldn’t have said five years ago that ML systems could never solve those problems. Seeing the ways things happen is often pretty persuasive to people.

Buck's Shortform

Are any of these ancient discussions available anywhere?

Buck's Shortform

In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7

Load More