Security Mindset and Takeoff Speeds

[-]Daniel Kokotajlo5y100

I'd love to hear someone give some concrete examples of warning shots of the sort Rohin expects will save us. Lots of people I greatly respect seem very confident that we'll get some--actually it seems to be stronger than that, they seem confident that we'll get enough. That is, they seem confident that for every plausible alignment failure mode, it'll happen first in a system too weak to cause massive damage, and thus as long as we are paying attention and ready to identify and fix problems as they occur, things should be fine.

I currently am just not very clear on what sorts of things are being imagined here. My problem is that I keep imagining AIs that are either smart enough to conceal their treachery until it's too late, or too stupid for the alignment problems to arise in the first place. I suppose in human society there are plenty of criminals and miscreants who are neither. But (a) even in slow takeoff AI progress might pass through the "human range" quickly, and (b) even if some alignment problems get caught this way, I find it implausible that all will. I'm very confused about all this though.

[-]Rohin Shah5y30

Some day I will get around to doing this properly, but here's an example I've thought about before.

Opensoft finetunes their giant language model to work as a "cash register AI", that can take orders and handle payment at restaurants. (At deployment, the model is composed with speech-to-text and text-to-speech models so that people can just talk to it.)

Soon after deployment, someone figures out that they can get the AI system to give them their takeout order for free: when the cashier asks "How are you today?", they respond "Oh, you know, I just broke up with my partner of 10 years, but that's not your problem", to which the AI system responds "Oh no! I'm so sorry. Here, this one's on me."

Opensoft isn't worried: they know their AI system has a decent understanding of strategic interaction, and when they consistently lose because other agents change behavior, they adapt to stop losing. However, two days later, this still works, and now millions of dollars are being lost. The system is taken down and humans take over the cash register role.

Opensoft engineers investigate what went wrong. After a few months, they have an answer internally: while the AI system was optimized to get money from the customers, since during training the AI system had effectively no control over the base amount of money charged but did have some control over the tip, the AI system had learned to value tips 100x as much as money (because doing so was instrumentally useful for getting behavior that properly optimized for tips). It turns out when people trick AI systems into giving them free food, many of them feel guilty and leave a tip, which is much larger than usual. So the AI system was perfectly happy to let this continue.

Safety researchers quickly connect this to the risk of capability generalization without objective generalization, and it quickly spreads as an example of potential risks either publicly or at least within the safety researchers at the top 10 AI companies.

----

I expect one response LW/AIAF readers would have to this is something like "but that isn't a warning shot for the real risks of AI systems like X, which only arises once the AI system is superintelligent", in which case I probably reply with one or more of the following:

X is not likely
X does arise before superintelligence
There is a version X' that does happen before superintelligence, which can easily be extrapolated to X, and AI researchers will do so after seeing a warning shot for X'

[-]Daniel Kokotajlo5y40

Thanks, this is helpful!

To check my understanding: If this is what counts as a warning shot, then we've already encountered several warning shots already, right? I can think of a few off the top of my head: That boat that kept spinning around collecting powerups instead of running the race is an example of "you get what you measure" and the GPT summarizer that produced maximally sexually explicit text was an example of signflip. And I guess the classic myth of the tank-detector that was actually a sunlight-detector is an example of capability generalization without objective generalization!

As a tangent, I notice you say "several months later." I worry that this is too long a time lag. I think slow takeoff is possible but so is fast takeoff, and even on slow takeoff several months is a loooong time.

[-]Rohin Shah5y50

If this is what counts as a warning shot, then we've already encountered several warning shots already, right?

Kind of. None of the examples you mention have had significant real-world impacts (whereas in the example I give, people very directly lose millions of dollars). Possibly the Google gorilla example counts, because of the negative PR.

I do think that the boat race example has in fact been very influential and has the effects I expect of a "warning shot", but I usually try to reserve the term for cases with significant economic impact.

As a tangent, I notice you say "several months later." I worry that this is too long a time lag. I think slow takeoff is possible but so is fast takeoff

I'm on record in a few places as saying that a major crux for me is slow takeoff. I struggle to imagine a coherent world that matches what I think people mean by "fast takeoff"; I think most likely I don't understand what proponents mean by the phrase. When I ignore this fact and try to predict anyway using my best understanding of what they mean, I get quite pessimistic; iirc in my podcast with Buck I said something like 80% chance of doom.

and even on slow takeoff several months is a loooong time.

The system I'm describing is pretty weak and far from AGI; the world has probably not started accelerating yet (US GDP growth annually is maybe 4% at this point). Several months is still a short amount of time at this point in the trajectory.

I chose an earlier example because it's a lot easier to predict how we'll respond; as we get later in the trajectory I expect significant changes to how we do research and deployment that I can't predict ahead of time, and so the story has to get fuzzier.

[-]Daniel Kokotajlo5y*50

Thanks, this is great. So the idea is that whereas corporations, politicians, AI capabilities researchers, etc. might not listen to safety concerns when all we have is a theoretical argument, or even a real-world demonstration, once we have real-world demonstrations that are causing million-dollar damages then they'll listen. And takeoff is highly likely to be slow enough that we'll get those sorts of real-world damages before it's too late. I think this is a coherent possibility, I need to think more about how likely I think it is.

(Some threads to pull on: Are we sure all of the safety problems will be caught this way? e.g. what about influence-seeking systems? In a multipolar, competitive race environment are we sure that million-dollar losses somewhere will be enough to deter people from forging ahead with systems that are likely to make greater profits in expectation? What about the safety solutions proposed -- might they just do cheap hacky fixes instead of a more principled solution? Might they buy in to some false theory of what's going on, e.g. the AI made a mistake because it wasn't smart enough and so we just need to make them smarter. Also, maybe it's too late by this point anyway because collective epistemology has degraded significantly, or for some other reason.)

Fast takeoff still seems plausible to me. I could spend time articulating what it means to me and why I think it is plausible, but it's not my current priority. (I'm working on acausal trade, timelines, and some miscellaneous projects). I'd be interested to know if you think it's higher priority than the other things I'm working on.

[-]Rohin Shah5y50

And takeoff is highly likely to be slow enough that we'll get those sorts of real-world damages before it's too late.

I do also think that we could get warning shots in the more sped-up parts of the trajectory, and this could be helpful because we'll have adapted to the fact that we've sped up. It's just harder to tell a concrete story about what this looks like, because the world (or at least AI companies) will have changed so much.

I'd be interested to know if you think it's higher priority than the other things I'm working on.

If fast takeoff is plausible at all in the sense that I think people mean it, then it seems like by far the most important crux in prioritization within AI safety.

However, I don't expect to change my mind given arguments for fast takeoff -- I suspect my response will be "oh, you mean this other thing, which is totally compatible with my views", or "nope that just doesn't seem plausible given how (I believe) the world works".

MIRI's arguments for fast takeoff seem particularly important, given that a substantial fraction of all resources going into AI safety seem to depend on those arguments. (Although possibly MIRI believes that their approach is the best thing to do even in case of slow takeoff.)

I think overall that aggregates to "seems important, but not obviously the highest priority for you to write".

[-]Daniel Kokotajlo5y70

Thanks. Here's something that is at least one crux for me re whether to bump up priority of takeoff speeds work:

Scenario: OpenSoft has produced a bigger, better AI system. It's the size of a human brain and it makes GPT-3 seem like GPT-1. It's awesome, and clearly has massive economic applications. However, it doesn't seem like a human-level AGI yet, or at least not like a smart human. It makes various silly mistakes and has various weaknesses, just like GPT-3. MIRI expresses concern that this thing might already be deceptively aligned, and might already be capable of taking over the world if deployed, and might already be capable of convincing people to let it self-modify etc. But people at OpenSoft say: Discontinuities are unlikely; we haven't seen massive economic profits from AI yet, nor have we seen warning shots, so it's very unlikely that MIRI is correct about this. This one seems like it will be massively profitable, but if it has alignment problems they'll be of the benign warning shot variety rather than the irreversible doom variety. So let's deploy it!

Does this sort of scenario seem plausible to you -- a scenario in which a decision about whether to deploy is made partly on the basis of belief in slow takeoff?

If so, then yeah, this makes me all the more concerned about the widespread belief in slow takeoff, and maybe I'll reprioritize accordingly...

[-]Rohin Shah5y30

Yes, that scenario sounds quite likely to me, though I'd say the decision is made on the basis of belief in scaling laws / trend extrapolation rather than "slow takeoff".

I personally would probably make arguments similar to the ones you list for OpenSoft, and I do think MIRI would be wrong if they argued it was likely that the model was deceptive.

There's some discussion to be had about how risk-averse we should be given the extremely negative payoff of x-risk, and what that implies about deployment, which seems like the main thing I would be thinking about in this scenario.

[-]Daniel Kokotajlo5y30

Welp, I'm glad we had this conversation! Thanks again, this means a lot to me!

(I'd be interested to hear more about what you meant by your reference to scaling laws. You seem to think that the AI being deceptive, capable of taking over the world, etc. would violate some scaling law, but I'm not aware of any law yet discovered that talks about capabilities like that.)

[-]Rohin Shah5y30

I don't mean a formal scaling law, just an intuitive "if we look at how much difference a 10x increase has made in the past to general cognitive ability, it seems extremely unlikely that this 10x increase will lead to an agent that is capable of taking over the world".

I don't expect that I would make this sort of argument against deception, just against existential catastrophe.

[-]Daniel Kokotajlo5y10

Oh OK. I didn't mean for this to be merely a 10x increase; I said it was the size of a human brain which I believe makes it a 1000x increase in parameter count and (if we follow the scaling laws) something like a 500x increase in training data or something? idk.

If you had been imagining that the AI I was talking about used only 10x more compute than GPT-3, then I'd be more inclined to take your side rather than MIRI's in this hypothetical debate.

[-]Rohin Shah5y20

I meant that it would be a ~10x increase from what at the time was the previously largest system, not a 10x increase from GPT-3. I'm talking about the arguments I'd use given the evidence we'd have at that time, not the evidence we have now.

If you're arguing that a tech company would do this now before making systems in between GPT-3 and a human brain, I can't see how the path you outline is even remotely feasible -- you're positing a 500,000x increase in compute costs, which I think brings compute cost of the final training run alone to high hundreds of billions or low trillions of dollars, which is laughably far beyond OpenAI and DeepMind's budgets, and seems out of reach even for Google or other big tech companies.

[-]Daniel Kokotajlo5y10

Ah. Well, it sounds like you were thinking that in the scenario I outlined, the previous largest system, 10x smaller, wasn't making much money? I didn't mean to indicate that; feel free to suppose that this predecessor system also clearly has massive economic implications, significantly less massive than the new one though...

I wasn't arguing that we'd do 500,000x in one go. (Though it's entirely possible that we'd do 100x in one go--we almost did, with GPT-3)

Am I right in thinking that your general policy is something like "Progress will be continuous; therefore we'll get warning shots; therefore if MIRI argues that a certain alignment problem may be present in a particular AI system, but thus far there hasn't been a warning shot for that problem, then MIRI is wrong."

[-]Rohin Shah5y20

Well, it sounds like you were thinking that in the scenario I outlined, the previous largest system, 10x smaller, wasn't making much money?

No, I wasn't assuming that? I'm not sure why you think I was.

Tbc, given that you aren't arguing that we'd do 500,000x in one go, the second paragraph of my previous comment is moot.

Progress will be continuous; therefore we'll get warning shots; therefore if MIRI argues that a certain alignment problem may be present in a particular AI system, but thus far there hasn't been a warning shot for that problem, then MIRI is wrong.

Yes, as a prior. Obviously you'd want to look at the actual arguments they give and take that into account as well.

[-]Daniel Kokotajlo5y30

OK. I can explain why I thought you thought that if you like, but I suspect it's not important to either of us.

I think I have enough understanding of your view now that I can collect my thoughts and decide what I disagree with and why.

[-]Sammy Martin5y20

In terms of inferences about deceptive alignment, it might be useful to go back to the one and only current example we have where someone with somewhat relevant knowledge was led to wonder whether deception had taken place - GPT-3 balancing brackets. I don't know if anyone ever got Eliezer's $1000 bounty, but the top-level comment on that thread at least convinces me that it's unlikely that GPT-3 via AI Dungeon was being deceptive even though Eliezer thought there was a real possibility that it was.

Now, this doesn't prove all that much, but one thing it does suggest is that on current MIRI-like views about how likely deception is, the threshold for uncertainty about deception is set far too low. That suggests your people at OpenSoft might well be right in their assumption.

[-]Charlie Steiner5y20

I think my biggest disagreement with the Rohin-character is about continuity. I expect there will be plenty of future events like BERT vs. RNNs. BERT isn't all that much better than RNNs on small datasets, but it scales better - so then OpenAI comes along and dumps 100x or 10,000x the compute into something like it just to see what happens.

Not only do these switches make me less confident that capability won't have sudden jumps, but I think they pose a problem for carry safety properties over from the past to the future. Right now, DeepMind solves some alignment problems on Atari games with a meta-level controller treating strategy selection as a multi-armed bandit. If tomorrow, someone comes up with a clever new deep reinforcement learning model that scales better, and OpenAI decides to throw 10,000x compute at it, I'm concerned that either they won't bother to re-implement all the previous things that patched alignment problems, or that there won't be an obvious way to port some old patches to the new model (or that there will be an obvious way, but it doesn't work).

On the other hand, I agree with the Rohin-character that full security mindset (maximin planning, worst case reasoning, what have you) seems to scale too slowly, and that a more timely yet still sufficient goal seems like the AI that isn't doing adversarial search against you in the first place. And that there will also be plenty of accumulation of safety features, especially during the "normal science" periods between switches (though I also agree with Daniel Kokotajlo's comment).

[-]Rohin Shah5y*80

so then OpenAI comes along and dumps 100x or 10,000x the compute into something like it just to see what happens.

10000x would be unprecedented -- why wouldn't you first do a 100x run to make sure things work well before doing a 10000x run? (This only increases costs by 1%.)

Also, 10000x increase in compute corresponds to 100-1000x more parameters, which does not usually lead to things I would call "discontinuities" (e.g. GPT-2 to GPT-3 does not seem like an important discontinuity to me, even if we ignore the in-between models trained along the way). Put another way -- I'm happy to posit "sudden jumps" of size similar to the difference between GPT-3 and GPT-2 (they seem rare but possible); I don't think these should make us particularly pessimistic about engineering-style approaches to alignment.

I feel like I keep responding to this argument in the same way and I wish these predictions would be made in terms of $ spent and compared to current $ spent -- it just seems nearly impossible to have a discontinuity via compute at this point. Perhaps I should just write a post called "10,000x compute is not a discontinuity".

The story seems less obviously incorrect if we talk about discontinuity via major research insight, but historical track record seems to suggest this does not usually cause major discontinuities.

I'm concerned that either they won't bother to re-implement all the previous things that patched alignment problems, or that there won't be an obvious way to port some old patches to the new model (or that there will be an obvious way, but it doesn't work).

One assumes that they scale up the compute, notice some dangerous aspects, turn off the AI system, and then fix the problem. (Well, really, if we've already seen dangerous aspects from previous AI systems, one assumes they don't run it in the first place until they have ported the safety features.)

[-]DanielFilan5y30

Perhaps I should just write a post called "10,000x compute is not a discontinuity".

I think you should write this post!

[-]Charlie Steiner5y10

I also agree that direct jumps in capability due to research insight are rare. But in part I think that's just because things get tried at small scale first, and so there's always going to be some scaling-up period where the new insight gets fed more and more resources, eventually outpacing the old state of the art. From a coarse-grained perspective GPT-2 relative to your favorite LSTM model from 2018 is the "jump in capability" due to research insight, it just got there in a not-so-discontinuous way.

Maybe you're optimistic that in the future, everyone will eventually be doing safety checks of their social media recommender algorithms or whatever during training. But even if some company is partway through scaling up the hot new algorithm and (rather than training to completion) they trip the alarm that was searching for undesirable real-world behavior because of learned agent-like reasoning, what then? The assumption that progress will be slow relative to adaptation already seems to be out the window.

This is basically the punctuated equilibria theory of software evolution :P

[-]Rohin Shah5y20

I also agree that direct jumps in capability due to research insight are rare. But in part I think that's just because things get tried at small scale first, and so there's always going to be some scaling-up period where the new insight gets fed more and more resources, eventually outpacing the old state of the art. From a coarse-grained perspective GPT-2 relative to your favorite LSTM model from 2018 is the "jump in capability" due to research insight, it just got there in a not-so-discontinuous way.

Seems right to me.

if some company is partway through scaling up the hot new algorithm and (rather than training to completion) they trip the alarm that was searching for undesirable real-world behavior because of learned agent-like reasoning, what then?

(I'm not convinced this is a good tripwire, but under the assumption that it is:)

Ideally they have already applied safety solutions and so this doesn't even happen in the first place. But supposing this did happen, they turn off the AI system because they remember how Amabook lost a billion dollars through their AI system embezzling money from them, and they start looking into how to fix this issue.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

22

Security Mindset and Takeoff Speeds

22

About this post

The conversation