Takeoff speeds have a huge effect on what it means to work on AI x-risk

[-]johnswentworth4y180

I agree with the basic difference you point to between fast- and slow-takeoff worlds, but disagree that it has important strategic implications for the obviousness of takeover risk.

In slow takeoff worlds, many aspects of the alignment problem show up well before AGI goes critical. However, people will by-default train systems to conceal those problems. (This is already happening: RL from human feedback is exactly the sort of strategy which trains systems to conceal problems, and we've seen multiple major orgs embracing it within the past few months.) As a result, AI takeover risk never looks much more obvious than it does now.

Concealed problems look like no problems, so there will in-general be economic incentives to train in ways which conceal problems. The most-successful-looking systems, at any given time, will be systems trained in ways which incentivize hidden problems over visible problems.

[-]Buck4y30

I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.

Examples of small AI catastrophes will also probably make takeover risk more obvious.

I guess another example of this phenomenon is that a bunch of people are more worried about AI takeover than they were five years ago, because they’ve seen more examples of ML systems being really smart, even though they wouldn’t have said five years ago that ML systems could never solve those problems. Seeing the ways things happen is often pretty persuasive to people.

[-]johnswentworth4y150

Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.

This prediction feels like... it doesn't play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which is not an economically-stable state of affairs, so shortly thereafter Facebook switches to a different metric which is less click-centric. (IIRC this actually happened a few years ago.)

On the other hand, sometimes Facebook's newsfeed algorithm is bad in ways which are not visible to individual customers. Like, maybe there's an echo chamber problem, people only see things they agree with. But from an individual customer's perspective, that's exactly what they (think they) want to see, they don't know that there's anything wrong with the information they're receiving. This sort of problem does not actually look like a problem from the perspective of any one person looking at their own feed; it looks good. So that's a much more economically stable state; Facebook is less eager to switch to a new metric.

... but even that isn't a real example of a problem which is properly invisible. It's still obvious that the echo-chamber-newsfeed is bad for other people, and therefore it will still be noticed, and Facebook will still be pressured to change their metrics. (Indeed that is what happened.) The real problems are problems people don't notice at all, or don't know to attribute to the newsfeed algorithm at all. We don't have a widely-recognized example of such a thing and probably won't any time soon, precisely because most people do not notice it. Yet I'd be surprised if Facebook's newsfeed algorithm didn't have some such subtle negative effects, and I very much doubt that the subtle problems will go away as the visible problems are iterated on.

If anything, I'd expect iterating on visible problems to produce additional subtle problems - for instance, in order to address misinformation problems, Facebook started promoting an Official Narrative which is itself often wrong. But that's much harder to detect, because it's wrong in a way which the vast majority of Official Sources also endorse. To put it another way: if most of the population can be dragged into a single echo chamber, all echoing the same wrong information, that doesn't make the echo chamber problem less bad, but it does make the echo chamber problem less visible.

Anyway, zooming out: solve for the equilibrium, as Cowen would say. If the problems are visible to customers, that's not a stable state. Organizations will be incentivized to iterate until problems stop being visible. They will not, however, be incentivized to iterate away the problems which aren't visible.

[-]Steven Byrnes4y100

I guess it depends on “how fast is fast and how slow is slow”, and what you say is true on the margin, but here's my plea that the type of thinking that says “we want some technical problem to eventually get solved, so we try to solve it” is a super-valuable type of thinking right now even if we were somehow 100% confident in slow takeoff. (This is mostly an abbreviated version of this section.)

Differential Technological Development (DTD) seems potentially valuable, but is only viable if we know what paths-to-AGI will be safe & beneficial really far in advance. (DTD could take the form of accelerating one strand of modern ML relative to another—e.g. model-based RL versus self-supervised language models etc.—or it could take the form of differentially developing ML-as-a-whole compared to, I dunno, something else.) Relatedly, suppose (for the sake of argument) that someone finds an ironclad proof that safe prosaic AGI is impossible, and the only path forward is a global ban on prosaic AGI research. It would be way better to find that proof right now than finding it in 5 years, and better in 5 years than 10, etc., and that's true no matter how gradual takeoff is.
We don't know how long safety research will take. If takeoff happens over N years, and safety research takes N+1 years, that's bad even if N is large.
1. Maybe you'll say that almost all of the person-years of safety research will happen during takeoff, and any effort right now is a drop in the ocean compared to that. But I really think wall-clock time is an important ingredient in research progress, not just person-years. (“Nine women can't make a baby in a month.”)
We don't just need to figure out the principles for avoiding AGI catastrophic accidents. We also need every actor with a supercomputer to understand and apply these principles. Some ideas take many decades to become widely (let alone universally) accepted—famous examples include evolution and plate tectonics. It takes wall-clock time for arguments to be refined. It takes wall-clock time for evidence to be marshaled. It takes wall-clock time for nice new pedagogical textbooks to be created. And of course, it takes wall-clock time for the stubborn holdouts to die and be replaced by the next generation. :-P

[-]Buck2y90Review for 2022 Review

I think this point is really crucial, and I was correct to make it, and it continues to explain a lot of disagreements about AI safety.

[-]Buck2y6-2Review for 2022 Review

I think this point is incredibly important and quite underrated, and safety researchers often do way dumber work because they don't think about it enough.

[-]Donald Hobson4y40

In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there.

Lets consider the opposite. Imagine you are programming a self driving car, in a simulated environment. You notice it goodhearting your metrics, so you tweak them and try again. You build up a list of 1001 ad hoc patches that makes your self driving car behave reasonably most of the time.

The object level patches only really apply to self driving cars. They include things like a small intrinsic preference towards looking at street signs. The meta level strategy of patching it until it works isn't very relevant either.

Imagine a world with many AI's like this. All with ad hoc kludges of hard coded utility functions. The AI is becoming increasingly economically important and getting close to AGI. Slow takeoff. All the industrial work is useless.

[-]David Scott Krueger (formerly: capybaralet)4y*20

In particular, in a fast takeoff world, AI takeover risk never looks much more obvious than it does now, and so x-risk-motivated people should be assumed to cause the majority of the research on alignment that happens.

I strongly disagree with that and I don't think it follows from the premise. I think by most reasonable definitions of alignment it is already the case that most of the research is not done by x-risk motivated people.

Furthermore, I think it reflects poorly on this community that this sort of sentiment seems to be common.

[-]David Scott Krueger (formerly: capybaralet)4y30

It's possible that a lot of our disagreement is due to different definitions of "research on alignment", where you would only count things that (e.g.) 1) are specifically about alignment that likely scales to superintelligent systems, or 2) is motivated by X safety.

To push back on that a little bit...
RE (1): It's not obvious what will scale, And I think historically this community has been too pessimistic (i.e. almost completely dismissive) about approaches that seem hacky or heuristic.
RE (2): This is basically circular.

[-]adamShimi4y50

I disagree, so I'm curious about what are great examples for you of good research on alignment that is not done by x-risk motivated people? (Not being dismissive, I'm genuinely curious, and discussing specifics sounds more promising than downvoting you to oblivion and not having a conversation at all).

[-]Joe Collman4y10

Examples would be interesting, certainly. Concerning the post's point, I'd say the relevant claim is that [type of alignment research that'll be increasingly done in slow takeoff scenarios] is already being done by non x-risk motivated people.

I guess the hope is that at some point there are clear-to-everyone problems with no hacky solutions, so that incentives align to look for fundamental fixes - but I wouldn't want to rely on this.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

61

Takeoff speeds have a huge effect on what it means to work on AI x-risk

61