Continuity Assumptions

[-]Rob Bensinger4y80

A lot of models of what can or can't work in AI alignment depends on intuitions about whether to expect "true discontinuities" or just "steep bits".

Note that Nate and Eliezer expect there to be some curves you can draw after-the-fact that shows continuity in AGI progress on particular dimensions. They just don't expect these to be the curves with the most practical impact (and they don't think we can identify the curves with foresight, in 2022, to make strong predictions about AGI timing or rates of progress).

Quoting Nate in 2018:

On my model, the key point is not 'some AI systems will undergo discontinuous leaps in their intelligence as they learn,' but rather, 'different people will try to build AI systems in different ways, and each will have some path of construction and some path of learning that can be modeled relatively well by some curve, and some of those curves will be very, very steep early on (e.g., when the system is first coming online, in the same way that the curve "how good is Google’s search engine" was super steep in the region between "it doesn’t work" and "it works at least a little"), and sometimes a new system will blow past the entire edifice of human knowledge in an afternoon shortly after it finishes coming online.' Like, no one is saying that Alpha Zero had massive discontinuities in its learning curve, but it also wasn't just AlphaGo Lee Sedol but with marginally more training: the architecture was pulled apart, restructured, and put back together, and the reassembled system was on a qualitatively steeper learning curve.
My point here isn't to throw 'AGI will undergo discontinuous leaps as they learn' under the bus. Self-rewriting systems likely will (on my models) gain intelligence in leaps and bounds. What I’m trying to say is that I don’t think this disagreement is the central disagreement. I think the key disagreement is instead about where the main force of improvement in early human-designed AGI systems comes from — is it from existing systems progressing up their improvement curves, or from new systems coming online on qualitatively steeper improvement curves?

And quoting Eliezer more recently:

if the future goes the way I predict and yet anybody somehow survives, perhaps somebody will draw a hyperbolic trendline on some particular chart where the trendline is retroactively fitted to events including those that occurred in only the last 3 years, and say with a great sage nod, ah, yes, that was all according to trend, nor did anything depart from trend

And:

There is, I think, a really basic difference of thinking here, which is that on my view, AGI erupting is just a Thing That Happens and not part of a Historical Worldview or a Great Trend.
Human intelligence wasn't part of a grand story reflected in all parts of the ecology, it just happened in a particular species.
Now afterwards, of course, you can go back and draw all kinds of Grand Trends into which this Thing Happening was perfectly and beautifully fitted, and yet, it does not seem to me that people have a very good track record of thereby predicting in advance what surprising news story they will see next - with some rare, narrow-superforecasting-technique exceptions, like the Things chart on a steady graph and we know solidly what a threshold on that graph corresponds to and that threshold is not too far away compared to the previous length of the chart.
One day the Wright Flyer flew. Anybody in the future with benefit of hindsight, who wanted to, could fit that into a grand story about flying, industry, travel, technology, whatever; if they've been on the ground at the time, they would not have thereby had much luck predicting the Wright Flyer. It can be fit into a grand story but on the ground it's just a thing that happened. It had some prior causes but it was not thereby constrained to fit into a storyline in which it was the plot climax of those prior causes.
My worldview sure does permit there to be predecessor technologies and for them to have some kind of impact and for some company to make a profit, but it is not nearly as interested in that stuff, on a very basic level, because it does not think that the AGI Thing Happening is the plot climax of a story about the Previous Stuff Happening.

And:

I think the Hansonian viewpoint - which I consider another gradualist viewpoint, and whose effects were influential on early EA and which I think are still lingering around in EA - seemed surprised by AlphaGo and Alpha Zero, when you contrast its actual advance language with what actually happened. Inevitably, you can go back afterwards and claim it wasn't really a surprise in terms of the abstractions that seem so clear and obvious now, but I think it was surprised then; and I also think that "there's always a smooth abstraction in hindsight, so what, there'll be one of those when the world ends too", is a huge big deal in practice with respect to the future being unpredictable.

(As an example, compare Paul Christiano's post on takeoff speeds from 2018, which is heavily about continuity, to the debate between Paul and Eliezer in late 2021. Despite the participants spending years in discussion, progress on bridging the continuous-discrete gap between them seems very limited.)

Paul and Eliezer have had lots of discussions over the years, but I don't think they talked about takeoff speeds between the 2018 post and the 2021 debate?

[-]Jan_Kulveit4y10

Note that Nate and Eliezer expect there to be some curves you can draw after-the-fact that shows continuity in AGI progress on particular dimensions. They just don't expect these to be the curves with the most practical impact (and they don't think we can identify the curves with foresight, in 2022, to make strong predictions about AGI timing or rates of progress).
Quoting Nate in 2018: ...

Yes, but conversely, I could say I'd expect some curves to show discontinuous jumps, mostly in dimensions which no one really cares about. Clearly the cruxes are about discontinuities in dimensions which matter.

As I tried to explain in the post, I think continuity assumptions mostly get you different things than "strong predictions about AGI timing".

...

My point here isn't to throw 'AGI will undergo discontinuous leaps as they learn' under the bus. Self-rewriting systems likely will (on my models) gain intelligence in leaps and bounds. What I’m trying to say is that I don’t think this disagreement is the central disagreement. I think the key disagreement is instead about where the main force of improvement in early human-designed AGI systems comes from — is it from existing systems progressing up their improvement curves, or from new systems coming online on qualitatively steeper improvement curves?

I would paraphrase this as "assuming discontinuities at every level" - both one-system training, and the more macroscopic exploration in the "space of learning systems" - but stating the key disagreement is about the discontinuities in the space of model architectures, rather than in jumpiness of single model training.

Personally, I don't think the distinction between 'movement by learning of a single model' and 'movement by scaling' and 'movement by architectural changes' will be necessarily big.

There is, I think, a really basic difference of thinking here, which is that on my view, AGI erupting is just a Thing That Happens and not part of a Historical Worldview or a Great Trend.

This seem more or less support what I wrote? Expecting a Big Discontinuity, and this being a pretty deep difference?

I think the Hansonian viewpoint - which I consider another gradualist viewpoint, and whose effects were influential on early EA and which I think are still lingering around in EA - seemed surprised by AlphaGo and Alpha Zero, when you contrast its actual advance language with what actually happened. Inevitably, you can go back afterwards and claim it wasn't really a surprise in terms of the abstractions that seem so clear and obvious now, but I think it was surprised then; and I also think that "there's always a smooth abstraction in hindsight, so what, there'll be one of those when the world ends too", is a huge big deal in practice with respect to the future being unpredictable.

My overall impression is Eliezer likes to argue against "Hansonian views", but something like "continuity assumptions" seem much broader category than Robin's views.

Paul and Eliezer have had lots of discussions over the years, but I don't think they talked about takeoff speeds between the 2018 post and the 2021 debate?

In my view continuity assumptions are not just about takeoff speeds. E.g, IDA make much more sense in a continuous world - if you reach a cliff, working IDA should slow down, and warn you. In the Truly Discontinuous world, you just jump off the cliff at some unknown step.

I would guess probably a majority of all debates and disagreements between Paul and Eliezer has some "continuity" component: e.g. the question whether we can learn a lot of important alignment stuff on non-AGI systems is a typical continuity problem, but only tangentially relevant to takeoff speeds.

[-]Steven Byrnes4y40

Do you think there’s an important practical difference between “discontinuous” and “continuous improvement from preschooler-level-intelligence to superintelligence over the course of 24 hours”? The post suggests that they are fundamentally different—cf. “true discontinuities” versus “steep bit”. But it seems to me that there is no difference in practice.

(If we replace “24 hours” by “2 years”, well I for one am mostly inclined to expect that those 2 years would be squandered. If it’s “30 years”, well OK now we’re playing a different game.)

[-]Jan_Kulveit4y30

Yes, I do. To disentangle it a bit

- one line of reasoning is, if you have this "preschooler-level-intelligence to superintelligence over the course of 24 hours” you probably had something which is able to learn really fast and generalize a lot before this. how does the rest of the world look like?

- second - if you have control over the learning time, in the continuous version, you can slow down or halt. yes, you need some fast oversight control loop to do that, but getting to a state where this is what you have, because that's what sane AI developers do, seems tractable. (also I think this has decent chance to become instrumentally convergent for AI developers)

[-]Steven Byrnes4y*40

one line of reasoning is, if you have this "preschooler-level-intelligence to superintelligence over the course of 24 hours” you probably had something which is able to learn really fast and generalize a lot before this. how does the rest of the world look like?

OK, here’s my scenario in more detail. Someone thinks of a radically new AI RL algorithm, tomorrow. They implement it and then run a training. It trains up from randomly-initialized to preschooler-level in the first 24 hours, and then from preschooler-level to superintelligence in the second 24 hours.

I’m not claiming this scenario is realistic, just that it illustrates a problem with your definition of continuity. I think this scenario is “continuous” by your definition, but that the continuity doesn’t buy you any of the things on your “how continuity helps” list. Right?

(My actual current expectation (see 1,2) is that people are already today developing bits and pieces of an AGI-capable RL algorithm, and at some indeterminate time in the future the pieces will all come together and people will work out the kinks and tricks to scaling it up etc. And as this happens, the algorithm will go from preschool-level to superintelligent over the course of maybe a year or two, and that year or two will not make much difference, because whatever safety problems they encounter will not have easy solutions, and a year or two is just not enough time to figure out and implement non-easy solutions, and pausing won’t be an option because of competition.)

[-]Jan_Kulveit4y20

So, I do think the continuity buys you things, even in this case - roughly in the way outlined in the post - it's choice of the developer to continue training after passing some roughly human level, and with a sensible oversight, they should notice and stop before getting to superintelligence,

You may ask why would they have the oversight in place. My response is, possible because some small-sized AI disaster which happened before, people understand the tech is dangerous.

With the 2-year scenario, again, I think there is a decent chance at stopping ~before or roughly at the human level. One story why that may be the case is, it seems quite possible the level where AI gets good at convincing/manipulating humans, or humans freak out for other reasons, is lower than AGI. If you get enough economic benefits from CAIS at the same time, you can also get strong counter-pressures to competition at developing AGI.

[-]Steven Byrnes4y30

in Eliezer's recent post, discontinuity is a strong component of points 3, 5, 6, 7, 10, 11, 12, 13, 26, 30, 31, 34, 35

I think I disagree. For example:

3,5,6,7, etc.—In a slow-takeoff world, at some point X you cross a threshold where your AI can kill everyone (if you haven’t figured out how to keep it under control), and at some point Y you cross a threshold where you & your AI can perform a “pivotal act”. IIUC, Eliezer is claiming that X occurs earlier than Y (assuming slow takeoff).
10,11,12, etc.—In slow-takeoff world, you still eventually reach a point where your AI can kill everyone (if you haven’t figured out how to keep it under control). At that point, you’ll need to train your AI in such a way that kill-everyone actions are not in the AI’s space of possible actions during training and sandbox-testing, but kill-everyone actions are in the AI’s space of possible actions during use / deployment. Thus there is an important distribution shift between training and deployment. (Unless you can create an amazingly good sandbox that both tricks the AI and allows all the same possible actions and strategies that the real world does. Seems hard, although I endorse efforts in that area.) By the same token, if the infinitesimally-less-competent AI that you deployed yesterday did not have kill-everyone actions in its space of possible actions, and the AI that you’re deploying today does, then that’s an important difference between them, despite the supposedly continuous takeoff.
26—In a slow-takeoff world, at some point Z you cross a threshold where you stop being able to understand the matrices, and at some point X you cross a threshold where your AI can kill everyone. I interpret this point as Eliezer making a claim that Z would occur earlier than X.

[-]Jan_Kulveit4y30

Can we also drop the "pivotal act" frame? Thinking in "pivotal acts" seem to be one of the root causes leading to discontinuities everywhere.
3,... Currently, my guess is we may want to steer to a trajectory where no single AI can kill everyone (in no point of the trajectory). Currently, no single AI can kill everyone - so maybe we want to maintain this property of the world / scale it, rather than e.g. create an AI sovereign which could unilaterally kill everyone, but will be nice instead (at least until we've worked out a lot more of the theory of alignment and intelligence than we had so far).

(I don't think the "killing everyone" threshold is a clear cap on capabilities - if your replace "kill everyone" with "own everything", it seems the property "no one owns everything" is compatible with scaling of economy.)

[-]Steven Byrnes4y*48

Consider the following hypotheses.

Hypothesis 1: humans with AI assistance can (and in fact will) build a nanobot defense system before an out-of-control AI would be powerful enough to deploy nanobots.
Hypothesis 2: humans with AI assistance can (and in fact will) build systems that robustly prevent hostile actors from tricking/bribing/hacking humanity into all-out nuclear war before an out-of-control AI would be powerful enough to do that.
Hypothesis 3,4,5,6,7…: Ditto for plagues, and disabling the power grid, and various forms of ecological collapse, and co-opting military hardware, and information warfare, etc. etc.

I think you believe that all these hypotheses are true. Is that right?

If so, this seems unlikely to me, for lots of reasons, both technological and social:

Some of the defensive measures might just be outright harder technologically than the offensive measures.
Some of the defensive measures would seem to require that humans are good at global coordination, and that humans will wisely prepare for uncertain hypothetical future threats even despite immediate cost and inconvenience.
The human-AI teams would be constrained by laws, norms, Overton window, etc., in a way that an out-of-control AI would not.
The human-AI teams would be constrained by lack-of-complete-trust-in-the-AI, in a way that an out-of-control AI would not. For example, defending nuclear weapons systems against hacking-by-an-out-of-control-AI would seem to require that humans either give their (supposedly) aligned AIs root access to the nuclear weapons computer systems, or source code and schematics for those computer systems, or similar, and none of these seem like things that military people would actually do in real life. As another example, humans may not trust their AIs to do recursive self-improvement, but an out-of-control AI probably would if it could.
There are lots of hypotheses that I listed above, plus presumably many more that we can't think of, and they're more-or-less conjunctive. (Not perfectly conjunctive—if just one hypothesis is false, we’re probably OK, apart from the nanobot one—but there seem to be lots of ways for 2 or 3 of the hypotheses to be false such that we’re in big trouble.)

Note that I don’t claim any special expertise, I mostly just want to help elevate this topic from unstated background assumption to an explicit argument where we figure out the right answer. :)

(I was recently discussing this topic in this thread.)

we may want to steer to a trajectory where no single AI can kill everyone

Want? Yes. We absolutely want that. So we should try to figure out whether that’s a realistic possibility. I’m suggesting that it might not be.

Before the cliff	After the cliff
Non-general systems. Lack the core of general reasoning, that which allows thought in domains far from training data	General systems. Capabilities generalise far
Weak systems - that won't kill you, but also won't help you solve alignment	Strong systems - that would help solve alignment, but unfortunately will kill you by default, if unaligned
Systems which may be misaligned, but aren't competently deceptive about it	System which is actively modelling you at a level where the deception is beyond your ability to notice
Weak acts	Pivotal acts
…	…

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

19

19

What I mean by continuity

How is this relevant

Deep cruxes

How continuity helps

How continuity does not help

Downstream implications

Common objections to continuity

Implications for x-risk