Nate Soares

Wiki Contributions


For sure. It's tricky to wipe out humanity entirely without optimizing for that in particular -- nuclear war, climate change, and extremely bad natural pandemics look to me like they're at most global catastrophes, rather than existential threats. It might in fact be easier to wipe out humanity by enginering a pandemic that's specifically optimized for this task (than it is to develop AGI), but we don't see vast resources flowing into humanity-killing-virus projects, the way that we see vast resources flowing into AGI projects. By my accounting, most other x-risks look like wild tail risks (what if there's a large, competent, state-funded successfully-secretive death-cult???), whereas the AI x-risk is what happens by default, on the mainline (humanity is storming ahead towards AGI as fast as they can, pouring billions of dollars into it per year, and by default what happens when they succeed is that they accidentally unleash an optimizer that optimizes for our extinction, as a convergent instrumental subgoal of whatever rando thing it's optimizing).

Question for Richard, Paul, and/or Rohin: What's a story, full of implausibly concrete details but nevertheless a member of some largish plausible-to-you cluster of possible outcomes, in which things go well? (Paying particular attention to how early AGI systems are deployed and to what purposes, or how catastrophic deployments are otherwise forstalled.)

In response to your last couple paragraphs: the critique, afaict, is not "a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those", but rather "a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is that you lose track of which properties are incompatible (toy model: you claim you can visualize a number that is both even and odd). A way to avert this failure mode is to regularly exhibit at least one concrete hypothesis that simultaneousy posseses whatever collection of properties you say you can simultaneously visualize (toy model: demonstrating that 14 is even and 7 is odd does not in fact convince me that you are correct to imagine a number that is both even and odd)."

On my understanding of Eliezer's picture (and on my own personal picture), almost nobody ever visibly tries to do this (never mind succeeding), when it comes to hopeful AGI scenarios.

Insofar as you have thought about at least one specific hopeful world in great detail, I strongly recommend, spelling it out, in all its great detail, to Eliezer, next time you two chat. In fact, I personally request that you do this! It sounds great, and I expect it to constitute some progress in the debate.

("near-zero" is a red herring, and I worry that that phrasing bolsters the incorrect view that the reason MIRI folk think alignment is hard is that we want implausibly strong guarantees. I suggest replacing "reduce x-risk to near-zero" with "reduce x-risk to sub-50%".)

My take on the exercise:

Is Humbali right that generic uncertainty about maybe being wrong, without other extra premises, should increase the entropy of one's probability distribution over AGI, thereby moving out its median further away in time?

Short version: Nah. For example, if you were wrong by dint of failing to consider the right hypothesis, you can correct for it by considering predictable properties of the hypotheses you missed (even if you don't think you can correctly imagine the true research pathway or w/e in advance). And if you were wrong in your calculations of the quantities you did consider, correction will regress you towards your priors, which are simplicity-based rather than maxent.

Long version: Let's set aside for the moment the question of what the "correct" maxent distribution on AGI timelines is (which, as others have noted, depends a bit on how you dice up the space of possible years). I don't think this is where the action is, anyway.

Let's suppose that we're an aspiring Bayesian considering that we may have made some mistakes in our calculations. Where might those mistakes have been? Perhaps:

  1. We were mistaken about what we saw (and erroneously updated on observations that we did not make)?
  2. We were wrong in our calculations of quantities of the form P(e|H) (the likelihoods) or P(H) (the priors), or the multiplications thereof?
  3. We failed to consider a sufficiently wide space of hypotheses, in our efforts to complete our updating before the stars burn out?

Set aside for now that the correct answer is "it's #3, like we might stumble over #1 and #2 every so often but bounded reasoners are making mistake #3 day in and day out, it's obviously mostly #3", and take these one at a time:

Insofar as we were mistaken about what we saw, correcting our mistake should involve reverting an update (and then probably making a different update, because we saw something that we mistook, but set that aside). Reverting an update pushes us back towards our prior. This will often increase entropy, but not necessarily! (For example, if we thought we saw a counter-example to gravitation, that update might dramatically increase our posterior entropy, and reverting the update might revert us back to confident narrow predictions about phones falling.) Our prior is not a maxent prior but a simplicity prior (which is important if we ever want to learn anything at all).

Insofar as we were wrong in our calculations of various quantities, correcting our mistake depends on which direction we were wrong, and for which hypotheses. In practice, a reflectively stable reasoner shouldn't be able to predict the (magnitude-weighted) direction of their error in calculating P(e|H): if we know that we tend to overestimate that value when e is floobish, we can just bump down our estimate whenever e is floobish, until we stop believing such a thing (or, more intelligently, trace down the source of the systematic error and correct it, but I digress). I suppose we could imagine humbly acknowledging that we're imperfect at estimating quantities of the form P(e|H), and then driving all such estimates towards 1/n, where n is the number of possible observations? This doesn't seem like a very healthy way to think, but its effect is to again regress us towards our prior. Which, again, is a simplicity prior and not a maxent prior. (If instead we start what-iffing about whether we're wrong in our intuitive calculations that vaguely correspond to the P(H) quantities, and decide to try to make all our P(H) estimates more similar to each other regardless of H as a symbol of our virtuous self-doubt, then we start regressing towards maximum entropy. We correspondingly lose our ability to learn. And of course, if you're actually worried that you're wrong in your estimates of the prior probabilities, I recommend checking whether you think your P(H)-style estimates are too high or two low in specific instances, rather than driving all such estimates to uniformity. But also ¯\_(ツ)_/¯, I can't argue good priors into a rock.)

Insofar as we were wrong because we were failing to consider a sufficiently wide array of hypotheses, correcting our mistake depends on which hypotheses we're missing. Indeed, much of Eliezer's dialog seems to me like Eliezer trying to say "it's mistake #3 guys, it's always #3", plus "just as the hypothesis that we'll get AGI at 20 watts doesn't seem relevant because we know that the ways computers consume watts and the ways brains consume watts and they're radically different, so too can we predict that whatever the correct specific hypothesis for how the first human-attained AGIs consume compute, it will make the amount of compute that humans consume seem basically irrelevant." Like, if we don't get AGI till 2050 then we probably can't consider the correct specific research path, a la #3, but we can predict various properties of all plausible unvisualized paths, and adjust our current probabilities accordingly, in acknowledgement of our current #3-style errors.

In sum: accounting for wrongness should look less like saying "I'd better inject more entropy into my distributions", and more like asking "are my estimates of P(e|H) off in a predictable direction when e looks like this and H looks like that?". The former is more like sacrificing some of your hard-won information on the alter of the gods of modesty; the latter is more like considering the actual calculations you did and where the errors might reside in them. And even if you insist on sacrificing some of your information because maybe you did the calculations wrong, you should regress towards a simplicity prior rather than towards maximum entropy (which in practice looks like reaching for fewer and simpler-seeming deep regularities in the world, rather than pushing median AGI timelines out to the year 52,021), which is also how things will look if you think you're missing most of the relevant information. Though of course, your real mistake was #3, you're ~always committing mistake #3. And accounting for #3 in practice does tend to involve increasing your error bars until they are wide enough to include the sorts of curveballs that reality tends to throw at you. But the reason for widening your error bars there is to include more curveballs, not just to add entropy for modesty's sake. And you're allowed to think about all the predictable-in-advance properties of likely ballcurves even if you know you can't visualize-in-advance the specific curve that the ball will take.

In fact, Eliezer's argument reads to me like it's basically "look at these few and simple-seeming deep regularities in the world" plus a side-order of "the way reality will actually go is hard to visualize in advance, but we can still predict some likely properties of all the concrete hypotheses we're failing to visualize (which in this case invalidate biological anchors, and pull my timelines closer than 2051)", both of which seem to me like hallmarks of accounting for wrongness.

I have discussed with MIRI their decision to make their research non-disclosed-by-default and we agreed that my research agenda is a reasonable exception.

Small note: my view of MIRI's nondisclosed-by-default policy is that if all researchers involved with a research program think it should obviously be public then it should obviously be public, and that doesn't require a bunch of bureaucracy. I think this while simultaneously predicting that when researchers have a part of themselves that feels uncertain or uneasy about whether their research should be public, they will find that there are large benefits to instituting a nondisclosed-by-default policy. But the policy is there to enable researchers, not to annoy them and make them jump through hoops.

(Caveat: within ML, it's still rare for risk-based nondisclosure to be treated as a real option, and many social incentives favor publishing-by-default. I want to be very clear that within the context of those incentives, I expect many people to jump to "this seems obviously safe to me" when the evidence doesn't warrant it. I think it's important to facilitate an environment where it's not just OK-on-paper but also socially-hedonic to decide against publishing, and I think that these decisions often warrant serious thought. The aim of MIRI's disclosure policy is to remove undue pressures to make publication decisions prematurely, not to override researchers' considered conclusions.)

The second statement seems pretty plausible (when we consider human-accessible AGI designs, at least), but I'm not super confident of it, and I'm not resting my argument on it.

The weaker statement you provide doesn't seem like it's addressing my concern. I expect there are ways to get highly capable reasoning (sufficient for, e.g., gaining decisive strategic advantage) without understanding low-K "good reasoning"; the concern is that said systems are much more difficult to align.

As I noted when we chatted about this in person, my intuition is less "there is some small core of good consequentialist reasoning (it has “low Kolmogorov complexity” in some sense), and this small core will be quite important for AI capabilities" and more "good consequentialist reasoning is low-K and those who understand it will be better equipped to design AGI systems where the relevant consequentialist reasoning happens in transparent boxes rather than black boxes."

Indeed, if I thought one had to understand good consequentialist reasoning in order to design a highly capable AI system, I'd be less worried by a decent margin.

Weighing in late here, I'll briefly note that my current stance on the difficulty of philosophical issues is (in colloquial terms) "for the love of all that is good, please don't attempt to implement CEV with your first transhuman intelligence". My strategy at this point is very much "build the minimum AI system that is capable of stabilizing the overall strategic situation, and then buy a whole lot of time, and then use that time to figure out what to do with the future." I might be more optimistic than you about how easy it will turn out to be to find a reasonable method for extrapolating human volition, but I suspect that that's a moot point either way, because regardless, thou shalt not attempt to implement CEV with humanity's very first transhuman intelligence.

Also, +1 to the overall point of "also pursue other approaches".

Load More