Wiki Contributions


Godzilla Strategies

Individual humans do make off much better when they get to select between products from competing companies rather than monopolies, benefitting from companies going out of their way to demonstrate when their products are verifiably better than rivals'. Humans get treated better by sociopathic powerful politicians and parties when those politicians face the threat of election rivals (e.g. no famines). Small states get treated better when multiple superpowers compete for their allegiance. Competitive science with occasional refutations of false claims produces much more truth for science consumers than intellectual monopolies. Multiple sources with secret information are more reliable than one.

It's just routine for weaker less sophisticated parties to do better in both assessment of choices and realized outcomes when multiple better informed or powerful parties compete for their approval vs just one monopoly/cartel.

Also, a flaw in your analogy is that schemes that use AIs as checks and balances on each other don't mean more AIs. The choice is not between monster A and monsters A plus B, but between two copies of monster A (or a double-size monster A), and a split of one A and one B, where we hold something of value that we can use to help throw the contest to either A or B (or successors further evolved to win such contests). In the latter case there's no more total monster capacity, but there's greater hope of our influence being worthwhile and selecting the more helpful winner (which we can iterate some number of times).

AGI Ruin: A List of Lethalities

 I think this claim is true, on account of gray goo and lots of other things, and I suspect Eliezer does too, and I’m pretty sure other people disagree with this claim.

If you have robust alignment, or AIs that are rapidly bootstrapping their level of alignment fast enough to outpace the danger of increased capabilities, aligned AGI could get through its intelligence explosion to get radically superior technology and capabilities. It could also get a hard start on superexponential replication in space, so that no follower could ever catch up, and enough tech and military hardware to neutralize any attacks on it (and block attacks on humans via nukes, bioweapons, robots, nanotech, etc). That wouldn't work if there are thing like vacuum collapse available to attackers, but we don't have much reason to expect that from current science and the leading aligned AGI would find out first.

That could be done without any violation of the territory of other sovereign states. The legality of grabbing space resources is questionable in light of the Outer Space Treaty, but commercial exploitation of asteroids is in the Overton window. The superhuman AGI would also be in a good position to persuade and trade with any other AGI developers.

Again I think this claim is true, and I suspect Eliezer does too. In fact, my guess is that there are already single GPU chips with enough FLOP/s to run human-level, human-speed, AGI, or at least in that ballpark.

An A100 may have humanlike FLOP/s but has only 80 GB of memory,  probably orders of magnitude less memory per operation than brains. Stringing together a bunch of them makes it possible to split up human-size models and run them faster/in parallel on big batches using the extra operations.

The prototypical catastrophic AI action is getting root access to its datacenter

Some more points about this action:

  • Controlling the datacenter means controlling the gradients/reward function, so that now AIs can do things that would otherwise suffer updating from SGD, e.g. acting on inner misaligned goals, or concealing its full capabilities even when this lowers performance
    • For reward-hungry AIs, getting to set reward to maximum (and keep it there?) seems extremely desirable
    • This also means getting past interpretability tools
    • Tripwires or watchdog AIs in the same datacenter that don't succeed in stopping the action no longer have the power to interfere after, so collusion/coordination is solved within the datacenter thereafter
  • Backdoors can be hard for (non-mind-reading/weight-sharing) critics to find (see cyberwarfare and cybersecurity) without the knowledge of the creator: the effort to find or insert one backdoor to use is much less than that required to find and patch all exploits an attacker might be using
    • Although a critic who knows a backdoor can have a relatively easy time demonstrating it, and critics should be designed to access the knowledge of the attacker/misbehaving model or be heads of it
    • Critic AIs can be left in the position of saying 'a backdoor has probably been inserted somewhere or will be soon but I can't tell you where,' a known danger for AI safety through debate
  • In an intelligence explosion context where AI capabilities are growing incredibly rapidly thanks to AI labor, there isn't time for humans to do unaided examination of new AI-produced software while keeping up the pace, at best briefly examining some AI attempts to point out suspicious attempts to insert vulnerabilities
    • It's plausible that the different abilities of AIs lead to optimal code design for them being harder for humans to understand (with less experience and working memory, and eventually general intelligence)
    • If you don't let AIs  design your new datacenter software, not only will it be less efficient, but it may be easily cracked by outside AI-enabled attackers
  • We are already routinely training big models with code repositories, and using programming as an easy virtual RL task (tests can provide immediate feedback cheaply), so these are skills we are moving towards providing to AI
  • We know exploits and backdoors are possible because humans can already do them
Buck's Shortform

Agreed, and versions of them exist in human governments trying to maintain control (where non-cooordination of revolts is central).  A lot of the differences are about exploiting new capabilities like copying and digital neuroscience or changing reward hookups.

In ye olde times of the early 2010s people (such as I) would formulate questions about what kind of institutional setups you'd use to get answers out of untrusted AIs (asking them separately to point out vulnerabilities in your security arrangement, having multiple AIs face fake opportunities to whistleblow on bad behavior, randomized richer human evaluations to incentivize behavior on a larger scale).

Fractional progress estimates for AI timelines and implied resource requirements

"Overall these estimates imply a timeline of [372 years]("

That was only for Hanson's convenience sample, other surveys using the method gave much shorter timelines, as discussed in the post.

Biology-Inspired AGI Timelines: The Trick That Never Works

But new algorithms also don't work well on old hardware. That's evidence in favor of Paul's view that much software work is adapting to exploit new hardware scales.

Biology-Inspired AGI Timelines: The Trick That Never Works

A perfectly correlated time series of compute and labor would not let us say which had the larger marginal contribution, but we have resources to get at that, which I was referring to with 'plausible decompositions.' This includes experiments with old and new software and hardware, like the chess ones Paul recently commissioned, and studies by AI Impacts, OpenAI, and Neil Thompson. There are AI scaling experiments, and observations of the results of shocks like the end of Dennard scaling, the availability of GPGPU computing, and Besiroglu's data on the relative predictive power of computer and labor in individual papers and subfields.

In different ways those tend to put hardware as driving more log improvement than software (with both contributing), particularly if we consider software innovations downstream of hardware changes.

Biology-Inspired AGI Timelines: The Trick That Never Works

Progress in AI has largely been a function of increasing compute, human software research efforts, and serial time/steps. Throwing more compute at researchers has improved performance both directly and indirectly (e.g. by enabling more experiments, refining evaluation functions in chess, training neural networks, or making algorithms that work best with large compute more attractive).

Historically compute has grown by many orders of magnitude, while human labor applied to AI and supporting software  by only a few. And on plausible decompositions of progress (allowing for adjustment of software to current hardware and vice versa), hardware growth accounts for more of the progress over time than human labor input growth.

So if you're going to use an AI production function for tech forecasting based on inputs (which do relatively OK by the standards tech forecasting), it's best to use all of compute, labor, and time, but it makes sense for compute to have pride of place and take in more modeling effort and attention, since it's the biggest source of change (particularly when including software gains  downstream of hardware technology and expenditures).

Thinking about hardware has a lot of helpful implications for constraining timelines:

  • Evolutionary anchors, combined with paleontological and other information (if you're worried about Rare Earth miracles), mostly cut off extremely high input estimates for AGI development, like Robin Hanson's, and we can say from known human advantages relative to evolution that credence should be suppressed some distance short of that (moreso with more software progress)
  • You should have lower a priori credence in smaller-than-insect brains yielding AGI than more middle of the range compute budgets
  • It lets you see you should concentrate probability mass in the next decade or so because of the rapid scaleup of compute investment  (with a supporting argument from the increased growth of AI R&D effort) covering a substantial share of the orders of magnitude between where we are and levels that we should expect are overkill
  • It gets you likely AGI this century,  and on the closer part of that, with a pretty flat prior over orders of magnitude of inputs that will go into success of magnitude of inputs
  • It suggests lower annual probability later on if Moore's Law and friends are dead, with stagnant inputs to AI

These are all useful things highlighted by Ajeya's model, and by earlier work like Moravec's. In particular, I think Moravec's forecasting methods are looking pretty good, given the difficulty of the problem. He and Kurzweil (like the computing industry generally)  were surprised by the death of Dennard scaling and general price-performance of computing growth slowing, and we're definitely years behind his forecasts in AI capability, but we are seeing a very compute-intensive AI boom in the right region of compute space. Moravec also did anticipate it would take a lot more compute than one lifetime run to get to AGI. He suggested human-level AGI would be in the vicinity of human-like compute quantities being cheap and available for R&D. This old discussion is flawed, but makes me feel the dialogue is straw-manning Moravec to some extent.

Ajeya's model puts most of the modeling work on hardware, but it is intentionally expressive enough to let you represent a lot of different views about software research progress, you just have to contribute more of that yourself when adjusting weights on the different scenarios, or effective software contribution year by year. You can even represent a breakdown of the expectation that software and hardware significantly trade off over time, and very specific accounts of the AI software landscape and development paths. Regardless modeling the most importantly changing input to AGI is useful, and I think this dialogue misleads with respect to that by equivocating between hardware not being the only contributing factor and not being an extremely important to dominant driver of progress.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Mainly such complete (and irreversible!) delegation to such incompetent systems being necessary or executed. If AI is so powerful that the nuclear weapons are launched on hair-trigger without direction from human leadership I expect it to not be awful at forecasting that risk.

You could tell a story where bargaining problems lead to mutual destruction, but the outcome shouldn't be very surprising on average, i.e. the AI should be telling you about it happening with calibrated forecasts.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

The US and China might well wreck the world  by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.

But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment (for both companies and governments). Competitive pressures are the main reason why AI systems with inadequate 1-to-1 alignment would be given long enough leashes to bring catastrophe. I would cosign Vanessa and Paul's comments about these scenarios being hard to fit with the idea that technical 1-to-1 alignment work is much less impactful than cooperative RL or the like.


In more detail, I assign a ≥10% chance to a scenario where two or more cultures each progressively diminish the degree of control they exercise over their tech, and the safety of the economic activities of that tech to human existence, until an involuntary human extinction event.  (By comparison, I assign at most around a ~3% chance of a unipolar "world takeover" event, i.e., I'd sell at 3%.)

If this means that a 'robot rebellion' would include software produced by more than one company or country, I think that that is a substantial possibility, as well as the alternative, since competitive dynamics in a world with a few giant countries and a few giant AI companies (and only a couple leading chip firms) can mean that the way safety tradeoffs work is by one party introducing rogue AI systems that outcompete by not paying an alignment tax (and intrinsically embodying in themselves astronomically valuable and expensive IP), or cascading alignment failure in software traceable to a leading company/consortium or country/alliance. 

But either way reasonably effective 1-to-1 alignment methods (of the 'trying to help you and not lie to you and murder you with human-level abilities' variety) seem to eliminate a supermajority of the risk.

[I am separately skeptical that technical work on multi-agent RL is particularly helpful, since it can be done by 1-to-1 aligned systems when they are smart, and the more important coordination problems seem to be earlier between humans in the development phase.]

Load More