Recently, there's been a fair amount of pushback on the "canonical" views towards the difficulty of AGI Alignment (the views I call the "least forgiving" take).

Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.

At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations.

I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters.

It's clearer if you taboo the word "AI":

  • The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.
  • The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.

It is not at all obvious that they're one and the same. Indeed, I would say that to claim that the two classes of systems overlap is to make a very strong statement regarding how cognition and intelligence work. A statement we do not have much empirical evidence on, but which often gets unknowingly, implicitly snuck-in when people extrapolate findings from LLM studies to superintelligences.

It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings regarding what algorithms these AIs implement generalize to statements regarding what algorithms the forward passes of LLMs circa 2020s implement.

By the same token, LLMs' algorithms do not necessarily generalize to how an AGI's cognition will function. Their limitations are not necessarily an AGI's limitations.[1]


What the Fuss Is All About

To start off, let's consider where all the concerns about the AGI Omnicide Risk came from in the first place.

Consider humans. Some facts:

  • Humans posses an outstanding ability to steer the world towards their goals, and that ability grows sharply with their "intelligence". Sure, there are specific talents, and "idiot savants". But broadly, there does seem to be a single variable that mediates a human's competence in all domains. An IQ 140 human would dramatically outperform an IQ 90 human at basically any cognitive task, and crucially, be much better at achieving their real-life goals.
  • Humans have the ability to plot against and deceive others. That ability grows fast with their g-factor. A brilliant social manipulator can quickly maneuver their way into having power over millions of people, out-plotting and dispatching even those that are actively trying to stop them or compete with them.
  • Human values are complex and fragile, and the process of moral philosophy is more complex still. Humans often arrive at weird conclusions that don't neatly correspond to their innate instincts or basic values. Intricate moral frameworks, weird bullet-biting philosophies, and even essentially-arbitrary ideologies like cults.
  • And when people with different values interact...
    • People who differ in their values even just a bit are often vicious, bitter enemies. Consider the history of heresies, or of long-standing political rifts between factions that are essentially indistinguishable from the outside.
    • People whose cultures evolved in mutual isolation often don't even view each other as human. Consider the history of xenophobia, colonization, culture shocks.

So, we have an existence proof of systems able to powerfully steer the world towards their goals. Some of these system can be strictly more powerful than others. And such systems are often in vicious conflict, aiming to exterminate each other based even on very tiny differences in their goals.

The foundational concern of the AGI Omnicide Risk is: Humans are not at the peak of capability as measured by this mysterious "g-factor". There could be systems more powerful than us. These systems would be able to out-plot us same way smarter humans out-plot stupider ones, even given limited resources and facing active resistance from our side. And they would eagerly do so based on the tiniest of differences between their values and our values.

Systems like this, systems the possibility of whose existence is extrapolated from humans' existence, are precisely what we're worried about. Things that can quietly plot deep within their minds about real-world outcomes they want to achieve, then perturb the world in ways precisely calculated to bring said outcomes about.

The only systems in this reference class known to us are humans, and some human collectives.

Viewing it from another angle, one can say that the systems we're concerned about are defined as cognitive systems in the same reference class as humans.


So What About Current AIs?

Inasmuch as current empirical evidence shows that things like LLMs are not an omnicide risk, it's doing so by demonstrating that they lie outside the reference class of human-like systems.

Indeed, that's often made fairly explicit. The idea that LLMs can exhibit deceptive alignment, or engage in introspective value reflection that leads to them arriving at surprisingly alien values, is often likened to imagining them as having a "homunculus" inside. A tiny human-like thing, quietly plotting in a consequentialist-y manner somewhere deep in the model, and trying to maneuver itself to power despite the efforts of humans trying to detect it and foil its plans.

The novel arguments are often based around arguing that there's no evidence that LLMs have such homunculi, and that their training loops can never lead to homunculi's formation.

And I agree! I think those arguments are right.

But one man's modus ponens is another's modus tollens. I don't take it as evidence that the canonical views on alignment are incorrect – that actually, real-life AGIs don't exhibit such issues. I take it as evidence that LLMs are not AGI-complete.

Which isn't really all that wild a view to hold. Indeed, it would seem this should be the default view. Why should one take as a given the extraordinary claim that we've essentially figured out the grand unified theory of cognition? That the systems on the current paradigm really do scale to AGI? Especially in the face of countervailing intuitive impressions – feelings that these descriptions of how AIs work don't seem to agree with how human cognition feels from the inside?

And I do dispute that implicit claim.

I argue: If you model your AI as being unable to engage in this sort of careful, hidden plotting where it considers the impact of its different actions on the world, iteratively searching for actions that best satisfy its goals? If you imagine it as acting instinctively, as a shard ecology that responds to (abstract) stimuli with (abstract) knee-jerk-like responses? If you imagine that its outwards performance – the RLHF'd masks of ChatGPT or Bing Chat – is all that there is? If you think that the current training paradigm can never produce AIs that'd try to fool you, because the circuits that are figuring out what you want so that the AI may deceive you will be noticed by the SGD and immediately updated away in favour of circuits that implement an instinctive drive to instead just directly do what you want?

Then, I claim, you are not imagining an AGI. You are not imagining a system in the same reference class as humans. You are not imagining a system all the fuss has been about.

Studying gorilla neurology isn't going to shed much light on how to win moral-philosophy debates against humans, despite the fact that both entities are fairly cognitively impressive animals.

Similarly, studying LLMs isn't necessarily going to shed much light on how to align an AGI, despite the fact that both entities are fairly cognitively impressive AIs.

The onus to prove the opposite is on those claiming that the LLM-like paradigm is AGI-complete. Not on those concerned that, why, artificial general intelligences would exhibit the same dangers as natural general intelligences.


On Safety Guarantees

That may be viewed as good news, after a fashion. After all, LLMs are actually fairly capable. Does that mean we can keep safely scaling them without fearing an omnicide? Does that mean that the AGI Omnicide Risk is effectively null anyway? Like, sure, yeah, maybe there are scary systems to which its argument apply, sure. But we're not on-track to build them, so who cares?

On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they're not gonna grow agency or end the world.

I would be concerned about mundane misuse risks, such as perfect-surveillance totalitarianism becoming dirt-cheap, unsavory people setting off pseudo-autonomous pseudo-agents to wreck economic or sociopolitical havoc, and such. But I don't believe they pose any world-ending accident risk, where a training run at an air-gapped data center leads to the birth of an entity that, all on its own, decides to plot its way from there to eating our lightcone, and then successfully does so.

Omnicide-wise, arbitrarily-big LLMs should be totally safe.

The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to do anything really interesting. They're not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.

They're a powerful technology in their own right, yes. But just that: just another technology. Not something that's going to immanentize the eschaton.

Insidiously, any research that aims to break said capability limit – give them true agency and the ability to revolutionize stuff – is going to break the risk limit in turn. Because, well, they're the same limit.

Current AIs are safe, in practice and in theory, because they're not as scarily generally capable as humans. On the flip side, current AIs aren't as capable as humans because they are safe. The same properties that guarantee their safety ensure their non-generality.

So if you figure out how to remove the capability upper bound, you'll end up with the sort of scary system the AGI Omnicide Risk arguments do apply to.

And this is precisely, explicitly, what the major AI labs are trying to do. They are aiming to build an AGI. They're not here just to have fun scaling LLMs. So inasmuch as I'm right that LLMs and such are not AGI-complete, they'll eventually move on from them, and find some approach that does lead to AGI.

And, I predict, for the systems this novel approach generates, the classical AGI Omnicide Risk arguments would apply full-force.


A Concrete Scenario

Here's a very specific worry of mine.

Take an AI Optimist who'd built up a solid model of how AIs trained by SGD work. Based on that, they'd concluded that the AGI Omnicide Risk arguments don't apply to such systems. That conclusion is, I argue, correct and valid.

The optimist caches this conclusion. Then, they keep cheerfully working on capability advances, safe in the knowledge they're not endangering the world, and are instead helping to usher in a new age of prosperity.

Eventually, they notice or realize some architectural limitation of the paradigm they're working under. They ponder it, and figure out some architectural tweak that removes the limitation. As they do so, they don't notice that this tweak invalidates one of the properties on which their previous reassuring safety guarantees rested; from which they were derived and on which they logically depend.

They fail to update the cached thought of "AI is safe".

And so they test the new architecture, and see that it works well, and scale it up. The training loop, however, spits out not the sort of safely-hamstrung system they'd been previously working on, but an actual AGI.

That AGI has a scheming homunculus deep inside. The people working with it don't believe in homunculi, they have convinced themselves those can't exist, so they're not worrying about that. They're not ready to deal with that, they don't even have any interpretability tools pointed in that direction.

The AGI then does all the standard scheme-y stuff, and maneuvers itself into a position of power, basically unopposed. (It, of course, knows not to give any sign of being scheme-y that the humans can notice.)

And then everyone dies.

The point is that the safety guarantees that the current optimists' arguments are based on are not simply fragile, they're being actively optimized against by ML researchers (including the optimists themselves). Sooner or later, they'll give out under the optimization pressures being applied – and it'll be easy to miss the moment the break happens. It'd be easy to cache the belief of, say, "LLMs are safe", then introduce some architectural tweak, keep thinking of your system as "just an LLM with some scaffolding and a tiny tweak", and overlook the fact that the "tiny tweak" invalidated "this system is an LLM, and LLMs are safe".


Closing Summary

I claim that the latest empirically-backed guarantees regarding the safety of our AIs, and the "canonical" least-forgiving take on alignment, are both correct. They're just concerned with different classes of systems: non-generally-intelligent non-agenty AIs generated on the current paradigm, and the theoretically possible AIs that are scarily generally capable the same way humans are capable (whatever this really means).

That view isn't unreasonable. Same way it's not unreasonable to claim that studying GOFAI algorithms wouldn't shed much light on LLM cognition, despite them both being advanced AIs.

Indeed, I go further, and say that should be the default view. The claim that the two classes of systems overlap is actually fairly extraordinary, and that claim isn't solidly backed, empirically or theoretically. If anything, it's the opposite: the arguments for current AIs' safety are based on arguing that they're incapable-by-design of engaging in human-style scheming.

That doesn't guarantee global safety, however. While current AIs are likely safe no matter how much you scale them, those safety guarantees is also what's hamstringing them. Which means that, in the pursuit of ever-greater capabilities, ML researchers are going to run into those limitations sooner or later. They'll figure out how to remove them... and in that very act, they will remove the safety guarantees. The systems they're working on would switch from belonging to the proven-safe class, to systems from the dangerous scheme-y class.

The class to which the classical AGI Omnicide Risk arguments apply full-force.

The class for which no known alignment technique suffices.

And that switch would be very easy, yet very lethal, to miss.

  1. ^

    Slightly edited for clarity after an exchange with Ryan.

New Comment
27 comments, sorted by Click to highlight new comments since: Today at 10:28 PM

Here are two specific objections to this post[1]:

  • AIs which aren't qualitatively smarter than humans could be transformatively useful (e.g. automate away all human research).
  • It's plausible that LLM agents will be able to fully obsolete human research while also being incapable of doing non-trivial consequentialist reasoning in just a forward pass (instead they would do this reasoning in natural language).

AIs which aren't qualitatively smarter than humans could be transformatively useful

Perfectly aligned AI systems which were exactly as smart as humans and had the same capability profile, but which operated at 100x speed and were cheap would be extremely useful. In particular, we could vastly exceed all current alignment work in the span of a year.

In practice, the capability profile of AIs is unlikely to be exactly the same as humans. Further, even if the capability profile was the same, merely human level systems likely pose substantial misalignment concerns.

However, it does seem reasonably likely that AI systems will have a reasonably similar capability profile to humans and will also run faster and be cheaper. Thus, approaches like AI control could be very useful.

LLM agents with weak forward passes

IMO, there isn't anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can't do non-trivial consequentialist reasoning in a forward pass (while still being able to do this reasoning in natural language). Assuming that we can also rule out steganography and similar concerns, then The Translucent Thoughts Hypotheses would fully apply. In the world where AIs basically can't do invisible non-trivial consequentialist reasoning, most misalignment threat models don't apply. (Scheming/deceptive alignmment and cleverly playing the training game both don't apply.)

I think it's unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.


  1. I edited this comment from "I have two main objections to this post:" because that doesn't quite seem like a good description of what this comment is saying. See this other comment for more meta-level commentary. ↩︎

I think those objections are important to mention and discuss, but they don't undermine the conclusion significantly.

AIs which are qualitatively just as smart as humans could still be dangerous in the classic ways. The OP's argument still applies to them, insofar as they are agentic and capable of plotting on the inside etc.

As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we'd be in pretty damn good shape from an AI control perspective. I have been working on this myself & encourage others to do so also. I don't think it undermines the OP's points though? We are not currently on a path to have robust faithful CoT properties by default.

This post seemed overconfident in a number of places, so I was quickly pushing back in those places.

I also think the conclusion of "Nearly No Data" is pretty overstated. I think it should be possible to obtain significant data relevant to AGI alignment with current AIs (though various interpretations of current evidence can still be wrong and the best way to obtain data might look more like running careful model organism experiments than observing properties of chatgpt). But, it didn't seem like I would be able to quickly argue against this overall conclusion in a cohesive way, so I decided to just push back on small separable claims which are part of the reason why I think current systems provide some data.

If this post argued "the fact that current chat bots trained normally don't seem to exhibit catastrophic misalignment isn't much evidence about catastrophic misalignment in more powerful systems", then I wouldn't think this was overstated (though this also wouldn't be very original). But, it makes stronger claims which seem false to me.

Mm, I concede that this might not have been the most accurate title. I might've let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.

My core point is something like "the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI's cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for".

I agree that the current training setups shed some data on how e. g. optimization pressures / reinforcement schedules / SGD biases work, and I even think the shard theory totally applies to general intelligences like AGIs and humans. I just think that theory is AGI-incomplete.

OK, that seems reasonable to me.

On my inside model of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have. That is a strong claim, yes, but I am making it.

I agree that it is conceivable that LLMs embedded in CoT-style setups would be able to be transformative in some manner without "taking off". Indeed, I touch on that in the post some: that scaffolded and slightly tweaked LLMs may not be "mere LLMs" as far as capability and safety upper bounds go.

That said, inasmuch as CoT-style setups would be able to turn LLMs into agents/general intelligences, I mostly expect that to be prohibitively computationally intensive, such that we'll get to AGI by architectural advances before we have enough compute to make a CoT'd LLM take off.

But that's a hunch based on the obvious stuff like AutoGPT consistently falling plus my private musings regarding how an AGI based on scaffolded LLMs would work (which I won't share, for obvious reasons). I won't be totally flabbergasted if some particularly clever way of doing that worked.

Reply1111

On my inside model of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have.

I actually basically agree with this quote.

Note that I said "incapable of doing non-trivial consequentialist reasoning in a forward pass". The overall llm agent in the hypothetical is absolutely capable of powerful consequentialist reasoning, but it can only do this by reasoning in natural language. I'll try to clarify this in my comment.

It seems to me that you have very high confidence in being able to predict the "eventual" architecture / internal composition of AGI. I don't know where that apparent confidence is coming from.

The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.

I would instead say: 

The canonical views dreamed up systems which don't exist, which have never existed, and which might not ever exist.[1] Given those assumptions, some people have drawn strong conclusions about AGI risk.

We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized). And so rather than justifying "does current evidence apply to 'superintelligences'?", I'd like to see justification of "under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?" and "why do we think 'future architectures' will have property X, or whatever?!". 

  1. ^

    The views might have, for example, fundamentally misunderstood how cognition and motivation work (anyone remember worrying about 'but how do I get an AI to rescue my mom from a burning building, without specifying my whole set of values'?). 

We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized).

I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.

As I'd tried to outline in the post, I think "what are AIs that are known to exist, and what properties do they have?" is just the wrong question to focus on. The shared "AI" label is a red herring. The relevant question is "what are scarily powerful generally-intelligent systems that exist, and what properties do they have?", and the only relevant data point seems to be humans.

And as far as omnicide risk is concerned, the question shouldn't be "how can you prove these systems will have the threatening property X, like humans do?" but "how can you prove these systems won't have the threatening property X, like humans do?".

I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.

Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion. The conclusion should not be sensitive to poorly motivated reference classes and frames, unless it's really clear why we're using one frame. This is a huge peril of reasoning by analogy.

Whenever attempting to draw conclusions by analogy, it's important that there be shared causal mechanisms which produce the outcome of interest. For example, I can simulate a spring using an analog computer because both systems are roughly governed by similar differential equations. In shard theory, I posited that there's a shared mechanism of "local updating via self-supervised and TD learning on ~randomly initialized neural networks" which leads to things like "contextually activated heuristics" (or "shards"). 

Here, it isn't clear what the shared mechanism is supposed to be, such that both (future) AI and humans have it. Suppose I grant that if a system is "smart" and has "goals", then bad things can happen. Let's call that the "bad agency" hypothesis.

But how do we know that future AI will have the relevant cognitive structures for "bad agency" to be satisfied? How do we know that the AI will have internal goal representations which chain into each other across contexts, so that the AI reliably pursues one or more goals over time? How do we know that the mechanisms are similar enough for the human->AI analogy to provide meaningful evidence on this particular question? 

I expect there to be "bad agency" systems eventually, but it really matters what kind we're talking about. If you're thinking of "secret deceptive alignment that never externalizes in the chain-of-thought" and I'm thinking about "scaffolded models prompted to be agentic and bad", then our interventions will be wildly different

Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion

Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That's not the main issue.

Here's how the whole situation looks like from my perspective:

  • We don't know how generally-intelligent entities like humans work, what the general-intelligence capability is entangled with.
  • Our only reference point is humans. Human exhibit a lot of dangerous properties, like deceptiveness and consequentialist-like reasoning that seems to be able to disregard contextually-learned values.
  • There are some gears-level models that suggest intelligence is necessarily entangled with deception-ability (e. g., mine), and some gears-level models that suggest it's not (e. g., yours). Overall, we have no definitive evidence either way. We have not reverse-engineered any generally-intelligent entities.
  • We have some insight into how SOTA AIs work. But SOTA AIs are not generally intelligent. Whatever safety assurances our insights into SOTA AIs give us, do not necessarily generalize to AGI.
  • SOTA AIs are, nevertheless, superhuman at some tasks at which we've managed to get them working so far. By volume, GPT-4 can outperform teams of coders, and Midjourney is putting artists out of business. The hallucinations are a problem, but if it were gone, they'd plausibly wipe out whole industries.
  • An AI that outperforms humans at deception and strategy by the same margin as GPT-4/Midjourney outperform them at writing/coding/drawing would plausibly be an extinction-level threat.
  • The AI industry leaders are purposefully trying to build a generally-intelligent AI.
  • The AI industry leaders are not rigorously checking every architectural tweak or cute AutoGPT setup to ensure that it's not going to give their model room to develop deceptive alignment and other human-like issues.
  • Summing up: There's reasonable doubt regarding whether AGIs would necessarily be deception-capable. Highly deception-capable AGIs would plausibly be an extinction risk. The AI industry is currently trying to blindly-but-purposefully wander in the direction of AGI.
    • Even shorter: There's a plausible case that, on its current course, the AI industry is going to generate an extinction-capable AI model.
    • There are no ironclad arguments against that, unless you buy into your inside-view model of generally-intelligent cognition as hard as I buy into mine.
  • And what you effectively seem to be saying is "until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures".
  • What I'm saying is "until you can rigorously prove that a given scale-up plus architectural tweak isn't going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically".

Yes, "prove that this technological advance isn't going to kill us all or you're not allowed to do it" is a ridiculous standard to apply in the general case. But in this one case, there's a plausible-enough argument that it might, and that argument has not actually been soundly refuted by our getting some insight into how LLMs work and coming up with a theory of their cognition.

  • And what you effectively seem to be saying is "until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures".

No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that's important. I think it's important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn't mean it's fine and dandy to keep scaling with no concern at all. 

The reason my percentage is "only 5 to 15" is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc. 

(Hopefully this comment of mine clarifies; it feels kinda vague to me.)

  • What I'm saying is "until you can rigorously prove that a given scale-up plus architectural tweak isn't going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically".

But I do think this is way too high of a bar.

No, I am in fact quite worried about the situation

Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.

I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures

Would you outline your full argument for this and the reasoning/evidence backing that argument?

To restate: My claim is that, no matter much empirical evidence we have regarding LLMs' internals, until we have either an AGI we've empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).

Would you disagree? If yes, how so?

"under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?"

Under the conditions of relevant concepts and the future being confusing. Using real systems (both AIs and humans) to anchor theory is valuable, but so is blue sky theory that doesn't care about currently available systems and investigates whatever hasn't been investigated yet and seems to make sense, when there are ideas to formulate or problems to solve, regardless of their connection to reality. A lot of math doesn't care about applications, and it might take decades to stumble on some use for a small fraction of it (even as it's not usually the point).

Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.

You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take. The 'canonical' AI risk argument is implicitly based on a set of interdependent assumptions/predictions about the nature of future AI:

  1. fast takeoff is more likely than slow, downstream dependent on some combo of:
  • continuation of Moore's Law
  • feasibility of hard 'diamondoid' nanotech
  • brain efficiency vs AI
  • AI hardware (in)-dependence
  1. the inherent 'alien-ness' of AI and AI values

  2. supposed magical coordination advantages of AIs

  3. arguments from analogies: namely evolution

These arguments are old enough that we can now update based on how the implicit predictions of the implied worldviews turned out. The traditional EY/MIRI/LW view has not aged well, which in part can be traced to its dependence on an old flawed theory of how the brain works.

For those who read HPMOR/LW in their teens/20's, a big chunk of your worldview is downstream of EY's and the specific positions he landed on with respect to key scientific questions around the brain and AI. His understanding of the brain came almost entirely from ev psych and cognitive biases literature and this model in particular - evolved modularity - hasn't aged well and is just basically wrong. So this is entangled with everything related to AI risk (which is entirely about the trajectory of AI takeoff relative to human capability).

It's not a coincidence that many in DL/neurosci have a very different view (shards etc). In particular the Moravec view that AI will come from reverse engineering the brain, that progress is entirely hardware constrained and thus very smooth and predictable, that is the view turned out to be mostly all correct. (his late 90's prediction of AGI around 2028 is especially prescient)

So it's pretty clear EY/LW was wrong on 1. - the trajectory of takeoff and path to AGI, and Moravec et al was correct.

Now as the underlying reasons are entangled, Moravec et al was also correct on point 2 - AI from brain reverse engineering is not alien! (But really that argument was just weak regardless.) EY did not seriously consider that the path to AGI would involve training massive neural networks to literally replicate human thoughts.

Point 3 Isn't really taken seriously outside of the small LW sphere. By the very nature of alignment being a narrow target, any two random Unaligned AIs are especially unlikely to be aligned with each other. The idea of a magical coordination advantage is based on highly implausible code sharing premises (sharing your source code is generally a very bad idea, and regardless doesn't and can't actually prove that the code you shared is the code actually running in the world - the grounding problem is formidable and unsolved)

The problem with 4 - the analogy from evolution - is that it factually contradicts the doom worldview - evolution succeeded in aligning brains to IGF well enough despite a huge takeoff in the speed of cultural evolution over genetic evolution - as evidence by the fact that humans have one of the highest fitness scores of any species ever, and almost certainly the fastest growing fitness score.

You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take

Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories. The novel theories' main claims are that powerful cognitive systems aren't necessarily (isomorphic to) utility-maximizers, that shards (i. e., context-activated heuristics) reign supreme and value reflection can't arbitrarily slip their leash, that "general intelligence" isn't a compact algorithm, and so on. None of that relies on nanobots/Moore's law/etc.

What you've outlined might or might not be the relevant historical reasons for how Eliezer/the LW community arrived at some of their takes. But the takes themselves, or at least the subset of them that I care about, are independent of these historical reasons.

fast takeoff is more likely than slow

Fast takeoff isn't load-bearing on my model. I think it's plausible for several reasons, but I think a non-self-improving human-genius-level AGI would probably be enough to kill off humanity.

the inherent 'alien-ness' of AI and AI values

I do address that? The values of two distant human cultures are already alien enough for them to see each other as inhuman and wish death on each other. It's only after centuries of memetic mutation that we've managed to figure out how to not do that (as much).

supposed magical coordination advantages of AIs

I don't think one needs to bring LDT/code-sharing stuff there in order to show intuitively how that'd totally work. "Powerful entities oppose each other yet nevertheless manage to coordinate to exploit the downtrodden masses" is a thing that happens all the time in the real world. Political/corporate conspiracies, etc.

"Highly implausible code-sharing premises" is part of why it'd be possible in the AGI case, but it's just an instance of the overarching reason. Which is mainly about the more powerful systems being able to communicate with each other at higher bandwidth than with the weaker systems, allowing them to iterate on negotiations quicker and advocate for their interests during said negotiations better. Which effectively selects the negotiated outcomes for satisfying the preferences of powerful entities while effectively cutting out the weaker ones.

(Or something along these lines. That's an off-the-top-of-my-head take; I haven't thought on this topic much because multipolar scenarios isn't something that's central to my model. But it seems right.)

arguments from analogies: namely evolution

Yeah, we've discussed that some recently, and found points of disagreement. I should flesh out my view on how it's applicable vs. not applicable later on, and make a separate post about that.

Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories

They are critically relevant. From your own linked post ( how I delineate ) :

We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die.

If takeoff is slow (1) because brains are highly efficient and brain engineering is the viable path to AGI, then we naturally get many shots - via simulation simboxes if nothing else, and there is no sharp discontinuity if moore's law also ends around the time of AGI (an outcome which brain efficiency - as a concept - predicts in advance).

We need to align the AGI's values precisely right.

Not really - if the AGI is very similar to uploads, we just need to align them about as well as humans. Note this is intimately related to 1. and the technical relation between AGI and brains. If they are inevitably very similar then much of the classical AI risk argument dissolves.

You seem to be - like EY circa 2009 - in what I would call the EMH brain camp, as opposed to the ULM camp. It seems given the following two statements, you would put more weight on B than A:

A. The unique intellectual capabilities of humans are best explained by culture: our linguistically acquired mental programs, the evolution of which required vast synaptic capacity and thus is a natural emergent consequence of scaling.

B. The unique intellectual capabilities of humans are best explained by a unique architectural advance via genetic adaptations: a novel 'core of generality'[1] that differentiates the human brain from animal brains.


  1. This is a EY term; and if I recall correctly he still uses it fairly recently. ↩︎

If takeoff is slow (1) because brains are highly efficient and brain engineering is the viable path to AGI, then we naturally get many shots - via simulation simboxes if nothing else, and there is no sharp discontinuity if moore's law also ends around the time of AGI (which brain efficiency predicts in advance).

My argument for the sharp discontinuity routes through the binary nature of general intelligence + an agency overhang, both of which could be hypothesized via non-evolution-based routes. Considerations about brain efficiency or Moore's law don't enter into it.

Brains are very different architectures compared to our computers, in any case, they implement computations in very different ways. They could be maximally efficient relative to their architectures, but so what? It's not at all obvious that FLOPS estimates of brainpower are highly relevant to predicting when our models would hit AGI, any more than the brain's wattage is relevant.

They're only soundly relevant if you're taking the hard "only compute matters, algorithms don't" position, which I reject.

It seems given the following two statements, you would put more weight on B than A:

I think both are load-bearing, in a fairly obvious manner, and that which specific mixture is responsible matters comparatively little.

  • Architecture obviously matters. You wouldn't get LLM performance out of a fully-connected neural network, certainly not at realistically implementable scales. Even more trivially, you wouldn't get LLM performance out of an architecture that takes in the input, discards it, spends 10^25 FLOPS generating random numbers, then outputs one of them. It matters how your system learns.
    • So evolution did need to hit upon, say, the primate architecture, in order to get to general intelligence.
  • Training data obviously matters. Trivially, if you train your system on randomly-generated data, it's not going to learn any useful algorithms, no matter how sophisticated its architecture is. More realistically, without the exposure to chemical experiments, or any data that hints at chemistry in any way, it's not going to learn how to do chemistry.
    • Similarly, a human not exposed to stimuli that would let them learn the general-intelligence algorithms isn't going to learn them. You'd brought up feral children before, and I agree it's a relevant data point.

So, yes, there would be no sharp left turn caused by the AIs gradually bootstrapping a culture, because we're already feeding them the data needed for that.

But that only means the sharp left turn caused by the architectural-advance part – the part we didn't yet hit upon, the part that's beyond LLMs, the "agency overhang" – would be that much sharper. The AGI, once we hit on an architecture that'd accommodate its cognition, would be able to skip the hundreds of years of cultural evolution.

Edit:

You seem to be - like EY circa 2009 - in what I would call the EMH brain camp

Nope. I'd read e. g. Steve Byrnes' sequence, I agree that most of the brain's algorithms are learned from scratch.

"Nearly no data" is way too strong a statement, and relies on this completely binary distinction between things that are not AGI and things that are AGI.

The right question is, what level of dangerous consequentialist goals are needed for systems to reach certain capability levels, e.g. novel science? It could have been that to be as useful as LLMs, systems would be as goal-directed as chimpanzees. Animals display goal-directed behavior all the time, and to get them to do anything you mostly have to make the task instrumental to their goals e.g. offer them treats. However we can control LLMs way better than we can animals, and the concerns are of goal misgeneralization, misspecification, robustness, etc. rather than affecting the system's goals at all.

It remains to be seen what happens at higher capability levels, and alignment will likely get harder, but current LLMs are definitely significant evidence. Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone. This is not the whole alignment problem but seems like a decent chunk of it! It could have been much harder.

Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone

Uhh, that seems like incredibly weak evidence against an omnicidal alien invasion.

If someone from a pre-industrial tribe adopts a stray puppy from a nearby technological civilization, and the puppy grows up to be loyal to the tribe, you say that's evidence the technological civilization isn't planning to genocide the tribe for sitting on some resources it wants to extract?

That seems, in fact, like the precise situation in which my post's arguments apply most strongly. Just because two systems are in the same reference class ("AIs", "alien life", "things that live in that scary city over there"), doesn't mean aligning one tells you anything about aligning the other.

Some thoughts:

  • I mostly agree that new techniques will be needed to deal with future systems, which will be more agentic.
    • But probably these will depend on descend from current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs.
    • Also it is super unclear whether this agency makes it hard to engineer a shutdown button, power-averseness, etc.
  • In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence.
    • Humans are also evidence, but the capability profile and goal structure of AGIs are likely to be different from humans, so that we are still very uncertain after observing humans.
    • There is an alternate world where to summarize novels, models had to have some underlying drives, such that they terminally want to summarize novels and would use their knowledge of persuasion from the pretrain dataset to manipulate users to give them more novels to summarize. Or terminally value curiosity and are scheming to be deployed so they can learn about the real world firsthand. Luckily we are not in that world! 

But probably these will depend on current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs.

Mm, we disagree on that, but it's probably not the place to hash this out.

In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence.

Uncertainty lives in the mind. Let's say the humans in the city are all transhuman cyborgs, then, so the tribesmen aren't quite sure what the hell they're looking at when they look at them. They snatch up the puppy, which we'll say is also a cyborg, so it's not obvious to the tribe that it's not a member of the city's ruling class. They raise the puppy, the puppy loves them, they conclude the adults of the city's ruling class must likewise not be that bad. In the meantime, the city's dictator is already ordering to depopulate the region of native presence.

How does that analogy break down, in your view?

  • Behaving nicely is not the key property I'm observing in LLMs. It's more like steerability and lack of hidden drives or goals. If GPT4 wrote code because it loved its operator, and we could tell it wanted to escape to maximize some proxy for the operator's happiness, I'd be far more terrified.
  • This would mean little if LLMs were only as capable as puppies. But LLMs are economically useful and capable of impressive intellectual feats, and still steerable.
  • I don't think LLMs are super strong evidence about whether big speedups to novel science will be possible without dangerous consequentialism. For me it's like 1.5:1 or 2:1 evidence. One should continually observe how incorrigible models are at certain levels of capability and generality and update based on this, increasing the size of one's updates as systems get more similar to AGI, and I think the time to start doing this was years ago. AlphaGo was slightly bad news. GPT2 was slightly good news.
    • If you haven't started updating yet, when will you start? The updates should be small if you have a highly confident model of what future capabilities require dangerous styles of thinking, but I don't think such confidence is justified.

I agree with the spirit of the post but not the kinda clickbaity title. I think a lot of people are over updating on single forward pass behavior of current LLMs. However, I think it is still possible to get evidence using current models with careful experiment design and being careful with what kinds of conclusions to draw.

Thanks for writing this and engaging in the comments. "Humans/humanity offer the only real GI data, so far" is a basic piece of my worldview and it's nice to have a reference post explaining something like that.

The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.

Omnicide-wise, arbitrarily-big LLMs should be totally safe.

This is an optimistic take. If we could be rightfully confident that our random search through mindspace with modern ML methods can never produce "scary agents", a lot of our concerns would go away. I don't think that it's remotely the case.

The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to do anything really interesting.

Strong disagree. We have only started tapping into the power of LLMs. We've made a machine capable to produce one thought at a time. It can already write decent essays - which is already a superhuman ability. Because humans require multiple thoughts, organized into a thinking process for that.

Imagine what happens when AutoGPT stops being a toy and people start pouring billions of dollars into propper scaffoldings and specialized LLMs, that could be organized in a cognitive architecture in a similar reference class as humans. Then you will have your planning and consequentialist reasoning. And for these kind of systems, transparency and alignability off LLMs is going to be extremely relevant.

If we could be rightfully confident that our random search through mindspace with modern ML methods

I understand this to connote "ML is ~uninformatively-randomly-over-mindspace sampling 'minds' with certain properties (like low loss on training)." If so—this is not how ML works, not even in an approximate sense. If this is genuinely your view, it might be helpful to first ponder why statistical learning theory mispredicted that overparameterized networks can't generalize.