Steve Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms—see Email: Twitter: @steve47285. Employer: Physicist by training.


Intro to Brain-Like-AGI Safety

Wiki Contributions


Thank you! I’ve been using the terms “inference algorithm” versus “learning algorithm” to talk about that kind of thing. What you said seems fine too, AFAIK.

I think that grader-optimization is likely to fail catastrophically when the grader is (some combination of):

  • more like “built / specified directly and exogenously by humans or other simple processes”, less like e.g. “a more and more complicated grader getting gradually built up through some learning process as the space-of-possible-plans gets gradually larger”
  • more like “looking at the eventual consequences of the plan”, less like “assessing plans for deontology and other properties” (related post) (e.g. “That plan seems to pattern-match to basilisk stuff” could be a strike against a plan, but that evaluation is not based solely on the plan’s consequences.)
  • more like “looking through tons of wildly-out-of-the-box plans”, less like “looking through a white-list of a small number of in-the-box plans”

Maybe we agree so far?

But I feel like this post is trying to go beyond that and say something broader, and I think that’s where I get off the boat.

I claim that maybe there’s a map-territory confusion going on. In particular, here are two possible situations:

  • (A) Part of the AGI algorithm involves listing out multiple plans, and another part of the algorithm involves a “grader” that grades the plans.
  • (B) Same as (A), but also assume that the high-scoring plans involve a world-model (“map”), and somewhere on that map is an explicit (metacognitive / reflective) representation of the “grader” itself, and the (represented) grader’s (represented) grade outputs (within the map) are identical to (or at least close to) the actual grader’s actual grades within the territory.

I feel like OP equivocates between these. When it’s talking about algorithms it seems to be (A), but when it’s talking about value-child and appendix C and so on, it seems to be (B).

In the case of people, I want to say that the “grader” is roughly “valence” / “the feeling that this is a good idea”.

I claim that (A), properly understood, should seem/feel almost tautological—like, it should be impossible to introspectively imagine (A) being false! It’s kinda the claim “People will do things that they feel motivated to do”, or something like that. By contrast, (B) is not tautological, or even true in general—it describes hedonists: “The person is thinking about how to get very positive valence on their own thoughts, and they’re doing whatever will lead to that”.

I think this is related to Rohin’s comment (“An AI system with a "direct (object-level) goal" is better than one with "indirect goals"”)—the AGI has a world-model / map, its “goals” are somewhere on the map (inevitably, I claim), and we can compare the option of “the goals are in the parts of the map that correspond to object-level reality (e.g. diamonds)”, versus “the goals are in the parts of the map that correspond to a little [self-reflective] portrayal of the AGI’s own evaluative module (or some other represented grader) outputting a high score”. That’s the distinction between (not-B) vs (B) respectively. But I think both options are equally (A).

(Sidenote: There are obvious reasons to think that (A) might lead to (B) in the context of powerful model-based RL algorithms. But I claim that this is not inevitable. I think OP would agree with that.)

Suppose most humans do X, where X increases empowerment. Three possibilities are:

  • (A) Most humans do X because they have an innate drive to do X; (e.g. having sex, breathing)
  • (B) Most humans do X because they have done X in the past and have learned from experience that doing X will eventually lead to good things (e.g. checking the weather forecast before going out)
  • (C) Most humans do X because they have indirectly figured out that doing X will eventually lead to good things—via either social / cultural learning, or via explicit means-end reasoning (e.g. avoiding prison, for people who have never been in prison)

I think Jacob & I both agree that there are things in all three categories, but we have disagreements where I want to put something into (A) and Jacob wants to put it into (B) or (C). Examples that came up in this post were “status-seeking / status-respecting behavior”, “fun”, and “enjoying river views”.

How do we figure it out? In general, 5 types of evidence that we can bring to bear are:

  • (1) Evidence from cases where we can rule out (C), e.g. sufficiently simple and/or young humans/animals. Then we can just see whether the animal is doing X more often than chance from the start, or whether it has to stumble upon X before it starts doing X more often than chance.
    • Example: If you’re a baby mouse who has never seen a bird (or bird-like projectile etc.) in your life, you have no rational basis for thinking that birds are dangerous. Nevertheless, lab experiments show that baby mice will run away from incoming birds, reliably, the first time. (Ref) So that has to be (A).
  • (2) Evidence from sufficiently distant consequences that we can rule out (B).
    • Example: Many animals will play-fight as children. This has a benefit (presumably) of eventually making the animals better at actual fighting as adults. But the animal can’t learn about that benefit via trial-and-error—the benefit won’t happen until perhaps years in the future. 
  • (3) Evidence from heritability—If doing X is heritable, I think an (A)-type explanation would make that fact very easy to explain—in fact, an (A)-type explanation for X would pretty much demand that doing X has nonzero heritability. Conversely, if doing X is heritable (in a way that’s not explained by heritability of “general intelligence” type stuff), well I don’t think (B) or (C) is immediately ruled out, but we do need to think about it and try to come up with a story of how that could work.
  • (4) Evidence from edge-cases where X is not actually empowering—Suppose doing X is usually empowering, but not always. If people do a lot of X even in edge-cases where it’s not empowering, I consider that strong evidence for (A) over (B) & (C). It’s not indisputable evidence though, because maybe you could argue that people are able to learn the simple pattern “X tends to be empowering”, but unable to learn the more complicated pattern “X tends to be empowering with the following exceptions…”. But still, I think it’s strong evidence.
    • Example: Humans can feel envy or anger or vengeance towards fictional characters, inanimate objects, etc. 
  • (5) Evidence from specific involuntary reactions, hypothalamus / brainstem involvement, etc.—For example, things that have specific universal facial expressions or sympathetic nervous system correlates, or behavior that can be reliably elicited by a particular neuropeptide injection (AgRP makes you hungry), etc., are probably (A).

A couple specific cases:

Status—I’m not sure whether Jacob is suggesting that human social status related behaviors are explained by (B) or (C) or both. But anyway I think 1,2,3,4 all push towards an (A)-type explanation for human social status behaviors. I think I would especially start with 3 (heritability)—if having high social status is generally useful for achieving a wide variety of goals, and that were the entire explanation for why people care about it, then it wouldn’t really make sense that some people care much more about status than others do, particularly in a way that (I’m pretty sure) statistically depends on their genes (including their sex) but which doesn’t much depend on their family environment (at least within a country), and which (I’m pretty sure) doesn’t particularly correlate with intelligence etc.

(As for 5, I’m not aware of e.g. some part of the hypothalamus or brainstem where stimulating it makes people feel high-status, but pretty please tell me if anyone has seen anything like that! I would be eternally grateful!)

Fun—Jacob writes “Fun is also probably an emergent consequence of value-of-information and optionality” which I take to be a claim that “fun” is (B) or (C), not (A). But I think it’s (A). I think 5 is strong evidence that fun involves (A). For one thing, decorticate rats will still do the activities we associate with “fun”, e.g. playing with each other (ref). For another thing, there’s a specific innate involuntary behavior / facial expression associated with “fun” (i.e. laughing in humans, and analogs-of-laughing in other animals), which again seems to imply (A). I also claim that 1,2,3,4 above also offer additional evidence for an (A)-type explanation of fun / play behavior, without getting into details.

I agree—I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans, and I don’t think it’s feasible to make an AGI that doesn’t do that.

But it sounds like this will be the topic of Alex’s next essay.

So I’m expecting to criticize Alex’s next essay by commenting on it along the lines of: “You think you just wrote an essay about something which is totally different from “Optimizing for the output of a grader which evaluates plans”, but I disagree; the thing you’re describing in this essay is in that category too.” But that’s just a guess; I will let Alex write the essay before I criticize it. :-P

Is there a reason you used the term “grader” instead of the AFAICT-more-traditional term “critic”? No big deal, I’m just curious.

I think in Eliezers model, which I agree with, if the first part happens, then by year Y+2, the world is a utopia of uploaded minds on a dyson sphere or something. 

I think this model is under-weighting possibilities like:

  • The people who make the under-control AGI tell it not to FOOM—because they’re trying to be careful and keep it under close supervision and FOOM would break their interpretability tools and FOOM would push the AGI way out of distribution etc.
  • The people who make the under-control AGI tell it to follow human laws, norms, etc., which would include things like “not setting up global surveillance infrastructure”, “not preventing random people across the world from using their own data centers to make their own AGIs”, “not doing experimental studies of mind-uploading tech without FDA approval”, etc.
  • The people who make the under-control AGI tell it to focus its mental energies exclusively on doing original cancer research.

I put a high probability on one of those happening (conditional on technical success in making “under-control AGI”), because those all seem like things that normal people would do, following their normal-people scripts.

But yes, if the Section 3.5.2 thing happens, that is a a priori plausible path to a great future, it seems to me. No disagreement there. My point is that the Section 3.5.2 thing with a happy conclusion is unlikely to happen. Nobody seems to think it’s a good idea to even try for the Section 3.5.2 path, AFAICT—e.g. Eliezer and Critch and Paul Christiano are all apparently against Section 3.5.2 (and for very different reasons!), and meanwhile normal people outside the x-risk bubble would (I imagine) be very opposed as well, if the possibility even occurred to them in the first place, cf. the bullet points above.

So, I seem to find myself as one of the leading advocates of the Section 3.5.2 plan right now (and even I am feeling pretty halfhearted about that!), probably because I am combining the Eliezer assumption that balance-of-power is not going to work in a post-AGI world, with substantially more optimism than Eliezer on getting AGI motivations close enough to CEV on the first try past the point of no return. (“More optimism than Eliezer” is obviously not a strong statement :-P But I’m at least at double-digit percentage success probability, I think, conditional on continued alignment research progress for the next decade, say.)

The discussion of "pivotal acts" …

I agree with what you wrote, see Section 3.5.1, specifically the paragraph starting “A nice thing about this category is that it puts minimal demands on AGI alignment…”

One thing is, I think you’re sorta assuming that the AI is omniscient, aligned, and completely trusted by the human. With those assumptions, I would hope that the person just lets the AI loose onto the internet to usher in utopia! (I.e. Section 3.5.2)

Rather than omniscience, I’m assuming that we’re coming in at a stage where the early AIs are insightful, systematic, fast-thinking, patient, etc., maybe moreso than humans along these dimensions, plus they have the ability to spin off copies and so on. But they still need to figure things out, their plans may have flaws (especially given a relative lack of real-world experience), and they can’t magic up solutions to every problem. I claim that already at this stage, the AI can probably start a deadly pandemic if it wanted to. (Or ten deadly pandemics at once.) But at the same stage, if the employee ask the AI “What do I do now? We need to deal with the out-of-control-AI problem, right? Any ideas?” then it might not have any, or at least any that the employee would endorse. (E.g. maybe the only plans likely to succeed that it can think of are very illegal.)

Maybe you’ll say “The AI will convince the person to do aggressive illegal actions rather than twiddle their thumbs until the apocalypse.” I’m open to that, but it entails rejecting corrigibility, right? So really, this is Section 3.5.2 territory. If we’re talking about an AGI that’s willing and able to convince its (so-called) supervisor to do actions that the (so-called) supervisor initially doesn’t want to do, because the AGI thinks they’re in the (so-called) supervisor’s long-term best interest, then we are NOT talking about a corrigible AGI under human control, rather we’re talking about a non-corrigible, out-of-control AGI. So we better hope that it’s a friendly out-of-control AGI!!

you expect AI control to outpace AI alignment

I think it’s more like, I’m trying to question the (I think) common belief that there’s a path to a good future involving things like “corrigibility” and “act-based agents” and “narrow (not ambitious) value learning”, and NOT involving things like Sections 3.5.1–2. If you never held that belief in the first place, then this post isn’t particularly addressed at you.

If the example people simply aren't thinking about large changes to the world (including their daily lives) as a thing they might want to make happen, maybe we should spread the message that the future might contain large, positive changes, and that it's totally natural for people to work to make them happen.

I feel like that’s the positive framing, whereas the negative framing is “people can irreversibly and radically and unilaterally change the world, for both better and worse, by making an AGI that’s not under tight human control—so let’s spread the message that they shouldn’t do that!” For example Critch’s post here. I’m not opposed to the Section 3.5.2 thing, I just want to be explicit about what it entails, and not sugarcoat it.

Thanks for your comment!

Again, I think you’re imagining that an AGI is going to take over, and the question is whether the AGI that takes over will have good or bad motives from a human perspective. I see that vision as entirely plausible—the hopeful case is my Section 3.5.2, and the bad scenario is x-risk.

(Whether this “bad scenario” involves immediate deaths of humans, versus the AGI keeping humans around, at least for a while, to help out with projects that advance the AGI’s own goals, is not a question where I really care what the answer is!)

So this post is not really arguing against your vision. Instead it’s arguing against (or at least questioning) a different vision, where no AGI takes over, and instead humans remain perpetually in control of docile helpful AGIs, in a multipolar world with similar power dynamics as today. …Or something like that.

See Section 3.3.3 for why I think a misaligned power-seeking AGI might want nuclear war, deadly pandemics, crop diseases, and other fun things like that. If I’m an AGI, humans can help me get things done, but humans can also potentially shut me down, and more importantly humans can also potentially create a different AGI with different and conflicting goals from mine, and equal capabilities.

No smart AI would risk nuclear war, as it would set their plans back by decades, or perhaps longer.

Decades? Sure. But we don’t know what the AGI’s “discount rate” will be (if that notion is even well-defined).

If you tell a human: Behind Door Number 1 is a box that will almost definitely solve world peace, climate change, and all the world’s diseases. But the box is sealed shut and won’t open for 35 years. Behind Door Number 2 is a button that might solve those same problems in just 3 years. But probably not. More likely it will create a black hole that will swallow the Earth.

I think the human would take door number 1. I think the AGI would plausibly make an analogous decision. Or if not that AGI, then the next AGI in line.

By the way, if we both agree that the misaligned AGI can gain control of Earth, then it doesn’t much matter whether the nuclear war scenario is likely or not, right? (If the AGI keeps human slaves around for a few decades until it invents next-gen robots, then kills the humans, versus killing the humans immediately, who cares?) Or conversely, if the AGI can’t gain control of Earth through any method besides destructive ones involving things like nuclear wars and pandemics, then we can’t also say that there’s no harm in keeping humans around from the AGI’s perspective.

Great post!

A. Contra “superhuman AI systems will be ‘goal-directed’”

I somewhat agree, see Consequentialism & Corrigibility. I’m a bit unclear on whether this is intended as an argument for “AGI almost definitely won’t have a zealous drive to control the universe” versus “AGI won’t necessarily have a zealous drive to control the universe”. I agree with the latter but not the former.

Also, the more different groups make AGIs, the more likely it is that someone will make one with a “zealous drive to control the universe”. Then we have to think about whether the non-zealous ones will have solved the problem posed by the zealous ones. In this context, there starts to be a contradiction between “we don’t need to worry about the non-zealous ones because they won’t be doing hardcore long-term consequentialist planning” versus “we don’t need to worry about the zealous ones because the non-zealous ones are so powerful and foresightful that, whatever plan the latter might come up with, the former can preemptively think of it and defend against it”. More on this topic in a forthcoming post hopefully in the next couple weeks. (EDIT—I added the link)

B. Contra “goal-directed AI systems’ goals will be bad”

I somewhat agree, see Section 14.6 here. Comments above also apply here, e.g. it’s not obvious that docile helpful human-norm-following AGIs will actually do what’s necessarily to defend against zealous universe-controlling AGIs, again wait for my forthcoming post.

Contra “superhuman AI would be sufficiently superior to humans to overpower humanity”

I mostly see these comments as arguments that “AI that can overpower humanity” might happen a bit later than one might otherwise expect, rather than arguments that it’s not going to happen at all. For example, if collaborative groups of humans are more successful than individual humans, well, sooner or later we’re going to have collaborative groups of AIs too. By the time we have a whole society of trillions of AIs, it stops feeling very reassuring. (The ability of AIs to self-replicate seems particularly relevant here.) If humans-using-tools are powerful, well sooner or later (I would argue sooner) AIs are going to be using tools too. (And inventing new tools.) The trust issue stops applying when we get to a world where AIs can start their own companies etc., and thus only need to trust each other (and the “each other” might be copies of themselves). The headroom argument seems adjacent to the lump-of-labor fallacy.

Hmm, OK, I guess the real point of all that is to argue for slow takeoff which then implies that doom is unlikely? (“at some point AI systems would account for most of the cognitive labor in the world. But if there is first an extended period of more minimal advanced AI presence, that would probably prevent an immediate death outcome, and improve humanity’s prospects for controlling a slow-moving AI power grab.”) Again, I’m not quite sure what we’re arguing. I think there’s still serious x-risk regardless of slow vs fast takeoff, and I think there’s still “less than certain doom” regardless of slow vs fast takeoff. In fact, I’m not even confident that x-risk is lower under slow takeoff than fast.

Well anyway, I have an object-level belief that there are already way more than enough GPUs on the planet to support AIs that can overpower humanity—see here—and I think that will be much more true by the time we have real-deal AGIs (which I for one expect to be probably after 2030 at least). I agree that this is a relevant empirical question though.

The idea that a superhuman AI would be able to rapidly destroy the world seems prima facie unlikely, since no other entity has ever done that.

I think there’s pretty good direct reason to believe that it is currently possible to start lots of simultaneously deadly pandemics and crop diseases etc., with an amount of competence already available to small teams of humans or maybe even individual humans. But we don’t currently have ongoing deliberate pandemics. I consider this pretty strong evidence that nobody on Earth with even moderate competence is trying to “destroy the world”, so to speak. So the fact that nobody has succeeded at doing so doesn’t really provide much evidence about the tractability of doing that. (Again, more on this topic in a forthcoming post.)

Load More