Steve Byrnes

Working on AGI safety via a deep-dive into brain algorithms, see


My AGI Threat Model: Misaligned Model-Based RL Agent

Strong agree. This is another way that it's a hard problem.

Against evolution as an analogy for how humans will create AGI

Thanks for cross-posting this! Sorry I didn't get around to responding originally. :-)

E.g. the thing RL currently does, which I don't expect the inner algorithm to be able to do, is make the first three layers of the network vision layers, and then a big region over on the other side the language submodule, and so on. And eventually I expect RL to shape the way the inner algorithm does weight updates, via meta-learning.

For what it's worth, I figure that the neocortex has some number (dozens to hundreds, maybe 180 like your link says, I dunno) of subregions that do a task vaguely like "predict data X from context Y", with different X & Y & hyperparameters in different subregions. So some design work is obviously required to make those connections. (Some taste of what that might look like in more detail is maybe Randall O'Reilly's vision-learning model.) I figure this is vaguely analogous to figuring out what convolution kernel sizes and strides you need in a ConvNet, and that specifying all this is maybe hundreds or low thousands but not millions of bits of information. (I don't really know right now, I'm just guessing.) Where will those bits of information come from? I figure, some combination of:

  • automated neural architecture search
  • and/or people looking at the neuroanatomy literature and trying to copy ideas
  • and/or when the working principles of the algorithm are better understood, maybe people can just guess what architectures are reasonable, just like somebody invented U-Nets by presumably just sitting and thinking about what's a reasonable architecture for image segmentation, followed by some trial-and-error tweaking.
  • and/or some kind of dynamic architecture that searches for learnable relationships and makes those connections on the fly … I imagine a computer would be able to do that to a much greater extent than a brain (where signals travel slowly, new long-range high-bandwidth connections are expensive, etc.)

If I understand your comment correctly, we might actually agree on the plausibility of the brute force "automated neural architecture search" / meta-learning case. …Except for the terminology! I'm not calling it "evolution analogy" because the final learning algorithm is mainly (in terms of information content) human-designed and by-and-large human-legible. Like, maybe humans won't have a great story for why the learning rate is 1.85 in region 72 but only 1.24 in region 13...But they'll have the main story of the mechanics of the algorithm and why it learns things. (You can correct me if I'm wrong.)

My AGI Threat Model: Misaligned Model-Based RL Agent

Hmm, I dunno, I haven't thought it through very carefully. But I guess an AGI might require a supercomputer of resources and maybe there are only so many hackable supercomputers of the right type, and the AI only knows one exploit and leaves traces of its hacking that computer security people can follow, and meanwhile self-improvement is hard and slow (for example, in the first version you need to train for two straight years, and in the second self-improved version you "only" need to re-train for 18 months). If the AI can run on a botnet then there are more options, but maybe it can't deal with latency / packet loss / etc., maybe it doesn't know a good exploit, maybe security researchers find and take down the botnet C&C infrastructure, etc. Obviously this wouldn't happen with a radically superhuman AGI but that's not what we're talking about.

But from my perspective, this isn't a decision-relevant argument. Either we're doomed in my scenario or we're even more doomed in yours. We still need to do the same research in advance.

Lying and cheating and power seeking behaviour are only a good idea if you can get away with them. If you can't break out the lab, you probably can't get away with much uncouragable behaviour. 

Well, we can be concerned about non-corrigible systems that act deceptively (cf. "treacherous turn"). And systems that have close-but-not-quite-right goals such that they're trying to do the right thing in test environments, but their goals veer away from humans' in other environments, I guess.

My AGI Threat Model: Misaligned Model-Based RL Agent

Oh sorry, I misread what you wrote. Sure, maybe, I dunno. I just edited the article to say "some number of years".

I never meant to make a claim "20 years is definitely in the realm of possibility" but rather to make a claim "even if it takes 20 years, that's still not necessarily enough to declare that we're all good".

My AGI Threat Model: Misaligned Model-Based RL Agent


For homogeneity, I guess I was mainly thinking that in the era of not-knowing-how-to-align-an-AGI, people would tend to try lots of different new things, because nothing so far has worked. I agree that once there's an aligned AGI, it's likely to get copied, and if new better AGIs are trained, people may be inclined to try to keep the procedure as close as possible to what's worked before.

I hadn't thought about whether different AGIs with different goals are likely to compromise vs fight. There's Wei Dai's argument that compromise is very easy with AGIs because they can "merge their utility functions". But at least this kind of AGI doesn't have a utility function ... maybe there's a way to do something like that with multiple parallel value functions, but I'm not sure that would actually work. There are also old posts about AGIs checking each other's source code for sincerity, but can they actually understand what they're looking at? Transparency is hard. And how do they verify that there isn't a backup stashed somewhere else, ready to jump out at a later date and betray the agreement? Also, humans have social instincts that AGIs don't, which pushes in both directions I think. And humans are easier to kill / easier to credibly threaten. I dunno. I'm not inclined to have confidence in any direction.

I agree that if a sufficiently smart misaligned AGI is running on a nice supercomputer somewhere, it would have every reason to try to stay right there and pursue its goals within that institution, and it would have every reason to try to escape and self-replicate elsewhere in the world. I guess we can be concerned about both. :-/

My AGI Threat Model: Misaligned Model-Based RL Agent

I haven't thought very much about takeoff speeds (if that wasn't obvious!). But I don't think it's true that nobody thinks it will take more than a decade... Like, I don't think Paul Christiano is the #1 slowest of all slow-takeoff advocates. Isn't Robin Hanson slower? I forget.

Then a different question is "Regardless of what other people think about takeoff speeds, what's the right answer, or at least what's plausible?" I don't know. A key part is: I'm hazy on when you "start the clock". People were playing with neural networks in the 1990s but we only got GPT-3 in 2020. What were people doing all that time?? Well mostly, people were ignoring neural networks entirely, but they were also figuring out how to put them on GPUs, and making frameworks like TensorFlow and PyTorch and making them progressively easier to use and scale and parallelize, and finding all the tricks like BatchNorm and Xavier initialization and Transformers, and making better teaching materials and MOOCs to spread awareness of how these things work, developing new and better chips tailored to these algorithms (and vice-versa), waiting on Moore's law, and on and on. I find it conceivable that we could get "glimmers of AGI" (in some relevant sense) in algorithms that have not yet jumped through all those hoops, so we're stuck with kinda toy examples for quite a while as we develop the infrastructure to scale these algorithms, the bag of tricks to make them run better, the MOOCs, the ASICs, and so on. But I dunno.

Or maybe I am misunderstanding what you mean by accidents?

Yeah, sorry, when I said "accidents" I meant "the humans did something by accident", not "the AI did something by accident".

Against evolution as an analogy for how humans will create AGI
  1. I think evolution is a good analogy for how inner alignment issues can arise.
  2. I don't think evolution is a good analogy for the process by which AGI is made (if you think that the analogy is that we literally use natural selection to improve AI systems).

Yes this post is about the process by which AGI is made, i.e. #2. (See "I want to be specific about what I’m arguing against here."...) I'm not sure what you mean by "literal natural selection", but FWIW I'm lumping together outer-loop optimization algorithms regardless of whether they're evolutionary or gradient descent or downhill-simplex or whatever.

Against evolution as an analogy for how humans will create AGI

Thanks for all those great references!

My current thinking is: (1) Outer-loop meta-learning is slow, (2) Therefore we shouldn't expect to get all that many bits of information out of it, (3) Therefore it's a great way to search for parameter settings in a parameterized family of algorithms, but not a great way to do "the bulk of the real design work", in the sense that programmers can look at the final artifact and say "Man, I have no idea what this algorithm is doing and why it's learning anything at all, let alone why it's learning things very effectively".

Like if I look at a trained ConvNet, it's telling me: Hey Steve, take your input pixels, multiply them by this specific giant matrix of numbers, then add this vector, blah blah , and OK now you have a vector, and if the first entry of the vector is much bigger than the other entries, then you've got a picture of a tench. I say "Yeah, that is a picture of a tench, but WTF just happened?" (Unless I'm Chris Olah.) That's what I think of when I think of the outer loop doing "the bulk of the real design work".

By contrast, when I look at Co-Reyes, I see a search for parameter settings (well, a tree of operations) within a parametrized family of primarily-human-designed algorithms—just what I expected. If I wanted to run the authors' best and final RL algorithm, I would start by writing probably many thousands of lines of human-written code, all of which come from human knowledge of how RL algorithms should generally work ("...the policy is obtained from the Q-value function using an ε-greedy strategy. The agent saves this stream of a replay buffer and continually updates the policy by minimizing a loss function...over these transitions with gradient descent..."). Then, to that big pile of code, I would add one important missing ingredient—the loss function L—containing at most 104 bits of information (if I calculated right). This ingredient is indeed designed by an automated search, but it doesn't have a lot of inscrutable complexity—the authors have no trouble writing down L and explaining intuitively why it's a sensible choice. Anyway, this is a very different kind of thing than the tench-discovery algorithm above.

Did the Co-Reyes search "invent" TD learning? Well, they searched over a narrow (-element) parameterized family of algorithms that included TD learning in it, and one of their searches settled on TD learning as a good option. Consider how few algorithms is    algorithms out of the space of all possible algorithms. Isn't it shocking that TD learning was even an option? No, it's not shocking, it's deliberate. The authors already knew that TD learning was good, and when they set up their search space, they made sure that TD learning would be part of it. ("Our search language...should be expressive enough to represent existing algorithms..."). I don't find anything about that surprising!

I feel like maybe I was projecting a mood of "Outer-loop searches aren't impressive or important". I don't think that! As far as I know, we might be just a few more outer-loop searches away from AGI! (I'm doubtful, but that's a different story. Anyway it's certainly possible.) And I did in fact write that I expect this kind of thing to be probably part of the path to AGI. It's all great stuff, and I didn't write this blog post because I wanted to belittle it. I wrote the blog post to respond to the idea I've heard that, for example, we could plausibly wind up with an AGI algorithm that's fundamentally based on reinforcement learning with tree search, but we humans are totally oblivious to the fact that the algorithm is based on reinforcement learning with tree search, because it's an opaque black box generating its own endogenous reward signals and doing RL off that, and we just have no idea about any of this. It takes an awful lot of bits to build that inscrutable a black box, and I don't think outer-loop meta-learning can feasibly provide that many bits of design complexity, so far as I know. (Again I'm not an expert and I'm open to learning.)

any grappling with the Bitter Lesson

I'm not exactly sure what you think I'm saying that's contrary to Bitter Lesson. My reading of "Bitter lesson" is that it's a bad idea to write code that describes the object-level complexity of the world, like "tires are black" or "the queen is a valuable chess piece", but rather we should write learning algorithms that learn the object-level complexity of the world from data. I don't read "Bitter Lesson" as saying that humans should stop trying to write learning algorithms. Every positive example in Bitter Lesson is a human-written learning algorithm.

Take something like "Attention Is All You Need" (2017). I think of it as a success story, exactly the kind of research that moves forward the field of AI. But it's an example of humans inventing a better learning algorithm. Do you think that "Attention Is All You Need" not part of the path to AGI, but rather a step forward in the wrong direction? Is "Attention Is All You Need" the modern version of "yet another paper with a better handcrafted chess-position-evaluation algorithm"? If that's what you think, well, you can make that argument, but I don't think that argument is "The Bitter Lesson", at least not in any straightforward reading of "Bitter Lesson", AFAICT...

It would also be a pretty unusual view, right? Most people think that the invention of transformers is what AI progress looks like, right? (Not that there's anything wrong with unusual views, I'm just probing to make sure I correctly understand the ML consensus.)

Against evolution as an analogy for how humans will create AGI

Thanks again, this is really helpful.

I don't feel like humans meet this bar.

Hmm, imagine you get a job doing bicycle repair. After a while, you've learned a vocabulary of probably thousands of entities and affordances and interrelationships (the chain, one link on the chain, the way the chain moves, the feel of clicking the chain into place on the gear, what it looks like if a chain is loose, what it feels like to the rider when a chain is loose, if I touch the chain then my finger will be greasy, etc. etc.). All that information is stored in a highly-structured way in your brain (I think some souped-up version of a PGM, but let's not get into that), such that it can grow to hold a massive amount of information while remaining easily searchable and usable. The problem with working memory is not capacity per se, it's that it's not stored in this structured, easily-usable-and-searchable way. So the more information you put there, the more you start getting bogged down and missing things. Ditto with pen and paper, or a recurrent state, etc.

I find it helpful to think about our brain's understanding as lots of subroutines running in parallel. (Kaj calls these things "subagents", I more typically call them "generative models", Kurzweil calls them "patterns", Minsky calls this idea "society of mind", etc.) They all mostly just sit around doing nothing. But sometimes they recognize a scenario for which they have something to say, and then they jump in and say it. So in chess, there's a subroutine that says "If the board position has such-and-characteristics, it's worthwhile to consider moving the pawn." The subroutine sits quietly for months until the board has that position, and then it jumps in and injects its idea. And of course, once you consider moving the pawn, that brings to mind a different board position, and then new subroutines will recognize them, jump in, and have their say, etc.

Or if you take an imperfect rule, like "Python code runs the same on Windows and Mac", the reason we can get by using this rule is because we have a whole ecosystem of subroutines on the lookout for exceptions to the rule. There's the main subroutine that says "Yes, Python code runs the same on Windows and Mac." But there's another subroutine that says "If you're sharing code between Windows and Mac, and there's a file path variable, then it's important to follow such-and-such best practices". And yet another subroutine is sitting around looking for the presence of system library calls in cross-platform code, etc. etc.

That's what it looks like to have knowledge that is properly structured and searchable and usable. I think that's part of what the trained transformer layers are doing in GPT-3—checking whether any subroutines need to jump in and start doing their thing (or need to stop, or need to proceed to their next step (when they're time-sequenced)), based on the context of other subroutines that are currently active.

I think that GPT-3 as used today is more-or-less restricted to the subroutines that were used by people in the course of typing text within the GPT-3 training corpus. But if you, Rohin, think about your own personal knowledge of AI alignment, RL, etc. that you've built up over the years, you've created countless thousands of new little subroutines, interconnected with each other, which only exist in your brain. When you hear someone talking about utility functions, you have a subroutine that says "Every possible policy is consistent with some utility function!", and it's waiting to jump in if the person says something that contradicts that. And of course that subroutine is supported by hundreds of other little interconnected subroutines with various caveats and counterarguments and so on.

Anyway, what's the bar for an AI to be an AGI? I dunno, but one question is: "Is it competent enough to help with AI alignment research?" My strong hunch is that the AI wouldn't be all that helpful unless it's able to add new things to its own structured knowledge base, like new subroutines that say "We already tried that idea and it doesn't work", or "This idea almost works but is missing such-and-such ingredient", or "Such-and-such combination of ingredients would have this interesting property".

Hmm, well, actually, I guess it's very possible that GPT-3 is already a somewhat-helpful tool for generating / brainstorming ideas in AI alignment research. Maybe I would use it myself if I had access! I should have said "Is it competent enough to do AI alignment research". :-D

I agree that your "crux" is a crux, although I would say "effective" instead of "efficient". I think the inability to add new things to its own structured knowledge base is a limitation on what the AI can do, not just what it can do given a certain compute budget.

This feels analogous to "the AGI doesn't go and run on its own, it operates by changing values in RAM according to the assembly language interpreter hardwired into the CPU chip". Like, it's true, but it seems like it's operating at the wrong level of abstraction.

Hmm, the point of this post is to argue that we won't make AGI via a specific development path involving the following three ingredients, blah blah blah. Then there's a second step: "If so, then what? What does that imply about the resulting AGI?" I didn't talk about that; it's a different issue. In particular I am not making the argument that "the algorithm's cognition will basically be human-legible", and I don't believe that.

Against evolution as an analogy for how humans will create AGI


A lot of your comments are trying to relate this to GPT-3, I think. Maybe things will be clearer if I just directly describe how I think about GPT-3.

The evolution analogy (as I'm defining it) says that “The AGI” is identified as the inner algorithm, not the inner and outer algorithm working together. In other words, if I ask the AGI a question, I don’t need the outer algorithm to be running in the course of answering that question. Of course the GPT-3 trained model is already capable of answering "easy" questions, but I'm thinking here about "very hard" questions that need the serious construction of lots of new knowledge and ideas that build on each other. I don't think the GPT-3 trained model can do that by itself.

Now for GPT-3, the outer algorithm edits weights, and the inner algorithm edits activations. I am very impressed about the capabilities of the GPT-3 weights, edited by SGD, to store an open-ended world model of greater and greater complexity as you train it more and more. I am not so optimistic that the GPT-3 activations can do that, without somehow transferring information from activations to weights. And not just for the stupid reason that it has a finite training window. (For example, other transformer models have recurrency.)

Why don't I think that the GPT-3 trained model is just as capable of building out an open-ended world-model of ever greater complexity using activations not weights?

For one thing, it strikes me as a bit weird to think that there will be this centaur-like world model constructed out of X% weights and (100-X)% activations. And what if GPT comes to realize that one of its previous beliefs is actually wrong? Can the activations somehow act as if they're overwriting the weights? Just seems weird. How much information content can you put in the activations anyway? I don't know off the top of my head, but much less than the amount you can put in the weights.

When I think of the AGI-hard part of "learning", I think of building a solid bedrock of knowledge and ideas, such that you can build new ideas on top of the old ideas, in an arbitrarily high tower. That's the part that I don't think GPT-3 inner algorithm (trained model) by itself can do. (The outer algorithm obviously does it.) Again, I think you would need to somehow transfer information from the activations to the weights, maybe by doing something vaguely like amplification, if you were to make a real-deal AGI from something like GPT-3.

My human brain analogy for GPT-3: One thing we humans do is build a giant interconnected predictive world-model by editing synapses over the course of our lifetimes. Another thing we do is flexibly combine the knowledge and ideas we already have, on the fly, to make sense of a new input, including using working memory and so on. Don't get me wrong, this is a really hard and impressive calculation, and it can do lots of things—I think it amounts to searching over this vast combinatorial space of compositional probabilistic generative models (see analysis-by-synthesis discussion here, or also here). But it does not involve editing synapses. It's different. You've never seen nor imagined a "banana hat" in your life, but if you saw one, you would immediately understand what it is, how to manipulate it, roughly how much it weighs, etc., simply by snapping together a bunch of your existing banana-related generative models with a bunch of your existing hat-related generative models into some composite which is self-consistent and maximally consistent with your visual inputs and experience. You can do all that and much more without editing synapses.

Anyway, my human brain analogy for GPT-3 is: I think the GPT-3 outer algorithm is more-or-less akin to editing synapses, and the GPT-3 inner algorithm is more-or-less akin to the brain's inference-time calculation (...but if humans had a more impressive working memory than we actually do).

The inference-time calculation is impressive but only goes so far. You can't learn linear algebra without editing synapses. There's just too many new concepts built on top of each other, and too many new connections to be learned.

If you were to turn GPT-3 into an AGI, the closest version consistent with my current expectations would be that someone took the GPT-3 trained model but somehow inserted some kind of online-learning mechanism to update the weights as it goes (again, maybe amplification or whatever). I'm willing to believe that something like that could happen, and it would not qualify as "evolution analogy" by my definition.

Even for humans it doesn't seem right to say that the brain's equivalent of backprop is the algorithm that "reads books and watches movies" etc, it seems like backprop created a black-box-ish capability of "learning from language" that we can then invoke to learn faster.

Learning algorithms always involve an interaction between the algorithm itself and what-has-been-learned-so-far, right? Even gradient descent takes a different step depending on the current state of the model-in-training. Again see the “Inner As AGI” criterion near the top for why this is different from the thing I’m arguing against. The "learning from language" black box here doesn't go off and run on its own; it learns new things using by editing synapses according to the synapse-editing algorithm hardwired into the genome.

Load More