The bridge to AGI control. Not quiiiiite ready for rush-hour traffic… Mind the gaps!! (image source)

(Update: Note that most of the things I wrote in this post are superseded (or at least explained better) in my later “Intro to Brain-Like-AGI Safety” post series.) 

I spend most of my time on relatively straightforward tasks—”straightforward” in the sense that I know how to proceed, and have some confidence that I can make progress, and making progress will probably be helpful. This is all very satisfying. But then I’m also trying to spend (at least) one day a week trying to “solve the whole AGI control problem”. (Specifically, the problem described in: My AGI Threat Model: Misaligned Model-Based RL Agent, which to be clear is still only one part of Safe & Beneficial AGI.)

So yeah, one day a week (at least), it’s big-picture end-to-end thinking, no excuses. My first couple attempts involved a lot of staring at a blank screen, muttering to myself "oh god oh god oh god, every idea is terrible, I'm in way over my head, we’re doomed…". But then I figured, it would be more useful to write up my current thoughts on the whole landscape of possible approaches, as an opportunity to clarify my thinking and get other people’s pushback. I'm hoping to repeat this exercise periodically.

(I’d love it if more people did this too FWIW.)

If I'm unjustifiably dismissing or omitting your favorite idea, try to talk me into it! Leave a comment or let’s chat.

Relatedly, if anyone feels an urge to trust my judgment (for reasons I can't fathom), this is an especially bad time to do so! I very much don’t want this brain-dump of a post to push people towards or away from what they think is a good direction. I haven’t put much thought into most parts of this. And what the heck do I know anyway?

Intended audience: People familiar with AGI safety literature. Lots of jargon and presumed background knowledge.

No claims to originality; many thanks to everyone whose ideas I’m stealing and butchering.

1. Corrigibility

(In the broad Paul sense of “the system is trying to do what the supervisor wants it to try to do”.)

1.0 Is corrigibility a worthwhile goal?

I think this is more-or-less uncontroversial, at least in the narrow sense of “knowing how to make a corrigible AGI would be a helpful thing to know how to do”. Well, Eliezer has argued (e.g. this comment) that corrigible motivation might not be stable for arbitrarily powerful self-reflective AGIs, or at least not straightforwardly. Then I imagine Paul responding, as here, that this is “a solution built to last (at most) until all contemporary thinking about AI has been thoroughly obsoleted…. I don’t think there is a strong case for thinking much further ahead than that.” Maybe these corrigible AGIs will not be all that powerful, or all that self-reflective and self-improving, but they’ll work well enough to help us solve the alignment problem? Or, being corrigible, we can tell them what we want their new goal to be? I mean, “install a motivation which is safe and stable for arbitrarily powerful self-reflective AGIs” is an awfully high bar. I don’t know of anyone trying to do that, at least not with a prayer of success. Like, as far as I can tell (from their limited public information), even Eliezer & his MIRI colleagues are more focused on task-limited AGIs (cf. the strawberry problem), and on laying theoretical groundwork, rather than on getting to CEV or whatever.

So anyway, overall, I’m on board with the idea that “installing a corrigible motivation system” is one thing that’s worth working on, while keeping potential problems in mind, and not to the exclusion of everything else.

1.1 Three versions of corrigible motivation

(As always assuming this type of AGl.) One of the selling points of corrigibility is that “being helpful” is a possible motivation for humans to have, so it should be a possible motivation for AGIs to have too. Well, how does it work in humans? I imagine three categories:

1.1.1 Reinforcement-ish corrigible motivation

What does that mean? For example, I find it pleasing to imagine helping someone, and then I do a good job, and then they shower me with praise and tell everyone what an awesome guy I am. I find it aversive to imagine deceiving or manipulating someone, and then (possibly) getting caught, and then they get really angry at me and tell everyone that I’m awful.

Discussion: This is obviously not the kind of corrigible motivation that we’re going for in our AGIs, because eventually the AGI might come up with a plan to deceive me or manipulate me which is so good that there’s no chance of getting caught and blamed, and then it’s perfectly happy to do that.

How would you install this motivation (if you wanted to)? Intuitively, this seems like the default thing you would get if you gave the programmer a remote-control reward button wired directly into the AGI’s motivation system. (Well, the remote-control scenario is more complicated than it sounds, I think, but at least this would probably be a big part of the motivation by default.)

In particular, this is what I expect from an approval-based system, unless we have sufficient transparency that we can approve “doing the right thing for the right reason” rather than merely “doing the right thing”. As discussed in a later section, I don’t know how to get that level of transparency. So I’m not spending any time thinking about approval signals; I wouldn’t know what to do with them.

1.1.2 Empathy-ish corrigible motivation

What does that mean? For example, I find it pleasing to imagine helping someone, and then they accomplish their goals and feel really good, and I love seeing them so happy. I find it aversive to imagine deceiving someone, and then they feel bad and sad, and gee, I hate seeing them like that.

Discussion: This kind of motivation seems less egregiously bad than the reinforcement one, but still seems to be not what we’re going for. The problem is that the AGI can sometimes make the programmer happier and more satisfied by deceiving and manipulating them (in such a way that they’ll never realize it). I mean, sure, there’s bound to be some murkiness in what is or isn’t problematic manipulation. But this AGI wouldn’t even be trying to not manipulate!

How would you install this motivation (if you wanted to)? You mean, install it reliably? Beats me. But a start might be making an AGI with a human-like system of social emotions—a built-in method of empathetic simulation that comes with various hooks tied to social emotions. See Section 5 below. I think that would be an unreliable method particularly because of the dehumanization problem (see here)—I think that people can interact with other people while deliberately avoiding activating their built-in empathetic simulation module; instead they just use their general cognitive capabilities to build a parallel human model from scratch. (I think that’s how some people with autism operate—long story.) Presumably an AGI could do the same trick.

1.1.3 Conceptual-ish corrigible motivation

What does that mean? For example, I find it pleasing to imagine helping someone, because, well, I like doing helpful things, and I like thinking of myself as a helpful guy. I find it aversive to imagine deceiving or manipulating someone, because, well, I don’t like being deceptive or manipulative, and I don’t like thinking of myself as the kind of guy who would be like that.

This seems more promising, right? Let’s look closer.

How would you install this motivation (if you wanted to)? The previous two attempts had issues from being grounded in something specific, and thus fell prey to Goodhart’s law. Here we have the opposite problem—it’s not grounded in anything, or at least not yet.

One approach would be labeled examples—watch YouTube or read a bunch of descriptions, and this scenario is “doing the right thing”, and that scenario is “doing the wrong thing”, etc., repeat N times. Another approach would be leaning directly on human-labeled concepts—e.g. literally flag the concepts associated with the English word “helpful” as good and with “manipulative” as bad. The two approaches are more similar than they look—after all, a large share of our understanding of the word “helpful” and “manipulative” is generalization from labeled examples throughout our lives.

Both of these approaches, unlike the previous two, have the advantage that we can seemingly install the idea that “it’s bad to manipulate, even if you don’t get caught, and even if the person you’re manipulating is better-off as a result”.

However, I immediately hit two problems.

First problem: If it works, the AGI would probably wind up with somewhat-incoherent motivations that break down at edge-cases. I’m hoping that “conservatism”, as discussed in Section 2 below, can deal with this.

Second problem: (let's call it "the 1st-person problem") Getting non-1st-person training signals into the 1st-person. So if we’re using labeled examples, we don’t want “watching a YouTube video where Alice manipulates Bob” to be aversive, rather we want the self-reflective “I, the AGI, am manipulating Bob” to be aversive. Likewise, if we’re using English words, we don’t want the abstract concept of “helpfulness” to be appealing, rather we want the self-reflective “I, the AGI, am being helpful” to be appealing.

I’m currently stuck on the second problem. (Any ideas?) I think I’m just not quite clear enough about what self-reflection will look like in the predictive world-model. Like, I have general ideas, but they’re not sufficiently nailed down to, say, assess the feasibility of using interpretability tools to fiddle with the self-reflectivity of an AGI’s thoughts, in order to transform 3rd-person labeled examples into 1st-person value function updates. I believe that there’s some literature on self-reflection in the special case of human brain algorithms (this overlaps with meta-cognition and “consciousness studies”); I don’t know if I’ll get anything out of diving into that, but worth a shot. That’s on my to-do list. Conveniently, this is also (I believe) not entirely unrelated to reasoning about other people, which in turn is part of the implementation of innate social instincts, which are also on my to-do list for unrelated reasons, as discussed below.

2. “Conservatism” to relax pressure on alignment, and to help with goal stability

See Conservatism in Neocortex-like AGIs (which in turn was inspired by Stuart Armstrong’s Model Splintering).

Intuitive summary is: I figure we’ll wind up with an AGI that has a somewhat incoherent mix of motivations (as humans do), e.g. as discussed here, for various reasons including the intended goal system not mapping cleanly into the AGI’s conceptual space, changes in the AGI’s conceptual space upon learning and reflection (e.g. ontological crises), the programmer’s intentions / rewards being themselves somewhat incoherent, etc.

So then when the AGI considers possible plans, it will sometimes hit edge-cases where its different motivations pull in different directions. We design it to just not do those things (or at least, to pause execution while the programmer double-checks what’s going on, or something). Instead, it will find those plans unappealing, and it will keep searching for things to do that seem unambiguously good according to all its different intuitions.

So for example, when the AGI encounters the trolley problem, we want it to say to itself “I don’t know! I don’t like either of those options!”, and to keep brainstorming about how to safely stop the train and save everyone on both tracks. And if it finds no acceptable thing to do, we want it to just stand there and do nothing at all—which is hard-coded as an always-acceptable default. Or we program it to temporarily shut down and (somehow) dump out a request for human guidance.

This would also hopefully prevent the thing where one component of its motivation system tries to undermine or remove a conflicting component of its motivation system (see here)—or as (my model of) Eliezer would describe it, it would hopefully “prevent the AGI from self-modifying to be more coherent”.

I’m hanging an awful lot of hope on this kind of “conservatism” (or something like it) working out, because it solves a bunch of problems for which I have no other ideas.

Would it work? And is there a way to know in advance? On the one hand, it's easy enough to imagine that, just as one motivation has Goodhart's law edge cases we don't like, likewise if there’s an incoherent mix of 50 motivations, each of them has edge cases, and what if those edge-cases all overlap somewhere?? I’m feeling optimistic that this won’t be a big problem, maybe partly because I'm imagining AGIs with human-like cognitive algorithms, which then (maybe) wind up with human-like concepts and human-like inductive biases. Also, killing all humans etc. would require a rather egregious misunderstanding of what we view as acceptable behavior!! But I could be wrong.

So anyway, I see this kind of conservatism as being very important to sort out. It's currently only a very sketchy intuitive proposal with many glaring gaps, so I want to fill those gaps, most immediately by developing a better understanding of the human brain motivation system (which I also want to do for other reasons anyway).

2.1 But aren’t capability-safety tradeoffs a no-go?

(Update: I later spun out this subsection into a separate post: Safety-capabilities tradeoff dials are inevitable in AGI.) 

Incidentally, one effect of this proposal is that, if successful, we will wind up designing an AGI with a dial, where one end of the dial says “more safe and less capable”, and the other end of the dial says “less safe and more capable”. This is obviously not a great situation, especially in a competitive world. But I don’t see any way around it. For my part, I want to just say: “Um, sorry everyone, we’re going to have these kinds of dials on our AGIs—in fact, probably several such dials. Hope the AGI strategy / governance folks can bail us out by coming up with a coordination mechanism!!” (Or I guess the other option is that one very-safety-conscious team will be way ahead of everyone else, and they’ll, ahem, take a pivotal action. This seems much worse for reasons here, albeit less bad than extinction, and I hope it doesn’t come to that.)

For example, a very general and seemingly-unavoidable capability-safety tradeoff is: an AGI can always be faster and more powerful and less safe if we remove humans from the loop—e.g. run the model without ever pausing to study and test it, and allow the AGI to execute plans that we humans do not understand or even cannot understand, and give the AGI unfiltered internet access and all the resources it asks for, etc. etc.

For what it’s worth, people do want their AGI to stay safe and under control! We’re not asking anyone to do anything wildly outside their self-interest. Now, I don’t think that that’s sufficient for people to set the dials to “more safe and less capable”—at least, not in a competitive, uncoordinated world—but I do think it’s something working in our favor.

By the way, I’m only arguing that we shouldn’t a priori universally rule out all safety-capability tradeoff dials. I’m not arguing that every dial is automatically fine. If there’s a dial where you need to throw away 99.999...% of the capabilities in order to get a microscopic morsel of safety, well then that’s not a practical path to safe transformative AGI. Is that the situation for this conservatism proposal? I don’t know enough to venture an answer, and until I do, this is a potential point of failure.

2.2 Other paths to goal stability

Umm, I dunno. Build a “Has-The-Correct-Goals-Meter” (or at least “Corrigibility-meter”) somehow, and disallow any changes to the system that makes the meter go up? Cache old copies of the AGI and periodically reactivate them and give them veto power over important decisions? Hope and pray that the AGI figures out how to solve the goal stability problem itself, before its goals shift?

I dunno.

As I’ve mentioned, I’m not a believer in “corrigibility is a broad basin of attraction”, so I see this as a big concern.

3. Transparency / interpretability

3.0 What are we hoping for?

An AGI will presumably be capable of honestly communicating what it’s trying to do—at least to the extent that it’s possible for any two different intelligent agents to try to communicate in good faith. Unfortunately, an AGI will also presumably be capable of dishonestly communicating what it’s trying to do. I would just love to get enough transparency to tell these two things apart.

I have an awfully hard time imagining success in the AGI control problem that doesn’t pass through transparency (except maybe Section 5 below). Unfortunately I haven’t seen or come up with transparency proposals that give me hope, like an end-to-end success story—even a vague one.

Here are a couple tools that seem like they might help, or maybe not, I dunno.

3.1 Segregate different reward components into different value function components

See Multi-dimensional rewards for AGI interpretability and control. (After writing that, I was delighted to learn from the comment section that this is a thing people are already doing, and it really does work the way I was imagining.) Here it is in brief:

The AGI will be getting reward signals, which then flow into changes in the value function. We should expect reward functions to usually be a sum of multiple, meaningfully different, components—e.g.  "reward for following the command I issued yesterday" versus "reward for following the command I issued last week". We can flow these different components into different value function components, and then add them up into the actual value function where necessary (i.e., when the AGI algorithm needs the total value to decide what actions to take or what thoughts to think).

Continuing with the intuitions: In the human case, my (tentative) claim is that for every thought you think, and every action you take, you’re doing so because it has a higher value than whatever you could be doing or thinking instead, and those values ultimately flow from some reward calculated (either now or in the past) by your brainstem. The path from a brainstem reward to “doing a thing” can be horrifically windy and indirect, passing through many layers of analogizing, and instrumental subgoals, and credit assignment errors, and who knows what, but there is a path. Humans don’t have any mechanism for tracing this path back to its source. (Did my throwing out that candy wrapper ultimately derive from my lifetime history of innate social approval rewards? Or from my lifetime history of innate aesthetics rewards? Or what? Who knows!) But we can put such a mechanism in our AGIs. Then we can say “this action ultimately flowed (somehow) mostly from such-and-such reward stream”.

Then what? I dunno. Again, I don’t have an end-to-end story here. I just wanted to mention this because it seems like the kind of thing that might be an ingredient in such a story.

3.2 AGIs steering AGIs

In principle, we could have a tower of two or more AGIs “steering” each other, with lower AGIs scrutinizing the cognition of the next AGI up, and sending rewards. Presumably the AGIs get more and more complex and powerful going up the tower, but gradually enough that each AGI is up to the task of steering the one above it.

It could of course also be a pyramid rather than a tower, with multiple dumber AGIs collaborating to steer a smarter AGI.

Problem #1: How exactly do the AGIs monitor each other? Beats me. This goes back to my not having a great story about interpretability. I’m not happy with the answer “The AGIs will figure it out”. They might or they might not.

Problem #2: What’s at the base of the tower? I talked above about approval-based training being problematic. I’ll talk about imitation in a later section, with the upshot being that if we can get imitation to work, then there are better and more straightforward ways to use such a system than building a tower of AGIs-steering-AGIs on top of it.

Problem #3: There might be gradual corruption of the human's intentions going up the tower—like a game of telephone.

All in all, I don’t currently see any story where AGIs-steering-AGIs is a good idea worth thinking about, at least not in the AGI development scenario I have in mind.

3.3 Other transparency directions

Like, I imagine sitting at my computer terminal with an AGI running on the server downstairs…

The world-model is full of entries that look like: “World-model entry #592378: If entry #98739 happens, it tends to be followed by entry #24567982”. Meanwhile the value function is full of entries that look like “Entry #5892748 has value +3.52, when it occurs in the context of entry #83246”. I throw up my hands. What am I supposed to do with this thing?

OK, I show the AGI the word “deception”. A few thousand interconnected entries in the world-model light up. (Some represent something like “noun”, some represent some low-level input processing, etc.) Hmm, maybe those entries are related to deception? Not necessarily, but maybe. Next I show my AGI a YouTube video of Alice deceiving Bob. Tens of thousands more entries in the world-model light up over the course of the video clip. Oh hey, there’s some overlap with the first group! Maybe that overlap is the AGI’s concept of “deception”?

You get the idea. I don’t have any strong argument that this kind of thing wouldn’t work, but I likewise have no strong argument that it would work. And how would we even know?? I wish I had something more solid to grab onto than this kind of tinkering around.

One potentially fruitful direction I can think of is what I mentioned above: trying to get a more detailed understanding of what self-concept and other-concept are going to look like in the world-model, and seeing if there’s a way to get at that with interpretability tools—i.e., recognizing that a thought is part of “I will do X” versus “Bob is doing X”, even if I’m not sure what X is. I don’t have any particularly strong reason to think this is possible, or that it would really help solve the overall problem, but I dunno, worth a shot. Again, that's somewhere on my to-do list.

4. Clever reward functions

I mentioned above why I don’t like human approval as a reward. What are other possibilities?

4.1 Imitation

If we can make a system that imitates its human operator, that seems to be relatively safe (kinda, sorta, with lots of caveats)—see here for that argument. Two questions are: (1) Can we do that, and (2) so what if we do? We already have humans!

4.1.1 How to make a human-imitating AGI

(As always assuming the model-based RL AGI I have in mind. Note also that my presumption here is that the AGI will be running a learning-and-acting algorithm that bears some similarity to the human’s (within-lifetime) learning-and-acting algorithm.)

How do we design such a system? Two ways I know of:

  • Straightforward RL: Ask a lot of questions, and send a big positive reward when the AGI gives exactly the same answer as the human operator. Maybe it also sends smaller rewards when the answer is close, as judged by an NLP model or something.
  • Gradient descent through the model: Here we put in a new kind of model-update step. Recall the normal model-update step of my AGI model involves updating the world-model via predictive learning, and updating the value function via something like TD learning based on the incoming rewards, and updating the planner / actor via something like increasing the probability of doing things that lead to positive RPE, or whatever. That’s the normal step. Now in addition to that, we add in a second, different kind of model update step, where if the human outputs X, we say “the calculation should have output X”. Then we differentiate all the way through the last N seconds of AGI operation, and then do a corresponding gradient update of all the weights in all three learned components (value function, world-model, actor/planner) in such a way as to make the output of X less unlikely in the future. To be clear, while the normal update step is kinda like how the human brain learns within a lifetime, this second type of update step is wildly different from anything that happens in biology. Not that there’s anything wrong with that.

Of the two, I suspect that the second one would work better, because it has a more direct and high-bandwidth flow of information from the operator’s answers into the value function.

Anyway, the ideal form of imitation—the form where we get those safety advantages—should involve the AGI reaching the same conclusions as humans for the same reasons. Otherwise the AGI won’t have the right inductive biases, or in other words, it will stop imitating as soon as you go off-distribution. (And the ability to go off-distribution is kinda the whole point of building an AGI in the first place! Creative thinking and problem-solving, right?)

While in theory any Turing-complete algorithm class can emulate any other algorithm, my feeling (see here) is that we won’t get human-brain-like inductive bias unless we build our AGIs with a human-brain-like algorithmic design. Conveniently, that’s the scenario I’m thinking about anyway.

And then since we’re especially interested in getting the right inductive biases around innate social instincts—which I believe are at the core of morality, norm-following, etc. (more discussion in Section 5)—we really need to code up our AGI with a scaffolding for human-like social instincts. Like, it’s fine if there are some adjustable parameters in there that we’re not sure about; we can just allow those parameters to get tweaked by those gradient updates. But we want to have the right general idea of how those social-instincts calculations are set up.

So this lands me at the idea that we should be working on reverse-engineering human social instincts—see Section 5.

And once we’re doing that anyway, do we really need the imitation step?? I’m not sure it adds much. Why not just set the social instincts to maximally pro-social (e.g. set jealousy to zero), and let 'er rip? That's the Section 5 approach.

Maybe. But the imitation step could well be useful, both for adjusting adjustable parameters that we can’t reverse-engineer (as mentioned), and for bailing us out in the case that it’s impossible to build human-like social instincts except by having a human body and growing up in a human community. Again see Section 5.

4.1.2. If we learn to imitate, then what?

Putting that aside, let’s say we really want to start with a model that imitates a specific human. Then what? We already have the human, right?

The “classic” answer to this question is one of many proposals centered around amplification & factored cognition, but I don’t like that, see Section 6.

Anyway, with the version of imitation above, there’s a better way. If we get an AGI that has a sufficiently high-fidelity clone of the cognition of a particular (trustworthy) human, including the social instinct circuits that underlie morality and norm-following, then since it’s an online-learning algorithm, we just do the usual thing: let it run and run, and it learns more and more, hopefully imitating that human having more time to think and read and study and brainstorm.

And we can do even better, capabilities-wise. At least in the human neocortical algorithm, I’m reasonably confident that you can take a learned model and “increase the intelligence without corrupting the knowledge or motivation”, at least to some extent. This would involve things like increasing working memory, increasing storage capacity, etc.

4.2 Debate

For the particular version of AGI alignment & control I’m working on, giving rewards for winning a debate seems like it would not lead to a good and safe motivation system. Instead, it would probably lead to a motivation of more-or-less “trying to win the debate by any means necessary”, which includes hacking into the debate opponent, hacking into the judge, breaking out of the box, and so on.

I guess there’s some hope that the two debate opponents, being equally smart and capable, will keep each other in check. But this doesn’t strike me as a reliable mechanism. The kind of AGI I have in mind is frequently learning and coming up with new ideas as it operates, and thus it’s entirely possible for one debate opponent to come up with a brilliant new plan, and execute it, in a way that would surprise the other one. Also, the attack / defense balance in cybersecurity has always favored the attack side, as far as I can tell. Also, given the messiness I expect in installing motivations, it’s not necessarily the case that the two opposing AGIs will really have exactly opposite motivations—for example, maybe they both wind up wanting to “get the Win-The-Debate signal”, rather than wanting to win the debate. Then they can collaborate on hacking into the judge and wireheading.

So debate is feeling less promising than other things, and I’m currently not spending time thinking about it. (See also factored cognition in Section 6.)

5. AGIs with human-like social instincts

I already mentioned this above (Section 4.1), but it’s worth reiterating because it’s kinda a whole self-contained potential path to success.

As in Section 4.1, if we want our AGIs to have human-like moral and social intuitions, including in weird out-of-distribution hypotheticals, I think the most viable and likely-to-succeed path is to understand the algorithms in the human brain that give rise to social instincts, and put similar algorithms into our AGIs. Then we get the right inductive bias for free. As mentioned earlier, we would probably not want to be blindly copy every aspect of human social instincts; instead we would take them as a starting point, and then turn off jealousy and so on. There’s some risk that, say, it’s impossible to turn off jealousy without messing everything else up, but my hunch is that it’s modular enough to be able to fiddle with it.

What do these algorithms look like? I’ve done some casual speculation here but very little research so far.

I guess I’m slightly concerned about the tractability of figuring out the answer, and much more concerned about having no feedback loop that says that the seemingly-correct algorithms are actually right, in advance of having an AGI in front of us to test them on. But I don’t know. Maybe it’s OK. We do, after all, have a wealth of constraints from psychological and neuroscience that the correct algorithm will have to satisfy. And psychologists and neuroscientists can always do more experiments if we have specific good ideas.

Another thing is: the social instinct algorithms aren’t enough by themselves. Remember, the brain is chock-full of learning algorithms. So you can build an AGI with the same underlying algorithms as humans have, but still get a different trained model.

A potential big problem in this category is: maybe the only way to get the social instincts is to take those underlying algorithms and put them in a human body growing up in a human community. That would make things harder. I don’t think that’s likely to be necessary, but I don’t have a good justification for that; it’s kinda a hunch at this point. Also, if this is a problem, it might be solvable with imitative learning as in Section 4.1.

An unrelated potential problem is the possibility that the social instinct algorithms are intimately tied up with fear-of-heights instincts and pain instincts and thousands of other things such that it’s horrifically complicated and we have no hope of reverse-engineering it. Right now I’m optimistic about the social instincts being more-or-less modular and legible, but of course it’s hard to know for sure.

Yet another potential problem is that even properly-implemented human social instincts are not going to get us what we want from AGI alignment, not even after turning off jealousy or whatever. For example, maybe with sufficient intelligence, those same human instincts lead in weird directions. I guess I’m leaning optimistic on this, because (1) the intelligence of highly-intelligent humans does not seem to systematically break their moral intuitions in ways we don’t endorse (1000× higher intelligence may be different but this is at least some evidence), (2) the expression of human morality has always grown and changed over the eons, and we kinda endorse that process, and indeed want future generations to have morals that we don’t endorse, just as our ancestors wouldn’t endorse ours; and if AGIs are somehow the continuation of that process, well maybe that’s not so bad, (3) we can also throw in conservatism (Section 2 above) to keep the AGI from drifting too far from its starting intuitions. So I'm leaning optimistic, but I don’t have great confidence; it’s hard to say.

Overall, I go back and forth a bit, but as of this writing, I kinda feel good about this general approach.

5.1 “Consolation prize” of a future with AGIs we care about

I kinda like the idea that if we go down this path, we can also go for a “consolation prize”: if the human species doesn’t survive into the post-AGI world, then I sure want those AGIs to (A) have some semblance of human-like social instincts, (B) be conscious, (C) have rich fulfilling lives (and in particular, to not suffer).

I don’t know if (A) is that important, or important at all. I’m not a philosopher, this is way above my pay-grade, it’s just that intuitively, I kinda don’t like imagining a future universe without any trace of love and friendship and connection forever and ever. The parts (B) & (C) seem very obviously more important—speaking of which, those are also on my to-do list. (For a hint of what I think progress would look like, see my old poorly-researched casual speculation on consciousness and on suffering.) But they're pretty low on the to-do list—even if I wanted to work on that, I'm missing some prerequisites.

6. Amplification / Factored cognition

I’m generally skeptical that anything in the vicinity of factored cognition will achieve both sufficient safety and sufficient capability simultaneously, for reasons similar to Eliezer’s here. For example, I’ll grant that a team of 10 people can design a better and more complex widget than any one of them could by themselves. But my experience (from having been on many such teams) is that the 10 people all need to be explaining things to each other constantly, such that they wind up with heavily-overlapping understandings of the task, because all abstractions are leaky. And you can’t just replace the 10 people with 100 people spending 10× less time, or the project will absolutely collapse, crushed under the weight of leaky abstractions and unwise-in-retrospect task-splittings and task-definitions, with no one understanding what they’re supposed to be doing well enough to actually do it. In fact, at my last job, it was not at all unusual for me to find myself sketching out the algorithms on a project and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks—or even very good!—but because sometimes a laser-related problem is best solved by switching to a different algorithm, or an FPGA-related problem is best solved by recognizing that the real end-user requirements are not quite what we thought, etc. etc. And that kind of design work is awfully hard unless a giant heap of relevant information and knowledge is all together in a single brain.

Now, I don't think I'm saying anything here that factored cognition proponents don’t already know well. In fact, my understanding is that even the most passionate advocates of factored cognition only claim that it might work and is worth a try. Well, I guess I’d be hard-pressed to disagree with that!! But anyway, I’m sufficiently skeptical that I don’t want to invest my time in thinking about it. There are other things I can do right now that seem more promising.

(That was mostly about amplification. AGI debate is also part of “factored cognition”, but I already covered it in a different section.)

7. AGI limiting

I talked about “conservatism” above separately, but this is my catch-all section for everything else that limits the power of the AGI in order to increase safety. See above (section 2.1) for why I think AGI limiting is an appropriate thing to look into.

7.1 Impact minimization

I only have one vague idea here, and it doesn’t seem to work, but I’ll mention it anyway. Maybe it can be salvaged, who knows.

I already talked above (Section 3.1) about splitting a reward into multiple components that flow into multiple value functions. Well, that seems awfully close to what we want for Alex Turner’s attainable utility preservation (AUP) idea.

Above I was saying we can have multiple reward components, all of which are things we actually want the AGI to try to do, like “follow the command I issued yesterday” vs “follow the command I issued last week”. But we can also go a different direction, and make up whatever pseudo-rewards we want, and flow those into pseudo-value functions in the same way.

Let’s say one of the 20 pseudo-rewards is “reward whenever it rains”, which (if it were a real reward) would motivate the AGI to want it to rain as much as possible. To be clear, nobody actually wants it to rain, or to not rain. We don't care! This is just one of those silly arbitrary rewards that you need in the AUP concept.

OK, and now the AGI assesses the value of a potential thought or action from the perspective of the "rain" value function and the 19 other arbitrary pseudo-value functions, as well as from the perspective of the real value function.

  • If this thought or action is very good from the perspective of all 20 pseudo-value functions, then we should be concerned that the AGI is generically increasing its power, self-replicating, and so on. And so we penalize this thought / action.
  • If this thought or action is very bad from the perspective of all 20 pseudo-value functions, then we should be concerned that the AGI is destroying things in the world or otherwise closing off future options (also cf. non-obstruction). And so, again, we penalize this thought / action.
  • If this thought or action is roughly neutral from the perspective of all 20 pseudo-value functions, then that’s promising! Looks like the AGI is narrowly pursuing its real value function in a low-impact way. So we don’t penalize this thought / action.

OK, that was my idea. And now this is why it doesn’t seem to work:

Problem #1: Reading a math textbook, learning new problem-solving strategies, etc., would presumably be useful for any possible goal, so this scheme would flag that as dangerous instrumental-convergence behavior. But we want our AGI to do those kinds of things! I think they’re essential for building a transformative AGI. Lots of normal things just happen to be more generally useful.

Problem #2: It doesn’t seem like it would actually stop instrumental convergence. For example, consider our example pseudo-motivation of “I like it when it rains”. At first glance, self-replicating around the internet would seem to be very positive from the perspective of that goal, as an AGI with extra power and backup copies etc. is in a better position to develop weather-control technology. But if the AGI has a self-model, it will realize that it is not, in fact, going to use that newfound power to develop weather-control technology, even though it “could” in some sense. So self-replicating around the internet is actually neutral from the perspective of “wanting it to rain”, as long as the AGI’s actual goals are not weather-related.

Actually it’s worse than that. I don’t think the AGI would even think the thought “with all this power, I’d be in a better position to make it rain”, because the pseudo-value function is not controlling what thoughts get thunk. So it wouldn’t even make the connection. I think RL grid-worlds give us the wrong impression here; the real world is so big and so open that the value function is going to be horrifically inaccurate if it's learned exclusively super-duper-off-policy.

So anyway, at the end of the day, I have no idea how to do impact minimization.

7.2 “Tool AI” from self-supervised learning without RL

A couple years ago I spent a month or two being enamored with the idea of tool AI via self-supervised learning, and I wrote a few posts like In Defense of Oracle ("Tool") AI Research and Self-Supervised Learning and AGI Safety. But now I’m sufficiently pessimistic that I want to spend my time elsewhere.

The main change was: I stopped thinking that self-supervised tool AI could be all that competent—like competent enough to help solve AI alignment, or competent enough that people could plausibly coordinate around never building a more competent AGI. Why not? Because I think RL is necessary to build new knowledge and answer hard questions.

So for example, sometimes you can ask an AGI (or human) a tricky question, and its answer immediately "jumps to mind". That's like what GPT-3 does. You don't need rewards for that to happen. But for harder questions, it seems to me like you need your AGI to have an ability to learn metacognitive strategies, so that it can break the problem down, brainstorm, give up on dead ends, etc. etc.

Like, compare “trying to solve a problem by breaking it into subproblems” with “trying to win a video game”. At a high level, they’re actually pretty similar!

  • In both cases, there are a bunch of possible moves you can make, and each move affects subsequent moves, in an exponentially-growing tree of possibilities.
  • In both cases, you’ll often get some early hints about whether moves were wise, but you won’t really know that you’re on the right track until you win.
  • And in both cases, I think the only reliable way to succeed is to have the capability to repeatedly try different things, and learn from experience what paths and strategies are fruitful.

…Hence we need RL, not just supervised learning.

So that’s my opinion right now. On the other hand, someone tried to talk me back into pure supervised learning a couple weeks ago, and he offered some intriguing-sounding ideas, and I mostly wound up feeling confused. So I dunno. :-P

(Same comments also apply to “Microscope AI”.)

7.3 AGIs with holes / boundaries in their cognition

I think it would be kinda nice to know how to make an AGI that reliably doesn’t think about a certain kind of thing. Like maybe we could (A) cripple its self-awareness, or (B) do the non-human-modeling STEM AI thing.

Unfortunately I don’t know how you would make an AGI with that property, at least not reliably.

Beyond that general problem, the I’m also more specifically skeptical about those two examples I just mentioned. For (A), I’m concerned that you can't remove self-awareness without also removing meta-cognition, and I think meta-cognition is necessary for capabilities reasons (see Section 7.2). For (B), I don't see how to use a STEM AI to make progress on the alignment problem, or to make such progress unnecessary.

But I dunno, I haven’t thought about it much.

8. Pre-deployment test protocols

Whatever we can safely and easily test, we don’t need to get right the first time, or even really know what we’re doing. So this seems very important.

Testing strikes me as a hard problem because the AGI can always think new thoughts and learn new things and see new opportunities in deployment that it didn’t see under testing.

I have nothing intelligent to say about pre-deployment test protocols, beyond what anyone could think of in the first 30 seconds. Sorry!

There are ideas floating around about adversarial testing, but I don’t get how I'm supposed to operationalize that, beyond the five words “We should do adversarial testing”.

9. IRL, value learning

I generally get very little out of IRL / CIRL / value learning papers, because I don’t see how we’re going to reliably point to “what the human is trying to do” in a big complicated world model that’s learned from scratch—which is the scenario I’m assuming.

And if we can point to that thing, it seems to me like the rest of the problem kinda solves itself…?

Needless to say, I’m probably missing something.

10. Out of scope for this post

As I mentioned at the top, there’s much more to Safe & Beneficial AGI than the AGI control problem, e.g.:

  • I’m not thinking about issues involving multiple humans and/or multiple AGIs cooperating and competing (example)
  • I’m not thinking about who controls the AGIs and what they do with them, or what we want the long-term future to look like, etc.
  • I’m not thinking about how to ensure that the people developing AGIs are willing and able to turn the safety-vs-capabilities dials (see above) all the way to the “safety” setting. (Ditto the safety-vs-development-speed dials.)

Not because those aren’t hard and necessary problems to solve! Just that they’re out of scope. Don't worry, you're not missing anything, because I actually have nothing intelligent to say about those problems anyway. :-P

11. Conclusion: Two end-to-end paths to AGI control

So, from where I stand right now, it seems to me like there are vaguely two end-to-end paths to solving the AGI control problem (in the AGI scenario I have in mind) with the fewest missing pieces and highest chance of success:

  • Conceptual-ish corrigibility (core ingredient) + Conservatism (core ingredient) + Transparency (probably) + Imitation (maybe) + Testing (probably). The biggest missing pieces for this path are:
    • Developing the “conservatism” idea (section 2)—I think I know how to make progress here
    • Solving the 1st-person problem (section 1.1.3)—I have a thing to look into that might help, but it also might not help, and I have no other ideas.
    • Solving transparency (section 3)—No idea
    • Coming up with test protocols (section 8)—No idea
    • Tying it all together and WCGW—May be hard until those other giant holes are filled.
  • Human-like social instincts (core ingredient) + Imitation (probably) + Transparency (probably) + Conservatism (maybe) + Testing (probably). The biggest missing pieces for this path are:
    • Reverse-engineering the algorithms underlying human social instincts (section 5)—I think I know how to make progress here
    • Solving transparency (section 3)—No idea
    • Coming up with test protocols (section 8)—No idea
    • Tying it all together and WCGW—May be hard until those other giant holes are filled.

Looking forward to ideas & criticisms!

Doing research and building knowledge is all well and good, but I'm trying to periodically ask myself: is this really part of a viable end-to-end success story?? (image source)
New Comment
2 comments, sorted by Click to highlight new comments since: Today at 3:22 AM

Ben Goertzel comments on this post via twitter:

1) Nice post ... IMO the "Human-Like Social Instincts" direction has best odds of success; the notion of making AGIs focused on compassion and unconditional love (understanding these are complex messy human concept-plexes) appears to fall into this category as u loosely define it

2) Of course to make compassionate/loving AGI actually work, one needs a reasonable amount of corrigibility in one's AGI cognitive architecture, many aspects of which seem independent of whether compassion/love or something quite different is the top-level motivation/inspiration

Regarding conservatism, there seems to be an open question of just how robust Goodhart effects are in that we all agree Goodhart is a problem but it's not clear how much of a problem it is and when. We have opinions ranging from mine, which is basically that Goodharting happens the moment you try to apply even the weakest optimization pressure and this will be a problem (or at least a problem in expectation; you might get lucky) for any system you need to never deviate, to what I read to be Paul's position: it's not that bad and we can do a lot to correct systems before Goodharting would be disastrous.

Maybe part of the problem is we're mixing up math and engineering problems and not making clear distinctions, but anyway I bring this up in the context of conservatism because it seems relevant that we also need to figure out how conservative, if at all, we need to be about optimization pressure, let alone how we would do it. I've not seen anything like a formal argument that X amount of optimization pressure, measured in whatever way is convenient, and given conditions Y produce Z% chance of Goodharting. Then at least we wouldn't have to disagree over what feels safe or not.