[Intro to brain-like-AGI safety] 14. Controlled AGI

Steven Byrnes

(Last revised: January 2026. See changelog at the bottom.)

14.1 Post summary / Table of contents

Part of the “Intro to brain-like-AGI safety” post series.

Post #12 suggested two paths forward for solving “the alignment problem” for brain-like AGI, which I called “Social-instinct AGI” and “Controlled AGI”. Then Post #13 went into more detail about (one aspect of) “Social-instinct AGI”. And now, in this post, we’re switching over to “Controlled AGI”.

If you haven’t read Post #12, don’t worry, the “Controlled AGI” research path is nothing fancy—it’s merely the idea of solving the alignment problem in the most obvious way possible:

The “Controlled AGI” research path:

Step 1 (out-of-scope for this series): We decide what we want our AGI’s motivation to be. For example, that might be:
- “Invent a better solar cell without causing catastrophe” (task-directed AGI),
- “Be a helpful assistant to the human supervisor” (corrigible AGI assistants),
- “Fulfill the human supervisor’s deepest life goals” (ambitious value learning),
- “Maximize coherent extrapolated volition”,
- or whatever else we choose.
Step 2 (subject of this post): We make an AGI with that motivation.

This post is about Step 2, whereas Step 1 is out-of-scope for this series. Honestly, I’d be ecstatic if we figured out how to reliably set the AGI’s motivation to any of those things I mentioned under Step 1.

Unfortunately, I don’t know any good plan for Step 2, and (I claim) nobody else does either. But I do have some vague thoughts and ideas, and I will share them here, in the spirit of brainstorming.

If you’re in a hurry and want to read a shorter and self-contained version of my least-bad proposed plan for Step 2, check out my separate post: Plan for mediocre alignment of brain-like [model-based RL] AGI (2023), which basically puts together the most obvious ideas mentioned in §14.2 and §14.3 into an end-to-end framework. I think that plan passes the low bar of “as far as I know, it might turn out OK”—well, I think I’m mildly skeptical, but I go back and forth, and I’m sure how to pin it down with more confidence. But obviously, we should be aiming higher than that! With stakes so high, we should really be starting from “there’s a strong reason to expect the plan to work, if carefully implemented”. And then we can start worrying about what can go wrong in the implementation. So we clearly still have work to do.

This post is not meant to be a comprehensive overview of the whole problem, just what I see as the most urgent missing ingredients.

Out of all the posts in the series, this post is the hands-down winner for “most lightly-held opinions”.

Table of contents:

§14.2 discusses what we might use as “Thought Assessors” in an AGI. If you’re just tuning in, Thought Assessors were defined in Posts #5–#6 and have been discussed throughout the series. If you have a Reinforcement Learning background, think of Thought Assessors as the components of a multi-dimensional value function. If you have a “being a human” background, think of Thought Assessors as learned functions that trigger visceral reactions (aversion, cortisol-release, etc.) based on the thought that you’re consciously thinking right now. In the case of brain-like AGIs, we get to pick whatever Thought Assessors we want, and I propose three categories for consideration: Thought Assessors oriented towards safety (e.g. “this thought / plan involves me being honest”), Thought Assessors oriented towards accomplishing a task (e.g. “this thought / plan will lead to better solar cell designs”), and Thought Assessors oriented purely towards interpretability (e.g. “this thought / plan has something to do with dogs”).
§14.3 discusses how we might generate supervisory signals to train those Thought Assessors. Part of this topic is what I call the “first-person problem”, namely the open question of whether it’s possible to take third-person labeled data (e.g. a YouTube video where Alice deceives Bob), and transmute it into a first-person preference (an AGI’s desire to not, itself, be deceptive).
§14.4 discusses the problem that the AGI will encounter “edge cases” in its preferences—plans or places where its preferences become ill-defined or self-contradictory. I’m cautiously optimistic that we can build a system that monitors the AGI’s thoughts and detects when it encounters an edge case. However, I don’t have any good idea about what to do when that happens. I’ll discuss a few possible solutions, including “conservatism”, and a couple different strategies for what Stuart Armstrong calls Concept Extrapolation.
§14.5 discusses the open question of whether we can rigorously prove anything about an AGI’s motivations. Doing so would seem to require diving into the AGI’s predictive world-model (which would probably be a multi-gigabyte, unlabeled (§2.7) data structure), and proving things about what the components of the world-model “mean”. I’m rather pessimistic about our prospects here, but I’ll mention possible paths forward, including John Wentworth’s “Natural Abstraction Hypothesis” research program (most recent update here).
§14.6 concludes with my overall thoughts about our prospects for “Controlled AGIs”. I’m currently a bit stumped and pessimistic about our prospects for coming up with a good plan, but hope I’m wrong and intend to keep thinking about it. I also note that a mediocre, unprincipled approach to “Controlled AGIs” (as in my “plan for mediocre alignment of brain-like AGI” post) would not necessarily cause a world-ending catastrophe—I think it’s hard to say.

14.2 Three categories of AGI Thought Assessors

As background, here’s our usual diagram of motivation in the human brain, from Post #6:

And here’s the modification for AGI, from Post #8:

On the center-right side of the diagram, I crossed out the words “cortisol”, “sugar”, “goosebumps”, etc. These correspond to the set of human innate visceral reactions which can be involuntarily triggered by thoughts (see Post #5).

(In machine learning terms, think of these as like the components of a multidimensional value function, as in multi-objective / multi-criteria reinforcement learning; or they can also be akin to the “pseudo” / “general” (non-reward-related) value functions of “Horde” (Sutton et al. 2011) and related algorithms.)

Clearly, things like cortisol, sugar, and goosebumps are the wrong Thought Assessors for our future AGIs. But what are the right ones? Well, we’re the programmers! We get to decide!

I have in mind three categories to pick from. I’ll talk about how they might be trained (i.e., supervised) in §14.3 below.

14.2.1 Safety & corrigibility Thought Assessors

Example thought assessors in this category:

This thought / plan involves me being helpful.
This thought / plan does not involve manipulating my own learning process, code, or motivation systems.
This thought / plan does not involve deceiving or manipulating anyone.
This thought / plan does not involve anyone getting hurt.
This thought / plan involves following human norms, or more generally, doing things that an ethical human would plausibly do.
This thought / plan is “low impact” (according to human common sense).
…

Arguably (cf. this Paul Christiano post), #1 is enough, and subsumes the rest. But I dunno, I figure it would be nice to have information broken down on all these counts, allowing us to change the relative weights in real time (§9.7), and perhaps giving an additional measure of safety.

Items #2–#3 are there because those are especially probable and dangerous types of thoughts—see discussion of Instrumental Convergence in §10.3.2.

Item #5 is a bit of a catch-all for the AGI finding weird out-of-the-box solutions to problems, i.e. it’s my feeble attempt to mitigate the so-called “Nearest Unblocked Strategy problem”. Why might it mitigate the problem? Because pattern-matching to “things that an ethical human would plausibly do” is a bit more like a whitelist than a blacklist. I still don’t think that would work on its own, don't get me wrong, but maybe it would work in conjunction with the various other ideas in this post.

Before you jump into loophole-finding mode (“lol an ethical human would plausibly turn the world into paperclips if they’re under the influence of alien mind-control rays”), remember (1) these are meant to be implemented via pattern-matching to previously-seen examples (§14.3 below), not literal-genie-style following the exact words of the text; (2) we would hopefully also have some kind of out-of-distribution detection system (§14.4 below) to prevent the AGI from finding and exploiting weird edge-cases in that pattern-matching process. That said, as we’ll see, I don’t quite know how to do either of those two things, and even if we figure it out, I don’t have an airtight argument that it would be sufficient to get the intended safe behavior.

14.2.2 Task-related Thought Assessors

Example thought assessors in this category:

This thought / plan will lead to a reduction in global warming
This thought / plan will lead to a better solar panel design
This thought / plan will lead to my supervisor becoming fabulously rich
…

This kind of thing is why we built the AGI—what we actually want it to do. (Assuming task-directed AGI for simplicity.)

Basing a motivation system on these kinds of assessments by themselves would be obviously catastrophic. But maybe if we use these as motivations, in conjunction with the previous category, it will be OK. For example, imagine the AGI can only think thoughts that pattern-match to “I am being helpful” AND pattern-match to “there will be less global warming”.

That said, I’m not sure we want this category at all. Maybe the “I am being helpful” Thought Assessor by itself is sufficient. After all, if the human supervisor is trying to reduce global warming, then a helpful AGI would produce a plan to reduce global warming. That’s kinda the approach advocated by Paul Christiano (2017), I think.

14.2.3 “Ersatz interpretability” Thought Assessors

(See §9.6 for what I mean by “Ersatz interpretability”.)

As discussed in Posts #4–#5, each thought assessor is a model trained by supervised learning. Certainly, the more Thought Assessors we put into the AGI, the more computationally expensive it will be. But how much more? It depends. For example, I think the “valence” Thought Assessor in the human brain involves orders of magnitude more neurons than the “salivation” Thought Assessor. On the other hand, I think the “valence” Thought Assessor is far more accurate as a result. Anyway, as far as I know, it’s not impossible that we can put in Thought Assessors, and they’ll work well enough, and this will only add 1% to the total compute required by the AGI. I don’t know. So I’ll hope for the best and take the More Dakka approach: let’s put in 30,000 Thought Assessors, one for every word in the dictionary:

This thought / plan has something to do with AARDVARK
This thought / plan has something to do with ABACUS
This thought / plan has something to do with ABANDON
… … …
This thought / plan has something to do with ZOOPLANKTON

I expect that ML-savvy readers will be able to immediately suggest much-improved versions of this scheme—including versions with even more dakka—that involve things like contextual word embeddings and language models and so on. As one example, if we buy out and open-source Cyc (more on which below), we could use its hundreds of thousands of human-labeled concepts.

14.2.4 Combining Thought Assessors into a reward function

For an AGI to judge a thought / plan as being good, we’d like all the safety & corrigibility Thought Assessors from §14.2.1 to have as high a value as possible, and we’d like the task-related Thought Assessor from §14.2.2 (if we’re using one) to have as high a value as possible.

(The outputs of the interpretability Thought Assessors from §14.2.3 are not inputs to the AGI’s reward function, or indeed used at all in the AGI, I presume. I was figuring that they’d be silently spit out to help the programmers do debugging, testing, monitoring, etc.)

So the question is: how do we combine this array of numbers into a single overall score that can guide what the AGI decides to do?

A probably-bad answer is “add them up”. We don’t want the AGI going with a plan that performs catastrophically badly on all but one of the safety-related Thought Assessors, but so astronomically well on the last one that it makes up for it.

Instead, I imagine we’ll want to apply some kind of nonlinear function with strongly diminishing returns, and/or maybe even acceptability thresholds, before adding up the Thought Assessors into an overall score.

I don’t have much knowledge or opinion about the details. But there is some related literature on “scalarization” of multi-dimensional value functions—see here for some references.

14.3 Supervising the Thought Assessors, and the “first-person problem”

Recall from Posts #4–#6 that the Thought Assessors are trained by supervised learning. So we need a supervisory signal—what I labeled “ground truth in hindsight” in the diagram at the top.

I’ve talked about how the brain generates ground truth in numerous places, e.g. §3.2.1, Posts #7 & #13. How do we generate it for the AGI?

Well, one obvious possibility is to have the AGI watch YouTube, with lots of labels throughout the video for when we think the various Thought Assessors ought to be active. Then when we’re ready to send the AGI off into the world to solve problems, we turn off the labeled YouTube videos, and simultaneously freeze the Thought Assessors (= set the error signals to zero) in their current state. Well, I’m not sure if that would work; maybe the AGI has to go back and watch more labeled YouTube videos from time to time, to help the Thought Assessors keep up as the AGI’s world-model grows and changes.

One potential shortcoming of this approach is related to first-person versus third-person concepts. We want the AGI to have strong preferences about aspects of first-person plans—hopefully, the AGI will see “I will lie and deceive” as bad, and “I will be helpful” as good. But we can’t straightforwardly get that kind of preference from the AGI watching labeled YouTube videos. The AGI will see YouTube character Alice deceiving YouTube character Bob, but that’s different from the AGI itself being deceptive. And it’s a very important difference! Consider:

If you tell me “my AGI dislikes being deceptive”, I’ll say “good for you!”.
If you tell me “my AGI dislikes it when people are deceptive”, I’ll say “for god's sake you better shut that thing off before it escapes human control and kills everyone”!!!

It sure would be great if there were a way to transform third-person data (e.g. a labeled YouTube video of Alice deceiving Bob) into an AGI’s first-person preferences (“I don’t want to be deceptive”). I call this the first-person problem.

How do we solve the first-person problem? I’m not entirely sure. I wrote my “Intuitive Self-Models” series (2024) partly as a giant rabbit hole trying to figure it out, and now at least have a vague idea (see §8.6.1 of that series), but little hope that it would actually work.

If the first-person problem is not solvable, we need to instead use the scary method of allowing the AGI to take actions, and putting labels on those actions. Why is that scary? First, because those actions might be dangerous. Second, because it doesn’t give us any good way to distinguish (for example) “the AGI said something dishonest” from “the AGI got caught saying something dishonest”. Conservatism and/or concept extrapolation (§14.4 below) could help with that “getting caught” problem—maybe we could manage to get our AGI both motivated to be honest and motivated to not get caught, and that could be good enough—but it still seems fraught for various reasons.

14.3.1 Side note: do we want first-person preferences?

I suspect that “the first-person problem” is intuitive for most readers. But I bet a subset of readers feel tempted to say that the first-person problem is not in fact a problem at all. After all, in the realm of human affairs, there’s a good argument that we could use a lot fewer first-person preferences!

The opposite of first-person preferences would be “impersonal consequentialist preferences”, wherein there’s a future situation that we want to bring about (e.g. “awesome post-AGI utopia”), and we make decisions to try to bring that about, without particular concern over what I-in-particular am doing. Indeed, too much first-person thinking leads to lots of things that I personally dislike in the world—e.g. jockeying for credit, blame avoidance, the act / omission distinction, social signaling, and so on.

Nevertheless, I still think giving AGIs first-person preferences is the right move for safety. Until we can establish super-reliable 12th-generation AGIs, I’d like them to treat “a bad thing happened (which had nothing to do with me)” as much less bad than “a bad thing happened (and it’s my fault)”. Humans have this notion, after all, and it seems at least relatively robust—for example, if I build a bank-robbing robot, and then it robs the bank, and then I protest “Hey I didn’t do anything wrong; it was the robot!”, I wouldn’t be fooling anybody, much less myself. An AGI with such a preference scheme would presumably be cautious and conservative when deciding what to do, and would default to inaction when in doubt. That seems generally good, which brings us to our next topic:

14.4 Conservatism and concept-extrapolation

14.4.1 Why not just relentlessly optimize the right abstract concept?

Let’s take a step back.

Suppose we build an AGI such that it has positive valence on the abstract concept “there will be lots of human flourishing”, and consequently makes plans and take actions to make that concept happen.

I actually find it pretty plausible that we’ll be able to do that, from a technical perspective. Just as above, we can use labeled YouTube videos and so on to make a Thought Assessor for “this thought / plan will lead to human flourishing”, and then base the reward function purely on that one Thought Assessor (cf. Post #7).

And then we set the AGI loose on an unsuspecting world, to go do whatever it thinks is best to do.

What could go wrong?

The problem is that the concept of “human flourishing” is an abstract concept in the AGI’s world-model—really, it’s just a fuzzy bundle of learned associations. It’s hard to know what actions a desire for “human flourishing” will induce, especially as the world itself changes, and the AGI’s understanding of the world changes even more. In other words, there is no future world that will perfectly pattern-match to the AGI’s current notion of “human flourishing”, and if an extremely powerful AGI optimized the world for the best possible pattern-match, we might wind up with something weird, even catastrophic. (Or maybe not! It’s pretty hard to say, more on which in §14.6.)

As some random examples of what might go wrong: maybe the AGI would take over the world and prevent humans and human society from changing or evolving forevermore, because those changes would reduce the pattern-match quality. Or maybe the least-bad pattern-match would be the AGI wiping out actual humans in favor of an endless modded game of The Sims. Not that The Sims is a perfect pattern-match to “human flourishing”—it’s probably pretty bad! But maybe it’s less bad a pattern-match than anything the AGI could feasibly do with actual real-world humans. Or maybe as the AGI learns more and more, its world-model gradually drifts and changes, such that the frozen Thought Assessor winds up pointing at something totally random and crazy, and then the AGI wipes out humans to tile the galaxy with paperclips. I don’t know!

So anyway, relentlessly optimizing a fixed, frozen abstract concept like “human flourishing” seems maybe problematic. Can we do better?

Well, it would be nice if we could also continually refine that concept, especially as the world itself, and the AGI’s understanding of the world, evolves. This idea is what Stuart Armstrong calls Concept Extrapolation, if I understand correctly.

Concept extrapolation is easier said than done—there’s no obvious ground truth for the question of “what is ‘human flourishing’, really?” For example, what would “human flourishing” mean in a future of transhuman brain-computer hybrid people and superintelligent evolved octopuses and god-only-knows-what-else?

Anyway, we can consider two steps to concept extrapolation. First (the easier part), we need to detect edge-cases in the AGI’s preferences. Second (the harder part), we need to figure out what the AGI should do when it comes across such an edge-case. Let’s talk about those in order.

14.4.2 The easier part of concept extrapolation: Detecting edge-cases in the AGI’s preferences

I’m cautiously optimistic about the feasibility of making a simple monitoring algorithm that can watch an AGI’s thoughts and detect that it’s in an edge-case situation—i.e., an out-of-distribution situation where its learned preferences and concepts are breaking down.

(Understanding the contents of the edge-case seems much harder, as discussed shortly, but here I’m just talking about recognizing the occurrence of an edge-case.)

To pick a few examples of possible telltale signs that an AGI is at an edge-case:

The learned probability distributions for Thought Assessors (see Post #4 footnote) could have a wide variance, indicating uncertainty.
The different Thought Assessors of §14.2 could diverge in new and unexpected ways.
The AGI’s valence could flip back and forth between positive and negative in a way that indicates “feeling torn” while paying attention to different aspects of the same possible plan.
The AGI’s generative world-model could settle into a state with very low prior probability, indicating confusion.

14.4.3 The harder part of concept extrapolation: What to do at an edge case

I don’t know of any good answer. Here are some options.

14.4.3.1 Option A: Conservatism—When in doubt, just don’t do it!

A straightforward approach would be that if the AGI’s edge-case-detector fires, it forces the valence signal negative—so that whatever thought the AGI was thinking is taken to be a bad thought / plan. This would loosely correspond to a “conservative” AGI.

(Side note: I think there may be many knobs we can turn in order to make a brain-like AGI more or less “conservative”, in different respects. The above is just one example. But they all seem to have the same issues.)

A failure mode of a conservative AGI is that the AGI just sits there, not doing anything, paralyzed by indecision, because every possible plan seems too uncertain or risky.

An “AGI paralyzed by indecision” is a failure mode, but it’s not a dangerous failure mode. Well, not unless we were foolish enough to put this AGI in charge of a burning airplane plummeting towards the ground. But that’s fine—in general, I think it’s OK to have first-generation AGIs that can sometimes get paralyzed by indecision, and which are thus not suited to solving crises where every second counts. Such an AGI could still do important work like inventing new technology, and in particular designing better and safer second-generation AGIs.

However, if the AGI is always paralyzed by indecision—such that it can’t get anything done—now we have a big problem. Presumably, in such a situation, future AGI programmers would just dial the “conservatism” knob down lower and lower, until the AGI started doing useful things. And at that point, it’s unclear if the remaining conservatism would be sufficient to buy us safety.

I think it would be much better to have a way for the AGI to iteratively gain information to reduce uncertainty, while remaining highly conservative in the face of whatever uncertainty still remains. So how can we do that?

14.4.3.2 Option B: Dumb algorithm to seek clarification in edge-cases

Here’s a slightly-silly illustrative example of what I have in mind. As above, we could have a simple monitoring algorithm that watches the AGI’s thoughts, and detects when it’s in an edge-case situation. As soon as it is, the monitoring algorithm shuts down the AGI entirely, and prints out the AGI’s current neural net activations (and corresponding Thought Assessor outputs). The programmers use interpretability tools to figure out what the AGI is thinking about, and manually assign a valence / value / reward, overriding the AGI’s previous uncertainty with a highly-confident ground-truth.

That particular story seems unrealistic, mainly because I’m skeptical that we’ll have the speed, manpower, and interpretability tools to keep up with how often I expect this situation to trigger. But maybe there’s a better approach than just printing out billions of neural activations and corresponding Thought Assessors?

The tricky part is that AGI-human communication is fundamentally a hard problem. It’s unclear to me whether it will be possible to solve that problem via a dumb algorithm. The situation here is very different from, say, an image classifier, where we can find an edge-case picture and just show it to the human. The AGI’s thoughts may be much more inscrutable than that.

By analogy, human-human communication is possible, but not by any dumb algorithm. We do it by leveraging the full power of our intellect—modeling what our conversation partner is thinking, strategically choosing words that will best convey a desired message, and learning through experience to communicate more and more effectively. So what if we try that approach?

14.4.3.3 Option C: The AGI wants to seek clarification in edge-cases

If I’m trying to help someone, I don’t need any special monitoring algorithm to prod me to seek clarification at edge-cases. Seeking clarification at edge-cases is just what I want to do, as a self-aware properly-motivated agent.

So what if we make our AGIs like that?

At first glance, this approach would seem to solve all the problems mentioned above. Not only that, but the AGI can use its full powers to make everything work better. In particular, it can learn its own increasingly-sophisticated metacognitive heuristics to flag edge-cases, and it can learn and apply the human’s meta-preferences about how and when the AGI should ask for clarification.

But there’s a catch. I was hoping for a conservatism / concept extrapolation system that would help protect us from misdirected motivations. If we implement conservatism / concept extrapolation via the motivation system itself, then we lose that protection.

More specifically: if we go up a level, the AGI still has a motivation (“seek clarification in edge-cases”), and that motivation is still an abstract concept that we have to extrapolate into out-of-distribution edge cases (“What if my supervisor is drunk, or dead, or confused? What if I ask a leading question?”). And for that concept extrapolation problem, we’re plowing ahead without a safety net.

Is that a problem? Bit of a long story:

Side-debate: Will “helpfulness”-type preferences “extrapolate” safely just by recursively applying to themselves?

In fact, a longstanding debate in AGI safety is whether these kinds of helpful / corrigible AGI preferences (e.g. an AGI’s desire to understand and follow a human’s preferences and meta-preferences) will “extrapolate” in a desirable way without any “safety net”—i.e., without any independent ground-truth mechanism pushing the AGI’s preferences in the right direction.

In the optimistic camp is Paul Christiano, who argued in “Corrigibility” (2017) that there would be “a broad basin of attraction towards acceptable outcomes”, based on, for example, the idea that an AGI’s preference to be helpful will result in the AGI having a self-reflective desire to continually edit its own preferences in a direction humans would like. But I don’t really buy that argument for reasons in my 2020 post—basically, I think there are bound to be sensitive areas like “what does it mean for people to want something” and “what are human communication norms” and “inclination to self-monitor”, and if the AGI’s preferences drift along any of those axes (or all of them simultaneously), I don’t think those preferences would self-correct.

Meanwhile, in the strongly-pessimistic camp is Eliezer Yudkowsky, I think mainly because of an argument (e.g. this post, final section) that we should expect powerful AGIs to have consequentialist preferences, and that consequentialist preferences seem incompatible with corrigibility. But I don’t really buy that argument either, for reasons in my 2021 “Consequentialism & Corrigibility” post—basically, I think there are possible preferences that are reflectively-stable, and that include consequentialist preferences (and thus are compatible with powerful capabilities), but are not purely consequentialist (and thus are compatible with corrigibility). A “preference to be helpful” seems like it could plausibly develop into that kind of hybrid preference scheme.

Anyway, I’m uncertain but leaning pessimistic. For more on the topic, see also Wei Dai’s recent post, and RogerDearnaley’s, and the comment sections of all of the posts linked above.

14.4.3.4 Option D: Something else?

I dunno.

14.5 Getting a handle on the world-model itself

The elephant in the room is the giant unlabeled generative world-model that lives inside the Thought Generator. The Thought Assessors provide a window into this world-model, but I’m concerned that it may be a rather small, foggy, and distorted window. Can we do better?

Ideally, we’d like to prove things about the AGI’s motivation. We’d like to say “Given the state of the AGI’s world-model and Thought Assessors, the AGI is definitely motivated to do X” (where X=be helpful, be honest, not hurt people, etc.) Wouldn’t that be great?

But we immediately slam into a brick wall: How do we prove anything whatsoever about the “meaning” of things in the world-model, and thus about the AGI’s motivation? The world is complicated, and therefore the world-model is complicated. The things we care about are fuzzy abstractions like “honesty” and “helpfulness”—see the Pointers Problem. The world-model keeps changing as the AGI learns more, and as it makes plans that would entail taking the world wildly out-of-distribution (e.g. planning the deployment of a new technology). How can we possibly prove anything here?

I still think the most likely answer is “We can’t”. But here are two possible paths anyway. For some related discussion, see Eliciting Latent Knowledge, and especially Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems (Dalrymple et al., 2024).

Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.

This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well. (Update: John disagrees with this characterization, see his comment.)

I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.

Proof strategy #2 would start with a human-legible “reference world-model” (e.g. Cyc). This reference world-model wouldn’t be constrained to be built out of localized objects in a 3D world, so unlike the above, it could and probably would contain things like “honesty” and “solar cell efficiency” and “daytime”.

Then we try to directly match up things in the “reference world-model” with things in the AGI’s world-model.

Will they match up? No, of course not. Probably the best we can hope for is a fuzzy, many-to-many match, with various holes on both sides.

It's hard for me to see a path to rigorously proving anything about the AGI’s motivations using this approach. Nevertheless, I continue to be amazed that unsupervised machine translation is possible at all, and I take that as an indirect hint that if pieces of two world-models match up with each other in their internal structure, then those pieces are probably describing the same real-world thing. So maybe I have the faintest glimmer of hope.

I’m unaware of work in this direction, possibly because it’s stupid and doomed, and also possibly because I don’t think we currently have any really great open-source human-legible world-models to run experiments on. The latter seems like it should be a fixable problem, so someone should fix it. I’ve mused about trying to open-source Cyc, but to be clear, that’s probably just one of many ways to develop a rich, accurate, and (most importantly) human-legible open-source world-model.

(See also some helpful discussion in Towards Guaranteed Safe AI about how to build an open-source human-legible world-model, although they have in mind a different end-use for it than I do. Indeed, there are lots of different reasons to want an awesome open-source human-legible world-model! All the more reason to make one!)

14.6 Conclusion: mild pessimism about finding a good solution, uncertainty about the consequences of a lousy solution

I think we have our work cut out figuring out how to solve the alignment problem via the "Controlled AGIs" route (as defined in Post #12). There are a bunch of open problems, and I’m currently pretty stumped. We should absolutely keep looking for good solutions, but right now I’m also open-minded to the possibility that we won’t find any. That’s why I continue to put a lot of my mental energy into the “social-instinct AGIs” path (Posts #12–#13), which seems somewhat less doomed to me, despite its various problems.

I note, however, that my pessimism is not universally shared—for example, as mentioned, Stuart Armstrong at AlignedAI appears optimistic about solving the open problem in §14.4, and John Wentworth and the Guaranteed Safe AI people appear optimistic about solving the open problem in §14.5. Let's hope they're right, wish them luck, and try to help!

To be clear, the thing I’m feeling pessimistic about is finding a good solution to “Controlled AGI”, i.e., a solution that we can feel extremely confident in a priori. A different question is: Suppose we try to make “Controlled AGI” via a lousy solution, like the §14.4.1 example (encapsulated in my post Plan for mediocre alignment of brain-like [model-based RL] AGI) where we imbue a super-powerful AGI with an all-consuming desire for the abstract concept of “human flourishing”, and the AGI then extrapolates that abstract concept arbitrarily far out of distribution in a totally-uncontrolled, totally-unprincipled way. Just how bad a future would such an AGI bring about? I’m very uncertain. Would such an AGI engage in mass torture? Umm, I guess I’m cautiously optimistic that it wouldn’t, absent a sign error from cosmic rays or whatever. Would it wipe out humanity? I think it’s possible!—see discussion in §14.4.1. But it might not! Hey, maybe it would even bring about a pretty awesome future! I just really don’t know, and I’m not even sure how to reduce my uncertainty.

In the next post, I will wrap up the series with my wish-list of open problems, and advice on how to get into the field and help solve them!

Changelog

July 2024: Since the initial version, I’ve made only minor changes. Mostly I added links to more recent content, particularly my own Plan for mediocre alignment of brain-like [model-based RL] AGI (which is basically a simpler self-contained version of part of this post), and Dalrymple et al.’s Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems, which is relevant to §14.5.

January 2026: Various minor edits and updates.

If I wanted to play fast and loose, I would claim that our sense of ourselves as having a first-person at all is part of an evolutionary solution to the problem of learning from other peoples's experiences (wait, wasn't there a post like that recently? Or was that about empathy...). It merely seems like a black box to us because we're too good at it, precisely because it's so important.

Somehow we develop a high-level model of the world with ourselves and other people in it, and then this level of abstraction actually gets hooked up to our motivations - making this a subset of social instincts.

When imagining hooking up abstract learned world models to motivation for AI like this, I sometimes imagine something much less "fire and forget" than the human brain, something more like people testing, responding to, and modifying an AI that's training or pre-training on real-world data. Evolution doesn't get to pause me at age 4 and rummage around in my skull.

Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well.
I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.

With my current best formalization, the "objects" in the world are not necessarily localized in 3D space. Indeed, one of the main things which makes an abstraction "natural" is that the relevant information is redundantly represented in many places in the physical world.

"Daytime" is a good example: I can measure light intensity at lots of different places in my general area, at lots of different times, and find that they all strongly correlate. The information about light intensity is redundant across all those locations: if I measure high light intensity outside my house, then I'm pretty confident that a measurement taken outside the office at the same time will also have high intensity. The latent variable representing that redundant information (as a function of time) is what we call "daytime".

Thanks! Follow-up question: Do you see yourself as working towards “Proof Strategy 2”? Or “none of the above”?

This part of Proof Strategy 1 is a basically-accurate description of what I'm working towards:

We try to come up with an unambiguous definition of what [things] are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.

... it's just not necessarily about objects localized in 3D space.

Also, there's several possible paths, and they don't all require unambiguous definitions of all the "things" in a human's ontology. For instance, if corrigibility turns out to be a natural "thing", that could short-circuit the need for a bunch of other rigorous concepts.

Thanks! One of my current sources of mild skepticism right now (which again you might talk me out of) is:

For capabilities reasons, the AGI will probably need to be able to add things to its world-model / ontology, including human-illegible things, and including things that don't exist in the world but which the AGI imagines (and could potentially create).
If the AGI is entertaining a plan of changing the world in important ways (e.g. inventing and deploying mind-upload technology, editing its own code, etc.), it seems likely that the only good way of evaluating whether it's a good plan would involve having opinions about features of the future world that the plan would bring about—as opposed to basing the evaluation purely on current-world-features of the plan, like the process by which it was made.
…And in that case, it's not sufficient to have rigorous concepts / things that apply in our world, but rather we need to be able to pick those concepts / things out of any possible future world that the AGI might bring about.
I'm mildly skeptical that we can find / define such concepts / things, especially for things that we care about like “corrigibility”.
…And thus the story needs something along the lines of out-of-distribution edge-case detection and handling systems like Section 14.4.

I expect that there will be concepts the AI finds useful which humans don't already understand. But these concepts should still be of the same type as human concepts - they're still the same kind of natural abstraction. Analogy: a human who grew up in a desert tribe with little contact with the rest of the world may not have any concept of "snow", but snow is still the kind-of-thing they're capable of understanding if they're ever exposed to it. When the AI uses concepts humans don't already have, I expect them to be like that.

As long as the concepts are the type of thing humans can recognize/understand, then it should be conceptually straightforward to model how humans would reason about those concepts or value them.

should be conceptually straightforward to model how humans would reason about those concepts or value them

Let’s say that the concept of an Em had never occurred to me before, and now you knock on my door and tell me that there’s a thing called Ems, and you know how to make them but you need my permission, and now I have to decide whether or not I care about the well-being of Ems. What do I do? I dunno, I would think about the question in different ways, I would try to draw analogies to things I already knew about, maybe I would read some philosophy papers, and most of all I would be implicitly probing my own innate "caring" reaction(s) and seeing exactly what kinds of thoughts do or don't trigger it.

Can we make an AGI that does all that? I say yes: we can build an AGI with human-like “innate drives” such that it has human-like moral intuitions, and then it applies those human-like intuitions in a human-like way when faced with new out-of-distribution situations. That’s what I call the “Social-Instinct AGI” research path, see Post #12.

But if we can do that, we’ve already arguably solved the whole AGI safety problem. I suspect you have something different in mind?

We don't necessarily need the AGI itself to have human-like drives, intuitions, etc. It just needs to be able to model the human reasoning algorithm well enough to figure out what values humans assign to e.g. an em.

(I expect an AI which relied heavily on human-like reasoning for things other than values would end up doing something catastrophically stupid, much as humans are prone to do.)

I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.

So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.” From that perspective, I would be concerned that if the (so-called) subroutine never wanted to do anything bad or stupid, then the outer AI is redundant, and if the (so-called) subroutine did want to do something bad or stupid, then the outer AI may not be able to recognize and stop it.

Separately, shouldn't “doing something catastrophically stupid” become progressively less of an issue as the AGI gets “smarter”? And insofar as caution / risk-aversion / etc. is a personality type, presumably we could put a healthy dose of it into our AGIs.

An example might be helpful here: consider the fusion power generator scenario. In that scenario, a human thinking about what they want arrives at the wrong answer, not because of uncertainty about their own values, but because they don't think to ask the right questions about how the world works. That's the sort of thing I have in mind.

In order to handle that sort of problem, an AI has to be able to use human values somehow without carrying over other specifics of how a human would reason about the situation.

I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.

I think I disagree with this claim. Maybe not exactly as worded - like, sure, maybe the "set of mental activities" involved in the reasoning overlap heavily. But I do expect (weakly, not confidently) that there's a natural notion of human-value-generator which factors from the rest of human reasoning, and has a non-human-specific API (e.g. it interfaces with natural abstractions).

So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.”

It sounds to me like you're imagining something which emulates human reasoning to a much greater extent than I'm imagining.

consider the fusion power generator scenario

It's possible that I misunderstood what you were getting at in that post. I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story? You could have equally well written the post as “Suppose, a few years from now, I set about trying to design a cheap, simple fusion power generator - something I could build in my garage and use to power my house. After years of effort, I succeed….” Is that correct?

If so, I think that’s a problem that can be mitigated in mundane ways (e.g. mandatory inventor training courses spreading best-practices for brainstorming unanticipated consequences, including red-teams, structured interviews, etc.), but can't be completely solved by humans. But it also can’t be completely solved by any possible AI, because AIs aren’t and will never be omniscient, and hence may make mistakes or overlook things, just as humans can.

Maybe you're thinking that we can make AIs that are less prone to human foibles like wishful thinking and intellectual laziness etc.? But I’m optimistic that we can make “social instinct” brain-like AGIs that are also unusually good at avoiding those things (after all, some humans are significantly better than others at avoiding those things, while still having normal-ish social instincts and moral intuitions).

I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story?

Basically, yeah.

The important point (for current purposes) is that, as the things-the-system-is-capable-of-doing-or-building scale up, we want the system's ability to notice subtle problems to scale up with it. If the system is capable of designing complex machines way outside what humans know how to reason about, then we need similarly-superhuman reasoning about whether those machines will actually do what a human intends. "With great power comes great responsibility" - cheesy, but it fits.

Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well.
I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.

This approach is probably particularly characteristic of my approach. I've perhaps overstated the similarity of my approach to John Wentworth's 😅 - I think that much of his research is useful to my approach, but there's also lots of positions of disagreement. But I suppose everyone finds his research ultra-promising.

A couple of notes:

I think even if my approach doesn't work out as the sole solution, it seems plausibly complementary to other approaches, including yours. For instance, if you don't do the sort of ontological lock that I'm advocating, then you tend to end up struggling with some basic symbol-reality distinction, e.g. you're likely to associate pictures of happy people with the concept of "happiness", so a happiness maximizer might end up tiling the world with pictures of happy people. My approach can avoid that for free (though the flipside is that it would likely not consider e.g. ems to be people unless explicitly programmed so. but that could probably be achieved.).

I think concepts like "solar cell efficiency" might be very achievable to define by my approach. If you have a clean 3D ontology, you can isolate an object like a solar panel in that ontology, and then counterfactually ask how it would perform under various conditions. So you could say "well how would this object perform if standard sunlight hit it under standard atmospheric conditions? how much power would it produce? would it produce any problematic pollution? etc.". You could be very precise about this.

... which is of course a curse as much as it is a blessing, e.g. you might not want a precise definition of "daytime", and it might not be possible for people to write down a precise definition of "honesty".

This approach is probably particularly characteristic of my approach.

Yeah, you were one of the “couple other people” I alluded to. The other was ‪Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).

my approach … ontological lock …

I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.

Yeah, you were one of the “couple other people” I alluded to. The other was ‪Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).

🤔 I wonder if I should talk with Tan Zhi-Xuan.

I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.

I got the phrase "ontological lock" from adamShimi's post here, but it only comes up very briefly, so it is not helpful for understanding what I mean and is sort of also me assuming that adamShimi meant the same as I did. 😅 I'm not sure if it's a term used elsewhere.

What I mean is forcing the AI to have a specific ontology, such as things embedded in 3D space, so you can directly programmatically interface with the AI's ontology, rather than having to statistically train an interface (which would lead to problems with distribution shift and such).

Re the human flourishing example - it seems to me that a better choice of thought assessor / ultimate value is "Does this tend to increase the total subjective utility (weighted by amount of consciousness) of all sentient beings?" It's simple, relies on probably-natural abstractions (utility, consciousness, sentient, agents), does not rely on arbitrary things that are hard to define like what exactly a "human" is, and I think most human morals (at least of the second order want-to-want kind) fall straight out of it.

Defining the utility function of an arbitrary agent is an issue, of course, but if an entity does not have coherent desires, their subagents could perhaps be factored into it, with moral relevance equal to that of the whole being multiplied by the "proportion" of their mind controlled by that subagent. But perhaps this is just CEV again. Actually, given that animals don't particularly care about (or know about) the concept of uplifting and yet I consider it a moral imperative, I must actually want CEV after all. Heh.

There are some potential failures here of course. For instance, the AGI may inappropriately believe that agents exist which really do not - humans do this all the time, and take them into account in moral calculations - spirits for instance! Well, of course, spirits do exist, but only as self-replicating [via proselytization etc] subagents in the human brain, not as external entities with consciousness of their own, and have minimal moral relevance. But it's probably possible to constrain it to only those entities which have a known, bounded physical location (allowing for such notions of "bounding" as would be necessary to locate a highly dispersed digital entity in space...), or some such thing.

Ultimately though, this is just a special case of the social instincts thing. I would just want it to be hardwired to feel things like lovingkindness, compassion, and sympathetic joy for all sentient beings, not just humans. A bodhisattva, in other words. :)

I agree that if we can make an AGI motivated by an arbitrary English-language sentence, “maximize human flourishing” is probably not the optimal choice. I was using that as an example / placeholder. As mentioned, I’m more interested in the other question, i.e. how do we make an AGI motivated by an arbitrary English-language sentence?

Also, my hunch is that, the more complicated the sentence, and the harder it is to find salient concrete examples of it, then the harder and more fraught it would be to make an AGI motivated by it. In that respect, “maximize human flourishing” would probably have an edge over “Does this tend to increase the total subjective utility (weighted by amount of consciousness) of all sentient beings?”, or CEV, etc.

Hmm. I'm not sure if I believe that. But I get what you mean. To me, English language sentences seem like they rely for their meaning on the life experience of English speakers and have far more complexity than they appear to have. Example: try to rigorously define "woman" in a way every English speaker would agree on. It's very hard if not impossible.

As a result, I prefer trying to think of utility functions that at least in principle can be made mathematically rigorous. I think my example is actually far simpler than "maximize human flourishing", in other words. And I really don't want a difference in interpretation of words to lead to misalignment. But perhaps I misunderstand you and you have some notion that there's a way around that problem?

To a first approximation, human motivations involve having a learned world-model, and then some things in the world-model get painted with positive valence (a.k.a. help push the value function higher). For example, if I'm in debt, I can kinda imagine myself being out of debt, and that mental image has a very positive valence (it's an appealing thought!), and that positive valence in turn helps motivate me to make plans and take actions to bring that about. See Post #7 for a more fleshed-out example.

Nowhere in this picture has anything been made mathematically rigorous. Nowhere in this picture has anyone defined a utility function. Yet, humans are obviously capable of doing very impressive things. I assume that (by default) future programmers will make AGI motivations that work in similar ways.

If we could we could figure out how to make and implement rigorously-defined utility function such that the AGI does the things we want it to do, that would be ridiculously awesome. But I don’t know how. That is the topic of Section 14.5.

The problem is that the steering subsystem does not have a world model and can't directly refer to anything in a learned world model. Insofar as we want to design a steering system to serve a particular goal, then, we have to design it in such a way that it doesn't have to have any particular learned world model at all in order to recognize what behaviors move it towards versus further away from that goal.

Example: "am I eating sugar? if so, reward!" is a good steering mechanism, as a presumably simple algorithm in the brainstem is capable of recognizing whether sugar is being eaten or not, and correcting thought assessors appropriately. But, "is this increasing human flourishing? if so, reward!" is not, as I have no idea how to pick out what in the learned world model of the AGI corresponds to "human flourishing".

But if we can mathematically define agency, consciousness, etc, then it might be possible to make a cascade of steering mechanisms in the "brain stem" that will make the AGI tend to pay attention to things that might be agents, tend to try to determine how conscious they are, tend to try to determine what they want, tend to take actions that give them what they want, etc, in such a way that it can learn in real time how best to do any of those things and we don't have to worry what its world model actually looks like, as it will never contradict the definitions of those important concepts that we hardcoded for it. Does that make sense, and if so am I missing anything important?

I have no idea how to pick out what in the learned world model of the AGI corresponds to "human flourishing".

Here’s a lousy way, but which has more than zero chance of working with a good deal more thought and if we can get past the various problems in Sections 14.3-14.4. The AGI watches lots of YouTube videos. Humans label the videos, second-by-second, when there are good examples of human flourishing, and/or when someone literally speaks the words “human flourishing”. These labels are used as supervisory signals that update a “human flourishing” thought assessor. That thought assessor would presumably would wind up most strongly linked to the “human flourishing” world-model concept if any (and also somewhat linked to related concepts like happiness and love and wisdom and whatnot). Then we deploy the AGI, giving it reward in proportion to how strongly each thought it thinks activates the “human flourishing” thought assessor.

It might be possible to make a cascade of steering mechanisms in the "brain stem" that will make the AGI tend to pay attention to things that might be agents, tend to try to determine how conscious they are, tend to try to determine what they want, tend to take actions that give them what they want, etc, in such a way that it can learn in real time how best to do any of those things and we don't have to worry what its world model actually looks like, as it will never contradict the definitions of those important concepts that we hardcoded for it. Does that make sense, and if so am I missing anything important?

That sounds lovely, but I have no idea how one would write code for any of the things you mention. You should figure it out and then tell me :-P

Your human flourishing example sounds like it wouldn't generalize well. As the AI's capacities grow stronger it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them, and if it grows more intelligent after we deploy it we will have no way to determine if its thought assessor generalizes wrongly. This is, I would think, a rather basic and obvious flaw in relying on any part of the world model directly.

As for how to code that stuff, well, I'll figure out how to do that after we've all figured out how to mathematically specify those things. :P

it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them

I’m not sure where you’re getting that. The thing I described in my last comment did not include the humans analyzing the AI’s plans, it only involved the humans labeling YouTube videos.

It would be lovely if humans could reliably analyze the AI’s plans. But I fear that our interpretability techniques will not be up to that challenge.

we will have no way to determine if its thought assessor generalizes wrongly

I agree, see §14.4.

Ah, sorry, I misunderstood you.

Great post!

Re: the 1st person problem, if we're thinking of prosaic alignment solutions, a promising one to me is showing the AI labeled videos of itself doing various things, along with whether those things were honest or not.

I think this is basically how I as a human perceive my sense of self? I don't think I have a good pointer to myself (e.g. out-of-body experiences highlight the difference between my physical body and my mind), but I do have a good pointer to what my friends would describe as myself. In that same way, it seems sort of reasonable to train an AI to define "I am being honest" as "AI Joe exists [and happens to be me], my goal is to maximize the probability that humans who see AI Joe taking action X would say that AI Joe is being honest".

Then all that remains is showing the AI lots of different situations in which it takes actions along with human labels that "AI Joe just took that action". Insofar as humans know what constitutes the AI, it seems like the AI could figure out the same definition?

Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well.
I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.

Thanks! Follow-up question: Do you see yourself as working towards “Proof Strategy 2”? Or “none of the above”?

This part of Proof Strategy 1 is a basically-accurate description of what I'm working towards:

We try to come up with an unambiguous definition of what [things] are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.

... it's just not necessarily about objects localized in 3D space.

Thanks! One of my current sources of mild skepticism right now (which again you might talk me out of) is:

For capabilities reasons, the AGI will probably need to be able to add things to its world-model / ontology, including human-illegible things, and including things that don't exist in the world but which the AGI imagines (and could potentially create).
If the AGI is entertaining a plan of changing the world in important ways (e.g. inventing and deploying mind-upload technology, editing its own code, etc.), it seems likely that the only good way of evaluating whether it's a good plan would involve having opinions about features of the future world that the plan would bring about—as opposed to basing the evaluation purely on current-world-features of the plan, like the process by which it was made.
…And in that case, it's not sufficient to have rigorous concepts / things that apply in our world, but rather we need to be able to pick those concepts / things out of any possible future world that the AGI might bring about.
I'm mildly skeptical that we can find / define such concepts / things, especially for things that we care about like “corrigibility”.
…And thus the story needs something along the lines of out-of-distribution edge-case detection and handling systems like Section 14.4.

As long as the concepts are the type of thing humans can recognize/understand, then it should be conceptually straightforward to model how humans would reason about those concepts or value them.

should be conceptually straightforward to model how humans would reason about those concepts or value them

But if we can do that, we’ve already arguably solved the whole AGI safety problem. I suspect you have something different in mind?

(I expect an AI which relied heavily on human-like reasoning for things other than values would end up doing something catastrophically stupid, much as humans are prone to do.)

In order to handle that sort of problem, an AI has to be able to use human values somehow without carrying over other specifics of how a human would reason about the situation.

I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.

So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.”

It sounds to me like you're imagining something which emulates human reasoning to a much greater extent than I'm imagining.

consider the fusion power generator scenario

I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story?

Basically, yeah.

Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well.
I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.

A couple of notes:

This approach is probably particularly characteristic of my approach.

Yeah, you were one of the “couple other people” I alluded to. The other was ‪Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).

my approach … ontological lock …

I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.

Yeah, you were one of the “couple other people” I alluded to. The other was ‪Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).

🤔 I wonder if I should talk with Tan Zhi-Xuan.

I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.

I have no idea how to pick out what in the learned world model of the AGI corresponds to "human flourishing".

It might be possible to make a cascade of steering mechanisms in the "brain stem" that will make the AGI tend to pay attention to things that might be agents, tend to try to determine how conscious they are, tend to try to determine what they want, tend to take actions that give them what they want, etc, in such a way that it can learn in real time how best to do any of those things and we don't have to worry what its world model actually looks like, as it will never contradict the definitions of those important concepts that we hardcoded for it. Does that make sense, and if so am I missing anything important?

That sounds lovely, but I have no idea how one would write code for any of the things you mention. You should figure it out and then tell me :-P

As for how to code that stuff, well, I'll figure out how to do that after we've all figured out how to mathematically specify those things. :P

it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them

I’m not sure where you’re getting that. The thing I described in my last comment did not include the humans analyzing the AI’s plans, it only involved the humans labeling YouTube videos.

It would be lovely if humans could reliably analyze the AI’s plans. But I fear that our interpretability techniques will not be up to that challenge.

we will have no way to determine if its thought assessor generalizes wrongly

I agree, see §14.4.

Ah, sorry, I misunderstood you.

Great post!

15

[Intro to brain-like-AGI safety] 14. Controlled AGI

15

14.1 Post summary / Table of contents

14.2 Three categories of AGI Thought Assessors

14.2.1 Safety & corrigibility Thought Assessors

14.2.2 Task-related Thought Assessors

14.2.3 “Ersatz interpretability” Thought Assessors

14.2.4 Combining Thought Assessors into a reward function

14.3 Supervising the Thought Assessors, and the “first-person problem”

14.3.1 Side note: do we want first-person preferences?

14.4 Conservatism and concept-extrapolation

14.4.1 Why not just relentlessly optimize the right abstract concept?

14.4.2 The easier part of concept extrapolation: Detecting edge-cases in the AGI’s preferences

14.4.3 The harder part of concept extrapolation: What to do at an edge case

14.4.3.1 Option A: Conservatism—When in doubt, just don’t do it!

14.4.3.2 Option B: Dumb algorithm to seek clarification in edge-cases

14.4.3.3 Option C: The AGI wants to seek clarification in edge-cases

14.4.3.4 Option D: Something else?

14.5 Getting a handle on the world-model itself

14.6 Conclusion: mild pessimism about finding a good solution, uncertainty about the consequences of a lousy solution

Changelog