# Embedded Agency (full-text version)

42 min read15th Nov 20181 comment

# 21

Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to learn for itself and figure out a lot of things that you don't already know.

There's a complicated engineering problem here. But there's also a problem of figuring out what it even means to build a learning agent like that. What is it to optimize realistic goals in physical environments? In broad terms, how does it work?

In this post, I’ll point to four ways we don’t currently know how it works, and four areas of active research aimed at figuring it out.

### 1. Embedded agents

This is Alexei, and Alexei is playing a video game.

Like most games, this game has clear input and output channels. Alexei only observes the game through the computer screen, and only manipulates the game through the controller.

The game can be thought of as a function which takes in a sequence of button presses and outputs a sequence of pixels on the screen.

Alexei is also very smart, and capable of holding the entire video game inside his mind. If Alexei has any uncertainty, it is only over empirical facts like what game he is playing, and not over logical facts like which inputs (for a given deterministic game) will yield which outputs. This means that Alexei must also store inside his mind every possible game he could be playing.

Alexei does not, however, have to think about himself. He is only optimizing the game he is playing, and not optimizing the brain he is using to think about the game. He may still choose actions based off of value of information, but this is only to help him rule out possible games he is playing, and not to change the way in which he thinks.

In fact, Alexei can treat himself as an unchanging indivisible atom. Since he doesn't exist in the environment he's thinking about, Alexei doesn't worry about whether he'll change over time, or about any subroutines he might have to run.

Notice that all the properties I talked about are partially made possible by the fact that Alexei is cleanly separated from the environment that he is optimizing.

This is Emmy. Emmy is playing real life.

Real life is not like a video game. The differences largely come from the fact that Emmy is within the environment that she is trying to optimize.

Alexei sees the universe as a function, and he optimizes by choosing inputs to that function that lead to greater reward than any of the other possible inputs he might choose. Emmy, on the other hand, doesn't have a function. She just has an environment, and this environment contains her.

Emmy wants to choose the best possible action, but which action Emmy chooses to take is just another fact about the environment. Emmy can reason about the part of the environment that is her decision, but since there's only one action that Emmy ends up actually taking, it’s not clear what it even means for Emmy to “choose” an action that is better than the rest.

Alexei can poke the universe and see what happens. Emmy is the universe poking itself. In Emmy’s case, how do we formalize the idea of “choosing” at all?

To make matters worse, since Emmy is contained within the environment, Emmy must also be smaller than the environment. This means that Emmy is incapable of storing accurate detailed models of the environment within her mind.

This causes a problem: Bayesian reasoning works by starting with a large collection of possible environments, and as you observe facts that are inconsistent with some of those environments, you rule them out. What does reasoning look like when you're not even capable of storing a single valid hypothesis for the way the world works? Emmy is going to have to use a different type of reasoning, and make updates that don't fit into the standard Bayesian framework.

Since Emmy is within the environment that she is manipulating, she is also going to be capable of self-improvement. But how can Emmy be sure that as she learns more and finds more and more ways to improve herself, she only changes herself in ways that are actually helpful? How can she be sure that she won’t modify her original goals in undesirable ways?

Finally, since Emmy is contained within the environment, she can’t treat herself like an atom. She is made out of the same pieces that the rest of the environment is made out of, which is what causes her to be able to think about herself.

In addition to hazards in her external environment, Emmy is going to have to worry about threats coming from within. While optimizing, Emmy might spin up other optimizers as subroutines, either intentionally or unintentionally. These subsystems can cause problems if they get too powerful and are unaligned with Emmy’s goals. Emmy must figure out how to reason without spinning up intelligent subsystems, or otherwise figure out how to keep them weak, contained, or aligned fully with her goals.

#### 1.1. Dualistic agents

Emmy is confusing, so let’s go back to Alexei. Marcus Hutter’s AIXI framework gives a good theoretical model for how agents like Alexei work:

The model has an agent and an environment that interact using actions, observations, and rewards. The agent sends out an action , and then the environment sends out both an observation and a reward . This process repeats at each time .

Each action is a function of all the previous action-observation-reward triples. And each observation and reward is similarly a function of these triples and the immediately preceding action.

You can imagine an agent in this framework that has full knowledge of the environment that it’s interacting with. However, AIXI is used to model optimization under uncertainty about the environment. AIXI has a distribution over all possible computable environments , and chooses actions that lead to a high expected reward under this distribution. Since it also cares about future reward, this may lead to exploring for value of information.

Under some assumptions, we can show that AIXI does reasonably well in all computable environments, in spite of its uncertainty. However, while the environments that AIXI is interacting with are computable, AIXI itself is uncomputable. The agent is made out of a different sort of stuff, a more powerful sort of stuff, than the environment.

We will call agents like AIXI and Alexei “dualistic.” They exist outside of their environment, with only set interactions between agent-stuff and environment-stuff. They require the agent to be larger than the environment, and don't tend to model self-referential reasoning, because the agent is made of different stuff than what the agent reasons about.

AIXI is not alone. These dualistic assumptions show up all over our current best theories of rational agency.

I set up AIXI as a bit of a foil, but AIXI can also be used as inspiration. When I look at AIXI, I feel like I really understand how Alexei works. This is the kind of understanding that I want to also have for Emmy.

Unfortunately, Emmy is confusing. When I talk about wanting to have a theory of “embedded agency,” I mean I want to be able to understand theoretically how agents like Emmy work. That is, agents that are embedded within their environment and thus:

• do not have well-defined i/o channels;
• are smaller than their environment;
• are able to reason about themselves and self-improve;
• and are made of parts similar to the environment.

You shouldn’t think of these four complications as a partition. They are very entangled with each other.

For example, the reason the agent is able to self-improve is because it is made of parts. And any time the environment is sufficiently larger than the agent, it might contain other copies of the agent, and thus destroy any well-defined i/o channels.

However, I will use these four complications to inspire a split of the topic of embedded agency into four subproblems. These are: decision theory, embedded world-models, robust delegation, and subsystem alignment.

#### 1.2. Embedded subproblems

Decision theory is all about embedded optimization.

The simplest model of dualistic optimization is . takes in a function from actions to rewards, and returns the action which leads to the highest reward under this function. Most optimization can be thought of as some variant on this. You have some space; you have a function from this space to some score, like a reward or utility; and you want to choose an input that scores highly under this function.

But we just said that a large part of what it means to be an embedded agent is that you don’t have a functional environment. So now what do we do? Optimization is clearly an important part of agency, but we can’t currently say what it is even in theory without making major type errors.

Some major open problems in decision theory include:

• logical counterfactuals: how do you reason about what would happen if you take action B, given that you can prove that you will instead take action A?
• environments that include multiple copies of the agent, or trustworthy predictions of the agent.
• logical updatelessness, which is about how to combine the very nice but very Bayesian world of Wei Dai’s updateless decision theory, with the much less Bayesian world of logical uncertainty.

Embedded world-models

is about how you can make good models of the world that are able to fit within an agent that is much smaller than the world.

This has proven to be very difficult—first, because it means that the true universe is not in your hypothesis space, which ruins a lot of theoretical guarantees; and second, because it means we’re going to have to make non-Bayesian updates as we learn, which also ruins a bunch of theoretical guarantees.

It is also about how to make world-models from the point of view of an observer on the inside, and resulting problems such as anthropics. Some major open problems in embedded world-models include:

• logical uncertainty, which is about how to combine the world of logic with the world of probability.
• multi-level modeling, which is about how to have multiple models of the same world at different levels of description, and transition nicely between them.
• ontological crises, which is what to do when you realize that your model, or even your goal, was specified using a different ontology than the real world.

Robust delegation is all about a special type of principal-agent problem. You have an initial agent that wants to make a more intelligent successor agent to help it optimize its goals. The initial agent has all of the power, because it gets to decide exactly what successor agent to make. But in another sense, the successor agent has all of the power, because it is much, much more intelligent.

From the point of view of the initial agent, the question is about creating a successor that will robustly not use its intelligence against you. From the point of view of the successor agent, the question is about, “How do you robustly learn or respect the goals of something that is stupid, manipulable, and not even using the right ontology?”

There are extra problems coming from the Löbian obstacle making it impossible to consistently trust things that are more powerful than you.

You can think about these problems in the context of an agent that’s just learning over time, or in the context of an agent making a significant self-improvement, or in the context of an agent that’s just trying to make a powerful tool.

The major open problems in robust delegation include:

• Vingean reflection, which is about how to reason about and trust agents that are much smarter than you, in spite of the Löbian obstacle to trust.
• value learning, which is how the successor agent can learn the goals of the initial agent in spite of that agent’s stupidity and inconsistencies.
• corrigibility, which is about how an initial agent can get a successor agent to allow (or even help with) modifications, in spite of an instrumental incentive not to.

Subsystem alignment is about how to be one unified agent that doesn’t have subsystems that are fighting against either you or each other.

When an agent has a goal, like “saving the world,” it might end up spending a large amount of its time thinking about a subgoal, like “making money.” If the agent spins up a sub-agent that is only trying to make money, there are now two agents that have different goals, and this leads to a conflict. The sub-agent might suggest plans that look like they only make money, but actually destroy the world in order to make even more money.

The problem is: you don’t just have to worry about sub-agents that you intentionally spin up. You also have to worry about spinning up sub-agents by accident. Any time you perform a search or an optimization over a sufficiently rich space that’s able to contain agents, you have to worry about the space itself doing optimization. This optimization may not be exactly in line with the optimization the outer system was trying to do, but it will have an instrumental incentive to look like it’s aligned.

A lot of optimization in practice uses this kind of passing the buck. You don’t just find a solution; you find a thing that is able to itself search for a solution.

In theory, I don’t understand how to do optimization at all—other than methods that look like finding a bunch of stuff that I don’t understand, and seeing if it accomplishes my goal. But this is exactly the kind of thing that’s most prone to spinning up adversarial subsystems.

The big open problem in subsystem alignment is about how to have an outer optimizer that doesn’t spin up adversarial inner optimizers. You can break this problem up further by considering cases where the inner optimizers are either intentional or unintentional, and considering restricted subclasses of optimization, like induction.

But remember: decision theory, embedded world-models, robust delegation, and subsystem alignment are not four separate problems. They’re all different subproblems of the same unified concept that is embedded agency.

### 2. Decision theory

Decision theory and artificial intelligence typically try to compute something resembling

I.e., maximize some function of the action. This tends to assume that we can detangle things enough to see outcomes as a function of actions.

For example, AIXI represents the agent and the environment as separate units which interact over time through clearly defined i/o channels, so that it can then choose actions maximizing reward.

When the agent model is a part of the environment model, it can be significantly less clear how to consider taking alternative actions.

For example, because the agent is smaller than the environment, there can be other copies of the agent, or things very similar to the agent. This leads to contentious decision-theory problems such as the Twin Prisoner's Dilemma and Newcomb's problem.

If Emmy Model 1 and Emmy Model 2 have had the same experiences and are running the same source code, should Emmy Model 1 act like her decisions are steering both robots at once? Depending on how you draw the boundary around "yourself", you might think you control the action of both copies, or only your own.

Problems of adapting decision theory to embedded agents include:
• counterfactuals
• Newcomblike reasoning, in which the agent interacts with copies of itself
• extortion problems
• coordination problems
• logical counterfactuals
• logical updatelessness

#### 2.1. Counterfactuals

The first difficulty can be illustrated by the five-and-ten problem. Suppose we have the option of taking a five dollar bill or a ten dollar bill, and all we care about in the situation is how much money we get. Obviously, we should take the $10. However, it is not so easy as it seems to reliably take the$10 when the agent knows its own behavior. If you reason about yourself as just another part of the environment, then you can know your own action. If you can know your own action, then it becomes difficult to reason about what would happen if you took different actions. This means an agent can stably take the $5 because it believes "If I take the$10, I get $0"! This error is coming from a confusion where we replace the intuitive counterfactual “if” with logical implication. This may seem like a silly confusion, but there is not much else we can do, because we don’t know how to formalize the counterfactual “if” correctly. We could instead try to use probability to formalize counterfactuals, but this won’t work either. If we try to calculate the expected utility of our actions by Bayesian conditioning, as is common, knowing our own behavior leads to a divide-by-zero error when we try to calculate the expected utility of actions we know we don't take: implies , which implies , which implies Because the agent doesn't know how to separate itself from the environment, it gets gnashing internal gears when it tries to imagine taking different actions. This is an instance of the problem of counterfactual reasoning: how do we evaluate hypotheticals like "What if the sun suddenly went out"? The most central example of why agents need to think about counterfactuals comes from counterfactuals about their own actions. This is especially tricky if you already know what you're going to do, the same way "what if the sun suddenly went out" is especially tricky if you know that it won't, or "what if 2+2=3" is especially tricky if you know 2+2=4. When the agent is part of the environment, it becomes difficult to distinguish reasoning about yourself from reasoning in general, so you run the risk of knowing your own action. Why might an agent come to know its own action before it has acted? Perhaps the agent is trying to plan ahead, or reason about a game-theoretic situation in which its action has an intricate role to play. But the biggest complication comes from Löb’s Theorem. This can be illustrated more clearly by looking at the behavior of simple logic-based agents reasoning about the five-and-ten problem. Consider this example: We have the source code for an agent and the universe. They can refer to each other through the use of quining. The universe is simple; the universe just outputs whatever the agent outputs. The agent spends a long time searching for proofs about what happens if it takes various actions. If for some and equal to , , or , it finds a proof that taking the leads to utility, that taking the leads to utility, and that , it will naturally take the . We expect that it won’t find such a proof, and will instead pick the default action of taking the . It seems easy when you just imagine an agent trying to reason about the universe. Yet it turns out that if the amount of time spent searching for proofs is enough, the agent will always choose ! The proof that this is so is by Löb's theorem. Löb's theorem says that, for any proposition , if you can prove that a proof of would imply the truth of , then you can prove . In symbols, with "" meaning " is provable": In the version of the five-and-ten problem I gave, "" is the proposition “if the agent outputs the universe outputs , and if the agent outputs the universe outputs ”. Supposing it is provable, the agent will eventually find the proof, and return in fact. This makes the sentence true, since the agent outputs and the universe outputs , and since it’s false that the agent outputs . This is because false propositions like “the agent outputs ” imply everything, including the universe outputting . The agent can (given enough time) prove all of this, in which case the agent in fact proves the proposition "if the agent outputs the universe outputs , and if the agent outputs the universe outputs ". And as a result, the agent takes the$5.

Let’s assume we search for short proofs first. In this case, we will take the $10, since it is very easy to show that leads to and leads to . The problem is that spurious proofs can be short too, and don’t get much longer when the universe gets harder to predict. If we replace the universe with one that is provably functionally the same, but is harder to predict, the shortest proof will short-circuit the complicated universe and be spurious. People often try to solve the problem of counterfactuals by suggesting that there will always be some uncertainty. An AI may know its source code perfectly, but it can't perfectly know the hardware it is running on. Does adding a little uncertainty solve the problem? Often not: • The proof of the spurious counterfactual often still goes through; if you think you are in a five-and-ten problem with a 95% certainty, you can have the usual problem within that 95%. • Adding uncertainty to make counterfactuals well-defined doesn't get you any guarantee that the counterfactuals will be reasonable. Hardware failures aren't often what you want to expect when considering alternate actions. Consider this scenario: You are confident that you almost always take the left path. However, it is possible (though unlikely) for a cosmic ray to damage your circuits, in which case you could go right—but you would then be insane, which would have many other bad consequences. If this reasoning in itself is why you always go left, you've gone wrong. So I'm not talking about agents who know their own actions because I think there's going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the consequences of your actions—a confusion which shows up even in the very simple case where everything about the world is known and you just need to choose the larger pile of money. Maybe we can force exploration actions, so that we learn what happens when we do things? This proposal runs into two problems: • A bad prior can think that exploring is dangerous. • Forcing it to take exploratory actions doesn't teach it what the world would look like if it took those actions deliberately. But writing down examples of "correct" counterfactual reasoning doesn't seem hard from the outside! Maybe that's because from "outside" we always have a dualistic perspective. We are in fact sitting outside of the problem, and we've defined it as a function of an agent. However, an agent can't solve the problem in the same way from inside. From its perspective, its functional relationship with the environment isn't an observable fact. This is why "counterfactuals" are called what they are called, after all. When I told you about the 5 and 10 problem, I first told you about the problem, and then gave you an agent. When one agent doesn’t work well, we could consider a different agent. Finding a way to succeed at a decision problem involves finding an agent that when plugged into the problem takes the right action. The fact that we can even consider putting in different agents means that we have already carved the universe into an “agent” part, plus the rest of the universe with a hole for the agent—which is most of the work! #### 2.3. Updatelessness Are we just fooling ourselves due to the way we set up decision problems, then? Are there no "correct" counterfactuals? Well, maybe we are fooling ourselves. But there is still something we are confused about! "Counterfactuals are subjective, invented by the agent" doesn't dissolve the mystery. There is something intelligent agents do, in the real world, to make decisions. Updateless decision theory (UDT) views the problem from "closer to the outside". It does this by picking the action which the agent would have wanted to commit to before getting into the situation. Consider the following game: Alice receives a card at random which is either High or Low. She may reveal the card if she wishes. Bob then gives his probability that Alice has a high card. Alice always loses dollars. Bob loses if the card is low, and if the card is high. Bob has a proper scoring rule, so does best by giving his true belief. Alice just wants Bob's belief to be as much toward "low" as possible. Suppose Alice will play only this one time. She sees a low card. Bob is good at reasoning about Alice, but is in the next room and so can't read any tells. Should Alice reveal her card? Since Alice's card is low, if she shows it to Bob, she will lose no money, which is the best possible outcome. However, this means that in the counterfactual world where Alice sees a high card, she wouldn't be able to keep the secret—she might as well show her card in that case too, since her reluctance to show it would be as reliable a sign of "high". On the other hand, if Alice doesn't show her card, she loses 25¢—but then she can use the same strategy in the other world, rather than losing$1. So, before playing the game, Alice would want to visibly commit to not reveal; this makes expected loss 25¢, whereas the other strategy has expected loss 50¢.

This game is equivalent to the decision problem called counterfactual mugging. UDT solves such problems by recommending that the agent do whatever would have seemed wisest before—whatever your earlier self would have committed to do.

UDT is an elegant solution to a fairly broad class of decision problems. However, it only makes sense if the earlier self can foresee all possible situations.

This works fine in a Bayesian setting where the prior already contains all possibilities within itself. However, there may be no way to do this in a realistic embedded setting. An agent has to be able to think of new possibilities—meaning that its earlier self doesn't know enough to make all the decisions.

And with that, we find ourselves squarely facing the problem of embedded world-models.

### 3. Embedded world-models

An agent which is larger than its environment can:

• Hold an exact model of the environment in its head.
• Think through the consequences of every potential course of action.
• If it doesn't know the environment perfectly, hold every possible way the environment could be in its head, as is the case with Bayesian uncertainty.

All of these are typical of notions of rational agency.

An embedded agent can't do any of those things, at least not in any straightforward way.

One difficulty is that, since the agent is part of the environment, modeling the environment in every detail would require the agent to model itself in every detail, which would require the agent’s self-model to be as “big” as the whole agent. An agent can’t fit inside its own head.

The lack of a crisp agent/environment boundary forces us to grapple with paradoxes of self-reference. As if representing the rest of the world weren't already hard enough.

Embedded World-Models have to represent the world in a way more appropriate for embedded agents. Problems in this cluster include:

• the "realizability" / "grain of truth" problem: the real world isn’t in the agent’s hypothesis space
• logical uncertainty
• high-level models
• multi-level models
• ontological crises
• naturalized induction, the problem that the agent must incorporate its model of itself into its world-model
• anthropic reasoning, the problem of reasoning with how many copies of yourself exist

#### 3.1. Realizability

In a Bayesian setting, where an agent's uncertainty is quantified by a probability distribution over possible worlds, a common assumption is “realizability”: the true underlying environment which is generating the observations is assumed to have at least some probability in the prior.

In game theory, this same property is described by saying a prior has a "grain of truth”. It should be noted, though, that there are additional barriers to getting this property in a game-theoretic setting; so, in their common usage cases, "grain of truth" is technically demanding while "realizability" is a technical convenience.

Realizability is not totally necessary in order for Bayesian reasoning to make sense. If you think of a set of hypotheses as “experts”, and the current posterior probability as how much you “trust” each expert, then learning according to Bayes' Law, , ensures a relative bounded loss property.

Specifically, if you use a prior , the amount worse you are in comparison to each expert is at most  , since you assign at least probability to seeing a sequence of evidence . Intuitively, is your initial trust in expert , and in each case where it is even a little bit more correct than you, you increase your trust accordingly. The way you do this ensures you assign an expert probability 1 and hence copy it precisely before you lose more than compared to it.

The prior AIXI is based on is the Solomonoff prior. It is defined as the output of a universal Turing machine (UTM) whose inputs are coin-flips.

In other words, feed a UTM a random program. Normally, you'd think of a UTM as only being able to simulate deterministic machines. Here, however, the initial inputs can instruct the UTM to use the rest of the infinite input tape as a source of randomness to simulate a stochastic Turing machine.

Combining this with the previous idea about viewing Bayesian learning as a way of allocating "trust" to "experts" which meets a bounded loss condition, we can see the Solomonoff prior as a kind of ideal machine learning algorithm which can learn to act like any algorithm you might come up with, no matter how clever.

For this reason, we shouldn't necessarily think of AIXI as “assuming the world is computable”, even though it reasons via a prior over computations. It's getting bounded loss on its predictive accuracy as compared with any computable predictor. We should rather say that AIXI assumes all possible algorithms are computable, not that the world is.

However, lacking realizability can cause trouble if you are looking for anything more than bounded-loss predictive accuracy:

• the posterior can oscillate forever;
• probabilities may not be calibrated;
• estimates of statistics such as the mean may be arbitrarily bad;
• estimates of latent variables may be bad;
• and the identification of causal structure may not work.

So does AIXI perform well without a realizability assumption? We don't know. Despite getting bounded loss for predictions without realizability, existing optimality results for its actions require an added realizability assumption.

First, if the environment really is sampled from the Solomonoff distribution, AIXI gets the maximum expected reward. But this is fairly trivial; it is essentially the definition of AIXI.

Second, if we modify AIXI to take somewhat randomized actions—Thompson sampling—there is an asymptotic optimality result for environments which act like any stochastic Turing machine.

So, either way, realizability was assumed in order to prove anything. (See Jan Leike, Nonparametric General Reinforcement Learning.)

But the concern I'm pointing at is not “the world might be uncomputable, so we don't know if AIXI will do well”; this is more of an illustrative case. The concern is that AIXI is only able to define intelligence or rationality by constructing an agent much, much bigger than the environment which it has to learn about and act within.

Laurent Orseau provides a way of thinking about this in “Space-Time Embedded Intelligence”. However, his approach defines the intelligence of an agent in terms of a sort of super-intelligent designer who thinks about reality from outside, selecting an agent to place into the environment.

Embedded agents don't have the luxury of stepping outside of the universe to think about how to think. What we would like would be a theory of rational belief for situated agents which provides foundations that are similarly as strong as the foundations Bayesianism provides for dualistic agents.

Imagine a computer science theory person who is having a disagreement with a programmer. The theory person is making use of an abstract model. The programmer is complaining that the abstract model isn’t something you would ever run, because it is computationally intractable. The theory person responds that the point isn’t to ever run it. Rather, the point is to understand some phenomenon which will also be relevant to more tractable things which you would want to run.

I bring this up in order to emphasize that my perspective is a lot more like the theory person's. I’m not talking about AIXI to say “AIXI is an idealization you can’t run”. The answers to the puzzles I’m pointing at don’t need to run. I just want to understand some phenomena.

However, sometimes a thing that makes some theoretical models less tractable also makes that model too different from the phenomenon we're interested in.

The way AIXI wins games is by assuming we can do true Bayesian updating over a hypothesis space, assuming the world is in our hypothesis space, etc. So it can tell us something about the aspect of realistic agency that’s approximately doing Bayesian updating over an approximately-good-enough hypothesis space. But embedded agents don’t just need approximate solutions to that problem; they need to solve several problems that are different in kind from that problem.

#### 3.2. Self-reference

One major obstacle a theory of embedded agency must deal with is self-reference.

Paradoxes of self-reference such as the liar paradox make it not just wildly impractical, but in a certain sense impossible for an agent's world-model to accurately reflect the world.

The liar paradox concerns the status of the sentence "This sentence is not true". If it were true, it must be false; and if not true, it must be true.

The difficulty comes in part from trying to draw a map of a territory which includes the map itself.

This is fine if the world "holds still" for us; but because the map is in the world, it may implement some function.

Suppose our goal is to make an accurate map of the final route of a road which is currently under construction. Suppose we also know that the construction team will get to see our map, and that construction will proceed so as to disprove whatever map we make. This puts us in a liar-paradox-like situation.

Problems of this kind become relevant for decision-making in the theory of games. A simple game of rock-paper-scissors can introduce a liar paradox if the players try to win, and can predict each other better than chance.

Game theory solves this type of problem with game-theoretic equilibria. But the problem ends up coming back in a different way.

I mentioned that the problem of realizability takes on a different character in the context of game theory. In an ML setting, realizability is a potentially unrealistic assumption, but can usually be assumed consistently nonetheless.

In game theory, on the other hand, the assumption itself may be inconsistent. This is because games commonly yield paradoxes of self-reference.

Because there are so many agents, it is no longer possible in game theory to conveniently make an "agent" a thing which is larger than a world. So game theorists are forced to investigate notions of rational agency which can handle a large world.

Unfortunately, this is done by splitting up the world into "agent" parts and "non-agent" parts, and handling the agents in a special way. This is almost as bad as dualistic models of agency.

In rock-paper-scissors, the liar paradox is resolved by stipulating that each player play each move with probability. If one player plays this way, then the other loses nothing by doing so. This way of introducing probabilistic play to resolve would-be paradoxes of game theory is called a Nash equilibrium.

We can use Nash equilibria to prevent the assumption that the agents correctly understand the world they're in from being inconsistent. However, that works just by telling the agents what the world looks like. What if we want to model agents who learn about the world, more like AIXI?

The grain of truth problem is the problem of formulating a reasonably bound prior probability distribution which would allow agents playing games to place some positive probability on each other's true (probabilistic) behavior, without knowing it precisely from the start.

Until recently, known solutions to the problem were quite limited. Benja Fallenstein, Jessica Taylor, and Paul Christiano's "Reflective Oracles: A Foundation for Classical Game Theory" provides a very general solution. (For details, see "A Formal Solution to the Grain of Truth Problem" by Jan Leike, Jessica Taylor, and Benja Fallenstein.)

Reflective oracles also solve the problem with game-theoretic notions of rationality I mentioned earlier. It allows agents to be reasoned about in the same manner as other parts of the environment, rather than treating them as a fundamentally special case.

If a Turing machine can use an oracle machine to solve a problem, then we can always jump to a problem that is undecidable even with that oracle; and if we bring in a stronger oracle to solve the new problem, then we can repeat this process with the new oracle.

Reflective oracles work by twisting the ordinary Turing universe back on itself, so that rather than an infinite hierarchy of ever-stronger oracles, you define an oracle that serves as its own oracle machine.

This would normally introduce contradictions, but reflective oracles avoid this by randomizing their output in cases where they would run into paradoxes. As a result, an agent with access to a reflective oracle can reason about the behavior of any other agent with access to a reflective oracle.

However, models of rational agents based on reflective oracles still have several major limitations. One of these is that agents are required to have unlimited processing power, just like AIXI, and so are assumed to know all of the consequences of their own beliefs.

In fact, knowing all the consequences of your beliefs—a property known as logical omniscience—turns out to be rather core to classical Bayesian rationality.

#### 3.3. Logical uncertainty

So far, I've been talking in a fairly naive way about the agent having beliefs about hypotheses, and the real world being or not being in the hypothesis space.

It isn't really clear what any of that means.

Depending on how we define things, it may actually be quite possible for an agent to be smaller than the world and yet contain the right world-model—it might know the true physics and initial conditions, but only be capable of inferring their consequences very approximately.

Uncertainty about the consequences of your beliefs is logical uncertainty. In this case, the agent might be empirically certain of a unique mathematical description pinpointing which universe she’s in, while being logically uncertain of most consequences of that description.

Logic and probability theory are two great triumphs in the codification of rational thought. However, the two don't work together as well as one might think.

Probability is like a scale, with worlds as weights. An observation eliminates some of the possible worlds, removing weights and shifting the balance of beliefs.

Logic is like a tree, growing from the seed of axioms according to inference rules. For real-world agents, the process of growth is never complete; you never know all the consequences of each belief.

Bayesian hypothesis testing requires each hypothesis to clearly declare which probabilities it assigns to which observations. That way, you know how much to rescale the odds when you make an observation. If we don't know the consequences of a belief, we don't know how much credit to give it for making predictions.

This is like not knowing where to place the weights on the scales of probability. We could try putting weights on both sides until a proof rules one out, but then the beliefs just oscillate forever rather than doing anything useful.

This forces us to grapple directly with the problem of a world that’s larger than the agent. We want some notion of boundedly rational beliefs about uncertain consequences; but any computable beliefs about logic must have left out something, since the tree of logical implications will grow larger than any container.

For a Bayesian, the scales of probability are balanced in precisely such a way that no Dutch book can be made against them—no sequence of bets that are a sure loss. But you can only account for all Dutch books if you know all the consequences of your beliefs. Absent that, someone who has explored other parts of the tree can Dutch-book you.

But human mathematicians don't seem to run into any special difficulty in reasoning about mathematical uncertainty, any more than we do with empirical uncertainty. So what characterizes good reasoning under mathematical uncertainty, if not immunity to making bad bets?

One answer is to weaken the notion of Dutch books so that we only allow bets based on quickly computable parts of the tree. This is one of the ideas behind Garrabrant et al.'s "Logical Induction”, an early attempt at defining something like "Solomonoff induction, but for reasoning that incorporates mathematical uncertainty”.

#### 3.4. High-level models

Another consequence of the fact that the world is bigger than you is that you need to be able to use high-level world models: models which involve things like tables and chairs.

This is related to the classical symbol grounding problem; but since we want a formal analysis which increases our trust in some system, the kind of model which interests us is somewhat different. This also relates to transparency and informed oversight: world-models should be made out of understandable parts.

A related question is how high-level reasoning and low-level reasoning relate to each other and to intermediate levels: multi-level world models.

Standard probabilistic reasoning doesn't provide a very good account of this sort of thing. It's as though you have different Bayes nets which describe the world at different levels of accuracy, and processing power limitations force you to mostly use the less accurate ones, so you have to decide how to jump to the more accurate as needed.

Additionally, the models at different levels don't line up perfectly, so you have a problem of translating between them; and the models may have serious contradictions between them. This might be fine, since high-level models are understood to be approximations anyway, or it could signal a serious problem in the higher- or lower-level models, requiring their revision.

This is especially interesting in the case of ontological crises, in which objects we value turn out not to be a part of "better" models of the world.

It seems fair to say that everything humans value exists in high-level models only, which from a reductionistic perspective is "less real" than atoms and quarks. However, because our values aren't defined on the low level, we are able to keep our values even when our knowledge of the low level radically shifts. (We would also like to be able to say something about what happens to values if the high level radically shifts.)

Another critical aspect of embedded world models is that the agent itself must be in the model, since the agent seeks to understand the world, and the world cannot be fully separated from oneself. This opens the door to difficult problems of self-reference and anthropic decision theory.

Naturalized induction is the problem of learning world-models which include yourself in the environment. This is challenging because (as Caspar Oesterheld has put it) there is a type mismatch between "mental stuff" and "physics stuff".

AIXI conceives of the environment as if it were made with a slot which the agent fits into. We might intuitively reason in this way, but we can also understand a physical perspective from which this looks like a bad model. We might imagine instead that the agent separately represents: self-knowledge available to introspection; hypotheses about what the universe is like; and a "bridging hypothesis" connecting the two.

There are interesting questions of how this could work. There's also the question of whether this is the right structure at all. It's certainly not how I imagine babies learning.

Thomas Nagel would say that this way of approaching the problem involves "views from nowhere"; each hypothesis posits a world as if seen from outside. This is perhaps a strange thing to do.

A special case of agents needing to reason about themselves is agents needing to reason about their future self.

To make long-term plans, agents need to be able to model how they’ll act in the future, and have a certain kind of trust in their future goals and reasoning abilities. This includes trusting future selves that have learned and grown a great deal.

In a traditional Bayesian framework, “learning” means Bayesian updating. But as we noted, Bayesian updating requires that the agent start out large enough to consider a bunch of ways the world can be, and learn by ruling some of these out.

Embedded agents need resource-limited, logically uncertain updates, which don’t work like this.

Unfortunately, Bayesian updating is the main way we know how to think about an agent progressing through time as one unified agent. The Dutch book justification for Bayesian reasoning is basically saying this kind of updating is the only way to not have the agent’s actions on Monday work at cross purposes, at least a little, to the agent’s actions on Tuesday.

Embedded agents are non-Bayesian. And non-Bayesian agents tend to get into wars with their future selves.

Which brings us to our next set of problems: robust delegation.

### 4. Robust delegation

Because the world is big, the agent as it is may be inadequate to accomplish its goals, including in its ability to think.

Because the agent is made of parts, it can improve itself and become more capable.

Improvements can take many forms: The agent can make tools, the agent can make successor agents, or the agent can just learn and grow over time. However, the successors or tools need to be more capable for this to be worthwhile.

This gives rise to a special type of principal/agent problem:

You have an initial agent, and a successor agent. The initial agent gets to decide exactly what the successor agent looks like. The successor agent, however, is much more intelligent and powerful than the initial agent. We want to know how to have the successor agent robustly optimize the initial agent’s goals.

The problem is not (just) that the successor agent might be malicious. The problem is that we don't even know what it means not to be.

This problem seems hard from both points of view.

The initial agent needs to figure out how reliable and trustworthy something more powerful than it is, which seems very hard. But the successor agent has to figure out what to do in situations that the initial agent can’t even understand, and try to respect the goals of something that the successor can see is inconsistent, which also seems very hard.

At first, this may look like a less fundamental problem than "make decisions" or "have models”. But the view on which there are multiple forms of the "build a successor" problem is a dualistic view.

To an embedded agent, the future self is not privileged; it is just another part of the environment. There isn't a deep difference between building a successor that shares your goals, and just making sure your own goals stay the same over time.

So, although I talk about "initial" and "successor" agents, remember that this isn't just about the narrow problem humans currently face of aiming a successor. This is about the fundamental problem of being an agent that persists and learns over time.

We call this cluster of problems Robust Delegation. Examples include:

#### 4.1. Vingean reflection

Imagine you are playing the CIRL game with a toddler.

CIRL means Cooperative Inverse Reinforcement Learning. The idea behind CIRL is to define what it means for a robot to collaborate with a human. The robot tries to pick helpful actions, while simultaneously trying to figure out what the human wants.

Usually, we think about this from the point of view of the human. But now consider the problem faced by the robot, where they’re trying to help someone who is very confused about the universe. Imagine trying to help a toddler optimize their goals.

• From your standpoint, the toddler may be too irrational to be seen as optimizing anything.
• The toddler may have an ontology in which it is optimizing something, but you can see that ontology doesn't make sense.
• Maybe you notice that if you set up questions in the right way, you can make the toddler seem to want almost anything.

Part of the problem is that the “helping” agent has to be bigger in some sense in order to be more capable; but this seems to imply that the “helped” agent can't be a very good supervisor for the “helper”.

For example, updateless decision theory eliminates dynamic inconsistencies in decision theory by, rather than maximizing expected utility of your action given what you know, maximizing expected utility of reactions to observations, from a state of ignorance.

Appealing as this may be as a way to achieve reflective consistency, it creates a strange situation in terms of computational complexity: If actions are type , and observations are type , reactions to observations are type  —a much larger space to optimize over than alone. And we're expecting our smaller self to be able to do that!

One way to more crisply state the problem is: We should be able to trust that our future self is applying its intelligence to the pursuit of our goals without being able to predict precisely what our future self will do. This criterion is called Vingean reflection.

For example, you might plan your driving route before visiting a new city, but you do not plan your steps. You plan to some level of detail, and trust that your future self can figure out the rest.

Vingean reflection is difficult to examine via classical Bayesian decision theory because Bayesian decision theory assumes logical omniscience. Given logical omniscience, the assumption “the agent knows its future actions are rational” is synonymous with the assumption “the agent knows its future self will act according to one particular optimal policy which the agent can predict in advance”.

We have some limited models of Vingean reflection (see “Tiling Agents for Self-Modifying AI, and the Löbian Obstacle” by Yudkowsky and Herreshoff). A successful approach must walk the narrow line between two problems:

• The Löbian Obstacle:   Agents who trust their future self because they trust the output of their own reasoning are inconsistent.
• The Procrastination Paradox:   Agents who trust their future selves without reason tend to be consistent but unsound and untrustworthy, and will put off tasks forever because they can do them later.

The Vingean reflection results so far apply only to limited sorts of decision procedures, such as satisficers aiming for a threshold of acceptability. So there is plenty of room for improvement, getting tiling results for more useful decision procedures and under weaker assumptions.

However, there is more to the robust delegation problem than just tiling and Vingean reflection.

When you construct another agent, rather than delegating to your future self, you more directly face a problem of value loading.

#### 4.2. Goodhart's law

The misspecification-amplifying effect is known as Goodhart's law, named for Charles Goodhart’s observation: "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."

When we specify a target for optimization, it is reasonable to expect it to be correlated with what we want—highly correlated, in some cases. Unfortunately, however, this does not mean that optimizing it will get us closer to what we want—especially at high levels of optimization.

There are (at least) four types of Goodhart: regressional, causal, extremal, and adversarial.

Regressional Goodhart happens when there is a less than perfect correlation between the proxy and the goal. It is more commonly known as the optimizer's curse, and it is related to regression to the mean.

An unbiased estimate of given is not an unbiased estimate of when we select for the best . In that sense, we can expect to be disappointed when we use as a proxy for for optimization purposes.

Using a Bayes estimate instead of an unbiased estimate, we can eliminate this sort of predictable disappointment.

This doesn't necessarily allow us to get a better value, since we still only have the information content of to work with. However, it sometimes may. If is normally distributed with variance , and is with even odds of or , a Bayes estimate will give better optimization results by almost entirely removing the noise.

Causal Goodhart happens when you observe a correlation between proxy and goal, but when you intervene to increase the proxy, you fail to increase the goal because the observed correlation was not causal in the right way. Teasing correlation apart from causation is run-of-the-mill counterfactual reasoning.

In extremal Goodhart, optimization pushes you outside the range where the correlation exists, into portions of the distribution which behave very differently. This is especially scary because it tends to have phase shifts. You might not be able to observe the proxy breaking down at all when you have weak optimization, but once the optimization becomes strong enough, you can enter a very different domain.

Extremal Goodhart is similar to regressional Goodhart, but we can't correct it with Bayes estimators if we don't have the right model—otherwise, there seems to be no reason why the Bayes estimator itself should not be susceptible to extremal Goodhart.

If you have a probability distribution such that the proxy is only a boundedly bad approximation of on average, quantilization avoids extremal Goodhart by selecting randomly from for some threshold . If we pick a threshold that is high but not extreme, we can hope that the risk of selecting from outliers with very different behavior will be small, and that is likely to be large.

This is helpful, but unlike Bayes estimators for regressional Goodhart, doesn't necessarily seem like the end of the story. Maybe we can do better.

Finally, there is adversarial Goodhart, in which agents actively make our proxy worse by intelligently manipulating it. This is even harder to observe at low levels of optimization, both because the adversaries won’t want to start manipulating until after test time is over, and because adversaries that come from the system’s own optimization won’t show up until the optimization is powerful enough.

These different types of Goodhart effects work in very different ways, and, roughly speaking, they tend to start appearing at successively higher levels of optimization power—so be careful not to think you've conquered Goodhart's law because you've solved some of them.

#### 4.3. Stable pointers to what we value

Besides anti-Goodhart measures, it would obviously help to be able to specify what we want precisely.

Unfortunately, this is hard; so can the AI system we’re building help us with this? More generally, can a successor agent help its predecessor solve this? Maybe it can use its intellectual advantages to figure out what we want?

AIXI learns what to do through a reward signal which it gets from the environment. We can imagine humans have a button which they press when AIXI does something they like.

The problem with this is that AIXI will apply its intelligence to the problem of taking control of the reward button. This is the problem of wireheading.

Maybe we build the reward button into the agent, as a black box which issues rewards based on what is going on. The box could be an intelligent sub-agent in its own right, which figures out what rewards humans would want to give. The box could even defend itself by issuing punishments for actions aimed at modifying the box.

In the end, though, if the agent understands the situation, it will be motivated to take control anyway.

There is a critical distinction between optimizing  ""  in quotation marks and optimizing    directly. If the agent is coming up with plans to try to achieve a high output of the box, and it incorporates into its planning its uncertainty regarding the output of the box, then it will want to hack the box. However, if you run the expected outcomes of plans through the actual box, then plans to hack the box are evaluated by the current box, so they don't look particularly appealing.

Daniel Dewey calls the second sort of agent an observation-utility maximizer. (Others have included observation-utility agents within a more general notion of reinforcement learning.)

I find it very interesting how you can try all sorts of things to stop an RL agent from wireheading, but the agent keeps working against it. Then, you make the shift to observation-utility agents and the problem vanishes.

It seems like the indirection itself is the problem. RL agents maximize the output of the box; observation-utility agents maximize  . So the challenge is to create stable pointers to what we value: a notion of "indirection" which serves to point at values not directly available to be optimized.

Observation-utility agents solve the classic wireheading problem, but we still have the problem of specifying  . So we add a level of indirection back in: we represent our uncertainty over  , and try to learn. Daniel Dewey doesn't provide any suggestion for how to do this, but CIRL is one example.

Unfortunately, the wireheading problem can come back in even worse fashion. For example, if there is a drug which modifies human preferences to only care about using the drug, a CIRL agent could be highly motivated to give humans that drug in order to make its job easier. This is called the human manipulation problem.

The lesson I want to draw from this is from "Reinforcement Learning with a Corrupted Reward Channel" (by Tom Everitt et al.): the way you set up the feedback loop makes a huge difference.

They draw the following picture:

• In Standard RL, the feedback about the value of a state comes from the state itself, so corrupt states can be "self-aggrandizing".
• In Decoupled RL, the feedback about the quality of a state comes from some other state, making it possible to learn correct values even when some feedback is corrupt.

In some sense, the challenge is to put the original, small agent in the feedback loop in the right way. However, the problems with updateless reasoning mentioned earlier make this hard; the original agent doesn't know enough.

One way to try to address this is through intelligence amplification: try to turn the original agent into a more capable one with the same values, rather than creating a successor agent from scratch and trying to get value loading right.

For example, Paul Christiano proposes an approach in which the small agent is simulated many times in a large tree, which can perform complex computations by splitting problems into parts.

However, this is still fairly demanding for the small agent: it doesn't just need to know how to break problems down into more tractable pieces; it also needs to know how to do so without giving rise to malign subcomputations.

For example, since the small agent can use the copies of itself to get a lot of computational power, it could easily try to use a brute-force search for solutions that ends up running afoul of Goodhart's law.

This issue is the subject of the next section: subsystem alignment.

### 5. Subsystem alignment

You want to figure something out, but you don't know how to do that yet.

You have to somehow break up the task into sub-computations. There is no atomic act of “thinking”; intelligence must be built up of non-intelligent parts.

The agent being made of parts is part of what made counterfactuals hard, since the agent may have to reason about impossible configurations of those parts.

Being made of parts is what makes self-reasoning and self-modification even possible.

What we're primarily going to discuss in this section, though, is another problem: when the agent is made of parts, there could be adversaries not just in the external environment, but inside the agent as well.

This cluster of problems is Subsystem Alignment: ensuring that subsystems are not working at cross purposes; avoiding subprocesses optimizing for unintended goals.

• benign induction
• benign optimization
• transparency
• inner optimizers

#### 5.1. Robustness to relative scale

Here's a straw agent design:

The epistemic subsystem just wants accurate beliefs. The instrumental subsystem uses those beliefs to track how well it is doing. If the instrumental subsystem gets too capable relative to the epistemic subsystem, it may decide to try to fool the epistemic subsystem, as depicted.

If the epistemic subsystem gets too strong, that could also possibly yield bad outcomes.

This agent design treats the system’s epistemic and instrumental subsystems as discrete agents with goals of their own, which is not particularly realistic. However, we saw in the section on wireheading that the problem of subsystems working at cross purposes is hard to avoid. And this is a harder problem if we didn’t intentionally build the relevant subsystems.

One reason to avoid booting up sub-agents who want different things is that we want robustness to relative scale.

An approach is robust to scale if it still works, or fails gracefully, as you scale capabilities. There are three types: robustness to scaling up; robustness to scaling down; and robustness to relative scale.

• Robustness to scaling up means that your system doesn’t stop behaving well if it gets better at optimizing. One way to check this is to think about what would happen if the function the AI optimizes were actually maximized. Think Goodhart's law.

• Robustness to scaling down means that your system still works if made less powerful. Of course, it may stop being useful; but it should fail safely and without unnecessary costs.

Your system might work if it can exactly maximize some function, but is it safe if you approximate? For example, maybe a system is safe if it can learn human values very precisely, but approximation makes it increasingly misaligned.

• Robustness to relative scale means that your design does not rely on the agent’s subsystems being similarly powerful. For example, GAN (Generative Adversarial Network) training can fail if one sub-network gets too strong, because there's no longer any training signal.

Lack of robustness to scale isn't necessarily something which kills a proposal, but it is something to be aware of; lacking robustness to scale, you need strong reason to think you're at the right scale.

Robustness to relative scale is particularly important for subsystem alignment. An agent with intelligent sub-parts should not rely on being able to outsmart them, unless we have a strong account of why this is always possible.

The big-picture moral: aim to have a unified system that doesn’t work at cross purposes to itself.

Why would anyone make an agent with parts fighting against one another? There are three obvious reasons: subgoals, pointers, and search.

Splitting up a task into subgoals may be the only way to efficiently find a solution. However, a subgoal computation shouldn’t completely forget the big picture!

An agent designed to build houses should not boot up a sub-agent who cares only about building stairs.

One intuitive desideratum is that although subsystems need to have their own goals in order to decompose problems into parts, the subgoals need to “point back” robustly to the main goal.

A house-building agent might spin up a subsystem that cares only about stairs, but only cares about stairs in the context of houses.

However, you need to do this in a way that doesn't just amount to your house-building system having a second house-building system inside its head. This brings me to the next item:

Pointers: It may be difficult for subsystems to carry the whole-system goal around with them, since they need to be reducing the problem. However, this kind of indirection seems to encourage situations in which different subsystems’ incentives are misaligned.

As we saw in the example of the epistemic and instrumental subsystems, as soon as we start optimizing some sort of expectation, rather than directly getting feedback about what we're doing on the metric that's actually important, we may create perverse incentives—that's Goodhart's law.

How do we ask a subsystem to “do X” as opposed to “convince the wider system that I’m doing X”, without passing along the entire overarching goal-system?

This is similar to the way we wanted successor agents to robustly point at values, since it is too hard to write values down. However, in this case, learning the values of the larger agent wouldn't make any sense either; subsystems and subgoals need to be smaller.

It might not be that difficult to solve subsystem alignment for subsystems which humans entirely design, or subgoals which an AI explicitly spins up. If you know how to avoid misalignment by design and robustly delegate your goals, both problems seem solvable.

However, it doesn't seem possible to design all subsystems so explicitly. At some point in solving a problem, you've split it up as much as you know how to and must rely on some trial and error.

This brings us to the third reason subsystems might be optimizing different things, search: solving a problem by looking through a rich space of possibilities, a space which may itself contain misaligned subsystems.

ML researchers are quite familiar with the phenomenon: it’s easier to write a program which finds a high-performance machine translation system for you than to directly write one yourself.

In the long run, this process can go one step further. For a rich enough problem and an impressive enough search process, the solutions found via search might themselves be intelligently optimizing something. This problem is described in Hubinger, et al.'s forthcoming "The Inner Alignment Problem”.

Let's call the outer search process an "outer optimizer", and the inner search process an "inner optimizer".

"Optimization" and "search" are ambiguous terms. I'll think of them as any algorithm which can be naturally interpreted as doing significant computational work to "find" an object that scores highly on some objective function.

The objective function of the outer optimizer is not necessarily the same as that of the inner optimizer. If the outer optimizer wants to make pizza, the inner optimizer may enjoy kneading dough, chopping ingredients, et cetera.

The inner objective function must be helpful for the outer, at least in the examples the outer optimizer is checking. Otherwise, the inner optimizer would not have been selected.

However, the inner optimizer must reduce the problem somehow; there is no point to it running the exact same search. So it seems like its objectives will tend to be like good heuristics; easier to optimize, but different from the outer objective in general.

Why might a difference between inner and outer objectives be concerning, if the inner optimizer is scoring highly on the outer objective anyway? It's about the interplay with what's really wanted. Even if we get value specification exactly right, there will always be some distributional shift between the training set and deployment. (See Amodei, et al.'s "Concrete Problems in AI Safety”.)

Distributional shifts which would be small in ordinary cases may make a big difference to a capable inner optimizer, which may observe the slight difference and figure out how to capitalize on it for its own objective.

Actually, to even use the term "distributional shift" seems wrong in the context of embedded agency. The world is not i.i.d. The analog of "no distributional shift" would be to have an exact model of the whole future relevant to what you want to optimize, and the ability to run it over and over during training. So we need to deal with massive "distributional shift".

We may also want to optimize for things that aren’t exactly what we want. The obvious way to avoid agents that pursue subgoals at the cost of the overall goal is to have the subsystems not be agentic. Just search over a bunch of ways to make stairs, don’t make something that cares about stairs. The problem is then that powerful inner optimizers are optimizing something the outer system doesn’t care about, and that the inner optimizers will have a convergent incentive to be agentic.

Additionally, there's the possibility that the inner optimizer becomes aware of the outer optimizer, in which case it might start explicitly trying to do well on the outer objective function in order to be kept around, while looking for any signs that it has left training and can stop pretending.

This is the same story we saw in adversarial Goodhart: there is something agentic in the search space, which responds to our choice of proxy in a way which makes our proxy a bad one.

If intelligent inner optimizers developing in deep neural network training seems too hypothetical, consider the evolution of life on Earth. Evolution can be thought of as a reproductive fitness maximizer.

(Evolution can actually be thought of as an optimizer for many things, or as no optimizer at all, but that doesn’t matter. The point is that if an agent wanted to maximize reproductive fitness, it might use a system that looked like evolution.)

Intelligent organisms are inner optimizers of evolution. Although the drives of intelligent organisms are certainly correlated with reproductive fitness, organisms want all sorts of things. There are even inner optimizers who have come to understand evolution, and even to manipulate it at times. Powerful and misaligned inner optimizers appear to be a real possibility, then, at least with enough processing power.

Problems seem to arise because you try to solve a problem which you don't yet know how to solve by searching over a large space and hoping "someone" can solve it.

If the source of the issue is the solution of problems by massive search, perhaps we should look for different ways to solve problems. Perhaps we should solve problems by figuring things out. But how do you solve problems which you don't yet know how to solve other than by trying things?

Let’s take a step back.

Embedded world-models is about how to think at all, as an embedded agent; decision theory is about how to act. Robust delegation is about building trustworthy successors and helpers. Subsystem alignment is about building one agent out of trustworthy parts.

The problem is that:

• We don't know how to think about environments when we're smaller.
• To the extent we can do that, we don't know how to think about consequences of actions in those environments.
• Even when we can do that, we don't know how to think about what we want.
• Even when we have none of these problems, we don't know how to reliably output actions which get us what we want!

### 6. Concluding thoughts

A final word on curiosity, and intellectual puzzles:

I described an embedded agent, Emmy, and said that I don't understand how she evaluates her options, models the world, models herself, or decomposes and solves problems.

In the past, when researchers have talked about motivations for working on problems like these, they’ve generally focused on the motivation from AI risk. AI researchers want to build machines that can solve problems in the general-purpose fashion of a human, and dualism is not a realistic framework for thinking about such systems. In particular, it's an approximation that's especially prone to breaking down as AI systems get smarter. When people figure out how to build general AI systems, we want those researchers to be in a better position to understand their systems, analyze their internal properties, and be confident in their future behavior.

This is the motivation for most researchers today who are working on things like updateless decision theory and subsystem alignment. We care about basic conceptual puzzles which we think we need to figure out in order to achieve confidence in future AI systems, and not have to rely quite so much on brute-force search or trial and error.

But the arguments for why we may or may not need particular conceptual insights in AI are pretty long. I haven't tried to wade into the details of that debate here. Instead, I've been discussing a particular set of research directions as an intellectual puzzle, and not as an instrumental strategy.

One downside of discussing these problems as instrumental strategies is that it can lead to some misunderstandings about why we think this kind of work is so important. With the “instrumental strategies” lens, it’s tempting to draw a direct line from a given research problem to a given safety concern. But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes.

What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework.

If AI developers in the future are still working with these confused and incomplete basic concepts as they try to actually build powerful real-world optimizers, that seems like a bad position to be in. And it seems like the research community is unlikely to figure most of this out by default in the course of just trying to develop more capable systems. Evolution certainly figured out how to build human brains without “understanding” any of this, via brute-force search.

Embedded agency is my way of trying to point at what I think is a very important and central place where I feel confused, and where I think future researchers risk running into confusions too.

There’s also a lot of excellent AI alignment research that’s being done with an eye toward more direct applications; but I think of that safety research as having a different type signature than the puzzles I’ve talked about here.

Intellectual curiosity isn't the ultimate reason we privilege these research directions. But there are some practical advantages to orienting toward research questions from a place of curiosity at times, as opposed to only applying the "practical impact" lens to how we think about the world.

When we apply the curiosity lens to the world, we orient toward the sources of confusion preventing us from seeing clearly; the blank spots in our map, the flaws in our lens. It encourages re-checking assumptions and attending to blind spots, which is helpful as a psychological counterpoint to our "instrumental strategy" lens—the latter being more vulnerable to the urge to lean on whatever shaky premises we have on hand so we can get to more solidity and closure in our early thinking.

Embedded agency is an organizing theme behind most, if not all, of our big curiosities. It seems like a central mystery underlying many concrete difficulties.

Bibliography