One of the more solid results in agency theory is the generalized definition of power-seeking. Power-seeking is the tendency of agents to move towards states with the potential to achieve the best average outcome across some distribution of reward functions. In other words, looking at an agent whose values we do not know, we can a priori expect it to pursue "power", because it's trying to navigate to some specific end-state, and the path to it likely routes through environment-states in which it can choose from the largest variety of end-states. For example: take over the world so it can do whatever it wants.

Can we derive a similar definition for the convergent development of world-models?

I believe so, and it feels surprisingly simple in retrospect.

"World-models" are sets of statistical correlations across input-data. Every next correlation you notice — like object permanence, or the laws of gravity, or that your friend saying "I've ordered pizza" correlates with the image of a pizza entering your visual feed half an hour later — is another building block of your world-model.

A particularly important kind of statistical correlations are those that constitute selection pressures. When some system experiences selection pressure — be that an NN trained via the SGD, an animal species optimized by evolution, or a corporation shaped by market forces — it's receiving positive or negative feedback from the environment in a statistically predictable manner. The strength of such pressures increases or decreases according to certain correspondences between environment-states and the system's actions, and by definition of being optimized/selected, the system is gradually adjusted to minimize the pressure.

To do that, the system needs to (be made to) recognize the statistical correlations around it, pick out the ones related to the selection pressure, and enforce such correspondences between them and its actions as to minimize loss/survive/maximize revenue.

In turn, every bit of feedback from the selection pressure gives the system information about that pressure (somehow). It builds into the system some understanding of the causal structure of the environment, by the very nature of selecting it.

When the system can only myopically respond to the input-data, when it can only uncover correlations on the level of e. g. pixels, its repository of statistical correlations is poor. As that repository grows, as its world-model develops, it starts being able to imagine more complex correlations:

(world-model(t)→action(t))→feedback(t)

"If the world is currently in this state, and I enforce such correlation between some elements of that state and my actions, it will correlate with me receiving such feedback."

And since any given system starts out not knowing what statistical correlations its selection pressure depends on, it's convergent to try to derive every statistical correlation it can — build a complete world-model, then search it for the sources of feedback it receives.

Thus: Any system subjected to some selection pressure the strength of which depends on some correlations between that system's actions and some features of the environment would converge towards discovering as many environmental statistical correlations as possible, in order to be more well-positioned to "guess" what correlation between the environment and its actions it needs to enforce in order to minimize that pressure.

The system doesn't know what the world wants of it, so it'll learn everything about the world in order to reverse-engineer that want.

Now let's try to show all of that in a toy mathematical environment.

2. Formalisms

2.1. The Setup

Suppose that we have some environmentX represented as a causal (not necessarily acyclic) graph G, with nodes xi∈X. The environment is dynamic: every time-step t, the value xti of every child-node is updated by its parent-nodes, as in xt+1i:=fi(xti,Xtpa(xi)), where Xpa(xi)⊂X is the set of parental nodes of xi.

An intervention function do(At=N) sets the value of every node ai∈A⊂X to some corresponding value ni∈N for the time-step t. This function will be used to model actions.

The System, every time-step, takes in the values of some set of observables O⊂X, and runs do on some set of actionables A⊂X. After that, it receives reward(t) from the Selection Pressure.

The Selection Pressure, every time-step, takes in the values of some set of reward nodes R⊂X, and outputs a score. That is, reward:Rt→R.

Sidebar: Technically, reward and the System can be made parts of the environment as well. With reward, it's trivial — just imagine that every ri∈R is a parent of the actual reward node, and reward(t) is the update function of that node.

With the System, it's a bit more difficult. You can imagine that every observable plus reward has a node agent as its child-node, every actionable has agent as its only parental node, and that agent somehow controls the update functions from itself to the actionables. Alternatively, you might imagine agent to be a special node that represents a black-boxed cluster of nodes, to explain its ability to uniquely specify the value of each actionable. You may also imagine that the internals of this node perform at a much faster speed than the environment, so all of its computations happen in a single time-step. Or you may not.

Postulating all of this is a needless complication for the purposes of this post, though, so I won't be doing this. But you can, if you want to expand on it while getting rid of Cartesianism.

Given this setup, what internal mechanisms would you need to put in the System to improve its ability to maximize reward across time, given that it would start from a place of total ignorance with regards to the environment structure, the current environment-state, and the nature of reward?

2.2. Ideal Actions

Buuut first let's consider the environment from the position of omniscience. Given full knowledge of the environment structure, its current state, and the reward nodes, what can we say about the optimal policy?

Let's imagine a toy environment where reward is a function of the value of just one node, xi. The actionables, likewise, are a singular other node a. Suppose the reward function wants the value of r to equal n at every time-step.

If a=xi, the optimal policy at the time-step t is simple: do({at}={n}).

If a is adjacent to xi, then we'll only be able to act on it on the time-step following the current one, and we'll do it through the update function between a and xi, whose output also depends on the values of xi's parents. I. e., the optimal action is a function of xti, fi, and Xtpa(xi).

If a is separated by some node x1 from xi, we'll only affect xi two time-steps later, and the value we send will be interfered with by the parents of x1 at t+1 and the parents of xi at t+1and at t+2.

The optimal action, thus, is a function of f1, fi, xt1, xt+1i, Xtpa(x1), and Xt+1pa(xi). Note that xt+1i is additionally a function of xti and Xtpa(xi), and Xt+1pa(xi) is additionally a function of the update functions for all members of that set, and the values of every parent of every member of that set.

Generally, if the shortest path from a to the reward-node xi consists of i nodes x1,x2,…,xi, we'll only be able to affect it i time-steps later, and the optimal action will be a function of f1,f2,…fi, xt1,xt+12,…,xt+i−1i, Xtpa(x1),Xt+1pa(x2),...,Xt+i−1pa(xi).

Thus, the farther away the node we want to affect, the broader the "causality cone" of nodes we'll need to take into consideration. Or, equivalently, the current action is a function of further-in-time environment-states.

That complexity is not necessarily irreducible. Certain combinations of environment structure and update functions can result in some of the parents reliably cancelling each other out, such that the actual function for computing the most optimal action is a lower-dimensional one. If the environment is a well-abstracting one, such dynamics might be ubiquitous, such that only a higher-level model of it suffices.

But in general, even taking such techniques into account, the "farther" the target node is, the more other nodes you'd need to take into account.

2.3. Heuristics

Some definitions:

An idealized model Mt is the probability distribution over the environment-state given some information It. That is: Mt:=P(Xt | It)

A cross-temporal model of the environment is some set containing all values of nodes across some time period: M[ta:tb]:=⋃t∈[ta,tb]Mt.

A cross-temporal slice of the environment is some subset M[ta:tb]si⊂M[ta:tb].

Similarly, we have a cross-temporal slice of the action-space: A[ta:tb]sk⊂M[ta:tb], where all nodes whose values are in A[ta:tb]sk are actionables.

In relation to this, given the current time-step t0, we can define a heuristic as follows:

A heuristic uses some function ci to look at some cross-temporal slice and judge whether some conditions it wants are met. If not (c=0), it executes the do part.

In English: given a model of the environment, a heuristic recommends taking certain action at present or at particular future points given certain beliefs about current, future, or pastenvironment-states. A heuristic does not necessarily have a recommended action for every intermediate time-step, or a recommendation for every actionable. Several heuristics can be ran in the same time-step, in fact, if their action-slices aren't overlapping.

(Note that the cross-temporal actions slice is a subset of M[t0:tb], because you can't take actions in the past.)

Intuitively, heuristics are likely defined over natural abstractions, or some high-level environment-models wholly build out of them. ci would look for a particular object — a cluster of individual nodes whose internal complexity can be reduced to a high-level summary — and check its high-level state, possibly across time. For certain "undesirable" states or dynamics, it would attempt to intervene, correcting them.

As an example, consider a grandmaster playing a game of chess. They've developed a lot of chess-specific heuristics. These heuristics look exclusively at the cross-temporal slice of the world-model that has to do with the current game. What moves the grandmaster made, what moves their opponent made, and what moves both of them may make in the future. One of these heuristics may recognize a cross-temporal pattern — a particular sequence of moves the opponent made, perhaps, plus the moves they may be planning to make — and map that pattern to the memorized correct response to it, countering the tactic the opponent is trying.

Caveats: That function is technically incoherent, since M[ta:tb], which it takes as an input, can presumably only be computed with the knowledge of the actions the System will take in the future, and these actions are what the heuristic computes as an output. There's a couple of ways around that: M[ta:tb] might be a "placeholder future", computed under the assumption that the System takes some default actions/null action. Or it might be the output of some other heuristic, passed to this one for potential correction.

The degenerate case of a heuristic, of course, is t0=ta=tb and Mt0=Ot0:

ht0i(t0,Ot0)=⟨do(At0s0=gi(Ot0)) | ci(Ot0s0)=0⟩

This is an "instinctive reaction", an instant response to some stimuli.

Another simplified, but interestingly different case, is t0=ta=tb, but with Mt0s0 being a few steps removed from the observations. Intuitively, this is how CNN image classifiers work: they take in a bunch of pixels, then extrapolate the environment-state these pixels are a snapshot of, and back out the values of "is the node dog = 1?", "is the node cat = 1?".

Okay, now let's try to answer the question.

2.4. The Need for World-Models

The System starts from a place of complete ignorance about the environment. It doesn't know its structure, its state, what nodes reward is defined over, or even that there is an outside world. All information it has access to are observations O and the scores it receives for taking actions A.

Any heuristic it can implement would follow the algorithm of, "IF (some set of observables has certain values) THEN (set some actionables to certain values)". Thus, its set of heuristics is defined over the power sets of observables P(O) and actionables P(A). Heuristics, thus, would have a type signature Oi→Ak, where Oi∈P(O), Ak∈P(A).

From a state of total ignorance, there's no strategy better than guessing blindly. So suppose the System samples a few random heuristics and runs them in a few episodes. Upon receiving reward, it would get some information about their effectiveness, and would ideally keep effective heuristics and improve on them while discarding bad ones. But how?

So suppose it is solved, somehow. What properties would the solution have?

Whatever it is, it would establish a causal connection between observations, actions, and the feedback: (Oi→Ak)→reward, and also reward←Y→Oi, since if it's possible to derive actionable information about reward from Oi, they have to be somehow causally connected.

In other words: whatever the credit assignment mechanism, whether it's external (as in ML) or purely internal, it would convey information about the structure of the environment.

Thus, even as the System is blindly guessing shallow heuristics, it's learning a world-model. Even if only implicitly.

And acting on observables can only get us so far. Literally: as per 2.2, the farther away from our action-nodes the reward-node is, the more intermediate nodes we have to take into account to compute the correct action to take. Turning it around, the farther from the reward-node the observation-nodes are, the more "diffuse" the information about the reward-node they contain. In all probability, all individual a∈A have to be calculated as a function of all observables, to maximize our ability to map out the surrounding environment.

And this is where the world-models come in.

Let's assume that world-models can be incomplete — as in, contain only some part of the environment graph G. For simplicity, let's assume that over time, it's expanded by one degree of separation d in every direction, denoted as Md. So Md=0 contains only the observables and the actionables, Md=1 also contains all of their parents and children, Md=2 contains parents and children of all nodes in Md=1, and so on.

Knowing the partial structure of the environment and the values of some of its variables (the observables) at t allows us to reconstruct its state: Mt,d=P(Xt,d|It), where Xt,d⊂Xtand It is some internal state (including the knowledge of the environment structure and likely a history of previous observations and the actions taken).

Further, knowing the environment structure, the state at t, and what actions we plan to take, lets us compute the model of the next environment state, but at a one less degree of separation: run:Mt,d×At→Mt+1,d−1.

Likewise, under some assumptions about the transition functions f, we can often model the environment backwards: back:Mt,d×At−1→Mt−1,d−1.

Let's denote the cross-temporal model we get from that as M[ta:tb],d|t.

The main effect of all of this is that it greatly expands the space of heuristics: they're now defined over the power sets of actionables and all modelable nodes:

At the limit of infinite time, the System may expand the world-model to cover the entire environment. That would allow it to design optimal heuristics for controlling every reward-node ri∈R, akin to the functions mentioned in 2.2, regardless of the specific environment-structure it's facing.

3. Extending the Framework

I believe the main point is made now, but the model could be extended to formalize a few more intuitions.

3.1. Incrementalism

The System won't start knowing the entire world-model, however. Conversely, it would start generating heuristics from the get-go. In theory, once the world-model is fully learned, there's nothing preventing the System from deriving and maximizing reward.

But what if we add friction?

The credit-assignment mechanism needs to know how to deal with imperfect heuristics, if it's to be useful before the world-model is complete. A generally good heuristic making a mistake shouldn't be grounds for immediately deleting it. We'd also need to know how to resolve heuristics conflicts: situations when the cross-temporal model conditioned on the System running some heuristic hi makes some other heuristic hk try to correct it, which causes hi try to re-correct it, and so on. Which one should be allowed to win?

The notion of a "weight" seems the natural answer to that need. Heuristics to which credit is assigned more often and in greater quantities shall have more weight than the less well-performing ones, and in case of a heuristics conflict, the priority shall be more often given to the weightier one.

Suppose that the individual reward-nodes ri aren't sitting in a cluster. They're scattered across the graph, such that you can often affect one of them without affecting the others. That will likely give rise to specialized heuristics, which focus on optimizing the state of just one or several such reward-node. Only the full heuristical ensemble would be optimizing reward directly.

Now consider the following structure:

Suppose that we have some heuristic hxp that's greatly specialized in optimizing xp. (I. e., it models the graph up to xp and picks actions that set xp to some value that the heuristic prefers. And it prefers it because setting xp to it has historically correlated with high reward.) Once the environment is mapped out up to r, that heuristic can be replaced by one focusing on r directly. But what if hxp has a lot of weight by this point? The hypothetical hr would not have that much to contribute — controlling xp is a good proxy for controlling r, and if xk,xe don't have a very large effect, hxp probably captures most of the reward that can be squeezed out of r without taking them into account.

The better-performing hr would simply never be able to edge the older hxp out, then.

Thus, the System would be optimized for a reward proxy.

And this is likely to be an ubiquitous outcome. The world-model is being developed in lockstep with new heuristics, and if the reward-nodes are sufficiently conceptually-far from the actionables, it'll find good reward-proxies well before it discovers the actual reward-nodes. And at that point, the proxy-controlling heuristics will be impossible to depose.

3.2. The Planning Loop

The definition of a heuristic has a free parameter: the mysterious function g, which somehow calculates what actions to take given the world-model.

Its internals can have different contents:

It may output a constant value, discarding M[ta:tb],d|t.

It may be a lookup table, taking some members of M[ta:tb]Si as input.

It may run a somewhat more sophisticated algorithm, looking at the values of some nodes and "extracting" the answer from the world-model.

But most salient are algorithms of the following kind:

This process searches the world-model for a series of actions that will cause some set of nodes in it to assume certain values at certain times. I. e., "how do I make X happen?".

That is the planning loop/inner optimization. Since we expect heuristics to be defined in relation to some natural abstractions — i. e., the condition c is looking for is some coherent high-level concept — we can assume that c implicitly includes some mesa-objective. (Returning to the chess-game example, the heuristic with a planning loop would be searching the world-model for actions that lead to the condition "the grandmaster wins the game", or maybe "the opponent's tactic is countered".)

The planning loop may be niche. Instead of the entire Mt0,d, it may search over some Mt0S∈Mt0,d. I. e., a given g implemented in a given heuristic may only know how to optimize over a particular cluster of nodes in the world-model — the cluster the heuristic is specialized in. (E. g., the model of chess.)

Sidenote: Similar analysis can be made with regards to the function c. The environment conditions for firing a heuristic can also be incredibly complex, so complex as to require intelligent analysis. I suspect there may be something interesting to be found in this direction as well. Consider, e. g., a heuristic with a very complex c but a lookup table for g? I think we see some examples of that in humans...

Sidenote #2: A lot of complexity is still hidden in g and c. A mesa-optimizer would not perform blind search — it has some heuristics/mechanisms for efficiently guessing what actions it makes sense to try running the world-model on, before it actually performs search. I suspect it's some set of "advanced cognitive functions", and that complex c likely share it with complex g.

3.3. The Mesa-Optimizer Pipeline

So how does all of that grow into a terrifying lightcone-eating utility-maximizer? There's a few paths.

First, g may be shared across all heuristics. As per the Orthogonality Thesis, intelligence is orthogonal to values: a good optimization algorithm can be used to optimize any objective. This model concretizes this intuition somewhat. If g can perform search in the entire world-model, there's no reason to replicate its functionality, and every reason to hook all the heuristics up to the same algorithm — to minimize memory expenditure.

In this scenario, the resultant AGI would be optimizing a mix of mesa-objectives, a weighted sum of the heuristics it developed. Possibly inconsistently, as different heuristics grab control at different times. Possibly unstably, if new heuristics are still being derived. Humans may work like this, see Why Subagents? and the Shard Theory.

Second, a more nuanced version of the above. The AGI may develop a sort of meta-heuristic, which would explicitly derive the weighted average of the extant heuristics, then set that as its primary mesa-objective, and pursue it consistently. There's probably a pressure to do that, as to do otherwise is a dominated strategy. It may also be required depending on the specifications of reward — if getting it to hit high values requires high coordination^{[1]}.

I suspect humans work like this some of the time — when we're consciously optimizing or performing value reflection, instead of acting on autopilot.

Third, one of the heuristics may take over. Suppose that g isn't shared after all; every heuristic uses some specialized search algorithm. Then, the heuristic with the most advanced one starts to expand, gradually encroaching on others' territory. It should be pretty heavily weighted, probably heavier than any of the others, so that's not inconceivable. In time, it may take over others' roles, and the entire system would become a wrapper-mind optimizing some simple mesa-objective.

That heuristic would probably need to do it deliberately. As in, it'd need to become an advanced enough agent to model this entire dynamic, and purposefully take over the mind it's running in, for instrumental power-seeking reasons, while actively preventing its mesa-objective from changing. It'd need to model the base objective instead of internalizing it.

It feels like this scenario is relatively unlikely: that heuristics are unlikely to be this lopsided in their capabilities. On the other hand, what do I know about how that works in neural networks trained by the SGD the specific way we're doing it now?

Fourth, given some advanced heuristic management algorithms, we may reduce the friction described at the beginning of 3.1. If heuristics don't ossify, and are frequently replaced by marginally-better-performing alternatives, the inner alignment failures the three previous scenarios represent won't happen. The AGI would grow to optimize the outer reward function it's been optimized by.

(Edit: Correction, it will probably wirehead instead. If, as per the sidebar in 2.1, we view reward itself as a node on the graph, then a frictionless setup would allow the AGI navigate to it directly, not to the in-environment variables it's defined over. To fix this, we'd need to set up some special "Cartesian boundary" over the reward node, ensuring it doesn't make that final jump.)

This scenario seems unlikely without some advanced, probably intelligent oversight. E. g., very advanced interpretability and manual-editing tools under the control of humans/another ML model.

Acknowledgements

Thanks to Quintin Pope, TurnTrout, Logan Riggs, and others working on the Shard Theory: the thinking behind this post has been heavily inspired by it.

(In fact, I suspect that the concept I'm referring to by "heuristic" in this post is synonymous with their "shards". Subjective opinion, though, I haven't run it by the ST folks.)

^{^}

E. g., if reward(t)=xti+xtj+xtk, we can have separate heuristics for maximizing the values of all of these nodes, operating mostly independently. On the other hand, if reward(t)=1−|3−EXP(xtj⋅xtk)|, that'd require tighter coordination to ensure the product approximates ln3.

## 1. Intuitions

One of the more solid results in agency theory is the generalized definition of power-seeking. Power-seeking is the tendency of agents to move towards states with the potential to achieve the best average outcome across some distribution of reward functions. In other words, looking at an agent whose values we do not know, we can

a prioriexpect it to pursue "power", because it's trying to navigate to somespecificend-state, and the path to it likely routes through environment-states in which it can choose from the largest variety of end-states. For example: take over the world so it can do whatever it wants.Can we derive a similar definition for the convergent development of world-models?

I believe so, and it feels surprisingly simple in retrospect.

"World-models" are sets of statistical correlations across input-data. Every next correlation you notice — like object permanence, or the laws of gravity, or that your friend saying "I've ordered pizza" correlates with the image of a pizza entering your visual feed half an hour later — is another building block of your world-model.

A particularly important kind of statistical correlations are those that constitute

selection pressures. When some system experiences selection pressure — be that an NN trained via the SGD, an animal species optimized by evolution, or a corporation shaped by market forces — it's receiving positive or negative feedback from the environment in a statistically predictable manner. The strength of such pressures increases or decreases according to certain correspondences between environment-states and the system's actions, and by definition of being optimized/selected, the system is gradually adjusted to minimize the pressure.To do that, the system needs to (be made to) recognize the statistical correlations around it, pick out the ones related to the selection pressure, and enforce such correspondences between them and its actions as to minimize loss/survive/maximize revenue.

In turn, every bit of feedback from the selection pressure gives the system information about that pressure (somehow). It builds into the system some understanding of the causal structure of the environment, by the very nature of selecting it.

When the system can only myopically respond to the input-data, when it can only uncover correlations on the level of e. g.

(world-model(t)→action(t))→feedback(t)pixels, its repository of statistical correlations is poor. As that repository grows, as its world-model develops, it starts being able to imagine more complex correlations:"If the world is currently in

thisstate, and I enforcesuchcorrelation between some elements of that state and my actions, it will correlate with me receivingsuchfeedback."And since any given system starts out not knowing what statistical correlations its selection pressure depends on, it's convergent to try to derive

everystatistical correlation it can — build a complete world-model, then search it for the sources of feedback it receives.Thus: Any system subjected to some selection pressure the strength of which depends on some correlations between that system's actions and some features of the environment would converge towards discovering as many environmental statistical correlations as possible, in order to be more well-positioned to "guess" what correlation between the environment and its actions it needs to enforce in order to minimize that pressure.

The system doesn't know what the world wants of it, so it'll learn everything about the world in order to reverse-engineer that want.

Now let's try to show all of that in a toy mathematical environment.

## 2. Formalisms

## 2.1. The Setup

Suppose that we have some environment X represented as a causal (not necessarily acyclic) graph G, with nodes xi∈X. The environment is dynamic: every time-step t, the value xti of every child-node is updated by its parent-nodes, as in xt+1i:=fi(xti,Xtpa(xi)), where Xpa(xi)⊂X is the set of parental nodes of xi.

An intervention function do(At=N) sets the value of every node ai∈A⊂X to some corresponding value ni∈N for the time-step t. This function will be used to model actions.

The System, every time-step, takes in the values of some set of observables O⊂X, and runs do on some set of actionables A⊂X. After that, it receives reward(t) from the Selection Pressure.

The Selection Pressure, every time-step, takes in the values of some set of reward nodes R⊂X, and outputs a score. That is, reward:Rt→R.

Given this setup, what internal mechanisms would you need to put in the System to improve its ability to maximize reward across time, given that it would start from a place of total ignorance with regards to the environment structure, the current environment-state, and the nature of reward?

## 2.2. Ideal Actions

Buuutfirst let's consider the environment from the position of omniscience. Given full knowledge of the environment structure, its current state, and the reward nodes, what can we say about the optimal policy?Let's imagine a toy environment where reward is a function of the value of just one node, xi. The actionables, likewise, are a singular other node a. Suppose the reward function wants the value of r to equal n at every time-step.

If a=xi, the optimal policy at the time-step t is simple: do({at}={n}).

If a is

adjacentto xi, then we'll only be able to act on it on the time-stepfollowingthe current one, and we'll do it through the update function between a and xi, whose output also depends on the values of xi's parents. I. e., the optimal action is a function of xti, fi, and Xtpa(xi).If a is separated by some node x1 from xi, we'll only affect xi two time-steps later, and the value we send will be interfered with by the parents of x1 at t+1 and the parents of xi at t+1

andat t+2.The optimal action, thus, is a function of f1, fi, xt1, xt+1i, Xtpa(x1), and Xt+1pa(xi). Note that xt+1i is additionally a function of xti and Xtpa(xi), and Xt+1pa(xi) is additionally a function of the update functions for all members of that set, and the values of every parent of every member of that set.

Generally, if the shortest path from a to the reward-node xi consists of i nodes x1,x2,…,xi, we'll only be able to affect it i time-steps later, and the optimal action will be a function of f1,f2,…fi, xt1,xt+12,…,xt+i−1i, Xtpa(x1),Xt+1pa(x2),...,Xt+i−1pa(xi).Thus, the farther away the node we want to affect, the broader the "causality cone" of nodes we'll need to take into consideration. Or, equivalently, the current action is a function of further-in-time environment-states.

That complexity is not necessarily irreducible. Certain combinations of environment structure and update functions can result in some of the parents reliably cancelling each other out, such that the actual function for computing the most optimal action is a lower-dimensional one. If the environment is a well-abstracting one, such dynamics might be ubiquitous, such that only a higher-level model of it suffices.

But

in general, even taking such techniques into account, the "farther" the target node is, the more other nodes you'd need to take into account.## 2.3. Heuristics

Some definitions:

In relation to this, given the current time-step t0, we can define a heuristic as follows:

h[ta:tb]i(t0,M[ta:tb]):=⟨do(A[t0:tn]sv=gi(M[ta:tb])) | (ci(M[ta:tb]su)=0)⟩A heuristic uses some function ci to look at some cross-temporal slice and judge whether some conditions it wants are met. If not (c=0), it executes the do part.

In English: given a model of the environment, a heuristic recommends taking certain action at present or at particular future points given certain beliefs about current, future, or pastenvironment-states. A heuristic does not necessarily have a recommended action for

everyintermediate time-step, or a recommendation for every actionable. Several heuristics can be ran in the same time-step, in fact, if their action-slices aren't overlapping.(Note that the cross-temporal actions slice is a subset of M[t0:tb], because you can't take actions in the past.)

Intuitively, heuristics are likely defined over natural abstractions, or some high-level environment-models wholly build out of them. ci would look for a particular object — a cluster of individual nodes whose internal complexity can be reduced to a high-level summary — and check its high-level state, possibly across time. For certain "undesirable" states or dynamics, it would attempt to intervene, correcting them.

As an example, consider a grandmaster playing a game of chess. They've developed a lot of chess-specific heuristics. These heuristics look

exclusivelyat the cross-temporal slice of the world-model that has to do with the current game. What moves the grandmaster made, what moves their opponent made, and what moves both of them may make in the future. One of these heuristics may recognize a cross-temporal pattern — a particular sequence of moves the opponent made, perhaps, plus the moves they may be planning to make — and map that pattern to the memorized correct response to it, countering the tactic the opponent is trying.Caveats: That function is technically incoherent, since M[ta:tb], which it takes as an input, can presumably only be computed with the knowledge of the actions the System will take in the future, and these actions are what the heuristic computes

as an output. There's a couple of ways around that: M[ta:tb] might be a "placeholder future", computed under the assumption that the System takes some default actions/null action. Or it might be the output of someotherheuristic, passed to this one for potential correction.The degenerate case of a heuristic, of course, is t0=ta=tb and Mt0=Ot0:

ht0i(t0,Ot0)=⟨do(At0s0=gi(Ot0)) | ci(Ot0s0)=0⟩This is an "instinctive reaction", an instant response to some stimuli.

Another simplified, but interestingly different case, is t0=ta=tb, but with Mt0s0 being a few steps removed from the observations. Intuitively, this is how CNN image classifiers work: they take in a bunch of pixels, then extrapolate the environment-state these pixels are a snapshot of, and back out the values of "is the node dog = 1?", "is the node cat = 1?".

Okay,

nowlet's try to answer the question.## 2.4. The Need for World-Models

The System starts from a place of complete ignorance about the environment. It doesn't know its structure, its state, what nodes reward is defined over, or even that there is an outside world. All information it has access to are observations O and the scores it receives for taking actions A.

Any heuristic it can implement would follow the algorithm of, "IF (some set of observables has certain values) THEN (set some actionables to certain values)". Thus, its set of heuristics is defined over the power sets of observables P(O) and actionables P(A). Heuristics, thus, would have a type signature Oi→Ak, where Oi∈P(O), Ak∈P(A).

From a state of total ignorance, there's no strategy better than guessing blindly. So suppose the System samples a few random heuristics and runs them in a few episodes. Upon receiving reward, it would get some information about their effectiveness, and would ideally keep effective heuristics and improve on them while discarding bad ones. But how?

Well, any solution would need to solve the credit assignment problem somehow.

So suppose it is solved, somehow. What properties would the solution have?

Whatever it is, it would establish a causal connection between observations, actions, and the feedback: (Oi→Ak)→reward, and also reward←Y→Oi, since if it's possible to derive actionable information about reward from Oi, they have to be somehow causally connected.

In other words: whatever the credit assignment mechanism, whether it's external (as in ML) or purely internal, it would convey information about the structure of the environment.

Thus, even as the System is blindly guessing shallow heuristics, it's learning a world-model. Even if only implicitly.

And acting on observables can only get us so far. Literally: as per 2.2, the farther away from our action-nodes the reward-node is, the more intermediate nodes we have to take into account to compute the correct action to take. Turning it around, the farther from the reward-node the observation-nodes are, the more "diffuse" the information about the reward-node they contain. In all probability, all individual a∈A have to be calculated as a function of

allobservables, to maximize our ability to map out the surrounding environment.And this is where the world-models come in.

Let's assume that world-models can be incomplete — as in, contain only some part of the environment graph G. For simplicity, let's assume that over time, it's expanded by one degree of separation d in every direction, denoted as Md. So Md=0 contains only the observables and the actionables, Md=1 also contains all of their parents and children, Md=2 contains parents and children of all nodes in Md=1, and so on.

Knowing the partial structure of the environment and the values of some of its variables (the observables) at t allows us to reconstruct its state: Mt, d=P(Xt, d|It), where Xt, d⊂Xtand It is some internal state (including the knowledge of the environment structure and likely a history of previous observations and the actions taken).

Further, knowing the environment structure, the state at t, and what actions we plan to take, lets us compute the model of the

nextenvironment state, but at a one less degree of separation: run:Mt, d×At→Mt+1, d−1.Likewise, under some assumptions about the transition functions f, we can often model the environment

backwards: back:Mt, d×At−1→Mt−1, d−1.Let's denote the cross-temporal model we get from that as M[ta:tb], d|t.

The main effect of all of this is that it greatly expands the space of heuristics: they're now defined over the power sets of actionables and

M[ta:tb], d|tSj∈P( M[ta:tb], d|t)h[ta:tb]i:M[ta:tb], d|tSj→Akall modelable nodes:At the limit of infinite time, the System may expand the world-model to cover the entire environment. That would allow it to design optimal heuristics for controlling every reward-node ri∈R, akin to the functions mentioned in 2.2, regardless of the specific environment-structure it's facing.

## 3. Extending the Framework

I believe the main point is made now, but the model could be extended to formalize a few more intuitions.

## 3.1. Incrementalism

The System won't

startknowing the entire world-model, however. Conversely, it would start generating heuristics from the get-go. In theory, once the world-model is fully learned, there's nothing preventing the System from deriving and maximizing reward.But what if we add friction?

The credit-assignment mechanism needs to know how to deal with imperfect heuristics, if it's to be useful before the world-model is complete. A generally good heuristic making a mistake shouldn't be grounds for immediately deleting it. We'd also need to know how to resolve heuristics conflicts: situations when the cross-temporal model conditioned on the System running some heuristic hi makes some other heuristic hk try to correct it, which causes hi try to re-correct it, and so on. Which one should be allowed to win?

The notion of a "weight" seems the natural answer to that need. Heuristics to which credit is assigned more often and in greater quantities shall have more weight than the less well-performing ones, and in case of a heuristics conflict, the priority shall be more often given to the weightier one.

Suppose that the individual reward-nodes ri aren't sitting in a cluster. They're scattered across the graph, such that you can often affect one of them without affecting the others. That will likely give rise to specialized heuristics, which focus on optimizing the state of just one or several such reward-node. Only the full heuristical ensemble would be optimizing reward directly.

Now consider the following structure:

Suppose that we have some heuristic hxp that's greatly specialized in optimizing xp. (I. e., it models the graph

up toxp and picks actions that set xp to some value that the heuristic prefers. And it prefers it because setting xp to it has historically correlated with high reward.) Once the environment is mapped out up to r, that heuristic can be replaced by one focusing on r directly. But what if hxp has a lot of weight by this point? The hypothetical hr would not havethatmuch to contribute — controlling xp is a good proxy for controlling r, and if xk,xe don't have a very large effect, hxp probably captures most of the reward that can be squeezed out of r without taking them into account.The better-performing hr would simply never be able to edge the older hxp out, then.

Thus, the System would be optimized for a

reward proxy.And this is likely to be an ubiquitous outcome. The world-model is being developed

in lockstepwith new heuristics, and if the reward-nodes are sufficiently conceptually-far from the actionables, it'll find good reward-proxies well before it discovers the actual reward-nodes. And at that point, the proxy-controlling heuristics will be impossible to depose.## 3.2. The Planning Loop

The definition of a heuristic has a free parameter: the mysterious function g, which somehow calculates what actions to take given the world-model.

Its internals can have different contents:

But most salient are algorithms of the following kind:

g(M[ta:t0], d|t):=A[t0:tb]Sk | c(run(M[ta:t0], d|t,A[t0:tb]Sk))=1This process searches the world-model for a series of actions that will cause some set of nodes in it to assume certain values at certain times. I. e., "how do I make X happen?".

That is the planning loop/inner optimization. Since we expect heuristics to be defined in relation to some natural abstractions — i. e., the condition c is looking for is some coherent high-level concept — we can assume that c implicitly includes some mesa-objective. (Returning to the chess-game example, the heuristic with a planning loop would be searching the world-model for actions that lead to the condition "the grandmaster wins the game", or maybe "the opponent's tactic is countered".)

The planning loop may be niche. Instead of the entire Mt0, d, it may search over some Mt0S∈Mt0, d. I. e., a given g implemented in a given heuristic may only know how to optimize over a particular cluster of nodes in the world-model — the cluster the heuristic is specialized in. (E. g., the model of chess.)

## 3.3. The Mesa-Optimizer Pipeline

So how does all of that grow into a terrifying lightcone-eating utility-maximizer? There's a few paths.

First, g may be shared across all heuristics. As per the Orthogonality Thesis, intelligence is orthogonal to values: a good optimization algorithm can be used to optimize any objective. This model concretizes this intuition somewhat. If g can perform search in the entire world-model, there's no reason to replicate its functionality, and every reason to hook all the heuristics up to the same algorithm — to minimize memory expenditure.In this scenario, the resultant AGI would be optimizing a

mixof mesa-objectives, a weighted sum of the heuristics it developed. Possibly inconsistently, as different heuristics grab control at different times. Possibly unstably, if new heuristics are still being derived. Humans may work like this, see Why Subagents? and the Shard Theory.Second, a more nuanced version of the above. The AGI may develop a sort of meta-heuristic, which would explicitly derive the weighted average of the extant heuristics, then set that as its primary mesa-objective, and pursue itconsistently. There's probably a pressure to do that, as to do otherwise is a dominated strategy. It may also be required depending on the specifications of reward — if getting it to hit high values requires high coordination^{[1]}.I suspect humans work like this some of the time — when we're consciously optimizing or performing value reflection, instead of acting on autopilot.

Third, one of the heuristics may take over. Suppose that g isn't shared after all; every heuristic uses some specialized search algorithm. Then, the heuristic with the most advanced one starts to expand, gradually encroaching on others' territory. It should be pretty heavily weighted, probably heavier than any of the others, so that's not inconceivable. In time, it may take over others' roles, and the entire system would become a wrapper-mind optimizing some simple mesa-objective.That heuristic would probably need to do it

deliberately. As in, it'd need to become an advanced enough agent to model this entire dynamic, andpurposefullytake over the mind it's running in, for instrumental power-seeking reasons, while actively preventing its mesa-objective from changing. It'd need to model the base objective instead of internalizing it.It

feelslike this scenario is relatively unlikely: that heuristics are unlikely to be this lopsided in their capabilities. On the other hand, what do I know about how that works in neural networks trained by the SGD the specific way we're doing it now?Fourth, given some advanced heuristic management algorithms, we may reduce the friction described at the beginning of 3.1. If heuristics don't ossify, and are frequently replaced by marginally-better-performing alternatives, the inner alignment failures the three previous scenarios represent won't happen. The AGI would grow to optimize the outer reward function it's been optimized by.(

Edit:Correction, it will probably wirehead instead. If, as per the sidebar in 2.1, we view reward itself as a node on the graph, then a frictionless setup would allow the AGI navigate to itdirectly, not to the in-environment variables it's defined over. To fix this, we'd need to set up some special "Cartesian boundary" over the reward node, ensuring it doesn't make that final jump.)This scenario seems unlikely without some advanced, probably intelligent oversight. E. g., very advanced interpretability and manual-editing tools under the control of humans/another ML model.

## Acknowledgements

Thanks to Quintin Pope, TurnTrout, Logan Riggs, and others working on the Shard Theory: the thinking behind this post has been heavily inspired by it.

(In fact, I suspect that the concept I'm referring to by "heuristic" in this post is synonymous with their "shards". Subjective opinion, though, I haven't run it by the ST folks.)

^{^}E. g., if reward(t)=xti+xtj+xtk, we can have separate heuristics for maximizing the values of all of these nodes, operating mostly independently. On the other hand, if reward(t)=1−|3−EXP(xtj⋅xtk)|, that'd require tighter coordination to ensure the product approximates ln 3.