Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

Thane Ruthenis

Epistemic status: I'm currently unsure whether that's a fake framework, a probably-wrong mechanistic model, or a legitimate insight into the fundamental nature of agency. Regardless, viewing things from this angle has been helpful for me.

In addition, the ambitious implications of this view is one of the reasons I'm fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)

Input Side: Observations

Consider what happens when we draw inferences based on observations.

Photons hit our eyes. Our brains draw an image aggregating the information each photon gave us. We interpret this image, decomposing it into objects, and inferring which latent-variable object is responsible for generating which part of the image. Then we wonder further: what process generated each of these objects? For example, if one of the "objects" is a news article, what is it talking about? Who wrote it? What events is it trying to capture? What set these events into motion? And so on.

In diagram format, we're doing something like this:

Blue are ground-truth variables, grey is the "Cartesian boundary" of our mind from which we read off observations, purple are nodes in our world-model each of which can be mapped to a ground-truth variable.

We take in observations, infer what latent variables generated them, then infer what generated those variables, and so on. We go backwards: from effects to causes, iteratively. The Cartesian boundary of our input can be viewed as a "mirror" of a sort, reflecting the Past.

It's a bit messier in practice, of course. There are shortcuts, ways to map immediate observations to far-off states. But the general idea mostly checks out – especially given that these "shortcuts" probably still implicitly route through all the intermediate variables, just without explicitly computing them. (You can map a news article to the events it's describing without explicitly modeling the intermediary steps of witnesses, journalists, editing, and publishing. But your mapping function is still implicitly shaped by the known quirks of those intermediaries.)

Output Side: Actions

Consider what happens when we're planning to achieve some goal, in a consequentialist-like manner.

We envision the target state. What we want to achieve, how the world would look like. Then we ask ourselves: what would cause this? What forces could influence the outcome to align with our desires? And then: how do we control these forces? What actions would we need to take in order to make the network of causes and effects steer the world towards our desires?

In diagram format, we're doing something like this:

Green are goals, purple are intermediary variables we compute, grey is the Cartesian boundary of our actions, red are ground-truth variables through which we influence our target variables.

We start from our goals, infer what latent variables control their state in the real world, then infer what controls those latent variables, and so on. We go backwards: from effects to causes, iteratively, until getting to our own actions. The Cartesian boundary of our output can be viewed as a "mirror" of a sort, reflecting the Future.

It's a bit messier in practice, of course. There are shortcuts, ways to map far-off goals to immediate actions. But the general idea mostly checks out – especially given that these heuristics probably still implicitly route through all the intermediate variables, just without explicitly computing them. ("Acquire resources" is a good heuristical starting point for basically any plan. But what counts as resources is something you had to figure out in the first place by mapping from "what lets me achieve goals in this environment?".)

And indeed, that side of my formulation isn't novel! From this post by Scott Garrabrant:

Time is also crucial for thinking about agency. My best short-phrase definition of agency is that agency is time travel. An agent is a mechanism through which the future is able to affect the past. An agent models the future consequences of its actions, and chooses actions on the basis of those consequences. In that sense, the consequence causes the action, in spite of the fact that the action comes earlier in the standard physical sense.

Both Sides: A Causal Mirror

Putting it together, an idealized, compute-unbounded "agent" could be laid out in this manner:

You may not like it, but this is what peak agency looks like.

It reflects the past at the input side, and reflects the future at the output side. In the middle, there's some "glue"/"bridge" connecting the past and the future by a forwards-simulation. During that, the agent "catches up to the present": figures out what'll happen while it's figuring out what to do.

If we consider the relation between utility functions and probability distributions, it gets even more literal. An utility function over could be viewed as a target probability distribution over $X$ , and maximizing expected utility is equivalent to minimizing cross-entropy between this target distribution and the real distribution.

That brings the "planning" process in alignment with the "inference" process: both are about propagating target distributions "backwards" in time through the network of causality.

Why Is This Useful?

The primary, "ordinary" use-case is that this allows to import intuitions and guesses about how planning works to how inference works, and vice versa. It's a helpful heuristic to guide one's thoughts when doing research.

An example: Agency researchers are fond of talking about "coherence theorems" that constrain how agents work. There's a lot of controversy around this idea. John Wentworth had speculated that "real" coherence theorems are yet to be discovered, and that they may be based on a more solid bedrock of probability theory or information theory. This might be the starting point for formulating those – by importing some inference-based derivations to planning procedures.

Another example: Consider the information-bottleneck method. Setup: Suppose we have a causal structure $W \to O \to M$ . We want to derive a mapping $O \to M$ such that it discards as much information in $O$ as possible while retaining all data it has about $W$ . In optimization-problem terms, we want to minimize $I (O; M)$ under the constraint of $I (W; O) = I (W; M)$ . The IBM paper then provides a precise algorithm on how to do that, if you know the mapping of $W \to O$ . And that's a pretty solid description of some aspects of inference.

But if inference is equivalent to planning, then it'd stand to reason that something similar happens on the planning side, too. Some sort of "observations", some sort of information-theoretic bottleneck, etc.

And indeed: the bottleneck is actions! When we're planning, we (approximately) generate a whole target world-state. But we can't just assert it upon reality, we have to bring it about through our actions. So we "extract" a plan, we compress that hypothetical world-state into actions that would allow us to generate it... and funnel those actions through our output-side interface with the world.

In diagram format:

We have two bottlenecks: our agent's processing capacity, which requires it to compress all observational data into a world-model, and our agent's limited ability to influence the world, which causes it to compress its target world-state into an action-plan. We can now adapt the IBM for the task of deriving planning-heuristics as well.

And we've arrived at this idea by reasoning from the equivalence of inference to planning.

The ambitious use-case is that if this framework is meaningfully true, this implies that all cognitive functions can be viewed as inverse problems to the environmental functions our universe computes. Which suggests a proper paradigm to agent-foundations research. A way to shed light on all of it by understanding how certain aspects of the environment work.

On which topic...

Missing Piece: Approximation Theory

Now, of course, agents can't be literal causal mirrors. It would essentially require each agent to be as big as the universe, if it has to literally infer the state of every variable the universe computes (bigger, actually: inverse problems tend to be harder).

The literal formulation also runs into all sorts of infinite recursion paradoxes. What if the agent wants to model itself? What if the environment contains other agents? What if some of them are modeling this agent? And so on.

But, of course, it doesn't have to model everything. I'd already alluded to it when mentioning "shortcuts". No, in practice, even idealized agents are only approximate causal mirrors. Their cognition is optimized for low computational complexity and efficient performance. The question then is: how does that "approximation" work?

That is precisely what the natural abstractions research agenda is trying to figure out. What is the relevant theory of approximation, that would suffice for efficiently modeling any system in our world?

Taking that into account, and assuming that my ambitious idea – that all cognitive functions can be derived as inversions of environmental functions – is roughly right...

Well, in that case, figuring out abstraction would be the last major missing piece in agent foundations. If we solve that puzzle, it'll be smooth sailing from there on out. No more fundamental questions about paradigms, no theoretical confusions, no inane philosophizing ~~like this post~~.

The work remaining after that may still not end up easy, mind. Inverse problems tend to be difficult, and the math for inversions of specific environmental transformations may be hard to figure out. But only in a strictly technical sense. It would be straightforwardly difficult, and much, much more scalable and parallelizable.

We won't need to funnel it all through a bunch of eccentric agent-foundation researchers. We would at last attain high-level expertise in the domain of agency, which would let us properly factorize the problem.

And then, all we'd need to do is hire a horde of mathematicians and engineers (or, if we're really lucky, get some math-research AI tools), pose them well-defined technical problems, and blow the problem wide open.

Just highlighting an overlap between the ideas expressed here and a stream that has recently been added to the MATS Summer 2024 Program.

This is not a direct extension of the work but something that shares some of intuitions and might help to formalise the ideas expressed in the post.

The grant proposal for the work is here. The proposal was submitted to Manifund (see here), where it was noticed by Ryan Kidd and subsequently added to the MATS program instead of receiving direct funding.

Do have a read and/or reach out if you're interested!

inverse problems tend to be difficult

Indeed, when cryptographers are trying to ensure that certain agents cannot do certain things, and other agents can, they often use trapdoor functions that are computationally impracticable for general agents to invert, but can be easily inverted by agents in possession of a specific secret.

I don't think there's a great deal that cryptography can teach agent fundamentals, but I do think there's some overlap: it should be possible to interface a valid agent fundamentals theory neatly to the basics of cryptography.

I'm fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)

I understanding that rigorously reexpressing philosophy in mathematics is non-trivial, but (as I'm sure you're aware) given currently plausible timelines, ~2030 seems pretty late for getting this figured out: we may well need to solve some rather specific and urgent practicalities by somewhere around then.

Can you tell me what is the hard part in formalizing the following:

Agent A (an AI) is less computationally limited than a set of agents through $H_{N}$ (humans). It models and can affect the world, itself, and the humans, using an efficient approximately Bayesian approach, and also models its own current remaining uncertainty due to insufficient knowledge (including due to not having access to the Universal prior since it is computationally bounded). It can plan both how to optimize the world for a specific goad while pessimizing with appropriate caution over its current uncertainty, and also how to prioritize using the scientific method to reduce its uncertainty. It understands (with some current uncertainty) what preference ordering the humans each have on future states of the world. It synthesizes all of these into a fairly good compromise (a problem extensively studied in economics and the theory of things like voting), then uses its superior computational capacity to optimize the world for this (with suitable minimizing caution over its remaining uncertainty) and also to reduce its uncertainty so it can optimize better.

Idealized Agents Are Approximate Causal Mirrors…
The literal formulation also runs into all sorts of infinite recursion paradoxes. What if the agent wants to model itself? What if the environment contains other agents? What if some of them are modeling this agent? And so on.

I recall reading a description by an early 20th century Asian-influenced-European-mystic of the image of a universe full of people being like a array of mirror-surfaced balls, each reflecting within it in miniature the entire rest of the array, including the reflections inside each of the other mirrored balls, recursively. (Though this image omits the agent modelling itself, it' not hard to extend it, say by adding some fuzz to the outside of each ball, and a reflection of that inside it,.)

I don't think there's a great deal that cryptography can teach agent fundamentals, but I do think there's some overlap

Yup! Cryptography actually was the main thing I was thinking about there. And there's indeed some relation. For example, it appears that is because our universe's baseline "forward-pass functions" are just poorly suited for being composed into functions solving certain problems. The environment doesn't calculate those; all of those are in $P$ .

However, the inversion of the universe's forward passes can be NP-complete functions. Hence a lot of difficulties.

~2030 seems pretty late for getting this figured out: we may well need to solve some rather specific and urgent practicalities by somewhere around then

2030 is the target for having completed the "hire a horde of mathematicians and engineers and blow the problem wide open" step, to be clear. I don't expect the theoretical difficulties to take quite so long.

Can you tell me what is the hard part in formalizing the following:

Usually, the hard part is finding a way to connect abstract agency frameworks to reality. As in: here you have your framework, here's the Pile, now write some code to make them interface with each other.

Specifically in this case, the problems are:

an efficient approximately Bayesian approach

What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What's the algorithm for this?

It understands (with some current uncertainty) what preference ordering the humans each have

How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?

However, the inversion of the universe's forward passes can be NP-complete functions.

Like a cryptographer, I'm not very concerned about worst-case complexity, only average-case complexity. We don't even generally need an exact inverse, normally just an approximation to some useful degree of accuracy. If I'm in a position to monitor and repeatedly apply corrections as I approach my goal, even fairly coarse approximations with some bounded error rate may well be enough. Some portions of the universe are pretty approximately-invertible in the average case using much lower computational resources than simulating the field-theoretical wave function of every fundamental particle. Others (for example non-linear systems after many Lyapunov times, carefully designed cryptosystems, or most chaotic cellular automata), less so. Animals including humans seem to be able to survive in the presence of a mixed situation where they can invert/steer some things but not others, basically by attempting to avoid situations where they need to do the impossible. AIs are going to face the same situation.

Hence a lot of difficulties.an efficient approximately Bayesian approach
What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What's the algorithm for this?

Basically every functional form of machine learning we know, including both SGD and in-context learning in sufficiently large LLMs, implements an approximate version of Bayesianism. I agree we need to engineer a specific implementation to build my proposal, but for mathematical analysis just the fact that it's a computationally-bounded approximation to Bayesianism takes us quite some way, until we need to analyze its limitations and inaccuracies.

It understands (with some current uncertainty) what preference ordering the humans each have
How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?

I'm assuming a structure similar to a computationally-bounded version of AIXI, upgraded to do value learning rather than having a hard-coded utility function. It maintains and performs approximate Bayesian updates on an ensemble of theories about a) mappings from current world state + actions to distributions of future world states, and b) mappings from world states to something utility function-like for individual humans, plus an aggregate/compromise of these across all humans. It can apply the scientific method to reducing uncertainty on these both of these ensembles of theories, in a prioritized way, and its final goal is to meanwhile attempt to optimize the utility of the aggregate/compromise across all humans, in a suitably cautious/pessimizing way over uncertainties in a) and b). So like AIXI, it has an a explicit final goal slot by construction, and that goal slot has been pointed at value learning. You don't need to point at what humans care about in detail, that's part b) of its world model ensemble. You probably do need to point at a definition of what a human is, plus the fact that humans, as sentient biological organisms, are computational bounded agents who have preferences/goals (which your agent fundamentals program clearly could be helpful for, if Biology alone wasn't enough of a pointer).

Given access to an LLM, I don't believe finding a basically-unique best-fit mapping between the human linguistic world model encoded in the LLM and the AI's Bayesian ensemble of world models is a hard problem, so I don't consider something as basic as pointing at the biological species Homo sapiens is very hard. I'm actually very puzzled why (post GPT-3) anyone still considers the pointers problem to be a challenge: given two very large, very complex and easily queriable world models, there is clearly almost always (apart from statistically unlikely corner cases) going to be a functionally-unique solution to finding a closest fit between the two that makes as much as possible of one an approximate subset of the other. (And in those cases where there are a small number of plausible alternative fits, either globally or at least for small portions of the world-model networks, there should be a clear experimental way to distinguish between the alternative hypotheses, often just by asking some humans some questions.) This is basically just a problem in optimal approximate subset-isomorphism of labelled graphs (with an unknown label mapping), something that has excellent heuristic methods that work in the average case. (I expect the worst case is NP-complete, but we're not going to hit it.) Doing this between different generations of human scientific paradigms for the same subject matter is basically always trivial, other than for paradigms so primitive and mistaken as to have almost no valid content (even the Ancient Greek Earth-Air-Fire-Water model maps onto solid, gas, plasma, liquid: the four most common states of matter). There may of course be parts that don't fit together well due to mistakes on one side or the other, but the concepts "the species Homo sapiens" and "humans are evolved sentient animals, and thus computationally-bounded agents with preferences/goals" both seem to me to be rather unlikely to be one of them, given how genetically similar to each other we all are.

One thing that would make this more symmetrical is if some errors in your world model are worse than others. This makes inference more like a utility function.

Yup. I think this might route through utility as well, though. Observations are useful because they unlock bits of optimization, and bits related to different variables could unlock both different amounts of optimization capacity, and different amounts of goal-related optimization capacity. (It's not so bad to forget a single digit of someone's phone number; it's much worse if you forgot a single letter in the password to your password manager.)

You are making the structure of time into a fundamental part of your agent design, not a contingency of physics.

Let an aput be an input or an output. Let an policy be a subset of possible aputs. Some policies are physically valid.

Ie a policy must have the property that, for each input, there is a single output. If the computer is reversible, the policy must be a bijection from inputs to outputs. If the computer can create a contradiction internally, stopping the timeline, then a policy must be a map from inputs to at most one output.

If the agent is actually split into several pieces with lightspeed and bandwidth limits, then the policy mustn't use info it can't have.

But these physical details don't matter.

The agent has some set of physically valid policies, and it must pick one.

Just highlighting an overlap between the ideas expressed here and a stream that has recently been added to the MATS Summer 2024 Program.

This is not a direct extension of the work but something that shares some of intuitions and might help to formalise the ideas expressed in the post.

Do have a read and/or reach out if you're interested!

inverse problems tend to be difficult

I'm fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)

Can you tell me what is the hard part in formalizing the following:

Idealized Agents Are Approximate Causal Mirrors…
The literal formulation also runs into all sorts of infinite recursion paradoxes. What if the agent wants to model itself? What if the environment contains other agents? What if some of them are modeling this agent? And so on.

I don't think there's a great deal that cryptography can teach agent fundamentals, but I do think there's some overlap

However, the inversion of the universe's forward passes can be NP-complete functions. Hence a lot of difficulties.

~2030 seems pretty late for getting this figured out: we may well need to solve some rather specific and urgent practicalities by somewhere around then

Can you tell me what is the hard part in formalizing the following:

Specifically in this case, the problems are:

an efficient approximately Bayesian approach

It understands (with some current uncertainty) what preference ordering the humans each have

How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?

However, the inversion of the universe's forward passes can be NP-complete functions.

Hence a lot of difficulties.an efficient approximately Bayesian approach
What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What's the algorithm for this?

It understands (with some current uncertainty) what preference ordering the humans each have
How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?

One thing that would make this more symmetrical is if some errors in your world model are worse than others. This makes inference more like a utility function.

You are making the structure of time into a fundamental part of your agent design, not a contingency of physics.

Let an aput be an input or an output. Let an policy be a subset of possible aputs. Some policies are physically valid.

If the agent is actually split into several pieces with lightspeed and bandwidth limits, then the policy mustn't use info it can't have.

But these physical details don't matter.

The agent has some set of physically valid policies, and it must pick one.

31

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

31

Input Side: Observations

Output Side: Actions

Both Sides: A Causal Mirror

Why Is This Useful?

Missing Piece: Approximation Theory