MIRI has said a lot about the issue of embedded agency over the last year. However, I am yet to see them trying to make progress in what I see as the most promising areas.

How does one attack a problem that is new, complicated and non-obvious? By **constructing toy models** and **inverting hard questions** to make them more tractable.

In general an inverse problem is harder than the "direct" one, because we are trying to infer unobservables from observables. Wikipedia gives an example of figuring out the position of Neptune from the perturbations in the orbit of Uranus. Another popular example is NP-complete problems: they are famously hard to solve but it is easy to verify a solution. Another example: you take **a multiple-choice math quiz**, it is often faster and easier to get the right answer by plugging the 4 or 5 potential solutions into the stated problem than to solve the problem directly.

I'll give an example from my own area. The equations of **general relativity** are hard to solve except in a few highly symmetric cases. It is a classic inverse problem. But! Any spacetime metric is actually a solution of the Einstein equations, so all one needs to do is to write down a metric and calculate its Einstein tensor to see what kind of a matter distribution (and boundary conditions) it is a solution of. **Inverting the inverse problem!** Of course, life is not that easy. Most solutions correspond to "unphysical" matter, usually with negative energy density, superluminal flows, singularities, infinities, weird topologies etc. However, it is a useful approach if one wants to study some general properties of the equations, and get a feel for (or sometimes a theorem about) what goes wrong, why and how. After a few iterations one can get better at guessing what form a "good" solution might take, and write up an ansatz that can help solve the original, not the inverse problem in some cases.

Another, more familiar example: **arithmetic division**. Until you learn or figure out the rules, it's hard. But its inverse problem, multiplication, is actually much easier! So to learn more about division, it pays to try to start with potential solutions and see what kind of multiplication actually solve the division problem. Eventually one can come up with the long division algorithm, that uses nothing but multiplication and subtraction. And voila, inverting an inverse problem helps us solve the original one.

This approach is common in computer science, as well. Plenty of algorithms, like **search**, actually rely on solving smaller and simpler inverse problems.

I contend that a similar approach could be useful for making progress in understanding embedded agency. To that end, let's first restate the original problem of embedded agency (copied from the alignment forum page):

**How can one make good models of the world that are able to fit within an agent that is much smaller than the world?**

This is a hard inverse problem! There are many faucets of it, such as the oft-mentioned problem of logical counterfactuals, that do not seem to yield to direct attacks. So, it seem natural to learn to "seek under the light" before stepping into the darkness, and that includes, you guessed it, constructing toy models and inverting the inverse problems.

What would inverting this problem look like? There are multiple possible formulations, just like an inverse of the operation of power a^b is both n-th root and logarithm. Here is a couple of ideas:

- Create a toy universe and look for its representations inside.
- Create a toy model and construct a world around it such that the model represents the world in some way.

Here is an example: a **fractal** is self-similar, so any subset of it can be thought of as a near-perfect model of the whole. Of course, a model is not enough, one has to figure out what would constitute an agent using this model in this fractal world. But at least it can be a promising and potentially illuminating direction to explore. There are plenty more ideas one can come up after thinking about it for 5 minutes.

I hope someone at MIRI is either thinking along these directions, or is ready to try to, instead of being stuck analyzing the messy and complicated inverse problem that is the "real world".

I thought about this for longer than expected so here's an elaboration on inverse-inverse problems in the examples you provided:

## Partial Differential Equations

Finding solutions to partial differential equations with specific boundary conditions is hard and often impossible. But we know a lot of solutions to differential equations with particular boundary conditions. If we match up those solutions with the problem at hand, we can often get a decent answer.

The direct problem: you have a function; figure out what relationships its derivatives have and its boundary conditions

The inverse problem: you know a bunch of relationships between derivatives and some boundary conditions; figure out the function that satisfies these conditions

The

inverseinverse problem: you have a bunch of solutions to inverse problems (ie you can take a bunch of functions, solve the direct problem, and now you know the inverse problem that the function is a solution to), figure out which of these solutions look like the unsolved inverse problem you're currently dealing with## Arithmetic

Performing division is hard but adding and multiplying is easy.

The direct problem: you have two numbers A and B; figure out what happens when you multiply them

The inverse problem: you have two numbers A and C; figure out what you can multiply A by to produce C

The

inverseinverse problem: you have a bunch of solutions to inverse problems (ie you can take A and multiply it by all sorts of things like B' to produce numbers like C', solving direct problems. Now you know that B' is a solution to the inverse problems where you must divide C' by A. You just need to figure out out which of these inverse problem solutions look like the inverse problem at hand (ie if you find a C' so C' = C, you've solved the inverse problem)## In The Abstract

We have a problem like "Find

Xthat producesY" which is a hard problem from a broader class of problems. But we can produce a lot of solutions in that broader class pretty quickly by solving problems of the form "Find theY'thatX'produces." Then the original problem is just a matter of finding aY'which is something likeY. Once we achieve this, we know thatXwill be something likeX'.## Applications for Embedded Agency

The direct problem: You have a small model of something, come up with a thing much bigger than the model that the model is modeling well

The inverse problem: You have a world; figure out something much smaller than the world that can model it well

The inverse inverse problem: You have a a bunch of worlds and a bunch of models that model them well. Figure out which world looks like ours and see what it's corresponding model tells us about good models for modeling

ourworld.## Some Theory About Why Inverse-Inverse Solutions Work

To speak

extremelyloosely, theassumptionfor inverse-inverse problems is something along the lines of "ifX'solves problemY', then we have reason to expect that solutionsXsimilar toX'will solve problemsYsimilar toY'".This tends to work really well in math problems with functions that are continuous/analytic because, as you take the limit of making

Y'andYincreasingly similar, you can make their solutionsX'andXarbitrarily close. And, even if you can't get close to that limit,X'will still be a good place to start work on finagling a solutionXif the relationship between the problem-space and the solution-space isn't too crazy.Division is a good example of an inverse-inverse problem with a literal continous and analytic mapping between the problem-space and solution-space. Differential equations with tweaked parameters/boundary conditions

canbe like this too although to a much weaker extent since they are iterative systems that allow dramatic phase transitions and bifurcations. Appropriately, inverse-inversing a differential equation is much, much harder inverse-inversing division.From this perspective, the embedded agency inverse-problem is much more confusing than ordinary inverse-inverse problems. Like differential equations, there seem to be many subtle ways of tweaking the world (ie black swans) that dramatically change what counts as a good model.

Fortunately, we also have an advantage over conventional inverse problems: Unlike multiplying numbers or taking derivatives which are functions with one solution (typically -- sometimes things are undefined or weird), a particular direct problem of embedded agency likely has

multiplesolutions (a single model can be good at modeling multiple different worlds). In principle, this makes things easier -- it's moreY'(worlds that embedded agency is solved in) that we can compare to ourY(actual world).## Thoughts on Structuring Embedded Agency Problems

To me the problem of embedded agency isn't about fitting a large description of the world into a small part of the world. That's easy with quining, which is mentioned in the MIRI writeup. The problem is more about the weird consequences of learning about something that contains the learner.

Also, I love your wording that the problem has many faucets. Please don't edit it out :-)

haha, oops.

Thinking about my focus on a theory of human values for AI alignment, the problem is quite hard when we ask for a way to precisely specify values. I might state the problem as something like finding "a theory of human values accurate and precise enough that its predictions don't come apart under extreme optimization". To borrow Isnasene's notation, here X = "a theory of human values accurate and precise enough" and Y = "its predictions don't come apart under extreme optimization".

So what is an inverse problem with X' and Y'? A Y' might be something like "functions that behave as expected under extreme optimization", where "behave as expected" is something like no Goodhart effects. We could even just be more narrow and make Y' = "functions that don't exist Goodhart effects under extreme optimization". Then the X' would be something like a generalized description of the classes of functions that satisfy Y'.

Doing the double inverse, we would try to find X from X' by looking at what properties hold for this class of functions that don't suffer from Goodharting, and use them to help us identify what would be needed to create an adequate theory of human values.

Looking for "functions that don't exhibit Goodhart effects under extreme optimization" might be a promising area to look into. What does it mean for a function to behave as expected under extreme optimization? Can you give a toy example?

I'm actually not really sure. We have some vague notion that, for example, my preference for eating pizza shouldn't result in attempts at unbounded pizza eating maximization, and I would probably be unhappy

from my current valuesif a maximizing agent saw I liked pizza the best of all foods and then proceeded to feed me only pizza forever, even if it modified me such that I would maximally enjoy the pizza each time and not get bored of it.Thinking more in terms of regressional Goodharting, maybe something like not deviating from the true target because of optimizing for the measure of it. Consider the classic rat extermination example of Goodharting. We already know collecting rat tails as evidence of extermination is a function that leads to weird effects. Does there exist a function that measures rat exterminations that, when optimized for, produces the intended effect (extermination of rats) without doing anything "weird", e.g. generating unintended side-effects, maximizing rat reproduction so we can exterminate more of them, just straightforwardly leads to the extinction of rats and nothing else.

Right, that's the question. Sure, it is easy to state that "metric must be a faithful representation of the target", but it never is, is it? From the point of view of double inversion, optimizing the target is a hard inverse problem, because, like in your pizza example, the true "values" (pizza is a preference on the background of an otherwise balanced diet) is not easily observable. What would be a double inverse in this case? Maybe something like trying various amounts of pizza and getting the feedback on enjoyment? That would match the long division pattern. I'm not sure.