Claim: problems of agents embedded in their environment mostly reduce to problems of abstraction. Solve abstraction, and solutions to embedded agency problems will probably just drop out naturally.

The goal of this post is to explain the intuition underlying that claim. The point is not to defend the claim socially or to prove it mathematically, but to illustrate why I personally believe that understanding abstraction is the key to understanding embedded agency. Along the way, we’ll also discuss exactly which problems of abstraction need to be solved for a theory of embedded agency.

What do we mean by “abstraction”?

Let’s start with a few examples:

  • We have a gas consisting of some huge number of particles. We throw away information about the particles themselves, instead keeping just a few summary statistics: average energy, number of particles, etc. We can then make highly precise predictions about things like e.g. pressure just based on the reduced information we've kept, without having to think about each individual particle. That reduced information is the "abstract layer" - the gas and its properties.
  • We have a bunch of transistors and wires on a chip. We arrange them to perform some logical operation, like maybe a NAND gate. Then, we throw away information about the underlying details, and just treat it as an abstract logical NAND gate. Using just the abstract layer, we can make predictions about what outputs will result from what inputs. Note that there’s some fuzziness - 0.01 V and 0.02 V are both treated as logical zero, and in rare cases there will be enough noise in the wires to get an incorrect output.
  • I tell my friend that I'm going to play tennis. I have ignored a huge amount of information about the details of the activity - where, when, what racket, what ball, with whom, all the distributions of every microscopic particle involved - yet my friend can still make some reliable predictions based on the abstract information I've provided.
  • When we abstract formulas like "1+1=2" or "2+2=4" into "n+n=2n", we're obviously throwing out information about the value of n, while still making whatever predictions we can given the information we kept. This is what abstraction is all about in math and programming: throw out as much information as you can, while still maintaining the core "prediction".
  • I have a street map of New York City. The map throws out lots of info about the physical streets: street width, potholes, power lines and water mains, building facades, signs and stoplights, etc. But for many questions about distance or reachability on the physical city streets, I can translate the question into a query on the map. My query on the map will return reliable predictions about the physical streets, even though the map has thrown out lots of info.

The general pattern: there’s some ground-level “concrete” model, and an abstract model. The abstract model throws away or ignores information from the concrete model, but in such a way that we can still make reliable predictions about some aspects of the underlying system.

Notice that, in most of these examples, the predictions of the abstract model need not be perfectly accurate. The mathematically exact abstractions used in pure math and CS are an unusual corner case: they don’t deal with the sort of fuzzy boundaries we see in the real world. "Tennis", on the other hand, is a fuzzy abstraction of many real-world activities, and there are edge cases which are sort-of-tennis-but-maybe-not. Most of the interesting problems involve non-exact abstraction, so we'll mostly talk about that, with the understanding that math/CS-style abstraction is just the case with zero fuzz.

In terms of existing theory, I only know of one field which explicitly quantifies abstraction without needing hard edges: statistical mechanics. The heart of the field is things like "I have a huge number of tiny particles in a box, and I want to treat them as one abstract object which I'll call ‘gas’. What properties will the gas have?" Jaynes puts the tools of statistical mechanics on foundations which can, in principle, be used for quantifying abstraction more generally. (I don't think Jaynes had all the puzzle pieces, but he had a lot more than anyone else I've read.) It's rather difficult to find good sources for learning stat mech the Jaynes way; Walter Grandy has a few great books, but they're not exactly intro-level.

Summary: abstraction is about ignoring or throwing away information, in such a way that we can still make reliable predictions about some aspects of the underlying system.

Embedded World-Models

The next few sections will walk through different ways of looking at the core problems of embedded agency, as presented in the embedded agency sequence. We’ll start with embedded world-models, since these introduce the key constraint for everything else.

The underlying challenge of embedded world models is that the map is smaller than the territory it represents. The map simply won’t have enough space to perfectly represent the state of the whole territory - much less every possible territory, as required for Bayesian inference. A piece of paper with some lines on it doesn’t have space to represent the full microscopic configuration of every atom comprising the streets of New York City.

Obvious implication: the map has to throw out some information about the territory. (Note that this isn’t necessarily true in all cases: the territory could have some symmetry allowing for a perfect compressed representation. But this probably won’t apply to most real-world systems, e.g. the full microscopic configuration of every atom comprising the streets of New York City.)

So we need to throw out some information to make a map, but we still want to be able to reliably predict some aspects of the territory - otherwise there wouldn’t be any point in building a map to start with. In other words, we need abstraction.

Exactly what problems of abstraction do we need to solve?

The simplest problems are things like:

  • Given a map-mapping process, characterize the queries whose answers the map can reliably predict. Example: figure out what what questions a streetmap can answer by watching a cartographer produce a streetmap.
  • Given some representation of the map-territory correspondence, translate queries from the territory-representation to the map-representation and vice versa. Example: after understanding the relationship between streets and lines on paper, turn “how far is Times Square from the Met?” into “How far is the Times Square symbol from the Met symbol on the map, and what’s the scale?”
  • Given a territory, characterize classes of queries which can be reliably answered using a map much smaller than the territory itself. Example: recognize that the world contains lots of things with leaves, bark, branches, etc, and these “trees” are similar enough that a compressed map can reliably make predictions about specific trees - e.g. things with branches and bark are also likely to have leaves.
  • Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.
  • Given a map and a class of queries whose answers the map can reliably predict, characterize the class of territories which the map might represent.
  • Given multiple different maps supporting different queries, how can we use them together consistently? Example: a construction project may need to use both a water-main map and a streetmap to figure out where to dig.

These kinds of questions directly address many of the issues from Abram & Scott’s embedded world-models post: grain-of-truth, high-level/multi-level models, ontological crises. But we still need to discuss the biggest barrier to a theory of embedded world-models: diagonalization, i.e. a territory which sees the map’s predictions and then falsifies them.

If the map is embedded in the territory, then things in the territory can look at what the map predicts, then make the prediction false. For instance, some troll in the department of transportation could regularly check Google’s traffic map for NYC, then quickly close off roads to make the map as inaccurate as possible. This sort of thing could even happen naturally, without trolls: if lots of people follow Google’s low-traffic route recommendations, then the recommended routes will quickly fill up with traffic.

These examples suggest that, when making a map of a territory which contains the map, there is a natural role for randomization: Google’s traffic-mapping team can achieve maximum accuracy by randomizing their own predictions. Rather than recommending the same minimum-traffic route for everyone, they can randomize between a few routes and end up at a Nash equilibrium in their prediction game.

We’re speculating about a map making predictions based on a game-theoretic mixed strategy, but at this point we haven’t even defined the rules of the game. What is the map’s “utility function” in this game? The answer to that sort of question should come from thinking about the simpler questions from earlier. We want a theory where the “rules of the game” for self-referential maps follow naturally from the theory for non-self-referential maps. This is one major reason why I see abstraction as the key to embedded agency, rather than embedded agency as the key to abstraction: I expect a solid theory of non-self-referential abstractions to naturally define the rules/objectives of self-referential abstraction. Also, I expect the non-referential-theory to characterize embedded map-making processes, which the self-referential theory will likely need to recognize in the territory.

Embedded Decision Theory

The main problem for embedded decision theory - as opposed to decision theory in general - is how to define counterfactuals. We want to ask questions like “what would happen if I dropped this apple on that table”, even if we can look at our own internal program and see that we will not, in fact, drop the apple. If we want our agent to maximize some expected utility function E[u(x)], then the “x” needs to represent a counterfactual scenario in which the agent takes some action - and we need to be able to reason about that scenario even if the agent ends up taking some other action.

Of course, we said in the previous section that the agent is using a map which is smaller than the territory - in “E[u(x)]”, that map defines the expectation operator E[-]. (Of course, we could imagine architectures which don’t explicitly use an expectation operator or utility function, but the main point carries over: the agent’s decisions will be based on a map smaller than the territory.) Decision theory requires that we run counterfactual queries on that map, so it needs to be a causal model.

In particular, we need a causal model which allows counterfactual queries over the agent’s own “outputs”, i.e. the results of any optimization it runs. In other words, the agent needs to be able to recognize itself - or copies of itself - in the environment. The map needs to represent, if not a hard boundary between agent and environment, at least the pieces which will be changed by the agent’s computation and/or actions.

What constraints does this pose on a theory of abstraction suitable for embedded agency?

The main constraints are:

  • The map and territory should both be causal (possibly with symmetry)
  • Counterfactual queries on the map should naturally correspond to counterfactuals on the territory
  • The agent needs some idea of which counterfactuals on the map correspond to its own computations/actions in the territory - i.e. it needs to recognize itself

These are the minimum requirements for the agent to plan out its actions based on the map, implement the plan in the territory, and have such plans work.

Note that there’s still a lot of degrees of freedom here. For instance, how does the agent handle copies of itself embedded in the environment? Some answers to that question might be “better” than others, in terms of producing more utility or something, but I see that as a decision theory question which is not a necessary prerequisite for a theory of embedded agency. On the other hand, a theory of embedded agency would probably help build decision theories which reason about copies of the agent. This is a major reason why I see a theory of abstraction as a prerequisite to new decision theories, but not new decision theories as a prerequisite to abstraction: we need abstraction on causal models just to talk about embedded decision theory, but problems like agent-copies can be built later on top of a theory of abstraction - especially a theory of abstraction which already handles self-referential maps.

Self-Reasoning & Improvement

Problems of self-reasoning, improvement, tiling, and so forth are similar to the problems of self-referential abstraction, but on hard mode. We’re no longer just thinking about a map of a territory which contains the map; we’re thinking about a map of a territory which contains the whole map-making process, and we want to e.g. modify the map-making process to produce more reliable maps. But if our goals are represented on the old, less-reliable map, can we safely translate those goals into the new map? For that matter, do the goals on the old map even make sense in the territory?

So… hard mode. What do we need from our theory of abstraction?

A lot of this boils down to the “simple” questions from earlier: make sure queries on the old map translate intelligibly into queries on the territory, and are compatible with queries on other maps, etc. But there are some significant new elements here: reflecting specifically on the map-making process, especially when we don’t have an outside-view way to know that we’re thinking about the territory “correctly” to begin with.

These things feel to me like “level 2” questions. Level 1: build a theory of abstraction between causal models. Handle cases where the map models a copy of itself, e.g. when an agent labels its own computations/actions in the map. Part of that theory should talk about map-making processes: for what queries/territories will a given map-maker produce a map which makes successful predictions? What map-making processes produce successful self-referential maps? Once level 1 is nailed down, we should have the tools to talk about level 2: running counterfactuals in which we change the map-making process.

Of course, not all questions of self-reasoning/improvement are about abstraction. We could also questions about e.g. how to make an agent which modifies its own code to run faster, without changing input/output (though of course input/output are slippery notions in an embedded world…). We could ask questions about how to make an agent modify its own decision theory. Etc. These problems don’t inherently involve abstraction. My intuition, however, is that the problems which don’t involve self-referential abstraction usually seem easier. That’s not to say people shouldn’t work on them - there’s certainly value there, and they seem more amenable to incremental progress - but the critical path to a workable theory of embedded agency seems to go through self-referential maps and map-makers.


Agents made of parts have subsystems. Insofar as those subsystems are also agenty and have goals of their own, we want them to be aligned with the top-level agent. What new requirements does this pose for a theory of abstraction?

First and foremost, if we want to talk about agent subsystems, then our map can’t just black-box the whole agent. We can’t circumvent the lack of an agent-environment boundary by simply drawing our own agent-environment boundary, and ignoring everything on the “agent” side. That doesn’t necessarily mean that we can’t do any self-referential black boxing. For instance, if we want to represent a map which contains a copy of itself, then a natural method is to use a data structure which contains a pointer to itself. That sort of strategy has not necessarily been ruled out, but we can’t just blindly apply it to the whole agent.

In particular, if we’re working with causal models (possibly with symmetry), then the details of the map-making process and the reflecting-on-map-making process and whatnot all need to be causal as well. We can’t call on oracles or non-constructive existence theorems or other such magic. Loosely speaking, our theory of abstraction needs to be computable.

In addition, we don’t just want to model the agent as having parts, we want to model some of the parts as agenty - or at least consider that possibility. In particular, that means we need to talk about other maps and other map-makers embedded in the environment. We want to be able to recognize map-making processes embedded in the territory. And again, this all needs to be computable, so we need algorithms to recognize map-making processes embedded in the territory.

We’re talking about these capabilities in the context of aligning subagents, but this is really a key requirement for alignment more broadly. Ultimately, we want to point at something in the territory and say “See that agenty thing over there? That’s a human; there’s a bunch of them out in the world. Figure out their values, and help satisfy those values.” Recognizing agents embedded in the territory is a key piece of this, and recognizing embedded map-making processes seems to me like the hardest part of that problem - again, it’s on the critical path.


Time for a recap.

The idea of abstraction is to throw out information, while still maintaining the ability to provide reliable predictions on at least some queries.

In order to address the core problems of embedded world models, a theory of abstraction would need to first handle some “simple” questions:

  • Characterize which queries work on which maps of which territories.
  • Characterize which query classes admit significantly-compressed maps on which territories.
  • Characterize map-making processes which produce reliable maps.
  • Translate queries between map-representation and territory-representation, and between different map-representations

We hope that a theory which addresses these problems on non-self-referential maps will suggest natural objectives/rules for self-referential maps.

Embedded decision theory adds a few more constraints, in order to define counterfactuals for optimization:

  • Our theory of abstraction should work with causal models for both the territory and the map
  • We need ways of mapping between counterfactuals on the map and counterfactuals on the territory
  • Agents need some way to recognize their own computations/outputs in the territory, and represent them in the map.

A theory of embedded agency seems necessary for talking about embedded decision theory in a well-defined way.

Self-reasoning kicks self-referential map-making one rung up the meta-ladder, and starts to talk about maps of map-making processes and related issues. These aren’t the only problems of self-reasoning, but it does feel like self-referential abstraction captures the “hard part” - it’s on the critical path to a full theory.

Finally, subsystems push us to make the entire theory of abstraction causal/computable. Also, it requires algorithms for recognizing agents - and thus map-makers - embedded in the territory. That’s a problem we probably want to solve for safety purposes anyway. Again, abstraction isn’t the only part of the problem, but it seems to capture enough of the hard part to be on the critical path.

New Comment
16 comments, sorted by Click to highlight new comments since:
We’re speculating about a map making predictions based on a game-theoretic mixed strategy, but at this point we haven’t even defined the rules of the game. What is the map’s “utility function” in this game? The answer to that sort of question should come from thinking about the simpler questions from earlier. We want a theory where the “rules of the game” for self-referential maps follow naturally from the theory for non-self-referential maps.

I want to make a couple of points about this part:

  • A significant part of the utility of a map comes from the self-referential effects on the territory; the map needs to be chosen with this in mind to avoid catastrophic self-fulfilling prophecies. (This doesn't feel especially important for your point, but it is part of the puzzle.)
  • The definition of naturalized epistemic-goodness can take inspiration from non-self-referential versions of the problem, but faces additional wireheading-like problems, which places significant burden on it. You probably can't just take the "epistemic utility function" from the non-self-referential case. The paper epistemic decision theory by Hilary Greaves explores this issue.
  • Thinking about self-reference may influence the "kind of thing" which is being scored. For example, in the non-self-referential setting, classical logic is a reasonable choice. Despite the ambiguities introduced by uncertain reasoning and abstraction, it might be reasonable to think of statements as basically being true or false, modulo some caveats. However, self-reference paradoxes may make non-classical logics more appropriate, with more radically different notions of truth-value. For example, reflective oracles deal with self-reference via probability (as you mention in the post, using Nash equilibria to avoid paradox in the face of self-reference). However, although it works to an extent, it isn't obviously right. Probability in the sense of uncertainty and probability in the sense of I-have-to-treat-this-as-random-because-it-structurally-depends-on-my-belief-in-a-way-which-diagonalizes-me might be fundamentally different from one another.
  • This same argument may also apply to the question of what abstraction even is.

I don't think you were explicitly denying any of this; I just wanted to call out that these things may create complications for the research agenda. My personal sense is that it could be possible to come up with the right notion by focusing on the non-self-referential case alone (and paying very close attention to what feels right/wrong), but anticipating the issues which will arise in the self-referential case provides significantly more constraints and thus significantly more guidance. A wide variety of tempting simplifications are available in the absence of self-reference.

I'm especially worried about the "kind of thing" point above. It isn't clear at all what kind of thing beliefs for embedded agents should be. Reflective oracles give a way to rescue probability theory for the embedded setting, but, are basically unrealistic. Logical inductors are of course somewhat more realistic (being computable), and look quite different. But, logical inductors don't have great decision-theoretic properties (so far).

Your "kind of thing" concern feels like it's pointing to the right problem, although I think I'm more confident than you that it will end up looking like probability. It feels to me like we're missing an interpretation of probability which would make this all make sense - something which would unify uncertainty-randomness and game-theoretic-randomness, in a causal setting, without invoking limiting frequencies or ontologically basic agents with beliefs.

You do make a strong case that such an interpretation may involve more than just map-territory correspondence, which dramatically widens the net in terms of what to look for.

It feels to me like throwing away information is the key piece here. For instance: I roll 2 dice, observe the outcome, and then throw away all info about their sum. What "posterior" leaves me with the most possible information, while still forgetting everything about the sum (i.e. "posterior" marginal distribution of sum is same as prior)? Optimal performance here requires randomizing my own beliefs. This sort of thing makes me think that a theory of abstraction - inherently about throwing away info - will point toward the key pieces, even before we introduce explicit self-reference.

I think one difference between us is, I really don't expect standard game-theoretic ideas to survive. They're a good starting point, but, we need to break them down to something more fundamental. (Breaking down probability (further than logical induction already does, that is), while on my radar, is far more speculative than that.)

Basic game theory uses equilibrium analysis. We need a theory of dynamics instead of only equilibrium, because a reasoner needs to find an equilibrium somehow -- and the "somehow" is going to involve computational learning theory. Evolutionary game theory is a step in the right direction but not powerful enough for thinking about superintelligent AI. Other things which seem like steps in the right direction include correlated equilibria (which have somewhat nice "dynamic" stories of reaching equilibrium through learning).

Logical induction is a success case for magically getting nice self reference properties after a set of desired properties fell into place. Following the "abstraction" intuition could definitely work out that way. Another passion example is how Hartry Field followed a line of research about the sorities paradox developed a logic of vagueness, and ended up with a theory of self-referential truth. But the first example involved leaving the Bayesian paradigm, and the second involved breaking map/territory intuitions and classical logic.

Hadn't seen the dice example, is it from Jaynes? (I don't yet see why you're better off randomising)

The dice example is one I stumbled on while playing with the idea of a probability-like calculus for excluding information, rather than including information. I'll write up a post on it at some point.

I can see how this notion of dynamics-rather-than-equilibrium fits nicely with something like logical induction - there's a theme of refining our equilibria and our beliefs over time. But I'm not sure how these refining-over-time strategies can play well with embeddedness. When I imagine an embedded agent, I imagine some giant computational circuit representing the universe, and I draw a box around one finite piece of it and say "this piece is doing something agenty: it took in a bunch of information, calculated a bit, then chose its output to optimize such-and-such". That's what I imagine the simplest embedded agents look like: info in, finite optimizer circuit, one single decision out, whole thing is a finite chunk of circuitry. Of course we could have agents which persist over time, collecting information and making multiple decisions, but if our theory of embedded agency assumes that, then it seems like it will miss a lot of agenty behavior.

Not sure if you're imagining a different notion of agency, or imagining using the theory in a different way, or... ?

The dice example is one I stumbled on while playing with the idea of a probability-like calculus for excluding information, rather than including information. I'll write up a post on it at some point.

I look forward to it.

When I imagine an embedded agent, I imagine some giant computational circuit representing the universe, and I draw a box around one finite piece of it

Speaking very abstractly, I think this gets at my actual claim. Continuing to speak at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.

Speaking much more concretely, this difference comes partly from the question of whether to consider robust delegation as a central part to tackle now, or (as you suggested in the post) a part to tackle later. I agree with your description of robust delegation as "hard mode", but nonetheless consider it to be central.

To name some considerations:

  • The "static" way of thinking involves handing decision problems to agents without asking how the agent found itself in that situation. The how-did-we-get-here question is sometimes important. For example, my rejection of the standard smoking lesion problem is a how-did-we-get-here type objection.
  • Moreover, "static" decision theory puts a box around "epistemics" with an output to decision-making. This implicitly suggests: "Decision theory is about optimal action under uncertainty -- the generation of that uncertainty is relegated to epistemics." This ignores the role of learning how to act. Learning how to act can be critical even for decision theory in the abstract (and is obviously important to implementation).
  • Viewing things from a learning-theoretic perspective, it doesn't generally make sense to view a single thing (a single observation, a single action/decision, etc) in isolation. So, accounting for logical non-omniscience, we can't expect to make a single decision "correctly" for basically any notion of "correctly". What we can expect is to be "moving in the right direction" -- not at a particular time, but generally over time (if nothing kills us).
    • So, describing an embedded agent in some particular situation, the notion of "rational (bounded) agency" should not expect anything optimal about its actions in that circumstance -- it can only talk about the way the agent updates.
    • Due to logical non-omniscience, this applies to the action even if the agent is at the point where it knows what's going on epistemically -- it might not have learned to appropriately react to the given situation yet. So even "reacting optimally given your (epistemic) uncertainty" isn't realistic as an expectation for bounded agents.
  • Obviously I also think the "dynamic" view is better in the purely epistemic case as well -- logical induction being the poster boy, totally breaking the static rules of probability theory at a fixed time but gradually improving its beliefs over time (in a way which approaches the static probabilistic laws but also captures more).
    • Even for purely Bayesian learning, though, the dynamic view is a good one. Bayesian learning is a way of setting up dynamics such that better hypotheses "rise to the top" over time. It is quite analogous to replicator dynamics as a model of evolution.
  • You can do "equilibrium analysis" of evolution, too (ie, evolutionary stable equilibria), but it misses how-did-we-get-here type questions: larger and smaller attractor basins. (Evolutionarily stable equilibria are sort of a patch on Nash equilibria to address some of the how-did-we-get-here questions, by ruling out points which are Nash equilibria but which would not be attractors at all.) It also misses out on orbits and other fundamentally dynamic behavior.
    • (The dynamic phenomena such as orbits become important in the theory of correlated equilibria, if you get into the literature on learning correlated equilibria (MAL -- multi-agent learning) and think about where the correlations come from.)
Of course we could have agents which persist over time, collecting information and making multiple decisions, but if our theory of embedded agency assumes that, then it seems like it will miss a lot of agenty behavior.

I agree that requiring dynamics would miss some examples of actual single-shot agents, doing something intelligently, once, in isolation. However, it is a live question for me whether such agents can be anything else that Boltzmann brains. In Does Agent-like Behavior imply Agent-like Architecture, Scott mentioned that it seems quite unlikely that you could get a look-up table which behaves like an agent without having an actual agent somewhere causally upstream of it. Similarly, I'm suggesting that it seems unlikely you could get an agent-like architecture sitting in the universe without some kind of learning process causally upstream.

Moreover, continuity is central to the major problems and partial solutions in embedded agency. X-risk is a robust delegation failure more than a decision-theory failure or an embedded world-model failure (though subsystem alignment has a similarly strong claim). UDT and TDT are interesting largely because of the way they establish dynamic consistency of an agent across time, partially addressing the tiling agent problem. (For UDT, this is especially central.) But, both of them ultimately fail very much because of their "static" nature.

[I actually got this static/dynamic picture from komponisto btw (talking in person, though the posts give a taste of it). At first it sounded like rather free-flowing abstraction, but it kept surprising me by being able to bear weight. Line-per-line, though, much more of the above is inspired by discussions with Steve Rayhawk.]

Edit: Vanessa made a related point in a comment on another post.

Great explanation, thanks. This really helped clear up what you're imagining.

I'll make a counter-claim against the core point:

... at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.

I think you make a strong case both that this will capture most (and possibly all) agenty behavior we care about, and that we need to think about agency this way long term. However, I don't think this points toward the right problems to tackle first.

Here's roughly the two notions of agency, as I'm currently imagining them:

  • "one-shot" agency: system takes in some data, chews on it, then outputs some actions directed at achieving a goal
  • "dynamic" agency: system takes in data and outputs decisions repeatedly, over time, gradually improving some notion of performance

I agree that we need a theory for the second version, for all of the reasons you listed - most notably robust delegation. I even agree that robust delegation is a central part of the problem - again, the considerations you list are solid examples, and you've largely convinced me on the importance of these issues. But consider two paths to build a theory of dynamic agency:

  • First understand one-shot agency, then think about dynamic agency in terms of processes which produce (a sequence of) effective one-shot agents
  • Tackle dynamic agency directly

My main claim is that the first path will be far easier, to the point that I do not expect anyone to make significant useful progress on understanding dynamic agency without first understanding one-shot agency.

Example: consider a cat. If we want to understand the whole cause-and-effect process which led to a cat's agenty behavior, then we need to think a lot about evolution. On the other hand, presumably people recognized that cats have agenty behavior long before anybody knew anything about evolution. People recognized that cats have goal-seeking behavior, people figured out (some of) what cats want, people gained some idea of what cats can and cannot learn... all long before understanding the process which produced the cat.

More abstractly: I generally agree that agenty behavior (e.g. a cat) seems unlikely to show up without some learning process to produce it (e.g. evolution). But it still seems possible to talk about agenty things without understanding - or even knowing anything about - the process which produced the agenty things. Indeed, it seems easier to talk about agenty things than to talk about the processes which produce them. This includes agenty things with pretty limited learning capabilities, for which the improving-over-time perspective doesn't work very well - cats can learn a bit, but they're finite and have pretty limited capacity.

Furthermore, one-shot (or at least finite) agency seems like it better describes the sort of things I mostly care about when I think about "agents" - e.g. cats. I want to be able to talk about cats as agents, in and of themselves, despite the cats not living indefinitely or converging to any sort of "optimal" behavior over long time spans or anything like that. I care about evolution mainly insofar as it lends insights into cats and other organisms - i.e., I care about long-term learning processes mainly insofar as it lends insights into finite agents. Or, in the language of subsystem alignment, I care about the outer optimization process mainly insofar as it lends insight into the mesa-optimizers (which are likely to be more one-shot-y, or at least finite). So it feels like we need a theory of one-shot agency just to define the sorts of things we want our theory of dynamic agency to talk about, especially from a mesa-optimizers perspective.

Conversely, if we already had a theory of what effective one-shot agents look like, then it would be a lot easier to ask "what sort of processes produce these kinds of systems"?

I agree that if a point can be addressed or explored in a static framework, it can be easier to do that first rather than going to the fully dynamic picture.

On the other hand, I think your discussion of the cat overstates the case. Your own analysis of the decision theory of a single-celled organism (ie the perspective you've described to me in person) compares it to gradient descent, rather than expected utility maximization. This is a fuzzy area, and certainly doesn't achieve all the things I mentioned, but doesn't that seem more "dynamic" than "static"? Today's deep learning systems aren't as generally intelligent as cats, but it seems like the gap exists more within learning theory than static decision theory.

More importantly, although the static picture can be easier to analyse, it has also been much more discussed for that reason. The low-hanging fruits are more likely to be in the more neglected direction. Perhaps the more difficult parts of the dynamic picture (perhaps robust delegation) can be put aside while still approaching things from a learning-theoretic perspective.

I may have said something along the lines of the static picture already being essentially solved by reflective oracles (the problems with reflective oracles being typical of the problems with the static approach). From my perspective, it seems like time to move on to the dynamic picture in order to make progress. But that's overstating things a bit -- I am interested in better static pictures, particularly when they are suggestive of dynamic pictures, such as COEDT.

In any case, I have no sense that you're making a mistake by looking at abstraction in the static setting. If you have traction, you should continue in that direction. I generally suspect that the abstraction angle is valuable, whether static or dynamic.

Still, I do suspect we have material disagreements remaining, not only disagreements in research emphasis.

Toward the end of your comment, you speak of the one-shot picture and the dynamic picture as if the two are mutually exclusive, rather than just easy mode vs hard mode as you mention early on. A learning picture still admits static snapshots. Also, cats don't get everything right on the first try.

Still, I admit: a weakness of an asymptotic learning picture is that it seems to eschew finite problems; to such an extent that at times I've said the dynamic learning picture serves as the easy version of the problem, with one-shot rationality being the hard case to consider later. Toy static pictures -- such as the one provided by reflective oracles -- give an idealized static rationality, using unbounded processing power and logical omniscience. A real static picture -- perhaps the picture you are seeking -- would involve bounded rationality, including both logical non-omniscience and regular physical non-omniscience. A static-rationality analysis of logical non-omnincience has seemed quite challenging so far. Nice versions of self-reference and other challenges to embedded world-models such as those you mention seem to require conveniences such as reflective oracles. Nothing resembling thin priors has come along to allow for eventual logical coherence while resembling bayesian static rationality (rather than logical-induction-like dynamic rationality). And as for the empirical uncertainty, we would really like to get some guarantees about avoiding catastrophic mistakes (though, perhaps, this isn't within your scope).

Wow, this is a really fascinating comment.

Hadn’t seen the dice example, is it from Jaynes? (I don’t yet see why you’re better off randomising)

Well, one way to forget the sum is to generate random pairs of dice for each possible sum and replace one of them with your actual pair. For example, if your dice came up (3 5), you can rewrite your memory with something like "the result was one of (1 1) (2 1) (3 1) (4 1) (4 2) (2 5) (3 5) (4 5) (6 4) (6 5) (6 6)". Is there a simpler way?

Obviously if you I the sum, I just want to know the die1-die2? The only problem is that the signed difference looks like a uniform distribution with width dependent on the sum - the signed difference can range from 11 possibilities (-5 to 5) down to 1 (0).

So what I think you do is you put all the differences onto the same scale by constructing a "unitless difference," which will actually be defined as a uniform distribution.

Rather than having the difference be a single number in a chunk of the number line that changes in size, you construct a big set of ordered points of fixed size equal to the least common multiple of the number of possible differences for all sums. If you think of a difference not as a number, but as a uniform distribution on the set of possible differences, then you can just "scale up" this distribution from its set of variable into the big set of constant size, and sample from this distribution to forget the sum but remember the most information about the difference.

EDIT: I shouldn't do math while tired.

Note that the agent should rewrite its memory with a distribution, not just a list of tuples - e.g. {(1 1): 1/36, (2 1): 2/36, ...}. That way the "posterior" distribution on the sum will match the prior distribution on the sum.

That said, this is basically correct. It matches the answer(s) I got, and is more elegant.

Yeah. I guess I was assuming that the agent knows the list of tuples and also knows that they came from the procedure I described; the distribution follows from that :-)

Asya's summary for the Alignment Newsletter:

<@Embedded agency problems@>(@Embedded Agents@) are a class of theoretical problems that arise as soon as an agent is part of the environment it is interacting with and modeling, rather than having a clearly-defined and separated relationship. This post makes the argument that before we can solve embedded agency problems, we first need to develop a theory of _abstraction_. _Abstraction_ refers to the problem of throwing out some information about a system while still being able to make predictions about it. This problem can also be referred to as the problem of constructing a map for some territory.
The post argues that abstraction is key for embedded agency problems because the underlying challenge of embedded world models is that the agent (the map) is smaller than the environment it is modeling (the territory), and so inherently has to throw some information away.
Some simple questions around abstraction that we might want to answer include:
- Given a map-making process, characterize the queries whose answers the map can reliably predict.
- Given some representation of the map-territory correspondence, translate queries from the territory-representation to the map-representation and vice versa.
- Given a territory, characterize classes of queries which can be reliably answered using a map much smaller than the territory itself.
- Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.
The post argues that once we create the simple theory, we will have a natural way of looking at more challenging problems with embedded agency, like the problem of self-referential maps, the problem of other map-makers, and the problem of self-reasoning that arises when the produced map includes an abstraction of the map-making process itself.

Asya's opinion:

My impression is that embedded agency problems as a class of problems are very young, extremely entangled, and characterized by a lot of confusion. I am enthusiastic about attempts to decrease confusion and intuitively, abstraction does feel like a key component to doing that.
That being said, my guess is that it’s difficult to predictably suggest the most promising research directions in a space that’s so entangled. For example, one thread in the comments of this post discusses the fact that this theory of abstraction as presented looks at “one-shot” agency where the system takes in some data once and then outputs it, rather than “dynamic” agency where a system takes in data and outputs decisions repeatedly over time. Abram Demski argues that the “dynamic” nature of embedded agency is a central part of the problem and that it may be more valuable and neglected to put research emphasis there.

A side-note:

Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.

Can't remember the specific reference but: Imperfect-information game theory has some research on abstractions. Naturally, an object of interest are "optimal" abstractions --- i.e., ones that are as small as possible for given accuracy, or as accurate as possible for given size. However, there are typically some negative results, stating that getting (near-) optimal abstractions is at least as expensive as finding the (near-) optimal solution of the full game. Intuitively, I would expect this to be a recurring theme for abstractions in general.

The implication of this is that all the goals should have the implicitly have the caveat that the maps have to be "not-too-expensive to construct". (This is intended to be a side-note, not an advocacy to change the formulation. The one you have there is accessible and memorable :-).)

Thanks for the pointer, sounds both relevant and useful. I'll definitely look into it.