Embedded Agency via Abstraction

[-]abramdemski6y60

We’re speculating about a map making predictions based on a game-theoretic mixed strategy, but at this point we haven’t even defined the rules of the game. What is the map’s “utility function” in this game? The answer to that sort of question should come from thinking about the simpler questions from earlier. We want a theory where the “rules of the game” for self-referential maps follow naturally from the theory for non-self-referential maps.

I want to make a couple of points about this part:

A significant part of the utility of a map comes from the self-referential effects on the territory; the map needs to be chosen with this in mind to avoid catastrophic self-fulfilling prophecies. (This doesn't feel especially important for your point, but it is part of the puzzle.)
The definition of naturalized epistemic-goodness can take inspiration from non-self-referential versions of the problem, but faces additional wireheading-like problems, which places significant burden on it. You probably can't just take the "epistemic utility function" from the non-self-referential case. The paper epistemic decision theory by Hilary Greaves explores this issue.
Thinking about self-reference may influence the "kind of thing" which is being scored. For example, in the non-self-referential setting, classical logic is a reasonable choice. Despite the ambiguities introduced by uncertain reasoning and abstraction, it might be reasonable to think of statements as basically being true or false, modulo some caveats. However, self-reference paradoxes may make non-classical logics more appropriate, with more radically different notions of truth-value. For example, reflective oracles deal with self-reference via probability (as you mention in the post, using Nash equilibria to avoid paradox in the face of self-reference). However, although it works to an extent, it isn't obviously right. Probability in the sense of uncertainty and probability in the sense of I-have-to-treat-this-as-random-because-it-structurally-depends-on-my-belief-in-a-way-which-diagonalizes-me might be fundamentally different from one another.
This same argument may also apply to the question of what abstraction even is.

I don't think you were explicitly denying any of this; I just wanted to call out that these things may create complications for the research agenda. My personal sense is that it could be possible to come up with the right notion by focusing on the non-self-referential case alone (and paying very close attention to what feels right/wrong), but anticipating the issues which will arise in the self-referential case provides significantly more constraints and thus significantly more guidance. A wide variety of tempting simplifications are available in the absence of self-reference.

I'm especially worried about the "kind of thing" point above. It isn't clear at all what kind of thing beliefs for embedded agents should be. Reflective oracles give a way to rescue probability theory for the embedded setting, but, are basically unrealistic. Logical inductors are of course somewhat more realistic (being computable), and look quite different. But, logical inductors don't have great decision-theoretic properties (so far).

[-]johnswentworth6y30

Your "kind of thing" concern feels like it's pointing to the right problem, although I think I'm more confident than you that it will end up looking like probability. It feels to me like we're missing an interpretation of probability which would make this all make sense - something which would unify uncertainty-randomness and game-theoretic-randomness, in a causal setting, without invoking limiting frequencies or ontologically basic agents with beliefs.

You do make a strong case that such an interpretation may involve more than just map-territory correspondence, which dramatically widens the net in terms of what to look for.

It feels to me like throwing away information is the key piece here. For instance: I roll 2 dice, observe the outcome, and then throw away all info about their sum. What "posterior" leaves me with the most possible information, while still forgetting everything about the sum (i.e. "posterior" marginal distribution of sum is same as prior)? Optimal performance here requires randomizing my own beliefs. This sort of thing makes me think that a theory of abstraction - inherently about throwing away info - will point toward the key pieces, even before we introduce explicit self-reference.

[-]abramdemski6y30

I think one difference between us is, I really don't expect standard game-theoretic ideas to survive. They're a good starting point, but, we need to break them down to something more fundamental. (Breaking down probability (further than logical induction already does, that is), while on my radar, is far more speculative than that.)

Basic game theory uses equilibrium analysis. We need a theory of dynamics instead of only equilibrium, because a reasoner needs to find an equilibrium somehow -- and the "somehow" is going to involve computational learning theory. Evolutionary game theory is a step in the right direction but not powerful enough for thinking about superintelligent AI. Other things which seem like steps in the right direction include correlated equilibria (which have somewhat nice "dynamic" stories of reaching equilibrium through learning).

Logical induction is a success case for magically getting nice self reference properties after a set of desired properties fell into place. Following the "abstraction" intuition could definitely work out that way. Another passion example is how Hartry Field followed a line of research about the sorities paradox developed a logic of vagueness, and ended up with a theory of self-referential truth. But the first example involved leaving the Bayesian paradigm, and the second involved breaking map/territory intuitions and classical logic.

Hadn't seen the dice example, is it from Jaynes? (I don't yet see why you're better off randomising)

[-]johnswentworth6y30

The dice example is one I stumbled on while playing with the idea of a probability-like calculus for excluding information, rather than including information. I'll write up a post on it at some point.

I can see how this notion of dynamics-rather-than-equilibrium fits nicely with something like logical induction - there's a theme of refining our equilibria and our beliefs over time. But I'm not sure how these refining-over-time strategies can play well with embeddedness. When I imagine an embedded agent, I imagine some giant computational circuit representing the universe, and I draw a box around one finite piece of it and say "this piece is doing something agenty: it took in a bunch of information, calculated a bit, then chose its output to optimize such-and-such". That's what I imagine the simplest embedded agents look like: info in, finite optimizer circuit, one single decision out, whole thing is a finite chunk of circuitry. Of course we could have agents which persist over time, collecting information and making multiple decisions, but if our theory of embedded agency assumes that, then it seems like it will miss a lot of agenty behavior.

Not sure if you're imagining a different notion of agency, or imagining using the theory in a different way, or... ?

[-]abramdemski6y*120

The dice example is one I stumbled on while playing with the idea of a probability-like calculus for excluding information, rather than including information. I'll write up a post on it at some point.

I look forward to it.

When I imagine an embedded agent, I imagine some giant computational circuit representing the universe, and I draw a box around one finite piece of it

Speaking very abstractly, I think this gets at my actual claim. Continuing to speak at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.

Speaking much more concretely, this difference comes partly from the question of whether to consider robust delegation as a central part to tackle now, or (as you suggested in the post) a part to tackle later. I agree with your description of robust delegation as "hard mode", but nonetheless consider it to be central.

To name some considerations:

The "static" way of thinking involves handing decision problems to agents without asking how the agent found itself in that situation. The how-did-we-get-here question is sometimes important. For example, my rejection of the standard smoking lesion problem is a how-did-we-get-here type objection.
Moreover, "static" decision theory puts a box around "epistemics" with an output to decision-making. This implicitly suggests: "Decision theory is about optimal action under uncertainty -- the generation of that uncertainty is relegated to epistemics." This ignores the role of learning how to act. Learning how to act can be critical even for decision theory in the abstract (and is obviously important to implementation).
Viewing things from a learning-theoretic perspective, it doesn't generally make sense to view a single thing (a single observation, a single action/decision, etc) in isolation. So, accounting for logical non-omniscience, we can't expect to make a single decision "correctly" for basically any notion of "correctly". What we can expect is to be "moving in the right direction" -- not at a particular time, but generally over time (if nothing kills us).

So, describing an embedded agent in some particular situation, the notion of "rational (bounded) agency" should not expect anything optimal about its actions in that circumstance -- it can only talk about the way the agent updates.
Due to logical non-omniscience, this applies to the action even if the agent is at the point where it knows what's going on epistemically -- it might not have learned to appropriately react to the given situation yet. So even "reacting optimally given your (epistemic) uncertainty" isn't realistic as an expectation for bounded agents.

Obviously I also think the "dynamic" view is better in the purely epistemic case as well -- logical induction being the poster boy, totally breaking the static rules of probability theory at a fixed time but gradually improving its beliefs over time (in a way which approaches the static probabilistic laws but also captures more).

Even for purely Bayesian learning, though, the dynamic view is a good one. Bayesian learning is a way of setting up dynamics such that better hypotheses "rise to the top" over time. It is quite analogous to replicator dynamics as a model of evolution.

You can do "equilibrium analysis" of evolution, too (ie, evolutionary stable equilibria), but it misses how-did-we-get-here type questions: larger and smaller attractor basins. (Evolutionarily stable equilibria are sort of a patch on Nash equilibria to address some of the how-did-we-get-here questions, by ruling out points which are Nash equilibria but which would not be attractors at all.) It also misses out on orbits and other fundamentally dynamic behavior.

(The dynamic phenomena such as orbits become important in the theory of correlated equilibria, if you get into the literature on learning correlated equilibria (MAL -- multi-agent learning) and think about where the correlations come from.)

Of course we could have agents which persist over time, collecting information and making multiple decisions, but if our theory of embedded agency assumes that, then it seems like it will miss a lot of agenty behavior.

I agree that requiring dynamics would miss some examples of actual single-shot agents, doing something intelligently, once, in isolation. However, it is a live question for me whether such agents can be anything else that Boltzmann brains. In Does Agent-like Behavior imply Agent-like Architecture, Scott mentioned that it seems quite unlikely that you could get a look-up table which behaves like an agent without having an actual agent somewhere causally upstream of it. Similarly, I'm suggesting that it seems unlikely you could get an agent-like architecture sitting in the universe without some kind of learning process causally upstream.

Moreover, continuity is central to the major problems and partial solutions in embedded agency. X-risk is a robust delegation failure more than a decision-theory failure or an embedded world-model failure (though subsystem alignment has a similarly strong claim). UDT and TDT are interesting largely because of the way they establish dynamic consistency of an agent across time, partially addressing the tiling agent problem. (For UDT, this is especially central.) But, both of them ultimately fail very much because of their "static" nature.

[I actually got this static/dynamic picture from komponisto btw (talking in person, though the posts give a taste of it). At first it sounded like rather free-flowing abstraction, but it kept surprising me by being able to bear weight. Line-per-line, though, much more of the above is inspired by discussions with Steve Rayhawk.]

Edit: Vanessa made a related point in a comment on another post.

[-]johnswentworth6y40

Great explanation, thanks. This really helped clear up what you're imagining.

I'll make a counter-claim against the core point:

... at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.

I think you make a strong case both that this will capture most (and possibly all) agenty behavior we care about, and that we need to think about agency this way long term. However, I don't think this points toward the right problems to tackle first.

Here's roughly the two notions of agency, as I'm currently imagining them:

"one-shot" agency: system takes in some data, chews on it, then outputs some actions directed at achieving a goal
"dynamic" agency: system takes in data and outputs decisions repeatedly, over time, gradually improving some notion of performance

I agree that we need a theory for the second version, for all of the reasons you listed - most notably robust delegation. I even agree that robust delegation is a central part of the problem - again, the considerations you list are solid examples, and you've largely convinced me on the importance of these issues. But consider two paths to build a theory of dynamic agency:

First understand one-shot agency, then think about dynamic agency in terms of processes which produce (a sequence of) effective one-shot agents
Tackle dynamic agency directly

My main claim is that the first path will be far easier, to the point that I do not expect anyone to make significant useful progress on understanding dynamic agency without first understanding one-shot agency.

Example: consider a cat. If we want to understand the whole cause-and-effect process which led to a cat's agenty behavior, then we need to think a lot about evolution. On the other hand, presumably people recognized that cats have agenty behavior long before anybody knew anything about evolution. People recognized that cats have goal-seeking behavior, people figured out (some of) what cats want, people gained some idea of what cats can and cannot learn... all long before understanding the process which produced the cat.

More abstractly: I generally agree that agenty behavior (e.g. a cat) seems unlikely to show up without some learning process to produce it (e.g. evolution). But it still seems possible to talk about agenty things without understanding - or even knowing anything about - the process which produced the agenty things. Indeed, it seems easier to talk about agenty things than to talk about the processes which produce them. This includes agenty things with pretty limited learning capabilities, for which the improving-over-time perspective doesn't work very well - cats can learn a bit, but they're finite and have pretty limited capacity.

Furthermore, one-shot (or at least finite) agency seems like it better describes the sort of things I mostly care about when I think about "agents" - e.g. cats. I want to be able to talk about cats as agents, in and of themselves, despite the cats not living indefinitely or converging to any sort of "optimal" behavior over long time spans or anything like that. I care about evolution mainly insofar as it lends insights into cats and other organisms - i.e., I care about long-term learning processes mainly insofar as it lends insights into finite agents. Or, in the language of subsystem alignment, I care about the outer optimization process mainly insofar as it lends insight into the mesa-optimizers (which are likely to be more one-shot-y, or at least finite). So it feels like we need a theory of one-shot agency just to define the sorts of things we want our theory of dynamic agency to talk about, especially from a mesa-optimizers perspective.

Conversely, if we already had a theory of what effective one-shot agents look like, then it would be a lot easier to ask "what sort of processes produce these kinds of systems"?

[-]abramdemski6y70

I agree that if a point can be addressed or explored in a static framework, it can be easier to do that first rather than going to the fully dynamic picture.

On the other hand, I think your discussion of the cat overstates the case. Your own analysis of the decision theory of a single-celled organism (ie the perspective you've described to me in person) compares it to gradient descent, rather than expected utility maximization. This is a fuzzy area, and certainly doesn't achieve all the things I mentioned, but doesn't that seem more "dynamic" than "static"? Today's deep learning systems aren't as generally intelligent as cats, but it seems like the gap exists more within learning theory than static decision theory.

More importantly, although the static picture can be easier to analyse, it has also been much more discussed for that reason. The low-hanging fruits are more likely to be in the more neglected direction. Perhaps the more difficult parts of the dynamic picture (perhaps robust delegation) can be put aside while still approaching things from a learning-theoretic perspective.

I may have said something along the lines of the static picture already being essentially solved by reflective oracles (the problems with reflective oracles being typical of the problems with the static approach). From my perspective, it seems like time to move on to the dynamic picture in order to make progress. But that's overstating things a bit -- I am interested in better static pictures, particularly when they are suggestive of dynamic pictures, such as COEDT.

In any case, I have no sense that you're making a mistake by looking at abstraction in the static setting. If you have traction, you should continue in that direction. I generally suspect that the abstraction angle is valuable, whether static or dynamic.

Still, I do suspect we have material disagreements remaining, not only disagreements in research emphasis.

Toward the end of your comment, you speak of the one-shot picture and the dynamic picture as if the two are mutually exclusive, rather than just easy mode vs hard mode as you mention early on. A learning picture still admits static snapshots. Also, cats don't get everything right on the first try.

Still, I admit: a weakness of an asymptotic learning picture is that it seems to eschew finite problems; to such an extent that at times I've said the dynamic learning picture serves as the easy version of the problem, with one-shot rationality being the hard case to consider later. Toy static pictures -- such as the one provided by reflective oracles -- give an idealized static rationality, using unbounded processing power and logical omniscience. A real static picture -- perhaps the picture you are seeking -- would involve bounded rationality, including both logical non-omniscience and regular physical non-omniscience. A static-rationality analysis of logical non-omnincience has seemed quite challenging so far. Nice versions of self-reference and other challenges to embedded world-models such as those you mention seem to require conveniences such as reflective oracles. Nothing resembling thin priors has come along to allow for eventual logical coherence while resembling bayesian static rationality (rather than logical-induction-like dynamic rationality). And as for the empirical uncertainty, we would really like to get some guarantees about avoiding catastrophic mistakes (though, perhaps, this isn't within your scope).

[-]Ben Pace6y20

Wow, this is a really fascinating comment.

[-]cousin_it6y*20

Hadn’t seen the dice example, is it from Jaynes? (I don’t yet see why you’re better off randomising)

Well, one way to forget the sum is to generate random pairs of dice for each possible sum and replace one of them with your actual pair. For example, if your dice came up (3 5), you can rewrite your memory with something like "the result was one of (1 1) (2 1) (3 1) (4 1) (4 2) (2 5) (3 5) (4 5) (6 4) (6 5) (6 6)". Is there a simpler way?

[-]Charlie Steiner6y*10

Obviously if you I the sum, I just want to know the die1-die2? The only problem is that the signed difference looks like a uniform distribution with width dependent on the sum - the signed difference can range from 11 possibilities (-5 to 5) down to 1 (0).

So what I think you do is you put all the differences onto the same scale by constructing a "unitless difference," which will actually be defined as a uniform distribution.

Rather than having the difference be a single number in a chunk of the number line that changes in size, you construct a big set of ordered points of fixed size equal to the least common multiple of the number of possible differences for all sums. If you think of a difference not as a number, but as a uniform distribution on the set of possible differences, then you can just "scale up" this distribution from its set of variable into the big set of constant size, and sample from this distribution to forget the sum but remember the most information about the difference.

EDIT: I shouldn't do math while tired.

[-]johnswentworth6y10

Note that the agent should rewrite its memory with a distribution, not just a list of tuples - e.g. {(1 1): 1/36, (2 1): 2/36, ...}. That way the "posterior" distribution on the sum will match the prior distribution on the sum.

That said, this is basically correct. It matches the answer(s) I got, and is more elegant.

[-]cousin_it6y20

Yeah. I guess I was assuming that the agent knows the list of tuples and also knows that they came from the procedure I described; the distribution follows from that :-)

[-]Rohin Shah6y30

Asya's summary for the Alignment Newsletter:

<@Embedded agency problems@>(@Embedded Agents@) are a class of theoretical problems that arise as soon as an agent is part of the environment it is interacting with and modeling, rather than having a clearly-defined and separated relationship. This post makes the argument that before we can solve embedded agency problems, we first need to develop a theory of _abstraction_. _Abstraction_ refers to the problem of throwing out some information about a system while still being able to make predictions about it. This problem can also be referred to as the problem of constructing a map for some territory.

The post argues that abstraction is key for embedded agency problems because the underlying challenge of embedded world models is that the agent (the map) is smaller than the environment it is modeling (the territory), and so inherently has to throw some information away.

Some simple questions around abstraction that we might want to answer include:
- Given a map-making process, characterize the queries whose answers the map can reliably predict.
- Given some representation of the map-territory correspondence, translate queries from the territory-representation to the map-representation and vice versa.
- Given a territory, characterize classes of queries which can be reliably answered using a map much smaller than the territory itself.
- Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.

The post argues that once we create the simple theory, we will have a natural way of looking at more challenging problems with embedded agency, like the problem of self-referential maps, the problem of other map-makers, and the problem of self-reasoning that arises when the produced map includes an abstraction of the map-making process itself.

Asya's opinion:

My impression is that embedded agency problems as a class of problems are very young, extremely entangled, and characterized by a lot of confusion. I am enthusiastic about attempts to decrease confusion and intuitively, abstraction does feel like a key component to doing that.

That being said, my guess is that it’s difficult to predictably suggest the most promising research directions in a space that’s so entangled. For example, one thread in the comments of this post discusses the fact that this theory of abstraction as presented looks at “one-shot” agency where the system takes in some data once and then outputs it, rather than “dynamic” agency where a system takes in data and outputs decisions repeatedly over time. Abram Demski argues that the “dynamic” nature of embedded agency is a central part of the problem and that it may be more valuable and neglected to put research emphasis there.

[-]johnswentworth6y30

LGTM

[-]VojtaKovarik6y20

A side-note:

Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.

Can't remember the specific reference but: Imperfect-information game theory has some research on abstractions. Naturally, an object of interest are "optimal" abstractions --- i.e., ones that are as small as possible for given accuracy, or as accurate as possible for given size. However, there are typically some negative results, stating that getting (near-) optimal abstractions is at least as expensive as finding the (near-) optimal solution of the full game. Intuitively, I would expect this to be a recurring theme for abstractions in general.

The implication of this is that all the goals should have the implicitly have the caveat that the maps have to be "not-too-expensive to construct". (This is intended to be a side-note, not an advocacy to change the formulation. The one you have there is accessible and memorable :-).)

[-]johnswentworth6y10

Thanks for the pointer, sounds both relevant and useful. I'll definitely look into it.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

18

18

What do we mean by “abstraction”?

Embedded World-Models

Embedded Decision Theory

Self-Reasoning & Improvement

Subsystems

Summary