We have a computational graph (aka circuit aka causal model) representing an agent and its environment. We’ve chosen a cut through the graph to separate “agent” from “environment” - i.e. a Cartesian boundary. Arrows from environment to agent through the boundary are “observations”; arrows from agent to environment are “actions”.
Presumably the agent is arranged so that the “actions” optimize something. The actions “steer” some nodes in the system toward particular values.
Let’s highlight a few problems with this as a generic agent model…
My human body interfaces with the world via the entire surface area of my skin, including molecules in my hair randomly bumping into air molecules. All of those tiny interactions are arrows going through the supposed “Cartesian boundary” around my body. These don’t intuitively seem like “actions” or “observations”, at least not beyond some high-level observations of temperature and pressure.
In general, low-level boundaries will have lots of tiny interactions crossing them which don’t conceptually seem like “actions” or “observations”.
When I’m driving, I often identify with the car rather than with my body. Or if I lose a limb, I stop identifying with the lost limb. (Same goes for using the toilet - I’ve heard that it’s quite emotionally stressful for children during potty training to throw away something which came from their physical body, because they still identify with it.)
In general, it’s ambiguous what Cartesian boundary to use; our conceptual boundaries around an “agent” don’t seem to correspond perfectly to any particular physical surface.
I could draw a supposed “Cartesian boundary” around a rock, and declare all the interactions between the rock and its environment to be “actions” and “observations”. If someone asks what the rock is optimizing, I’ll say “the actions” - i.e. the rock “wants” to do whatever it is that the rock in fact does.
In general, we intuitively conceive of “agents” as optimizers in some nontrivial sense. Optimizing actions doesn’t cut it; we generally don’t think of something as an agent unless it’s optimizing something out in the environment away from itself.
Let’s solve all of these problems in one fell swoop.
We’ll start with the rock problem. One natural answer is to declare that we’re only interested in agents which optimize things “far away” from themselves. What does that mean? Well, as long as we’re already representing the world as a computational DAG, we might as well say that two chunks of our computation DAG are “far apart” when there are many intermediating layers between them. Like this:
If you’ve read the Telephone Theorem post, it’s the same idea.
For instance, if I’m planning a party, then the actions I take now are far away in time (and probably also space) from the party they’re optimizing. The “intermediate layers” might be snapshots of the universe-state at each time between the actions and the party. (... or they might be something else; there are usually many different ways to draw intermediate layers between far-apart things.)
This applies surprisingly well even in situations like reinforcement learning, where we don’t typically think of the objective as “far away” from the agent. If I'm a reinforcement learner optimizing for some reward I’ll receive later, that later reward is still typically far away from my current actions. My actions impact the reward via some complicated causal path through the environment, acting through many intermediate layers.
So we’ve ruled out agents just “optimizing” their own actions. How does this solve the other two problems?
We’re using the same kind of model and the same notion of “far apart” as the Telephone Theorem, so we can carry that theorem over. The main takeaway is that far apart things interact only via a typically-relatively-small “abstract summary”. This summary consists of the information which is arbitrarily well conserved as it propagates through the intermediate layers.
Because the agent only interacts with the far away things-it’s-optimizing via a relatively-small summary, it’s natural to define the “actions” and “observations” as the contents of the summary flowing in either direction, rather than all the low-level interactions flowing through the agent’s supposed “Cartesian boundary”. That solves the microscopic interactions problem: all the random bumping between my hair/skin and air molecules mostly doesn’t impact things far away, except via a few summary variables like temperature and pressure.
This redefinition of “actions” and “observations” also makes the Cartesian boundary flexible. The Telephone Theorem says that the abstract summary consists of information which is arbitrarily well conserved as it propagates through the intermediate layers. So, the summary isn’t very sensitive to which layer we declare to be “the Cartesian boundary”; we can move the boundary around quite a bit without changing the abstract “agent” we’re talking about. (Though obviously if we move the Cartesian boundary to some totally different part of the world, that may change what “agent” we’re talking about.) If we want, we could even stop thinking of the boundary as localized to a particular cut through the graph at all.
When Adam Shimi first suggested to me a couple years ago that “optimization far away” might be important somehow, one counterargument I raised was dynamic programming (DP): if the agent is optimizing an expected utility function over something far away, then we can use DP to propagate the expected utility function back through the intermediate layers to find an equivalent utility function over the agent’s actions:
This isn’t actually a problem, though. It says that optimization far away is equivalent to some optimization nearby. But the reverse does not necessarily hold: optimization nearby is not necessarily equivalent to some optimization far away. This makes sense: optimization nearby is a trivial condition which matches basically any system, and therefore will match the interesting cases as well as the uninteresting cases.
(Note that I haven’t actually demonstrated here that optimization at a distance is nontrivial, i.e. that some systems do optimize at a distance and others don’t; I’ve just dismissed one possible counterargument. I have several posts planned on optimization at a distance over the next few weeks, and nontriviality will be in one of them.)
I like to picture optimization at a distance like a satellite dish or phased array:
In a phased array, lots of little antennas distributed over an area are all controlled simultaneously, so that their waves add up to one big coherent wave which can propagate over a long distance. Optimization at a distance works the same way: there’s lots of little actions distributed over space/time, all controlled in such a way that their influence can add up coherently and propagate over a long distance to optimize some far-away target.
Embedded agents have a spatial extent. If we use the analogy between physical spacetime and a domain of computation of environment, this offers interesting interpretations for some terms.
In a domain, counterfactuals might be seen as points/events/observations that are incomparable in specialization order, that is points that are not in each other's logical future. Via the spacetime analogy, this is the same as the points being space-like separated. This motivates calling collections of mutually counterfactual (incomparable) events logical space, in the same sense as events comparable in specialization order follow logical time. (Some other non-Frechet spaces would likely give more interesting space-like subspaces than a domain typical for program semantics.)
An embedded agent extant in logical space of an environment (at a particular time) is then a collection of counterfactuals. In this view, an agent is not a specific computation, but rather a collection of possible alternative behaviors/observations/events of an environment (resulting from multiple different computations), events that are counterfactual to each other. The logical space an agent occupies comprises the behaviors/observations/events (partial-states-at-a-time) of possible environments where the agent has influence.
In this view, counterfactuals are not merely phantasmal decision theory ideas developed to make sure that reality doesn't look like them, hypothetical threats that should never obtain in actuality. Instead, they are reified as equals to reality, as parts of the agent, and an agent's description is incomplete without them. This is not as obvious as with parts of a physical machine because usually each small part of a machine doesn't contain a precise description of the whole machine. With agents, an actual agent suggests quite strongly what its counterfactual behaviors would be in the adjacent possible environments, at least given a decision theory that interprets such things. So this resembles a biological organism where each cell has a blueprint for the whole body, each expression of counterfactual behavior of an embedded agent has the whole design of the agent sufficient to reconstruct its behavior in the other counterfactuals. But this point of view suggests that this is not a necessary property of embedded agents, that counterfactuals might have independent content, other parts of a larger design.
For counterfactuals in decision theory, this cashes out as imperfect ability of an agent to know what it does in counterfactuals, or as coordination with other agents that have different designs in different counterfactuals, acausal trade across logical space. So there is essentially nothing new, the notion of "logical space" and of agents having extent in logical space adds up to normality, extending the title of a singular "agent" to a collective of agents with different designs that are mutually counterfactual and are engaged in acausal trade with each other, parts of the collective. It is natural to treat different parties engaged in acausal trade as parts of a whole since they interact and influence each other's behavior. With sufficient integration, it becomes more central to call the whole collective "an agent" instead of privileging views that only focus on one part (counterfactual) at a time.
Logical space is an unusual notion of counterfactuals, because different points of a logical space can have a common logical future, that is different counterfactuals can contribute to the same future logical event, be in that event's past. This is not surprising given acausal trade and predictors that ask what a given agent/computation does in multiple counterfactual situations. But it usefully runs counter to the impression that counterfactuals necessarily irrevocably diverge from each other, embed a mutual contradiction that prevents them from ever being reunited in a single possibility.
If someone asks what the rock is optimizing, I’ll say “the actions” - i.e. the rock “wants” to do whatever it is that the rock in fact does.
This argument does not seem to me like it captures the reason a rock is not an optimiser?
I would hand wave and say something like:
"If you place a human into a messy room, you'll sometimes find that the room is cleaner afterwards. If you place a kid in front of a bowl of sweets, you'll soon find the sweets gone. These and other examples are pretty surprising state transitions, that would be highly unlikely in the absence of those humans you added. And when we say that something is an optimiser, we mean that it is such that, when it interfaces with other systems, it tends to make a certain narrow slice of state space much more likely for those systems to end up in."
The rock seems to me to have very few such effects. The probability of state transitions of my room is roughly the same with or with out a rock in a corner of it. And that's why I don't think of it as an optimiser.
Exactly! That's an optimization-at-a-distance style intuition. The optimizer (e.g. human) optimizes things outside of itself, at some distance from itself.
A rock can arguably be interpreted as optimizing itself, but that's not an interesting kind of "optimization", and the rock doesn't optimize anything outside itself. Throw it in a room, the room stays basically the same.
For instance, if I’m planning a party, then the actions I take now are far away in time (and probably also space) from the party they’re optimizing. The “intermediate layers” might be snapshots of the universe-state at each time between the actions and the party. (... or they might be something else; there are usually many different ways to draw intermediate layers between far-apart things.)This applies surprisingly well even in situations like reinforcement learning, where we don’t typically think of the objective as “far away” from the agent. If I'm a reinforcement learner optimizing for some reward I’ll receive later, that later reward is still typically far away from my current actions. My actions impact the reward via some complicated causal path through the environment, acting through many intermediate layers.So we’ve ruled out agents just “optimizing” their own actions. How does this solve the other two problems?
I feel like this is assuming away one of the crucial difficulties of ascribing agency and goal-directedness: lack of competence or non optimality might make agentic behavior look non-agentic unless you already have a mechanistic interpretation. Separating a rock from a human is not really the problem; it's more like separating something acting like a chimp but for which you have very little data and understanding, and an agent optimizing to clip you.
(Not saying that this can't be relevant to address this problem, just that currently you seem to assume the problem away)
Hmm. I like the idea of redefining action as the consequences of one's action that are observable "far away" — it nicely rederives the observation-action loop through interaction with far away variables. That being said, I'm confused if defining the observations in the summary statistics itself is not problematic. I have one intuition that tells me that this is all you can observe anyway, so it's fine; on the other hand, it looks like you're assuming that the agent has the right ontology already? I guess that can be solved by saying that the observations are on the content of the summary, but not necessarily all of it.
When Adam Shimi first suggested to me a couple years ago that “optimization far away” might be important somehow, one counterargument I raised was dynamic programming (DP): if the agent is optimizing an expected utility function over something far away, then we can use DP to propagate the expected utility function back through the intermediate layers to find an equivalent utility function over the agent’s actions:u′(A)=E[u(X)|do(A)]This isn’t actually a problem, though. It says that optimization far away is equivalent to some optimization nearby. But the reverse does not necessarily hold: optimization nearby is not necessarily equivalent to some optimization far away. This makes sense: optimization nearby is a trivial condition which matches basically any system, and therefore will match the interesting cases as well as the uninteresting cases.
I think I actually remember now the discussion we were having, and I recall an intuition about counting. Like, there seem to be more ways to optimize nearby than to optimize the specific part of far away, which I guess is what you're pointing at.
Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose.
I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.