Recommended Sequences

Late 2021 MIRI Conversations
Embedded Agency
AGI safety from first principles

Recent Discussion

We have a computational graph (aka circuit aka causal model) representing an agent and its environment. We’ve chosen a cut through the graph to separate “agent” from “environment” - i.e. a Cartesian boundary. Arrows from environment to agent through the boundary are “observations”; arrows from agent to environment are “actions”.


Presumably the agent is arranged so that the “actions” optimize something. The actions “steer” some nodes in the system toward particular values.

Let’s highlight a few problems with this as a generic agent model…

Microscopic Interactions

My human body interfaces with the world via the entire surface area of my skin, including molecules in my hair randomly bumping into air molecules. All of those tiny interactions are arrows going through the supposed “Cartesian boundary” around my body. These don’t intuitively seem like “actions”...

This argument does not seem to me like it captures the reason a rock is not an optimiser? I would hand wave and say something like: "If you place a human into a messy room, you'll sometimes find that the room is cleaner afterwards. If you place a kid in front of a bowl of sweets, you'll soon find the sweets gone. These and other examples are pretty surprising state transitions, that would be highly unlikely in the absence of those humans you added. And when we say that something is an optimiser, we mean that it is such that, when it interfaces with other systems, it tends to make a certain narrow slice of state space much more likely for those systems to end up in." The rock seems to me to have very few such effects. The probability of state transitions of my room is roughly the same with or with out a rock in a corner of it. And that's why I don't think of it as an optimiser.

Exactly! That's an optimization-at-a-distance style intuition. The optimizer (e.g. human) optimizes things outside of itself, at some distance from itself.

A rock can arguably be interpreted as optimizing itself, but that's not an interesting kind of "optimization", and the rock doesn't optimize anything outside itself. Throw it in a room, the room stays basically the same.

10Vladimir Nesov1d
Embedded agents have a spatial extent. If we use the analogy [] between physical spacetime and a domain of computation [] of environment, this offers interesting interpretations for some terms. In a domain [], counterfactuals might be seen as points/events/observations that are incomparable in specialization order [], that is points that are not in each other's logical future. Via the spacetime analogy, this is the same as the points being space-like separated. This motivates calling collections of counterfactual events logical space, in the same sense as events comparable in specialization order follow logical time. (Some other non-Frechet spaces would likely give more interesting space-like subspaces than a domain typical for program semantics.) An embedded agent extant in logical space of an evironment (at a particular time) is then a collection of counterfactuals. In this view, an agent is not a specific computation, but rather a collection of possible alternative behaviors/observations/events of an environment (resulting from multiple different computations), events that are counterfactual to each other. The logical space an agent occupies comprises the behaviors/observations/events (partial-states-at-a-time) of possible environments where the agent has influence. In this view, counterfactuals are not merely phantasmal decision theory ideas developed to make sure that reality doesn't look like them, hypothetical threats that should never obtain in actuality. Instead, they are reified as equals to reality, as parts of the agent, and an agent's description is incomplete without them. This is not as obvious as with parts of a physical machine because usual
3G Gordon Worley III1d
Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose. I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.

Let’s say you’re relatively new to the field of AI alignment. You notice a certain cluster of people in the field who claim that no substantive progress is likely to be made on alignment without first solving various foundational questions of agency. These sound like a bunch of weird pseudophilosophical questions, like “what does it mean for some chunk of the world to do optimization?”, or “how does an agent model a world bigger than itself?”, or “how do we ‘point’ at things?”, or in my case “how does abstraction work?”. You feel confused about why otherwise-smart-seeming people expect these weird pseudophilosophical questions to be unavoidable for engineering aligned AI. You go look for an explainer, but all you find is bits and pieces of worldview scattered...

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.

Flat out wrong. Its quite possible for A and B to have 0 mutual information. But A and B always have mutual information conditional on some C (assuming A and B each have information) Its possible for there to be absolutely no mutual i... (read more)

This is a linkpost to our working paper “Towards AI Standards Addressing AI Catastrophic Risks: Actionable-Guidance and Roadmap Recommendations for the NIST AI Risk Management Framework”, which we co-authored with our UC Berkeley colleagues Jessica Newman and Brandie Nonnecke. Here are links to both Google Doc and pdf options for accessing our working paper:

  • Google Doc (56 pp, last updated 16 May 2022) 
  • pdf on Google Drive (56 pp, last updated 16 May 2022)  
  • pdf on arXiv (not available yet, planned for a later version)

We seek feedback from readers considering catastrophic risks as part of their work on AI safety and governance. It would be very helpful if you email feedback to Tony Barrett, or share a marked-up copy of the Google Doc with Tony, at

If you are providing feedback...

The observations I make here have little consequence from the point of view of solving the alignment problem. If anything, they merely highlight the essential nature of the inner alignment problem. I will reject the idea that robust alignment, in the sense described in Risks From Learned Optimization, is possible at all. And I therefore also reject the related idea of 'internalization of the base objective', i.e. I do not think it is possible for a mesa-objective to "agree" with a base-objective or for a mesa-objective function to be “adjusted towards the base objective function to the point where it is robustly aligned.” I claim that whenever a learned algorithm is performing optimization, one needs to accept that an objective which one did not explicitly design is...

If I've understood it correctly, I think this is a really important point, so thanks for writing a post about it. This post highlights that mesa objectives and base objectives are typically going to be of different "types", because the base objective will typically be designed to evaluate things in the world as humans understand it (or as modelled by the formal training setup) whereas the mesa objective will be evaluating things in the AI's world model (or if it doesn't really have a world model, then more local things like actions themselves as opposed to... (read more)

There’s been a lot of response to the Call For Distillers, so I’m experimenting with a new post format. This post is relatively short and contains only a simple mathematical argument, with none of the examples, motivation, more examples, or context which would normally make such a post readable. My hope is that someone else will write a more understandable version.

Jacob is offering a $500 bounty on a distillation.

Goal: following the usual coherence argument setup, show that if multiple decisions are each made with different input information available, then each decision maximizes expected utility given its input information.

We’ll start with the usual coherence argument setup: a system makes a bunch of choices, aiming to be pareto-optimal across a bunch of goals (e.g. amounts of various resources) . Pareto...

An update on this: sadly I underestimated how busy I would be after posting this bounty. I spent 2h reading this and Thomas post the other day, but didn't not manage to get into the headspace of evaluating the bounty (i.e. making my own interpretation of John's post, and then deciding whether Thomas' distillation captured that). So I will not be evaluating this. (Still happy to pay if someone else I trust claim Thomas' distillation was sufficient.) My apologies to John and Thomas about that.

When programming distributed systems, we always have many computations running in parallel. Our servers handle multiple requests in parallel, perform read and write operations on the database in parallel, etc.

The prototypical headaches of distributed programming involve multiple processes running in parallel, each performing multiple read/write operations on the same database fields. Maybe some database field says “foo”, and process 1 overwrites it with “bar”. Process 2 reads the field - depending on the timing, it may see either “foo” or “bar”. Then process 2 does some computation and writes another field - for instance, maybe it sees “foo” and writes {“most_recent_value”: “foo”} to a cache.  Meanwhile, process 1 overwrote “foo” with “bar”, so it also overwrites the cache with {“most_recent_value”: “bar”}. But these two processes are running...

4G Gordon Worley III2d
For what it's worth, I think this is trying to get at the same insight as logical time [] but via a different path. For the curious reader, this is also the same reason we use vector clocks to build distributed systems when we can't synchronize the clocks very well. And there's something quite interesting about computation as a partial order. It might seem that this only comes up when you have a "distributed" system, but actually you need partial orders to reason about unitary programs when they are non-deterministic (any program with loops and conditionals that can't be unrolled because they depend on inputs not known before runtime are non-deterministic in this sense). For this reason, partial orders are the bread-and-butter of program verification.
3Donald Hobson3d
This fails if there are closed timelike curves around. There is of course a very general formalism, whereby inputs and outputs are combined into aputs. Physical laws of causality, and restrictions like running on a reversible computer are just restrictions on the subsets of aputs accepted.
5Alex Mennen3d
This seems related in spirit to the fact that time is only partially ordered in physics as well. You could even use special relativity to make a model for concurrency ambiguity in parallel computing: each processor is a parallel worldline, detecting and sending signals at points in spacetime that are spacelike-separated from when the other processors are doing these things. The database follows some unknown worldline, continuously broadcasts its contents, and updates its contents when it receives instructions to do so. The set of possible ways that the processors and database end up interacting should match the parallel computation model. This makes me think that intuitions about time that were developed to be consistent with special relativity should be fine to also use for computation.
1Vladimir Nesov2d
If you mark something like causally inescapable subsets of spacetime (not sure how this should be called), which are something like all unions of future lightcones, as open sets, then specialization preorder [] on spacetime points will agree with time. This topology on spacetime is non-Frechet (has nontrivial specialization preorder), while the relative topologies it gives on space-like subspaces (loci of states of the world "at a given time" in a loose sense) are Hausdorff, the standard way of giving a topology for such spaces. This seems like the most straightforward setting for treating physical time as logical time.
3Ramana Kumar3d
It's possible that reality is even worse than this post suggests, from the perspective of someone keen on using models with an intuitive treatment of time. I'm thinking of things like "relaxed-memory concurrency" (or "weak memory models") where there is no sequentially consistent ordering of events. The classic example is where these two programs run in parallel, with X and Y initially both holding 0, [write 1 to X; read Y into R1] || [write 1 to Y; read X into R2], and after both programs finish both R1 and R2 contain 0. What's going on here is that the level of abstraction matters: writing and reading from registers are not atomic operations, but if you thought they were you're gonna get confused if you expect sequential consistency. * Total ordering: there's only one possible ordering of all operations, and everyone knows it. (or there's just one agent in a cybernetic interaction loop.) * Sequential consistency: everyone knows the order of their own operations, but not how they are interleaved with others' operations (as in this post) * Weak memory: everyone knows the order of their own operations, but others' operations may be doing stuff to shared resources that aren't compatible with any interleaving of the operations See e.g., [] or this blog for more [].
14Vladimir Nesov4d
I like specialization preorder [] as a setting for formulating these concepts. In a topological space, point y is stronger (more specialized) than point x iff all opens containing x also contain y. If opens are thought of as propositions, and specialization order as a kind of ("logical") time, with stronger points being in the future of weaker points, then this says that propositions must be valid with respect to time (that is, we want to only allow propositions that don't get invalidated). This setting motivates thinking of points not as objects of study, but as partial observations of objects of study, their shadows that develop according to specialization preorder. If a proposition is true about some partial observation of an object (a point of the space), it remains true when it develops further (in the future, for stronger points). The best we can capture objects of study is with neighborhood filters, but the conceptual distinction suggests that even in a sober space the objects of study are not necessarily points, they are merely observed through points. This is just what Scott domains [] or more generally algebraic dcpos with Scott topology talk about, when we start with a poset of finite observations (about computations, the elusive objects of study), which is the specialization preorder of its Alexandrov topology, which then becomes Scott topology after soberification [], adding points [] for partial observations that can be expressed in terms of Alexandrov opens on finite observations. Specialization order follows a computation, and opens formulate semidecidable properties. There are two different ways in which a computation is approximated: with a weaker observation/point, and with a weaker specification/proposition/open. One nice thing here is that
4Abram Demski2d
Up to here made sense. After here I was lost. Which propositions are valid with respect to time? How can we only allow propositions which don't get invalidated (EG if we don't know yet which will and will not be), and also, why do we want that? You're saying a lot about what the "objects of study" are and aren't, but not very concretely, and I'm not getting the intuition for why this is important. I'm used to the idea that the points aren't really the objects of study in topology; the opens are the more central structure. But the important question for a proposed modeling language is how well it models what we're after. It seems like you are trying to do something similar to what cartesian frames and finite factored sets are doing, when they reconstruct time-like relationships from other (purportedly more basic) terms. Would you care to compare the reconstructions of time you're gesturing at to those provided by cartesian frames and/or finite factored sets?

Which propositions are valid with respect to time? How can we only allow propositions which don't get invalidated (EG if we don't know yet which will and will not be), and also, why do we want that?

This was just defining/motivating terms (including "validity") for this context, the technical answer is to look at the definition of specialization preorder, when it's being suggestively called "logical time". If an open is a "proposition", and a point being contained in an open is "proposition is true at that point", and a point stronger in specialization o... (read more)

Load More