Recommended Sequences

Late 2021 MIRI Conversations
Embedded Agency
AGI safety from first principles

Recent Discussion

This post is not-very-distilled and doesn’t contain much background; it’s intended for people who already have the context of at least these four posts. I’m putting it up mainly as a reference for people who might want to work directly on the math of natural abstractions, and as a technical reference post.

There’s various hints that, in most real-world cases, the distribution of low-level state given high-level natural abstractions should take the form of a maximum entropy distribution, in which:

  • The “features” are sums over local terms, and
  • The high-level variables are (isomorphic to) the Lagrange multipliers

More formally: we have a low-level causal model (aka Bayes net) . Given the high-level variables , the distribution of low-level variable values should look like

… i.e. the maximum-entropy distribution subject to constraints of the form . (Note: , and  are all vector-valued.)


Recently, I had a conversation with someone from a math background, asking how they could get into AI safety research. Based on my own path from mathematics to AI alignment, I recommended the following sources. It may prove useful to others contemplating a similar change in career:

  • Superintelligence by Nick Bostrom. It condenses all the main arguments for the power and the risk of AI, and gives a framework in which to think of the challenges and possibilities.
  • Sutton and Barto's Book: Reinforcement Learning: An Introduction. This gives the very basics of what ML researchers actually do all day, and is important for understanding more advanced concepts. It gives (most of) the vocabulary to understand what ML and AI papers are talking about.
  • Gödel without too many tears. This is

We have a computational graph (aka circuit aka causal model) representing an agent and its environment. We’ve chosen a cut through the graph to separate “agent” from “environment” - i.e. a Cartesian boundary. Arrows from environment to agent through the boundary are “observations”; arrows from agent to environment are “actions”.


Presumably the agent is arranged so that the “actions” optimize something. The actions “steer” some nodes in the system toward particular values.

Let’s highlight a few problems with this as a generic agent model…

Microscopic Interactions

My human body interfaces with the world via the entire surface area of my skin, including molecules in my hair randomly bumping into air molecules. All of those tiny interactions are arrows going through the supposed “Cartesian boundary” around my body. These don’t intuitively seem like “actions”...

This argument does not seem to me like it captures the reason a rock is not an optimiser? I would hand wave and say something like: "If you place a human into a messy room, you'll sometimes find that the room is cleaner afterwards. If you place a kid in front of a bowl of sweets, you'll soon find the sweets gone. These and other examples are pretty surprising state transitions, that would be highly unlikely in the absence of those humans you added. And when we say that something is an optimiser, we mean that it is such that, when it interfaces with other systems, it tends to make a certain narrow slice of state space much more likely for those systems to end up in." The rock seems to me to have very few such effects. The probability of state transitions of my room is roughly the same with or with out a rock in a corner of it. And that's why I don't think of it as an optimiser.

Exactly! That's an optimization-at-a-distance style intuition. The optimizer (e.g. human) optimizes things outside of itself, at some distance from itself.

A rock can arguably be interpreted as optimizing itself, but that's not an interesting kind of "optimization", and the rock doesn't optimize anything outside itself. Throw it in a room, the room stays basically the same.

10Vladimir Nesov2d
Embedded agents have a spatial extent. If we use the analogy [] between physical spacetime and a domain of computation [] of environment, this offers interesting interpretations for some terms. In a domain [], counterfactuals might be seen as points/events/observations that are incomparable in specialization order [], that is points that are not in each other's logical future. Via the spacetime analogy, this is the same as the points being space-like separated. This motivates calling collections of mutually counterfactual (incomparable) events logical space, in the same sense as events comparable in specialization order follow logical time. (Some other non-Frechet spaces would likely give more interesting space-like subspaces than a domain typical for program semantics.) An embedded agent extant in logical space of an evironment (at a particular time) is then a collection of counterfactuals. In this view, an agent is not a specific computation, but rather a collection of possible alternative behaviors/observations/events of an environment (resulting from multiple different computations), events that are counterfactual to each other. The logical space an agent occupies comprises the behaviors/observations/events (partial-states-at-a-time) of possible environments where the agent has influence. In this view, counterfactuals are not merely phantasmal decision theory ideas developed to make sure that reality doesn't look like them, hypothetical threats that should never obtain in actuality. Instead, they are reified as equals to reality, as parts of the agent, and an agent's description is incomplete without them. This is not as obvious as with parts of a physic
3G Gordon Worley III2d
Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose. I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.

Let’s say you’re relatively new to the field of AI alignment. You notice a certain cluster of people in the field who claim that no substantive progress is likely to be made on alignment without first solving various foundational questions of agency. These sound like a bunch of weird pseudophilosophical questions, like “what does it mean for some chunk of the world to do optimization?”, or “how does an agent model a world bigger than itself?”, or “how do we ‘point’ at things?”, or in my case “how does abstraction work?”. You feel confused about why otherwise-smart-seeming people expect these weird pseudophilosophical questions to be unavoidable for engineering aligned AI. You go look for an explainer, but all you find is bits and pieces of worldview scattered...

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.

Flat out wrong. Its quite possible for A and B to have 0 mutual information. But A and B always have mutual information conditional on some C (assuming A and B each have information) Its possible for there to be absolutely no mutual i... (read more)

This is a linkpost to our working paper “Towards AI Standards Addressing AI Catastrophic Risks: Actionable-Guidance and Roadmap Recommendations for the NIST AI Risk Management Framework”, which we co-authored with our UC Berkeley colleagues Jessica Newman and Brandie Nonnecke. Here are links to both Google Doc and pdf options for accessing our working paper:

  • Google Doc (56 pp, last updated 16 May 2022) 
  • pdf on Google Drive (56 pp, last updated 16 May 2022)  
  • pdf on arXiv (not available yet, planned for a later version)

We seek feedback from readers considering catastrophic risks as part of their work on AI safety and governance. It would be very helpful if you email feedback to Tony Barrett, or share a marked-up copy of the Google Doc with Tony, at

If you are providing feedback...

The observations I make here have little consequence from the point of view of solving the alignment problem. If anything, they merely highlight the essential nature of the inner alignment problem. I will reject the idea that robust alignment, in the sense described in Risks From Learned Optimization, is possible at all. And I therefore also reject the related idea of 'internalization of the base objective', i.e. I do not think it is possible for a mesa-objective to "agree" with a base-objective or for a mesa-objective function to be “adjusted towards the base objective function to the point where it is robustly aligned.” I claim that whenever a learned algorithm is performing optimization, one needs to accept that an objective which one did not explicitly design is...

If I've understood it correctly, I think this is a really important point, so thanks for writing a post about it. This post highlights that mesa objectives and base objectives are typically going to be of different "types", because the base objective will typically be designed to evaluate things in the world as humans understand it (or as modelled by the formal training setup) whereas the mesa objective will be evaluating things in the AI's world model (or if it doesn't really have a world model, then more local things like actions themselves as opposed to... (read more)

Load More