This post is not-very-distilled and doesn’t contain much background; it’s intended for people who already have the context of at least these four posts. I’m putting it up mainly as a reference for people who might want to work directly on the math of natural abstractions, and as a technical reference post.
There’s various hints that, in most real-world cases, the distribution of low-level state given high-level natural abstractions should take the form of a maximum entropy distribution, in which:
More formally: we have a low-level causal model (aka Bayes net) . Given the high-level variables , the distribution of low-level variable values should look like
… i.e. the maximum-entropy distribution subject to constraints of the form . (Note: , , and are all vector-valued.)
This...
Recently, I had a conversation with someone from a math background, asking how they could get into AI safety research. Based on my own path from mathematics to AI alignment, I recommended the following sources. It may prove useful to others contemplating a similar change in career:
We have a computational graph (aka circuit aka causal model) representing an agent and its environment. We’ve chosen a cut through the graph to separate “agent” from “environment” - i.e. a Cartesian boundary. Arrows from environment to agent through the boundary are “observations”; arrows from agent to environment are “actions”.
Presumably the agent is arranged so that the “actions” optimize something. The actions “steer” some nodes in the system toward particular values.
Let’s highlight a few problems with this as a generic agent model…
My human body interfaces with the world via the entire surface area of my skin, including molecules in my hair randomly bumping into air molecules. All of those tiny interactions are arrows going through the supposed “Cartesian boundary” around my body. These don’t intuitively seem like “actions”...
Let’s say you’re relatively new to the field of AI alignment. You notice a certain cluster of people in the field who claim that no substantive progress is likely to be made on alignment without first solving various foundational questions of agency. These sound like a bunch of weird pseudophilosophical questions, like “what does it mean for some chunk of the world to do optimization?”, or “how does an agent model a world bigger than itself?”, or “how do we ‘point’ at things?”, or in my case “how does abstraction work?”. You feel confused about why otherwise-smart-seeming people expect these weird pseudophilosophical questions to be unavoidable for engineering aligned AI. You go look for an explainer, but all you find is bits and pieces of worldview scattered...
But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.
Flat out wrong. Its quite possible for A and B to have 0 mutual information. But A and B always have mutual information conditional on some C (assuming A and B each have information) Its possible for there to be absolutely no mutual i... (read more)
This is a linkpost to our working paper “Towards AI Standards Addressing AI Catastrophic Risks: Actionable-Guidance and Roadmap Recommendations for the NIST AI Risk Management Framework”, which we co-authored with our UC Berkeley colleagues Jessica Newman and Brandie Nonnecke. Here are links to both Google Doc and pdf options for accessing our working paper:
We seek feedback from readers considering catastrophic risks as part of their work on AI safety and governance. It would be very helpful if you email feedback to Tony Barrett, or share a marked-up copy of the Google Doc with Tony, at anthony.barrett@berkeley.edu.
If you are providing feedback...
The observations I make here have little consequence from the point of view of solving the alignment problem. If anything, they merely highlight the essential nature of the inner alignment problem. I will reject the idea that robust alignment, in the sense described in Risks From Learned Optimization, is possible at all. And I therefore also reject the related idea of 'internalization of the base objective', i.e. I do not think it is possible for a mesa-objective to "agree" with a base-objective or for a mesa-objective function to be “adjusted towards the base objective function to the point where it is robustly aligned.” I claim that whenever a learned algorithm is performing optimization, one needs to accept that an objective which one did not explicitly design is...
If I've understood it correctly, I think this is a really important point, so thanks for writing a post about it. This post highlights that mesa objectives and base objectives are typically going to be of different "types", because the base objective will typically be designed to evaluate things in the world as humans understand it (or as modelled by the formal training setup) whereas the mesa objective will be evaluating things in the AI's world model (or if it doesn't really have a world model, then more local things like actions themselves as opposed to... (read more)
Exactly! That's an optimization-at-a-distance style intuition. The optimizer (e.g. human) optimizes things outside of itself, at some distance from itself.
A rock can arguably be interpreted as optimizing itself, but that's not an interesting kind of "optimization", and the rock doesn't optimize anything outside itself. Throw it in a room, the room stays basically the same.