I finally got around to reading this sequence, and I really like the ideas behind these methods. This feels like someone actually trying to figure out exactly how fragile human values are. It's especially exciting because it seems like it hooks right into an existing, normal field of academia (thus making it easier to leverage their resources toward alignment).

I do have one major issue with how the takeaway is communicated, starting with the term "catastrophic". I would only use that word when the outcome of the optimization is really bad, much worse that "average" in some sense. That's in line with the idea that the AI will "use the atoms for something else", and not just leave us alone to optimize its own thing. But the theorems in this sequence don't seem to be about that; 

We call this catastrophic Goodhart because the end result, in terms of , is as bad as if we hadn't conditioned at all.

Being as bad as if you hadn't optimized at all isn't very bad; it's where we started from!

I think this has almost the opposite takeaway from the intended one. I can imagine someone (say, OpenAI) reading these results and thinking something like, great! They just proved that in the worst case scenario, we do no harm. Full speed ahead!

(Of course, putting a bunch of optimization power into something and then getting no result would still be a waste of the resources put into it, which is presumably not built into . But that's still not very bad.)

That said, my intuition says that these same techniques could also suss out the cases where optimizing for  pessimizes for , in the previously mentioned use-our-atoms sense.

I'll also note that I think what you're calling "Vingean agency" is a notable sub-type of optimization process that you've done a good job at analyzing here. But it's definitely not the definition of optimization or agency to me. For example, in the post you say

We perceive agency when something is better at doing something than us; we endorse some aspect of its reasoning or activity.

This doesn't feel true to me (in the carve-nature-at-its-joints sense). I think children are strongly agents, even though I do everything more competently than they do.

I have some comments on the arbitrariness of the "baseline" measure in Yudkowsky's measure of optimization.

Sometimes, I am surprised in the moment about how something looks, and I quickly update to believing there's an optimization process behind it. For example, if I climb a hill expecting to see a natural forest, and then instead see a grid of suburban houses or an industrial logging site, I'll immediately realize that there's no way this is random and instead there's an optimization process that I wasn't previously modelling. In cases like this, I think Yudkowsky's measure accurately captures the measure of optimization.

Alternatively, sometimes I'm thinking about optimization processes that I've always known are there, and I'm wondering to myself how powerful they are. For example, sometimes I'll be admiring how competent one of my friends is. To measure their competence, I can imagine what a "typical" person would do in that situation, and check the Yudkowsky measure as a diff. I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then "deleting" it, but this just doesn't feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.

While it may be clear how to do this in many cases, it isn't clear in general. I suspect if we tried to write down the algorithm for doing it, it would involve an "agency detector" at some point; you have to be able to draw a circle around the agent in order to selectively forget it.

I think this is where Flint's framework was insightful. Instead of "detecting" and "deleting" the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this. The potential optimization process will be in that average, but it will be washed out by all the other trajectories (assuming most trajectories don't go up the ordering nearly as much; if they did, then your observed process would rightly not register as an optimizer).

(Obviously this is not helpful for e.g. looking into a neural network and figuring out whether it contains something that will powerfully optimize the world around you. But that's not what this level of the framework is for; this level is for deciding what it even means for something to powerfully optimize something around you.)

Of course, to run this comparison you need a "baseline" of a measure over every possible trajectory. But I think this is just reflecting the true nature of optimization; I think it's only meaningful relative to some other expectation.

I feel like there's a key concept that you're aiming for that isn't quite spelled out in the math.

I remember reading somewhere that there's a typically unmentioned distinction between "Bayes' theorem" and "Bayesian inference". Bayes' theorem is the statement about , which is true from the axioms of probability theory for any  and  whatsoever. Notably, it has nothing to do with time, and it's still true even after you learn . On the other hand, Bayesian inference is the premise your beliefs should change in accordance with Bayes' theorem. Namely that  where  is an observation. That is, when you observe something, you wholesale replace your probability space  with a new probability space  which is calculated by applying the conditional (via Bayes' theorem).

And I think there's a similar thing going on with your definitions of endorsement. While trying to understand the equations, I found it easier to visualize  and  as two separate distributions on the same , where endorsement is simply a consistency condition. For belief consistency, you would just say that  endorses  on event  if .

But that isn't what you wrote; instead you wrote thing this with conditioning on a quoted thing. And of course, the thing I said is symmetrical between  and , whereas your concept of endorsement is not symmetrical. It seems like the intention is that  "learns" or "hears about" 's belief, and then  updates (in the above Bayesian inference sense) to have a new  that has the consistency condition with .

By putting  in the conditional, you're saying that it's an event on , a thing with the same type as . And it feels like that's conceptually correct, but also kind of the hard part. It's as if  is modelling  as an agent embedded into .

You might be interested in some of my open drafts about optimization;

One distinction that I pretty strongly hold as carving nature at its joint is (what I call) optimization vs agents. Optimization has no concept of a utility function, and it just about the state going up an ordering. Agents are the thing that has a utility function, which they need for picking actions with probabilistic outcomes.

I feel very on-board with this research aesthetic.

Here are just some nit-picks/notational confusions I had while reading this;

  • The sequence , i.e., , is the computation seeded at  (or a “trajectory” in dynamical systems terminology).


  • A property  is achieved by a computation s if there exists some number of steps  such that ...

It took me a second to figure out what  referred to, partly because the first s was not rendered in LaTeX, partly because it was never shown as a function before, and partly because it looked kinda like , so I thought maybe the notation had changed.

the empty board 

I've seen  as "false" before, but I don't think it's super common, and you also previously said

a pattern is an infinite two-dimensional Boolean grid, or equivalently a function of type ℤxℤ→{true, false}

which made this feel like a switchup of notation. (Also, I think the type signature is off? The empty board  should be a function, but instead it's a set containing one symbol...)

This includes still lifes (), blinkers ()

I think if blinkers have period 2 then still lifes have to be considered to have period 1, and not 0.

Eater. An eater p is robust for  within any context  that contains  spaceships traveling in the direction of the eater (and nothing else on the board).

I think the true thing is a lot weaker than this; it's robust to gliders (not all spaceships) traveling along a specific diagonal with respect to the location of the eater (and possibly the glider has to have a certain phase, I'd have to check).


The basin of attraction for a pattern  and a property  is the largest context set  such that  is robust for  within .


  • Eater. Let  be an eater and  is the context set containing  spaceships moving in the direction of the eater and nothing else (in any other context, the contents of the board don't get consumed by the eater).

This is definitely not the largest context set , because there are tons of patterns that extinguish themselves.

I would especially especially love it if it popped out a .tex file that I could edit, since I'm very likely to be using different language on LW than I would in a fancy academic paper.