The Plan

johnswentworth

This is a high-level overview of the reasoning behind my research priorities, written as a Q&A.

What’s your plan for AI alignment?

Step 1: sort out our fundamental confusions about agency

Step 2: ambitious value learning (i.e. build an AI which correctly learns human values and optimizes for them)

Step 3: …

Step 4: profit!

… and do all that before AGI kills us all.

That sounds… awfully optimistic. Do you actually think that’s viable?

Better than a 50/50 chance of working in time.

Do you just have really long timelines?

No. My median is maybe 10-15 years, though that’s more a gut estimate based on how surprised I was over the past decade rather than a carefully-considered analysis. (I wouldn’t be shocked by another AI winter, especially on an inside view, but on an outside view the models generating that prediction have lost an awful lot of Bayes Points over the past few years.)

Mostly timelines just aren’t that relevant; they’d have to get down to around 18-24 months before I think it’s time to shift strategy a lot.

… Wat. Not relevant until we’re down to two years?!?

To be clear, I don’t expect to solve the whole problem in the next two years. Rather, I expect that even the incremental gains from partial progress on fundamental understanding will be worth far more than marginal time/effort on anything else, at least given our current state.

At this point, I think we’re mostly just fundamentally confused about agency and alignment. I expect approximately-all of the gains-to-be-had come from becoming less confused. So the optimal strategy is basically to spend as much time as possible sorting out as much of that general confusion as possible, and if the timer starts to run out, then slap something together based on the best understanding we have.

18-24 months is about how long I expect it to take to slap something together based on the best understanding we have. (Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)

But iterative engineering is important!

In order for iterative engineering to be useful, we first need to have a strong enough understanding of what we even want to achieve in order to recognize when an iteration has brought us a step closer to the goal. No amount of A/B testing changes to our website will make our company profitable if we’re measuring the wrong metrics. I claim that, for alignment, we do not yet have a strong enough understanding for iteration to produce meaningful progress.

When I say “we’re just fundamentally confused about agency and alignment”, that’s the sort of thing I’m talking about.

To be clear: we can absolutely come up with proxy measures of alignment. The problem is that I don’t expect iteration under those proxy measures to get us meaningfully closer to aligned AGI. No reasonable amount of iterating on gliders’ flight-range will get one to the moon.

But engineering is important for advancing understanding too!

I do still expect some amount of engineering to be central for making progress on fundamental confusion. Engineering is one of the major drivers of science; failed attempts to build amplifiers drove our first decent understanding of semiconductors, for instance. But this is a very different path-to-impact than directly iterating on “alignment”, and it makes sense to optimize our efforts differently if the path-to-impact is through fundamental understanding. Just take some confusing concept which is fundamental to agency and alignment (like abstraction, or optimization, or knowledge, or …) and try to engineer anything which can robustly do something with that concept. For instance, a lot of my own work is driven by the vision of a “thermometer of abstraction”, a device capable of robustly and generalizably measuring abstractions and presenting them in a standard legible format. It’s not about directly iterating on some alignment scheme, it’s about an engineering goal which drives and grounds the theorizing and can be independently useful for something of value.

Also, the theory-practice gap is a thing, and I generally expect the majority of “understanding” work to go into crossing that gap. I consider such work a fundamental part of sorting out confusions; if the theory doesn’t work in practice, then we’re still confused. But I also expect that the theory-practice gap is only very hard to cross the first few times; once a few applications work, it gets much easier. Once the first field-effect transistor works, it’s a lot easier to come up with more neat solid-state devices, without needing to further update the theory much. That’s why it makes sense to consider the theory-practice gap a part of fundamental understanding in its own right: once we understand it well enough for a few applications, we usually understand it well enough to implement many more with much lower marginal effort.

An analogy: to go from medieval castles to skyscrapers, we don’t just iterate on stone towers; we leverage fundamental scientific advances in both materials and structural engineering. My strategy for building the tallest possible metaphorical skyscraper is to put all my effort into fundamental materials and structural science. That includes testing out structures as-needed to check that the theory actually works, but the goal there is understanding, not just making tall test-towers; tall towers might provide useful data, but they’re probably not the most useful investment until we’re near the end-goal. Most of the iteration is on e.g. metallurgy, not on tower-height directly. Most of the experimentation is on e.g. column or beam loading under controlled conditions, again not on tower-height directly. If the deadline is suddenly 18-24 months, then it’s time to slap together a building with whatever understanding is available, but hopefully we figure things out fast enough that the deadline isn’t that limiting of a constraint.

What do you mean by “fundamentally confused”?

My current best explanation of “fundamental confusion” is that we don’t have the right frames. When thinking about agency or alignment, we do not know:

What are the most important questions to ask?
What approximations work?
What do we need to pay attention to, and what can we safely ignore?
How can we break the problem/system up into subproblems/subsystems?

For all of these, we can certainly make up some answers. The problem is that we don’t have answers to these questions which seem likely to generalize well. Indeed, for most current answers to these questions, I think there are strong arguments that they will not generalize well. Maybe we have an approximation which works well for a particular class of neural networks, but we wouldn’t expect it to generalize to other kinds of agenty systems (like e.g. a bacteria), and it’s debatable whether it will even apply to future ML architectures. Maybe we know of some possible failure modes for alignment, but we don’t know which of them we need to pay attention to vs which will mostly sort themselves out, especially in future regimes/architectures which we currently can’t test. (Even more important: there’s only so much we can pay attention to at all, and we don’t know what details are safe to ignore.) Maybe we have a factorization of alignment which helps highlight some particular problems, but the factorization is known to be leaky; there are other problems which it obscures.

By contrast, consider putting new satellites into orbit. At this point, we generally know what the key subproblems are, what approximations we can make, what to pay attention to, what questions to ask. Most importantly, we are fairly confident that our framing for satellite delivery will generalize to new missions and applications, at least in the near-to-medium-term future. When someone needs to put a new satellite in orbit, it’s not like the whole field needs to worry about their frames failing to generalize.

(Note: there’s probably aspects of “fundamental confusion” which this explanation doesn’t capture, but I don’t have a better explanation right now.)

What are we fundamentally confused about?

We’ve already talked about one example: I think we currently do not understand alignment well enough for iterative engineering to get us meaningfully closer to solving the real problem, in the same way that iterating on glider range will not get one meaningfully closer to going to the moon. When iterating, we don’t currently know which questions to ask, we don’t know which things to pay attention to, we don’t know which subproblems are bottlenecks.

Here’s a bunch of other foundational problems/questions where I think we currently don’t know the right framing to answer them in a generalizable way:

Is an e-coli an agent? Does it have a world-model, and if so, what is it? Does it have a utility function, and if so, what is it? Does it have some other kind of “goal”?
What even are "human values"? What’s the type signature of human values?
Given two agents (with potentially completely different world models), how can I tell whether one is "trying to help" the other? What does that even mean?
Given a trained neural network, does it contain any subagents? What are their world-models, and what do they want?
Given an atomically-precise scan of a whole human brain, body, and local environment, and unlimited compute, calculate the human’s goals/wants/values, in a manner legible to an automated optimizer.
Given some physical system, identify any agents in it, and what they’re optimizing for.
Back out the learned objective of a trained neural net, and compare it to the training objective.

What kinds of “incremental progress” do you have in mind here?

As an example, I’ve spent the last couple years better understanding abstraction (and I’m currently working to push that across the theory-practice gap). It’s a necessary component for the sorts of questions I want to answer about agency in general (like those above), but in the nearer term I also expect it to provide very strong ML interpretability tools. (This is a technical thing, but if you want to see the rough idea, take a look at the Telephone Theorem post and imagine that the causal models are computational circuits for neural nets. There are still some nontrivial steps after that to adapt the theorem to neural nets, but it should convey the general idea, and it's a very simple theorem.) If I found out today that AGI was two years away, I’d probably spend a few more months making the algorithms for abstraction-extraction as efficient as I could get them, then focus mainly on applying it to interpretability.

(What I actually expect/hope is that I’ll have efficient algorithms demo-ready in the first half of next year, and then some engineers will come along and apply them to interpretability while I work on other things.)

Another example: the next major thing to sort out after abstraction will be when and why large optimized systems (e.g. neural nets or biological organisms) are so modular, and how the trained/evolved modularity corresponds to modular structures in the environment. I expect that will yield additional actionable insights into ML interpretability, and especially into what environmental/training features lead to more transparent ML models.

Ok, the incremental progress makes sense, but the full plan still sounds ridiculously optimistic with 10-15 year timelines. Given how slow progress has been on the foundational theory of agency (especially at MIRI), why do you expect it to go so much faster?

Mostly I think MIRI has been asking not-quite-the-right-questions, in not-quite-the-right-ways.

Not-quite-the-right-questions: when I look at MIRI’s past work on agent foundations, it’s clear that the motivating questions were about how to build AGI which satisfies various desiderata (e.g. stable values under self-modification, corrigibility, etc). Trying to understand agency-in-general was mostly secondary, and was not the primary goal guiding choice of research directions. One clear example of this is MIRI’s work on proof-based decision theories: absolutely nobody would choose this as the most-promising research direction for understanding the decision theory used by, say, an e-coli. But plenty of researchers over the years have thought about designing AGI using proof-based internals.

I’m not directly thinking about how to design an AGI with useful properties. I’m trying to understand agenty systems in general - be it humans, ML systems, e-coli, cats, organizations, markets, what have you. My impression is that MIRI’s agent foundations team has started to think more along these lines over time (especially since Embedded Agency came out), but I think they’re still carrying a lot of baggage.

… which brings us to MIRI tackling questions in not-quite-the-right-ways. The work on Tiling Agents is a central example here: the problem is to come up with models for agents which copy themselves, so copies of the agents “tile” across the environment. When I look at that problem through an “understand agency in general” lens, my immediate thought is “ah, this is a baseline model for evolution”. Once we have a good model for agents which “reproduce” (i.e. tile), we can talk about agents which approximately-reproduce with small perturbations (i.e. mutations) and the resulting evolutionary process. Then we can go look at how evolution actually behaves to empirically check our models.

When MIRI looks at the Tiling Agents problem, on the other hand, they set it up in terms of proof systems proving things about “successor” proof systems. Absolutely nobody would choose this as the most natural setup to talk about evolution. It’s a setup which is narrowly chosen for a particular kind of “agent” (i.e. AI with some provable guarantees) and a particular use-case (i.e. maintaining the guarantees when the AI self-modifies).

Main point: it does not look like MIRI has primarily been trying to sort out fundamental confusions about agency-in-general, at least not for very long; that’s not what they were optimizing for. Their work was much more narrow than that. And this is one of those cases where I expect the more-general theory to be both easier to find (because we can use lots of data from existing agenty systems in biology, economics and ML) and more useful (because it will more likely generalize to many use-cases and many kinds of agenty systems).

Side note: contrary to popular perception, MIRI is an extremely heterogeneous org, and the criticisms above apply to different people at different times to very different degrees. That said, I think it’s a reasonable representation of the median past work done at MIRI. Also, MIRI is still the best org at this sort of thing, which is why I’m criticizing them in particular.

What’s the roadmap?

Abstraction is the main foundational piece (more on that below). After that, the next big piece will be selection theorems, and I expect to ride that train most of the way to the destination.

Regarding selection theorems: I think most of the gap between aspects of agency which we understand in theory, and aspects of agenty systems which seem to occur consistently in practice, come from broad and robust optima. Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are “broad”: optima whose basins fill a lot of parameter/genome space. And they find optima which are robust: small changes in the distribution of the environment don’t break them. There are informal arguments that this leads to a lot of key properties:

Modularity of the trained/evolved system (which we do indeed see in practice)
Good generalization properties
Information compression
Goal-directedness

… but we don’t have good formalizations of those arguments, and we’ll need the formalizations in order to properly leverage these properties for engineering.

Besides that, there’s also some cruft to clean up in existing theorems around agency. For instance, coherence theorems (i.e. the justifications for Bayesian expected utility maximization) have some important shortcomings, and are incomplete in important ways. And of course there’s also work to be done on the theoretical support structure for all this - for instance, sorting out good models of what optimization even means.

Why do we need formalizations for engineering?

It’s not that we need formalizations per se; it’s that we need gears-level understanding. We need to have some understanding of why e.g. modularity shows up in trained/evolved systems, what precisely makes that happen. The need for gears-level understanding, in turn, stems from the need for generalizability.

Let’s get a bit more concrete with the modularity example. We could try to build some non-gears-level (i.e. black-box) model of modularity in neural networks by training some different architectures in different regimes on different tasks and with different parameters, empirically computing some proxy measure of “modularity” for each trained network, and then fitting a curve to it. This will probably work great right up until somebody tries something well outside of the distribution on which this black-box model was fit. (Those crazy engineers are constantly pushing the damn boundaries; that’s largely why they’re so useful for driving fundamental understanding efforts.)

On the other hand, if we understand why modularity occurs in trained/evolved systems, then we can follow the gears of our reasoning even on new kinds of systems. More importantly, we can design new systems to leverage those gears without having to guess and check.

Now, gears-level understanding need not involve formal mathematics in general. But for the sorts of things I’m talking about here (like modularity or good generalization or information compression in evolved/trained systems), gears-level understanding mostly looks like mathematical proofs, or at least informal mathematical arguments. A gears-level answer to the question “Why does modularity show up in evolved systems?”, for instance, should have the same rough shape as a proof that modularity shows up in some broad class of evolved systems (for some reasonably-general formalization of “modularity” and “evolution”). It should tell us what the necessary conditions are, and explain why those conditions are necessary in such a way that we can modify the argument to handle different kinds of conditions without restarting from scratch.

Why so much focus on abstraction?

Abstraction is a common bottleneck to a whole bunch of problems in agency and alignment. Questions like:

If I have some system, what’s the right way to carve out a subsystem (which might be an “agent”, or a “world model”, or an “optimizer”, etc)? This should be robust/general enough to let us confidently say things like e.g. “there are no agents embedded in this trained neural net”.
What kinds-of-things show up in world models? For instance, is an AI likely to have internal notions of “tree” or “rock” or “car” which map to the corresponding human notions, and how closely?
How can we empirically measure high-level abstract things (like trees or agents) in the real world, in robustly generalizable ways?
To the extent that humans care about high-level abstract things like trees or cars, rather than quantum fields, how can we formalize that?
How can we translate the internal concepts used by trained ML systems into human-legible concepts, robustly enough that we won’t miss anything important (or at least can tell if we do)?

… and so forth. The important point isn’t any one of these questions; the important point is that understanding abstraction is a blocker for a whole bunch of different things. That’s what makes it an ideal target to focus on. Once it’s worked out, I expect to be unblocked not just on the above questions, but also on other important questions I haven’t even thought of yet - if it’s a blocker for many things already, it’s probably also a blocker for other things which I haven’t noticed.

If I had to pick one central reason why abstraction matters so much, it’s that we don’t currently have a robust, generalizable and legible way to measure high-level abstractions. Once we can do that, it will open up a lot of tricky conceptual questions to empirical investigation, in the same way that robust, generalizable and legible measurement tools usually open up scientific investigation of new conceptual areas.

But, like, 10-15 years?!?

A crucial load-bearing part of my model here is that agency/alignment work will undergo a phase transition in the next ~5 years. We’ll go from a basically-preparadigmatic state, where we don’t even know what questions to ask or what tools to use to answer them, to a basically-paradigmatic state, where we have a general roadmap and toolset. Or at the very least I expect to have a workable paradigm, whether anyone else jumps on board is a more open question.

There’s more than one possible path here, more than one possible future paradigm. My estimate of “~5 years” comes from eyeballing the current rate of progress, plus a gut feel for how close the frames are to where they need to be for progress to take off.

As an example of one path which I currently consider reasonably likely: abstraction provides the key tool for the phase transition. Once we can take a simulated environment or a trained model or the like, and efficiently extract all the natural abstractions from it, that changes everything. It’ll be like introducing the thermometer to the study of thermodynamics. We’ll be able to directly, empirically answer questions like “does this model know what a tree is?” or “does this model have a notion of human values?” or “is ‘human’ a natural abstraction?” or “are the agenty things in this simulation natural abstractions?” or …. (These won’t be yes/no answers, but they’ll be quantifiable in a standardized and robustly-generalizable way.) This isn’t a possibility I expect to be legibly plausible to other people right now, but it’s one I’m working towards.

Another path: once a few big selection theorems are sorted out (like modularity of evolved systems, for instance) and empirically verified, we’ll have a new class of tools for empirical study of agenty systems. Like abstraction measurement, this has the potential to open up a whole class of tricky conceptual questions to empirical investigation. Things like “what is this bacteria’s world model?” or “are there any subagents in this trained neural network?”. Again, I don’t necessarily expect this possibility to be legibly plausible to other people right now.

To be clear: not all of my “better than 50/50 chance of working in time” comes from just these two paths. I’ve sketched a fair amount of burdensome detail here, and there’s a lot of variations which lead to similar outcomes with different details, as well as entirely different paths. But the general theme is that I don’t think it will take too much longer to get to a point where we can start empirically investigating key questions in robustly-generalizable ways (rather than the ad-hoc methods used for empirical work today), and get proper feedback loops going for improving understanding.

Why ambitious value learning?

It’s the best-case outcome. I mean, c’mon, it’s got “ambitious” right there in the name.

… but why not aim for some easier strategy?

The main possibly-easier strategy for which I don’t know of any probably-fatal failure mode is to emulate/simulate humans working on the alignment problem for a long time, i.e. a Simulated Long Reflection. The main selling point of this strategy is that, assuming the emulation/simulation is accurate, it probably performs at least as well as we would actually do if we tackled the problem directly.

This is really a whole class of strategies, with many variations, most of which involve training ML systems to mimic humans. (Yes, that implies we’re already at the point where it can probably FOOM.) In general, the further the variations get from just directly simulating humans working on alignment basically the way we do now (but for longer), the more possibly-fatal failure modes show up. HCH is a central example here: for some reason a structure whose most obvious name is The Infinite Bureaucracy was originally suggested as an approximation of a Long Reflection. Look, guys, there is no way in hell that The Infinite Bureaucracy is even remotely a good approximation of a Long Reflection. Naming it “HCH” does not make it any less of an infinite bureaucracy, and yes it is going to fail in basically the same ways as real bureaucracies and for basically the same underlying reasons (except even worse, because it’s infinite).

… but the failure of variations does not necessarily mean that the basic idea is doomed. The basic idea seems basically-sound to me; the problem is implementing it in such a way that the output accurately mimics a real long reflection, while also making it happen before unfriendly AGI kills us all.

Personally, I’m still not working on that strategy, for a few main reasons:

I expect my current strategy to be more competitive. One big advantage of understanding agency in general is that we can apply that understanding to whatever ML/AI progress comes along, even if it ends up looking very different from e.g. GPT-3.
The Simulated Long Reflection strategy gets more likely to work when we have people for it to mimic who are already far down the road to solving alignment. The further, the better.
On a gut level, I just don’t expect ML to emulate humans accurately enough for a Simulated Long Reflection to work until we’ve already passed doomsday. (This is probably the cruxiest issue.)

I am generally happy that other people are working on strategies in the Simulated Long Reflection family, and hope that such work continues.

I want to disagree about MIRI.

Mostly, I think that MIRI (or at least a significant subset of MIRI) has always been primarily directed at agenty systems in general.

I want to separate agent foundations at MIRI into three eras. The Eliezer Era (2001-2013), the Benya Era (2014-2016), and the Scott Era(2017-).

The transitions between eras had an almost complete overhaul of the people involved. In spite of this, I believe that they have roughly all been directed at the same thing, and that John is directed at the same thing.

The proposed mechanism behind the similarity is not transfer, but instead because agency in general is a convergent/natural topic.

I think throughout time, there has always been a bias in the pipeline from ideas to papers towards being more about AI. I think this bias has gotten smaller over time, as the agent foundations research program both started having stable funding, and started carrying less and less of the weight of all of AI alignment on its back. (Before going through editing with Rob, I believe Embedded Agency had no mention of AI at all.)

I believe that John thinks that the Embedded Agency document is especially close to his agenda, so I will start with that. (I also think that both John and I currently have more focus on abstraction than what is in the Embedded Agency document).

Embedded Agency, more so than anything else I have done was generated using an IRL shaped research methodology. I started by taking the stuff that MIRI has already been working on, mostly the artifacts of the Benya Era, and trying to communicate the central justification that would cause one to be interested in these topics. I think that I did not invent a pattern, but instead described a preexisting pattern that originally generated the thoughts.

This is consistent with having the pattern be about agency in general, and so I could find the pattern in ideas that were generated based on agency in AI, but I think this is not the case. I think the use of proof based systems is demonstrating an extreme disregard for the substrate that the agency is made of. I claim that the reason that there was a historic focus on proof-based agents, is because it is a system that we could actually say stuff about. The fact that real life agents looked very different of the surface from proof based agents was a shortfall that most people would use to completely reject the system, but MIRI would work in it because what they really cared about was agency in general, and having another system that is easy to say things about that could be used to triangulate agency in general. If MIRI was directed at a specific type of agency, they would have rejected the proof based systems as being too different.

I think that MIRI is often misrepresented as believing in GOFAI because people look at the proof based systems and think that MIRI would only study those if they thought that is what AI might look at. I think in fact, the reason for the proof based systems is because at the time, this was the most fruitful models we had, and we were just very willing to use any lens that worked when trying to look at something very very general.

(One counterpoint here, is maybe MIRI didn't care about the substrate the agency was running on, but did have a bias towards singleton-like agency, rather than very distributed systems, I think this is slightly true. Today, I think that you need to understand the distributed systems, because realistic singleton-like agents follow many of the same rules, but it is possible that early MIRI did not believe this as much)

Most of the above was generated by looking at the Benya Era, and trying to justify that it was directed at agency in general at least/almost as much as the Scott Era, which seems like the hardest of three for me.

For the Scott Era, I have introspection. I sometimes stop thinking in general, and focus on AI. This is usually a bad idea, and doesn't generate as much fruit, and it is usually not what I do.

For the Eliezer Era, just look at the sequences.

I just looked up and reread, and tried to steel man what you originally wrote. My best steel man is that you are saying that MIRI is trying the develop a prescriptive understanding of agency, and you are trying the develop a descriptive understanding of agency. There might be something to this, but it is really complicated. One way to define agency is as the pipeline from the prescriptive to the descriptive, so I am not sure that prescriptive and descriptive agency makes sense as a distinction.

As for the research methodology, I think that we all have pretty different research methodologies. I do not think Benya and Eliezer and I have especially more in common with each other than we do with John, but I might be wrong here. I also don't think Sam and Abram and Tsvi and I have especially more in common in terms of research methodologies, except in so far as we have been practicing working together.

In fact, the thing that might be going on here is that the distinctions in topics is coming from differences in research skills. Maybe proof based systems are the most fruitful model if you are a Benya, but not if you are a Scott or a John. But this is about what is easiest for you to think about, not about a difference in the shared convergent subgoal of understanding agency in general.

I generally agree with most of this, but I think it misses the main claim I wanted to make. I totally agree that all three eras of MIRI's agent foundations research had some vision of the general theory of agency behind them, driving things. My point of disagreement is that, for most of MIRI's history, elucidating that general theory has not been the primary optimization objective.

Let's go through some examples.

The Sequences: we can definitely see Eliezer's understanding of the general theory of agency in many places, especially when talking about Bayes and utility. (Engines of Cognition is a central example.) But most of the sequences talk about things like failure modes of human cognition, how to actually change your mind, social failure modes of human cognition, etc. It sure looks like the primary optimization objective is about better human thinking, plus some general philosophical foundations, not the elucidation of the general theory of agency.

Tiling agents and proof-based decision theories: I'm on board with the use of proof-based setups to make minimal assumptions about "the substrate that the agency is made of". That's an entirely reasonable choice, and it does look like that choice was driven (in large part) by a desire for the theory to apply quite generally. But these models don't look like they were ever intended as general models of agency (I doubt they would apply nicely to e-coli); in your words, they provided "another system that is easy to say things about that could be used to triangulate agency in general". That's not necessarily a bad step on the road to general theory, but the general theory itself was not the main thing those models were doing. (Personally, I think we already have enough points to triangulate from for the time being. I think if someone were just directly, explicitly optimizing for a general theory of agency they'd probably come to that same conclusion. On the other hand, I could imagine someone very focused on self-reference barriers in particular might end up hunting for more data points, and it's plausible that someone directly optimizing for a general theory of agency would end up focused mostly on self-reference.)

Grain of truth: similar to tiling agents and proof-based decision theories, this sounds like "another system that is easy to say things about that could be used to triangulate agency in general". It does not sound like a part of the general theory of agency in its own right.

Logical induction: here we see something which probably would apply to an e-coli; it does sound like a part of a general theory of agency. (For the peanut gallery: I'm talking about LI criterion here, not the particular algorithm.) On the other hand, I wouldn't expect it to say much of interest about an e-coli beyond what we already know from older coherence theorems. It's still mainly of interest in problems of reflection. And I totally buy that reflection is an important bottleneck to the general theory of agency, but I wouldn't expect to see such a singular focus on that one bottleneck if someone were directly optimizing for a general theory of agency as their primary objective.

Embedded agents: in your own words, you "started by taking the stuff that MIRI has already been working on, mostly the artifacts of the Benya Era, and trying to communicate the central justification that would cause one to be interested in these topics". You did not start by taking all the different agenty systems you could think of, and trying to communicate the central concept that would cause one to be interested in those systems. I do think embedded agency came closer than any other example on this list to tackling the general theory of agency, but it still wasn't directly optimizing for that as the primary objective.

Going down that list (and looking at your more recent work), it definitely looks like research has been more and more directly pointed at the general theory of agency over time. But it also looks like that was not the primary optimization objective over most of MIRI's history, which is why I don't think slow progress on agent foundations to date provides strong evidence that the field is very difficult. Conversely, I've seen firsthand how tractable things are when I do optimize directly for a general theory of agency, and based on that experience I expect fairly fast progress.

(Addendum for the peanut gallery: I don't mean to bash any of this work; every single thing on the list was at least great work, and a lot of it was downright brilliant. There's a reason I said MIRI is the best org at this kind of work. My argument is just that it doesn't provide especially strong evidence that agent foundations are hard, because the work wasn't directly optimizing for the general theory of agency as its primary objective.)

Hmm, yeah, we might disagree about how much reflection(self-reference) is a central part of agency in general.

It seems plausible that it is important to distinguish between the e-coli and the human along a reflection axis (or even more so, distinguish between evolution and a human). Then maybe you are more focused on the general class of agents, and MIRI is more focused on the more specific class of "reflective agents."

Then, there is the question of whether reflection is going to be a central part of the path to (F/D)OOM.

Does this seem right to you?

To operationalize, I claim that MIRI has been directed at a close enough target to yours that you probably should update on MIRI's lack of progress at least as much as you would if MIRI was doing the same thing as you, but for half as long.

Which isn't *that* large an update. The average number of agent foundations researchers (That are public facing enough that you can update on their lack of progress) at MIRI over the last decade is like 4.

Figuring out how to factor in researcher quality is hard, but it seems plausible to me that the amount of quality adjusted attention directed at your subgoal over the next decade is significantly larger than the amount of attention directed at your subgoal over the last decade. (Which would not all come from you. I do think that Agent Foundations today is non-trivially closer to John today that Agent Foundations 5 years ago is to John today.)

It seems accurate to me to say that Agent Foundations in 2014 was more focused on reflection, which shifted towards embeddedness, and then shifted towards abstraction, and that these things all flow together in my head, and so Scott thinking about abstraction will have more reflection mixed in than John thinking about abstraction. (Indeed, I think progress on abstraction would have huge consequences on how we think about reflection.)

In case it is not obvious to people reading, I endorse John's research program. (Which can maybe be inferred by the fact that I am arguing that it is similar to my own). I think we disagree about what is the most likely path after becoming less confused about agency, but that part of both our plans is yet to be written, and I think the subgoal is enough of a simple concept that I don't think disagreements about what to do next to have a strong impact on how to do the first step.

This all sounds right.

In particular, for folks reading, I symmetrically agree with this part:

In case it is not obvious to people reading, I endorse John's research program. (Which can maybe be inferred by the fact that I am arguing that it is similar to my own). I think we disagree about what is the most likely path after becoming less confused about agency, but that part of both our plans is yet to be written, and I think the subgoal is enough of a simple concept that I don't think disagreements about what to do next to have a strong impact on how to do the first step.

... i.e. I endorse Scott's research program, mine is indeed similar, I wouldn't be the least bit surprised if we disagree about what comes next but we're pretty aligned on what to do now.

Also, I realize now that I didn't emphasize it in the OP, but a large chunk of my "50/50 chance of success" comes from other peoples' work playing a central role, and the agent foundations team at MIRI is obviously at the top of the list of people whose work is likely to fit that bill. (There's also the whole topic of producing more such people, which I didn't talk about in the OP at all, but I'm tentatively optimistic on that front too.)

That does seem right.

I do expect reflection to be a pretty central part of the path to FOOM, but I expect it to be way easier to analyze once the non-reflective foundations of agency are sorted out. There are good reasons to expect otherwise on an outside view - i.e. all the various impossibility results in logic and computing. On the other hand, my inside view says it will make more sense once we understand e.g. how abstraction produces maps smaller than the territory while still allowing robust reasoning, how counterfactuals naturally pop out of such abstractions, how that all leads to something conceptually like a Cartesian boundary, the relationship between abstract "agent" and the physical parts which comprise the agent, etc.

If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I'd follow a path a lot more like MIRI's. So yeah, this fits.

If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I'd follow a path a lot more like MIRI's. So yeah, this fits.

One thing I'm still not clear about in this thread is whether you (John) would feel that progress has been made for the theory of agency if all the problems on which MIRI were instantaneously solved. Because there's a difference between saying "this is the obvious first step if you believe reflection is the taut constraint" and "solving this problem would help significantly even if reflection wan't the taut constraint".

I expect that progress on the general theory of agency is a necessary component of solving all the problems on which MIRI has worked. So, conditional on those problems being instantly solved, I'd expect that a lot of general theory of agency came along with it. But if a "solution" to something like e.g. the Tiling Problem didn't come with a bunch of progress on more foundational general theory of agency, then I'd be very suspicious of that supposed solution, and I'd expect lots of problems to crop up when we try to apply the solution in practice.

(And this is not symmetric: I would not necessarily expect such problems in practice for some more foundational piece of general agency theory which did not already have a solution to the Tiling Problem built into it. Roughly speaking, I expect we can understand e-coli agency without fully understanding human agency, but not vice-versa.)

I agree with this asymmetry.

One thing I am confused about is whether to think of the e-coli as qualitatively different from the human. The e-coli is taking actions that can be well modeled by an optimization process searching for actions that would be good if this optimization process output them, which has some reflection in it.

It feels like it can behaviorally be well modeled this way, but is mechanistically not shaped like this, I feel like the mechanistic fact is more important, but I feel like we are much closer to having behavioral definitions of agency than mechanistic ones.

I would say the e-coli's fitness function has some kind of reflection baked into it, as does a human's fitness function. The qualitative difference between the two is that a human's own world model also has an explicit self-model in it, which is separate from the reflection baked into a human's fitness function.

After that, I'd say that deriving the (probable) mechanistic properties from the fitness functions is the name of the game.

... so yeah, I'm on basically the same page as you here.

Main response is in another comment; this is a tangential comment about prescriptive vs descriptive viewpoints on agency.

I think viewing agency as "the pipeline from the prescriptive to the descriptive" systematically misses a lot of key pieces. One central example of this: any properties of (inner/mesa) agents which stem from broad optima, rather than merely optima. (For instance, I expect that modularity of trained/evolved systems mostly comes from broad optima.) Such properties are not prescriptive principles; a narrow optimum is still an optimum. Yet we should expect such properties to apply to agenty systems in practice, including humans, other organisms, and trained ML systems.

The Kelly criterion is another good example: Abram has argued that it's not a prescriptive principle, but it is still a very strong descriptive principle for agents in suitable environments.

More importantly, I think starting from prescriptive principles makes it much easier to miss a bunch of the key foundational questions - for instance, things like "what is an optimizer?" or "what are goals?". Questions like these need some kind of answer in order for many prescriptive principles to make sense in the first place.

Also, as far as I can tell to date, there is an asymmetry: a viewpoint starting from prescriptive principles misses key properties, but I have not seen any sign of key principles which would be missed starting from a descriptive viewpoint. (I know of philosophical arguments to the contrary, e.g. this, but I do not expect such things to cash out into any significant technical problem for agency/alignment, any more than I expect arguments about solipsism to cash out into any significant technical problem.)

Curated. Not that many people pursue agendas to solve the whole alignment problem and of those even fewer write up their plan clearly. I really appreciate this kind of document and would love to see more like this. Shoutout to the back and forth between John and Scott Garrabrant about John's characterization of MIRI and its relation to John's work.

I want to flag that HCH was never intended to simulate a long reflection. It’s main purpose (which it fails in the worse case) is to let humans be epistemically competitive with the systems you’re trying to train.

I mean, we have this thread with Paul directly saying "If all goes well you can think of it like 'a human thinking a long time'", plus Ajeya and Rohin both basically agreeing with that.

Agreed, but the thing you want to use this for isn’t simulating a long reflection, which will fail (in the worst case) because HCH can’t do certain types of learning efficiently.

Once we get past Simulated Long Reflection, there's a whole pile of Things To Do With AI which strike me as Probably Doomed on general principles.

You mentioned using HCH to "let humans be epistemically competitive with the systems we're trying to train", which definitely falls in that pile. We have general principles saying that we should definitely not rely on humans being epistemically competitive with AGI; using HCH does not seem to get around those general principles at all. (Unless we buy some very strong hypotheses about humans' skill at factorizing problems, in which case we'd also expect HCH to be able to simulate something long-reflection-like.)

Trying to be epistemically competitive with AGI is, in general, one of the most difficult use-cases one can aim for. For that to be easier than simulating a long reflection, even for architectures other than HCH-emulators, we'd need some really weird assumptions.

This post is one of the LW posts a younger version of myself would have been most excited to read. Building on what I got from the Embedded Agency sequence, this post lays out a broad-strokes research plan for getting the alignment problem right. It points to areas of confusion, it lists questions we should be able to answer if we got this right, it explains the reasoning behind some of the specific tactics the author is pursuing, and it answers multiple common questions and objections. It leaves me with a feeling of "Yeah, I could pursue that too if I wanted, and I expect I could make some progress" which is a shockingly high bar for a purported plan to solve the alignment problem. I give this post +9.

(Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)

There's also red teaming time, and lag in idea uptake/marketing, to account for. It's possible that we'll have the solution to FAI when AGI gets invented, but the inventor won't be connected to our community and won't be aware of/sold on the solution.

Edit: Don't forget to account for the actual engineering effort to implement the safety solution and integrate it with capabilities work. Ideally there is time for extensive testing and/or formal verification.

The proposed mechanism behind the similarity is not transfer, but instead because agency in general is a convergent/natural topic.

Let's go through some examples.

Then, there is the question of whether reflection is going to be a central part of the path to (F/D)OOM.

Does this seem right to you?

This all sounds right.

In particular, for folks reading, I symmetrically agree with this part:

In case it is not obvious to people reading, I endorse John's research program. (Which can maybe be inferred by the fact that I am arguing that it is similar to my own). I think we disagree about what is the most likely path after becoming less confused about agency, but that part of both our plans is yet to be written, and I think the subgoal is enough of a simple concept that I don't think disagreements about what to do next to have a strong impact on how to do the first step.

... i.e. I endorse Scott's research program, mine is indeed similar, I wouldn't be the least bit surprised if we disagree about what comes next but we're pretty aligned on what to do now.

That does seem right.

If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I'd follow a path a lot more like MIRI's. So yeah, this fits.

If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I'd follow a path a lot more like MIRI's. So yeah, this fits.

I agree with this asymmetry.

After that, I'd say that deriving the (probable) mechanistic properties from the fitness functions is the name of the game.

... so yeah, I'm on basically the same page as you here.

Main response is in another comment; this is a tangential comment about prescriptive vs descriptive viewpoints on agency.

The Kelly criterion is another good example: Abram has argued that it's not a prescriptive principle, but it is still a very strong descriptive principle for agents in suitable environments.

I mean, we have this thread with Paul directly saying "If all goes well you can think of it like 'a human thinking a long time'", plus Ajeya and Rohin both basically agreeing with that.

Agreed, but the thing you want to use this for isn’t simulating a long reflection, which will fail (in the worst case) because HCH can’t do certain types of learning efficiently.

Once we get past Simulated Long Reflection, there's a whole pile of Things To Do With AI which strike me as Probably Doomed on general principles.

(Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)

75

The Plan

75

What’s your plan for AI alignment?

That sounds… awfully optimistic. Do you actually think that’s viable?

Do you just have really long timelines?

… Wat. Not relevant until we’re down to two years?!?

But iterative engineering is important!

But engineering is important for advancing understanding too!

What do you mean by “fundamentally confused”?

What are we fundamentally confused about?

What kinds of “incremental progress” do you have in mind here?

Ok, the incremental progress makes sense, but the full plan still sounds ridiculously optimistic with 10-15 year timelines. Given how slow progress has been on the foundational theory of agency (especially at MIRI), why do you expect it to go so much faster?

What’s the roadmap?

Why do we need formalizations for engineering?

Why so much focus on abstraction?

But, like, 10-15 years?!?

Why ambitious value learning?

… but why not aim for some easier strategy?