Updates and additions to "Embedded Agency"

Rob Bensinger; abramdemski

Abram Demski and Scott Garrabrant's "Embedded Agency" has been updated with quite a bit of new content from Abram. All the changes are live today, and can be found at any of these links:

as a hand-drawn sequence (LW link, AIAF link);
as blog posts (MIRI link, LW link, AIAF link);
and as an arXiv paper (link).

Abram says, "I'm excited about this new version because I feel like in a lot of cases, the old version gestured at an idea but didn't go far enough to really explain. The new version feels to me like it gives the real version of the problem in cases where the previous version didn't quite make it, and explains things more thoroughly."

This diff shows all the changes to the blog version. Changes include (in addition to many added or tweaked illustrations)...

Changes to "Decision Theory":

"Observation counterfactuals" (discussed in the counterfactual mugging section at the end) are distinguished from "action counterfactuals" (discussed in the earlier sections). Action counterfactuals are introduced before the five-and-ten problem.
The introduction to the five-and-ten problem is now slower and more focused (less jumping between topics), and makes the motivation clearer.
Instead of highlighting "Perhaps the agent is trying to plan ahead, or reason about a game-theoretic situation in which its action has an intricate role to play." as reasons an agent might know its own action, the text now highlights points from "Embedded World-Models": a sufficiently smart agent with access to its own source code can always deduce its own conditional behaviors.
ε-exploration and Newcomblike problems now get full sections, rather than a few sentences each.
Added discussion of "Do humans make this kind of mistake?" (Text versions only.)

Changes to "Embedded World-Models":

"This is fine if the world 'holds still' for us; but because the map is in the world, it may implement some function." changed to "... because the map is in the world, different maps create different worlds."
Discussion of reflective oracles now gives more context (e.g., says what "oracle machines" are).
Spend more time introducing the problem of logical uncertainty: emphasize that humans handle logical uncertainty fine (text versions only); say a bit more about how logic and probability theory differ; note that the two "may seem superficially compatible, since probability theory is an extension of Boolean logic"; and describe the Gödelian and realizability obstacles to linking the two. Note explicitly that "the 'scale versus tree' problem also means that we don’t know how ordinary empirical reasoning works" (text versions only).

Changes to "Robust Delegation":

Introduction + Vingean Reflection:
- Introduction expanded to explicitly describe the AI alignment, tiling agent, and stability under self-improvement problems; draw analogies to royal succession and lost purposes in human institutions; and highlight that the difficulty lies in (a) the predecessor not fully understanding itself and its goals, and (b) the successor needing to act with some degree of autonomy. (Text versions only.)
- Put more explicit focus on the case where a successor is much smarter than its predecessor. (Text versions only.)
- Expanded "Usually, we think about this from the point of view of the human." to "A lot of current work on robust delegation comes from the goal of aligning AI systems with what humans want. So usually, we think about this from the point of view of the human." (Text versions only.)
Goodhart's Law:
- Fixed a typo in the text versions' Bayes estimate equation: it previously flipped the first and $y$ , but now shows the correct formula $E_{y | x} [g (x) - y] = 0$ . (Text versions only.)
- Expanded discussion of regressional Goodhart, including adding more illustrations and noting two problems with Bayesian estimators (intractability, and realizability). Removed claim that Bayes estimators are "the end of the story" for regressional Goodhart.
- Moved extremal to come after regressional instead of after causal, so extremal and regressional can readily be compared.
- Rewrote and expanded extremal Goodhart to introduce the problem more slowly, and walk through quantilizers in much more detail.
- Expanded discussion of causal Goodhart to clarify connection to decision theory and note realizability issues.
- Clarified the connection to mesa-optimizers and subsystem alignment in adversarial Goodhart.
Stable Pointers to Value:
- Added following the Goodhart discussion: "Remember that none of these problems would come up if a system were optimizing what we wanted directly, rather than optimizing a proxy."
- Introduced the term "treacherous turns".
- Shortened and clarified introduction to observation-utility maximizers, described how observation-utility agents could do value learning, and removed mention of CIRL in this context.
- Mentioned the operator modeling problem.
- Discussed wireheading as a form of Goodharting.

Changes to "Subsystem Alignment":

"Optimization daemons" / "inner optimizers" are now "mesa-optimizers", matching the terminology in "Risks from Learned Optimization". (Change also made in "Embedded Agents" / the introduction.)
New section on treacherous turns, simulated deployments, and time and length limits on programs.

[-]habryka6y40

Promoted to curated: These additions are really great, and they fill in a lot of the most confusing parts of the original Embedded Agency sequence, which was already one of my favorite pieces of content on all of Lesswrong. So it seems fitting to curate this update to it, which improves it even further.

36

Updates and additions to "Embedded Agency"

36