AI ALIGNMENT FORUM
AF

This review is mostly going to talk about what I think the post does wrong and how to fix it, because the post itself does a good job explaining what it does right. But before we get to that, it's worth saying up-front what the post does well: the post proposes a basically-correct notion of "power" for purposes of instrumental convergence, and then uses it to prove that instrumental convergence is in fact highly probable under a wide range of conditions. On that basis alone, it is an excellent post.

I see two (related) central problems, from which various o... (read more)

But exactly how complex and fragile?

Alex Turner4y120Review for 2019 Review

(I reviewed this in a top-level post: Review of 'But exactly how complex and fragile?'.)

I've thought about (concepts related to) the fragility of value quite a bit over the last year, and so I returned to Katja Grace's But exactly how complex and fragile? with renewed appreciation (I'd previously commented only a very brief microcosm of this review). I'm glad that Katja wrote this post and I'm glad that everyone commented. I often see private Google docs full of nuanced discussion which will never see the light of day, and that makes me sad, and I'm happy ... (read more)

Risks from Learned Optimization: Introduction

Adam Shimi4y140Review for 2019 Review

In “Why Read The Classics?”, Italo Calvino proposes many different definitions of a classic work of literature, including this one:

A classic is a book which has never exhausted all it has to say to its readers.

For me, this captures what makes this sequence and corresponding paper a classic in the AI Alignment literature: it keeps on giving, readthrough after readthrough. That doesn’t mean I agree with everything in it, or that I don’t think it could have been improved in terms of structure. But when pushed to reread it, I found again and again that I had m... (read more)

Selection vs Control

Adam Shimi4y180Review for 2019 Review

Selection vs Control is a distinction I always point to when discussing optimization. Yet this is not the two takes on optimization I generally use. My favored ones are internal optimization (which is basically search/selection), and external optimization (optimizing systems from Alex Flint’s The ground of optimization). So I do without control, or at least without Abram’s exact definition of control.

Why? Simply because the internal structure vs behavior distinction mentioned in this post seems more important than the actual definitions (which seem constra... (read more)

Gradient hacking

Adam Shimi4y121Review for 2019 Review

This post states the problem of gradient hacking. It is valuable in that this problem is far from obvious, and if plausible, very dangerous. On the other hand, the presentation doesn’t go into enough details, and so leaves gradient hacking open to attacks and confusion. Thus instead of just reviewing this post, I would like to clarify certain points, while interweaving my critics about the way gradient hacking was initially stated, and explaining why I consider this problem so important.

(Caveat: I’m not pretending that any of my objections are unknown to E... (read more)

Why Subagents?

johnswentworth4y90Review for 2019 Review

What's the type signature of goals?

The type signature of goals is the overarching topic to which this post contributes. It can manifest in a lot of different ways in specific applications:

What's the type signature of human values?
What structure types should systems biologists or microscope AI researchers look for in supposedly-goal-oriented biological or ML systems?
Will AI be "goal-oriented", and what would be the type signature of its "goal"?

If we want to "align AI with human values", build ML interpretability tools, etc, then that's going to be pretty to... (read more)

The Parable of Predict-O-Matic

fiddler4y100Review for 2019 Review

I think this post is incredibly useful as a concrete example of the challenges of seemingly benign powerful AI, and makes a compelling case for serious AI safety research being a prerequisite to any safe further AI development. I strongly dislike part 9, as painting the Predict-o-matic as consciously influencing others personality at the expense of short-term prediction error seems contradictory to the point of the rest of the story. I suspect I would dislike part 9 significantly less if it was framed in terms of a strategy to maximize predictive accuracy.... (read more)

Evolution of Modularity

johnswentworth4y70Review for 2019 Review

The material here is one seed of a worldview which I've updated toward a lot more over the past year. Some other posts which involve the theme include Science in a High Dimensional World, What is Abstraction?, Alignment by Default, and the companion post to this one Book Review: Design Principles of Biological Circuits.

Two ideas unify all of these:

Our universe has a simplifying structure: it abstracts well, implying a particular kind of modularity.
Goal-oriented systems in our universe tend to evolve a modular structure which reflects the structure of the u

... (read more)

Understanding “Deep Double Descent”

orthonormal4y90Review for 2019 Review

If this post is selected, I'd like to see the followup made into an addendum—I think it adds a very important piece, and it should have been nominated itself.

Selection vs Control

johnswentworth4y70Review for 2019 Review

In a field like alignment or embedded agency, it's useful to keep a list of one or two dozen ideas which seem like they should fit neatly into a full theory, although it's not yet clear how. When working on a theoretical framework, you regularly revisit each of those ideas, and think about how it fits in. Every once in a while, a piece will click, and another large chunk of the puzzle will come together.

Selection vs control is one of those ideas. It seems like it should fit neatly into a full theory, but it's not yet clear what that will look like. I revis... (read more)

Seeking Power is Often Convergently Instrumental in MDPs

Alex Turner4y60Review for 2019 Review

One year later, I remain excited about this post, from its ideas, to its formalisms, to its implications. I think it helps us formally understand part of the difficulty of the alignment problem. This formalization of power and the Attainable Utility Landscape have together given me a novel frame for understanding alignment and corrigibility.

Since last December, I’ve spent several hundred hours expanding the formal results and rewriting the paper; I’ve generalized the theorems, added rigor, and taken great pains to spell out what the theorems do and do not ... (read more)

Alignment Research Field Guide

Adam Shimi4y40Review for 2019 Review

How do you review a post that was not written for you? I’m already doing research in AI Alignment, and I don’t plan on creating a group of collaborators for the moment. Still, I found some parts of this useful.

Maybe that’s how you do it: by taking different profiles, and running through the most useful advice for each profile from the post. Let’s do that.

Full time researcher (no team or MIRIx chapter)

For this profile (which is mine, by the way), the most useful piece of advice from this post comes from the model of transmitters and receivers. I’m convinced... (read more)

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

Rohin Shah4y20Review for 2019 Review

I trust past-me to have summarized CAIS much better than current-me; back when this post was written I had just finished reading CAIS for the third or fourth time, and I haven't read it since. (This isn't a compliment -- I read it multiple times because I had a lot of trouble understanding it.)

I've put in two points of my own in the post. First:

(My opinion: I think this isn't engaging with the worry with RL agents -- typically, we're worried about the setting where the RL agent is learning or planning at test time, which can happen in learn-to-learn and on

DanielFilan4y60Review for 2019 Review

[NB: this is a review of the paper, which I have recently read, not of the post series, which I have not]

For a while before this paper was published, several people in AI alignment had discussed things like mesa-optimization as serious concerns. That being said, these concerns had not been published in their most convincing form in great details. The two counterexamples that I’m aware of are the posts What does the universal prior actually look like? by Paul Christiano, and Optimization daemons on Arbital. However, the first post only discussed the issue i... (read more)

Chris Olah’s views on AGI safety

DanielFilan4y50Review for 2019 Review

Olah’s comment indicates that this is indeed a good summary of his views.
I think the first three listed benefits are indeed good reasons to work on transparency/interpretability. I am intrigued but less convinced by the prospect of ‘microscope AI’.
- The ‘catching problems with auditing’ section describes an ‘auditing game’, and says that progress in this game might illustrate progress in using interpretability for alignment. It would be good to learn how much success the auditors have had in this game since the post was published.
- One test of ‘microscope

... (read more)

Utility ≠ Reward

Oliver Habryka4y40Review for 2019 Review

I think this post and the Gradient Hacking post caused me to actually understand and feel able to productively engage with the idea of inner-optimizers. I think the paper and full sequence was good, but I bounced off of it a few times, and this helped me get traction on the core ideas in the space.

I also think that some parts of this essay hold up better as a core abstraction than the actual mesa-optimizer paper itself, though I am not at all confident about this. But I just noticed that when I am internally thinking through alignment problems relate... (read more)

Soft takeoff can still lead to decisive strategic advantage

orthonormal4y30Review for 2019 Review

It's hard to know how to judge a post that deems itself superseded by a post from a later year, but I lean toward taking Daniel at his word and hoping we survive until the 2021 Review comes around.

Chris Olah’s views on AGI safety

orthonormal4y40Review for 2019 Review

The content here is very valuable, even if the genre of "I talked a lot with X and here's my articulation of X's model" comes across to me as a weird intellectual ghostwriting. I can't think of a way around that, though.

Six AI Risk/Strategy Ideas

Oliver Habryka4y30Review for 2019 Review

I have now linked at least 10 times to the heading on "'Generate evidence of difficulty' as a research purpose" section of this post. It was a thing that I kind of wanted to point to before this post came out, but felt confused about it, and this post finally gave me a pointer to it.

I think that section was substantially more novel and valuable to me than the rest of this post, but it is also evidence that others might have also not had some of the other ideas on their map, and so they might found it similarly valuable because of a different section.

Classifying specification problems as variants of Goodhart's Law

Victoria Krakovna4y30Review for 2019 Review

Writing this post helped clarify my understanding of the concepts in both taxonomies - the different levels of specification and types of Goodhart effects. The parts of the taxonomies that I was not sure how to match up usually corresponded to the concepts I was most confused about. For example, I initially thought that adversarial Goodhart is an emergent specification problem, but upon further reflection this didn't seem right. Looking back, I think I still endorse the mapping described in this post.

I hoped to get more comments on this post... (read more)

Understanding “Deep Double Descent”

DanielFilan4y40Review for 2019 Review

I think this paper does a good job at collecting papers about double descent into one place where they can be contrasted and discussed.
I am not convinced that deep double descent is a pervasive phenomenon in practically-used neural networks, for reasons described in Rohin’s opinion about Preetum et. al.. This wouldn’t be so bad, except the limitations of the evidence (smaller ResNets than usual, basically goes away without label noise in image classification, some sketchy choices made in the Belkin et al experiments) are not really addressed or highlight

... (read more)

The strategy-stealing assumption

Alex Turner4y40Review for 2019 Review

Over the last year, I've thought a lot about human/AI power dynamics and influence-seeking behavior. I personally haven't used the strategy-stealing assumption (SSA) in reasoning about alignment, but it seems like a useful concept.

Overall, the post seems good. The analysis is well-reasoned and reasonably well-written, although it's sprinkled with opaque remarks (I marked up a Google doc with more detail).

If this post is voted in, it might be nice if Paul gave more room to big-picture, broad-strokes "how does SSA tend to fail?" discussion, discussing ... (read more)

What failure looks like

orthonormal4y30Review for 2019 Review

I think this post (and similarly, Evan's summary of Chris Olah's views) are essential both in their own right and as mutual foils to MIRI's research agenda. We see related concepts (mesa-optimization originally came out of Paul's talk of daemons in Solomonoff induction, if I remember right) but very different strategies for achieving both inner and outer alignment. (The crux of the disagreement seems to be the probability of success from adapting current methods.)

Strongly recommended for inclusion.

Gradient hacking

Oliver Habryka4y20Review for 2019 Review

adamshimi says almost everything I wanted to say in my review, so I am very glad he made the points he did, and I would love for both his review and the top level post to be included in the book.

The key thing I want to emphasize a bit more is that I think the post as given is very abstract, and I have personally gotten a lot of value out of trying to think of more concrete scenarios where gradient hacking can occur.

I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an... (read more)

Strategic implications of AIs' ability to coordinate at low cost, for example by merging

Daniel Kokotajlo4y20Review for 2019 Review

This post is excellent, in that it has a very high importance-to-word-count ratio. It'll take up only a page or so, but convey a very useful and relevant idea, and moreover ask an important question that will hopefully stimulate further thought.

The Credit Assignment Problem

Raymond Arnold4y20Review for 2019 Review

I think I have juuust enough background to follow the broad strokes of this post, but not to quite grok the parts I think Abram was most interested in.

I definitely caused me to think about credit assignment. I actually ended up thinking about it largely through the lens of Moral Mazes (where challenges of credit assignment combine with other forces to create a really bad environment). Re-reading this post, while I don't quite follow everything, I do successfully get a taste of how credit assignment fits into a bunch of different domains.

For the "myop... (read more)

Risks from Learned Optimization: Introduction

Ben Pace4y30Review for 2019 Review

For me, this is the paper where I learned to connect ideas about delegation to machine learning. The paper sets up simple ideas of mesa-optimizers, and shows a number of constraints and variables that will determine how the mesa-optimizers will be developed – in some environments you want to do a lot of thinking in advance then delegate execution of a very simple algorithm to do your work (e.g. this simple algorithm Critch developed that my group house uses to decide on the rent for each room), and in some environments you want to do a little thinking and ... (read more)

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Alex Turner4y20Review for 2019 Review

Note 1: This review is also a top-level post.

Note 2: I think that 'robust instrumentality' is a more apt name for 'instrumental convergence.' That said, for backwards compatibility, this comment often uses the latter.

In the summer of 2019, I was building up a corpus of basic reinforcement learning theory. I wandered through a sun-dappled Berkeley, my head in the clouds, my mind bent on a single ambition: proving the existence of instrumental convergence.

Somehow.

I needed to find the right definitions first, and I couldn't even imagine what... (read more)

Soft takeoff can still lead to decisive strategic advantage

Daniel Kokotajlo4y20Review for 2019 Review

I've written up a review here, which I made into a separate post because it's long.

Now that I read the instructions more carefully, I realize that I maybe should have just put it here and waited for mods to promote it if they wanted to. Oops, sorry, happy to undo if you like.

Reframing Impact

jacobjacob4y20Review for 2019 Review

Here are prediction questions for the predictions that TurnTrout himself provided in the concluding post of the Reframing Impact sequence.

Elicit Prediction (elicit.org/binary/questions/7SoL5DPRf)

Elicit Prediction (elicit.org/binary/questions/AevXOS1Rj)

Elicit Prediction (elicit.org/binary/questions/javyyEd8C)

Elicit Prediction (elicit.org/binary/questions/iYT69bLl9)

Elicit Prediction (elicit.org/binary/questions/GFGG5plOQ)

Elicit Prediction (eli

... (read more)

Thoughts on Human Models

Rohin Shah4y30Review for 2019 Review

I continue to agree with my original comment on this post (though it is a bit long-winded and goes off on more tangents than I would like), and I think it can serve as a review of this post.

If this post were to be rewritten, I'd be particularly interested to hear example "deployment scenarios" where we use an AGI without human models and this makes the future go well. I know of two examples:

We use strong global coordination to ensure that no powerful AI systems with human models are ever deployed.
We build an AGI that can do science / engineering really wel

... (read more)

The strategy-stealing assumption

jacobjacob4y20Review for 2019 Review

Elicit Prediction (elicit.org/binary/questions/4JOKn_4F5)

(You can find a list of all 2019 Review poll questions here.)

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

Oliver Habryka4y20Review for 2019 Review

I think the CAIS framing that Eric Drexler proposed gave concrete shape to a set of intuitions that many people have been relying on for their thinking about AGI. I also tend to think that those intuitions and models aren't actually very good at modeling AGI, but I nevertheless think it productively moved the discourse forward a good bit.

In particular I am very grateful about the comment thread between Wei Dai and Rohin, which really helped me engage with the CAIS ideas, and I think were necessary to get me to my current understanding of CAIS and to ... (read more)

Understanding “Deep Double Descent”

Mark Xu4y30Review for 2019 Review

This post gave a slightly better understanding of the dynamics happening inside SGD. I think deep double descent is strong evidence that something like a simplicity prior exists in SGG, which might have actively bad generalization properties, e.g. by incentivizing deceptive alignment. I remain cautiously optimistic that approaches like Learning the Prior can get circumnavigate this problem.

The Parable of Predict-O-Matic

orthonormal4y10Review for 2019 Review

This reminds me of That Alien Message, but as a parable about mesa-alignment rather than outer alignment. It reads well, and helps make the concepts more salient. Recommended.

Utility ≠ Reward

Vladimir Mikulik4y10Review for 2019 Review

More than a year since writing this post, I would still say it represents the key ideas in the sequence on mesa-optimisation which remain central in today's conversations on mesa-optimisation. I still largely stand by what I wrote, and recommend this post as a complement to that sequence for two reasons:

First, skipping some detail allows it to focus on the important points, making it better-suited than the full sequence for obtaining an overview of the area.

Second, unlike the sequence, it deemphasises the mechanism of optimisation, and explicitly cas... (read more)