AI ALIGNMENT FORUM
AF

This post is a review of Paul Christiano's argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it.

Instead, I will focus on what I believe is the post's main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of rea... (read more)

The ground of optimization

Vanessa Kosoy3y160Review for 2020 Review

In this post, the author proposes a semiformal definition of the concept of "optimization". This is potentially valuable since "optimization" is a word often used in discussions about AI risk, and much confusion can follow from sloppy use of the term or from different people understanding it differently. While the definition given here is a useful perspective, I have some reservations about the claims made about its relevance and applications.

The key paragraph, which summarizes the definition itself, is the following:

An optimizing system is a system that

Vanessa Kosoy3y140Review for 2020 Review

In this post, the author presents a case for replacing expected utility theory with some other structure which has no explicit utility function, but only quantities that correspond to conditional expectations of utility.

To provide motivation, the author starts from what he calls the "reductive utility view", which is the thesis he sets out to overthrow. He then identifies two problems with the view.

The first problem is about the ontology in which preferences are defined. In the reductive utility view, the domain of the utility function is the set of possib... (read more)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

johnswentworth3y120Review for 2020 Review

The work linked in this post was IMO the most important work done on understanding neural networks at the time it came out, and it has also significantly changed the way I think about optimization more generally.

That said, there's a lot of "noise" in the linked papers; it takes some digging to see the key ideas and the data backing them up, and there's a lot of space spent on things which IMO just aren't that interesting at all. So, I'll summarize the things which I consider central.

When optimizing an overparameterized system, there are many many different... (read more)

An overview of 11 proposals for building safe advanced AI

Daniel Kokotajlo3y160Review for 2020 Review

This post is the best overview of the field so far that I know of. I appreciate how it frames things in terms of outer/inner alignment and training/performance competitiveness--it's very useful to have a framework with which to evaluate proposals and this is a pretty good framework I think.

Since it was written, this post has been my go-to reference both for getting other people up to speed on what the current AI alignment strategies look like (even though this post isn't exhaustive). Also, I've referred back to it myself several times. I learned a lot from... (read more)

The Solomonoff Prior is Malign

johnswentworth3y110Review for 2020 Review

This post is an excellent distillation of a cluster of past work on maligness of Solomonoff Induction, which has become a foundational argument/model for inner agency and malign models more generally.

I've long thought that the maligness argument overlooks some major counterarguments, but I never got around to writing them up. Now that this post is up for the 2020 review, seems like a good time to walk through them.

In Solomonoff Model, Sufficiently Large Data Rules Out Malignness

There is a major outside-view reason to expect that the Solomonoff-is-malign ar... (read more)

Inner Alignment in Salt-Starved Rats

Steve Byrnes3y110Review for 2020 Review

I still think this post is correct in spirit, and was part of my journey towards good understanding of neuroscience, and promising ideas in AGI alignment / safety.

But there are a bunch of little things that I got wrong or explained poorly. Shall I list them?

First, my "neocortex vs subcortex" division eventually developed into "learning subsystem vs steering subsystem", with the latter being mostly just the hypothalamus and brainstem, and the former being everything else, particularly the whole telencephalon and cerebellum. The main difference is that the "... (read more)

Possible takeaways from the coronavirus pandemic for slow AI takeoff

Victoria Krakovna3y60Review for 2020 Review

I generally endorse the claims made in this post and the overall analogy. Since this post was written, there are a few more examples I can add to the categories for slow takeoff properties.

Learning from experience

The UK procrastinated on locking down in response to the Alpha variant due to political considerations (not wanting to "cancel Christmas"), though it was known that timely lockdowns are much more effective.
Various countries reacted to Omicron with travel bans after they already had community transmission (e.g. Canada and the UK), while it wa

... (read more)

Alignment By Default

Steve Byrnes3y80Review for 2020 Review

I’ll set aside what happens “by default” and focus on the interesting technical question of whether this post is describing a possible straightforward-ish path to aligned superintelligent AGI.

The background idea is “natural abstractions”. This is basically a claim that, when you use an unsupervised world-model-building learning algorithm, its latent space tends to systematically learn some patterns rather than others. Different learning algorithms will converge on similar learned patterns, because those learned patterns are a property of the world, not an ... (read more)

Clarifying inner alignment terminology

Vanessa Kosoy3y100Review for 2020 Review

This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation^[1] and making the concepts precise at least in some very simplistic toy model.

In the following, I'll try going over some of the definitions and explicating my unde... (read more)

An Orthodox Case Against Utility Functions

Ben Pace3y70Review for 2020 Review

An Orthodox Case Against Utility Functions was a shocking piece to me. Abram spends the first half of the post laying out a view he suspects people hold, but he thinks is clearly wrong, which is a perspective that approaches things "from the starting-point of the universe". I felt dread reading it, because it was a view I held at the time, and I used as a key background perspective when I discussed bayesian reasoning. The rest of the post lays out an alternative perspective that "starts from the standpoint of the agent". Instead of my beliefs being about t... (read more)

Against GDP as a metric for timelines and takeoff speeds

Daniel Kokotajlo3y60Review for 2020 Review

(I am the author)

I still like & stand by this post. I refer back to it constantly. It does two things:

1. Argue that an AI-induced point of no return could significantly before, or significantly after, world GDP growth accelerates--and indeed will probably come before!

2. Argue that we shouldn't define timelines and takeoff speeds in terms of economic growth. So, against "is there a 4 year doubling before a 1 year doubling?" and against "When will we have TAI = AI capable of doubling the economy in 4 years if deployed?"

I think both things are pretty impo... (read more)

Introduction to Cartesian Frames

Ben Pace3y60Review for 2020 Review

Introduction to Cartesian Frames is a piece that also gave me a new philosophical perspective on my life.

I don't know how to simply describe it. I don't know what even to say here.

One thing I can say is that the post formalized the idea of having "more agency" or "less agency", in terms of "what facts about the world can I force to be true?". The more I approach the world by stating things that are going to happen, that I can't change, the more I'm boxing-in my agency over the world. The more I treat constraints as things I could fight to chang... (read more)

Introduction To The Infra-Bayesianism Sequence

Diffractor3y60Review for 2020 Review

This post is still endorsed, it still feels like a continually fruitful line of research. A notable aspect of it is that, as time goes on, I keep finding more connections and crisper ways of viewing things which means that for many of the further linked posts about inframeasure theory, I think I could explain them from scratch better than the existing work does. One striking example is that the "Nirvana trick" stated in this intro (to encode nonstandard decision-theory problems), has transitioned from "weird hack that happens to work" to "pops straight out... (read more)

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

johnswentworth3y70Review for 2020 Review

Why This Post Is Interesting

This post takes a previously-very-conceptually-difficult alignment problem, and shows that we can model this problem in a straightforward and fairly general way, just using good ol' Bayesian utility maximizers. The formalization makes the Pointers Problem mathematically legible: it's clear what the problem is, it's clear why the problem is important and hard for alignment, and that clarity is not just conceptual but mathematically precise.

Unfortunately, mathematical legibility is not the same as accessibility; the post does have... (read more)

Draft report on AI timelines

Daniel Kokotajlo3y60Review for 2020 Review

Ajeya's timelines report is the best thing that's ever been written about AI timelines imo. Whenever people ask me for my views on timelines, I go through the following mini-flowchart:

1. Have you read Ajeya's report?

--If yes, launch into a conversation about the distribution over 2020's training compute and explain why I think the distribution should be substantially to the left, why I worry it might shift leftward faster than she projects, and why I think we should use it to forecast AI-PONR instead of TAI.

--If no, launch into a conversation about Ajey... (read more)

Inner Alignment: Explain like I'm 12 Edition

David Manheim3y60Review for 2020 Review

This post is both a huge contribution, giving a simpler and shorter explanation of a critical topic, with a far clearer context, and has been useful to point people to as an alternative to the main sequence. I wouldn't promote it as more important than the actual series, but I would suggest it as a strong alternative to including the full sequence in the 2020 Review. (Especially because I suspect that those who are very interested are likely to have read the full sequence, and most others will not even if it is included.)

AGI safety from first principles: Introduction

Raymond Arnold3y20Review for 2020 Review

I haven't had time to reread this sequence in depth, but I wanted to at least touch on how I'd evaluate it. It seems to be aiming to be both a good introductory sequence, while being a "complete and compelling case I can for why the development of AGI might pose an existential threat".

The question is who is this sequence for, what is it's goal, and how does it compare to other writing targeting similar demographics.

Some writing that comes to mind to compare/contrast it with includes:

Scott Alexander's Superintelligence FAQ. This is the post I've

Steve Byrnes3y60Review for 2020 Review

I wrote this relatively early in my journey of self-studying neuroscience. Rereading this now, I guess I'm only slightly embarrassed to have my name associated with it, which isn’t as bad as I expected going in. Some shifts I’ve made since writing it (some of which are already flagged in the text):

New terminology part 1: Instead of “blank slate” I now say “learning-from-scratch”, as defined and discussed here.
New terminology part 2: “neocortex vs subcortex” → “learning subsystem vs steering subsystem”, with the former including the whole telencephalon and

Daniel Kokotajlo3y70Review for 2020 Review

(I am the author)

I still like & endorse this post. When I wrote it, I hadn't read more than the wiki articles on the subject. But then afterwards I went and read 3 books (written by historians) about it, and I think the original post held up very well to all this new info. In particular, the main critique the post got -- that disease was more important than I made it sound, in a way that undermined my conclusion -- seems to have been pretty wrong. (See e.g. this comment thread, these follow up posts)

So, why does it matter? What contribution did this po... (read more)

Extrapolating GPT-N performance

Daniel Kokotajlo3y60Review for 2020 Review

We all saw the GPT performance scaling graphs in the papers, and we all stared at them and imagined extending the trend for another five OOMs or so... but then Lanrian went and actually did it! Answered the question we had all been asking! And rigorously dealt with some technical complications along the way.

I've since referred to this post a bunch of times. It's my go-to reference when discussing performance scaling trends.

What counts as defection?

Alex Turner3y70Review for 2020 Review

This post's main contribution is the formalization of game-theoretic defection as gaining personal utility at the expense of coalitional utility.

Rereading, the post feels charmingly straightforward and self-contained. The formalization feels obvious in hindsight, but I remember being quite confused about the precise difference between power-seeking and defection—perhaps because popular examples of taking over the world are also defections against the human/AI coalition. I now feel cleanly deconfused about this distinction. And if I was confused about... (read more)

Search versus design

Ben Pace3y40Review for 2020 Review

"Search versus design" explores the basic way we build and trust systems in the world. A few notes:

My favorite part is the definitions about an abstraction layer being an artifact combined with a helpful story about it. It helps me see the world as a series of abstraction layers. We're not actually close to true reality, we are very much living within abstraction layers — the simple stories we are able to tell about the artefacts we build. A world built by AIs will be far less comprehensible than the world we live in today. (Much more like biology is

Rohin Shah3y30Review for 2020 Review

Apparently this has been nominated for the review. I assume that this is implicitly a nomination for the book, rather than my summary of it. If so, I think the post itself serves as a review of the book, and I continue to stand by the claims within.

How uniform is the neocortex?

Ruben Bloom3y40Review for 2020 Review

This post is what first gave me a major update towards "an AI with a simple single architectural pattern scaled up sufficiently could become AGI", in other words, there doesn't necessarily have to be complicated fine-tuned algorithms for different advanced functions–you can get lots of different things from the same simple structure plus optimization. Since then, as far as I can tell, that's what we've been seeing.

The Solomonoff Prior is Malign

Charlie Steiner3y40Review for 2020 Review

This was a really interesting post, and is part of a genre of similar posts about acausal interaction with consequentialists in simulatable universes.

The short argument is that if we (or not us, but someone like us with way more available compute) try to use the Kolmogorov complexity of some data to make a decision, our decision might get "hijacked" by simple programs that run for a very very long time and simulate aliens who look for universes where someone is trying to use the Solomonoff prior to make a decision and then based on what decision they want,... (read more)

Developmental Stages of GPTs

orthonormal3y30Review for 2020 Review

There are some posts with perennial value, and some which depend heavily on their surrounding context. This post is of the latter type. I think it was pretty worthwhile in its day (and in particular, the analogy between GPT upgrades and developmental stages is one I still find interesting), but I leave it to you whether the book should include time capsules like this.

It's also worth noting that, in the recent discussions, Eliezer has pointed to the GPT architecture as an example that scaling up has worked better than expected, but he diverges from the thes... (read more)

Alignment By Default

Vanessa Kosoy3y40Review for 2020 Review

In this post, the author describes a pathway by which AI alignment can succeed even without special research effort. The specific claim that this can happen "by default" is not very important, IMO (the author himself only assigns 10% probability to this). On the other hand, viewed as a technique that can be deliberately used to help with alignment, this pathway is very interesting.

The author's argument can be summarized as follows:

For anyone trying to predict events happening on Earth, the concept of "human values" is a "natural abstraction", i.e. someth

Ben Pace3y10Review for 2020 Review

My quick two-line review is something like: this post (and its sequel) is an artifact from someone with an interesting perspective on the world looking at the whole problem and trying to communicate their practical perspective. I don't really share this perspective, but it is looking at enough of the real things, and differently enough to the other perspectives I hear, that I am personally glad to have engaged with it. +4.

The Reasonable Effectiveness of Mathematics or: AI vs sandwiches

Vanessa Kosoy3y20Review for 2020 Review

In this post I speculated on the reasons for why mathematics is so useful so often, and I still stand behind it. The context, though, is the ongoing debate in the AI alignment community between the proponents of heuristic approaches and empirical research^[1] ("prosaic alignment") and the proponents of building foundational theory and mathematical analysis (as exemplified in MIRI's "agent foundations" and my own "learning-theoretic" research agendas).

Previous volleys in this debate include Ngo's "realism about rationality" (on the anti-theory side), the pro... (read more)

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Vanessa Kosoy3y40Review for 2020 Review

This post states a subproblem of AI alignment which the author calls "the pointers problem". The user is regarded as an expected utility maximizer, operating according to causal decision theory. Importantly, the utility function depends on latent (unobserved) variables in the causal network. The AI operates according to a different, superior, model of the world. The problem is then, how do we translate the utility function from the user's model to the AI's model? This is very similar to the "ontological crisis" problem described by De Blanc, only De Blanc ... (read more)

Inaccessible information

Vanessa Kosoy3y30Review for 2020 Review

This post defines and discusses an informal notion of "inaccessible information" in AI.

AIs are expected to acquire all sorts of knowledge about the world in the course of their training, including knowledge only tangentially related to their training objective. The author proposes to classify this knowledge into "accessible" and "inaccessible" information. In my own words, information inside an AI is "accessible" when there is a straightforward way to set up a training protocol that will incentivize the AI to reliably and accurately communicate this inform... (read more)

Introduction To The Infra-Bayesianism Sequence

Charlie Steiner3y20Review for 2020 Review

Of the agent foundations work from 2020, I think this sequence is my favorite, and I say this without actually understanding it.

The core idea is that Bayesianism is too hard. And so what we ultimately want is to replace probability distributions over all possible things with simple rules that don't have to put a probability on all possible things. In some ways this is the complement to logical uncertainty - logical uncertainty is about not having to have all possible probability distributions possible, this is about not having to put probability distributi... (read more)

Inaccessible information

Raymond Arnold3y10Review for 2020 Review

It strikes me that this post looks like a (AFAICT?) a stepping stone towards the Eliciting Latent Knowledge research agenda, which currently has a lot of support/traction. Which makes this post fairly historically important.

Some AI research areas and their relevance to existential safety

Raymond Arnold3y10Review for 2020 Review

I've highly voted this post for a few reasons.

First, this post contains a bunch of other individual ideas I've found quite helpful for orienting. Some examples:

Useful thoughts on which term definitions have "staying power," and are worth coordinating around.
The zero/single/multi alignment framework.
The details on how to anticipate legitimize and fulfill governance demands.

But my primary reason was learning Critch's views on what research fields are promising, and how they fit into his worldview. I'm not sure if I agree with Critch, but I think "Figur... (read more)

Radical Probabilism

Ben Pace3y10Review for 2020 Review

Radical Probabilism is an extensions of the Embedded Agency philosophical position. I remember reading is and feeling a strong sense that I really got to see a well pinned-down argument using that philosophy. Radical Probabilism might be a +9, will have to re-read, but for now I give it +4.

(This review is taken from my post Ben Pace's Controversial Picks for the 2020 Review.)

Inner Alignment: Explain like I'm 12 Edition

Rafael Harth3y10Review for 2020 Review

Author here. One thing I think I've done wrong in the post is to equate black-box-search-in-large-parametrized-space with all of machine learning. I've now added this paragraph at the end of chapter 1:

Admittedly, the inner alignment model is not maximally general. In this post, we've looked at black box search, where we have a parametrized model and do SGD to update the parameters. This describes most of what Machine Learning is up to in 2020, but it does not describe what the field did pre-2000 and, in the event of a paradigm shift similar to the deep l

... (read more)

Measuring hardware overhang

Charlie Steiner3y10Review for 2020 Review

On the one hand this is an interesting and useful piece of data on AI scaling and the progress of algorithms. It's also important because it makes the point that the very notion of "progress of algorithms" implies hardware overhang as important as >10 years of Moore's law. I also enjoyed the follow-up work that this spawned in 2021.