Reviews (All Years)

Sorted by Top

I didn't like this post. At the time, I didn't engage with it very much. I wrote a mildly critical comment (which is currently the top-voted comment, somewhat to my surprise) but I didn't actually engage with the idea very much. So it seems like a good idea to say something now.

The main argument that this is valuable seems to be: this captures a common crux in AI safety. I don't think it's my crux, and I think other people who think it is their crux are probably mistaken. So from my perspective it's a straw-man of the view it&... (read more)

(I'm just going to speak for myself here, rather than the other authors, because I don't want to put words in anyone else's mouth. But many of the ideas I describe in this review are due to other people.)

I think this work was a solid intellectual contribution. I think that the metric proposed for how much you've explained a behavior is the most reasonable metric by a pretty large margin.

The core contribution of this paper was to produce negative results about interpretability. This led to us abandoning work on interpretability a few months later, which I'm... (read more)

I think this post might be the best one of all the MIRI dialogues. I also feel confused about how to relate to the MIRI dialogues overall.

A lot of the MIRI dialogues consist of Eliezer and Nate saying things that seem really important and obvious to me, and a lot of my love for them comes from a feeling of "this actually makes a bunch of the important arguments for why the problem is hard". But the nature of the argument is kind of closed off. 

Like, I agree with these arguments, but like, if you believe these arguments, having traction on AI Alignment... (read more)

This review is mostly going to talk about what I think the post does wrong and how to fix it, because the post itself does a good job explaining what it does right. But before we get to that, it's worth saying up-front what the post does well: the post proposes a basically-correct notion of "power" for purposes of instrumental convergence, and then uses it to prove that instrumental convergence is in fact highly probable under a wide range of conditions. On that basis alone, it is an excellent post.

I see two (related) central problems, from which various o... (read more)

This is a review of both the paper and the post itself, and turned more into a review of the paper (on which I think I have more to say) as opposed to the post. 

Disclaimer: this isn’t actually my area of expertise inside of technical alignment, and I’ve done very little linear probing myself. I’m relying primarily on my understanding of others’ results, so there’s some chance I’ve misunderstood something. Total amount of work on this review: ~8 hours, though about 4 of those were refreshing my memory of prior work and rereading the paper. 

TL... (read more)

I think that strictly speaking this post (or at least the main thrust) is true, and proven in the first section. The title is arguably less true: I think of 'coherence arguments' as including things like 'it's not possible for you to agree to give me a limitless number of dollars in return for nothing', which does imply some degree of 'goal-direction'.

I think the post is important, because it constrains the types of valid arguments that can be given for 'freaking out about goal-directedness', for lack of a better term. In my mind, it provokes various follo

... (read more)

In this essay, ricraz argues that we shouldn't expect a clean mathematical theory of rationality and intelligence to exist. I have debated em about this, and I continue to endorse more or less everything I said in that debate. Here I want to restate some of my (critical) position by building it from the ground up, instead of responding to ricraz point by point.

When should we expect a domain to be "clean" or "messy"? Let's look at everything we know about science. The "cleanest" domains are mathematics and fundamental physics. There, we have crisply defined

... (read more)

This post provides a valuable reframing of a common question in futurology: "here's an effect I'm interested in -- what sorts of things could cause it?"

That style of reasoning ends by postulating causes.  But causes have a life of their own: they don't just cause the one effect you're interested in, through the one causal pathway you were thinking about.  They do all kinds of things.

In the case of AI and compute, it's common to ask

  • Here's a hypothetical AI technology.  How much compute would it require?

But once we have an answer to this quest... (read more)

This post is a review of Paul Christiano's argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it.

Instead, I will focus on what I believe is the post's main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of rea... (read more)

(I reviewed this in a top-level post: Review of 'But exactly how complex and fragile?'.)

I've thought about (concepts related to) the fragility of value quite a bit over the last year, and so I returned to Katja Grace's But exactly how complex and fragile? with renewed appreciation (I'd previously commented only a very brief microcosm of this review). I'm glad that Katja wrote this post and I'm glad that everyone commented. I often see private Google docs full of nuanced discussion which will never see the light of day, and that makes me sad, and I'm happy ... (read more)

I've been thinking about this post a lot since it first came out. Overall, I think it's core thesis is wrong, and I've seen a lot of people make confident wrong inferences on the basis of it. 

The core problem with the post was covered by Eliezer's post "GPTs are Predictors, not Imitators" (which was not written, I think, as a direct response, but which still seems to me to convey the core problem with this post):  

Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all t

... (read more)

In this essay, Rohin sets out to debunk what ey perceive as a prevalent but erroneous idea in the AI alignment community, namely: "VNM and similar theorems imply goal-directed behavior". This is placed in the context of Rohin's thesis that solving AI alignment is best achieved by designing AI which is not goal-directed. The main argument is: "coherence arguments" imply expected utility maximization, but expected utility maximization does not imply goal-directed behavior. Instead, it is a vacuous constraint, since any agent policy can be regarded as maximiz

... (read more)

In this essay Paul Christiano proposes a definition of "AI alignment" which is more narrow than other definitions that are often employed. Specifically, Paul suggests defining alignment in terms of the motivation of the agent (which should be, helping the user), rather than what the agent actually does. That is, as long as the agent "means well", it is aligned, even if errors in its assumptions about the user's preferences or about the world at large lead it to actions that are bad for the user.

Rohin Shah's comment on the essay (which I believe is endorsed

... (read more)

Comments on the outcomes of the post:

  • I'm reasonably happy with how this post turned out. I think it probably bought the Anthropic/superposition mechanistic interpretability agenda somewhere between 0.1 to 4 counterfactual months of progress, which feels like a win.
  • I think sparse autoencoders are likely to be a pretty central method in mechanistic interpretability work for the foreseeable future (which tbf is not very foreseeable).
  • Two parallel works used the method identified in the post (sparse autoencoders - SAEs) or slight modification:
    • Cunningham et al.
... (read more)

A year later, I continue to agree with this post; I still think its primary argument is sound and important. I'm somewhat sad that I still think it is important; I thought this was an obvious-once-pointed-out point, but I do not think the community actually believes it yet.

I particularly agree with this sentence of Daniel's review:

I think the post is important, because it constrains the types of valid arguments that can be given for 'freaking out about goal-directedness', for lack of a better term."

"Constraining the types of valid arguments" is exactly the... (read more)

In “Why Read The Classics?”, Italo Calvino proposes many different definitions of a classic work of literature, including this one:

A classic is a book which has never exhausted all it has to say to its readers.

For me, this captures what makes this sequence and corresponding paper a classic in the AI Alignment literature: it keeps on giving, readthrough after readthrough. That doesn’t mean I agree with everything in it, or that I don’t think it could have been improved in terms of structure. But when pushed to reread it, I found again and again that I had m... (read more)

Selection vs Control is a distinction I always point to when discussing optimization. Yet this is not the two takes on optimization I generally use. My favored ones are internal optimization (which is basically search/selection), and external optimization (optimizing systems from Alex Flint’s The ground of optimization). So I do without control, or at least without Abram’s exact definition of control.

Why? Simply because the internal structure vs behavior distinction mentioned in this post seems more important than the actual definitions (which seem constra... (read more)

In this post, the author proposes a semiformal definition of the concept of "optimization". This is potentially valuable since "optimization" is a word often used in discussions about AI risk, and much confusion can follow from sloppy use of the term or from different people understanding it differently. While the definition given here is a useful perspective, I have some reservations about the claims made about its relevance and applications.

The key paragraph, which summarizes the definition itself, is the following:

An optimizing system is a system that

... (read more)

I've been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it's fun, standardized, up-to-date, comprehensive, and collaborative.

Some of the appeal is that it's fun to read about AI cheating at tasks in unexpected ways (I&apo... (read more)

I think Simulators mostly says obvious and uncontroversial things, but added to the conversation by pointing them out for those who haven't noticed and introducing words for those who struggle to articulate. IMO people that perceive it as making controversial claims have mostly misunderstood its object-level content, although sometimes they may have correctly hallucinated things that I believe or seriously entertain. Others have complained that it only says obvious things, which I agree with in a way, but seeing as many upvoted it or said they found it ill... (read more)

I really liked this post in that it seems to me to have tried quite seriously to engage with a bunch of other people's research, in a way that I feel like is quite rare in the field, and something I would like to see more of. 

One of the key challenges I see for the rationality/AI-Alignment/EA community is the difficulty of somehow building institutions that are not premised on the quality or tractability of their own work. My current best guess is that the field of AI Alignment has made very little progress in the last few years, which is really not w... (read more)

I still think this is great. Some minor updates, and an important note:

Minor updates: I'm a bit less concerned about AI-powered propaganda/persuasion than I was at the time, not sure why. Maybe I'm just in a more optimistic mood. See this critique for discussion. It's too early to tell whether reality is diverging from expectation on this front. I had been feeling mildly bad about my chatbot-centered narrative, as of a month ago, but given how ChatGPT was received I think things are basically on trend.
Diplomacy happened faster than I expected, though in a ... (read more)

In this post, the author presents a case for replacing expected utility theory with some other structure which has no explicit utility function, but only quantities that correspond to conditional expectations of utility.

To provide motivation, the author starts from what he calls the "reductive utility view", which is the thesis he sets out to overthrow. He then identifies two problems with the view.

The first problem is about the ontology in which preferences are defined. In the reductive utility view, the domain of the utility function is the set of possib... (read more)

The work linked in this post was IMO the most important work done on understanding neural networks at the time it came out, and it has also significantly changed the way I think about optimization more generally.

That said, there's a lot of "noise" in the linked papers; it takes some digging to see the key ideas and the data backing them up, and there's a lot of space spent on things which IMO just aren't that interesting at all. So, I'll summarize the things which I consider central.

When optimizing an overparameterized system, there are many many different... (read more)

This post is the best overview of the field so far that I know of. I appreciate how it frames things in terms of outer/inner alignment and training/performance competitiveness--it's very useful to have a framework with which to evaluate proposals and this is a pretty good framework I think.

Since it was written, this post has been my go-to reference both for getting other people up to speed on what the current AI alignment strategies look like (even though this post isn't exhaustive). Also, I've referred back to it myself several times. I learned a lot from... (read more)

This post states the problem of gradient hacking. It is valuable in that this problem is far from obvious, and if plausible, very dangerous. On the other hand, the presentation doesn’t go into enough details, and so leaves gradient hacking open to attacks and confusion. Thus instead of just reviewing this post, I would like to clarify certain points, while interweaving my critics about the way gradient hacking was initially stated, and explaining why I consider this problem so important.

(Caveat: I’m not pretending that any of my objections are unknown to E... (read more)

As with the CCS post, I'm reviewing both the paper and the post, though the majority of the review is on the paper. Writing this quickly (total time on review: ~1.5h), but I expect to be willing to defend the points being made --

There's a lot of reasons I like the work. It's an example of:

  1. Actually poking inside a real model. A lot of the mech interp work in early-mid 2022 was focused on getting a deep understanding of toy models trained on algorithmic tasks (at least in this community).[1] There was some effort at Redwood to do neuron-by-neuron replac
... (read more)

I'm glad I ran this survey, and I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven't rerun the survey so I don't really know. Looking back at the "possible implications for our work" section, we are working on basically all of these things. 

Thoughts on some of the cruxes in the post based on last year's developments:

  • Is global cooperation sufficiently difficult that AGI would need to deploy new powerful technology to make it
... (read more)

What's the type signature of goals?

The type signature of goals is the overarching topic to which this post contributes. It can manifest in a lot of different ways in specific applications:

  • What's the type signature of human values?
  • What structure types should systems biologists or microscope AI researchers look for in supposedly-goal-oriented biological or ML systems?
  • Will AI be "goal-oriented", and what would be the type signature of its "goal"?

If we want to "align AI with human values", build ML interpretability tools, etc, then that's going to be pretty to... (read more)

This is my post.

How my thinking has changed

I've spent much of the last year thinking about the pedagogical mistakes I made here, and am writing the Reframing Impact sequence to fix them. While this post recorded my 2018-thinking on impact measurement, I don't think it communicated the key insights well. Of course, I'm glad it seems to have nonetheless proven useful and exciting to some people!

If I were to update this post, it would probably turn into a rehash of Reframing Impact. Instead, I'll just briefly state the argument as I would present it today.

... (read more)

IMO, this post makes several locally correct points, but overall fails to defeat the argument that misaligned AIs are somewhat likely to spend (at least) a tiny fraction of resources (e.g., between 1/million and 1/trillion) to satisfy the preferences of currently existing humans.

AFAICT, this is the main argument it was trying to argue against, though it shifts to arguing about half of the universe (an obviously vastly bigger share) halfway through the piece.[1]

When it returns to arguing about the actual main question (a tiny fraction of resources) at the e... (read more)

I've used the term "safetwashing" at least once every week or two in the last year. I don't know whether I've picked it up from this post, but it still seems good to have an explanation of a term that is this useful and this common that people are exposed to.

This post is an excellent distillation of a cluster of past work on maligness of Solomonoff Induction, which has become a foundational argument/model for inner agency and malign models more generally.

I've long thought that the maligness argument overlooks some major counterarguments, but I never got around to writing them up. Now that this post is up for the 2020 review, seems like a good time to walk through them.

In Solomonoff Model, Sufficiently Large Data Rules Out Malignness

There is a major outside-view reason to expect that the Solomonoff-is-malign ar... (read more)

Self-Review: After a while of being insecure about it, I'm now pretty fucking proud of this paper, and think it's one of the coolest pieces of research I've personally done. (I'm going to both review this post, and the subsequent paper). Though, as discussed below, I think people often overrate it.

Impact The main impact IMO is proving that mechanistic interpretability is actually possible, that we can take a trained neural network and reverse-engineer non-trivial and unexpected algorithms from it. In particular, I think by focusing on grokking I (semi-acci... (read more)

I still think this post is correct in spirit, and was part of my journey towards good understanding of neuroscience, and promising ideas in AGI alignment / safety.

But there are a bunch of little things that I got wrong or explained poorly. Shall I list them?

First, my "neocortex vs subcortex" division eventually developed into "learning subsystem vs steering subsystem", with the latter being mostly just the hypothalamus and brainstem, and the former being everything else, particularly the whole telencephalon and cerebellum. The main difference is that the "... (read more)

In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community.

The upsides
- majority of the claims is true or at least approximately true
- "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as 'shards'
- part of the success leads at some people in the AGI labs to... (read more)

This post snuck up on me.

The first time I read it, I was underwhelmed.  My reaction was: "well, yeah, duh.  Isn't this all kind of obvious if you've worked with GPTs?  I guess it's nice that someone wrote it down, in case anyone doesn't already know this stuff, but it's not going to shift my own thinking."

But sometimes putting a name to what you "already know" makes a whole world of difference.

Before I read "Simulators," when I'd encounter people who thought of GPT as an agent trying to maximize something, or people who treated MMLU-like one... (read more)

I think this post is incredibly useful as a concrete example of the challenges of seemingly benign powerful AI, and makes a compelling case for serious AI safety research being a prerequisite to any safe further AI development. I strongly dislike part 9, as painting the Predict-o-matic as consciously influencing others personality at the expense of short-term prediction error seems contradictory to the point of the rest of the story. I suspect I would dislike part 9 significantly less if it was framed in terms of a strategy to maximize predictive accuracy.... (read more)

The material here is one seed of a worldview which I've updated toward a lot more over the past year. Some other posts which involve the theme include Science in a High Dimensional World, What is Abstraction?, Alignment by Default, and the companion post to this one Book Review: Design Principles of Biological Circuits.

Two ideas unify all of these:

  1. Our universe has a simplifying structure: it abstracts well, implying a particular kind of modularity.
  2. Goal-oriented systems in our universe tend to evolve a modular structure which reflects the structure of the u
... (read more)

I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency). 

I think this post was quite helpful. I think it does a good job laying out a fairly complete picture of a pretty reasonable safety plan, and the main sources of difficulty. I basically agree with most of the points. Along the way, it makes various helpful points, for example introducing the "action risk vs inaction risk" frame, which I use constantly. This post is probably one of the first ten posts I'd send someone on the topic of "the current state of AI safety technology".

I think that I somewhat prefer the version of these arguments that I give in e.g. ... (read more)

I think it's a bit hard to tell how influential this post has been, though my best guess is "very". It's clear that sometime around when this post was published there was a pretty large shift in the strategies that I and a lot of other people pursued, with "slowing down AI" becoming a much more common goal for people to pursue.

I think (most of) the arguments in this post are good. I also think that when I read an initial draft of this post (around 1.5 years ago or so), and had a very hesitant reaction to the core strategy it proposes, that I was picking up... (read more)

When this post came out, I left a comment saying:

It is not for lack of regulatory ideas that the world has not banned gain-of-function research.

It is not for lack of demonstration of scary gain-of-function capabilities that the world has not banned gain-of-function research.

What exactly is the model by which some AI organization demonstrating AI capabilities will lead to world governments jointly preventing scary AI from being built, in a world which does not actually ban gain-of-function research?

Given how the past year has gone, I should probably lose at... (read more)

I find this post fairly uninteresting, and feel irritated when people confidently make statements about "simulacra." One problem is, on my understanding, that it doesn't really reduce the problem of how LLMs work. "Why did GPT-4 say that thing?" "Because it was simulating someone who was saying that thing." It does postulate some kind of internal gating network which chooses between the different "experts" (simulacra), so it isn't contentless, but... Yeah. 

Also I don't think that LLMs have "hidden internal intelligence", given e.g LLMs trained on “A i... (read more)

The post is still largely up-to-date. In the intervening year, I mostly worked on the theory of regret bounds for infra-Bayesian bandits, and haven't made much progress on open problems in infra-Bayesian physicalism. On the other hand, I also haven't found any new problems with the framework.

The strongest objection to this formalism is the apparent contradiction between the monotonicity principle and the sort of preferences humans have. While my thinking about this problem evolved a little, I am still at a spot where every solution I know requires biting a... (read more)

I hadn't realized this post was nominated, partially because of my comment, so here's a late review. I basically continue to agree with everything I wrote then, and I continue to like this post for those reasons, and so I support including it in the LW Review.

Since writing the comment, I've come across another argument for thinking about intent alignment -- it seems like a "generalization" of assistance games / CIRL, which itself seems like a formalization of an aligned agent in a toy setting. In assistance games, the agent explici... (read more)

I generally endorse the claims made in this post and the overall analogy. Since this post was written, there are a few more examples I can add to the categories for slow takeoff properties. 

Learning from experience

  • The UK procrastinated on locking down in response to the Alpha variant due to political considerations (not wanting to "cancel Christmas"), though it was known that timely lockdowns are much more effective.
  • Various countries reacted to Omicron with travel bans after they already had community transmission (e.g. Canada and the UK), while it wa
... (read more)

Review by the author:

I continue to endorse the contents of this post.

I don't really think about the post that much, but the post expresses a worldview that shapes how I do my research - that agency is a mechanical fact about the workings of a system.

To me, the main contribution of the post is setting up a question: what's a good definition of optimisation that avoids the counterexamples of the post? Ideally, this definition would refer or correspond to the mechanistic properties of the system, so that people could somehow statically determine whether a giv

... (read more)

+9. This is a powerful set of arguments pointing out how humanity will literally go extinct soon due to AI development (or have something similarly bad happen to us). A lot of thought and research went into an understanding of the problem that can produce this level of understanding of the problems we face, and I'm extremely glad it was written up.

This is IMO actually a really important topic, and this is one of the best posts on it. I think it probably really matters whether the AIs will try to trade with us or care about our values even if we had little chance of making our actions with regards to them conditional on whether they do. I found the arguments in this post convincing, and have linked many people to it since it came out. 

This was one of those posts that I dearly wish somebody else besides me had written, but nobody did, so here we are. I have no particular expertise. (But then again, to some extent, maybe nobody does?)

I basically stand by everything I wrote here. I remain pessimistic for reasons spelled out in this post, but I also still have a niggling concern that I haven’t thought these things through carefully enough, and I often refer to this kind of stuff as “an area where reasonable people can disagree”.

If I were rewriting this post today, three changes I’d make wou... (read more)

The post is influential, but makes multiple somewhat confused claims and led many people to become confused. 

The central confusion stems from the fact that genetic evolution already created a lot of control circuitry before inventing cortex, and did the obvious thing to 'align' the evolutionary newer areas: bind them to the old circuitry via interoceptive inputs. By this mechanism, genome is able to 'access' a lot of evolutionary relevant beliefs and mental models. The trick is the higher/more distant to genome models are learned in part to predict in... (read more)

Returning to this essay, it continues to be my favorite Paul post (even What Failure Looks Like only comes second), and I think it's the best way to engage with Paul's work than anything else (including the Eliciting Latent Knowledge document, which feels less grounded in the x-risk problem, is less in Paul's native language, and gets detailed on just one idea for 10x the space thus communicating less of the big picture research goal). I feel I can understand all the arguments made in this post. I think this should be mandatory reading before reading Elici... (read more)

I’ll set aside what happens “by default” and focus on the interesting technical question of whether this post is describing a possible straightforward-ish path to aligned superintelligent AGI.

The background idea is “natural abstractions”. This is basically a claim that, when you use an unsupervised world-model-building learning algorithm, its latent space tends to systematically learn some patterns rather than others. Different learning algorithms will converge on similar learned patterns, because those learned patterns are a property of the world, not an ... (read more)

Insofar as the AI Alignment Forum is part of the Best-of-2018 Review, this post deserves to be included. It's the friendliest explanation to MIRI's research agenda (as of 2018) that currently exists.

I view this post as providing value in three (related) ways:

  1. Making a pedagogical advancement regarding the so-called inner alignment problem
  2. Pointing out that a common view of "RL agents optimize reward" is subtly wrong
  3. Pushing for thinking mechanistically about cognition-updates


Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn't truly comprehend it - sure, I could parrot back terms like "base optimizer" and "mesa-optimizer", but it didn't click. I was confused.

Some mon... (read more)

This post's point still seems correct, and it still seems important--I refer to it at least once a week.

I think this point is really crucial, and I was correct to make it, and it continues to explain a lot of disagreements about AI safety.

This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation[1] and making the concepts precise at least in some very simplistic toy model.

In the following, I'll try going over some of the definitions and explicating my unde... (read more)

An Orthodox Case Against Utility Functions was a shocking piece to me. Abram spends the first half of the post laying out a view he suspects people hold, but he thinks is clearly wrong, which is a perspective that approaches things "from the starting-point of the universe". I felt dread reading it, because it was a view I held at the time, and I used as a key background perspective when I discussed bayesian reasoning. The rest of the post lays out an alternative perspective that "starts from the standpoint of the agent". Instead of my beliefs being about t... (read more)

(I am the author)

I still like & stand by this post. I refer back to it constantly. It does two things:

1. Argue that an AI-induced point of no return could significantly before, or significantly after, world GDP growth accelerates--and indeed will probably come before!

2. Argue that we shouldn't define timelines and takeoff speeds in terms of economic growth. So, against "is there a 4 year doubling before a 1 year doubling?" and against "When will we have TAI = AI capable of doubling the economy in 4 years if deployed?"

I think both things are pretty impo... (read more)

If this post is selected, I'd like to see the followup made into an addendum—I think it adds a very important piece, and it should have been nominated itself.

In a field like alignment or embedded agency, it's useful to keep a list of one or two dozen ideas which seem like they should fit neatly into a full theory, although it's not yet clear how. When working on a theoretical framework, you regularly revisit each of those ideas, and think about how it fits in. Every once in a while, a piece will click, and another large chunk of the puzzle will come together.

Selection vs control is one of those ideas. It seems like it should fit neatly into a full theory, but it's not yet clear what that will look like. I revis... (read more)

I still endorse the breakdown of "sharp left turn" claims in this post. Writing this helped me understand the threat model better (or at all) and make it a bit more concrete.

This post could be improved by explicitly relating the claims to the "consensus" threat model summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of that threat model, which makes a subset of the SLT claims: 

  • Claim 1 (capabilities generalize far) and Claim 3 (humans fail to intervene), but not Claims 1a/b (simultaneous / discontinuous generalization) or Claim
... (read more)

I continue to endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the "SG + GMG → MAPS" framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk. 

In terms of adding new threat models to the categorization, the main one that comes to mind is Deep Deceptiveness (let's call it Soares2), whi... (read more)

This post consists of comments on summaries of a debate about the nature and difficulty of the alignment problem. The original debate was between Eliezer Yudkowsky and Richard Ngo but this post does not contain the content from that debate. This posts is mostly of commentary by Jaan Tallinn on that debate, with comments by Eliezer.

The post provides a kind of fascinating level of insight into true insider conversations about AI alignment. How do Eliezer and Jaan converse about alignment? Sure, this is a public setting, so perhaps they communicate differentl... (read more)

I've written a bunch elsewhere about object-level thoughts on ELK. For this review, I want to focus instead on meta-level points.

I think ELK was very well-made; I think it did a great job of explaining itself with lots of surface area, explaining a way to think about solutions (the builder-breaker cycle), bridging the gap between toy demonstrations and philosophical problems, and focusing lots of attention on the same thing at the same time. In terms of impact on the growth and development on the AI safety community, I think this is one of the most importa... (read more)

Introduction to Cartesian Frames is a piece that also gave me a new philosophical perspective on my life. 

I don't know how to simply describe it. I don't know what even to say here. 

One thing I can say is that the post formalized the idea of having "more agency" or "less agency", in terms of "what facts about the world can I force to be true?". The more I approach the world by stating things that are going to happen, that I can't change, the more I'm boxing-in my agency over the world. The more I treat constraints as things I could fight to chang... (read more)

This post is still endorsed, it still feels like a continually fruitful line of research. A notable aspect of it is that, as time goes on, I keep finding more connections and crisper ways of viewing things which means that for many of the further linked posts about inframeasure theory, I think I could explain them from scratch better than the existing work does. One striking example is that the "Nirvana trick" stated in this intro (to encode nonstandard decision-theory problems), has transitioned from "weird hack that happens to work" to "pops straight out... (read more)

Why This Post Is Interesting

This post takes a previously-very-conceptually-difficult alignment problem, and shows that we can model this problem in a straightforward and fairly general way, just using good ol' Bayesian utility maximizers. The formalization makes the Pointers Problem mathematically legible: it's clear what the problem is, it's clear why the problem is important and hard for alignment, and that clarity is not just conceptual but mathematically precise.

Unfortunately, mathematical legibility is not the same as accessibility; the post does have... (read more)

Ajeya's timelines report is the best thing that's ever been written about AI timelines imo. Whenever people ask me for my views on timelines, I go through the following mini-flowchart:

1. Have you read Ajeya's report?

--If yes, launch into a conversation about the distribution over 2020's training compute and explain why I think the distribution should be substantially to the left, why I worry it might shift leftward faster than she projects, and why I think we should use it to forecast AI-PONR instead of TAI.

--If no, launch into a conversation about Ajey... (read more)

This post is both a huge contribution, giving a simpler and shorter explanation of a critical topic, with a far clearer context, and has been useful to point people to as an alternative to the main sequence. I wouldn't promote it as more important than the actual series, but I would suggest it as a strong alternative to including the full sequence in the 2020 Review. (Especially because I suspect that those who are very interested are likely to have read the full sequence, and most others will not even if it is included.)

One year later, I remain excited about this post, from its ideas, to its formalisms, to its implications. I think it helps us formally understand part of the difficulty of the alignment problem. This formalization of power and the Attainable Utility Landscape have together given me a novel frame for understanding alignment and corrigibility.

Since last December, I’ve spent several hundred hours expanding the formal results and rewriting the paper; I’ve generalized the theorems, added rigor, and taken great pains to spell out what the theorems do and do not ... (read more)

I thought this post and associated paper was worse than Richard's previous sequence "AGI safety from first principles", but despite that, I still think it's one of the best pieces of introductory content for AI X-risk. I've also updated that good communication around AI X-risk stuff will probably involve writing many specialized introductions that work within the epistemic frames and methodologies of many different communities, and I think this post does reasonably well at that for the ML community (though I am not a great judge of that).

This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent.

I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the... (read more)

Meta level I wrote this post in 1-3 hours, and am very satisfied with the returns per unit time! I don't think this is the best or most robust post I could have written, and I think some of these theories of impact are much more important than others. But I think that just collecting a ton of these in the same place was a valuable thing to do, and have heard from multiple people who appreciated this post's existence! More importantly, it was easy and fun, and I personally want to take this as inspiration to find more, easy-to-write-yet-valuable things to d... (read more)

I haven't talked to that many academics about AI safety over the last year but I talked to more and more lawmakers, journalists, and members of civil society. In general, it feels like people are much more receptive to the arguments about AI safety. Turns out "we're building an entity that is smarter than us but we don't know how to control it" is quite intuitively scary. As you would expect, most people still don't update their actions but more people than anticipated start spreading the message or actually meaningfully update their actions (probably still less than 1 in 10 but better than nothing).

Since this post was written, OpenAI has done much more to communicate its overall approach to safety, making this post somewhat obsolete. At the time, I think it conveyed some useful information, although it was perceived as more defensive than I intended.

My main regret is bringing up the Anthropic split, since I was not able to do justice to the topic. I was trying to communicate that OpenAI maintained its alignment research capacity, but should have made that point without mentioning Anthropic.

Ultimately I think the post was mostly useful for sparking some interesting discussion in the comments.

I think this post makes a true and important point, a point that I also bring up from time to time.

I do have a complaint though: I think the title (“Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc”) is too strong. (This came up multiple times in the comments.)

In particular, suppose it takes N unlabeled parameters to solve a problem with deep learning, and it takes M unlabeled parameters to solve the same problem with probabilistic programming. And suppose that M<N, or even M<<N, which I think is generally plausible.

If P... (read more)

I think this point is very important, and I refer to it constantly.

I wish that I'd said "the prototypical AI catastrophe is either escaping from the datacenter or getting root access to it" instead (as I noted in a comment a few months ago).

  • Paul's post on takeoff speed had long been IMO the last major public step in the dialogue on this subject (not forgetting to honorably mention Katja's crazy discontinuous progress examples and Kokotajlo's arguments against using GPD as a metric), and I found it exceedingly valuable to read how it reads to someone else who has put in a great deal of work into figuring out what's true about the topic, thinks about it in very different ways, and has come to different views on it. I found this very valuable for my own understanding of the subject, and I felt I
... (read more)

I haven't had time to reread this sequence in depth, but I wanted to at least touch on how I'd evaluate it. It seems to be aiming to be both a good introductory sequence, while being a "complete and compelling case I can for why the development of AGI might pose an existential threat".

The question is who is this sequence for,  what is it's goal, and how does it compare to other writing targeting similar demographics. 

Some writing that comes to mind to compare/contrast it with includes:

... (read more)

I wrote this relatively early in my journey of self-studying neuroscience. Rereading this now, I guess I'm only slightly embarrassed to have my name associated with it, which isn’t as bad as I expected going in. Some shifts I’ve made since writing it (some of which are already flagged in the text):

  • New terminology part 1: Instead of “blank slate” I now say “learning-from-scratch”, as defined and discussed here.
  • New terminology part 2: “neocortex vs subcortex” → “learning subsystem vs steering subsystem”, with the former including the whole telencephalon and
... (read more)

(I am the author)

I still like & endorse this post. When I wrote it, I hadn't read more than the wiki articles on the subject. But then afterwards I went and read 3 books (written by historians) about it, and I think the original post held up very well to all this new info. In particular, the main critique the post got -- that disease was more important than I made it sound, in a way that undermined my conclusion -- seems to have been pretty wrong. (See e.g. this comment thread, these follow up posts)

So, why does it matter? What contribution did this po... (read more)

We all saw the GPT performance scaling graphs in the papers, and we all stared at them and imagined extending the trend for another five OOMs or so... but then Lanrian went and actually did it! Answered the question we had all been asking! And rigorously dealt with some technical complications along the way.

I've since referred to this post a bunch of times. It's my go-to reference when discussing performance scaling trends.

I think Redwood's classifier project was a reasonable project to work towards, and I think this post was great because it both displayed a bunch of important virtues and avoided doubling down on trying to always frame one's research in a positive light. 

I was really very glad to see this update come out at the time, and it made me hopeful that we can have a great discourse on LessWrong and AI Alignment where when people sometimes overstate things, they can say "oops", learn and move on. My sense is Redwood made a pretty deep update from the first post they published (and this update), and hasn't made any similar errors since then.

I found this post to be a clear and reasonable-sounding articulation of one of the main arguments for there being catastrophic risk from AI development. It helped me with my own thinking to an extent. I think it has a lot of shareability value.

I wrote a review here. There, I identify the main generators of Christiano's disagreement with Yudkowsky[1] and add some critical commentary. I also frame it in terms of a broader debate in the AI alignment community.

  1. ^

    I divide those into "takeoff speeds", "attitude towards prosaic alignment" and "the metadebate" (the last one is about what kind of debate norms should we have about this or what kind of arguments should we listen to.)

Retrospective: I think this is the most important post I wrote in 2022. I deeply hope that more people benefit by fully integrating these ideas into their worldviews. I think there's a way to "see" this lesson everywhere in alignment: for it to inform your speculation about everything from supervised fine-tuning to reward overoptimization. To see past mistaken assumptions about how learning processes work, and to think for oneself instead. This post represents an invaluable tool in my mental toolbelt.

I wish I had written the key lessons and insights more p... (read more)

I consider this post as one of the most important ever written on issues of timelines and AI doom scenario. Not because it's perfect (some of its assumptions are unconvincing), but because it highlights a key aspect of AI Risk and the alignment problem which is so easy to miss coming from a rationalist mindset: it doesn't require an agent to take over the whole world. It is not about agency.

What RAAPs show instead is that even in a purely structural setting, where agency doesn't matter, these problem still crop up!

This insight was already present in Drexle... (read more)

This post's main contribution is the formalization of game-theoretic defection as gaining personal utility at the expense of coalitional utility

Rereading, the post feels charmingly straightforward and self-contained. The formalization feels obvious in hindsight, but I remember being quite confused about the precise difference between power-seeking and defection—perhaps because popular examples of taking over the world are also defections against the human/AI coalition. I now feel cleanly deconfused about this distinction. And if I was confused about... (read more)

How do you review a post that was not written for you? I’m already doing research in AI Alignment, and I don’t plan on creating a group of collaborators for the moment. Still, I found some parts of this useful.

Maybe that’s how you do it: by taking different profiles, and running through the most useful advice for each profile from the post. Let’s do that.

Full time researcher (no team or MIRIx chapter)

For this profile (which is mine, by the way), the most useful piece of advice from this post comes from the model of transmitters and receivers. I’m convinced... (read more)

When I think of useful concepts in AI alignment that I frequently refer to, there are a bunch from the olden days (e.g. “instrumental convergence”, “treacherous turn”, …), and a bunch of idiosyncratic ones that I made up myself for my own purposes, and just a few others, one of which is “concept extrapolation”. For example I talk about it here. (Others in that last category include “goal misgeneralization” [here’s how I use the term] (which is related to concept extrapolation) and “inner and outer alignment” [here’s how I use the term].)

So anyway, in the c... (read more)

I was impressed by this post. I don't have the mathematical chops to evaluate it as math -- probably it's fairly trivial -- but I think it's rare for math to tell us something so interesting and important about the world, as this seems to do. See this comment where I summarize my takeaways; is it not quite amazing that these conclusions about artificial neural nets are provable (or provable-given-plausible-conditions) rather than just conjectures-which-seem-to-be-borne-out-by-ANN-behavior-so-far? (E.g. conclusions like "Neural nets trained on very complex ... (read more)

In many ways, this post is frustrating to read. It isn't straigthforward, it needlessly insults people, and it mixes irrelevant details with the key ideas.

And yet, as with many of Eliezer's post, its key points are right.

What this post does is uncover the main epistemological mistakes made by almost everyone trying their hands at figuring out timelines. Among others, there is:

  • Taking arbitrary guesses within a set of options that you don't have enough evidence to separate
  • Piling on arbitrary assumption on arbitraty assumption, leading to completely uninforma
... (read more)

I trust past-me to have summarized CAIS much better than current-me; back when this post was written I had just finished reading CAIS for the third or fourth time, and I haven't read it since. (This isn't a compliment -- I read it multiple times because I had a lot of trouble understanding it.)

I've put in two points of my own in the post. First:

(My opinion: I think this isn't engaging with the worry with RL agents -- typically, we're worried about the setting where the RL agent is learning or planning at test time, which can happen in learn-to-learn and on

... (read more)

I think it was important to have something like this post exist. However, I now think it's not fit for purpose. In this discussion thread, rohinmshah, abramdemski and I end up spilling a lot of ink about a disagreement that ended up being at least partially because we took 'realism about rationality' to mean different things. rohinmshah thought that irrealism would mean that the theory of rationality was about as real as the theory of liberalism, abramdemski thought that irrealism would mean that the theory of rationality would be about as real as the theo

... (read more)

I think this point is incredibly important and quite underrated, and safety researchers often do way dumber work because they don't think about it enough.

IMO the biggest contribution of this post was popularizing having a phrase for the concept of mode collapse in the context of LLMs and more generally and as an example of a certain flavor of empirical research on LLMs. Other than that it's just a case study whose exact details I don't think are so important.

Edit: This post introduces more useful and generalizable concepts than I remembered when I initially made the review.

To elaborate on what I mean by the value of this post as an example of a certain kind of empirical LLM research: I don't know of much pu... (read more)

I think this is an excellent response (I'd even say, companion piece) to Joe Carlsmith's also-excellent report on the risk from power-seeking AI. On a brief re-skim I think I agree with everything Nate says, though I'd also have a lot more to add and I'd shift emphasis around a bit. (Some of the same points I did in fact make in my own review of Joe's report.)

Why is it important for there to be a response? Well, the 5% number Joe came to at the end is just way too low. Even if you disagree with me about that, you'll concede that a big fraction of the ratio... (read more)

"Search versus design" explores the basic way we build and trust systems in the world. A few notes: 

  • My favorite part is the definitions about an abstraction layer being an artifact combined with a helpful story about it. It helps me see the world as a series of abstraction layers. We're not actually close to true reality, we are very much living within abstraction layers — the simple stories we are able to tell about the artefacts we build. A world built by AIs will be far less comprehensible than the world we live in today. (Much more like biology is
... (read more)