Reviews 2022

Sorted by New

+9. This is a powerful set of arguments pointing out how humanity will literally go extinct soon due to AI development (or have something similarly bad happen to us). A lot of thought and research went into an understanding of the problem that can produce this level of understanding of the problems we face, and I'm extremely glad it was written up.

I view this post as providing value in three (related) ways:

  1. Making a pedagogical advancement regarding the so-called inner alignment problem
  2. Pointing out that a common view of "RL agents optimize reward" is subtly wrong
  3. Pushing for thinking mechanistically about cognition-updates

 

Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn't truly comprehend it - sure, I could parrot back terms like "base optimizer" and "mesa-optimizer", but it didn't click. I was confused.

Some mon... (read more)

Someone working full-time on an approach to the alignment problem that they feel optimistic about, and writing annual reflections on their work, is something that has been sorely lacking. +4

As with the CCS post, I'm reviewing both the paper and the post, though the majority of the review is on the paper. Writing this quickly (total time on review: ~1.5h), but I expect to be willing to defend the points being made --

There's a lot of reasons I like the work. It's an example of:

  1. Actually poking inside a real model. A lot of the mech interp work in early-mid 2022 was focused on getting a deep understanding of toy models trained on algorithmic tasks (at least in this community).[1] There was some effort at Redwood to do neuron-by-neuron replac
... (read more)

This post seems like it was quite influential. This is basically a trivial review to allow the post to be voted on.

IMO, this post makes several locally correct points, but overall fails to defeat the argument that misaligned AIs are somewhat likely to spend (at least) a tiny fraction of resources (e.g., between 1/million and 1/trillion) to satisfy the preferences of currently existing humans.

AFAICT, this is the main argument it was trying to argue against, though it shifts to arguing about half of the universe (an obviously vastly bigger share) halfway through the piece.[1]

When it returns to arguing about the actual main question (a tiny fraction of resources) at the e... (read more)

I really enjoyed this sequence, it provides useful guidance on how to combine different sources of knowledge and intuitions to reason about future AI systems. Great resource on how to think about alignment for an ML audience. 

In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community.

The upsides
- majority of the claims is true or at least approximately true
- "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as 'shards'
- part of the success leads at some people in the AGI labs to... (read more)

These kinds of overview posts are very valuable, and I think this one is as well. I think it was quite well executed, and I've seen it linked a lot, especially to newer people trying to orient to the state of the AI Alignment field, and the ever growing number of people working in it. 

This is IMO actually a really important topic, and this is one of the best posts on it. I think it probably really matters whether the AIs will try to trade with us or care about our values even if we had little chance of making our actions with regards to them conditional on whether they do. I found the arguments in this post convincing, and have linked many people to it since it came out. 

I am not a huge fan of shard theory, but other people seem into it a bunch. This post captured at least a bunch of my problems with shard theory (though not all of them, and it's not a perfect post). This means the post at least has saved me some writing effort a bunch of times. 

This is a review of both the paper and the post itself, and turned more into a review of the paper (on which I think I have more to say) as opposed to the post. 

Disclaimer: this isn’t actually my area of expertise inside of technical alignment, and I’ve done very little linear probing myself. I’m relying primarily on my understanding of others’ results, so there’s some chance I’ve misunderstood something. Total amount of work on this review: ~8 hours, though about 4 of those were refreshing my memory of prior work and rereading the paper. 

TL... (read more)

This post makes a pretty straightforward and important point, and I've referenced it a few times since then. It hasn't made a huge impact , and it isn't the best explanation, but I think it's a good one that covers the basics, and I think it could be linked to more frequently.

I think Redwood's classifier project was a reasonable project to work towards, and I think this post was great because it both displayed a bunch of important virtues and avoided doubling down on trying to always frame one's research in a positive light. 

I was really very glad to see this update come out at the time, and it made me hopeful that we can have a great discourse on LessWrong and AI Alignment where when people sometimes overstate things, they can say "oops", learn and move on. My sense is Redwood made a pretty deep update from the first post they published (and this update), and hasn't made any similar errors since then.

I am not that excited about marginal interpretability research, but I have nevertheless linked to this a few times. I think this post both clarifies a bunch of inroads into making marginal interpretability progress, but also maps out how long the journey between where we are and where many important targets are for using interpretability methods to reduce AI x-risk.

Separately, besides my personal sense that marginal interpretability research is not a great use of most researcher's time, there are really a lot of people trying to get started doing work on A... (read more)

I disagree with the conclusion of this post, but still found it a valuable reference for a bunch of arguments I do think are important to model in the space.

I thought this post and associated paper was worse than Richard's previous sequence "AGI safety from first principles", but despite that, I still think it's one of the best pieces of introductory content for AI X-risk. I've also updated that good communication around AI X-risk stuff will probably involve writing many specialized introductions that work within the epistemic frames and methodologies of many different communities, and I think this post does reasonably well at that for the ML community (though I am not a great judge of that).

I think this post was quite helpful. I think it does a good job laying out a fairly complete picture of a pretty reasonable safety plan, and the main sources of difficulty. I basically agree with most of the points. Along the way, it makes various helpful points, for example introducing the "action risk vs inaction risk" frame, which I use constantly. This post is probably one of the first ten posts I'd send someone on the topic of "the current state of AI safety technology".

I think that I somewhat prefer the version of these arguments that I give in e.g. ... (read more)

This was one of those posts that I dearly wish somebody else besides me had written, but nobody did, so here we are. I have no particular expertise. (But then again, to some extent, maybe nobody does?)

I basically stand by everything I wrote here. I remain pessimistic for reasons spelled out in this post, but I also still have a niggling concern that I haven’t thought these things through carefully enough, and I often refer to this kind of stuff as “an area where reasonable people can disagree”.

If I were rewriting this post today, three changes I’d make wou... (read more)

I've used the term "safetwashing" at least once every week or two in the last year. I don't know whether I've picked it up from this post, but it still seems good to have an explanation of a term that is this useful and this common that people are exposed to.

I found this post to be a clear and reasonable-sounding articulation of one of the main arguments for there being catastrophic risk from AI development. It helped me with my own thinking to an extent. I think it has a lot of shareability value.

I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency). 

This post snuck up on me.

The first time I read it, I was underwhelmed.  My reaction was: "well, yeah, duh.  Isn't this all kind of obvious if you've worked with GPTs?  I guess it's nice that someone wrote it down, in case anyone doesn't already know this stuff, but it's not going to shift my own thinking."

But sometimes putting a name to what you "already know" makes a whole world of difference.

Before I read "Simulators," when I'd encounter people who thought of GPT as an agent trying to maximize something, or people who treated MMLU-like one... (read more)

I think it's a bit hard to tell how influential this post has been, though my best guess is "very". It's clear that sometime around when this post was published there was a pretty large shift in the strategies that I and a lot of other people pursued, with "slowing down AI" becoming a much more common goal for people to pursue.

I think (most of) the arguments in this post are good. I also think that when I read an initial draft of this post (around 1.5 years ago or so), and had a very hesitant reaction to the core strategy it proposes, that I was picking up... (read more)

I wrote a review here. There, I identify the main generators of Christiano's disagreement with Yudkowsky[1] and add some critical commentary. I also frame it in terms of a broader debate in the AI alignment community.

  1. ^

    I divide those into "takeoff speeds", "attitude towards prosaic alignment" and "the metadebate" (the last one is about what kind of debate norms should we have about this or what kind of arguments should we listen to.)

When this post came out, I left a comment saying:

It is not for lack of regulatory ideas that the world has not banned gain-of-function research.

It is not for lack of demonstration of scary gain-of-function capabilities that the world has not banned gain-of-function research.

What exactly is the model by which some AI organization demonstrating AI capabilities will lead to world governments jointly preventing scary AI from being built, in a world which does not actually ban gain-of-function research?

Given how the past year has gone, I should probably lose at... (read more)

I find this post fairly uninteresting, and feel irritated when people confidently make statements about "simulacra." One problem is, on my understanding, that it doesn't really reduce the problem of how LLMs work. "Why did GPT-4 say that thing?" "Because it was simulating someone who was saying that thing." It does postulate some kind of internal gating network which chooses between the different "experts" (simulacra), so it isn't contentless, but... Yeah. 

Also I don't think that LLMs have "hidden internal intelligence", given e.g LLMs trained on “A i... (read more)

This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent.

I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the... (read more)

I really liked this post in that it seems to me to have tried quite seriously to engage with a bunch of other people's research, in a way that I feel like is quite rare in the field, and something I would like to see more of. 

One of the key challenges I see for the rationality/AI-Alignment/EA community is the difficulty of somehow building institutions that are not premised on the quality or tractability of their own work. My current best guess is that the field of AI Alignment has made very little progress in the last few years, which is really not w... (read more)

I've been thinking about this post a lot since it first came out. Overall, I think it's core thesis is wrong, and I've seen a lot of people make confident wrong inferences on the basis of it. 

The core problem with the post was covered by Eliezer's post "GPTs are Predictors, not Imitators" (which was not written, I think, as a direct response, but which still seems to me to convey the core problem with this post):  

Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all t

... (read more)

I think this point is incredibly important and quite underrated, and safety researchers often do way dumber work because they don't think about it enough.

(I'm just going to speak for myself here, rather than the other authors, because I don't want to put words in anyone else's mouth. But many of the ideas I describe in this review are due to other people.)

I think this work was a solid intellectual contribution. I think that the metric proposed for how much you've explained a behavior is the most reasonable metric by a pretty large margin.

The core contribution of this paper was to produce negative results about interpretability. This led to us abandoning work on interpretability a few months later, which I'm... (read more)

I really like this paper! This is one of my favourite interpretability papers of 2022, and has substantially influenced my research. I voted at 9 in the annual review. Specific things I like about it:

  • It really started the "narrow distribution" focused interpretability, just examining models on sentences of the form "John and Mary went to the store, John gave a bag to" -> " Mary". IMO this is a promising alternative focus to the "understand what model components mean on the full data distribution" mindset, and worth some real investment in. Model compo
... (read more)

Meta level I wrote this post in 1-3 hours, and am very satisfied with the returns per unit time! I don't think this is the best or most robust post I could have written, and I think some of these theories of impact are much more important than others. But I think that just collecting a ton of these in the same place was a valuable thing to do, and have heard from multiple people who appreciated this post's existence! More importantly, it was easy and fun, and I personally want to take this as inspiration to find more, easy-to-write-yet-valuable things to d... (read more)

Self-Review: After a while of being insecure about it, I'm now pretty fucking proud of this paper, and think it's one of the coolest pieces of research I've personally done. (I'm going to both review this post, and the subsequent paper). Though, as discussed below, I think people often overrate it.

Impact The main impact IMO is proving that mechanistic interpretability is actually possible, that we can take a trained neural network and reverse-engineer non-trivial and unexpected algorithms from it. In particular, I think by focusing on grokking I (semi-acci... (read more)

When I think of useful concepts in AI alignment that I frequently refer to, there are a bunch from the olden days (e.g. “instrumental convergence”, “treacherous turn”, …), and a bunch of idiosyncratic ones that I made up myself for my own purposes, and just a few others, one of which is “concept extrapolation”. For example I talk about it here. (Others in that last category include “goal misgeneralization” [here’s how I use the term] (which is related to concept extrapolation) and “inner and outer alignment” [here’s how I use the term].)

So anyway, in the c... (read more)

IMO the biggest contribution of this post was popularizing having a phrase for the concept of mode collapse in the context of LLMs and more generally and as an example of a certain flavor of empirical research on LLMs. Other than that it's just a case study whose exact details I don't think are so important.

Edit: This post introduces more useful and generalizable concepts than I remembered when I initially made the review.

To elaborate on what I mean by the value of this post as an example of a certain kind of empirical LLM research: I don't know of much pu... (read more)

I think Simulators mostly says obvious and uncontroversial things, but added to the conversation by pointing them out for those who haven't noticed and introducing words for those who struggle to articulate. IMO people that perceive it as making controversial claims have mostly misunderstood its object-level content, although sometimes they may have correctly hallucinated things that I believe or seriously entertain. Others have complained that it only says obvious things, which I agree with in a way, but seeing as many upvoted it or said they found it ill... (read more)

I still endorse the breakdown of "sharp left turn" claims in this post. Writing this helped me understand the threat model better (or at all) and make it a bit more concrete.

This post could be improved by explicitly relating the claims to the "consensus" threat model summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of that threat model, which makes a subset of the SLT claims: 

  • Claim 1 (capabilities generalize far) and Claim 3 (humans fail to intervene), but not Claims 1a/b (simultaneous / discontinuous generalization) or Claim
... (read more)

I continue to endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the "SG + GMG → MAPS" framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk. 

In terms of adding new threat models to the categorization, the main one that comes to mind is Deep Deceptiveness (let's call it Soares2), whi... (read more)

I'm glad I ran this survey, and I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven't rerun the survey so I don't really know. Looking back at the "possible implications for our work" section, we are working on basically all of these things. 

Thoughts on some of the cruxes in the post based on last year's developments:

  • Is global cooperation sufficiently difficult that AGI would need to deploy new powerful technology to make it
... (read more)

An early paper that Anthropic then built on to produce their recent exciting results. I found the author's insight and detailed parameter tuning advice helpful.

The post is influential, but makes multiple somewhat confused claims and led many people to become confused. 

The central confusion stems from the fact that genetic evolution already created a lot of control circuitry before inventing cortex, and did the obvious thing to 'align' the evolutionary newer areas: bind them to the old circuitry via interoceptive inputs. By this mechanism, genome is able to 'access' a lot of evolutionary relevant beliefs and mental models. The trick is the higher/more distant to genome models are learned in part to predict in... (read more)

I think this post was a good exercise to clarify my internal model of how I expect the world to look like with strong AI. Obviously, most of the very specific predictions I make are too precise (which was clear at the time of writing) and won't play out exactly like that but the underlying trends still seem plausible to me. For example, I expect some major misuse of powerful AI systems, rampant automation of labor that will displace many people and rob them of a sense of meaning, AI taking over the digital world years before taking over the physical world ... (read more)

I still stand behind most of the disagreements that I presented in this post. There was one prediction that would make timelines longer because I thought compute hardware progress was slower than Moore's law. I now mostly think this argument is wrong because it relies on FP32 precision. However, lower precision formats and tensor cores are the norm in ML, and if you take them into account, compute hardware improvements are faster than Moore's law. We wrote a piece with Epoch on this: https://epochai.org/blog/trends-in-machine-learning-hardware

If anything, ... (read more)

In a narrow technical sense, this post still seems accurate but in a more general sense, it might have been slightly wrong / misleading. 

In the post, we investigated different measures of FP32 compute growth and found that many of them were slower than Moore's law would predict. This made me personally believe that compute might be growing slower than people thought and most of the progress comes from throwing more money at larger and larger training runs. While most progress comes from investment scaling, I now think the true effective compute growth... (read more)

I haven't talked to that many academics about AI safety over the last year but I talked to more and more lawmakers, journalists, and members of civil society. In general, it feels like people are much more receptive to the arguments about AI safety. Turns out "we're building an entity that is smarter than us but we don't know how to control it" is quite intuitively scary. As you would expect, most people still don't update their actions but more people than anticipated start spreading the message or actually meaningfully update their actions (probably still less than 1 in 10 but better than nothing).

Since this post was written, OpenAI has done much more to communicate its overall approach to safety, making this post somewhat obsolete. At the time, I think it conveyed some useful information, although it was perceived as more defensive than I intended.

My main regret is bringing up the Anthropic split, since I was not able to do justice to the topic. I was trying to communicate that OpenAI maintained its alignment research capacity, but should have made that point without mentioning Anthropic.

Ultimately I think the post was mostly useful for sparking some interesting discussion in the comments.

Since this post was written, I feel like there's been a zeitgeist of "Distillation Projects." I don't know how causal this post was, I think in some sense the ecosystem was ripe for a Distillation Wave) But it seemed useful to think about how that wave played out.

Some of the results have been great. But many of the results have felt kinda meh to me, and I now have a bit of a flinch/ugh reaction when I see a post with "distillation" in it's title. 

Basically, good distillations are a highly skilled effort. It's sort of natural to write a distillation of... (read more)

I think this post makes a true and important point, a point that I also bring up from time to time.

I do have a complaint though: I think the title (“Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc”) is too strong. (This came up multiple times in the comments.)

In particular, suppose it takes N unlabeled parameters to solve a problem with deep learning, and it takes M unlabeled parameters to solve the same problem with probabilistic programming. And suppose that M<N, or even M<<N, which I think is generally plausible.

If P... (read more)

Retrospective: I think this is the most important post I wrote in 2022. I deeply hope that more people benefit by fully integrating these ideas into their worldviews. I think there's a way to "see" this lesson everywhere in alignment: for it to inform your speculation about everything from supervised fine-tuning to reward overoptimization. To see past mistaken assumptions about how learning processes work, and to think for oneself instead. This post represents an invaluable tool in my mental toolbelt.

I wish I had written the key lessons and insights more p... (read more)

This post's point still seems correct, and it still seems important--I refer to it at least once a week.

I think this point is very important, and I refer to it constantly.

I wish that I'd said "the prototypical AI catastrophe is either escaping from the datacenter or getting root access to it" instead (as I noted in a comment a few months ago).

I think this point is really crucial, and I was correct to make it, and it continues to explain a lot of disagreements about AI safety.

Comments on the outcomes of the post:

  • I'm reasonably happy with how this post turned out. I think it probably bought the Anthropic/superposition mechanistic interpretability agenda somewhere between 0.1 to 4 counterfactual months of progress, which feels like a win.
  • I think sparse autoencoders are likely to be a pretty central method in mechanistic interpretability work for the foreseeable future (which tbf is not very foreseeable).
  • Two parallel works used the method identified in the post (sparse autoencoders - SAEs) or slight modification:
    • Cunningham et al.
... (read more)

I just gave this a re-read, I forgot what a trip it is to read the thoughts of Eliezer Yudkowsky. It continues to be some of my favorite stuff in recent years written on LessWrong.

It's hard to relate to the world with a level of mastery over basic ideas as Eliezer has. I don't mean with this to vouch that his perspective is certainly correct, but I believe it is at least possible, and so I think he aspires to a knowledge of reality that I rarely if ever aspire to. Reading it inspires me to really think about how the world works, and really figure out what I know and what I don't. +9

(And the smart people dialoguing with him here are good sports for keeping up their side of the argument.)