The Best of LessWrong

AI ALIGNMENT FORUM
AF

The Best of LessWrong — AI Alignment Forum

10fiddler

I think this post is incredibly useful as a concrete example of the challenges of seemingly benign powerful AI, and makes a compelling case for serious AI safety research being a prerequisite to any safe further AI development. I strongly dislike part 9, as painting the Predict-o-matic as consciously influencing others personality at the expense of short-term prediction error seems contradictory to the point of the rest of the story. I suspect I would dislike part 9 significantly less if it was framed in terms of a strategy to maximize predictive accuracy. More specifically, I really enjoy the focus on the complexity of “optimization” on a gears-level: I think that it’s a useful departure from high abstraction levels, as the question of what predictive accuracy means, and the strategy AI would use to pursue it, is highly influenced by the approach taken. I think a more rigorous approach to analyzing whether different AI approaches are susceptible to “undercutting” as a safety feature would be an extremely valuable piece. My suspicion is that even the engineer’s perspective here is significantly under-specified with the details necessary to determine whether this vulnerability exists. I also think that Part 9 detracts from the piece in two main ways: by painting the predict-o-matic as conscious, it implies a significantly more advanced AI than necessary to exhibit this effect. Additionally, because the AI admits to sacrificing predictIve accuracy in favor of some abstract value-add, it seems like pretty much any naive strategy would outcompete the current one, according to the engineer, meaning that the type of threat is also distorted: the main worry should be AI OPTIMIZING for predictive accuracy, not pursuing its own goals. That’s bad sci-fi or very advanced GAI, not a prediction-optimizer. I would support the deletion or aggressive editing of part 9 in this and future similar pieces: I’m not sure what it adds. ETA-I think whether or not this post should be upd

10DanielFilan

* Olah’s comment indicates that this is indeed a good summary of his views. * I think the first three listed benefits are indeed good reasons to work on transparency/interpretability. I am intrigued but less convinced by the prospect of ‘microscope AI’. * The ‘catching problems with auditing’ section describes an ‘auditing game’, and says that progress in this game might illustrate progress in using interpretability for alignment. It would be good to learn how much success the auditors have had in this game since the post was published. * One test of ‘microscope AI’: the go community has had a couple of years of the computer era, in which time open-source go programs stronger than AlphaGo have been released. This has indeed changed the way that humans think about go: seeing the corner variations that AIs tend to play has changed our views on which variations are good for which player, and seeing AI win probabilities conditioned on various moves, as well as the AI-recommended continuations, has made it easier to review games. Yet sadly, there has been to my knowledge no new go knowledge generated from looking at the internals of these systems, despite some visualization research being done (https://arxiv.org/pdf/1901.02184.pdf, https://link.springer.com/chapter/10.1007/978-3-319-97304-3_20). As far as I’m aware, we do not even know if these systems understand the combinatorial game theory of the late endgame, the one part of go that has been satisfactorily mathematized (and therefore unusually amenable to checking whether some program implements it). It’s not clear to me whether this is for a lack of trying, but this does seem like a setting where microscope AI would be useful if it were promising. * The paper mostly focuses on the benefits of transparency/interpretability for AI alignment. However, as far as I’m aware, since before this post was published, the strongest argument against work in this direction has been the problem of tractability - can we ac

6Daniel Kokotajlo

Ajeya's timelines report is the best thing that's ever been written about AI timelines imo. Whenever people ask me for my views on timelines, I go through the following mini-flowchart: 1. Have you read Ajeya's report? --If yes, launch into a conversation about the distribution over 2020's training compute and explain why I think the distribution should be substantially to the left, why I worry it might shift leftward faster than she projects, and why I think we should use it to forecast AI-PONR instead of TAI. --If no, launch into a conversation about Ajeya's framework and why it's the best and why all discussion of AI timelines should begin there. So, why do I think it's the best? Well, there's a lot to say on the subject, but, in a nutshell: Ajeya's framework is to AI forecasting what actual climate models are to climate change forecasting (by contrast with lower-tier methods such as "Just look at the time series of temperature over time / AI performance over time and extrapolate" and "Make a list of factors that might push the temperature up or down in the future / make AI progress harder or easier," and of course the classic "poll a bunch of people with vaguely related credentials." There's something else which is harder to convey... I want to say Ajeya's model doesn't actually assume anything, or maybe it makes only a few very plausible assumptions. This is underappreciated, I think. People will say e.g. "I think data is the bottleneck, not compute." But Ajeya's model doesn't assume otherwise! If you think data is the bottleneck, then the model is more difficult for you to use and will give more boring outputs, but you can still use it. (Concretely, you'd have 2020's training compute requirements distribution with lots of probability mass way to the right, and then rather than say the distribution shifts to the left at a rate of about one OOM a decade, you'd input whatever trend you think characterizes the likely improvements in data gathering.) The upsho

10habryka

I think it's a bit hard to tell how influential this post has been, though my best guess is "very". It's clear that sometime around when this post was published there was a pretty large shift in the strategies that I and a lot of other people pursued, with "slowing down AI" becoming a much more common goal for people to pursue. I think (most of) the arguments in this post are good. I also think that when I read an initial draft of this post (around 1.5 years ago or so), and had a very hesitant reaction to the core strategy it proposes, that I was picking up on something important, and that I do also want to award Bayes points to that part of me given how things have been playing out so far. I do think that since I've seen people around me adopt strategies to slow down AI, I've seen it done on a basis that feels much more rhetorical, and often directly violates virtues and perspectives that I hold very dearly. I think it's really important to understand that technological progress has been the central driving force behind humanity's success, and that indeed this should establish a huge prior against stopping almost any kind of technological development. In contrast to that, the majority of arguments that I've seen find traction for slowing down AI development are not distinguishable from arguments that apply to a much larger set of technologies which to me clearly do not pose a risk commensurable with the prior we should have against slowdown. Concerns about putting people out of jobs, destroying traditional models of romantic relationships, violating copyright law, spreading disinformation, all seem to me to be the kind of thing that if you buy it, you end up with an argument that proves too much and should end up opposed to a huge chunk of technological progress. And I can feel the pressure in myself for these things as well. I can see how it would be easy to at least locally gain traction at slowing down AI by allying myself with people who are concerned abo

21nostalgebraist

This post provides a valuable reframing of a common question in futurology: "here's an effect I'm interested in -- what sorts of things could cause it?" That style of reasoning ends by postulating causes. But causes have a life of their own: they don't just cause the one effect you're interested in, through the one causal pathway you were thinking about. They do all kinds of things. In the case of AI and compute, it's common to ask * Here's a hypothetical AI technology. How much compute would it require? But once we have an answer to this question, we can always ask * Here's how much compute you have. What kind of AI could you build with it? If you've asked the first question, you ought to ask the second one, too. The first question includes a hidden assumption: that the imagined technology is a reasonable use of the resources it would take to build. This isn't always true: given those resources, there may be easier ways to accomplish the same thing, or better versions of that thing that are equally feasible. These facts are much easier to see when you fix a given resource level, and ask yourself what kinds of things you could do with it. This high-level point seems like an important contribution to the AI forecasting conversation. The impetus to ask "what does future compute enable?" rather than "how much compute might TAI require?" influenced my own view of Bio Anchors, an influence that's visible in the contrarian summary at the start of this post. ---------------------------------------- I find the specific examples much less convincing than the higher-level point. For the most part, the examples don't demonstrate that you could accomplish any particular outcome applying more compute. Instead, they simply restate the idea that more compute is being used. They describe inputs, not outcomes. The reader is expected to supply the missing inference: "wow, I guess if we put those big numbers in, we'd probably get magical results out." But this

24ryan_greenblatt

My sense is that this post holds up pretty well. Most of the considerations under discussion still appear live and important including: in-context learning, robustness, whether jank AI R&D accelerating AIs can quickly move to more general and broader systems, and general skepticism of crazy conclusions. At the time of this dialogue, my timelines were a bit faster than Ajeya's. I've updated toward the views Daniel expresses here and I'm now about half way between Ajeya's views in this post and Daniel's (in geometric mean). My read is that Daniel looks somewhat too aggressive in his predictions for 2024, though it is a bit unclear exactly what he was expecting. (This concrete scenario seems substantially more bullish than what we've seen in 2024, but not by a huge amount. It's unclear if he was intending these to be mainline predictions or a 25th percentile bullish scenario.) AI progress appears substantially faster than the scenario outlined in Ege's median world. In particular: * On "we have individual AI labs in 10 years that might be doing on the order of e.g. $30B/yr in revenue". OpenAI made $4 billion in revenue in 2024 and based on historical trends it looks like AI company revenue goes up 3x per year such that in 2026 the naive trend extrapolation indicates they'd make around $30 billion. So, this seems 3 years out instead of 10. * On "maybe AI systems can get gold on the IMO in five years". We seem likely to see gold on IMO this year (a bit less than 2 years later). It would be interesting to hear how Daniel, Ajeya, and Ege's views have changed since the time this was posted. (I think Daniel has somewhat later timelines (but the update is smaller than the progression of time such that AGI now seems closer to Daniel) and I think Ajeya has somewhat sooner timelines.) Daniel discusses various ideas for how to do a better version of this dialogue in this comment. My understanding is that Daniel (and others) have run something similar to what he describes

9Daniel Kokotajlo

I still think this is great. Some minor updates, and an important note: Minor updates: I'm a bit less concerned about AI-powered propaganda/persuasion than I was at the time, not sure why. Maybe I'm just in a more optimistic mood. See this critique for discussion. It's too early to tell whether reality is diverging from expectation on this front. I had been feeling mildly bad about my chatbot-centered narrative, as of a month ago, but given how ChatGPT was received I think things are basically on trend. Diplomacy happened faster than I expected, though in a less generalizeable way than I expected, so whatever. My overall timelines have shortened somewhat since I wrote this story, but it's still the thing I point people towards when they ask me what I think will happen. (Note that the bulk of my update was from publicly available info rather than from nonpublic stuff I saw at OpenAI.) Important note: When I wrote this story, my AI timelines median was something like 2029. Based on how things shook out as the story developed it looked like AI takeover was about to happen, so in my unfinished draft of what 2027 looks like, AI takeover happens. (Also AI takeoff begins, I hadn't written much about that part but probably it would reach singularity/dysonswarms/etc. in around 2028 or 2029.) That's why the story stopped, I found writing about takeover difficult and confusing & I wanted to get the rest of the story up online first. Alas, I never got around to finishing the 2027 story. I'm mentioning this because I think a lot of readers with 20+ year timelines read my story and were like "yep seems about right" not realizing that if you look closely at what's happening in the story, and imagine it happening in real life, it would be pretty strong evidence that crazy shit was about to go down. Feel free to controvert that claim, but the point is, I want it on the record that when this original 2026 story was written, I envisioned the proper continuation of the story resultin

2Raemon

I haven't had time to reread this sequence in depth, but I wanted to at least touch on how I'd evaluate it. It seems to be aiming to be both a good introductory sequence, while being a "complete and compelling case I can for why the development of AGI might pose an existential threat". The question is who is this sequence for, what is it's goal, and how does it compare to other writing targeting similar demographics. Some writing that comes to mind to compare/contrast it with includes: * Scott Alexander's Superintelligence FAQ. This is the post I've found most helpful for convincing people (including myself), that yes, AI is just actually a big deal and an extinction risk. It's 8000 words. It's written fairly entertainingly. What I find particularly compelling here are a bunch of factual statements about recent AI advances that I hadn't known about at the time. * Tim Urban's Road To Superintelligence series. This is even more optimized for entertainingness. I recall it being a bit more handwavy and making some claims that were either objectionable, or at least felt more objectionable. It's 22,000 words. * Alex Flint's AI Risk for Epistemic Minimalists. This goes in a pretty different direction – not entertaining, and not really comprehensive either . It came to mind because it's doing a sort-of-similar thing of "remove as many prerequisites or assumptions as possible". (I'm not actually sure it's that helpful, the specific assumptions it's avoiding making don't feel like issues I expect to come up for most people, and then it doesn't make a very strong claim about what to do) (I recall Scott Alexander once trying to run a pseudo-study where he had people read a randomized intro post on AI alignment, I think including his own Superintelligence FAQ and Tim Urban's posts among others, and see how it changed people's minds. I vaguely recall it didn't find that big a difference between them. I'd be curious how this compared) At a glance, AGI Safety From First P

7Daniel Kokotajlo

(I am the author) I still like & endorse this post. When I wrote it, I hadn't read more than the wiki articles on the subject. But then afterwards I went and read 3 books (written by historians) about it, and I think the original post held up very well to all this new info. In particular, the main critique the post got -- that disease was more important than I made it sound, in a way that undermined my conclusion -- seems to have been pretty wrong. (See e.g. this comment thread, these follow up posts) So, why does it matter? What contribution did this post make? Well, at the time -- and still now, though I think I've made a dent in the discourse -- quite a lot of people I respect (such as people at OpenPhil) seemed to think unaligned AGI would need god-like powers to be able to take over the world -- it would need to be stronger than the rest of the world combined! I think this is based on a flawed model of how takeover/conquest works, and history contains plenty of counterexamples to the model. The conquistadors are my favorite counterexample from my limited knowledge of history. (The flawed model goes by the name of "The China Argument," at least in my mind. You may have heard the argument before -- China is way more capable than the most capable human, yet it can't take over the world; therefore AGI will need to be way way more capable than the most powerful human to take over the world.) Needless to say, this is a somewhat important crux, as illustrated by e.g. Joe Carlsmith's report, which assigns a mere 40% credence to unaligned APS-AI taking over the world even conditional on it escaping and seeking power and managing to cause at least a trillion dollars worth of damage. (I've also gotten feedback from various people at OpenPhil saying that this post was helpful to them, so yay!) I've since written a sequence of posts elaborating on this idea: Takeoff and Takeover in the Past and Future. Alas, I still haven't written the capstone posts in the sequence, t

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Rationality

Optimization

World

Practical

AI Strategy

Technical AI Safety