The Best of LessWrong

6Daniel Kokotajlo

Ajeya's timelines report is the best thing that's ever been written about AI timelines imo. Whenever people ask me for my views on timelines, I go through the following mini-flowchart: 1. Have you read Ajeya's report? --If yes, launch into a conversation about the distribution over 2020's training compute and explain why I think the distribution should be substantially to the left, why I worry it might shift leftward faster than she projects, and why I think we should use it to forecast AI-PONR instead of TAI. --If no, launch into a conversation about Ajeya's framework and why it's the best and why all discussion of AI timelines should begin there. So, why do I think it's the best? Well, there's a lot to say on the subject, but, in a nutshell: Ajeya's framework is to AI forecasting what actual climate models are to climate change forecasting (by contrast with lower-tier methods such as "Just look at the time series of temperature over time / AI performance over time and extrapolate" and "Make a list of factors that might push the temperature up or down in the future / make AI progress harder or easier," and of course the classic "poll a bunch of people with vaguely related credentials." There's something else which is harder to convey... I want to say Ajeya's model doesn't actually assume anything, or maybe it makes only a few very plausible assumptions. This is underappreciated, I think. People will say e.g. "I think data is the bottleneck, not compute." But Ajeya's model doesn't assume otherwise! If you think data is the bottleneck, then the model is more difficult for you to use and will give more boring outputs, but you can still use it. (Concretely, you'd have 2020's training compute requirements distribution with lots of probability mass way to the right, and then rather than say the distribution shifts to the left at a rate of about one OOM a decade, you'd input whatever trend you think characterizes the likely improvements in data gathering.) The upsho

8Steven Byrnes

I’ll set aside what happens “by default” and focus on the interesting technical question of whether this post is describing a possible straightforward-ish path to aligned superintelligent AGI. The background idea is “natural abstractions”. This is basically a claim that, when you use an unsupervised world-model-building learning algorithm, its latent space tends to systematically learn some patterns rather than others. Different learning algorithms will converge on similar learned patterns, because those learned patterns are a property of the world, not an idiosyncrasy of the learning algorithm. For example: Both human brains and ConvNets seem to have a “tree” abstraction; neither human brains nor ConvNets seem to have a “head or thumb but not any other body part” concept. I kind of agree with this. I would say that the patterns are a joint property of the world and an inductive bias. I think the relevant inductive biases in this case are something like: (1) “patterns tend to recur”, (2) “patterns tend to be localized in space and time”, and (3) “patterns are frequently composed of multiple other patterns, which are near to each other in space and/or time”, and maybe other things. The human brain definitely is wired up to find patterns with those properties, and ConvNets to a lesser extent. These inductive biases are evidently very useful, and I find it very likely that future learning algorithms will share those biases, even more than today’s learning algorithms. So I’m basically on board with the idea that there may be plenty of overlap between the world-models of various different unsupervised world-model-building learning algorithms, one of which is the brain. (I would also add that I would expect “natural abstractions” to be a matter of degree, not binary. We can, after all, form the concept “head or thumb but not any other body part”. It would just be extremely low on the list of things that would pop into our head when trying to make sense of something we’

28Vanessa Kosoy

This post is a review of Paul Christiano's argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it. Instead, I will focus on what I believe is the post's main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of reasoning, and that the attack vector in question can attributed to this, at least partially. This is evident in phrases such as "unintuitive notion of simplicity" and "the Solomonoff prior is very strange". This is also why the author thinks the speed prior might help and that "since it is difficult to compute the Solomonoff prior, [the attack vector] might not be relevant in the real world". In contrast, I believe that the attack vector is quite robust and will threaten any sufficiently powerful AI as long as it's cartesian (more on "cartesian" later). Formally analyzing this question is made difficult by the essential role of non-realizability. That is, the attack vector arises from the AI reasoning about "possible universes" and "simulation hypotheses" which are clearly phenomena that are computationally infeasible for the AI to simulate precisely. Invoking Solomonoff induction dodges this issue since Solomonoff induction is computationally unbounded, at the cost of creating the illusion that the conclusions are a symptom of using Solomonoff induction (and, it's still unclear how to deal with the fact Solomonoff induction itself cannot exist in the universes that Solomonoff induction can learn). Instead, we should be using models that treat non-realizability fairly, such as infra-Bayesiansim. However, I will make no attempt to present such a formal analysis in this review. Instead, I will rely on painting an in

12johnswentworth

This post is an excellent distillation of a cluster of past work on maligness of Solomonoff Induction, which has become a foundational argument/model for inner agency and malign models more generally. I've long thought that the maligness argument overlooks some major counterarguments, but I never got around to writing them up. Now that this post is up for the 2020 review, seems like a good time to walk through them. In Solomonoff Model, Sufficiently Large Data Rules Out Malignness There is a major outside-view reason to expect that the Solomonoff-is-malign argument must be doing something fishy: Solomonoff Induction (SI) comes with performance guarantees. In the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. The post mentions that: [...] ... but in the large-data limit, SI's guarantees are stronger than just that. In the large-data limit, there is no computable way of making better predictions than the Solomonoff prior in any world. Thus, agents that are influencing the Solomonoff prior cannot gain long-term influence in any computable world; they have zero degrees of freedom to use for influence. It does not matter if they specialize in influencing worlds in which they have short strings; they still cannot use any degrees of freedom for influence without losing all their influence in the large-data limit. Takeaway of this argument: as long as we throw enough data at our Solomonoff inductor before asking it for any outputs, the malign agent problem must go away. (Though note that we never know exactly how much data that is; all we have is a big-O argument with an uncomputable constant.) ... but then how the hell does this outside-view argument jive with all the inside-view arguments about malign agents in the prior? Reflection Breaks The Large-Data Guarantees There's an important gotcha in those guarantees: in the limit of large data, SI performs as well as the best-predicting program, in ever

17Vanessa Kosoy

In this post, the author proposes a semiformal definition of the concept of "optimization". This is potentially valuable since "optimization" is a word often used in discussions about AI risk, and much confusion can follow from sloppy use of the term or from different people understanding it differently. While the definition given here is a useful perspective, I have some reservations about the claims made about its relevance and applications. The key paragraph, which summarizes the definition itself, is the following: [...] In fact, "continues to exhibit this tendency with respect to the same target configuration set despite perturbations" is redundant: clearly as long as the perturbation doesn't push the system out of the basin, the tendency must continue. This is what is known as "attractor" in dynamical systems theory. For comparison, here is the definition of "attractor" from the Wikipedia: [...] The author acknowledges this connection, although he also makes the following remark: [...] I find this remark confusing. An attractor that operates along a subset of the dimension is just an attractor submanifold. This is completely standard in dynamical systems theory. Given that the definition itself is not especially novel, the post's main claim to value is via the applications. Unfortunately, some of the proposed applications seem to me poorly justified. Specifically, I want to talk about two major examples: the claimed relationship to embedded agency and the claimed relations to comprehensive AI services. In both cases, the main shortcoming of the definition is that there is an essential property of AI that this definition doesn't capture at all. The author does acknowledge that "goal-directed agent system" is a distinct concept from "optimizing systems". However, he doesn't explain how are they distinct. One way to formulate the difference is as follows: agency = optimization + learning. An agent is not just capable of steering a particular universe t

2Raemon

I haven't had time to reread this sequence in depth, but I wanted to at least touch on how I'd evaluate it. It seems to be aiming to be both a good introductory sequence, while being a "complete and compelling case I can for why the development of AGI might pose an existential threat". The question is who is this sequence for, what is it's goal, and how does it compare to other writing targeting similar demographics. Some writing that comes to mind to compare/contrast it with includes: * Scott Alexander's Superintelligence FAQ. This is the post I've found most helpful for convincing people (including myself), that yes, AI is just actually a big deal and an extinction risk. It's 8000 words. It's written fairly entertainingly. What I find particularly compelling here are a bunch of factual statements about recent AI advances that I hadn't known about at the time. * Tim Urban's Road To Superintelligence series. This is even more optimized for entertainingness. I recall it being a bit more handwavy and making some claims that were either objectionable, or at least felt more objectionable. It's 22,000 words. * Alex Flint's AI Risk for Epistemic Minimalists. This goes in a pretty different direction – not entertaining, and not really comprehensive either . It came to mind because it's doing a sort-of-similar thing of "remove as many prerequisites or assumptions as possible". (I'm not actually sure it's that helpful, the specific assumptions it's avoiding making don't feel like issues I expect to come up for most people, and then it doesn't make a very strong claim about what to do) (I recall Scott Alexander once trying to run a pseudo-study where he had people read a randomized intro post on AI alignment, I think including his own Superintelligence FAQ and Tim Urban's posts among others, and see how it changed people's minds. I vaguely recall it didn't find that big a difference between them. I'd be curious how this compared) At a glance, AGI Safety From First P

9Vanessa Kosoy

This post states a subproblem of AI alignment which the author calls "the pointers problem". The user is regarded as an expected utility maximizer, operating according to causal decision theory. Importantly, the utility function depends on latent (unobserved) variables in the causal network. The AI operates according to a different, superior, model of the world. The problem is then, how do we translate the utility function from the user's model to the AI's model? This is very similar to the "ontological crisis" problem described by De Blanc, only De Blanc uses POMDPs instead of causal networks, and frames it in terms of a single agent changing their ontology, rather than translation from user to AI. The question the author asks here is important, but not that novel (the author himself cites Demski as prior work). Perhaps the use of causal networks is a better angle, but this post doesn't do much to show it. Even so, having another exposition of an important topic, with different points of emphasis, will probably benefit many readers. The primary aspect missing from the discussion in the post, in my opinion, is the nature of the user as a learning agent. The user doesn't have a fixed world-model: or, if they do, then this model is best seen as a prior. This observation hints at the resolution of the apparent paradox wherein the utility function is defined in terms of a wrong model. But it still requires us to explain how the utility is defined s.t. it is applicable to every hypothesis in the prior. (What follows is no longer a "review" per se, inasmuch as a summary of my own thoughts on the topic.) Here is a formal model of how a utility function for learning agents can work, when it depends on latent variables. Fix A a set of actions and O a set of observations. We start with an ontological model which is a crisp infra-POMPD. That is, there is a set of states Sont, an initial state s0ont∈Sont, a transition infra-kernel Tont:Sont×A→□(Sont×O) and a reward functio

7johnswentworth

Why This Post Is Interesting This post takes a previously-very-conceptually-difficult alignment problem, and shows that we can model this problem in a straightforward and fairly general way, just using good ol' Bayesian utility maximizers. The formalization makes the Pointers Problem mathematically legible: it's clear what the problem is, it's clear why the problem is important and hard for alignment, and that clarity is not just conceptual but mathematically precise. Unfortunately, mathematical legibility is not the same as accessibility; the post does have a wide inductive gap. Warning: Inductive Gap This post builds on top of two important pieces for modelling embedded agents which don't have their own posts (to my knowledge). The pieces are: * Lazy world models * Lazy utility functions (or value functions more generally) In hindsight, I probably should have written up separate posts on them; they seem obvious once they click, but they were definitely not obvious beforehand. Lazy World Models One of the core conceptual difficulties of embedded agency is that agents need to reason about worlds which are bigger than themselves. They're embedded in the world, therefore the world must be as big as the entire agent plus whatever environment the world includes outside of the agent. If the agent has a model of the world, the physical memory storing that model must itself fit inside of the world. The data structure containing the world model must represent a world larger than the storage space the data structure takes up. That sounds tricky at first, but if you've done some functional programming before, then data structures like this actually pretty run-of-the-mill. For instance, we can easily make infinite lists which take up finite memory. The trick is to write a generator for the list, and then evaluate it lazily - i.e. only query for list elements which we actually need, and never actually iterate over the whole thing. In the same way, we can represent

7Daniel Kokotajlo

(I am the author) I still like & endorse this post. When I wrote it, I hadn't read more than the wiki articles on the subject. But then afterwards I went and read 3 books (written by historians) about it, and I think the original post held up very well to all this new info. In particular, the main critique the post got -- that disease was more important than I made it sound, in a way that undermined my conclusion -- seems to have been pretty wrong. (See e.g. this comment thread, these follow up posts) So, why does it matter? What contribution did this post make? Well, at the time -- and still now, though I think I've made a dent in the discourse -- quite a lot of people I respect (such as people at OpenPhil) seemed to think unaligned AGI would need god-like powers to be able to take over the world -- it would need to be stronger than the rest of the world combined! I think this is based on a flawed model of how takeover/conquest works, and history contains plenty of counterexamples to the model. The conquistadors are my favorite counterexample from my limited knowledge of history. (The flawed model goes by the name of "The China Argument," at least in my mind. You may have heard the argument before -- China is way more capable than the most capable human, yet it can't take over the world; therefore AGI will need to be way way more capable than the most powerful human to take over the world.) Needless to say, this is a somewhat important crux, as illustrated by e.g. Joe Carlsmith's report, which assigns a mere 40% credence to unaligned APS-AI taking over the world even conditional on it escaping and seeking power and managing to cause at least a trillion dollars worth of damage. (I've also gotten feedback from various people at OpenPhil saying that this post was helpful to them, so yay!) I've since written a sequence of posts elaborating on this idea: Takeoff and Takeover in the Past and Future. Alas, I still haven't written the capstone posts in the sequence, t

6Steven Byrnes

I wrote this relatively early in my journey of self-studying neuroscience. Rereading this now, I guess I'm only slightly embarrassed to have my name associated with it, which isn’t as bad as I expected going in. Some shifts I’ve made since writing it (some of which are already flagged in the text): * New terminology part 1: Instead of “blank slate” I now say “learning-from-scratch”, as defined and discussed here. * New terminology part 2: “neocortex vs subcortex” → “learning subsystem vs steering subsystem”, with the former including the whole telencephalon and cerebellum, and the latter including the hypothalamus and brainstem. I distinguish them by "learning-from-scratch vs not-learning-from-scratch". See here. * Speaking of which, I now put much more emphasis on "learning-from-scratch" over "cortical uniformity" when talking about the neocortex etc.—I care about learning-from-scratch more, I talk about it more, etc. I see the learning-from-scratch hypothesis as absolutely central to a big picture of the brain (and AGI safety!), whereas cortical uniformity is much less so. I do still think cortical uniformity is correct (at least in the weak sense that someone with a complete understanding of one part of the cortex would be well on their way to a complete understanding of any other part of the cortex), for what it’s worth. * I would probably drop the mention of “planning by probabilistic inference”. Well, I guess something kinda like planning by probabilistic inference is part of the story, but generally I see the brain thing as mostly different. * Come to think of it, every time the word “reward” shows up in this post, it’s safe to assume I described it wrong in at least some respect. * The diagram with neocortex and subcortex is misleading for various reasons, see notes added to the text nearby. * I’m not sure I was using the term “analysis-by-synthesis” correctly. I think that term is kinda specific to sound processing. And the vision analog is “vision

13Vanessa Kosoy

In this post, the author presents a case for replacing expected utility theory with some other structure which has no explicit utility function, but only quantities that correspond to conditional expectations of utility. To provide motivation, the author starts from what he calls the "reductive utility view", which is the thesis he sets out to overthrow. He then identifies two problems with the view. The first problem is about the ontology in which preferences are defined. In the reductive utility view, the domain of the utility function is the set of possible universes, according to the best available understanding of physics. This is objectionable, because then the agent needs to somehow change the domain as its understanding of physics grows (the ontological crisis problem). It seems more natural to allow the agent's preferences to be specified in terms of the high-level concepts it cares about (e.g. human welfare or paperclips), not in terms of the microscopic degrees of freedom (e.g. quantum fields or strings). There are also additional complications related to the unobservability of rewards, and to "moral uncertainty". The second problem is that the reductive utility view requires the utility function to be computable. The author considers this an overly restrictive requirement, since it rules out utility functions such as in the procrastination paradox (1 is the button is ever pushed, 0 if the button is never pushed). More generally, computable utility function have to be continuous (in the sense of the topology on the space of infinite histories which is obtained from regarding it as an infinite cartesian product over time). The alternative suggested by the author is using the Jeffrey-Bolker framework. Alas, the author does not write down the precise mathematical definition of the framework, which I find frustrating. The linked article in the Stanford Encyclopedia of Philosophy is long and difficult, and I wish the post had a succinct distillation of the

Rationality

Optimization

World

Practical

AI Strategy

Technical AI Safety