Edit, 5/16/23: I think this post is beautiful, correct in its narrow technical claims, and practically irrelevant to alignment. This post presents a cripplingly unrealistic picture of the role of reward functions in reinforcement learning. I expect this post to harm your alignment research intuitions unless you've already inoculated yourself by deeply internalizing and understanding Reward is not the optimization target. If you're going to read one alignment post I've written, read that one.
Follow-up work (Parametrically retargetable decision-makers tend to seek power) moved away from optimal policies and treated reward functions more realistically.
In 2008, Steve Omohundro's foundational paper The Basic AI Drives conjectured that superintelligent goal-directed AIs might be incentivized to gain significant amounts of power in order to better achieve their goals. Omohundro's conjecture bears out...
It seems likely to me that AIs will be able to coordinate with each other much more easily (i.e., at lower cost and greater scale) than humans currently can, for example by merging into coherent unified agents by combining their utility functions. This has been discussed at least since 2009, but I'm not sure its implications have been widely recognized. In this post I talk about two such implications that occurred to me relatively recently.
I was recently reminded of this quote from Robin Hanson's Prefer Law To Values:
...The later era when robots are vastly more capable than people should be much like the case of choosing a nation in which to retire. In this case we don’t expect to have much in the way of skills to
[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future. EDIT: So far this has stood the test of time. EDIT: As of September 2020 I think this is one of the most important things to be thinking about.]
This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. [Edit: 2009 in fact!] In short, here is the problem:
Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be...
h/t Anthony DiGiovanni who points to this new paper making a weaker version of this point, in the context of normative ethics: Johan E. Gustafsson, Bentham’s Mugging - PhilPapers
This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence.
The goal of this sequence is to analyze the type of learned optimization that occurs when a...
"Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way. In Risks from Learned Optimization, we included the following footnote to reflect this possibility:
...Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there to be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its
I am still pretty unconvinced that there is a corruption mechanism that wouldn’t be removed more quickly by SGD than the mesaobjective would be reverted. Are there more recent write ups that shed more light on this?
Specifically, I can’t tell whether this assumes the corruption mechanism has access to a perfect model of its own weights via observation (eg hacking) or via somehow the weights referring to themselves. This is important because if “the mesaobjective weights” are referred to via observation, then SGD will not compute a gradient wrt them (since t...
If you're not familiar with the double descent phenomenon, I think you should be. I consider double descent to be one of the most interesting and surprising recent results in analyzing and understanding modern machine learning. Today, Preetum et al. released a new paper, “Deep Double Descent,” which I think is a big further advancement in our understanding of this phenomenon. I'd highly recommend at least reading the summary of the paper on the OpenAI blog. However, I will also try to summarize the paper here, as well as give a history of the literature on double descent and some of my personal thoughts.
The double descent phenomenon was first discovered by Mikhail Belkin et al., who were confused by the phenomenon wherein modern ML practitioners would
...Note to mods: I'm a bit uncertain whether posts like this one currently belong on the Alignment Forum. Please move it if it doesn't. Or if anyone would prefer not to have such posts on AF, please let me know.
In Strategic implications of AIs’ ability to coordinate at low cost, I talked about the possibility that different AGIs can coordinate with each other much more easily than humans can, by doing something like merging their utility functions together. It now occurs to me that another way for AGIs to greatly reduce coordination costs in an economy is by having each AGI or copies of each AGI profitably take over much larger chunks of the economy (than companies currently own), and this can be done with AGIs that
...I was reading parts of Superintelligence recently for something unrelated and noticed that Bostrom makes many of the same points as this post:
...If the frontrunner is an AI system, it could have attributes that make it easier for it to expand its capabilities while reducing the rate of diffusion. In human-run organizations, economies of scale are counteracted by bureaucratic inefficiencies and agency problems, including difficulties in keeping trade secrets. These problems would presumably limit the growth of a machine intelligence project so long as it is op
There has been considerable debate over whether development in AI will experience a discontinuity, or whether it will follow a more continuous growth curve. Given the lack of consensus and the confusing, diverse terminology, it is natural to hypothesize that much of the debate is due to simple misunderstandings. Here, I seek to dissolve some misconceptions about the continuous perspective, based mostly on how I have seen people misinterpret it in my own experience.
First, we need to know what I even mean by continuous takeoff. When I say it, I mean a scenario where the development of competent, powerful AI follows a trajectory that is roughly in line with what we would have expected by extrapolating from past progress. That is, there is no point at...
Suppose that 1% of the world’s resources are controlled by unaligned AI, and 99% of the world’s resources are controlled by humans. We might hope that at least 99% of the universe’s resources end up being used for stuff-humans-like (in expectation).
Jessica Taylor argued for this conclusion in Strategies for Coalitions in Unit-Sum Games: if the humans divide into 99 groups each of which acquires influence as effectively as the unaligned AI, then by symmetry each group should end, up with as much influence as the AI, i.e. they should end up with 99% of the influence.
This argument rests on what I’ll call the strategy-stealing assumption: for any strategy an unaligned AI could use to influence the long-run future, there is an analogous strategy that a similarly-sized group...
Categorising the ways that the strategy-stealing assumption can fail:
[Epistemic status: Argument by analogy to historical cases. Best case scenario it's just one argument among many. Edit: Also, thanks to feedback from others, especially Paul, I intend to write a significantly improved version of this post in the next two weeks. Edit: I never did, because in the course of writing my response I realized the original argument made a big mistake. See this review.]
I have on several occasions heard people say things like this:
The original Bostrom/Yudkowsky paradigm envisioned a single AI built by a single AI project, undergoing intelligence explosion all by itself and attaining a decisive strategic advantage as a result. However, this is very unrealistic. Discontinuous jumps in technological capability are very rare, and it is very implausible that one project...
It's hard to know how to judge a post that deems itself superseded by a post from a later year, but I lean toward taking Daniel at his word and hoping we survive until the 2021 Review comes around.
I added the following to the beginning:
... (read more)