The following text is my submission for the AI Safety Public Materials contest. In it, I try to lay out the importance of AI Safety Research to people who, according to the winning conditions of the contest, have not yet engaged with AI Safety, Lesswrong, or effective altruism. As such, I’m eager to receive feedback on how accessible the text is to people outside of these communities. Additionally, I’d like to know where you think my framings might be misguided.
In my text, I focus on the distribution shift problem since it is an inner alignment problem, which increasingly feels like the most central challenge of AI Safety to me. However, the text does not focus on explicit inner optimizers or agency, even though this would strengthen the argument. The reason is that I think the distribution shift problem can already motivate existential risk well enough; the value of keeping the discussion focused therefore seems larger than the value of painting a complete picture. Nevertheless, in the appendix, I include a short overview of other safety concerns.
I thank Gabriela Jiang, Tom Lieberum, Shos Tekofsky, and Magdalena Wache for their useful comments and feedback on this text.
This text covers the importance of technical AI Safety research. I give an explanation of why current machine learning paradigms might lead to the disempowerment of humanity in this century. With disempowerment of humanity, I thereby mean any situation in which humans lose the ability to steer the future development of society, leading in the worst case to a loss of everything that we find valuable.
The argument is based on the distribution shift problem of contemporary machine learning, the increased dependence on AI and stronger distribution shifts that High-Level Machine Intelligence (HLMI) entails, and the likely development of HLMI in this century. I hope this text can motivate people to work on AI Safety research to prevent humanity's disempowerment.
Note that the distribution shift problem is only one of several safety problems that have been discussed over recent years. You can find a short overview in the appendix. A good starting point for learning more about the distribution shift problem specifically is the 2016 paper on Concrete Problems in AI Safety.
It will be useful to have basic knowledge about machine learning to appreciate this text. I include a short overview at the beginning of the argument.
In the contemporary machine learning paradigm, a machine learning (ML) system receives inputs and produces outputs. It is trained to achieve some specified or implicit goal. The internals of the ML system are thereby changed in such a way that it will, in the future, be better at reaching the goal. The algorithm to achieve this internal change is typically some variation of gradient descent. The ML system itself is nowadays usually based on artificial neural networks. The subfield of machine learning that deals with large neural networks is called deep learning.
The most important sub-paradigms in machine learning are:
Note that autoregressive learning is a subclass of what’s often called unsupervised learning. I decided to focus the discussion above on autoregressive learning since it has been increasingly important in recent years.
What’s common to all these paradigms is that the goal is not explicitly represented inside the ML system itself. Instead, it simply receives inputs and produces outputs according to its initially random internals. If the outputs do not satisfy the goal, then an algorithm outside of the ML system, typically gradient descent, adjusts the internals of the ML system to better achieve the goal in the future. Thus, the ML system will over time behave as if it “cared for the goal”. The problem is that this may just be a superficial correlation, as we now illustrate with the distribution shift problem.
With the terms “distribution” or “data distribution”, one typically refers to the properties, frequencies, and nature of the data that the ML system receives. A “distribution shift” is any situation in which the distribution changes, that is:
Very often, ML systems are trained on a well-curated data set or environment and then deployed in the real world, which may come with substantial distribution shifts. This may lead to problems, for example:
These examples constitute an alignment failure: the ML system’s actions are not aligned with its given goal, as revealed by the shift in distribution. One often specifically calls this an inner alignment problem, to contrast it with outer alignment, which is roughly about choosing the right goal for the ML system in the first place. More clarifications on these notions can be found in the appendix.
Usually, such alignment failures are recoverable:
Therefore, it seems like we can proceed with trial and error:
However, there is to date no principled solution that can spot all potential issues under distribution shift in advance. This may lead to considerably worse issues once ML systems become very powerful, as we explain next.
Currently, ML systems have limited applicability. There have been many attempts to formalize the notion of artificial intelligence that goes beyond what’s currently possible and matches or exceeds the generality and power of human intelligence. One such notion is High-Level Machine Intelligence (HLMI), which Katja Grace defined as follows:
“High-level machine intelligence” (HLMI) is achieved when unaided machines can accomplish every task better and more cheaply than human workers.”
There are similar concepts with somewhat different meanings, e.g. Artificial General Intelligence (AGI), Transformative AI (TAI), and Process for Automating Scientific and Technological Advancements (PASTA). All of them have in common that reaching them would likely constitute a significant historical event with an enormous impact on human society. Machines would then be able to do much of the intellectual work currently conducted by humans. In this section, we explain how the advent of HLMI would make humanity increasingly dependent on AI and lead to unprecedented levels of innovation. We then argue that this might lead to humanity’s disempowerment if the distribution shift problem is not solved.
Once we have HLMI, AI systems will, by definition, be able to do every task better and more cheaply than human workers. Consequently, humans would be the main bottleneck in the supply chain, which makes it highly economically valuable to help AI act more autonomously. This could take the form of giving AI more access to crucial computer systems, and building up infrastructure that AI can use to effectively act in the world. For example, robot factories might be built that allow AI to quickly design and deploy robots that can do all manual tasks currently done by humans.
With humanity increasingly relying on powerful AI systems, it will be hard to unilaterally discontinue their use. States and companies that will do so will simply be outcompeted by ones that make use of AI’s great innovative power. This will lead to the share of intellectual work done by HLMI systems increasing over time compared to humanity. It will then be hard to have effective oversight over all behavior of all HLMI systems combined.
Once AI acts autonomously in the world and does all human jobs, it will by definition also do most or all of the innovation that drives our growth. AI will have two key advantages compared to humans that will allow it to increase the rate of innovation to unprecedented levels:
First of all, AI can simply be copied and scaled up: As long as there is enough raw material and electricity in the world, one can increase the number of computers that run AI systems. This is in contrast to humanity, whose growth slows considerably — the human population will likely peak in the 21st century. Additionally, not only the quantity but also the quality of artificial intelligence can increase through sustained AI research: HLMI can, by definition, do all human jobs, and will therefore be able to do AI research itself, leading to a positive feedback loop that increases artificial intelligence.
Since the amount and quality of intelligence is likely the key driver of innovation, these dynamics would lead to huge increases in the rate of innovation.
In summary, if we reach HLMI, we will likely see AI being very autonomous and widespread, replacing most or all jobs currently done by humans. In particular, AI will replace jobs in research in development, leading to increased rates of innovation. Over longer timeframes, AI systems will then dominate the intellectual work on earth and humanity will be subject to the decisions of AI systems that determine the future trajectory of humanity. In an ideal world, AI remains aligned with us, shares our values, and defers to humans for key decisions.
However, AI autonomy and increased rates of innovation will also lead to large increases in distribution shifts: once the AI acts autonomously, it is put in situations it has not encountered in previously constrained environments. And with new innovations and the world transforming, the environment will change in unpredictable ways. This means that over time, the distribution will drift from the one the AI will have been trained on. If the distribution shift problem is not solved by then, then AI will misbehave in ways that may be unaligned with humanity. This may lead to a disempowerment of humanity whose fate is determined by AIs that show increasingly alien behavior in an increasingly changing world. In the worst case, the future may lose everything which is valuable to humans.
None of this would be a problem we need to focus on now if HLMI were never developed or were developed only in the very distant future. However, we cannot rely on this, which we explain below.
The last ten years have seen remarkable progress in artificial intelligence. In 2012, AlexNet made tremendous progress in classifying images. Since then, research into deep learning has sped up dramatically, leading to a cascade of spectacular successes:
Much of this success was driven by scaling up neural network architectures to enormous sizes. This is made possible by scalable architectures like the transformer that are able to show large and predictable improvements by being scaled up. They can have hundreds of billions of parameters and cost tens of millions of dollars to train. Another factor in the success of DL was the continued fall of the price of GPUs, the core processing units used in deep learning. Additionally, though not independent of inventions like the transformer, large algorithmic efficiency gains made it possible to achieve better performance with the same compute budget.
Where does that leave us for the future? Certainly, some trends cannot continue forever. For example, the price of GPUs might stagnate. Furthermore, current top-performing models already cost tens of millions of dollars. Projects in the billions of dollars are in principle possible, but one can likely expect a slow-down in the scale-up until then, and much more expensive networks might not be trained in the foreseeable future.
However, the development of new architectures, key insights, and training processes will likely continue and has the usual unpredictability of basic science. This means that precise predictions are in principle hard.
Nevertheless, HLMI might arrive earlier than many people think. In 2016, Katja Grace surveyed AI experts that successfully published at the 2015 NIPS and ICML conferences — two leading venues for machine learning. The median respondent predicted a 50% chance of HLMI for the year 2061. She repeated the survey in 2022, with the experts predicting the year 2059. A report by Ajeya Cotra made comparisons to biological neural networks and predicts transformative AI by 2050. She has recently updated her prediction down towards 2040. Finally, forecasters at Metaculus predict artificial general intelligence (AGI) for 2041 — though note that this forecast fluctuates strongly over time.
While these predictions are about different operationalizations of strong AI — namely, HLMI, TAI, and AGI — they still overall point in the direction of HLMI arriving in the next few decades. This is within the lifetime of many readers of this post or their children.
In summary, the previous sections provide the following chain of reasoning leading to a potential disempowerment of humanity in this century:
Thus, we might see a disempowerment of humanity in this century.
To mitigate our potential disempowerment, it seems crucial that more people work on researching AI Safety. If you want to think about getting involved, I can recommend these three articles. If you want to go deeper, then consider reading through the AI Alignment Curriculum. I have written summaries for many texts in the curriculum that you may want to read as well. Finally, I want to clarify once again that the distribution shift problem is only one of many safety concerns, and I want to highly encourage you to become familiar with several of them before “getting to work”.
This post focused on the distribution shift problem as the core motivator for research into AI Safety, but it is not the only concern. I now shortly summarize the bigger picture. I start with a discussion of inner and outer alignment, which together form what Paul Christiano calls (intent) alignment:
The distribution shift problem itself is part of the inner alignment problem. This is, very roughly, about any situation in which the AI’s behavior does not correspond to the specified goal. A specific subclass concerns the situation when the AI system itself becomes an optimizer for some goal possibly acquired during training; the AI system is then called a mesa optimizer.
It is unclear whether AI systems actually acquire their own goals, but it seems plausible on evolutionary grounds: evolution selected humans according to the “goal” of inclusive genetic fitness, and yet, humans rarely explicitly maximize this goal. Rather, we have evolved many subgoals that have been useful for our fitness in the ancestral environment, including eating sugary food, gaining status, or finding deep enjoyment in life.
Recently, the inner alignment problem has been demonstrated in deep reinforcement learning. This work even argued that the inner-misaligned reinforcement learning systems optimize for unaligned goals instead of "merely" misbehaving. This is an example of a more general concern that capabilities might generalize further than alignment.
Another central concern is outer alignment, which I completely omitted from this article. An outer misalignment failure occurs when the goal we give to the AI system in the first place is not actually aligned with our own intentions. Thus, even if the AI perfectly satisfies its specified goal, one may still encounter problems. Viktoriya Krakovna wrote about this under the name of specification gaming. The problem is usually caused by our inability to fully specify all constraints we care about; The AI may then achieve its goal by causing some potentially irreversible and harmful side effects.
There have been some attempts to solve the specification problem by learning the goal from human preferences. This, however, has the problem that human’s stated preferences do not necessarily agree with their actual values. And even if the human’s preferences would agree with the visible AI behavior, we might still have the problem left that the AI possibly conceals harmful externalities of its actions.
Prior to the rise of deep learning, there already were many discussions on AI Safety. They often revolved around the notion of a superintelligence — i.e., an AI that is much more intelligent than all of humanity combined. The canonical introduction is Nick Bostrom’s book Superintelligence. This book models the AI as being given an explicit representation of its goal that it tries to maximize in expectation.
While his discussion of the topic may seem archaic from today’s machine learning perspective, the text remains relevant to this day. Three very important notions are:
Joseph Carlsmith wrote a report that recasts many of these concerns in relation to modern machine learning.
There is also the question of how to govern AI. This is in itself an active research field about which you can learn in the AI Governance Curriculum. To quote from there:
AI governance can be thought of as a cluster of activities and professional fields that seek to best navigate the transition to a world with advanced AI systems. It involves not just government decisions but also corporate decisions, and not just formal policies but also institutions and norms.
For a short introduction, see this talk by Allan Dafoe.
However, there have been concerns that a representation of a goal — equal or different from the specified goal — may emerge in the ML system. See the section on inner alignment in the appendix.
See also the Chinchilla paper for an updated view on scaling laws.
Some would narrow this down further and replace the “AI’s behavior” with what the AI is “trying” to do, see Paul Christiano’s definition of the full alignment problem encompassing both inner and outer alignment.
I like that you point out that we'd normally do trial and error, but that this might not work with AI. I think you could possibly make clearer where this fails in your story. You do point out how HLMI might become extremely widespread and how it might replace most human work. Right now it seems to me like you argue essentially that the problem is a large-scale accident that comes from a distribution shift. But this doesn't yet say why we couldn't e.g. just continue trial-and-error and correct the AI once we notice that something is going wrong.
I think one would need to invoke something like instrumental convergence, goal preservation and AI being power-seeking, to argue that this isn't just an accident that could be prevented if we gave some more feedback in time. It is important for the argument that the AI is pursuing the wrong goals and thus wouldn't want to be stopped, etc.
Of course, one has to simplify the argument somehow in an introduction like this (and you do elaborate in the appendix), but maybe some argument about instrumental convergence should still be included in the main text.
Yes, after reflection I think this is correct. I think I had in mind a situation where with deployment, the training of the AI system simply stops, but of course, this need not be the case. So if training continues, then one either needs to argue stronger reasons why the distribution shift leads to a catastrophe (e.g., along the lines you argue) or make the case that the training signal couldn't keep up with the fast pace of the development. The latter would be an outer alignment failure, which I tried to avoid talking about in the text.