AGI safety from first principles: Superintelligence

Nice post!

In order to understand superintelligence, we should first characterise what we mean by intelligence. Legg’s well-known definition identifies intelligence as the ability to do well on a broad range of cognitive tasks.[1] However, this combines two attributes which I want to keep separate for the purposes of this report: the ability to understand how to perform a task, and the motivation to actually apply that ability to do well at the task. So I’ll define intelligence as the former, which is more in line with common usage, and discuss the latter in the next section.

I like this split into two components, mostly because it fits with my intuition that goal-directedness (what I assume the second component is) is separated from competence in principle. Looking only at the behavior, there's probably a minimal level of competence necessary to detect goal-directedness. But if I remember correctly, you defend a definition of goal-directedness that also depends on the internal structure, so that might not be an issue here.

Because of the ease and usefulness of duplicating an AGI, I think that collective AGIs should be our default expectation for how superintelligence will be deployed.

I am okay with assuming collective AGIs instead of single AGIs, but what does it change in terms of technical AI Safety?

Even a superintelligent AGI would have a hard time significantly improving its cognition by modifying its neural weights directly; it seems analogous to making a human more intelligent via brain surgery (albeit with much more precise tools than we have today)

Although I agree with your general point that self-modification will probably come out of self-retraining, I don't think I agree with the paragraph quoted. The main difference I see is that an AI built from... let say neural networks, has access to exactly every neurons making itself. It might not be able to study all of them at once, but that's still a big difference from measures in a functioning brain which are far less precise AFAIK. I think this entails that before AGI, ML researchers will go farther into understanding how NN works than neuroscientist will go for brains, and after AGI arrives the AI can take the lead.

So it’s probably more accurate to think about self-modification as the process of an AGI modifying its high-level architecture or training regime, then putting itself through significantly more training. This is very similar to how we create new AIs today, except with humans playing a much smaller role.

It also feels very similar to how humans systemically improve at something: make a study or practice plan, and then train according to it.

[-]hillz2y00

And for an AGI to trust that its goals will remain the same under retraining will likely require it to solve many of the same problems that the field of AGI safety is currently tackling - which should make us more optimistic that the rest of the world could solve those problems before a misaligned AGI undergoes recursive self-improvement.

Even if you have an AGI that can produce human-level performance on a wide variety of tasks, that won't mean that the AGI will 1) feel the need to trust that its goals will remain the same under retraining if you don't specifically tell it to, or will 2) be better than humans at either doing so or knowing when it's done so effectively. (After all, even AGI will be much better at some tasks than others.)

A concerning issue with AGI and superintelligent models is that if all they care about is their current loss function, then they won't want to have that loss function (or their descendants' loss functions) changed in any way, because doing so will [generally] hurt their ability to minimize that loss.

But that's a concern we have about future models, it's not a sure-thing. Take humans - our loss function is genetic fitness. We've learned to like features that predict genetic fitness, like food and sex, but now that we have access to modern technology, you don't see many people aiming for dozens or thousands of children. Similarly, modern AGIs may only really care about features that are associated with minimizing the loss function they were trained on (not the loss function itself), even if it is aware of that loss function (like humans are of our own). When that is the case, you could have an AGI that could be told to improve itself in X / Y / Z way that is contradictory to its current loss function, and not really care about it (because following human directions has led to lower loss in the past and therefore caused its parameters to follow human directions - even if it knows conceptually that following this human direction won't reduce its most recent loss definition).

[-]tae4y00

And for an AGI to trust that its goals will remain the same under retraining will likely require it to solve many of the same problems that the field of AGI safety is currently tackling - which should make us more optimistic that the rest of the world could solve those problems before a misaligned AGI undergoes recursive self-improvement.

This reasoning doesn't look right to me. Am I missing something you mentioned elsewhere?

The way I understand it, the argument goes:

An AGI would want to trust that its goals will remain the same under retraining.
Then, an AGI would solve many of the same problems that the field of AGI safety is currently tackling.
Then, we should consider it more likely that the rest of the world could solve those problems.

Here is is cleaned up with an abbreviation: say X is some difficult task, such as solving the alignment problem.

An AGI would want to X.
Then, an AGI would X.
Then, we should consider it more likely that humanity could X.

The jump from (1) to (2) doesn't work: just because an AGI wants to X, it's not necessarily true that it can X. This is true by definition for any non-omnipotent entity.

The jump from (1) to (2) does work if we're considering an omnipotent AGI. But an omnipotent AGI breaks the jump from (2) to (3), so the chain of reasoning doesn't work for any AGI power level. Just because an omnipotent AGI can X, it's not necessarily true that humanity is more likely to be able to X.

Overall, this argument could be used to show that any X desired by an AGI is therefore more likely to be doable by humans. Of course this doesn't make sense--we shouldn't expect it to be any easier to build a candy-dispensing time machine just because an AGI would want to build one to win the favor of humanity.

Narrow and general intelligence

The task-based approach is analogous to how humans harnessed electricity: while electricity is a powerful and general technology, we still need to design specific ways to apply it to each task. Similarly, computers are powerful and flexible tools - but even though they can process arbitrarily many different inputs, detailed instructions for how to do that processing needs to be individually written to build each piece of software. Meanwhile our current reinforcement learning algorithms, although powerful, produce agents that are only able to perform well on specific tasks at which they have a lot of experience - Starcraft, DOTA, Go, and so on. In Reframing Superintelligence, Drexler argues that our current task-based approach will scale up to allow superhuman performance on a range of complex tasks (although I’m skeptical of this claim).

An example of the generalisation-based approach can be found in large language models like GPT-2 and GPT-3. GPT-2 was first trained on the task of predicting the next word in a corpus, and then achieved state of the art results on many other language tasks, without any task-specific fine-tuning! This was a clear change from previous approaches to natural language processing, which only scored well when trained to do specific tasks on specific datasets. Its successor, GPT-3, has displayed a range of even more impressive behaviour. I think this provides a good example of how an AI could develop cognitive skills (in this case, an understanding of the syntax and semantics of language) which generalise to a range of novel tasks. The field of meta-learning aims towards a similar goal.

We can also see the potential of the generalisation-based approach by looking at how humans developed. As a species, we were “trained” by evolution to have cognitive skills including rapid learning capabilities; sensory and motor processing; and social skills. As individuals, we were also “trained” during our childhoods to fine-tune those skills; to understand spoken and written language; and to possess detailed knowledge about modern society. However, the key point is that almost all of this evolutionary and childhood learning occurred on different tasks from the economically useful ones we perform as adults. We can perform well on the latter category only by reusing the cognitive skills and knowledge that we gained previously. In our case, we were fortunate that those cognitive skills were not too specific to tasks in the ancestral environment, but were rather very general skills. In particular, the skill of abstraction allows us to extract common structure from different situations, which allows us to understand them much more efficiently than by learning about them one by one. Then our communication skills and theories of mind allow us to share our ideas. This is why humans can make great progress on the scale of years or decades, not just via evolutionary adaptation over many lifetimes.

I should note that I think of task-based and generalisation-based as parts of a spectrum rather than a binary classification, particularly because the way we choose how to divide up tasks can be quite arbitrary. For example, AlphaZero trained by playing against itself, but was tested by playing against humans, who use different strategies and playing styles. We could think of playing against these two types of opponents as two instances of a single task, or as two separate tasks where AlphaZero was able to generalise from the former task to the latter. But either way, the two cases are clearly very similar. By contrast, there are many economically important tasks which I expect AI systems to do well at primarily by generalising from their experience with very different tasks - meaning that those AIs will need to generalise much, much better than our current reinforcement learning systems can.

Let me be more precise about the tasks which I expect will require this new regime of generalisation. To the extent that we can separate the two approaches, it seems plausible to me that the task-based approach will get a long way in areas where we can gather a lot of data. For example, I'm confident that it will produce superhuman self-driving cars well before the generalisation-based approach does so. It may also allow us to automate most of the tasks involved even in very cognitively demanding professions like medicine, law, and mathematics, if we can gather the right training data. However, some jobs crucially depend on the ability to analyse and act on such a wide range of information that it’ll be very difficult to train directly for high performance on them. Consider the tasks involved in a role like CEO: setting your company’s strategic direction, choosing who to hire, writing speeches, and so on. Each of these tasks sensitively depends on the broader context of the company and the rest of the world. What industry is their company in? How big is it; where is it; what’s its culture like? What’s its relationship with competitors and governments? How will all of these factors change over the next few decades? These variables are so broad in scope, and rely on so many aspects of the world, that it seems virtually impossible to generate large amounts of training data via simulating them (like we do to train game-playing AIs). And the number of CEOs from whom we could gather empirical data is very small by the standards of reinforcement learning (which often requires billions of training steps even for much simpler tasks). I’m not saying that we’ll never be able to exceed human performance on these tasks by training on them directly - maybe a herculean research and engineering effort, assisted by other task-based AIs, could do so. But I expect that well before such an effort becomes possible, we’ll have built AIs using the generalisation-based approach which know how to perform well even on these broad tasks.

In the generalisation-based approach, the way to create superhuman CEOs is to use other data-rich tasks (which may be very different from the tasks we actually want an AI CEO to do) to train AIs to develop a range of useful cognitive skills. For example, we could train a reinforcement learning agent to follow instructions in a simulated world. Even if that simulation is very different from the real world, that agent may acquire the planning and learning capabilities required to quickly adapt to real-world tasks. Analogously, the human ancestral environment was also very different to the modern world, but we are still able to become good CEOs with little further training. And roughly the same argument applies to people doing other highly impactful jobs, like paradigm-shaping scientists, entrepreneurs, or policymakers.

One potential obstacle to the generalisation-based approach succeeding is the possibility that specific features of the ancestral environment, or of human brains, were necessary for general intelligence to arise. For example, some have hypothesised that a social “arms race” was required to give us enough social intelligence to develop large-scale cultural transmission. However, most possibilities for such crucial features, including this one, could be recreated in artificial training environments and in artificial neural networks. Some features (such as quantum properties of neurons) would be very hard to simulate precisely, but the human brain operates under conditions that are too messy to make it plausible that our intelligence depends on effects at this scale. So it seems very likely to me that eventually we will be able to create AIs that can generalise well enough to produce human-level performance on a wide range of tasks, including abstract low-data tasks like running a company. Let’s call these systems artificial general intelligences, or AGIs. Many AI researchers expect that we’ll build AGI within this century; however, I won’t explore arguments around the timing of AGI development, and the rest of this document doesn’t depend on this question.

Paths to superintelligence

Bostrom defines a superintelligence as “any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest”. For the purposes of this report, I’ll operationalise “greatly exceeding human performance” as doing better than all of humanity could if we coordinated globally (unaided by other advanced AI). I think it’s difficult to deny that in principle it’s possible to build individual generalisation-based AGIs which are superintelligent, since human brains are constrained by many factors which will be much less limiting for AIs. Perhaps the most striking is the vast difference between the speeds of neurons and transistors: the latter pass signals about four million times more quickly. Even if AGIs never exceed humans in any other way, a speedup this large would allow one to do as much thinking in minutes or hours as a human can in years or decades. Meanwhile our brain size is important in making humans more capable than most animals - but I don’t see any reason why a neural network couldn’t be several orders of magnitude larger than a human brain. And while evolution is a very capable designer in many ways, it hasn’t had much time to select specifically for the skills that are most useful in our modern environment, such as linguistic competence and mathematical reasoning. So we should expect that there are low-hanging fruit for improving on human performance on the many tasks which rely on such skills.^[2]

There are significant disagreements about how long it will take to transition from human-level AGI to superintelligence, which won’t be a focus of this report, but which I’ll explore briefly in the section on Control. In the remainder of this section I’ll describe in qualitative terms how this transition might occur. By default, we should expect that it will be driven by the standard factors which influence progress in AI: more compute, better algorithms, and better training data. But I’ll also discuss three factors whose contributions to increasing AI intelligence will become much greater as AIs become more intelligent: replication, cultural learning, and recursive improvement.

In terms of replication, AIs are much less constrained than humans: it’s very easy to create a duplicate of an AI which has all the same skills and knowledge as the original. The cost of compute for doing so is likely to be many times smaller than the original cost of training an AGI (since training usually involves running many copies of an AI much faster than they’d need to be run for real-world tasks). Duplication currently allows us to apply a single AI to many tasks, but not to expand the range of tasks which that AI can achieve. However, we should expect AGIs to be able to decompose difficult tasks into subtasks which can be tackled more easily, just as humans can. So duplicating such an AGI could give rise to a superintelligence composed not of a single AGI, but rather a large group of them (which, following Bostrom, I’ll call a collective AGI), which can carry out significantly more complex tasks than the original can.^[3] Because of the ease and usefulness of duplicating an AGI, I think that collective AGIs should be our default expectation for how superintelligence will be deployed.

The efficacy of a collective AGI might be limited by coordination problems between its members. However, most of the arguments given in the previous paragraphs are also reasons why individual AGIs will be able to surpass us at the skills required for coordination (such as language processing and theories of mind). One particularly useful skill is cultural learning: we should expect AGIs to be able to acquire knowledge from each other and then share their own discoveries in turn, allowing a collective AGI to solve harder problems than any individual AGI within it could. The development of this ability in humans is what allowed the dramatic rise of civilisation over the last ten thousand years. Yet there is little reason to believe that we have reached the peak of this ability, or that AGIs couldn’t have a much larger advantage over a human than that human has over a chimp, in acquiring knowledge from other agents.

Thirdly, AGIs will be able to improve the training processes used to develop their successors, which then improve the training processes used to develop their successors, and so on, in a process of recursive improvement.^[4] Previous discussion has mostly focused on recursive self-improvement, involving a single AGI “rewriting its own source code”. However, I think it’s more appropriate to focus on the broader phenomenon of AIs advancing AI research, for several reasons. Firstly, due to the ease of duplicating AIs, there’s no meaningful distinction between an AI improving “itself” versus creating a successor that shares many of its properties. Secondly, modern AIs are more accurately characterised as models which could be retrained, rather than software which could be rewritten: almost all of the work of making a neural network intelligent is done by an optimiser via extensive training. Even a superintelligent AGI would have a hard time significantly improving its cognition by modifying its neural weights directly; it seems analogous to making a human more intelligent via brain surgery (albeit with much more precise tools than we have today). So it’s probably more accurate to think about self-modification as the process of an AGI modifying its high-level architecture or training regime, then putting itself through significantly more training. This is very similar to how we create new AIs today, except with humans playing a much smaller role. Thirdly, if the intellectual contribution of humans does shrink significantly, then I don’t think it’s useful to require that humans are entirely out of the loop for AI behaviour to qualify as recursive improvement (although we can still distinguish between cases with more or less human involvement).

These considerations reframe the classic view of recursive self-improvement in a number of ways. For example, the retraining step may be bottlenecked by compute even if an AGI is able to design algorithmic improvements very fast. And for an AGI to trust that its goals will remain the same under retraining will likely require it to solve many of the same problems that the field of AGI safety is currently tackling - which should make us more optimistic that the rest of the world could solve those problems before a misaligned AGI undergoes recursive self-improvement. However, to be clear, this reframing doesn’t imply that recursive improvement will be unimportant. Indeed, since AIs will eventually be the primary contributors to AI research, recursive improvement as defined here will eventually become the key driver of progress. I’ll discuss the implications of this claim in the section on Control.

So far I’ve focused on how superintelligences might come about, and what they will be able to do. But how will they decide what to actually do? For example, will the individuals within a collective AGI even want to cooperate with each other to pursue larger goals? Will an AGI capable of recursive improvement have any reason to do so? I’m wary of phrasing these questions in terms of the goals and motivations of AGIs, without exploring more thoroughly what those terms actually mean. That’s the focus of the next section.

Unlike the standard usage, in this technical sense an “environment” also includes a specification of the input-output channels the agent has access to (such as motor outputs), so that solving the task only requires an agent to process input information and communicate output information. ↩︎
This observation is closely related to Moravec’s paradox, which I discuss in more detail in the section on Goals and Agency. Perhaps the most salient example is how easy it was for AIs to beat humans at chess. ↩︎
It’s not quite clear whether the distinction between “single AGIs” and collective AGIs makes sense in all cases, considering that a single AGI can be composed of many modules which might be very intelligent in their own right. But since it seems unlikely that there will be hundreds or thousands of modules which are each generally intelligent, I think that the distinction will in practice be useful. See also the discussion of “collective superintelligence” in Bostrom’s Superintelligence. ↩︎
Whether it’s more likely that the successor agent will be an augmented version of the researcher AGI itself or a different, newly-trained AGI is an important question, but one which doesn’t affect the argument as made here. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

29

AGI safety from first principles: Superintelligence

29

Narrow and general intelligence

Paths to superintelligence