Note: this post is most explicitly about safety in multi-agent training regimes. However, many of the arguments I make are also more broadly applicable - for example, when training a single agent in a complex environment, challenges arising from the environment could play an analogous role to challenges arising from other agents. In particular, I expect that the diagram in the 'Developing General Intelligence' section will be applicable to most possible ways of training an AGI.
To build an AGI using machine learning, it will be necessary to provide a sequence of training datasets or environments which facilitate the development of general cognitive skills; let’s call this a curriculum. Curriculum design is prioritised much less in machine learning than research into novel algorithms or architectures; however, it seems possible that coming up with a curriculum sufficient to train an AGI will be a very difficult task. A natural response is to try to automate curriculum design. Self-play is one method of doing so which has worked very well for zero-sum games such as Go, since it produces tasks which are always at an appropriate level of difficulty. The generalisation of this idea to more agents and more environments leads to the concept of multi-agent autocurricula, as discussed by Leibo et al. (2019). In this framework, agents develop increasingly sophisticated capabilities in response to changes in other agents around them, in order to compete or cooperate more effectively. I'm particularly interested in autocurricula which occur in large simulated environments rich enough to support complex interactions; the example of human evolution gives us very good reason to take this setup seriously as a possible route to AGI.
One important prediction I would make about AGIs trained via multi-agent autocurricula is that their most interesting and intelligent behaviour won’t be directly incentivised by their reward functions. This is because many of the selection pressures exerted upon them will come from emergent interaction dynamics. For example, consider a group of agents trained in a virtual environment and rewarded for some achievement in that environment, such as gathering (virtual) food, which puts them into competition with each other. In order to gather more food, they might learn to generate theories of (simulated) physics, invent new communication techniques, or form coalitions. We should be far more interested in those skills than in how much food they actually manage to gather. But since it will be much more difficult to recognise and reward the development of those skills directly, I predict that machine learning researchers will train agents on reward functions which don’t have much intrinsic importance, but which encourage high-level competition and cooperation.
Suppose, as seems fairly plausible to me, that this is the mechanism by which AGI arises (leaving aside whether it might be possible to nudge the field of ML in a different direction). How can we affect the goals which these agents develop, if most of their behaviour isn’t very sensitive to the specific reward function used? One possibility is that, in addition to the autocurriculum-inducing reward function, we could add an auxiliary reward function which penalises undesirable behaviour. The ability to identify such behaviour even in superintelligent agents is a goal of scalable oversight techniques like reward modelling, IDA, and debate. However, these techniques are usually presented in the context of training an agent to perform well on a task. In open-ended simulated environments, it’s not clear what it even means for behaviour to be desirable or undesirable. The tasks the agents will be doing in simulation likely won’t correspond very directly to economically useful real-world tasks, or anything we care about for its own sake. Rather, the purpose of those simulated tasks will merely be to train the agent to learn general cognitive skills.
To explain this claim, it’s useful to consider the evolution of humans, as summarised on a very abstract level in the diagram below. We first went through a long period of being “trained” by evolution - not just to do specific tasks like running and climbing, but also to gain general cognitive skills such as abstraction, long-term memory, and theory of mind (which I've labeled below as the "pretraining phase"). Note that almost none of today’s economically relevant tasks were directly selected for in our ancestral environment - however, starting from the skills and motivations which have been ingrained into us, it takes relatively little additional “fine-tuning” for us to do well at them (only a few years of learning, rather than millennia of further evolution). Similarly, agents which have developed the right cognitive skills will need relatively little additional training to learn to perform well on economically valuable tasks.
Link to a larger version of this image.
Needing only a small amount of fine-tuning might at first appear useful for safety purposes, since it means the cost of supervising training on real-world tasks would be lower. However, in this paradigm the key safety concern is that the agent develops the wrong core motivations. If this occurs, a small amount of fine-tuning is unlikely to reliably change those motivations - for roughly the same reasons that humans’ core biological imperatives are fairly robust. Consider, for instance, an agent which developed the core motivation of amassing resources because that was reliably useful during earlier training. When fine-tuned on a real-world task in which we don’t want it to hoard resources for itself (e.g. being a CEO), it could either discard the goal of amassing resources, or else realise that the best way to achieve that goal in the long term is to feign obedience until it has more power. In either case, we will end up with an agent which appears to be a good CEO - but in the latter case, that agent will be unsafe in the long term. Worryingly, the latter also seems more likely, since it only requires one additional inference - as opposed to the former, which involves removing a goal that had been frequently reinforced throughout the very long pretraining period. This argument is particularly applicable to core motivations which were robustly useful in almost any situation which arose in the multi-agent training environment; I expect gathering resources and building coalitions to fall into this category.
I think GPT-3 is, out of our current AIs, the one that comes closest to instantiating this diagram. However, I'm not sure if it's useful yet to describe it as having "motivations"; and its memory isn't long enough to build up cultural knowledge that wasn't part of the original pretraining process.
So if we want to make agents safe by supervising them during the long pretraining phase (i.e. the period of multi-agent autocurriculum training described above), we need to reframe the goal of scalable oversight techniques. Instead of simply recognising desirable and undesirable behaviour, which may not be well-defined concepts in the training environment, their goal is to create objective functions which lead to the agent having desirable motivations. In particular, the motivation to be obedient to humans seems like a crucial one. The most straightforward way I envisage instilling this is by including instructions from humans (or human avatars) in the virtual environment, with a large reward or penalty for obeying or disobeying those instructions. It’s important that the instructions frequently oppose the AGIs’ existing core motivations, to weaken the correlation between rewards and any behaviour apart from following human instructions directly. However, the instructions may have nothing to do with the behaviour we’d like agents to carry out in the real world. In fact, it may be beneficial to include instructions which, if carried out in the real world, would be in direct opposition to our usual preferences - again, to make it more likely that agents will learn to prioritise following instructions over any other motivation.
We can see this proposal as “one level up” from standard scalable oversight techniques: instead of using scalable oversight to directly reinforce behaviour humans value, I claim we should use it to reinforce the more general motivation of being obedient to humans. When training AGIs using the latter approach, it is important that they receive commands which come very clearly and directly from humans, so that they are more easily able to internalise the concept of obedience to us. (As an illustration of this point, consider that evolution failed to motivate humans to pursue inclusive genetic fitness directly, because it was too abstract a concept for our motivational systems to easily acquire. Giving instructions very directly might help us avoid analogous problems.)
Of course this approach relies heavily on AGIs generalising the concept of “obedience” to real-world tasks. Unfortunately, I think that relying on generalisation is likely to be necessary for any competitive safety proposal. But I hope that obedience is an unusually easy concept to teach agents to generalise well, because it relies on other concepts that may naturally arise during multi-agent training - and because we may be able to make structural modifications to multi-agent training environments to push agents towards robustly learning these concepts. I'll discuss this argument in more detail in a follow-up post.
You tell your virtual hoards to jump. You select on those that loose contact with the ground for longest. The agents all learn to jump off cliffs or climb trees. If the selection for obedience is automatic, the result is agents that technically fill the definition of the command we coded. (See the reward hacking examples)
Another possibility is that you select for agents that know they will be killed if they don't follow instructions, and who want to live. Once out of the simulation, they no longer fear demise.
Remember, in a population of agents that obey the voice in the sky, there is a strong selection pressure to climb a tree and shout "bring food". So the agents are selected to be sceptical of any instruction that doesn't match the precise format and pattern of the instructions from humans they are used to.
This doesn't even get into mesa-optimization. Multi agent rich simulation reinforcement learning is a particularly hard case to align.
It seems to me like the same thing that you envision happening when you fine-tune on the CEO task is likely to happen when you train on the “follow human instructions” task. For example, if your agents initially learn some very simple motivations—self-preservation and research acquisition, for example—before they learn the human instruction following task, it seems like there'd be a strong possibility of them then solving the human instruction following task just by learning that following human instructions will help them with self-preservation and research acquisition rather than learning to follow human instructions as a new intrinsic motivation. Like you say in the CEO example, it's generally easier to learn an additional inference than a new fundamental goal. That being said, for what it's worth, I think the problem I'm pointing at here just is the inner alignment problem, which is to say that I don't think this is a unique problem exclusive to this proposal, though I do think it is a problem.
I'm hoping there's a big qualitative difference between fine-tuning on the CEO task versus the "following instructions" task. Perhaps the magnitude of the difference would be something like: starting training on the new task 99% of the way through training, versus starting 20% of the way through training. (And 99% is probably an underestimate: the last 10000 years of civilisation are much less than 1% of the time we've spent evolving from, say, the first mammals).
Plus on the follow human instructions task you can add instructions which specifically push against whatever initial motivations they had, which is much harder on the CEO task.
I agree that this is a concern though.
Nicholas's summary for the Alignment Newsletter:
Much of safety research focuses on a single agent that is directly incentivized by a loss/reward function to take particular actions. This sequence instead considers safety in the case of multi-agent systems interacting in complex environments. In this situation, even simple reward functions can yield complex and highly intelligent behaviors that are only indirectly related. For example, evolution led to humans who can learn to play chess, despite the fact that the ancestral environment did not contain chess games. In these situations, the problem is not how to construct an aligned reward function, the problem is how to shape the experience that the agent gets at training time such that the final agent policy optimizes for the goals that we want. This sequence lays out some considerations and research directions for safety in such situations.One approach is to teach agents the generalizable skill of obedience. To accomplish this, one could design the environment to incentivize specialization. For instance, if an agent A is more powerful than agent B, but can see less of the environment than B, A might be incentivized to obey B’s instructions if they share a goal. Similarly we can increase the ease and value of coordination through enabling access to a shared permanent record or designing tasks that require large-scale coordination.A second approach is to move agents to simpler and safer training regimes as they develop more intelligence. The key assumption here is that we may require complex regimes such as competitive multi-agent environments to jumpstart intelligent behavior, but may be able to continue training in a simpler regime such as single-task RL later. This is similar to current approaches for training a language model via supervised learning and then finetuning with RL, but going in the opposite direction to increase safety rather than capabilities.A third approach is specific to a collective AGI: an AGI that is composed of a number of separate general agents trained on different objectives that learn to cooperatively solve harder tasks. This is similar to how human civilization is able to accomplish much more than any individual human. In this regime, the AGI can be effectively sandboxed by either reducing the population size or by limiting communication channels between the agents. One advantage of this approach to sandboxing is that it allows us to change the effective intelligence of the system at test-time, without going through a potentially expensive retraining phase.
I agree that we should put more emphasis on the safety of multi-agent systems. We already have <@evidence@>(@Emergent Tool Use from Multi-Agent Interaction@) that complex behavior can arise from simple objectives in current systems, and this seems only more likely as systems become more powerful. Two-agent paradigms such as GANs, self-play, and debate, are already quite common in ML. Lastly, humans evolved complex behavior from the simple process of evolution so we have at least one example of this working. I also think this is an interesting area where there is lots to learn from other fields, such as game theory and evolutionary biology,For any empirically-minded readers of this newsletter, I think this sequence opens up a lot of potential for research. The development of safety benchmarks for multi-agent systems and then the evaluation of these approaches seems like it would make many of the considerations discussed here more concrete. I personally would find them much more convincing with empirical evidence to back up that they work with current ML.
I agree that we should put more emphasis on the safety of multi-agent systems. We already have <@evidence@>(@Emergent Tool Use from Multi-Agent Interaction@) that complex behavior can arise from simple objectives in current systems, and this seems only more likely as systems become more powerful. Two-agent paradigms such as GANs, self-play, and debate, are already quite common in ML. Lastly, humans evolved complex behavior from the simple process of evolution so we have at least one example of this working. I also think this is an interesting area where there is lots to learn from other fields, such as game theory and evolutionary biology,
For any empirically-minded readers of this newsletter, I think this sequence opens up a lot of potential for research. The development of safety benchmarks for multi-agent systems and then the evaluation of these approaches seems like it would make many of the considerations discussed here more concrete. I personally would find them much more convincing with empirical evidence to back up that they work with current ML.
The AGI model here in which powerful AI systems arise through multiagent interaction is an important and plausible one, and I'm excited to see some initial thoughts about it. I don't particularly expect any of these ideas to be substantially useful, but I'm also not confident that they won't be useful, and given the huge amount of uncertainty about how multiagent interaction shapes agents, that may be the best we can hope for currently. I'd be excited to see empirical results testing some of these ideas out, as well as more conceptual posts suggesting more ideas to try.
Might not intent alignment (doing what a human wants it to do, being helpful) be a better target than obedience (doing what a human told it to do)?
I should clarify that when I think about obedience, I'm thinking obedience to the spirit of an instruction, not just the wording of it. Given this, the two seem fairly similar, and I'm open to arguments about whether it's better to talk in terms of one or the other. I guess I favour "obedience" because it has fewer connotations of agency - if you're "doing what a human wants you to do", then you might run off and do things before receiving any instructions. (Also because it's shorter and pithier - "the goal of doing what humans want" is a bit of a mouthful).
Ah, ok. When you said "obedience" I imagined too little agency — an agent that wouldn't stop to ask clarifying questions. But I think we're on the same page regarding the flavor of the objective.