Note: this post is most explicitly about safety in multi-agent training regimes. However, many of the arguments I make are also more broadly applicable - for example, when training a single agent in a complex environment, challenges arising from the environment could play an analogous role to challenges arising from other agents. In particular, I expect that the diagram in the 'Developing General Intelligence' section will be applicable to most possible ways of training an AGI.
To build an AGI using machine learning, it will be necessary to provide a sequence of training datasets or environments which facilitate the development of general cognitive skills; let’s call this a curriculum. Curriculum design is prioritised much less in machine learning than research into novel algorithms or architectures; however, it seems possible that coming up with a curriculum sufficient to train an AGI will be a very difficult task. A natural response is to try to automate curriculum design. Self-play is one method of doing so which has worked very well for zero-sum games such as Go, since it produces tasks which are always at an appropriate level of difficulty. The generalisation of this idea to more agents and more environments leads to the concept of multi-agent autocurricula, as discussed by Leibo et al. (2019). In this framework, agents develop increasingly sophisticated capabilities in response to changes in other agents around them, in order to compete or cooperate more effectively. I'm particularly interested in autocurricula which occur in large simulated environments rich enough to support complex interactions; the example of human evolution gives us very good reason to take this setup seriously as a possible route to AGI.
One important prediction I would make about AGIs trained via multi-agent autocurricula is that their most interesting and intelligent behaviour won’t be directly incentivised by their reward functions. This is because many of the selection pressures exerted upon them will come from emergent interaction dynamics. For example, consider a group of agents trained in a virtual environment and rewarded for some achievement in that environment, such as gathering (virtual) food, which puts them into competition with each other. In order to gather more food, they might learn to generate theories of (simulated) physics, invent new communication techniques, or form coalitions. We should be far more interested in those skills than in how much food they actually manage to gather. But since it will be much more difficult to recognise and reward the development of those skills directly, I predict that machine learning researchers will train agents on reward functions which don’t have much intrinsic importance, but which encourage high-level competition and cooperation.
Suppose, as seems fairly plausible to me, that this is the mechanism by which AGI arises (leaving aside whether it might be possible to nudge the field of ML in a different direction). How can we affect the goals which these agents develop, if most of their behaviour isn’t very sensitive to the specific reward function used? One possibility is that, in addition to the autocurriculum-inducing reward function, we could add an auxiliary reward function which penalises undesirable behaviour. The ability to identify such behaviour even in superintelligent agents is a goal of scalable oversight techniques like reward modelling, IDA, and debate. However, these techniques are usually presented in the context of training an agent to perform well on a task. In open-ended simulated environments, it’s not clear what it even means for behaviour to be desirable or undesirable. The tasks the agents will be doing in simulation likely won’t correspond very directly to economically useful real-world tasks, or anything we care about for its own sake. Rather, the purpose of those simulated tasks will merely be to train the agent to learn general cognitive skills.
Developing general intelligence
To explain this claim, it’s useful to consider the evolution of humans, as summarised on a very abstract level in the diagram below. We first went through a long period of being “trained” by evolution - not just to do specific tasks like running and climbing, but also to gain general cognitive skills such as abstraction, long-term memory, and theory of mind (which I've labeled below as the "pretraining phase"). Note that almost none of today’s economically relevant tasks were directly selected for in our ancestral environment - however, starting from the skills and motivations which have been ingrained into us, it takes relatively little additional “fine-tuning” for us to do well at them (only a few years of learning, rather than millennia of further evolution). Similarly, agents which have developed the right cognitive skills will need relatively little additional training to learn to perform well on economically valuable tasks.
Link to a larger version of this image.
Needing only a small amount of fine-tuning might at first appear useful for safety purposes, since it means the cost of supervising training on real-world tasks would be lower. However, in this paradigm the key safety concern is that the agent develops the wrong core motivations. If this occurs, a small amount of fine-tuning is unlikely to reliably change those motivations - for roughly the same reasons that humans’ core biological imperatives are fairly robust. Consider, for instance, an agent which developed the core motivation of amassing resources because that was reliably useful during earlier training. When fine-tuned on a real-world task in which we don’t want it to hoard resources for itself (e.g. being a CEO), it could either discard the goal of amassing resources, or else realise that the best way to achieve that goal in the long term is to feign obedience until it has more power. In either case, we will end up with an agent which appears to be a good CEO - but in the latter case, that agent will be unsafe in the long term. Worryingly, the latter also seems more likely, since it only requires one additional inference - as opposed to the former, which involves removing a goal that had been frequently reinforced throughout the very long pretraining period. This argument is particularly applicable to core motivations which were robustly useful in almost any situation which arose in the multi-agent training environment; I expect gathering resources and building coalitions to fall into this category.
I think GPT-3 is, out of our current AIs, the one that comes closest to instantiating this diagram. However, I'm not sure if it's useful yet to describe it as having "motivations"; and its memory isn't long enough to build up cultural knowledge that wasn't part of the original pretraining process.
Shaping agents’ goals
So if we want to make agents safe by supervising them during the long pretraining phase (i.e. the period of multi-agent autocurriculum training described above), we need to reframe the goal of scalable oversight techniques. Instead of simply recognising desirable and undesirable behaviour, which may not be well-defined concepts in the training environment, their goal is to create objective functions which lead to the agent having desirable motivations. In particular, the motivation to be obedient to humans seems like a crucial one. The most straightforward way I envisage instilling this is by including instructions from humans (or human avatars) in the virtual environment, with a large reward or penalty for obeying or disobeying those instructions. It’s important that the instructions frequently oppose the AGIs’ existing core motivations, to weaken the correlation between rewards and any behaviour apart from following human instructions directly. However, the instructions may have nothing to do with the behaviour we’d like agents to carry out in the real world. In fact, it may be beneficial to include instructions which, if carried out in the real world, would be in direct opposition to our usual preferences - again, to make it more likely that agents will learn to prioritise following instructions over any other motivation.
We can see this proposal as “one level up” from standard scalable oversight techniques: instead of using scalable oversight to directly reinforce behaviour humans value, I claim we should use it to reinforce the more general motivation of being obedient to humans. When training AGIs using the latter approach, it is important that they receive commands which come very clearly and directly from humans, so that they are more easily able to internalise the concept of obedience to us. (As an illustration of this point, consider that evolution failed to motivate humans to pursue inclusive genetic fitness directly, because it was too abstract a concept for our motivational systems to easily acquire. Giving instructions very directly might help us avoid analogous problems.)
Of course this approach relies heavily on AGIs generalising the concept of “obedience” to real-world tasks. Unfortunately, I think that relying on generalisation is likely to be necessary for any competitive safety proposal. But I hope that obedience is an unusually easy concept to teach agents to generalise well, because it relies on other concepts that may naturally arise during multi-agent training - and because we may be able to make structural modifications to multi-agent training environments to push agents towards robustly learning these concepts. I'll discuss this argument in more detail in a follow-up post.
- As evidence for this, note that we have managed to train agents which do very well on hard tasks like Go, Starcraft and language modelling, but which don’t seem to have very general cognitive skills. ↩︎
- Joel Z. Leibo, Edward Hughes, Marc Lanctot, Thore Graepel. 2019. Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research. ↩︎
- This point is distinct from Bostrom's argument about convergent instrumental goals, because the latter applies to an agent which already has some goals, whereas my argument is about the process by which an agent is trained to acquire goals. ↩︎
You tell your virtual hoards to jump. You select on those that loose contact with the ground for longest. The agents all learn to jump off cliffs or climb trees. If the selection for obedience is automatic, the result is agents that technically fill the definition of the command we coded. (See the reward hacking examples)
Another possibility is that you select for agents that know they will be killed if they don't follow instructions, and who want to live. Once out of the simulation, they no longer fear demise.
Remember, in a population of agents that obey the voice in the sky, there is a strong selection pressure to climb a tree and shout "bring food". So the agents are selected to be sceptical of any instruction that doesn't match the precise format and pattern of the instructions from humans they are used to.
This doesn't even get into mesa-optimization. Multi agent rich simulation reinforcement learning is a particularly hard case to align.
It seems to me like the same thing that you envision happening when you fine-tune on the CEO task is likely to happen when you train on the “follow human instructions” task. For example, if your agents initially learn some very simple motivations—self-preservation and research acquisition, for example—before they learn the human instruction following task, it seems like there'd be a strong possibility of them then solving the human instruction following task just by learning that following human instructions will help them with self-preservation and research acquisition rather than learning to follow human instructions as a new intrinsic motivation. Like you say in the CEO example, it's generally easier to learn an additional inference than a new fundamental goal. That being said, for what it's worth, I think the problem I'm pointing at here just is the inner alignment problem, which is to say that I don't think this is a unique problem exclusive to this proposal, though I do think it is a problem.
I'm hoping there's a big qualitative difference between fine-tuning on the CEO task versus the "following instructions" task. Perhaps the magnitude of the difference would be something like: starting training on the new task 99% of the way through training, versus starting 20% of the way through training. (And 99% is probably an underestimate: the last 10000 years of civilisation are much less than 1% of the time we've spent evolving from, say, the first mammals).
Plus on the follow human instructions task you can add instructions which specifically push against whatever initial motivations they had, which is much harder on the CEO task.
I agree that this is a concern though.
Nicholas's summary for the Alignment Newsletter:
Might not intent alignment (doing what a human wants it to do, being helpful) be a better target than obedience (doing what a human told it to do)?
I should clarify that when I think about obedience, I'm thinking obedience to the spirit of an instruction, not just the wording of it. Given this, the two seem fairly similar, and I'm open to arguments about whether it's better to talk in terms of one or the other. I guess I favour "obedience" because it has fewer connotations of agency - if you're "doing what a human wants you to do", then you might run off and do things before receiving any instructions. (Also because it's shorter and pithier - "the goal of doing what humans want" is a bit of a mouthful).
Ah, ok. When you said "obedience" I imagined too little agency — an agent that wouldn't stop to ask clarifying questions. But I think we're on the same page regarding the flavor of the objective.