Thanks to Arun Jose, Joseph Bloom, and Johannes Treutlein for feedback/discussions.
The release of AutoGPT prompted discussions related to the potential of such systems to turn non-agentic LLMs into agentic systems that pursue goals, along with the dangers that could follow. The relation of such systems to the alignment problem has also been explored.
In this short post, we investigate a threat model that comes from AutoGPT-like systems pursuing unaligned objectives and explore the potential for alignment via oversight. We briefly consider some key properties of such systems, and then discuss the idea that these systems’ high-level cognition might be interpretable by default and so might allow for sufficient oversight to ensure the system is aligned. Finally, we consider a couple of reasons for why the high-level cognition might be obscured from oversight and highlight ways of preventing this obscuration.
AutoGPT-like systems, or Scaffolded LLMs, are systems that:
wrap a programmatic scaffold around an LLM core and chain together a number of individual LLM calls to achieve some larger and more complex task than can be accomplished in a single prompt.
This programmatic scaffold allows for information from the inputs/outputs of LLM calls to be stored in long-term memory as well as used as inputs to tools/plugins.
We will state a few assumptions we’ll make about these AutoGPT-like systems. See this post for a similar context.
The threat model we’ll look at is one in which we have an agentic AutoGPT-like system that pursues unaligned objectives.
We don’t discuss the likelihood of this threat model being realized but note that these systems may suffer from inefficiencies that come from bureaucratic systems and hence may not be competitive. See this Zvi’s post for an exploration of the future of such systems.
We assume that there is some overseer that wants the AutoGPT-like system to pursue an aligned objective. If the unaligned objective is detectable by the overseer, then the overseer can (hopefully) shut down the system before the objective is successfully pursued (including the objective of the system escaping the overseer’s oversight). This oversight may ensure that the system remains aligned and never pursues unaligned objectives.
We now make explicit the key hope for oversight that comes from the modular-by-default structure of AutoGPT-like systems. The system is modular because information must be passed from one LLM call to another (or other modules like memory, etc.) Furthermore, because of the way the modular structure is encoded (via a programmatic scaffolding) and the fact that by default information is passed from one LLM call/module to another via natural language, a human overseer (or an LLM assistant) can, in theory, monitor every information channel for information pertaining to the objective or high-level cognition of the system.
This kind of oversight fails if either:
We think it’s reasonable to make the assumption that information about the objective must be passed along some information channel (potentially in the initial prompt from the user) if the AutoGPT-like system is to successfully pursue that objective. It seems unlikely that a single pass of an LLM, etc. could develop an objective and successfully pursue it, without information about that objective being passed to other LLM calls/modules/plugins. One reason for this is that for a system to successfully pursue an outcome in the world, it must interact with the world and use feedback to evaluate its progress in achieving that outcome. Hence, information about the objective must be preserved through future calls to LLMs/modules.
Given the above assumption, oversight fails if information about the objective is obscured when it is passed through information channels. The naive hope is that because LLMs take natural language statements as their input/outputs, these information channels might be interpretable by a human overseer/LLM, by default.
In the next section, we consider why this might not hold.
If information about the objective of the system/high-level cognition must be passed through information channels, why might this information be obscured from the overseer? Two reasons for why this might be the case are:
See the Externalized Reasoning Oversight Agenda for similar thoughts.
Although information is passed via the medium of token sequences that correspond to natural language, it is possible to pass information through such sequences in a way that is uninterpretable by a human/AI overseer. For example, it seems to be the case that GPT4 can compress and decompress prompts.
There doesn’t seem to be an incentive for the system to do this by default, but if optimization pressure is applied to the AutoGPT-like system as a whole (to make it more capable), the LLM calls may be incentivized to pass information in non-overseer interpretable ways for efficiency reasons, etc.
We could potentially pay an alignment tax and ask that groups developing such systems don’t apply this kind of optimization pressure, but it seems that these systems may not be competitive for reasons outline in this post.
Note that it might be possible to use LLMs to retrieve the obscured information from information channels, and this might be worth investigating.
If a call to a LLM instiates a deceptive cognition, the cognition might purposefully produce an output that hides information about the LLM’s true unaligned objective from the overseer, potentially in such a way that when the overseer reads the information channel, the information seems to indicate that the system as a whole is aligned.
If the LLM core being used is deceptive by default (perhaps due to the RLHF applied during training), then we’re already in a bad situation. Hopefully behavioral or interpretability based monitoring for deception can detect this in the LLM core and so such LLMs won’t be deployed.
If the LLM is not a deceptive agent by default, then deception must have occurred due to a particular prompt being passed to the LLM (see the Simulator framing). In this scenario, it seems difficult for the AutoGPT-like system to consistently be deceptive. One reason for this is that when deception is first evoked, the deceptive cognition must create a prompt that will invoke deceptive simulacra with the same objectives in subsequent LLM calls while ensuring that the prompt doesn’t cause suspicion to arise in the overseer. Because of this difficulty, it may be possible to ensure that AutoGPT-like systems that use such an LLM core are safe with sufficient monitoring, however this is risky.
If we have an agentic AutoGPT-like system, oversight of its high-level cognition (including its objectives) may be feasible, allowing the overseer to ensure that the system as a whole is aligned, or to shut it down if it becomes unaligned.
My current intuition based on the above brief analysis is that obscuration of this high-level cognition due to deception might be unlikely in the case of sufficient evaluation measures for the deceptive LLM core and because of the difficulty of a deceptive simulacrum solving its own alignment problem.
I currently feel as though AutoGPT-like systems won’t remain competitive without optimization pressure being applied to the system as a whole, and that this optimization pressure may incentivize the high-level cognition of the system to be obscured.
See Beren's post.
One can think of this as a type of alignment problem for unaligned simulacra.