A distillation of Evan Hubinger's training stories (for SERI MATS)

Daphne_W

This post is a distillation of Evan Hubinger's post "how do we become confident in the safety of a machine learning system?", made as part of the summer 2022 SERI MATS program. While I have attempted to understand and extrapolate Evan's opinions, this post has not been vetted. Likewise, I use training stories (and contribution stories) to describe the methodology of proposals for safe advanced AI without the endorsement of those proposals' authors and based on a relatively shallow understanding of those proposals (due to my inexperience and time constraints). The opinions presented in this post are my own unless otherwise noted.

Epistemic status: Exploratory

Some day, all too soon from now, people will deploy the AI that seals humanity's fate. There are many scenarios for how this comes about or what happens afterward. Some have multiple AI negotiating or fighting for dominance, others one. Some think the handoff of power from humans to AI will go slowly, others fast. But whatever the case, there is one question that these people need to get right: "Is the AI we're about to deploy safe?"

For people to get this answer right when it matters, two things need to happen: we need tools to accurately determine whether an advanced AI is safe, and we need an advanced AI those tools approve of that we can deploy in time for it to matter.

Training stories are a tool to evaluate proposals for making this happen. Specifically, they're meant to analyze complete proposals for training prosaic advanced AI: advanced AI which has similar enough architecture to current systems that most research is cross-applicable. The reason to focus on complete proposals is that it gets us to pay attention to what will be important when push comes to shove. From there, we can backpropagate and get an estimate of what research paths are most beneficial. Evan thinks theoretical prosaic alignment is the most beneficial in expectation for him and many others to pursue, though he supports a broad spectrum of research.

Because I think the original post describes the central concept of training stories very well, I have avoided retreading the same ground, and instead focused on expanding its usability. I will start with some more case studies of using training stories to analyze proposals, and then expand on the original post with contribution stories: a method for analyzing research agendas by how they factor into training stories.

Training stories

Case study: Imitative amplification + intermittent oversight

In this case study, Evan's comments in An overview of 11 proposals for building safe advanced AI map well to the training story evaluation, and I don't feel like much was missing. However, the training story itself makes it more clear to me what the training goal might look like, which could help point at potential experiments with toy models that can be done today. This demonstrates how training stories can be used to clarify a path towards working AI.

Training goal specification: The goal is to end up with a model M which accurately answers whether a model similar to M is safe to run, which suggests possible tests to determine if that model is safe to run, and which answers questions about some other topic T. This can take several forms, but let's try to work one out:
M can't take all weights of M as input for inspection, so M has to look at sections of a model and share a summary of its conclusions with other M's to form a reliable answer. This could be organized like HCH. Since M is trained to imitate the output of Amp(M), which is prototypically a human consulting M and some selection of transparency tools, these summaries would be human-legible.
Mechanically, M would take as input the question it is tasked with answering, several relevant question-answer pairs from other M's, and information relevant to answering the question (either part of the M-like model or information about the topic M is intended to be deployed in). The question-answer pairs, baked-in priors from training, and the new information input could internally be combined to form a world model, with (linear combinations of) neural activations corresponding to confidence in particular beliefs. The natural-language question could then be combined with the world model to produce a natural-language response.
Training goal desirability: The reason we want such a model is so it can give accurate, reliable answers about the topic T it has been trained on, as an oracle. This could just be inspecting other models, but it could also mean answering practical questions about the world that could have transformative value.
Training rationale nudges: The plan for getting such a model is to train it through imitative amplification, intermittently performing oversight checks. Imitative amplification here means that the model is trained to imitate Amp(M), where Amp() is the amplification operator consisting of a human that can consult transparency tools and the model itself on questions concerning M's safety and the topic T. The oversight checks happen intermittently and consist of Amp(M) giving a binary answer on whether the model is safe enough to continue training. Models that fail these checks are pruned, applying evolutionary pressure for the remaining ones to be safe.
Training rationale constraints: The model must have a structure suitable for inspecting neural nets, for handling T, for natural language questions and answers, and for delegating questions to other versions of itself. It is unclear at this point what a suitable architecture for this would be. In training, the questions Amp(M) is asked have to follow a distribution suitable for M at its stage of development. Amp() must also be capable enough to allow some M's to pass the filter at all stages of development without false negatives.

The evaluation of the training story is done pretty well in the original 11 proposals post:

Training goal alignment: The model is intended to function as an oracle, with the associated potential hazards. Also, the limit of training is HCH, so it is at most as aligned as that system would be. See outer alignment in 11 proposals.
Training goal competitiveness: As performance competitiveness in 11 proposals.
Training rationale alignment: See inner alignment in 11 proposals. This is currently the main limiting factor: we don't have transparency tools sufficient to provide a rich training signal on M inspecting M-like models, especially not if the safety of M is non-trivial.
Training rationale competitiveness: As training competitiveness in 11 proposals.

Case study: Reinforcement learning + transparency tools

Attempting to make a training story for this, it quickly becomes clear that the training goal is underspecified. While there are vague behavioral arguments for AI trained in cooperative environments to act 'cooperatively', there appear to be no conceptualization of an internal algorithm where this cooperation is robust in the way we want, let alone training rationale nudges to give rise to this internal logic over others that meet the training rationale constraints of cooperative behavior in training.

This demonstrates training stories' usefulness as an analysis tool: a serious proposal for safe advanced AI is shown lacking. That doesn't mean that an RL incentive towards cooperation is useless, just that it is far from sufficient and a better plan is needed.

Case study: STEM AI

Like reinforcement learning + transparency tools, the training goal is for STEM AI is underspecified. Limiting AI subject matter is a valid, albeit soft, training rationale nudge away from the AI having a conception of agents it can deceive, but it will not suffice alone. STEM AI may well have great economic value, but it's hard to imagine it being safely transformative. It might help make capabilities research safer, though - see below.

Contribution stories

In the section 'Exploring the landscape of possible training stories', Evan lists a number of goals and rationales that could be elements of a training story, and notes that many other options are possible. The section does not provide tools for how to qualitatively evaluate these potential partial strategies, which I think limits the usability of training stories to many researchers' actual efforts. For example, while Evan places high importance on transparency tools, Microscope AI is the part of Chris Olah's work that is highlighted in the overall article despite requiring additional assumptions.

Training stories help us answer the question "What training characteristics would convince us that an advanced AI we're about to deploy is safe?". I would like to expand on that with "How do we get those training characteristics?" - to give tools to think about the backpropagation to research agendas that can be tackled now. For this I'll sketch out contribution stories. As an expansion of training stories, these are also focused on prosaic AI development.

I'll split a contribution story up into the following sections:

Alignment contribution method: How and whether the research contribution would advance the development of a safe advanced AI, ideally by naming training story components and criteria that it contributes to.
Alignment contribution impact: An estimate of how much the research contributes to the odds of developing safe AI. The following criteria seem useful:
- Urgent: whether and to what extent the research's value depends on it starting now. Since some AI research agendas are likely to be non-parallelizable or to benefit other safety research, it's better to work on those over ones that aren't urgent.
- Critical: whether and to what extent the research has value for helping the final AI be safe.
- Generalizable: how fragile the research's value is to changes in the training story.
Capabilities contribution method: How and whether the research contribution would advance the development of general advanced AI capabilities, and thereby increase x-risk, ideally with training story components and criteria.
Capabilities contribution impact: An estimate of how much the research might advance general capabilities, ideally with mitigation strategies.

Case study: Best-case inspection transparency

Alignment contribution method: Ultimately, transparency tools can be used to improve training rationale alignment by allowing a human, a safe AI, or a combination of both to inspect a model as it develops, steering away from dangerously misaligned thought patterns such as deception by penalizing their precursors. Developing transparency tools early can help build a catalog of precursors to misalignment or even desired patterns for more well-behaved models that could be selected for. While best-case inspection transparency would not be sufficient, it allows us to gather data to develop more sophisticated transparency tools.
Alignment contribution impact: As argued in "A transparency and interpretability tech tree", it seems difficult to ever be confident in the alignment of an AI if we don't have powerful transparency tools that let us know how the AI is arriving at its decisions, meaning transparency tools will be critical in any form of prosaic alignment. Getting sufficiently powerful transparency tools to guide advanced AI also seems like non-parallelizable work: we need to work our way up the tech tree with experimental research. This makes transparency urgent as well.
Transparency is broadly valuable because it makes nearly all prosaic alignment research more easy to test and allows for more fine-grained models of why the model is thinking what it is.
Transparency tools will advance capabilities (see below), but in doing so they give capabilities researchers an incentive to make their models more transparent, which could also be a boon to alignment in the long run by lowering the alignment tax capabilities researchers would have to pay if they decide to be sufficiently cautious at a later date and by offloading transparency research on people who might otherwise not be inclined to do safety research. General popularity of transparency tools could even raise awareness among capabilities researchers by helping them notice potential warning shots.
Capabilities contribution method: Since transparency tools allow researchers to get more fine-grained ideas of how a model is thinking, they will also increase training rationale competititiveness of all model classes they are applied to by giving feedback on where a training process could be improved.
Capabilities contribution impact: Because transparency improves training rationale competitiveness of a wide range of models it is applied to, worlds with transparency have shorter timelines than worlds without. This is a high price to pay, but it seems likely transparency is necessary for prosaic alignment, so we simply have to make do.

Case study: Deferential AI (Stuart Russell / CHAI)

The training goal of deferential AI is a model which has an internal representation of a human's preferences which it pursues and is uncertain over. This uncertainty is then leveraged to make it defer to the human when the human tries to correct it or shut it off.

Alignment contribution method: The goal of deferential AI is to allow training goal specification to be a less precise target by making the AI more corrigible in deployment. Unfortunately, this corrigibility appears to break down as the AI acquires a sufficiently accurate model of humans.
Less advanced deferential AI can assist in providing training rationale nudges by aiding research.
Alignment contribution impact: Since the corrigibility of deferential AI appears to break down at high levels and it seems possible to imagine safe AI without it (e.g. myopic or truly aligned AI), it does not seem critical or urgent. Less advanced deferential AI might be a generally useful tool for aiding research, but as long as it requires architecture changes deference can't easily be added to an alignment research agenda.
It's possible that further research into deference could ameliorate these issues.
Capabilities contribution method: Deferential AI could aid capabilities research just as readily as alignment research by providing training rationale nudges towards more capable unsafe goal specifications. It might catch some unsafe objectives, but others it may not recognize or even be overruled on by the human operator.
Capabilities contribution impact: It's hard to estimate at this point how viable deferential AI will be, but it appears to award little to no advantage to alignment over capabilities as it is presented now, which would mean it would worsen our odds. It seems wise to steer away from making deferential AI competitive until it improves alignment odds.

Case study: STEM AI; narrow agents in general

STEM AI didn't quite fit the mould in 11 proposals or of a full-fledged training story. In my opinion a capabilities story does the proposal more justice:

Alignment contribution method: Narrow agents like STEM AI attempt to define a training goal specification that allows for more training rationale competitiveness through less onerous training rationale constraints, while still being advanced enough to have great value. However, I do not expect this to be training goal competitive compared to more generally trained AI because the world is filled with humans that affect our preferences and options.
Alignment contribution impact: As I said above, I don't expect narrow agents to be competitive for transformative AI. Agents that rely on narrowness for their safety don't seem very useful for alignment research.
Research into narrowness guarantees has some chance of being critical by preventing a catastrophe with AI intended to be narrow - see below. This seems most useful around the time when AI could be trained to be general, so it is not very urgent. It's generalizable to any topic where you only need a narrow understanding to perform well, but that excludes quite a lot.
Capabilities contribution method: Narrow agents may be able to get more optimized results in their limited domains than non-agentic simulators. If narrowness is guaranteed by architecture innovations, then narrow agents might be more training goal competitive than broad agents as well. The agents' output could then be used to advance AI capabilities by easing training rationale constraints like the price of compute.
Offering capabilities researchers ways to guarantee narrowness could reduce risk from unaligned AI by preventing AI intended for a narrow topic from exhibiting dangerous behavior through unintended capabilities generalization. This would constitute improvement of training goal specification for capabilities, but improve our chances of survival.
Capabilities contribution impact: Narrowness guarantees don't advance the timeline to safe transformative AI much. Instead, they potentially reduce the risk of misalignment by making capabilities research safer. This requires them to be both training goal competitive and training rationale competitive.

Case study: Grouped Loss

As a demonstration of contribution stories' applicability, I picked a recent alignment forum post about prosaic alignment to try it out on: Grouped Loss may disfavor discontinuous capabilities.

Alignment contribution method: The intent of grouped loss is to make training rationale nudges easier to implement by smearing the evolution of new capabilities over more training steps, trading some training rationale competitiveness by picking a less efficient and harder to define loss function for training rationale alignment by making it easier to catch precursors to misalignment or steer towards desirable model traits.
Alignment contribution impact: The proposal does not seem highly urgent, since its primary function is giving more traction for nudges that aren't developed enough yet, like transparency tools. It does seem potentially critical, in that it could be implemented in the training procedure for an advanced AI if the competitiveness hit isn't too great and it doesn't conflict with another loss specification. Generalization may be an issue depending on how much work it is to define groups that have the desired properties.
Capabilities contribution: The training rationale competitiveness hit will likely prevent grouped loss from advancing capabilities.

It feels to me like the contribution story format helped slot this concept into my model for how AI alignment could shake out, and to ask some of the right questions about the value of the research.

I hope that this framework helps people evaluate their research options and discuss them with others, and so contributes to paving the way for aligned AGI.

9