This post is a distillation of Evan Hubinger's post "how do we become confident in the safety of a machine learning system?", made as part of the summer 2022 SERI MATS program. While I have attempted to understand and extrapolate Evan's opinions, this post has not been vetted. Likewise, I use training stories (and contribution stories) to describe the methodology of proposals for safe advanced AI without the endorsement of those proposals' authors and based on a relatively shallow understanding of those proposals (due to my inexperience and time constraints). The opinions presented in this post are my own unless otherwise noted.
Epistemic status: Exploratory
Some day, all too soon from now, people will deploy the AI that seals humanity's fate. There are many scenarios for how this comes about or what happens afterward. Some have multiple AI negotiating or fighting for dominance, others one. Some think the handoff of power from humans to AI will go slowly, others fast. But whatever the case, there is one question that these people need to get right: "Is the AI we're about to deploy safe?"
For people to get this answer right when it matters, two things need to happen: we need tools to accurately determine whether an advanced AI is safe, and we need an advanced AI those tools approve of that we can deploy in time for it to matter.
Training stories are a tool to evaluate proposals for making this happen. Specifically, they're meant to analyze complete proposals for training prosaic advanced AI: advanced AI which has similar enough architecture to current systems that most research is cross-applicable. The reason to focus on complete proposals is that it gets us to pay attention to what will be important when push comes to shove. From there, we can backpropagate and get an estimate of what research paths are most beneficial. Evan thinks theoretical prosaic alignment is the most beneficial in expectation for him and many others to pursue, though he supports a broad spectrum of research.
Because I think the original post describes the central concept of training stories very well, I have avoided retreading the same ground, and instead focused on expanding its usability. I will start with some more case studies of using training stories to analyze proposals, and then expand on the original post with contribution stories: a method for analyzing research agendas by how they factor into training stories.
In this case study, Evan's comments in An overview of 11 proposals for building safe advanced AI map well to the training story evaluation, and I don't feel like much was missing. However, the training story itself makes it more clear to me what the training goal might look like, which could help point at potential experiments with toy models that can be done today. This demonstrates how training stories can be used to clarify a path towards working AI.
The evaluation of the training story is done pretty well in the original 11 proposals post:
Attempting to make a training story for this, it quickly becomes clear that the training goal is underspecified. While there are vague behavioral arguments for AI trained in cooperative environments to act 'cooperatively', there appear to be no conceptualization of an internal algorithm where this cooperation is robust in the way we want, let alone training rationale nudges to give rise to this internal logic over others that meet the training rationale constraints of cooperative behavior in training.
This demonstrates training stories' usefulness as an analysis tool: a serious proposal for safe advanced AI is shown lacking. That doesn't mean that an RL incentive towards cooperation is useless, just that it is far from sufficient and a better plan is needed.
Like reinforcement learning + transparency tools, the training goal is for STEM AI is underspecified. Limiting AI subject matter is a valid, albeit soft, training rationale nudge away from the AI having a conception of agents it can deceive, but it will not suffice alone. STEM AI may well have great economic value, but it's hard to imagine it being safely transformative. It might help make capabilities research safer, though - see below.
In the section 'Exploring the landscape of possible training stories', Evan lists a number of goals and rationales that could be elements of a training story, and notes that many other options are possible. The section does not provide tools for how to qualitatively evaluate these potential partial strategies, which I think limits the usability of training stories to many researchers' actual efforts. For example, while Evan places high importance on transparency tools, Microscope AI is the part of Chris Olah's work that is highlighted in the overall article despite requiring additional assumptions.
Training stories help us answer the question "What training characteristics would convince us that an advanced AI we're about to deploy is safe?". I would like to expand on that with "How do we get those training characteristics?" - to give tools to think about the backpropagation to research agendas that can be tackled now. For this I'll sketch out contribution stories. As an expansion of training stories, these are also focused on prosaic AI development.
I'll split a contribution story up into the following sections:
The training goal of deferential AI is a model which has an internal representation of a human's preferences which it pursues and is uncertain over. This uncertainty is then leveraged to make it defer to the human when the human tries to correct it or shut it off.
STEM AI didn't quite fit the mould in 11 proposals or of a full-fledged training story. In my opinion a capabilities story does the proposal more justice:
As a demonstration of contribution stories' applicability, I picked a recent alignment forum post about prosaic alignment to try it out on: Grouped Loss may disfavor discontinuous capabilities.
It feels to me like the contribution story format helped slot this concept into my model for how AI alignment could shake out, and to ask some of the right questions about the value of the research.
I hope that this framework helps people evaluate their research options and discuss them with others, and so contributes to paving the way for aligned AGI.
I like the idea of contribution stories. That seems like a useful concept to have around.
I also endorse your contribution story for Grouped Loss.