Modeling the impact of safety agendas

Ben Cottier

This post, which deals with how safety research - that is, technical research agendas aiming to reduce AI existential risk - might impact risks from AI, is part 6 in our sequence on Modeling Transformative AI Risk. In this series of posts, we are presenting a preliminary model of the relationships between key hypotheses in debates about catastrophic risks from AI. Previous posts in this sequence explained how different subtopics of this project, such as AI takeoff and mesa-optimization, are incorporated into our model.

We caution that this part of the model is much more of a work in progress than others. At present, it is best described as loosely modeling a few aspects of safety agendas, and the hope is that it can be further developed to be similar quality to more complete portions of the model. In many cases, we are unclear about how different research agendas relate to specific types and causes of risk. While this is partly because this part of the model is a work in progress, it is also because the theory of impact for many safety agendas is still unclear. So in addition to explaining the model, we will highlight things that are unclear and what would clarify those things.

Modeling of different research areas is contained in the green-colored modules, circled below.

Key questions for safety agendas

We encountered several uncertainties about safety agendas which have made modeling them relatively difficult. While most of these uncertainties are explained throughout the next section, the following points seem like the most important questions to ask about a safety agenda:

What is the theory of change? What does success look like, and how does that reduce AI risk? What aspects of alignment is the agenda supposed to address?
What is assumed about the progression of AI? How much does the agenda rely on a particular AI paradigm?
What is the expected timeline of the research agenda, and how much does that depend on additional funding or buy-in?
What are likely effects from work on this agenda even if it doesn’t succeed fully? Are there spillover effects for AI capabilities or for other safety agendas? Are there beneficial effects from partial success?

Besides just helping our model, it's worth us highlighting all the benefits of understanding the above for various safety agendas:

the community can better understand what constitutes success and provide better feedback on the agenda,
the researchers get a better idea of how to steer the research agenda as it progresses,
funders (and researchers) can better evaluate agendas, and in turn prioritise funding or pursuing them,
future research can find gaps in the safety-agenda space more easily.

Model overview

Overall impact of safety research

The key outcome to focus on for our model of research impact is Misaligned HLMI^[1], circled in purple in the figure below. This node is obviously important to the final risk scenarios that we model, which will be covered in more detail in the next post. Looking at the inputs to this node, the model says we can avoid Misaligned HLMI if either

HLMI is never developed (blue-circled node), or
We manage to Correct course as we go, meaning HLMI is either aligned by default, or HLMI can be aligned in an iterative fashion in a post-HLMI world (orange-circled node), or
HLMI [is] aligned ahead of time - that is, people find a way to align HMLI before it appears or needs to be aligned (green-circled node).

While it seems that most people in the AI safety community believe option 3 is necessary for safe HLMI, option 2 is argued by some mainstream AI researchers and by some within the AI safety community, and option 1 would apply if humanity successfully coordinated to never build HLMI (though this possibility is not currently captured by our model). We will discuss option 2 further in the upcoming post on failure modes.

If we assume that aligning HLMI ahead of time is a worthwhile endeavour, i.e. conditioning on the model of the risks, how would success be possible? It would either be through the success of currently proposed research agendas and any of their direct follow-ups, or through more novel approaches developed as we get closer to HLMI. New approaches may come from new insights or from a paradigm shift in AI. This is essentially what the HLMI aligned ahead of time module is capturing, shown below. In particular, this module currently includes Foundational research (specifically the highly reliable agent designs agenda) and Synthesized utility function. We are currently including these two here for simplicity, but other agendas should fit in as well.

Our current model focuses primarily on currently proposed agendas since we can, and want to, model them more concretely. Additional work on both the current agendas, and the potential for future progress, would be useful in improving our understanding of the risk and building the model.

In this post we'll focus on three (or perhaps two and a half) different approaches to safety; 1) Iterated Distillation and Amplification, 2) Foundational Research, and 3) Transparency, which has been proposed as a useful part of several approaches to safety, but may not be sufficient alone. In the following sections we go through our preliminary models of these agendas, and point out major uncertainties we have about them in bold.

Iterated Distillation and Amplification

We will use the IDA research agenda as the main example to explain our uncertainties about modeling, as it appears to have more published detail than most other proposed agendas regarding: what is involved, its theory of change, and its assumptions about AI progress. The section of the model for IDA is shown in the figure below.

The final output of this section is IDA research successful. We assume that success means obtaining a clear, vetted procedure to align the intentions of an actual HLMI via IDA. However, the IDA agenda seems to address outer alignment (i.e. finding a training objective with optima that are aligned with the overseer), and not necessarily inner alignment (i.e. making a model robustly aligned with the training objective itself)^[2]. This is a case where we are uncertain what parts of the alignment problem the agenda is supposed to solve, and how a partial solution to alignment is expected to fit into a full solution.

For IDA and other agendas, it's also difficult to reason about degrees of success. What if the research doesn't reach its end goal, but produces some useful insights? Relatedly, how could the agenda help other agendas if it does not directly achieve its aims? We're uncertain how researchers think about this, and how best to model these effects.

Continuing through the model, towards the right of the figure we have a section about the “race” between IDA research and AI capabilities research, summarised as Amplification research will produce useful results in time for HLMI. We’re particularly unsure how to think about timelines to solve research agendas, and we are interested in either community feedback about their understanding of the timelines for success, or any insights on the topic.

Currently, we model research timelines using the node IDA research sufficient by year X, where X is affected by both Investment effect and randomness. This result is then modified by the node Extra time, which models the possibility that a "fire alarm" for HLMI is recognised and speeds up safety research in the years before HLMI, either through insight or increased resources. This time to sufficient IDA research is then compared to the timeline for HLMI (Timeline: HLMI by year X), where success is dependent on the time to IDA being less than the time to HLMI.

The competitiveness of IDA is modeled in a section toward the left of the figure. We break competitiveness down into Not prohibitively expensive to train, Competitive at runtime, and whether IDA scales to arbitrary capabilities (Team of aligned agents can be more capable than an individual agent and reach arbitrary levels of capability). All are modeled as being necessary for IDA to be competitive.

Finally, we have one node to represent whether IDA [is] outer aligned at optimum. This means that all possible models which are optimal according to the training objective are at least intent-aligned. Being outer aligned at optimum has been argued to defuse much of the threat from Goodhart's Law, specifically the Causal and Extremal variants of it. So IDA being outer aligned at optimum would be important for the IDA agenda to be successful at alignment, and therefore a key uncertainty.

Putting it all together, our model considers IDA to be a workable solution for outer alignment if and only if IDA research wins in a “race” against unaligned HLMI, and IDA is sufficiently competitive with other approaches of HLMI, and IDA is outer aligned at optimum. The output node IDA research successful then feeds into the Incorrigibility module at the top level of the model - that is, if it's successful and we have an intent-aligned HLMI, then it is corrigible. Corrigibility in turn increases the ability to Correct course as we go (and so on, as explained in the previous section).

Foundational Research

Our work on modeling Foundational Research has focused entirely on MIRI’s highly reliable agent designs (HRAD) research agenda, as this has had the most discussion in the AI alignment community (out of all the technical work in Foundational Research). Trying to model disagreements about the value of the HRAD agenda led to the post Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate. To summarize the post, one of the difficulties with modeling the value of HRAD research is that there seems to be disagreement about what the debate is even about. The post tries to organize the debate into three “possible worlds'' about what the core disagreement is, and gives some reasons for thinking why we might be in each world. The discussion in comments did not lead to a consensus, so more work will probably be needed to make our thoughts precise enough to encode in the graph structure of the model.

The model below is a simpler substitute, pending further work on the above. Like the IDA model, this considers whether the research can succeed in time for HLMI. Besides that, there are two nodes about the possibility and difficulty of HRAD. The Foundational research [is] successful node feeds directly into the HLMI aligned ahead of time module shown previously.

Transparency

Here, we are using the term "transparency" as shorthand for transparency and interpretability research that has long-term AI safety as a core motivation.

Transparency can be applied to whole classes of machine learning models, and may be a part, or complement, of several alignment techniques. In An overview of 11 proposals for building safe advanced AI, "transparency tools" form a key part of several proposals, for different reasons. More recently, Transparency Trichotomy analysed different ways that transparency can help understand a model: via inspection, or training, or architecture. The parts of the trichotomy can also work together, for instance by using transparency via inspection to get more informed oversight, which then feeds back into the model via training. So the current structure of our model, with its largely separate paths to impact for each agenda, does not seem well suited to transparency research. However, the Mesa-optimization module does incorporate some nodes on how transparency research may help detect deception (via inspection) or actively avoid deception (via training).

Theories of change for transparency helping to align HLMI have become clearer in published writing over the last couple of years. The post Chris Olah's views on AGI safety offers several claims on theory of change, including:

Transparency tools give you a mulligan - a chance to recognise a bad HLMI system, and try again with better understanding.
Advances in transparency tools feed back into design. If we build systems with more understanding of how they work, then we can better understand their failure cases and how to avoid them.
Careful analysis using transparency tools will clarify what we don't understand too. Pointing out what we don't understand will generate more concern about HLMI.
Transparency tools help an overseer to give feedback not just on a system's output, but also the process by which it produced that output.
Advances in transparency tools (and demonstrating their usefulness and appeal) helps realign the ML community to focus on deliberate design and understanding.

From the above claims and discussion elsewhere, there are several apparent cruxes for transparency helping to align HLMI:

Using transparency tools will not make enough progress (or any progress) on the "hard problem" of transparency. The hard problem is to figure out what it even means to understand a model, in a way that can save us from Goodhart's Law and deception. As discussed in Transparency Trichotomy, transparency tools can themselves be gamed. There seems to be agreement that transparency tools will not get us all the way on this problem, but disagreement about how much they help - see e.g. this thread and this comment (bullet point 3).
Similar to the above, though we are not sure if this is a distinct crux for anyone: there is a risk that transparency tools make the flaws we are trying to detect harder to understand (discussed here and here), so there is too great a risk that the tools cause net harm.
Transparency tools will not scale with the capabilities of HLMI and beyond - discussed here. The crux could be specifically about the amount of labour required to understand increasingly large models. It could also be about increasingly capable systems using increasingly alien abstractions. The linked post suggests that an amplified overseer could get around this problem, so the crux could actually be in whether an amplified overseer can make transparency scale reliably in place of humans.
The available transparency tools will not be useful for the kind of system that HLMI is (e.g. the work on Circuits in vision models will not transfer well to language models). This is like a horizontal version of the above scaling crux. Chris Olah raised this point himself.

Again, more work is needed to structure our model in a way that incorporates the above cruxes.

Other agendas

Other agendas or strategies which we have not yet modeled include:

Counterfactual oracles, STEM AI, quantilizers, myopic cognition (discussed in Arguments against myopic training), debate
Multi-agent safety (there is writing on it, and some issues have been identified but we are not aware of a research agenda)
Aligning current systems (e.g. large language models) - see The case for aligning narrowly superhuman models

Some of the above are more difficult to model because there is less writing that clearly outlines paths to impact or what success looks like. A potentially valuable project is to make a clearer case to the community for how a given research agenda could be impactful, and explaining what the goals and specific approaches are.

Help from this community

Our tentative understanding suggests that more public effort to understand and clearly articulate safety agendas' impacts, driving beliefs, and main points of disagreement would be really helpful. Examples of good work in this area are An overview of 11 proposals for building safe advanced AI, and Some AI research areas and their relevance to existential safety. This work can take a lot of effort and time, but some of the uncertainties highlighted in this post seem fairly easy to clarify through comments or smaller write-ups.

To illustrate the kind of information that would help, we have written the following condensed explanation of an imaginary agenda (the agenda and opinions are made-up - this is not quoting anyone):

This agenda aims to increase the chance that high-level machine intelligence (HLMI) is inner-aligned. More specifically, it will defuse the threat of deceptively aligned HLMI. The path from deceptively aligned HLMI to existential catastrophe is roughly: such a system would be deployed due to economic or other incentives and lack of apparent danger. It would also be capable enough to take the long-term future out of humanity's control. While we have a very wide distribution over how humanity loses control, we expect a scenario similar to scenario 2 of What Failure Looks Like.

The specific outcome we are aiming for is <alignment procedure>. For this to succeed and be scalable, we rely on AI progressing like <current machine learning trends>. We expect the resulting AI to be competitive, with training time on the same order of magnitude and performance within 20% of the unaligned baseline.

With regard to timelines, we tentatively estimate that this work is at its most valuable if HLMI is produced in the medium term, neither in the next 10 years, nor more than 30 years from now. With our current resources, we give a rough 5% chance of having a viable procedure within 5 years, and a 10% chance within 20 years. This increases to 20% and 30% respectively with <additional resources>. The remaining subjective estimated probability of failure is split evenly between obstacles from the theory, or project management, or external factors. This agenda relies on outer alignment being solved using <broad outer alignment approach>, but otherwise not interacting much with the problem we aim to solve.

Finally, as a way of quickly gathering opinions, we would love to see comments on the following: for any agenda you can think of, or one that you're working on, what are the cruxes for working on it?

In the next post, we will look at the failure modes of HLMI and the final outcomes of our model.

Acknowledgements

Thanks to the rest of the MTAIR Project team for feedback and suggestions, as well as Adam Shimi and Neel Nanda for feedback on an early draft.

We define High-Level Machine Intelligence (HLMI) as machines that are capable, either individually or collectively, of performing almost all economically-relevant information-processing tasks that are performed by humans, or quickly (relative to humans) learning to perform such tasks. We are using the term “high-level machine intelligence” here instead of the related terms “human-level machine intelligence”, “artificial general intelligence”, or “transformative AI”, since these other terms are often seen as baking in assumptions about either the nature of intelligence or advanced AI that are not universally accepted. ↩︎
For more on this distinction/issue, see this post. ↩︎

Sure, sounds fun, here goes:

Brain-like AGI safety

This agenda is primarily useful in the scenario that (1) human intelligence is mostly powered by legible learning algorithms in the brain (especially neocortex), (2) people will eventually reverse-engineer or reinvent these learning algorithms and thus build HLMI (3) …before anyone builds HLMI by any other different path.

The agenda aims to increase the chance that high-level machine intelligence (HLMI) is aligned.

I think there are many possible paths from "we don't know how to reliably control the motivations of a HLMI" to "existential catastrophe", and I don't have to take a stand on which ones are more or less likely.

The specific outcome I'm aiming for is either (A) an architecture plan / training plan / whatever that enables programmers to reliably set the HLMI's motivation to be and stay in a particular intended direction, or to align with human ethics, or whatever, e.g. avoiding failure modes like value drift, wireheading, conscious suffering AIs, etc., or (B) establish that no such architecture is possible, in which case we can at least discourage this line of capabilities research or encourage alternatives (i.e. differential technology development).

If promising architectures are found, I mean, obviously I hope they'll be competitive with unaligned architectures, but it's too soon to know.

With regard to timelines, the utility of this agenda depends mostly on the extent to which the HLMI development path goes to the destination of neocortex-like algorithms. The timeline per se—i.e., how soon researchers reach that destination—is less important.

I think of this as a cousin of Prosaic AGI research. Prosaic AGI research says "what if AGI is like the most impressive ML algorithms of today?". I say "what if AGI is like the neocortex?" I think both agendas are valuable, like for contingency planning purposes. Obviously I have my own opinions about the relative probabilities of those two contingencies (and it could also be "neither of the above"), but I'm not sure that's very decision-relevant, we should just do both. :-P

Nice! A couple things that this comment pointed out for me:

Real time is not always (and perhaps often not) the most useful way to talk about timelines. It can be more useful to talk about different paths, or economic growth, if that's more relevant to how tractable the research is.
An agenda doesn't necessarily have to argue that its assumptions are more likely, because we may have enough resources to get worthwhile expected returns on multiple approaches.

Something that's unclear here: are you excited about this approach because you think brain-like AGI will be easier to align? Or is it more about the relative probabilities / neglectedness / your fit?

are you excited about this approach because you think brain-like AGI will be easier to align?

I don't think it's obvious that "we should do extra safety research that bet on a future wherein AGI safety winds up being easy". If anything it seems backwards. Well, tractability cuts one way, importance cuts the other way, "informing what we should do viz. differential technology development" is a bit unclear. I do know one person who works on brain-like AGI capabilities on the theory that brain-like AGI would be easier to align. Not endorsing that, but at least there's an internal logic there.

(FWIW, my hunch is that brain-like AGI would be better / less bad for safety than the "risks from learned optimization" scenario, albeit with low confidence. How brain-like AGI compares to other scenarios (GPT-N or whatever), I dunno.)

Instead I'm motivated to work on this because of relative probabilities and neglectedness.

I like this comment, though I don't have a clear-eyed view of what sort of research makes (A) or (B) more likely. Is there a concrete agenda here (either that you could link to, or in your head), or is the work more in the exploratory phase?

You could read all my posts, but maybe a better bet is to wait a month or two, I'm in the middle of compiling everything into a (hopefully) nice series of blog posts that lays out everything I know so far.

I don't really know how to do (B) except "keep trying to do (A), and failing, and maybe the blockers will become more apparent".

I'm working on an in-depth analysis of interpretability research, which is largely about its impacts as a safety research agenda. I think it would be a useful companion to your "Transparency" section in this post. I'm writing it up in this sequence of posts: Interpretability Research for the Most Important Century. (I'm glad I found your post and its "Transparency" section too, because now I can refer to it as I continue writing the sequence.)

The sequence isn't finished yet, but a couple of the posts are done already. In particular the second post Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios contains a substantial part of the analysis. The "Closing thoughts" section of that post gets at most of the cruxes for interpretability research as I see them so far, excerpted here:

In this post, we investigated whether interpretability has property of #1 of High-leverage Alignment Research^[1]. We discussed the four most important parts AI alignment, and which seem to be the hardest. Then we explored interpretability's relevance to these areas by analyzing seven specific scenarios focused on major interpretability breakthroughs that could have great impacts on the four alignment components. We also looked at interpretability's potential relevance to deconfusion research and yet-unknown scenarios for solving alignment.
It seems clear that there are many ways interpretability will be valuable or even essential for AI alignment.^[26] It is likely to be the best resource available for addressing inner alignment issues across a wide range of alignment techniques and proposals, some of which look quite promising from an outer alignment and performance competitiveness perspective.
However, it doesn't look like it will be easy to realize the potential of interpretability research. The most promising scenarios analyzed above tend to rely on near-perfection of interpretability techniques that we have barely begun to develop. Interpretability also faces serious potential obstacles from things like distributed representations (e.g. polysemanticity), the likely-alien ontologies of advanced AIs, and the possibility that those AIs will attempt to obfuscate their own cognition. Moreover, interpretability doesn't offer many great solutions for suboptimality alignment and training competitiveness, at least not that I could find yet.
Still, interpretability research may be one of the activities that most strongly exhibits property #1 of High-leverage Alignment Research^[1]. This will become more clear if we can resolve some of the Further investigation questions above, such as developing more concrete paths to achieving the scenarios in this post and estimating probabilities that we could achieve them. It would also help if, rather than considering interpretability just on its own terms, we could do a side-by-side-comparison of interpretability with other research directions, as the Alignment Research Activities Question^[5] suggests.

(Pasting in the most important/relevant footnotes referenced above:)

[1]: High-leverage Alignment Research is my term for what Karnofsky (2022)^[6] defines as:

“Activity that is [1] likely to be relevant for the hardest and most important parts of the problem, while also being [2] the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)”
See The Alignment Research Activities Question section in the first post of this sequence for further details.
[...]

[5]: The Alignment Research Activities Question is my term for a question posed by Karnofsky (2022)^[6]. The short version is: “What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?”
For all relevant details on that question, see the The Alignment Research Activities Question section in the first post of this sequence.

If any of this is confusing, please let me know - it should also help to reference details in the post itself to clarify. Additionally there are some useful sections in that post for thinking about the high-level impact of interpretability not fully expressed in the "Closing thoughts" above, for example the positive list of Reasons to think interpretability will go well with enough funding and talent and the negative list of Reasons to think interpretability won’t go far enough even with lots of funding and talent.