Our mission at Apollo Research is to reduce catastrophic risks from AI by auditing advanced AI systems for misalignment and dangerous capabilities, with an initial focus on deceptive alignment.
In our announcement post, we presented a brief theory of change of our organization which explains why we expect AI auditing to be strongly positive for reducing catastrophic risk from advanced AI systems.
In this post, we present a theory of change for how AI auditing could improve the safety of advanced AI systems. We describe what AI auditing organizations would do; why we expect this to be an important pathway to reducing catastrophic risk; and explore the limitations and potential failure modes of such auditing approaches.
We want to emphasize that this is our current perspective and, given that the field is still young, could change in the future.
As presented in ‘A Causal Framework for AI Regulation and Auditing’, one of the ways to think about auditing is that auditors act at different steps of the causal chain that leads to AI systems’ effects on the world. This chain can be broken down into different components (see figure in main text), and we describe auditors’ potential roles at each stage. Having defined these roles, we identify and outline five categories of audits and their theories of change:
In general, external auditors provide defence-in-depth (overlapping audits are more likely to catch more risks before they’re realized); AI safety-expertise sharing; transparency of labs to regulators; public accountability of AI development; and policy guidance.
But audits have limitations which may include risks of false confidence or safety washing; overfitting to audits; and lack of safety guarantees from behavioral AI system evaluations.
The recommendations of auditors need to be backed by regulatory authority in order to ensure that they improve safety. It will be important for safety to build a robust AI auditing ecosystem and to research improved evaluation methods.
Frontier AI labs are training and deploying AI systems that are increasingly capable of interacting intelligently with their environment. It is therefore ever more important to evaluate and manage risks resulting from these AI systems. One step to help reduce these risks is AI auditing, which aims to assess whether AI systems and the processes by which they are developed are safe.
At Apollo Research, we aim to serve as external AI auditors (as opposed to internal auditors situated within the labs building frontier AI). Here we discuss Apollo Research's theories of change, i.e. the pathways by which auditing hopefully improves outcomes from advanced AI.
We discuss the potential activities of auditors (both internal and external) and the importance of external auditors in frontier AI development. We also delve into the limitations of auditing and some of the assumptions underlying our theory of change.
The primary goal of auditing is to identify and therefore reduce risks from AI. This involves looking at AI systems and the processes by which they are developed in order to gain assurance that the effects that AI systems have on the world are safe.
To exert control over AI systems’ effects on the world, we need to act on the causal chain that leads to them.
We have developed a framework for auditing that centers on this causal chain in ‘A Causal Framework for AI Regulation and Auditing’ (Sharkey et al., 2023). For full definitions of each step, see the Framework. Here, we briefly describe what auditors could concretely do at each step in the chain. Later, we’ll examine the theory of change of those actions.
Beyond roles of auditors relating directly to the above causal chain, additional general functions of auditors include:
It seems desirable that different auditing organizations specialize in different functions. For instance, security audits may best be handled by cybersecurity firms or even intelligence agencies. However, it is important for safety that auditing tasks are done by multiple actors simultaneously to reduce risks as much as possible.
Different kinds of audits could examine different parts of the causal chain leading to AI systems’ effects on the world. We identify five categories of audits: 1) AI system evaluations; 2) Training-experiment design audits; 3) Deployment audits; 4) Security audits; and 5) Governance audits. Each category of audit has different theories of change:
AI system evaluations look at behaviors expressed by the AI system; capabilities and propensities of AI systems (during and after training); the mechanistic structure of AI systems; and what the AI system has learned and can learn.
We assess AI system evaluations as having direct effects; indirect effects on safety research; and indirect effects on AI governance.
Direct effects: If successful, AI system evaluations would identify misaligned systems and systems with dangerous capabilities, thus helping to reduce the risk that such systems would be given affordances that let them have damaging effects on the world (e.g. deployment). Notably, audits do not need to be 100% successful to be worthwhile; finding some, even if not all, flaws already decreases risk (though see section Limits of auditing). Beyond behavioral AI system evaluations, Apollo Research also performs interpretability research in order to improve evaluations in future. Interpretability also has additional theories of change.
Indirect effects on safety research: Adequate AI system evaluations would convert alignment from a ‘single-shot’ problem into a ‘many-shot’ problem. In a world without extensive evaluations, there is a higher chance that a frontier AI lab deploys a misaligned AI system without realizing it and thus causes an accident, potentially a catastrophic one. In this case, the first “shot” has to be successful. By contrast, in a world with effective evaluations, labs can catch misaligned AI systems during training or before deployment; we would therefore get multiple “shots” at successfully aligning frontier AI AI systems. For instance, reliable AI system evaluations may give us evidence if any specific alignment technique succeeds in reducing a AI systems’ propensity to be deceptive. This would have important implications for the tractability of the alignment problem, since it would enable us to gather empirical evidence about the successes or failures of alignment techniques in dangerous AI systems without undue risk. Ultimately, successful AI system evaluations would let us iteratively solve the alignment problem like we would most other scientific or engineering problems.
Indirect effects on AI governance: AI system evaluations could provide compelling empirical evidence of AI system misalignment ‘in the wild’ in a way that is convincing to AI system developers, policymakers, and the general public. For example, AI system evaluations could be used to demonstrate that an AI system has superhuman hacking capabilities or is able to manipulate its users to gather relevant amounts of money. Such demonstrations could encourage these stakeholders to understand the gravity of the alignment problem and may convince them to propose regulation mandating safety measures or generally slowing down AI progress. Auditors likely have a good understanding of what frontier AI systems are capable of and can use their more neutral position to inform regulators.
Indirect effects on distribution of AI benefits: In order to reap the potential benefits from AI, it must be (safely) deployed. Assuming audits can be done effectively, auditing derisks investments, potentially leading to more investments in the area and thus greater benefits. By catching failures before they happen, auditing may be able to avoid accident scenarios, which have harmed public confidence in nuclear technology. Effective audits may also increase public trust in the technology, leading to wider spread use.
AI system development audits look at effective compute, training data content, and training-experiment design decisions. They also look at the design of AI system training-experiments, which help determine the previous factors.
The primary means of impact of AI system development audits is that they reduce the risk of dangerous AI systems coming into existence in the first place and reduce the danger posed by AI systems. They aim to achieve this by controlling which capabilities AI systems have (to avoid dangerous ones), the extent of their capabilities, and their propensities to use dangerous capabilities. By embedding safety into the AI system development process, AI system development audits may help place safety at the center of labs’ work rather than as an afterthought to increasing capabilities.
Deployment audits concern proposals for the deployment of particular AI systems.
The overall means of impact is that they should prevent systems from being deployed in ways that contravene regulations or that are deemed too risky. Note that these pathways are separate from AI system evaluations. The results of AI system evaluations should inform risk assessments in deployment audits. They should aim to assess risks from giving particular kinds of AI system access (e.g. access to inference; access to fine-tuning; access to weights) to particular kinds of people (e.g. deployment to the public; internal deployment; deployment in certain countries). They should also assess risks from making particular kinds of affordances available to AI systems, for instance internet access or access to particular kinds of software.
Deployment audits aim to ensure that AI systems are not intentionally given excessive available affordances; by contrast, security audits aim to reduce the risk that they are given excessive available affordances unintentionally.
Security audits assess the security of AI systems and the security of the organizations developing, hosting, and interacting with them. The overall purpose is to limit the affordances made available to highly capable AI systems unintentionally, thus reducing accident and misuse risks, both of which are extremely important for such a transformative and dual-use technology. They reduce the risk of AI system proliferation either through accidental leaks or exfiltration, either by internal or external actors. By assessing how well AI systems have been ‘boxed’, they also reduce the risk of AI systems exfiltrating themselves. They also aim to assess the adequacy of damage control measures in the event of security or safety failures.
Governance audits look at the structure of the institutions developing, regulating, and auditing AI systems (and the interactions between those institutions) to ensure that they are conducive to safety.
They aim to ensure that organizations have proper mechanisms in place to make informed, ethical, and responsible decisions regarding the development and deployment of AI systems. While other audits aim to ensure that AI systems are aligned or that they’re used for aligned purposes, governance audits aim to ensure that alignment with human values extends to the institutions wielding and managing these AI systems. Their path to impact is that they can identify problems in the governance landscape, thus making it possible to rectify them.
In addition to theories of change for each individual category of audit, there are also multiple theories of change for auditing in general:
External auditors, as opposed to internal auditors at the labs developing frontier AI, have additional pathways to impact:
We are aware of some of the limits of AI auditing, as well as the perverse incentives of auditing organizations that could both serve to reduce the positive impact that auditing efforts might have.
Our theory of change makes some assumptions about AI threat models and how the future is likely to play out. If these assumptions are incorrect, then it is not clear that auditing will be a good marginal investment of talent and time, or else the auditing strategy will have to change significantly:
Lee Sharkey led the project and edited the final version. Marius Hobbhahn contributed significantly to all parts other than “The roles of auditors in AI” section. Beren Millidge contributed to an early draft of the post. Dan Braun, Jeremy Scheurer, Mikita Balesni, Lucius Bushnaq, Charlotte Stix, and Clíodhna Ní Ghuidhir provided feedback and discussion.