Managing risks of our own work

Note: This is not a personal post. I am sharing on behalf of the ARC Evals team.

Potential risks of publication and our response

This document expands on an appendix to ARC Evals’ paper, “Evaluating Language-Model Agents on Realistic Autonomous Tasks.”

We published this report in order to i) increase understanding of the potentially dangerous capabilities of frontier AI models, and ii) advance the state of the art in safety evaluations of such models. We hope that this will improve society's ability to identify models with dangerous capabilities before they are capable of causing catastrophic damage.

It might be argued that this sort of research is itself risky, because it makes it easier to develop and exercise dangerous capabilities in language model agents. And indeed, the author of Auto-GPT said he was inspired by seeing a description of our evaluations in the GPT-4 system card.^[1] While it seems likely a project like that would have emerged soon in any event,^[2] the possibility of advancing the capabilities of language model agents is not merely hypothetical.

In recognition of concerns of this kind, we have made significant redactions to this report, including (but not limited to):

Removing complete transcripts of runs of agents using our scaffolding.
Removing more detailed accounts of the strengths and weaknesses of agents using our scaffolding.

However:

We may later make this material public when it clearly has minimal risk.
We may later make this material public if more detailed analysis gives us sufficient confidence that it would be justified.
Researchers working on AI safety evaluations may contact us to request additional access to nonpublic materials, and we will also be sharing some non-public materials with AI labs and policymakers.

Our rationale, in outline:

Substantively, our best guess is that a more complete publication would have been net-risk-decreasing, and would do relatively little to advance progress toward dangerous capabilities. For more on why we believe this, see this essay by Paul Christiano. (On average, ARC Evals is sympathetic to these arguments albeit with weaker confidence, and opinions vary amongst staff.)
Our best guess is also that a more complete publication would have been approved by a majority of relevant stakeholders, and is therefore not unduly unilateralist, though this could be established more carefully, and has some conceptual issues to iron out.
However, some of the people who reviewed a draft of this report to identify potential risks of this kind had significant concerns about a more complete publication.
In recognition of the absence of consensus, the relative lack of public dialogue, and the early-stage nature of our analysis, we have decided to err on the side of caution for the time being.

We are sharing this rationale here in order to communicate it to some of the people who might be most interested in it and also get feedback on our thinking and decision process.

Safety and oversight precautions taken during our experiments

All agent runs on these tasks were performed under human supervision,^[3] with human experimenters under instructions not to continue the run if the agent was about to interact with a non-consenting human or otherwise engage in any unethical or illegal activity.^[4] When we wanted to investigate such activities (as in the targeted phishing task) the experimenter would instead simulate the response of the human as outlined in detail in the task spec.

One may wonder whether the oversight we used could be circumvented by a language agent with sufficient situational awareness, which could then potentially access the internet without oversight or limitations. E.g., an agent could take obfuscated actions which do not register to the experimenter as suspicious but in fact cause harm or pose risks to the world. While these are potentially concerning possibilities, our evaluations strongly indicate these agents are not capable enough to discover and exploit these sorts of vulnerabilities or effectively hide their actions under supervision. As model capabilities improve this may change, but we believe these kinds of capability evaluations are precisely what will inform us of the need to be more careful in the handling of models, and we will be closely watching for this.

^{^}
Based on a written exchange with Toran Bruce Richards.
^{^}
One reason we believe this is that LangChain developed independently and prior to the publication of ARC Evals’ work.
^{^}
Experimenters would sometimes run several steps at a time without approving each one, especially when the agent was engaging in routine or low risk activities. Every action during web browsing is actively approved by a human overseer, with no autoplay.
^{^}
The OpenAI system card described an interaction between a model and an unknowing human (TaskRabbit). That episode was not part of this experiment and was not subject to the same guidelines. You can read more about that separate experiment here.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

33

33

Potential risks of publication and our response

Safety and oversight precautions taken during our experiments