Note: This is not a personal post. I am sharing on behalf of the ARC Evals team.
This document expands on an appendix to ARC Evals’ paper, “Evaluating Language-Model Agents on Realistic Autonomous Tasks.”
We published this report in order to i) increase understanding of the potentially dangerous capabilities of frontier AI models, and ii) advance the state of the art in safety evaluations of such models. We hope that this will improve society's ability to identify models with dangerous capabilities before they are capable of causing catastrophic damage.
It might be argued that this sort of research is itself risky, because it makes it easier to develop and exercise dangerous capabilities in language model agents. And indeed, the author of Auto-GPT said he was inspired by seeing a description of our evaluations in the GPT-4 system card. While it seems likely a project like that would have emerged soon in any event, the possibility of advancing the capabilities of language model agents is not merely hypothetical.
In recognition of concerns of this kind, we have made significant redactions to this report, including (but not limited to):
Our rationale, in outline:
We are sharing this rationale here in order to communicate it to some of the people who might be most interested in it and also get feedback on our thinking and decision process.
All agent runs on these tasks were performed under human supervision, with human experimenters under instructions not to continue the run if the agent was about to interact with a non-consenting human or otherwise engage in any unethical or illegal activity. When we wanted to investigate such activities (as in the targeted phishing task) the experimenter would instead simulate the response of the human as outlined in detail in the task spec.
One may wonder whether the oversight we used could be circumvented by a language agent with sufficient situational awareness, which could then potentially access the internet without oversight or limitations. E.g., an agent could take obfuscated actions which do not register to the experimenter as suspicious but in fact cause harm or pose risks to the world. While these are potentially concerning possibilities, our evaluations strongly indicate these agents are not capable enough to discover and exploit these sorts of vulnerabilities or effectively hide their actions under supervision. As model capabilities improve this may change, but we believe these kinds of capability evaluations are precisely what will inform us of the need to be more careful in the handling of models, and we will be closely watching for this.
Based on a written exchange with Toran Bruce Richards.
One reason we believe this is that LangChain developed independently and prior to the publication of ARC Evals’ work.
Experimenters would sometimes run several steps at a time without approving each one, especially when the agent was engaging in routine or low risk activities. Every action during web browsing is actively approved by a human overseer, with no autoplay.
The OpenAI system card described an interaction between a model and an unknowing human (TaskRabbit). That episode was not part of this experiment and was not subject to the same guidelines. You can read more about that separate experiment here.