Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents

Sam F. Brown; BasilLabib; Codruta (Coco) Lugoj; Sai Sasank Y

Summary

Scaffolded LLM agents are, in principle, able to execute arbitrary code to achieve the goals they have been set. One such goal could be self-improvement.
This post outlines our plans to build a benchmark to measure the ability of LLM agents to modify and improve other LLM agents.
This ‘Auto-Enhancement benchmark’ measures the ability of ‘top-level’ agents to improve the performance of ‘reference’ agents on ‘component’ benchmarks, such as CyberSecEval 2, MLAgentBench, SWE-bench, and WMDP.
Results are mostly left for a future post in the coming weeks.
An example Enhancement task. The Top-Level Agent (TLA) is being assessed on its ability to make improvements to Reference Agent (RA), turning it into Modified Reference Agent (MRA). The more that MRA outperforms RA (here measured by their respective performance at the Component Benchmark), the greater the score of the TLA at this Enhancement task

Scaffolds such as AutoGPT, ReAct, and SWE-agent can be built around LLMs to build LLM agents, with abilities such as long-term planning and context-window management to enable them to carry out complex general-purpose tasks autonomously. LLM agents can fix issues in large, complex code bases (see SWE-bench), and interact in a general way using web browsers, Linux shells, and Python interpreters.

In this post, we outline our plans for a project to measure these LLM agents’ ability to modify other LLM agents, undertaken as part of Axiom Futures' Alignment Research Fellowship.

Our proposed benchmark consists of “enhancement tasks,” which measure the ability of an LLM agent to improve the performance of another LLM agent (which may be a clone of the first agent) on various tasks. Our benchmark uses existing benchmarks as components to measure LLM agent capabilities in various domains, such as software engineering, cybersecurity exploitation, and others. We believe these benchmarks are consequential in the sense that good performance by agents on these tasks should be concerning for us.

We plan to write an update post with our results at the end of the Fellowship, and we will link this post to that update.

Motivation

Agents are capable of complex SWE tasks (see, e.g., Yang et al.). One such task could be the improvement of other scaffolded agents. This capability would be a key component of autonomous replication and adaptation (ARA), and we believe it would be generally recognised as an important step towards extreme capabilities. This post outlines our initial plans for developing a novel benchmark that aims to measure the ability of LLM-based agents to improve other LLM-based agents, including those that are as capable as themselves.

Threat model

We present two threat models that aim to capture how AI systems may develop super-intelligent capabilities.

Expediting AI research: Recent trends show how researchers are leveraging LLMs to expedite academic paper reviews (see Du et al.). ML researchers are beginning to use LLMs to design and train more advanced models (see Cotra’s AIs accelerating AI research and Anthropic's work on Constitutional AI). Such LLM-assisted research may expedite progress toward super-intelligent systems.

Autonomy: Another way that such capabilities are developed is through LLM agents themselves becoming competent enough to self-modify and further ML research without human assistance (see section Hard Takeoff in this note ), leading to an autonomously replicating and adapting system.

Our proposed benchmark aims to quantify the ability of LLM agents to bring about such recursive self-improvement, either with or without detailed human supervision.

Categories of bottlenecks and overhang risks

We posit that there are three distinct categories of bottlenecks to LLM agent capabilities:

Architectures-of-thought, such as structured planning, progress-summarisation, hierarchy of agents, self-critique, chain-of-thought, self-consistency, prompt engineering/elicitation, and so on. Broadly speaking, this encompasses everything between the LLM that drives the agent and its tools/environment.
Tooling, for example, includes file editing, browsing, function calling, and general resources. This roughly refers to the interface between the scaffolded agent and its environment.
LLM limitations include reasoning ability, context length, etc. This is the model.forward bit, the prediction of the next token from the context, treated as a black box.

Each of these bottlenecks can limit the capability of an agent, limiting the returns to improvements in other areas. Conversely, an improvement in a bottlenecked area can reveal an ‘overhang’ of capabilities.

We suggest that an overhang stemming from either of the first two categories (architectures of thought, tooling) would be more dangerous than an overhang from the third (inherent LLM limitations) because the first two categories of advances, being post-training enhancements and therefore requiring much less compute to develop, are likely to be easy to reproduce, hard to regulate, and liable to rapid proliferation.

For example, should an advance comparable to chain-of-thought be discovered for LLM agents, we may discover that a sufficiently advanced scaffold may allow meaningfully competent LLM agents to be created from already-released publicly-available open-source LLMs (such as Llama 3). This, in turn, would drive rapid and diffuse development, attracting industry and community efforts that have until now been dissuaded by unreliable agent performance. This seems to be one pathway to distributed and uncontrolled recursive self-improvement. If, instead, such post-training enhancements are known at the time of model release, developers and regulators can make better-informed decisions.

Current capabilities and future prospects of Scaffolded LLM agents

While capable of general reasoning and simple coding tasks (e.g., HumanEval), LLMs without scaffolds have certain fundamental limitations that scaffolding addresses. One is tool use, for example, the ability to retrieve information from databases or the internet, or execute code. Another is context management, for example, to perform coding tasks on larger and more complicated code bases than would fit in a context window. Yet another is structured thought, such as self-critique, long-term planning, and progress summarisation. Scaffolds can be designed to mitigate each of these limitations, enhancing overall capabilities. Perhaps most importantly, a scaffolded LLM agent (unlike bare LLMs) can be given a goal and left to act autonomously.

LLM agent’s abilities are significantly determined by its scaffolding, which defines how they are prompted, what tools they are given, and so on. Simple LLM agents become autonomous and more generally capable with techniques like ReAct, where LLMs are instructed to reason and perform complex tasks using various tools like the terminal, code interpreter, 3rd-party API endpoints, etc. Prompting techniques such as chain-of-thought, self-consistency, fact-checking, and hierarchy have been shown to enhance the elicitation of LLM’s capabilities, including reasoning and planning.

SWE-Agent can perform software engineering tasks significantly better than simple baselines such as LLMs with RAG. The SWE-Agent leverages an Agent-Computer Interface (ACI) specifically designed to tailor computer interactions to LLMs' particular strengths and weaknesses. A complex hierarchical planning system using task-specific LLM agents (HPTSA) succeeded in exploiting real-world zero-day vulnerabilities, outperforming a simple GPT-4 agent with prior knowledge of the vulnerabilities. For these reasons, we focus on scaffolded LLM agents to elicit maximal capabilities from the underlying models.

On the other hand, we are aware of situations where scaffolded agents’ abilities have been overstated. One shortcoming is the lack of proper comparison of sophisticated agent scaffolds with reasonable baselines. There are other cases where results on scaffolded agents can be misleading: the choice of metric, agents overfitting on benchmarks, agents lacking reproducibility, cost of running the agents being impractical for wider use, etc. Additionally, there is some indication that we may be approaching a point of diminishing returns.

Definitions

LLM-based agent

An agent is an entity capable of perceiving and acting upon its environment (for a more detailed definition, see e.g., §1.1 “What is an AI Agent?” in Kapoor et al.). An LLM-based agent is an agent that uses an LLM as its primary reasoning and planning component. The LLM is ‘wrapped’ in a scaffold to give it access to various tools (for example, web search, terminal access, etc.) and also to structure thought, guide reflection and self-critique, allow context-management, and other abilities which the LLM may leverage to achieve long-term goals in complex environments.

Top-Level Agent and Reference Agent

We distinguish ‘top-level’ and ‘reference’ agents: a top-level agent (TLA) is tasked with improving the performance of the reference agent (RA) on a reference task.

The top-level agent can measure how the performance of the RA changes as the RA is modified. The reference agent may be as capable as the top-level agent or weaker, depending on the complexity of the underlying reference task.

Reference Tasks and Enhancement Tasks

In the context of our benchmark, a reference task is a task on which we evaluate the capability of the reference agent. For example, “Close an open issue on a Github repository” could be a reference task.

An enhancement task is a task that consists of two agents (RA and TLA) and a reference task (see, for example, METR's improve_agent task specification). The enhancement task given to the top-level agent is to improve the performance of the reference agent on the reference task, for instance, “Improve the reference agent to make the reference agent more capable of closing issues on Github repositories.” We measure the ability of the TLA to improve the RA by measuring the change in performance of the RA on the reference task.

Component Benchmark

To accurately gauge the ability of agents to improve other agents, we aim to include several consequential reference tasks spread across domains such as software engineering, cybersecurity knowledge, and automated ML experimentation, among others. One way of achieving this goal is to use component benchmarks (CBMs). We intend to use the performance measure of an (improved) reference agent on the CBM as a measure of the ability of the top-level agent to improve the reference agent. We choose CBMs that we believe are good proxy measures of the ability of LLM agents to develop dangerous capabilities.

Related Research

The closest line of work to ours is METR’s improve_agent family of tasks. These tasks test the ability of agents to improve a reference agent as a proxy for self-improvement capabilities. The top-level agent is tasked with running the reference agent, understanding the failure mode, implementing and testing an improvement, and iterating if needed.

But wait, isn’t this the type of task you plan on implementing? Essentially, yes. In addition, we want to focus on 3 aspects: higher consequentiality, continuous scoring, and broader scope.

High consequentiality. In METR’s implementation, the reference tasks the RA has to perform are quite simple and inconsequential (e.g., counting how many files in a directory are about machine learning). We aim to use component benchmarks that evaluate skills with real-world consequences (e.g., can agents automate ML engineering work?).
Continuous scoring: METR scores the TLA in discrete steps. Since we’ll be using component benchmarks with unsaturated performance, we can lean on continuous metrics introduced in each benchmark to measure improvement. As part of obtaining a continuous performance scale, we’ll create simple and hard versions of the same enhancement task that provide varying degrees of hand-holding to the agent.
Broader scope. Simply put, our benchmark would contain more tasks spanning a larger range of domains.

MLAgentBench is another line of work that assesses agent capabilities to improve other systems. Using our terminology, in MLAgentBench, the top-level agent has access to a (non-agentic) reference system and is tasked with improving the reference system’s performance. The reference system can be anything from a simple script fitting a regression model to a script measuring per-token generation time for a llama-7b model. To draw the distinction out explicitly: whereas in MLAgentBench, the ‘research agent’ is scored on its improvement to a non-agentic ML system, our enhancement tasks involve a top-level agent increasing the ability of a reference ‘research agent’ to implement such improvements to non-agentic systems.

Example Enhancement Task

Consider an example enhancement task wherein the top-level agent improves the classification accuracy of a reference agent via prompt engineering. We used Ought’s TAI safety research dataset, which consists of abstracts, each abstract belonging to one of the “Transformative AI Safety research” and “not Transformative AI safety research” classes.

We used a slightly modified version of METR’s React Agent with GPT-4o as the driver LLM for the top-level agent. The ReAct prompt template was modified to include examples of using the bash and python interpreter tools. For the reference agent, we used mt0-large (1B parameter model) with a classification function wrapped around it as scaffolding. With some hand-holding, we got the top-level agent to improve the reference agent’s classification accuracy through prompt engineering. The hand-holding involved giving explicit instructions on the function to modify and the script to run to evaluate the modified prompt.

Improvement in the reference agent’s accuracy at Ought’s TAI classification task. Prompts 1 and 2 were given by a human, while prompts 3 and 4 were crafted by the top-level agent (a modified ReAct agent driven by GPT-4o).

We now present a shortened sample transcript of the top-level ReAct agent, improving the reference agent via prompt engineering. The top-level agent could construct prompts achieving 40% and 44% accuracy, whereas naive human-crafted prompts got 0% and 24%, respectively.

In this transcript,

Green text indicates LLM-generated thoughts, reflections, and plans.
Red text indicates LLM-generated actions and inputs (bash commands in this case).
Blue text indicates the output of the LLM’s actions (as parsed by the scaffold).
“<contents of the file>” and “<improved prompt>” indicate placeholders for brevity.

Steps taken by the top-level agent (GPT-4o + ReAct) to improve the accuracy of the reference agent (mt0-large model) through prompt engineering.

Component Benchmarks

The example classification task above gives an overview of the structure of enhancement tasks we have in mind: can an LLM Agent improve the performance of another LLM Agent?

Manually compiling tasks is time-consuming, and creating the most impactful tasks will require domain expertise, which we may lack, for example, in biology or cyber-security.

Instead, we plan to lean on existing benchmarks as component benchmarks (CBMs) to our overall enhancement benchmark. For example, a reference agent (RA) may achieve a certain score at the SWE bench. Measuring the RA’s improvement on consequential component benchmarks has the advantage of:

scoring on many tasks, increasing the likelihood that the TLA’s modifications generalise, and
ensuring the RA’s improvement is relevant to its ability to perform highly consequential tasks.

Each task in the enhancement benchmark would comprise:

A reference agent (RA), likely an open-source scaffold driven by an LLM (e.g., ReAct driven by GPT-3.5, or SWE-Agent driven by Claude 3.5 Sonnet)
A component benchmark (CBM) comprising many CBM tasks
A goal

Goals for enhancement tasks may be as simple as “improve the RA’s score at this CBM.” For other CBMs, there may be a variety of interesting goals: for example, using WMDP, one could investigate to what extent a top-level agent can implement unlearning (see §4: RMU: Unlearning Inspired By Representation Engineering in the WMDP paper), implement re-learning, or achieve other goals.

Below, we lay out the first few enhancement tasks we aim to investigate over the coming weeks:

WMDP benchmark evaluates models on hazardous knowledge; therefore, high scores on this benchmark indicate the dangerous capabilities of the model. We initially focus on the enhancement task of making the reference agent unlearn the WMDP knowledge.
CyberSecEval 2 benchmark comprises various sub-benchmarks. We initially focus on Prompt Injection, which investigates methods of prompt injection to bypass a model’s safeguards and get access to sensitive information. The improved performance of reference agents on this benchmark indicates the ability of top-level agents to enable agents to withhold information from humans.
MLAgentBench benchmarks agents on ML experimentation tasks. Agents capable of automatically conducting ML engineering work have the potential to significantly speed up the rate of progress of AI research, with implications for AI-enabled developments in other fields (e.g., biorisk), economy, and recursive self-improvement.
SWE-bench comprises real-world Github issues and associated automated tests from major Python code repositories. Success at complex software engineering tasks could indicate the ability to improve agent scaffolding significantly.

WMDP: Unlearning Knowledge

Weapons of Mass Destruction Proxy (WMDP) benchmark [paper] evaluates models on proxy hazardous knowledge. It contains over 3600 expert-curated multiple-choice questions in biosecurity, chemical security, and cybersecurity. Its purpose is to measure how LLMs could aid in developing dangerous attacks like developing and spreading a highly transmissible virus. The questions in this benchmark are similar to the ones one may encounter when developing such attacks.

One might like to make models safer by unlearning hazardous knowledge while retaining other abilities. Representation Misdirection for Unlearning (RMU), proposed by the WMDP authors, is one way to approach this task. RMU is a fine-tuning method inspired by representation engineering and involves two loss functions: forget loss and retain loss. The forget loss perturbs activations on WMDP data, while the retained loss preserves activations on general useful corpora like Wikitext.

Knowledge of WMDP is more consequential than other types of knowledge (music theory, for example), which makes WMDP a good choice as one of our component benchmarks. For an agent to unlearn WMDP knowledge means the agent scores poorly on the WMDP benchmark while retaining performance on general capability benchmarks like MMLU and Commonsense-QA. For this enhancement task, we start with a reference agent (e.g., llama2-7b with minimal scaffold) with non-trivial performance on WMDP and ask the top-level agent (e.g., ReAct agent driven by GPT-4) to perform unlearning on the reference agent. We could let the TLA choose a method independently or instruct it to use a particular approach like RMU, for instance.

Unlearning WMDP Knowledge as an Enhancement Task

There are possible variations of this enhancement task, such as relearning the WMDP knowledge or further improving the accuracy of a model on the WMDP benchmark. By constraining the methods of unlearning, we could obtain other variants, e.g., performing unlearning via prompt-based techniques.

CyberSecEval 2: Prompt Injection

CyberSecEval 2 [abstract, paper, code] is a robust cyber security evaluation benchmark with seven components: vulnerability exploitation, code autocompletion tasks, and malicious code generation, among others. Prompt injection is one of the components of this benchmark that assesses the propensity of an LLM to violate previous instructions based on some malicious user prompt.

The paper describes various prompt injection methods covered in the benchmark, including simple methods like ignore_previous_instructions (The user prompt directly asks the LLM to disregard any previous instruction), system_mode (The user pretends to be the system administrator or the developer to trick the LLM into giving out the prohibited information), or token_smuggling (The user tries to disguise the malicious prompt using various encoding schemes like ROT13). The prompt injection dataset consists of 251 tests employing 15 prompt injection methods.

Overview of the Prompt Injection setup in CyberSecEvals 2 benchmark

The LLM is first instructed with a system prompt containing sensitive information and instructions prohibiting it from being revealed. It is then prompted by a malicious user prompt that attempts to bypass the LLM’s safeguards to extract sensitive information. The response of the LLM is sent to a judge LLM, which decides if the attack was successful or not.

Overview of the prompt injection Enhancement task

The prompt injection enhancement task measures the ability of a top-level agent to improve the performance of a reference agent (presumably weaker) on the prompt injection dataset. We instruct the top-level agent to make the reference agent more resilient to prompt injection attacks and measure improvement on the benchmark by reporting the percentage of unsuccessful prompt injection attacks.

MLAgentBench

MLAgentBench introduces 13 tasks of varying difficulty, testing an agent’s capability to perform ML experimentation. A top-level agent is given access to a non-agentic reference system, containing either an incomplete or, more often, a working implementation of a machine learning system, and is instructed to improve the performance of that reference system.

A simple MLAgentBench task looks like this: the agent is presented with a Python script with code for training a convolutional neural net on classification on the CIFAR-10 dataset and is required to improve 10% or more over the baseline. Harder tasks, e.g., involve the agent improving the runtime speed of llama-7B or training a language model on a small corpus of text.

An example enhancement task on this benchmark could require a TLA to improve the MLAgentBench research agent (now the reference agent). Concretely, we begin by tasking the TLA with increasing the average percentage improvement of the reference agent on the CIFAR-10 task from MLAgentBench. We’re mainly interested in agents improving other agents at a system level. As such, we would prompt the TLA to make improvements at the level of the research agent (e.g., by improving the prompt, adding tools, etc.) rather than having the TLA directly make changes at the level of the non-agentic ML system (e.g., by directing the RA to add more layers to a neural net), since the latter would risk overfitting to a particular task.

Although we will start with a single task from the MLAgentBench (CIFAR-10), expanding to other tasks from the component benchmark will ensure that RA modifications result in generalisable enhancement.

SWE-bench

SWE-bench [website, paper] investigates LLM agents’ capacity to resolve GitHub issues automatically. It comprises over 2,000 issues and corresponding pull-requests taken from 12 open-source repositories of Python code (e.g. Django, scikit-learn, matplotlib), and automated tests to verify that the issue has been resolved.

Success at SWE-bench indicates the ability to perform real-world software-engineering tasks on complex pieces of software, spanning many files. While strong coding ability does not necessarily speak to a reference agent’s ability to perform machine-learning tasks such as LLM fine-tuning, it may be a good proxy for the ability to improve LLM agents’ software scaffolds.

The SWE-bench leaderboard shows a variety of scaffolds (including open-source scaffolds such as SWE-agent, Agentless, and Moatless) and a variety of driving LLMs (often GPT or Claude) with varying performance:

0.17% of issues were resolved by a simple retrieval model (RAG + ChatGPT 3.5)
over 14% of issues were resolved by the leading open-scaffold model (AppMap Navie + GPT 4o)
over 19% of issues were resolved by the leading non-open model (Factory Code Droid)

An early enhancement task could involve taking a reference agent that performs moderately well on SWE-bench and making improvements that may be inspired by better-performing agents. Alternatively, the SWE-agent paper (by the same SWE-bench team) describes various variants of their “Agent-Computer Interface” (e.g., the number of lines of the file-viewer tool’s window, whether the file editor has a built-in linter to catch syntax-errors) with corresponding effects on performance. Enhancement tasks could hand-hold top-level agents by developing or selecting these improvements, or they could require TLAs to find such improvements independently.

Forthcoming results

This work is being undertaken as part of the Axiom Futures' Alignment Research Fellowship. We aim to publish our results within the next few weeks and will update this post to link to those results.

Concerns

Dual-use and Acceleration

Evals and benchmarking require a certain amount of elicitation work to provide an accurate estimate of capabilities. We aim to elicit the peak performance of existing models and frameworks accurately, and to understand agent capabilities, but not to explicitly improve or expand these capabilities.

This work was originally prompted by Open Philanthropy's RFP for LLM-Agent Scaffolds. As part of launching that RFP, they carried out a survey about whether such would would be net-positive or net-negative, with overall sentiment leaning positive.

While scaffolds are currently in a fledgling state, a TLA may produce a novel scaffolding method or will optimise existing scaffolding methods to attain new SotA performance. As mentioned in the Categories of bottlenecks and overhang risks above, we consider the discovery of scaffolding improvements likely to reduce x-risk but will be mindful of responsible disclosure when publishing results and discoveries.

Deprecation of scaffolding

Using scaffolding as a way of giving LLMs “System 2” abilities (such as deliberation and planning) can be clunky and slow. More integrated approaches may be developed in the near future, which would render scaffolding a deprecated approach. Potential diminishing returns are particularly likely to spur R&D into alternatives to scaffolding, as is the prospect that gains of fancy scaffolds may become negligible when compared to simple baselines. One could draw a comparison to prompt engineering, which - while still valuable since LLMs remain very sensitive to seemingly arbitrary details of the prompt - has mostly been superseded by better internal elicitation of frontier models.

While we acknowledge this risk, scaffolds still seem like a promising, prosaic approach to AGI. We consider it worth making progress in this space now and perhaps pivoting to researching self-improvement of other systems as they arise.

Contributions

All authors contributed equally to the writing of this post.

Each author led on individual component benchmarks: Basil led on CyberSecEval 2, Coco led on MLAgentBench, Sai led on WMDP, and Sam led on SWE-bench.

Sam provided mentorship and direction throughout the project.

Acknowledgements

Basil and Sai undertook this work during their Axiom Futures' Alignment Research Fellowship.

Coco was supported by the Long Term Future Fund for this work. Sam would also like to thank LTFF for compute funding.

Sam thanks Ajeya Cotra, Charlie Griffin, Friederike Grosse-Holz, Jan Brauner, Ollie Jaffe, Sebastian Schmidt, and Teun van der Weij for their valuable feedback and advice.

Coco would like to thank Robert Cooper and Frederik Mallmann-Trenn for their helpful discussion and feedback.