Sam Brown

Message

381

Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents

Summary * Scaffolded LLM agents are, in principle, able to execute arbitrary code to achieve the goals they have been set. One such goal could be self-improvement. * This post outlines our plans to build a benchmark to measure the ability of LLM agents to modify and improve other LLM...

Jul 22, 2024•20

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

We have written a paper on sandbagging for which we present the abstract and brief results in this post. See the paper for more details. Tweet thread here. Illustration of sandbagging. Evaluators may regulate the deployment of AI systems with dangerous capabilities, potentially against the interests of the AI system...

Jun 13, 2024•84

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

This post summarizes work done over the summer as part of the Summer 2023 AI Safety Hub Labs programme. Our results will also be published as part of an upcoming paper. In this post, we focus on explaining how we define and evaluate properties of deceptive behavior in LMs and...

Nov 8, 2023•49