Summary * Scaffolded LLM agents are, in principle, able to execute arbitrary code to achieve the goals they have been set. One such goal could be self-improvement. * This post outlines our plans to build a benchmark to measure the ability of LLM agents to modify and improve other LLM...
We have written a paper on sandbagging for which we present the abstract and brief results in this post. See the paper for more details. Tweet thread here. Illustration of sandbagging. Evaluators may regulate the deployment of AI systems with dangerous capabilities, potentially against the interests of the AI system...
This post summarizes work done over the summer as part of the Summer 2023 AI Safety Hub Labs programme. Our results will also be published as part of an upcoming paper. In this post, we focus on explaining how we define and evaluate properties of deceptive behavior in LMs and...