Authors: Satvik Golechha*, Sid Black*, Joseph Bloom * Equal Contribution. This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI). Our code is available on GitHub and the model checkpoints and data is available on HuggingFace. Executive Summary In Natural Emergent Misalignment...
This post is a companion piece to a forthcoming paper. This work was done as part of MATS 7.0 & 7.1. Abstract We explore how LLMs’ awareness of their own capabilities affects their ability to acquire resources, sandbag an evaluation, and escape AI control. We quantify LLMs' self-awareness of capability...
Introduction Joseph Bloom, Alan Cooney This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field. The format of this post was inspired by...
Please go to the colab for interactive viewing and playing with the phenomena. For space reasons, not all results included in the colab are included here so please visit the colab for the full story. A GitHub repository with the colab notebook and accompanying data can be found here. This...
Conjecture is hiring! We have open roles for all teams, both technical and non-technical. We have written a bit more about the teams at Conjecture here and you can view all open positions here. Applications for the hiring round will close December 16. Conjecture is an AI Safety startup that...
This post is a brief retrospective on the last 8 months at Conjecture that summarizes what we have done, our assessment of how useful this has been, and the updates we are making. Intro Conjecture formed in March 2022 with 3 founders and 5 early employees. We spent our first...
This post gives an overview of discussions - from the perspective and understanding of the interpretability team at Conjecture - between mechanistic interpretability researchers from various organizations including Conjecture, Anthropic, Redwood Research, OpenAI, and DeepMind as well as some independent researchers. It is not a review of past work, nor...