Scott Emmons — AI Alignment Forum

Evaluating and monitoring for AI scheming

by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin Shah

As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to...

Jul 10, 202553

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

You can try our interactive demo! (Or read our preprint.) Here, we want to explain why we care about this work from an AI safety perspective. Concerning Properties of Image Hijacks What are image hijacks? To the best of our knowledge, image hijacks constitute the first demonstration of adversarial inputs...

Sep 20, 202358

Uncovering Latent Human Wellbeing in LLM Embeddings

by ChengCheng, Pedro Freire, Dan H, and Scott Emmons

tl;dr A one-dimensional PCA projection of OpenAI's text-embedding-ada-002 achieves 73.7% accuracy on the ETHICS Util test dataset. This is comparable with the 74.6% accuracy of BERT-large finetuned on the entire ETHICS Util training dataset. This demonstrates how language models are developing implicit representations of human utility even without direct preference...

Sep 14, 202332

Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)

tl;dr Contrast consistent search (CCS)[1] is a method by Burns et al. that consists of two parts: 1. Generate contrast pairs by adding pseudolabels to an unlabelled dataset. 2. Use the contrast pairs to search for a direction in representation space that satisfies logical consistency properties. In discussions with other...

May 31, 202397

Storytelling Makes GPT-3.5 Deontologist: Unexpected Effects of Context on LLM Behavior

by Edmund Mills and Scott Emmons

TL;DR When prompted to make decisions, do large language models (LLMs) show power-seeking behavior, self-preservation instincts, and long-term goals? Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al.) introduced a set of evaluations for these behaviors, along with other dimensions of LLM self-identity, personality, views, and decision-making. Ideally, we’d...

Mar 14, 202317

Introducing the Fund for Alignment Research (We're Hiring!)

by AdamGleave, Scott Emmons, Ethan Perez, and Claudia Shi

Cross-posted to the EA Forum The Fund for Alignment Research (FAR) is hiring research engineers and communication specialists to work closely with AI safety researchers. We believe these roles are high-impact, contributing to some of the most interesting research agendas in safety. We also think they offer an excellent opportunity...

Jul 6, 202262