Megan Kinniment — AI Alignment Forum

Bounty: Diverse hard tasks for LLM agents

Update 3/14/2024: This post is out of date. For current information on the task bounty, see our Task Development Guide. Summary METR (formerly ARC Evals) is looking for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents. Quick description of key...

Dec 17, 202349

Send us example gnarly bugs

by Beth Barnes, Megan Kinniment, and Tao Lin

Update: We are no longer accepting gnarly bug submissions. However, we are still accepting submissions for our Task Bounty! Tl;dr: Looking for hard debugging tasks for evals, paying greater of $60/hr or $200 per example. METR (formerly ARC Evals) is interested in producing hard debugging tasks for models to attempt...

Dec 10, 202377

Steering Behaviour: Testing for (Non-)Myopia in Language Models

by Evan R. Murphy and Megan Kinniment

Authors' Contributions: Both authors contributed equally to this project as a whole. Evan did the majority of implementation work, as well as the work for writing this post. Megan was more involved at the beginning of the project, and did the majority of experiment design. While Megan did give some...

Dec 5, 202240

Recall and Regurgitation in GPT2

The first half of this post uses causal tracing to explore differences in how GPT2-XL handles completing cached phrases vs completing factual statements. The second half details my attempt to build intuitions about the high-level structure of GPT2-XL and is speculation heavy. Some familiarity with transformer architecture is assumed but...

Oct 3, 202243

Megan Kinniment's Shortform

Jul 14, 20223

Exploring Mild Behaviour in Embedded Agents

Thanks to Tristan Cook, Emery Cooper, Nicolas Macé and Will Payne for discussions, advice, and comments. All views are my own. I think this post is best viewed as an effort to point at a potentially interesting research direction. The specific arguments I make centered around inducing mild behaviour in...

Jun 27, 202221