(Creating more visibility for a comment thread with Rohin Shah.) Currently, DeepMind's capabilities evals are run on the post-RL*F (RLHF/RLAIF) models and not on the base models. This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the...
Summary: Recent interpretability work on "grokking" suggests a mechanism for a powerful mesa-optimizer to emerge suddenly from a ML model. Inspired By: A Mechanistic Interpretability Analysis of Grokking Overview of Grokking In January 2022, a team from OpenAI posted an article about a phenomenon they dubbed "grokking", where they trained...
Epistemic Status: I only know as much as anyone else in my reference class (I build ML models, I can grok the GPT papers, and I don't work for OpenAI or a similar lab). But I think my thesis is original. Related: Gwern on GPT-3 For the last several years,...
(This is a half-formed idea from discussions within MIRI; if it's dumb, I take the full blame.) In value learning contexts, it's useful to have a toy model of human psychology, to see (for example) if a certain approach would work to learn the values of an idealized rational agent,...
A half-baked idea that came out of a conversation with Jessica, Ryan, and Tsvi: We'd like to have a straightforward way to define "manipulation", so that we could instruct an AI not to manipulate its developers, or construct a low-impact measure that takes manipulation as a particularly important impact. We...
(An idea from a recent MIRI research workshop; similar to some ideas of Eric Drexler and others. Would need further development before it's clear whether this would do anything interesting, let alone be a reliable part of a taskifying approach.) If you take an AI capable of pretty general reasoning,...