orthonormal

Run evals on base models too!

(Creating more visibility for a comment thread with Rohin Shah.) Currently, DeepMind's capabilities evals are run on the post-RL*F (RLHF/RLAIF) models and not on the base models. This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the...

Apr 4, 202451

Mesa-Optimizers via Grokking

Summary: Recent interpretability work on "grokking" suggests a mechanism for a powerful mesa-optimizer to emerge suddenly from a ML model. Inspired By: A Mechanistic Interpretability Analysis of Grokking Overview of Grokking In January 2022, a team from OpenAI posted an article about a phenomenon they dubbed "grokking", where they trained...

Dec 6, 202236

Developmental Stages of GPTs

Epistemic Status: I only know as much as anyone else in my reference class (I build ML models, I can grok the GPT papers, and I don't work for OpenAI or a similar lab). But I think my thesis is original. Related: Gwern on GPT-3 For the last several years,...

Jul 26, 2020140

orthonormal's Shortform

Oct 31, 20199

Value Learning for Irrational Toy Models

(This is a half-formed idea from discussions within MIRI; if it's dumb, I take the full blame.) In value learning contexts, it's useful to have a toy model of human psychology, to see (for example) if a certain approach would work to learn the values of an idealized rational agent,...

May 15, 20170

HCH as a measure of manipulation

A half-baked idea that came out of a conversation with Jessica, Ryan, and Tsvi: We'd like to have a straightforward way to define "manipulation", so that we could instruct an AI not to manipulate its developers, or construct a low-impact measure that takes manipulation as a particularly important impact. We...

Mar 11, 20171

Censoring out-of-domain representations

(An idea from a recent MIRI research workshop; similar to some ideas of Eric Drexler and others. Would need further development before it's clear whether this would do anything interesting, let alone be a reliable part of a taskifying approach.) If you take an AI capable of pretty general reasoning,...

Feb 1, 20173

orthonormal

orthonormal

Developmental Stages of GPTs

Run evals on base models too!

Mesa-Optimizers via Grokking

Cooperative Inverse Reinforcement Learning vs. Irrational Human Preferences

orthonormal

Developmental Stages of GPTs

Run evals on base models too!

Mesa-Optimizers via Grokking

Cooperative Inverse Reinforcement Learning vs. Irrational Human Preferences

Run evals on base models too!

Mesa-Optimizers via Grokking

Developmental Stages of GPTs

orthonormal's Shortform

Value Learning for Irrational Toy Models

HCH as a measure of manipulation

Censoring out-of-domain representations