ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Risks from Learned Optimization: Introduction

Sure—I just edited it to be maybe a bit less jarring for those who know Greek.

[Link] A minimal viable product for alignment

I think this concern is only relevant if your strategy is to do RL on human evaluations of alignment research. If instead you just imitate the distribution of current alignment research, I don't think you get this problem, at least anymore than we have it now--and I think you can still substantially accelerate alignment research with just imitation. Of course, you still have inner alignment issues, but from an outer alignment perspective I think imitation of human alignment research is a pretty good thing to try.

A Longlist of Theories of Impact for Interpretability

The reason other possible reporter heads work is because they access the model's concept for X and then do something with that (where the 'doing something' might be done in the core model or in the head).

I definitely think there are bad reporter heads that don't ever have to access X. E.g. the human imitator only accesses X if X is required to model humans, which is certainly not the case for all X.

E.g. simplicity of explanation of a particular computation is a bit more like a speed prior.

I don't think that the size of an explanation/proof of correctness for a program should be very related to how long that program runs—e.g. it's not harder to prove something about a program with larger loop bounds, since you don't have to unroll the loop, you just have to demonstrate a loop invariant.

Gears-Level Mental Models of Transformer Interpretability

Understanding “Deep Double Descent”

SGD is meant as a shorthand that includes other similar optimizers like Adam.

Use the length of the shortest interpretability explanation of behaviours of the model as a training loss for ELK - the idea is that models with shorter explanations are less likely to include human simulations / you can tell if they do.

Maybe this is not the right place to ask this, but how does this not just give you a simplicity prior?

Musings on the Speed Prior

This is not true in the circuit-depth complexity model. Remember that an arbitrary lookup table is O(log n) circuit depth. If my function I'm trying to memorize is f(x) = (x & 1), the fastest circuit is O(1), whereas a lookup table is O(log n).

Certainly, I'm assuming that the intended function is not in O(log n), though I think that's a very mild assumption for any realistic task.

I think the prior you're suggesting is basically a circuit size prior. How do you think it differs from that?

Challenges with Breaking into MIRI-Style Research

One of my hopes with the SERI MATS program is that it can help fill this gap by providing a good pipeline for people interested in doing theoretical AI safety research (be that me-style, MIRI-style, Paul-style, etc.). We're not accepting public applications right now, but the hope is definitely to scale up to the point where we can run many of these every year and accept public applications.

