Sequences

ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions

Comments

Risks from Learned Optimization: Introduction

Sure—I just edited it to be maybe a bit less jarring for those who know Greek.

[Link] A minimal viable product for alignment

I think this concern is only relevant if your strategy is to do RL on human evaluations of alignment research. If instead you just imitate the distribution of current alignment research, I don't think you get this problem, at least anymore than we have it now--and I think you can still substantially accelerate alignment research with just imitation. Of course, you still have inner alignment issues, but from an outer alignment perspective I think imitation of human alignment research is a pretty good thing to try.

A Longlist of Theories of Impact for Interpretability

The reason other possible reporter heads work is because they access the model's concept for X and then do something with that (where the 'doing something' might be done in the core model or in the head).

I definitely think there are bad reporter heads that don't ever have to access X. E.g. the human imitator only accesses X if X is required to model humans, which is certainly not the case for all X.

A Longlist of Theories of Impact for Interpretability

E.g. simplicity of explanation of a particular computation is a bit more like a speed prior.

I don't think that the size of an explanation/proof of correctness for a program should be very related to how long that program runs—e.g. it's not harder to prove something about a program with larger loop bounds, since you don't have to unroll the loop, you just have to demonstrate a loop invariant.

Gears-Level Mental Models of Transformer Interpretability

(Moderation note: added to the Alignment Forum from LessWrong.)

Understanding “Deep Double Descent”

SGD is meant as a shorthand that includes other similar optimizers like Adam.

A Longlist of Theories of Impact for Interpretability

Use the length of the shortest interpretability explanation of behaviours of the model as a training loss for ELK - the idea is that models with shorter explanations are less likely to include human simulations / you can tell if they do.

Maybe this is not the right place to ask this, but how does this not just give you a simplicity prior?

Musings on the Speed Prior

This is not true in the circuit-depth complexity model. Remember that an arbitrary lookup table is O(log n) circuit depth. If my function I'm trying to memorize is f(x) = (x & 1), the fastest circuit is O(1), whereas a lookup table is O(log n).

Certainly, I'm assuming that the intended function is not in O(log n), though I think that's a very mild assumption for any realistic task.


I think the prior you're suggesting is basically a circuit size prior. How do you think it differs from that?

Challenges with Breaking into MIRI-Style Research

One of my hopes with the SERI MATS program is that it can help fill this gap by providing a good pipeline for people interested in doing theoretical AI safety research (be that me-style, MIRI-style, Paul-style, etc.). We're not accepting public applications right now, but the hope is definitely to scale up to the point where we can run many of these every year and accept public applications.

Load More