x

AI ALIGNMENT FORUM

AF

Gurkenglas — AI Alignment Forum

Gurkenglas

Top postsTop post

Gurkenglas

Message

I operate by Crocker's rules.

I try to not make people regret telling me things. So in particular:
- I expect to be safe to ask if your post would give AI labs dangerous ideas.
- If you worry I'll produce such posts, I'll try to keep your...

2959

Ω

82

13

1388

43

13y

Gurkenglas

I operate by Crocker's rules.

I try to not make people regret telling me things. So in particular:
- I expect to be safe to ask if your post would give AI labs dangerous ideas.
- If you worry I'll produce such posts, I'll try to keep your...

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

by Logan Riggs and Gurkenglas

Tl;dr We are attempting to make neural networks (NN) modular, have GPT-N interpret each module for us, in order to catch mesa-alignment and inner-alignment failures. Completed Project Train a neural net with an added loss term that enforces the sort of modularity that we see in well-designed software projects. To...

Sep 3, 2020•68

Mapping Out Alignment

by Logan Riggs, adamShimi, Gurkenglas, AlexMennen, and Gyrodiot

This week, the key alignment group, we answered two questions, 5-minute timer style: 1. Map out all of alignment (25 minutes) 2. Create an image/ table representing alignment (10 min.) You are free to stop here, to actually try to answer the questions yourself. Here is a link for a...

Aug 15, 2020•43