Marius Hobbhahn

I recently founded Apollo Research: 

I was previously doing a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research. 

For more see

I subscribe to Crocker's Rules

Wiki Contributions


All of the above but in a specific order. 
1. Test if the model has components of deceptive capabilities with lots of handholding with behavioral evals and fine-tuning. 
2. Test if the model has more general deceptive capabilities (i.e. not just components) with lots of handholding with behavioral evals and fine-tuning. 
3. Do less and less handholding for 1 and 2. See if the model still shows deception. 
4. Try to understand the inductive biases for deception, i.e. which training methods lead to more strategic deception. Try to answer questions such as: can we change training data, technique, order of fine-tuning approaches, etc. such that the models are less deceptive? 
5. Use 1-4 to reduce the chance of labs deploying deceptive models in the wild. 

I don't think there is a general answer here. But here are a couple of considerations:
- networks can get stuck in local optima, so if you initialize it to memorize, it might never find a general solution.
- grokking has shown that with high weight regularization, networks can transition from memorized to general solutions, so it is possible to move from one to the other.
- it probably depends a bit on how exactly you initialize the memorized solution. You can represent lookup tables in different ways and some are much more liked by NNs than others. For example, I found that networks really don't like it if you set the weights to one-hot vectors such that one input only maps to one feature.
- My prediction for empirical experiments here would be something like "it might work in some cases but not be clearly better in the general case. It will also depend on a lot of annoying factors like weight decay and learning rate and the exact way you build the dictionary".