For challenge 1, was the MNIST CNN trained on all 60,000 examples in the MNIST train/validation sets?
For both challenges, do the models achieve perfect train accuracy? When did you train them until?
What sort of interp tools are allowed? Can I use pre-mech interp tools like saliency maps?

Edit: played around with the models, it seems like the transformer only gets 99.7% train accuracy and 97.5% test accuracy!

Reply

[-]scasper3y10

The MNIST CNN was trained only on the 50k training examples.

I did not guarantee that the models had perfect train accuracy. I don't believe they did.

I think that any interpretability tools are allowed. Saliency maps are fine. But to 'win,' a submission needs to come with a mechanistic explanation and sufficient evidence for it. It is possible to beat this challenge by using non mechanistic techniques to figure out the labeling function and then using that knowledge to find mechanisms by which the networks classify the data.

At the end of the day, I (and possibly Neel) will have the final say in things.

Thanks :)

Reply

[-]Arthur Conmy3y10

Does the “ground truth” shows the correct label function on 100% of the training and test data? If so, what’s the relevance of the transformer which imperfectly implements the label function?

Reply

[-]scasper3y21

Yes, it does show the ground truth.

The goal of the challenge is not to find the labels, but to find the program that explains them using MI tools. In the post, when I say labeling "function", I really mean labeling "program" in this case.

Reply

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

14

EIS VII: A Challenge for Mechanists

14

Given a network, recover its labeling function's program.

Challenge 1, MNIST CNN

Challenge 2, Transformer

Prizes

Good luck

Questions