AI ALIGNMENT FORUM
All of Collin's Comments + Replies
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
There were a number of iterations with major tweaks. It went something like:
I spent a while thinking about the problem conceptually, and developed a pretty strong intuition that something like this should be possible.
I tried to show it experimentally. There were no signs of life for a while (it turns out you need to get a bunch of details right to see any real signal -- a regime that I think is likely my comparative advantage) but I eventually got it to sometimes work using a PCA-based method. I think it took some work to make that more reliable, whi
Just what I wanted :D