Is the claim here that the 2^200 "persuasive ideas" would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)? Or do you just mean that they would look promising in a shorter evaluation done for training purposes?
Oh, this is definitely not what I meant.
"Betting odds" == Your actual belief after factoring in other people's opinions
"Inside view" == What your models predict, before factoring in other opinions or the possibility of being completely wrong
Checked; the answer is no: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models?commentId=ET24eiKK6FSJNef7G
Nice! Do you know if the author of that post was involved in RASP?
What probability do you assign to the proposition "Prosaic alignment will fail"?
Same question for:
"More than 50% of the prosaic alignment work done by the top 7 researchers is nearly useless"
Is a metal bar an optimizer? Looking at the temperature distribution, there is a clear set of target states (states of uniform temperature) with a much larger basin of attraction (all temperature distributions that don't vaporize the bar).
I suppose we could consider the second law of thermodynamics to be the true optimizer in this case. The consequence is that any* closed physical system is trivially an optimizing system towards higher entropy.
In general, it seems like this optimization criterion is very easy to satisfy if we don't specify what exactly we care about as a meaningful aspect of the system. Even the bottle cap 'optimizes' for trivial things like maintaining its shape (against the perturbation of elastic deformation).
Do you think this will become a problem when using this definition for AI? For example, we might find that a particular program incidentally tends to 'optimize' certain simple measures such as the average magnitude of network weights, or some other functions of weights, loss, policy, etc. to a set point/range. We may then find slightly more complex things being optimized that look like sub-goals (which could in a certain context be unwanted or dangerous). How would we know where to draw the line? It seems like the definition would classify lots of things as optimization, and it would be up to us to decide which kinds are interesting or concerning and which ones are as trivial as the bottle cap maintaining its shape.
That being said, I really like this definition. I just think it should be extended to classify the interestingness of a given optimization. An AI agent which competently pursues complex goals is a much more interesting optimizer than a metal bar, even though the bar seems more robust (deleting a tiny piece of metal won't stop it from conducting; deleting a tiny piece in the AI's computer could totally disable it).
Also a nitpick on the section about whether the universe is an optimizing system:
I don't think it is correct to say that the target space is almost as big as the basin of attraction. Either:
I would say that the first lens is almost always better than the second, since macro-states are what we actually care about and how we naturally divide the configuration space of a system.
Finally, just want to say this is an amazing post! I love the style as well as the content. The diagrams make it really easy to get an intuitive picture.
*Unsure about the existence of exceptions (can an isolated system be contrived that fails to reach the global max for entropy?)