Below is a summary of my (honorable mention) proposal, which tries to induce mind-blindness on the Reporter. I've also posted the full contest entry as a blog post here: https://calvinmccarter.wordpress.com/2022/02/19/mind-blindness-strategy-for-eliciting-latent-knowledge/ . I've not fully analyzed whether it survives previously mentioned counterexamples, or whether it has additional unique vulnerabilities compared to other strategies, so I'd welcome feedback on these questions.
The overall strategy is to avoid training a “human simulator” reporter by regularizing its internal state to have mind-blindness. One could imagine training a “Human Simulator” that takes as input the “what’s going on” state, plus a question about what a human believes about the world, and is trained to maximize its accuracy at predicting the human’s answer explanation. Meanwhile, the Reporter is trained with dual objectives: in addition to (as before) answering questions about the world correctly, it should be penalized for containing internal state that is useful to the Human Simulator. To do this, an auxiliary “Human Simulator Detector” model is trained to predict the Human Simulator’s activations from the Reporter’s activations. The Reporter is trained adversarially to minimize the Human Simulator Detector’s accuracy. The reporter would be trained not merely to be bad at predicting the explanation — an evil reporter could pretend to be bad at predicting the explanation — but to not contain state that is predictive of “human simulator state”. This is something that the various defeated strategies are missing. A devious reporter can easily hide its full knowledge from its predicted answer, and may even choose weights to look like it does not contain a human simulator. But we argue that when operating on actual input data, a devious reporter will find it more challenging to run a simulation while pretending not to.