ah, I see! It's an incentive problem! So I guess your funding needs to be conditional on you producing legible outputs.
This rubs me the wrong way. Of course, you can make anyone do X, if you make their funding conditional on X. But whether you should do that, that depends on how sure you are that X is more valuable than whatever is the alternative.
There are already thousands of people out there whose funding is conditional on them producing legible outputs. Why is that not enough? What will change if we increase that number by a dozen?
I wonder if it would make sense to make this half-open, in the sense that you would publish on LW links to the study materials, and maybe also some of the results. So that people who didn't participate have a better idea.
Makes sense, with the proviso that this is sometimes true only statistically. Like, the AI may choose to write an output which has a 70% chance to hurt you and a 30% chance to (equally) help you, if that is its best option.
If you assume that the AI is smarter than you, and has a good model of you, you should not read the output. But if you accidentally read it, and luckily you react in the right (for you) way, that is a possible result, too. You just cannot and should not rely on being so lucky.