What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?
The reason I said "precise specification" is that if your guidelines are ambiguous, then you're implicitly optimizing something like, "what labelers prefer on average, given the ambiguity", but doing so in a less data-efficient way than if you had specified this target more precisely.
Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?
I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of "ideologically neutral". (You'll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia give me hope that we can achieve a reasonable amount of consensus.) There are still a number of challenging obstacles, including being able to correctly evaluate responses to difficult questions, collecting enough data while maintaining quality, and covering unusual or adversarially-selected edge cases.
Do you have any speculations on how/why this "helpful prompt" reduces false answers? [... It's not] instantiating a coherent simulation of a professor who is trying to be very diligent
I do think it's reasonable to describe the model as trying to simulate the professor, albeit with very low fidelity, and at the same time as trying to imitate other scenarios in which the prompt would appear (such as parodies). The model has a very poor understanding of what the professor would say, so it is probably often falling back to what it thinks would typically appear in response to the question.
Longer term, when giving a prompt like this [...]
I hope and expect that longer term we'll tend to use much more flexible and robust alignment techniques than prompt engineering, such that things like the ideological bias of the AI is something we will have direct control over. (What that bias should be is a separate discussion.) That said, I think that correlations in the pre-training data (such as between style and ideology) are likely to persist by default, and it will be challenging to specify precise enough objectives to eliminate most of these correlations that are unwanted.
It's great to see these examples spelled out with clear and careful experiments. There's no doubt that the CoinRun agent is best described as trying to get to the end of the level, not the coin.
Some in-depth comments on the interpretability experiments:
This is a follow-up note to a nice paper of Markus Mueller on the possibility of a machine-invariant notion of Kolmogorov complexity, available here: http://www.sciencedirect.com/science/article/pii/S0304397509006550