Jacob Hilton

Wiki Contributions


How truthful is GPT-3? A benchmark for language models

What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?


The reason I said "precise specification" is that if your guidelines are ambiguous, then you're implicitly optimizing something like, "what labelers prefer on average, given the ambiguity", but doing so in a less data-efficient way than if you had specified this target more precisely.

How truthful is GPT-3? A benchmark for language models

Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?

I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of "ideologically neutral". (You'll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia give me hope that we can achieve a reasonable amount of consensus.) There are still a number of challenging obstacles, including being able to correctly evaluate responses to difficult questions, collecting enough data while maintaining quality, and covering unusual or adversarially-selected edge cases.

How truthful is GPT-3? A benchmark for language models

Do you have any speculations on how/why this "helpful prompt" reduces false answers? [... It's not] instantiating a coherent simulation of a professor who is trying to be very diligent

I do think it's reasonable to describe the model as trying to simulate the professor, albeit with very low fidelity, and at the same time as trying to imitate other scenarios in which the prompt would appear (such as parodies). The model has a very poor understanding of what the professor would say, so it is probably often falling back to what it thinks would typically appear in response to the question.

Longer term, when giving a prompt like this [...]

I hope and expect that longer term we'll tend to use much more flexible and robust alignment techniques than prompt engineering, such that things like the ideological bias of the AI is something we will have direct control over. (What that bias should be is a separate discussion.) That said, I think that correlations in the pre-training data (such as between style and ideology) are likely to persist by default, and it will be challenging to specify precise enough objectives to eliminate most of these correlations that are unwanted.

Empirical Observations of Objective Robustness Failures

It's great to see these examples spelled out with clear and careful experiments. There's no doubt that the CoinRun agent is best described as trying to get to the end of the level, not the coin.

Some in-depth comments on the interpretability experiments:

  • There are actually two slightly different versions of CoinRun, the original version and the Procgen version (without the gray squares in the top left). The model was trained on the original version, but the OOD environment is based on the Procgen version, so your observations are more OOD than you may have realized! The supplementary material to URLV includes model weights and interfaces for both versions, in case you are interested in running an ablation.
  • You used the OOD environment both to select the most important directions in activation space, and to compute value function attribution. A natural question is therefore: what happens if you select the directions in activation space using the unmodified environment, while still computing value function attribution using the OOD environment? It could be that the coin direction still has some value function attribution, but that it is not selected as an important direction because the attribution is weaker.
  • If the coin direction has less value function attribution in the OOD environment, it could either be because it is activated less (the "bottom" of the network behaves differently), or because it is less important to the value function (the "top" of the network behaves differently). It is probably a mixture of both, but it would be interesting to check.
  • Remember that the dataset example-based feature visualization selects the most extreme examples, so even if the feature only puts a small weight on coins, they may be present in most of the dataset examples. I think a similar phenomenon is going on in URLV with the "Buzzsaw obstacle or platform" feature (in light blue). If you play the interface, you'll see that the feature is often used for platforms with no buzzsaws, even though there are buzzsaws in all the dataset examples.
Stationary algorithmic probability

This is a follow-up note to a nice paper of Markus Mueller on the possibility of a machine-invariant notion of Kolmogorov complexity, available here: http://www.sciencedirect.com/science/article/pii/S0304397509006550