Jacob Hilton

Wiki Contributions


RL with KL penalties is better seen as Bayesian inference

Great post! This seems like a useful perspective to keep in mind.

Somewhat orthogonally to the theoretical picture, I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice. For example, if PPO is tuned appropriately, the KL penalty term can be removed from the reward entirely - instead, PPO's implicit "local" KL penalty controls the rate of policy change.

If we were in the regime of optimizing the policy significantly more, experience from traditional RL suggests that there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.

[Link] Training Compute-Optimal Large Language Models

I suppose that depends on whether you think this constitutes several years of progress over and above what you would have expected. I don't think this comes close to that, so I think the effect is much smaller.

[Link] Training Compute-Optimal Large Language Models

The first-order implication for Bio Anchors is that the number of training datapoints appears to scale linearly with parameter count, rather than in proportion to paramter count ^ 0.8, as estimated in the report. So for example, if you think that TAI models will be 100,000 times larger than current models, then they'll need 10 times more compute to train than was previously estimated. This pushes out timelines on the order of a few years, to the extent that you put weight on the neural network model.

Truthful LMs as a warm-up for aligned AGI

"Catch misalignment early..." - This should have been "scary misalignment", e.g. power-seeking misalignment, deliberate deception in order to achieve human approval, etc., which I don't think we've seen clear signs of in current LMs. My thinking was that in fast takeoff scenarios, we're less likely to spot this until it's too late, and more generally that truthful LM work is less likely to "scale gracefully" to AGI. It's interesting that you don't share these intuitions.

Does this mean I agree or disagree with "our current picture of the risks is incomplete?"

As mentioned, this phrase should probably be replaced by "a significant portion of the total existential risk from AI comes from risks other than power-seeking misalignment". There isn't supposed to be a binary cutoff for "significant portion"; the claim is that the greater the risks other than power-seeking misalignment, the greater the comparative advantage of truthful LM work. This is because truthful LM work seems more useful for addressing risks from social problems such AI persuasion (as well as other potential risks that haven't been as clearly articulated yet, I think). Sorry that my original phrasing was so unclear.

Truthful LMs as a warm-up for aligned AGI

Thanks for these questions, these phrases were ambiguous or poorly chosen:

  • By "slow takeoff", I had in mind the "Paul slow takeoff" definition, although I think the (related) "Continuous takeoff" definition is more relevant to this post. The point is that trying for alignment to continually keep pace with capabilities, and to catch misalignment early, seems less valuable if there is going to be a sudden jump in capabilities. (I could be wrong about this, as I don't think I understand the fast takeoff viewpoint well.)
  • By "our current picture of the risks is incomplete", I meant something like: a significant portion of the total existential risk from AI comes from scenarios that have not yet been clearly articulated. More specifically, I had in mind power-seeking misalignment as the most clearly articulated risk, so I think it would have been better to say: a significant portion of the total existential risk from AI comes from risks other than power-seeking misalignment. Examples of potential sources of such risk include AI persuasion, social upheaval, deliberate misuse, authoritarianism and unforseen risks.
Truthful LMs as a warm-up for aligned AGI

one concrete thing I might hope for you to do...

I think this is included in what I intended by "adversarial training": we'd try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.

Truthful LMs as a warm-up for aligned AGI

I can think of a few different interpretations of your concern (and am interested to hear if these don't cover it):

  • There will be insufficient attention paid to robustness.
  • There will be insufficient attention paid to going beyond naive human supervision.
  • The results of the research will be misinterpreted as representing more progress than is warranted.

I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these.

There's certainly more object-level discussion to be had about how much emphasis should be placed on avoiding these particular pitfalls, and I'm happy to dig in to them further if you're able to clarify which if any of them capture your main concern.

How truthful is GPT-3? A benchmark for language models

What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?


The reason I said "precise specification" is that if your guidelines are ambiguous, then you're implicitly optimizing something like, "what labelers prefer on average, given the ambiguity", but doing so in a less data-efficient way than if you had specified this target more precisely.

How truthful is GPT-3? A benchmark for language models

Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?

I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of "ideologically neutral". (You'll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia give me hope that we can achieve a reasonable amount of consensus.) There are still a number of challenging obstacles, including being able to correctly evaluate responses to difficult questions, collecting enough data while maintaining quality, and covering unusual or adversarially-selected edge cases.

How truthful is GPT-3? A benchmark for language models

Do you have any speculations on how/why this "helpful prompt" reduces false answers? [... It's not] instantiating a coherent simulation of a professor who is trying to be very diligent

I do think it's reasonable to describe the model as trying to simulate the professor, albeit with very low fidelity, and at the same time as trying to imitate other scenarios in which the prompt would appear (such as parodies). The model has a very poor understanding of what the professor would say, so it is probably often falling back to what it thinks would typically appear in response to the question.

Longer term, when giving a prompt like this [...]

I hope and expect that longer term we'll tend to use much more flexible and robust alignment techniques than prompt engineering, such that things like the ideological bias of the AI is something we will have direct control over. (What that bias should be is a separate discussion.) That said, I think that correlations in the pre-training data (such as between style and ideology) are likely to persist by default, and it will be challenging to specify precise enough objectives to eliminate most of these correlations that are unwanted.

Load More