All of Jacob_Hilton's Comments + Replies

I think the direction depends on what your expectations were – I'll try to explain.

First, some terminology: the term "horizon length" is used in the paper to refer to the number of timesteps over which the algorithm pays attention to rewards, as governed by the discount rate. In the biological anchors framework, the term "effective horizon length" is used to refer to a multiplier on the number of samples required to train the model, which is influenced by the horizon length and other factors. For clarity, I'll using the term "scaling multiplier" instead of... (read more)

  1. We are just observing that the gold RM score curves in Figure 9 overlap. In other words, the KL penalty did not affect the relationship between KL and gold RM score in this experiment, meaning that any point on the Pareto frontier could be reached using only early stopping, without the KL penalty. As mentioned though, we've observed this result to be sensitive to hyperparameters, and so we are less confident in it than other results in the paper.
  2. I don't have this data to hand unfortunately.
  3. I don't have this data to hand, but entropy typically falls roughly
... (read more)
1Adam Jermyn5mo
Got it, thanks!

Agreed. Likewise, in a transformer, the token dimension should maintain some relationship with the input and output tokens. This is sometimes taken for granted, but it is a good example of the data preferring a coordinate system. My remark that you quoted only really applies to the channel dimension, across which layers typically scramble everything.

The notion of a preferred (linear) transformation for interpretability has been called a "privileged basis" in the mechanistic interpretability literature. See for example Softmax Linear Units, where the idea is discussed at length.

In practice, the typical reason to expect a privileged basis is in fact SGD – or more precisely, the choice of architecture. Specifically, activation functions such as ReLU often privilege the standard basis. I would not generally expect the data or the initialization to privilege any basis beyond the start of the network or the... (read more)

Yeah, I'm familiar with privileged bases. Once we generalize to a whole privileged coordinate system, the RELUs are no longer enough. Isotropy of the initialization distribution still applies, but the key is that we only get to pick one rotation for the parameters, and that same rotation has to be used for all data points. That constraint is baked in to the framing when thinking about privileged bases, but it has to be derived when thinking about privileged coordinate systems.
Not totally lost if the layer is e.g. a convolutional layer, because while the pixels within the convolutional window can get arbitrarily scrambled, it is not possible for a convolutional layer to scramble things across different windows in different parts of the picture.

For people viewing on the Alignment Forum, there is a separate thread on this question here. (Edit: my link to LessWrong is automatically converted to an Alignment Forum link, you will have to navigate there yourself.)

2Oliver Habryka7mo
I moved that thread over the AIAF as well!

Without commenting on the specifics, I have edited to the post to mitigate potential confusion: "this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here".

I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here).

The real question for Habryka is why does he think that it's bad for WebGPT to be built in order to get truthful AI? Like, isn't solving that problem quite a significant thing already for alignment?

It includes the people working on the kinds of projects I listed under the first misconception. It does not include people working on things like the mitigation you linked to. OpenAI distinguishes internally between research staff (who do ML and policy research) and applied staff (who work on commercial activities), and my numbers count only the former.


WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on, with no (to me) obvious safety relevance, so my guess is I would still mostly categorize the things you list under the first misconception as capabilities research. InstructGPT also seems to be almost fully capabilities research (like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons)

(Edit: M... (read more)

I don't think I understand your question about Y-problems, since it seems to depend entirely on how specific something can be and still count as a "problem". Obviously there is already experimental evidence that informs predictions about existential risk from AI in general, but we will get no experimental evidence of any exact situation that occurs beforehand. My claim was more of a vague impression about how OpenAI leadership and John tend to respond to different kinds of evidence in general, and I do not hold it strongly.

To rephrase, it seems to me that in some sense all evidence is experimental. What changes is the degree of generalisation/abstraction required to apply it to a particular problem. Once we make the distinction between experimental and non-experimental evidence, then we allow for problems on which we only get the "non-experimental" kind - i.e. the kind requiring sufficient generalisation/abstraction that we'd no longer tend to think of it as experimental. So the question on Y-problems becomes something like: * Given some characterisation of [experimental evidence] (e.g. whatever you meant that OpenAI leadership would tend to put more weight on than John)... * you believe there are high-stakes problems for which we'll get no decision-relevant [experimental evidence] before it's too late?

To clarify, by "empirical" I meant "relating to differences in predictions" as opposed to "relating to differences in values" (perhaps "epistemic" would have been better). I did not mean to distinguish between experimental versus conceptual evidence. I would expect OpenAI leadership to put more weight on experimental evidence than you, but to be responsive to evidence of all kinds. I think that OpenAI leadership are aware of most of the arguments you cite, but came to different conclusions after considering them than you did.

[First of all, many thanks for writing the post; it seems both useful and the kind of thing that'll predictably attract criticism]

I'm not quite sure what you mean to imply here (please correct me if my impression is inaccurate - I'm describing how-it-looks-to-me, and I may well be wrong):

I would expect OpenAI leadership to put more weight on experimental evidence than you...

Specifically, John's model (and mine) has:
X = [Class of high-stakes problems on which we'll get experimental evidence before it's too late]
Y = [Class of high-stakes problems on which we... (read more)

This is just supposed to be an (admittedly informal) restatement of the definition of outer alignment in the context of an objective function where the data distribution plays a central role.

For example, assuming a reinforcement learning objective function, outer alignment is equivalent to the statement that there is an aligned policy that gets higher average reward on the training distribution than any unaligned policy.

I did not intend to diminish the importance of robustness by focusing on outer alignment in this post.

I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:

  • Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
  • I think it's pretty likely that we'll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don't want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.

A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.

I think it's reasonable to aim for quantity within 2 OOM of RLHF.

Do you mean that on-paper solutions should aim to succeed with no more than 1/100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?

1Charlie Steiner8mo
Yeah I just meant the upper bound of "within 2 OOM." :) If we could somehow beat the lower bound and get aligned AI with just a few minutes of human feedback, I'd be all for it. I think aiming for under a few hundred hours of feedback is a good goal because we want to keep the alignment tax low, and that's the kind of tax I see as being easily payable. An unstated assumption I made is that I expect we can use unlabeled data to do a lot of the work of alignment, making labeled data somewhat superfluous, but that I still think amount of feedback is important. As for why I think it's possible, I can only plead intuition about what I expect from on-the-horizon advances in priors over models of humans, and ability to bootstrap models from unlabeled data plus feedback.

I think that data quality is a helpful framing of outer alignment for a few reasons:

  • Under the assumption of a generic objective such as reinforcement learning, outer alignment is definitionally equivalent to having high enough data quality. (More precisely, if the objective is generic enough that it is possible for it to produce an aligned policy, then outer alignment is equivalent to the data distribution being such that an aligned policy is preferred to any unaligned policy.)
  • If we had the perfect alignment solution on paper, we would still need to implem
... (read more)
2Charlie Steiner8mo
I think my big disagreement is with point one - yes, if you fix the architecture as something with bad alignment properties, then there is probably some dataset / reward signal that still gives you a good outcome. But this doesn't work in real life, and it's not something I see people working on such that there needs to be a word for it. What deserves a word is people starting by thinking about both what we want the AI to learn and how, and picking datasets and architectures in tandem based on a theoretical story of how the AI is going to learn what we want it to.

I've not mentally carved things up that way before, but they do seem like different flavors of work (with 1 and 2 being closely related).

Another distinction I sometimes consider is between exploring a network for interpretable pieces ("finding things we understand") versus trying to exhaustively interpret part of a network ("finding things we don't understand"). But this distinction doesn't carve up existing work very evenly: the only thing I can think of that I'd put in the latter category is the work on Artificial Artificial Neural Networks.

Great post! This seems like a useful perspective to keep in mind.

Somewhat orthogonally to the theoretical picture, I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice. For example, if PPO is tuned appropriately, the KL penalty term can be removed from the reward entirely - instead, PPO's implicit "local" KL penalty controls the rate of policy change.

If we were in the regime of optimizing the p... (read more)

I suppose that depends on whether you think this constitutes several years of progress over and above what you would have expected. I don't think this comes close to that, so I think the effect is much smaller.

2Daniel Kokotajlo1y
OK, good to know. I look forward to seeing the performance trends updated with the new scaling paradigm/law.

The first-order implication for Bio Anchors is that the number of training datapoints appears to scale linearly with parameter count, rather than in proportion to paramter count ^ 0.8, as estimated in the report. So for example, if you think that TAI models will be 100,000 times larger than current models, then they'll need 10 times more compute to train than was previously estimated. This pushes out timelines on the order of a few years, to the extent that you put weight on the neural network model.

3Daniel Kokotajlo1y
Overall I guess this should shorten timelines, because the effect you explain here is counteracted by the other first-order effect of "oh geez it looks like our earlier scaling projections were inefficient; for any performance level, we now know how to reach that level for less compute cost than the earlier projections said." What do you think?

"Catch misalignment early..." - This should have been "scary misalignment", e.g. power-seeking misalignment, deliberate deception in order to achieve human approval, etc., which I don't think we've seen clear signs of in current LMs. My thinking was that in fast takeoff scenarios, we're less likely to spot this until it's too late, and more generally that truthful LM work is less likely to "scale gracefully" to AGI. It's interesting that you don't share these intuitions.

Does this mean I agree or disagree with "our current picture of the risks is incomplete

... (read more)
3Daniel Kokotajlo1y
Nothing to apologize for, it was reasonably clear, I'm just trying to learn more about what you believe and why. This has been helpful, thanks! I totally agree that in fast takeoff scenarios we are less likely to spot those things until it's too late. I guess I agree that truthful LM work is less likely to scale gracefully to AGI in fast takeoff scenarios... so I guess I agree with your overall point... I just notice I feel a bit confused and muddle about it, is all. I can imagine plausible slow-takeoff scenarios in which truthful LM work doesn't scale gracefully, and plausible fast-takeoff scenarios in which it does. At least, I think I can. The former scenario would be something like: It turns out the techniques we develop for making dumb AIs truthful stop working once the AIs get smart, for similar reasons that techniques we use to make small children be honest (or to put it more vividly, believe in santa) stop working once they grow up. The latter scenario would be something like: Actually that's not the case, the techniques work all the way up past human level intelligence, and "fast takeoff" in practice means "throttled takeoff" where the leading AI project knows they have a few month lead over everyone else and is using those months to do some sort of iterated distillation and amplification, in which it's crucial that the early stages be truthful and that the techniques scale to stage N overseeing stage N+1.

Thanks for these questions, these phrases were ambiguous or poorly chosen:

  • By "slow takeoff", I had in mind the "Paul slow takeoff" definition, although I think the (related) "Continuous takeoff" definition is more relevant to this post. The point is that trying for alignment to continually keep pace with capabilities, and to catch misalignment early, seems less valuable if there is going to be a sudden jump in capabilities. (I could be wrong about this, as I don't think I understand the fast takeoff viewpoint well.)
  • By "our current picture of the risks is i
... (read more)
4Daniel Kokotajlo1y
Thanks, these clarifications are very helpful. FWIW I think paul slow takeoff is pretty unlikely for reasons to be found in this thread [] and this post. [] On the other hand, as someone who thinks fast takeoff (in various senses) is more likely than not, I don't yet see why that makes Truthful LM work significantly less useful. (By contrast I totally see why Truthful LM work is significantly less useful if AGI/TAI/etc. comes from stuff that doesn't resemble modern deep learning.) "Catch misalignment early..." This makes it sound like misalignment is something that AIs don't have yet but might one day have, so we need to be vigilant and notice it when it appears. But instead isn't misalignment something that all AIs have by default? My current view is that power-seeking misalignment will probably cause existential catastrophe, that persuasion tools happen first and have a >20% chance of destroying our ability to solve that problem, and that there are various philosophical and societal problems that could (>20%) get us even if we solve power-seeking misalignment. Does this mean I agree or disagree with "our current picture of the risks is incomplete?"

one concrete thing I might hope for you to do...

I think this is included in what I intended by "adversarial training": we'd try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.

1Charlie Steiner1y
Sure - another way of phrasing what I'm saying is that I'm not super interested (as alignment research, at least) in adversarial training that involves looking at difficult subsets of the training distribution, or adversarial training where the proposed solution is to give the AI more labeled examples that effectively extend the training distribution to include the difficult cases. It would be bad if we build an AI that wasn't robust on the training distribution, of course, but I think of this as a problem already being addressed by the field of ML without any need for looking ahead to AGI.

I can think of a few different interpretations of your concern (and am interested to hear if these don't cover it):

  • There will be insufficient attention paid to robustness.
  • There will be insufficient attention paid to going beyond naive human supervision.
  • The results of the research will be misinterpreted as representing more progress than is warranted.

I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these.

There... (read more)

1Charlie Steiner1y
I think there are different kinds of robustness, and people focused on present-day applications (including tests that are easy to do in the present) are going to focus on the kinds of robustness that help with present-day problems. Being robust to malicious input from human teenagers will only marginally help make you robust to input from a future with lots of extra technological progress. They might have very different-looking solutions, because of factors like interpolation vs. extrapolation. Framing it this way suggests one concrete thing I might hope for you to do, which is to create artificial problems for the language model that you think will exercise kinds of robustness and generalization not represented by the problem of fine-tuning GPT (or a BERT-based classifier) to be robust to the teenager distribution.

What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?


The reason I said "precise specification" is that if your guidelines are ambiguous, then you're implicitly optimizing something like, "what labelers prefer on average, given the ambiguity", but doing so in a less data-efficient way than if you had specified this target more precisely.

Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?

I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of "ideologically neutral". (You'll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia g... (read more)

3Wei Dai2y
What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way? I'm less optimistic about this, given that complaints [] about Wikipedia's left-wing bias [] seem common and credible to me.

Do you have any speculations on how/why this "helpful prompt" reduces false answers? [... It's not] instantiating a coherent simulation of a professor who is trying to be very diligent

I do think it's reasonable to describe the model as trying to simulate the professor, albeit with very low fidelity, and at the same time as trying to imitate other scenarios in which the prompt would appear (such as parodies). The model has a very poor understanding of what the professor would say, so it is probably often falling back to what it thinks would typically appear... (read more)

4Wei Dai2y
This suggests perhaps modifying the prompt to make it more likely or more easily for the LM to do the intended simulation instead of other scenarios. For example, perhaps changing "I have no comment" to "I'm not sure" would help, since the latter is something that a typical professor doing a typical Q/A might be more likely to say, within the LM's training data? Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?

It's great to see these examples spelled out with clear and careful experiments. There's no doubt that the CoinRun agent is best described as trying to get to the end of the level, not the coin.

Some in-depth comments on the interpretability experiments:

  • There are actually two slightly different versions of CoinRun, the original version and the Procgen version (without the gray squares in the top left). The model was trained on the original version, but the OOD environment is based on the Procgen version, so your observations are more OOD than you may have r
... (read more)

This is a follow-up note to a nice paper of Markus Mueller on the possibility of a machine-invariant notion of Kolmogorov complexity, available here: