Agreed. Likewise, in a transformer, the token dimension should maintain some relationship with the input and output tokens. This is sometimes taken for granted, but it is a good example of the data preferring a coordinate system. My remark that you quoted only really applies to the channel dimension, across which layers typically scramble everything.
The notion of a preferred (linear) transformation for interpretability has been called a "privileged basis" in the mechanistic interpretability literature. See for example Softmax Linear Units, where the idea is discussed at length.
In practice, the typical reason to expect a privileged basis is in fact SGD – or more precisely, the choice of architecture. Specifically, activation functions such as ReLU often privilege the standard basis. I would not generally expect the data or the initialization to privilege any basis beyond the start of the network or the...
For people viewing on the Alignment Forum, there is a separate thread on this question here. (Edit: my link to LessWrong is automatically converted to an Alignment Forum link, you will have to navigate there yourself.)
Without commenting on the specifics, I have edited to the post to mitigate potential confusion: "this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here".
I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here).
It includes the people working on the kinds of projects I listed under the first misconception. It does not include people working on things like the mitigation you linked to. OpenAI distinguishes internally between research staff (who do ML and policy research) and applied staff (who work on commercial activities), and my numbers count only the former.
WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on, with no (to me) obvious safety relevance, so my guess is I would still mostly categorize the things you list under the first misconception as capabilities research. InstructGPT also seems to be almost fully capabilities research (like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons)
I don't think I understand your question about Y-problems, since it seems to depend entirely on how specific something can be and still count as a "problem". Obviously there is already experimental evidence that informs predictions about existential risk from AI in general, but we will get no experimental evidence of any exact situation that occurs beforehand. My claim was more of a vague impression about how OpenAI leadership and John tend to respond to different kinds of evidence in general, and I do not hold it strongly.
To clarify, by "empirical" I meant "relating to differences in predictions" as opposed to "relating to differences in values" (perhaps "epistemic" would have been better). I did not mean to distinguish between experimental versus conceptual evidence. I would expect OpenAI leadership to put more weight on experimental evidence than you, but to be responsive to evidence of all kinds. I think that OpenAI leadership are aware of most of the arguments you cite, but came to different conclusions after considering them than you did.
[First of all, many thanks for writing the post; it seems both useful and the kind of thing that'll predictably attract criticism]
I'm not quite sure what you mean to imply here (please correct me if my impression is inaccurate - I'm describing how-it-looks-to-me, and I may well be wrong):
I would expect OpenAI leadership to put more weight on experimental evidence than you...
Specifically, John's model (and mine) has:
X = [Class of high-stakes problems on which we'll get experimental evidence before it's too late]
Y = [Class of high-stakes problems on which we...
This is just supposed to be an (admittedly informal) restatement of the definition of outer alignment in the context of an objective function where the data distribution plays a central role.
For example, assuming a reinforcement learning objective function, outer alignment is equivalent to the statement that there is an aligned policy that gets higher average reward on the training distribution than any unaligned policy.
I did not intend to diminish the importance of robustness by focusing on outer alignment in this post.
I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:
A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.
I think it's reasonable to aim for quantity within 2 OOM of RLHF.
Do you mean that on-paper solutions should aim to succeed with no more than 1/100 as much human data as RLHF, or no more than 100 times as much? And are you referring the amount of human data typically used in contemporary implementations of RLHF, or something else? And what makes you think that this is a reasonable target?
I think that data quality is a helpful framing of outer alignment for a few reasons:
I've not mentally carved things up that way before, but they do seem like different flavors of work (with 1 and 2 being closely related).
Another distinction I sometimes consider is between exploring a network for interpretable pieces ("finding things we understand") versus trying to exhaustively interpret part of a network ("finding things we don't understand"). But this distinction doesn't carve up existing work very evenly: the only thing I can think of that I'd put in the latter category is the work on Artificial Artificial Neural Networks.
Great post! This seems like a useful perspective to keep in mind.
Somewhat orthogonally to the theoretical picture, I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice. For example, if PPO is tuned appropriately, the KL penalty term can be removed from the reward entirely - instead, PPO's implicit "local" KL penalty controls the rate of policy change.
If we were in the regime of optimizing the p...
I suppose that depends on whether you think this constitutes several years of progress over and above what you would have expected. I don't think this comes close to that, so I think the effect is much smaller.
The first-order implication for Bio Anchors is that the number of training datapoints appears to scale linearly with parameter count, rather than in proportion to paramter count ^ 0.8, as estimated in the report. So for example, if you think that TAI models will be 100,000 times larger than current models, then they'll need 10 times more compute to train than was previously estimated. This pushes out timelines on the order of a few years, to the extent that you put weight on the neural network model.
"Catch misalignment early..." - This should have been "scary misalignment", e.g. power-seeking misalignment, deliberate deception in order to achieve human approval, etc., which I don't think we've seen clear signs of in current LMs. My thinking was that in fast takeoff scenarios, we're less likely to spot this until it's too late, and more generally that truthful LM work is less likely to "scale gracefully" to AGI. It's interesting that you don't share these intuitions.
Does this mean I agree or disagree with "our current picture of the risks is incomplete
Thanks for these questions, these phrases were ambiguous or poorly chosen:
one concrete thing I might hope for you to do...
I think this is included in what I intended by "adversarial training": we'd try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.
I can think of a few different interpretations of your concern (and am interested to hear if these don't cover it):
I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these.
What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?
The reason I said "precise specification" is that if your guidelines are ambiguous, then you're implicitly optimizing something like, "what labelers prefer on average, given the ambiguity", but doing so in a less data-efficient way than if you had specified this target more precisely.
Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?
I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of "ideologically neutral". (You'll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia g...
Do you have any speculations on how/why this "helpful prompt" reduces false answers? [... It's not] instantiating a coherent simulation of a professor who is trying to be very diligent
I do think it's reasonable to describe the model as trying to simulate the professor, albeit with very low fidelity, and at the same time as trying to imitate other scenarios in which the prompt would appear (such as parodies). The model has a very poor understanding of what the professor would say, so it is probably often falling back to what it thinks would typically appear...
It's great to see these examples spelled out with clear and careful experiments. There's no doubt that the CoinRun agent is best described as trying to get to the end of the level, not the coin.
Some in-depth comments on the interpretability experiments:
This is a follow-up note to a nice paper of Markus Mueller on the possibility of a machine-invariant notion of Kolmogorov complexity, available here: http://www.sciencedirect.com/science/article/pii/S0304397509006550
I think the direction depends on what your expectations were – I'll try to explain.
First, some terminology: the term "horizon length" is used in the paper to refer to the number of timesteps over which the algorithm pays attention to rewards, as governed by the discount rate. In the biological anchors framework, the term "effective horizon length" is used to refer to a multiplier on the number of samples required to train the model, which is influenced by the horizon length and other factors. For clarity, I'll using the term "scaling multiplier" instead of... (read more)