Rohin Shah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter


My research methodology

I agree this involves discretion [...] So instead I'm doing some in between thing

Yeah, I think I feel like that's the part where I don't think I could replicate your intuitions (yet).

I don't think we disagree; I'm just noting that this methodology requires a fair amount of intuition / discretion, and I don't feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.

(Probably I could have been clearer about this in the original opinion.)

My research methodology

In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."

To fill in the details more:

Assume that we're finding an algorithm to train an agent with a sufficiently large action space (i.e. we don't get safety via the agent having such a restricted action space that it can't do anything unsafe).

It seems like in some sense the game is in constraining the agent's cognition to be such that it is "safe" and "useful". The point of designing alignment algorithms is to impose such constraints, without requiring so much effort as to make the resulting agent useless / uncompetitive.

However, there are always going to be some plausible circumstances that we didn't consider (even if we're talking about amplified humans, which are still bounded agents). Even if we had maximal ability to place constraints on agent cognition, whatever constraints we do place won't have been tested in these unconsidered plausible circumstances. It is always possible that one misfires in a way that makes the agent do something unsafe.

(This wouldn't be true if we had some sort of proof against misfiring, that doesn't assume anything about what circumstances the agent experiences, but that seems ~impossible to get. I'm pretty sure you agree with that.)

More generally, this story is going to be something like:

  1. Suppose you trained your model M to do X using algorithm A.
  2. Unfortunately, when designing algorithm A / constraining M with A, you (or amplified-you) failed to consider circumstance C as a possible situation that might happen.
  3. As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.
  4. Circumstance C then happens in the real world, leading to an actual failure.

Obviously, I can't usually instantiate M, X, A, C, and H such that the story works for an amplified human (since they can presumably think of anything I can think of). And I'm not arguing that any of this is probable. However, it seems to meet your bar of "plausible":

there is some way to fill in the rest of the details that's consistent with everything I know about the world.

EDIT: Or maybe more accurately, I'm not sure how exactly the stories you tell are different / more concrete than the ones above.


When I say you have "a better defined sense of what does and doesn't count as a valid step 2", I mean that there's something in your head that disallows the story I wrote above, but allows the stories that you generally use, and I don't know what that something is; and that's why I would have a hard time applying your methodology myself.


Possible analogy / intuition pump for the general story I gave above: Human cognition is only competent in particular domains and must be relearned in new domains (like protein folding) or new circumstances (like when COVID-19 hits), and sometimes human cognition isn't up to the task (like when being teleported to a universe with different physics and immediately dying), or doesn't do so in a way that agrees with other humans (like how some humans would push a button that automatically wirehead everyone for all time, while others would find that abhorrent).

My research methodology

Planned summary for the Alignment Newsletter:

This post outlines a simple methodology for making progress on AI alignment. The core idea is to alternate between two steps:

1. Come up with some alignment algorithm that solves the issues identified so far

2. Try to find some plausible situation in which either a) the resulting AI system is misaligned or b) the AI system is not competitive.

This is all done conceptually, so step 2 can involve fairly exotic scenarios that probably won't happen. Given such a scenario, we need to argue why no failure in the same class as that scenario will happen, or we need to go back to step 1 and come up with a new algorithm.

This methodology could play out as follows:

Step 1: RL with a handcoded reward function.

Step 2: This is vulnerable to <@specification gaming@>(@Specification gaming examples in AI@).

Step 1: RL from human preferences over behavior, or other forms of human feedback.

Step 2: The system might still pursue actions that are bad that humans can't recognize as bad. For example, it might write a well researched report on whether fetuses are moral patients, which intuitively seems good (assuming the research is good). However, this would be quite bad if the AI wow the report because it calculated that it would increase partisanship leading to civil war.

Step 1: Use iterated amplification to construct a feedback signal that is "smarter" than the AI system it is training.

Step 2: The system might pick up on <@inaccessible information@>(@Inaccessible information@) that the amplified overseer cannot find. For example, it might be able to learn a language just by staring at a large pile of data in that language, and then seek power whenever working in that language, and the amplified overseer may not be able to detect this.

Step 1: Use <@imitative generalization@>(@Imitative Generalisation (AKA 'Learning the Prior')@) so that the human overseer can leverage facts that can be learned by induction / pattern matching, which neural nets are great at.

Step 2: Since imitative generalization ends up learning a description of facts for some dataset, it may learn low-level facts useful for prediction on the dataset, while not including the high-level facts that tell us how the low-level facts connect to things we care about. 

The post also talks about various possible objections you might have, which I’m not going to summarize here.

Planned opinion:

I'm a big fan of having a candidate algorithm in mind when reasoning about alignment. It is a lot more concrete, which makes it easier to make progress and not get lost, relative to generic reasoning from just the assumption that the AI system is superintelligent.

I'm less clear on how exactly you move between the two steps -- from my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2. Our algorithms are instead meant to chip away at the problem, by continually increasing our control over these patterns of thought. It seems like the author has a better defined sense of what does and doesn't count as a valid step 2, and that makes this methodology more fruitful for him than it would be for me. More discussion [here](

How do scaling laws work for fine-tuning?

I don't think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.

How do scaling laws work for fine-tuning?

Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws?

My guess is that the answer is mostly yes (maybe not the exact numbers predicted by existing scaling laws, but similar ballpark).

how does that not contradict the scaling laws for transfer described here and used in this calculation by Rohin?

I think this is mostly irrelevant to timelines / previous scaling laws for transfer:

  1. You still have to pretrain the Transformer, which will take the usual amount of compute (my calculation that you linked takes this into account).
  2. The models trained in the new paper are not particularly strong. They are probably equivalent in performance to models that are multiple orders of magnitude smaller trained from scratch. (I think when comparing against training from scratch, the authors did use smaller models because that was more stable, though with a quick search I couldn't find anything confirming that right now.) So if you think of the "default" as "train an X-parameter model from scratch", then to get equivalent performance you'd probably want to do something like "pretrain a 100X-parameter model, then finetune 0.1% of its weights". (Numbers completely made up.)
  3. I expect there are a bunch of differences in how exactly models are trained. For example, the scaling law papers work almost exclusively with compute-optimal training, whereas this paper probably works with models trained to convergence.

You probably could come to a unified view that incorporates both this new paper and previous scaling law papers, but I expect you'd need to spend a bunch of time getting into the minutiae of the details across the two methods. (Probably high tens to low hundreds of hours.)

Coherence arguments imply a force for goal-directed behavior

Yes, that's basically right.

You think I take the original argument to be arguing from ‘has goals' to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.

Well, I do think it is an interesting/relevant argument (because as you say it explains how you get from "weakly has goals" to "strongly has goals"). I just wanted to correct the misconception about what I was arguing against, and I wanted to highlight the "intelligent" --> "weakly has goals" step as a relatively weak step in our current arguments. (In my original post, my main point was that that step doesn't follow from pure math / logic.)

In that case, my current understanding is that you are disagreeing with 2, and that you agree that if 2 holds in some case, then the argument goes through.

At least, the argument makes sense. I don't know how strong its effect is -- basically I agree with your phrasing here:

This force probably doesn’t exist out at the zero goal directness edges, but it unclear how strong it is in the rest of the space—i.e. whether it becomes substantial as soon as you move out from zero goal directedness, or is weak until you are in a few specific places right next to ‘maximally goal directed’.)

Coherence arguments imply a force for goal-directed behavior

Thanks, that's helpful. I'll think about how to clarify this in the original post.

Coherence arguments imply a force for goal-directed behavior

You're mistaken about the view I'm arguing against. (Though perhaps in practice most people think I'm arguing against the view you point out, in which case I hope this post helps them realize their error.) In particular:

Whatever things you care about, you are best off assigning consistent numerical values to them and maximizing the expected sum of those values

If you start by assuming that the agent cares about things, and your prior is that the things it cares about are "simple" (e.g. it is very unlikely to be optimizing the-utility-function-that-makes-the-twitching-robot-optimal), then I think the argument goes through fine. According to me, this means you have assumed goal-directedness in from the start, and are now seeing what the implications of goal-directedness are.

My claim is that if you don't assume that the agent cares about things, coherence arguments don't let you say "actually, principles of rationality tell me that since this agent is superintelligent it must care about things".

Stated this way it sounds almost obvious that the argument doesn't work, but I used to hear things that effectively meant this pretty frequently in the past. Those arguments usually go something like this:

  1. By hypothesis, we will have superintelligent agents.
  2. A superintelligent agent will follow principles of rationality, and thus will satisfy the VNM axioms.
  3. Therefore it can be modeled as an EU maximizer.
  4. Therefore it pursues convergent instrumental subgoals and kill us all.

This talk for example gives the impression that this sort of argument works. (If you look carefully, you can see that it does state that the AI is programmed to have "objects of concern", which is where the goal-directedness assumption comes in, but you can see why people might not notice that as an assumption.)


You might think "well, obviously the superintelligent AI system is going to care about things, maybe it's technically an assumption but surely that's a fine assumption". I think on balance I agree, but it doesn't seem nearly so obvious to me, and seems to depend on how exactly the agent is built. For example, it's plausible to me that superintelligent expert systems would not be accurately described as "caring about things", and I don't think it was a priori obvious that expert systems wouldn't lead to AGI. Similarly, it seems at best questionable whether GPT-3 can be accurately described as "caring about things".


As to whether this argument is relevant for whether we will build goal-directed systems: I don't think that in isolation my argument should strongly change your view on the probability you assign to that claim. I see it more as a constraint on what arguments you can supply in support of that view. If you really were just saying "VNM theorem, therefore 99%", then probably you should become less confident, but I expect in practice people were not doing that and so it's not obvious how exactly their probabilities should change.


I'd appreciate advice on how to change the post to make this clearer -- I feel like your response is quite common, and I haven't yet figured out how to reliably convey the thing I actually mean.

Introduction To The Infra-Bayesianism Sequence

But for more general infradistributions this need not be the case. For example, consider  and take the set of a-measures generated by  and . Suppose you start with  dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting  dollars on the outcome , with a value of  dollars.

I guess my question is more like: shouldn't there be some aspect of reality that determines what my set of a-measures is? It feels like here we're finding a set of a-measures that rationalizes my behavior, as opposed to choosing a set of a-measures based on the "facts" of the situation and then seeing what behavior that implies.

I feel like we agree on what the technical math says, and I'm confused about the philosophical implications. Maybe we should just leave the philosophy alone for a while.

Load More