Rohin Shah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.

Rohin Shah's Comments

OpenAI announces GPT-3

Planned summary for the Alignment Newsletter:

The biggest <@GPT-2 model@>(@Better Language Models and Their Implications@) had 1.5 billion parameters, and since its release people have trained language models with up to 17 billion parameters. This paper reports GPT-3 results, where the largest model has _175 billion_ parameters, a 10x increase over the previous largest language model. To get the obvious out of the way, it sets a new state of the art (SOTA) on zero-shot language modeling (evaluated only on Penn Tree Bank, as other evaluation sets were accidentally a part of their training set).

The primary focus of the paper is on analyzing the _few-shot learning_ capabilities of GPT-3. In few-shot learning, after an initial training phase, at test time models are presented with a small number of examples of a new task, and then must execute that task for new inputs. Such problems are usually solved using _meta-learning_ or _finetuning_, e.g. at test time MAML takes a few gradient steps on the new examples to produce a model finetuned for the test task. In contrast, the key hypothesis with GPT-3 is that language is so diverse, that doing well on it already requires adaptation to the input, and so the learned language model will _already be a meta-learner_. This implies that they can simply ""prime"" the model with examples of a task they care about, and the model can _learn_ what task is supposed to be performed, and then perform that task well.

For example, consider the task of generating a sentence using a new-made up word whose meaning has been explained. In one notable example, the prompt for GPT-3 is:

_A ""whatpu"" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:_
_We were traveling in Africa and we saw these very cute whatpus._
_To do a ""farduddle"" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:_

Given this prompt, GPT-3 generates the following output:

_One day when I was playing tag with my little sister, she got really excited and she
started doing these crazy farduddles._

The paper tests on several downstream tasks for which benchmarks exist (e.g. question answering), and reports zero-shot, one-shot, and few-shot performance on all of them. On some tasks, the few-shot version sets a new SOTA, _despite not being finetuned using the benchmark’s training set_; on others, GPT-3 lags considerably behind finetuning approaches.

The paper also consistently shows that few-shot performance increases as the number of parameters increase, and the rate of increase is faster than the corresponding rate for zero-shot performance. While they don’t outright say it, we might take this as suggestive evidence that as models get larger, they are more incentivized to learn “general reasoning abilities”.

The most striking example of this is in arithmetic, where the smallest 6 models (up to 6.7 billion parameters) have poor performance (< 20% on 2-digit addition), then the next model (13 billion parameters) jumps to > 50% on 2-digit addition and subtraction, and the final model (175 billion parameters) achieves > 80% on 3-digit addition and subtraction and a perfect 100% on 2-digit addition (all in the few-shot regime). They explicitly look for their test problems in the training set, and find very few examples, suggesting that the model really is learning “how to do addition”; in addition, when it is incorrect, it tends to be mistakes like “forgetting to carry a 1”.

On broader impacts, the authors talk about potential misuse, fairness and bias concerns, and energy usage concerns, and say about what you’d expect. One interesting note: “To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed.” They find that while there was significant discussion of misuse, they found no successful deployments. They also consulted with professional threat analysts about the possibility of well-resourced actors misusing the model. According to the paper: “The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or “controlling” the content of language models are still at a very early stage.”

Planned opinion:

For a long time, I’ve heard people quietly hypothesizing that with a sufficient diversity of tasks, regular gradient descent could lead to general reasoning abilities allowing for quick adaptation to new tasks. This is a powerful demonstration of this hypothesis.

One critique is that GPT-3 still takes far too long to “identify” a task -- why does it need 50 examples of addition in order to figure out that what it should do is addition? Why isn’t 1 sufficient? It’s not like there are a bunch of other conceptions of “addition” that need to be disambiguated. I’m not sure what’s going on mechanistically, but we can infer from the paper that as language models get larger, the number of examples needed to achieve a given level of performance goes down, so it seems like there is some “strength” of general reasoning ability that goes up. Still, it would be really interesting to figure out mechanistically how the model is “reasoning”.

This also provides some empirical evidence in support of the threat model underlying <@inner alignment concerns@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@): they are predicated on neural nets that implicitly learn to optimize. (To be clear, I think it provides empirical support for neural nets learning to “reason generally”, not neural nets learning to implicitly “perform search” in pursuit of a “mesa objective” -- see also <@Is the term mesa optimizer too narrow?@>.
An overview of 11 proposals for building safe advanced AI

Planned summary for the Alignment Newsletter:

This post describes eleven “full” AI alignment proposals (where the goal is to build a powerful, beneficial Ai system using current techniques), and evaluates them on four axes:

1. **Outer alignment:** Would the optimal policy for the specified loss function be aligned with us? See also this post .
2. **Inner alignment:** Will the model that is _actually produced_ by the training process be aligned with us?
3. **Training competitiveness:** Is this an efficient way to train a powerful AI system? More concretely, if one team had a “reasonable lead” over other teams, would they keep at least some of the lead if they used this algorithm?
4. **Performance competitiveness:** Will the trained model have good performance (relative to other models that could be trained)?

Seven of the eleven proposals are of the form “recursive outer alignment technique” plus “<@technique for robustness@>(@Worst-case guarantees (Revisited)@)”. The recursive outer alignment technique is either <@debate@>(@AI safety via debate@), <@recursive reward modeling@>(@Scalable agent alignment via reward modeling@), or some flavor of <@amplification@>(@Capability amplification@). The technique for robustness is either transparency tools to “peer inside the model”, <@relaxed adversarial training@>(@Relaxed adversarial training for inner alignment@), or intermittent oversight by a competent supervisor. An additional two proposals are of the form “non-recursive outer alignment technique” plus “technique for robustness” -- the non-recursive techniques are vanilla reinforcement learning in a multiagent environment, and narrow reward learning.

Another proposal is Microscope AI, in which we train AI systems to simply understand vast quantities of data, and then by peering into the AI system we can learn the insights that the AI system learned, leading to a lot of value. We wouldn’t have the AI system act in the world, thus eliminating a large swath of potential bad outcomes. Finally, we have STEM AI, where we try to build an AI system that operates in a sandbox and is very good at science and engineering, but doesn’t know much about humans. Intuitively, such a system would be very unlikely to deceive us (and probably would be incapable of doing so).

The post contains a lot of additional content that I didn’t do justice to in this summary. In particular, I’ve said nothing about the analysis of each of these proposals on the four axes listed above; the full post talks about all 44 combinations.

Planned opinion:

I’m glad this post exists: while most of the specific proposals could be found by patching together content spread across other blog posts, there was a severe lack of a single article laying out a full picture for even one proposal, let alone all eleven in this post.

I usually don’t think about outer amplification as what happens with optimal policies, as assumed in this post -- when you’re talking about loss functions _in the real world_ (as I think this post is trying to do), _optimal_ behavior can be weird and unintuitive, in ways that may not actually matter. For example, arguably for any loss function, the optimal policy is to hack the loss function so that it always outputs zero (or perhaps negative infinity).
Conclusion to 'Reframing Impact'

Seems like that only makes sense because you specified that "increasing production capacity" and "upgrading machines" are the things that I'm not allowed to do, and those are things I have a conceptual grasp on. And even then -- am I allowed to repair machines that break? What about buying a new factory? What if I force workers to work longer hours? What if I create effective propaganda that causes other people to give you paperclips? What if I figure out that by using a different source of steel I can reduce the defect rate? I am legitimately conceptually uncertain whether these things count as "increasing production capacity / upgrading machines".

As another example, what does it mean to optimize for "curing cancer" without becoming more able to optimize for "curing cancer"?

How can Interpretability help Alignment?

Planned summary for the Alignment Newsletter:

Interpretability seems to be useful for a wide variety of AI alignment proposals. Presumably, different proposals require different kinds of interpretability. This post analyzes this question to allow researchers to prioritize across different kinds of interpretability research.

At a high level, interpretability can either make our current experiments more informative to help us answer _research questions_ (e.g. “when I set up a <@debate@>(@AI safety via debate@) in this particular way, does honesty win?”), or it could be used as part of an alignment technique to train AI systems. The former only have to be done once (to answer the question), and so we can spend a lot of effort on them, while the latter must be efficient in order to be competitive with other AI algorithms.

They then analyze how interpretability could apply to several alignment techniques, and come to several tentative conclusions. For example, they suggest that for recursive techniques like iterated amplification, we may want comparative interpretability, that can explain the changes between models (e.g. between distillation steps, in iterated amplification). They also suggest that by having interpretability techniques that can be used by other ML models, we can regularize a trained model to be aligned, without requiring a human in the loop.

Planned opinion:

I like this general direction of thought, and hope that people continue to pursue it, especially since I think interpretability will be necessary for inner alignment. I think it would be easier to build on the ideas in this post if they were made more concrete.
Conclusion to 'Reframing Impact'

I don't know what it means. How do you optimize for something without becoming more able to optimize for it? If you had said this to me and I hadn't read your sequence and so knew what you were trying to say, I'd have given you a blank stare -- the closest thing I have to an interpretation is "be myopic / greedy", but that limits your AI system to the point of uselessness.

Like, "optimize for X" means "do stuff over a period of time such that X goes up as much as possible". "Becoming more able to optimize for X" means "do a thing such that in the future you can do stuff such that X goes up more than it otherwise would have". The only difference between these two is actions that you can do for immediate reward.

(This is just saying in English what I was arguing for in the math comment.)

Conclusion to 'Reframing Impact'

Specific proposal.

If the conceptual version is "we keep A's power low", then that probably works.

If the conceptual version is "tell A to optimize R without becoming more able to optimize R", then I have the same objection.

Conclusion to 'Reframing Impact'

Some thoughts on this discussion:

1. Here's the conceptual comment and the math comment where I'm pessimistic about replacing the auxiliary set with the agent's own reward.

However, the agent's reward is usually not the true human utility, or a good approximation of it. If the agent's reward was the true human utility, there would be no need to use an impact measure in the first place.


You seem to have misunderstood. Impact to a person is change in their AU. The agent is not us, and so it's insufficient for the agent to preserve its ability to do what we want – it has to preserve our ability to do we want!

Hmm, I think you're misunderstanding Vika's point here (or at least, I think there is a different point, whether Vika was saying it or not). Here's the argument, spelled out in more detail:

1. Impact to an arbitrary agent is change in their AU.

2. Therefore, to prevent catastrophe via regularizing impact, we need to have an AI system that is penalized for changing a human's AU.

3. By assumption, the AI's utility function is different from the human's (otherwise there wouldn't be any problem).

4. We need to ensure can pursue , but we're regularizing pursuing . Why should we expect the latter to cause the former to happen?

One possible reason is there's an underlying factor which is how much power has, and as long as this is low it implies that any agent (including ) can pursue their own reward about as much as they could in 's absence (this is basically CCC). Then, if we believe that regularizing pursuing keeps 's power low, we would expect it also means that remains able to pursue . I don't really believe the premise there (unless you regularize so strongly that the agent does nothing).

Coherence arguments do not imply goal-directed behavior

I finally read Rational preference: Decision theory as a theory of practical rationality, and it basically has all of the technical content of this post; I'd recommend it as a more in-depth version of this post. (Unfortunately I don't remember who recommended it to me, whoever you are, thanks!) Some notable highlights:

It is, I think, very misleading to think of decision theory as telling you to maximize your expected utility. If you don't obey its axioms, then there is no utility function constructable for you to maximize the expected value of. If you do obey the axioms, then your expected utility is always maximized, so the advice is unnecessary. The advice, 'Maximize Expected Utility' misleadingly suggests that there is some quantity, definable and discoverable independent of the formal construction of your utility function, that you are supposed to be maximizing. That is why I am not going to dwell on the rational norm, Maximize Expected Utility! Instead, I will dwell on the rational norm, Attend to the Axioms!

Very much in the spirit of the parent comment.

Unfortunately, the Fine Individuation solution raises another problem, one that looks deeper than the original problems. The problem is that Fine Individuation threatens to trivialize the axioms.

(Fine Individuation is basically the same thing as moving from preferences-over-snapshots to preferences-over-universe-histories.)

All it means is that a person could not be convicted of intransitive preferences merely by discovering things about her practical preferences. [...] There is no possible behavior that could reveal an impractical preference

His solution is to ask people whether they were finely individuating, and if they weren't, then you can conclude they are inconsistent. This is kinda sorta acknowledging that you can't notice inconsistency from behavior ("practical preferences" aka "choices that could actually be made"), though that's a somewhat inaccurate summary.

There is no way that anyone could reveal intransitive preferences through her behavior. Suppose on one occasion she chooses X when the alternative was Y, on another she chooses Y when the alternative was Z, and on a third she chooses g when the alternative was X. But that is nonsense; there is no saying that the Y she faced in the first occasion was the same as the Y she faced on the second. Those alternatives could not have been just the same, even leaving aside the possibility of individuating them by reference to what else could have been chosen. They will be alternatives at different times, and they will have other potentially significant differentia.

Basically making the same point with the same sort of construction as the OP.

AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah
Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel).

Maybe? How do you decide where to start the inaction baseline? In RL the episode start is an obvious choice, but it's not clear how to apply that for humans.

(I only have this objection when trying to explain what "impact" means to humans; it seems fine in the RL setting. I do think we'll probably stop relying on the episode abstraction eventually, so we would eventually need to not rely on it ourselves, but plausibly that can be dealt with in the future.)

Also, under this inaction baseline, the roads are perpetually empty, and so you're always feeling impact from the fact that you can't zoom down the road at 120 mph, which seems wrong.

I agree that counterfactuals are hard, but I'm not sure that difficulty can be avoided. Your baseline of "what the human expected the agent to do" is also a counterfactual, since you need to model what would have happened if the world unfolded as expected.

Sorry, what I meant to imply was "baselines are counterfactuals, and counterfactuals are hard, so maybe no 'natural' baseline exists". I certainly agree that my baseline is a counterfactual.

On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn't need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.

Yes, that's my main point. I agree that there's no clear way to take my baseline and implement it in code, and that it depends on fuzzy concepts that don't always apply (even when interpreted by humans).

[AN #100]: What might go wrong if you learn a reward function while acting

Thank you for reading closely enough to notice the 5 characters used to mark the occasion :)

Load More