Richard Ngo

Former username ricraz. I'm an AI safety research engineer at DeepMind (all opinions my own, not theirs). I'm from New Zealand and now based in London; I also did my undergrad and masters degrees in the UK (in Computer Science, Philosophy, and Machine Learning). Blog: thinkingcomplete.blogspot.com

Comments

Arguments against myopic training

Ah, makes sense. There's already a paragraph on this (starting "I should note that so far"), but I'll edit to mention it earlier.

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

I think 1% in the next year and a half is significantly too low.

Firstly, conditioning on AGI researchers makes a pretty big difference. It rules out most mainstream AI researchers, including many of the most prominent ones who get the most media coverage. So I suspect your gut feeling about what people would say isn't taking this sufficiently into account.

Secondly, I think attributing ignorance to the outgroup is a pretty common fallacy, so you should be careful of that. I think a clear majority of AGI researchers are probably familiar with the concept of reward gaming by now, and could talk coherently about AGIs reward gaming, or manipulating humans. Maybe they couldn't give very concrete disaster scenarios, but neither can many of us.

And thirdly, once you get agreement that there are problems, you basically get "we should fix the problems first" for free. I model most AGI researchers as thinking that AGI is far enough away that we can figure out practical ways to prevent these things, like better protocols for giving feedback. So they'll agree that we should do that first, because they think that it'll happen automatically anyway.

Arguments against myopic training

I'm also confused.

"While the overseer might very well try to determine how effective it's own actions will be at achieving long-term goals, it never evaluates how effective the model's actions will be."

Evan, do you agree that for the model to imitate the actions of the supervisor, it would be useful to mimic some of the thought processes the supervisor uses when generating those actions?

In other words, if HCH is pursuing goal X, what feature of myopic training selects for a model that is internally thinking "I'm going to try to be as close to HCH as possible in this timestep, which involves reasoning about how HCH would pursue X", versus a model that's thinking "I'm going to pursue goal X"? (To the extent these are different, which I'm still confused about).

Environments as a bottleneck in AGI development

I endorse Steve's description as a caricature of my view, and also Rohin's comment. To flesh out my view a little more: I think that GPT-3 doing so well on language without (arguably) being able to reason, is the same type of evidence as Deep Blue or AlphaGo doing well at board games without being able to reason (although significantly weaker). In both cases it suggests that just optimising for this task is not sufficient to create general intelligence. While it now seems pretty unreasonable to think that a superhuman chess AI would by default be generally intelligent, that seems not too far off what people used to think.

Now, it might be the case that the task doesn't matter very much for AGI if you "put a ton of information / inductive bias into the architecture", as Rohin puts it. But I interpret Sutton to be arguing against our ability to do so.

We'll eventually invent a different architecture-and-learning-algorithm that is suited to reasoning

There are two possible interpretations of which, one of which I agree with, one of which I don't. I could either interpret you as saying that we'll eventually develop an architecture/learning algorithm biased towards reasoning ability - I disagree with this.

Or you could be saying that future architectures will be capable of reasoning in ways that transformers aren't, by virtue of just being generally more powerful. Which seems totally plausible to me.

Environments as a bottleneck in AGI development

+1, I endorse this summary. I also agree that GPT-3 was an update towards the environment not mattering as much as I thought.

Your summary might be clearer if you rephrase as:

It considers two possibilities: the “easy paths hypothesis” that which many environments would incentivize AGI, and the “hard paths hypothesis” that such environments are rare.

Since "easy paths" and "hard paths" by themselves are kinda ambiguous terms - are we talking about the paths, or the hypothesis? This is probably my fault for choosing bad terminology.

Environments as a bottleneck in AGI development

While this is a sensible point, I also think we should have a pretty high threshold for not talking about things, for a couple of reasons:

1. Safety research is in general much more dependent on having good ideas than capabilities research (because a lot of capabilities are driven by compute, and also because there are fewer of us).

2. Most of the AI people who listen to things people like us say are safety people.

3. I don't think there's enough work on safety techniques tailored to specific paths to AGI (as I discuss briefly at the end of this post).

4. It's uncooperative and gives others a bad impression of us.

So the type of thing I'd endorse not saying is "Here's one weird trick which will make the generation of random environments much easier." But something I endorse talking about is the potential importance of multi-agent environments for training AGIs, even though this is to me a central example of a "useful insight about what environment features are needed to incentivize general intelligence".

Environments as a bottleneck in AGI development

To be precise, the argument is that elephants (or other animals in similar situations) *wouldn't* evolve to human-level intelligence. The fact that they *didn't* isn't very much information (for anthropic reasons, because if they did then it'd be them wondering why primates didn't get to elephant-level intelligence).

And then we should also consider that the elephant environment isn't a randomly-sampled environment either, but is also correlated with ours (which means we should also anthropically discount this).

Environments as a bottleneck in AGI development
AGI wouldn't have those chicken-and-egg problems.

I like and agree with this point, and have made a small edit to the original post to reflect that. However, while I don't dispute that GPT-3 has some human-like concepts, I'm less sure about its reasoning abilities, and it's pretty plausible to me that self-supervised training on language alone plateaus before we get to a GPT-N that does. I'm also fairly uncertain about this, but these types of environmental difficulties are worth considering.

I'm also a bit confused about your reference to "Rich Sutton’s bitter lesson". Do you agree that Transformers learn more / better in the same environment than MLPs? That LSTMs learn more / better in the same environment than simpler RNNs?

Yes, but my point is that the *content* comes from the environment, not the architecture. We haven't tried to leverage our knowledge of language by, say, using a different transformer for each part of speech. I (and I assume Sutton) agree that we'll have increasingly powerful models, but they'll also be increasing general - and therefore the question of whether a model with the capacity to become an AGI does so or not will depend to a significant extent on the environment.

Arguments against myopic training

I broadly agree about what our main disagreement is. Note that I've been mainly considering the case where the supervisor is more intelligent than the agent as well. The actual resolution of this will depend on what's really going on during amplification, which is a bigger topic that I'll need to think about more.

On the side disagreement (of whether looking at future states before evaluation counts as "myopic") I think I was confused when I was discussing it above and in the original article, which made my position a bit of a mess. Sorry about that; I've added a clarifying note at the top of the post, and edited the post to reflect what I actually meant. My actual response to this:

Objection 2: This sacrifices competitiveness, because now the human can't look at the medium-term consequences of actions before providing feedback.

Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training. However, in nonmyopic training this doesn't matter very much, because we can assign high or low reward to some later state in the trajectory, which then influences whether the agent learns to do the original action more or less. We can't do this in myopic training in the current paradigm, which is where the competitiveness sacrifice comes from.

E.g. my agent sends an email. Is it good or bad? In myopic training, you need to figure this out now. In nonmyopic training, you can shrug, give it 0 reward now, and then assign high or low reward to the agent when it gets a response that makes it clearer how good the email was. Then because the agent does credit assignment automatically, actions are in effect evaluated based on their medium-term consequences, although the supervisor never actually looks at future states during evaluations.

This is consistent with your position: "When I talk about myopic training vs. regular RL, I'm imagining that they have the same information available when feedback is given". However, it also raises the question of why we can't just wait until the end of the trajectory to give myopic feedback anyway. In my edits I've called this "semi-myopia". This wouldn't be as useful for nonmyopia, but I do agree that semi-myopia alleviates some competitiveness concerns, although at the cost of being more open to manipulation. The exact tradeoff here will depend on disagreement 1.

Arguments against myopic training

Why do nonmyopic agents end up power-seeking? Because the supervisor rates some states highly, and so the agent is incentivised to gain power in order to reach those states.

Why do myopic agents end up power-seeking? Because to train a competitive myopic agent, the supervisor will need to calculate how much approval they assign to actions based on how much those actions contribute to reaching valuable states. So the agent will be rewarded for taking actions which acquire it more power, since the supervisor will predict that those contribute to reaching valuable states.

(You might argue that, if the supervisor doesn't want the agent to be power-seeking, they'll only approve of actions which gain the agent more power in specified ways. But equivalently a reward function can also penalise unauthorised power-gaining, given equal ability to notice it by the supervisors in both cases.)

Load More