[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement

[-]gwern6y30

On the other hand, OpenAI Five (AN #13) also has many, many subtasks, that in theory should interfere with each other, and it still seems to train well.

True, but OA5 is inherently a different setup than ALE. Catastrophic forgetting is at least partially offset by the play against historical checkpoints, which doesn't have an equivalent in your standard ALE; the replay buffer typically turns over so old experiences disappear, and there's no adversarial dynamics or AlphaStar-style population of agents which can exploit forgotten area of state-space. Since Rainbow is an off-policy DQN, I think you could try saving old checkpoints and periodically spending a few episodes running old checkpoints and adding the experience samples to the replay buffer, but that might not be enough.

There's also the batch size. The OA5 batch size was ridiculously large. Given all of the stochasticity in a DoTA2 game & additional exploration, that covers an awful lot of possible trajectories.

In Gmail, everything after

They also present some fine-grained experiments which show that for a typical agent, training on specific contexts adversely affects performance on other contexts that are qualitatively different.

Is cut off by default due to length.

[-]Rohin Shah6y20

True, but OA5 is inherently a different setup than ALE.

I broadly agree with this, but have some nitpicks on the specific things you mentioned.

Catastrophic forgetting is at least partially offset by the play against historical checkpoints, which doesn't have an equivalent in your standard ALE

But since you're always starting from the same state, you always have to solve the earlier subtasks? E.g. in Montezuma's revenge in every trajectory you have to successfully get the key and climb the ladder; this doesn't change as you learn more.

there's no adversarial dynamics or AlphaStar-style population of agents which can exploit forgotten area of state-space

The thing about Montezuma's revenge and similar hard exploration tasks is that there's only one trajectory you need to learn; and if you forget any part of it you fail drastically; I would by default expect this to be better than adversarial dynamics / populations at ensuring that the agent doesn't forget things.

There's also the batch size. The OA5 batch size was ridiculously large. Given all of the stochasticity in a DoTA2 game & additional exploration, that covers an awful lot of possible trajectories.

Agreed, but the Memento observation also shows that the problem isn't about exploration: if you make a literal copy of the agent that gets 6600 reward and train that from the 6600 reward states, it reliably gets more reward than the original agent got. The only difference between the two situations is that in the original situation, the original agent still had to remember how to get to the 6600 reward states in order to maintain its performance, while the new agent was allowed to start directly from that state and so was allowed to forget how to get to the 6600 reward states.

In particular, I would guess that the original agent does explore trajectories in which it gets higher reward (because the Memento agent definitely does), but for whatever reason it is unable to learn as effectively from those trajectories.

Is cut off by default due to length.

Thanks, we noticed this after we sent it out (I think it didn't happen in our test emails for whatever reason). Hopefully the kinks in the new design will be worked out by next week.

(That being said, I've seen other newsletters which are always cut off by GMail, so it may not be possible to do this when using a nice HTML design... if anyone knows how to fix this I'd appreciate tips.)

[-]Pattern6y10

The thing about Montezuma's revenge and similar hard exploration tasks is that there's only one trajectory you need to learn; and if you forget any part of it you fail drastically; I would by default expect this to be better than adversarial dynamics / populations at ensuring that the agent doesn't forget things.

But is it easier to remember things if there's more than one way to do them?

[-]Rohin Shah6y20

Unclear, seems like it could go either way.

If you aren't forced to learn all the ways of doing the task, then you should expect the neural net to learn only one of the ways. So maybe it's that the adversarial nature of OpenAI Five forced it to learn all the ways, and it was then paradoxically easier to remember all of the ways than just one of the ways.

[-]TurnTrout6y30

I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent's beliefs change, which doesn't happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining "dumber" beliefs that power is measured relative to (but this fails to leverage the AI system's understanding of the world).

Although the point is more easily made in the deterministic environments, impact doesn't happen in expectation for optimal agents in stochastic environments, either. This is by conservation of expected AU (this is the point I was making in The Gears of Impact).

Similar things can be said about power gain – when we think an agent is gaining power... gaining power compared to what? The agent "always had" that power, in a sense – the only thing that happens is that we realize it.

This line of argument makes me more pessimistic about there being a clean formalization of "don't gain power". I do think that the formalization of power is correct, but I suspect people are doing something heuristic and possibly kludgy when we think about someone else gaining power.

[-]Rohin Shah6y20

Yup, strongly agree. I focused on the deterministic case because the point is easiest to understand there, but they also apply in the stochastic case.

I suspect people are doing something heuristic and possibly kludgy when we think about someone else gaining power.

I agree, though if I were trying to have a nice formalization, one thing I might do is look at what "power" looks like in a multiagent setting, where you can't be "larger" than the environment, and so you can't have perfectly calibrated beliefs about what's going to happen.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

11

[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement

11

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

HANDLING GROUPS OF AGENTS

FORECASTING

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

DEEP LEARNING

MACHINE LEARNING

FEEDBACK

PODCAST