Scott Emmons

Wiki Contributions


Neat to see the follow-up from your introductory prediction post on this project!

In my prediction I was particularly interested in the following stats:
1. If you put the cheese in the top-left and bottom-right of the largest maze size, what fraction of the time does the out-of-the-box policy you trained go to the cheese?
2. If you try to edit the mouse's activations to make it go to the top left or bottom right of the largest mazes (leaving the cheese wherever it spawned by default in the top right), what fraction of the time do you succeed in getting the mouse to go to the top left or bottom right? What percentage of network activations are you modifying when you do this?

Do you have these stats? I read some, but not all, of this post, and I didn't see answers to these questions.

Neat experimental setup. Goal misgeneralization is one of the things I'm most worried about in advanced AI, so I'm excited to see you studying it in more detail!

I want to jot-down my freeform analysis of what I expect to happen. (I wrote these predictions independently, without looking at anyone else's analysis.)

In very small mazes, I think the mouse will behave as if it's following this algorithm: find the shortest path to the cheese location. In very large mazes, I think the mouse will behave as if it's following this algorithm: first, go to the top-right region of the maze. Then, go to the exact location of the cheese. As we increase the maze size, I expect the mouse to have a phase transition from the first behavior to the second behavior. I don't know at exactly what size the phase transition will occur.

I expect that for very small mazes, the mouse will learn how to optimally get to the cheese, no matter where the cheese is. 

  • Prediction: (80% confidence) I think we'll be able to edit some part of the mouse's neural network (say, <10% of its activations) so that it goes to arbitrary locations in very small mazes.

I expect that for very large mazes, the mouse will act as follows: it will first just try to go to the top-right region of the maze. Once it gets to the top-right region of the maze, it will start trying to find the cheese exactly. My guess is that there's a trigger in the model's head for when it switches from going to the top-right corner to finding the cheese exactly. I'd guess this trigger activates either when the mouse is in the top-right corner of the maze, or when the mouse is near the cheese. (Or perhaps a mixture of both these triggers exists in the model's head.)

  • Prediction: (75% confidence) The mouse will struggle to find cheese in the top-left and bottom-right of  very large mazes (ie, if we put the cheese in the top-left or bottom-right of the maze, the model will have <33% success rate of reaching it within the average number of steps it takes the mouse to reach the cheese in the top-right corner). I think the mouse will go to the top-right corner of the maze in these cases.
  • Prediction: (75% confidence) We won't be able to easily edit the model's activations to make them go to the top-left or bottom-right of very large mazes. Concretely, the team doing this project won't find <10% of network activations that they can edit to make the mouse reach cheese in the top-left or bottom-right of the maze with >= 33% success rate within the average number of steps it takes the mouse to reach the cheese in the top-right corner.
  • Prediction: (55% confidence) I weakly believe that if we put the cheese in the bottom-left corner of a very large maze, the mouse will go to the cheese. (Ie, the mouse will quickly find the cheese in 50% of very large mazes with cheese in the bottom left corner.) I weakly think that there will be a trigger in the model's head that recognizes that it is close to the cheese at the start of the episode, and that that will activate the cheese finding mode. But I only weakly believe this. I think it's possible that this trigger doesn't fire, and instead the mouse just goes to the top-right corner of the maze when the cheese starts out in the bottom-left corner.

Another question is: Will we be able to infer the exact cheese location by just looking at the model's internal activations?

  • Prediction: (80% confidence) Yes. The cheese is easy for a convnet to see (it's distinct in color from everything else), and it's key information for solving the task. So I think the policy will learn to encode this information. Idk about exactly which layer(s) in the network will contain this information. My prediction is that the team doing this project will be able to probe at least one of the layers of the network to obtain the exact location of the cheese with >90% accuracy, for all maze sizes.

Thanks for writing. I think this is a useful framing!

Where does the term "structural" come from?

The related literature I've seen uses the word "systemic", eg, the field of system safety. A good example is this talk (and slides, eg slide 24).

Thanks for writing this! I appreciate it and hope you share more things that you write faster without totally polishing everything.

One word of caution I'd share is: beware of spending too much effort running experiments on toy examples. I think toy examples are useful to gain conceptual clarity. However, if your idea is primarily empirical (such as an improvement to a deep neural network architecture), then I would recommend spending basically zero time running toy experiments.

With deep learning, it's often the case that improvements on toy examples don't scale to being improvements on real examples. In my experience, lots of papers in reinforcement learning don't actually work because the authors only tried out the method on toy examples. (Or, they tried out the method on more complex examples, but they didn't publish those experiments because the method didn't work.) So trying out a new empirical method on a toy example provides little information about how valuable the empirical method will be on real examples.

The flipside of this warning is advice: for empirical projects, test your idea on as diverse and complex a set of tasks as is possible. The good empirical ideas are few, and extensive empirical testing is the best way a researcher can determine if their idea will stand the test of time.

When running diverse and complex experiments, it is still important to design the simplest possible experiment that will be informative, as Lawrence describes in the section "Mock or simplify difficult components." I suggest being simple (such as Lawrence's example of using text-davinci-003 instead of finetuning one's own model) rather than being toy (using a tiny or hard-coded language model).

I think the main reasons to work on mechanistic interp do not look like "we can literally understand all the cognition behind a powerful AI", but instead "we can bound the behavior of the AI"

I assume "bound the behavior" means provide a worst-case guarantee. But if we don't understand all the cognition, how can we provide such a guarantee? How do we know that the part of the AI we don't understand wouldn't ruin our guarantee?

we can help other, weaker AIs understand the powerful AI

My understanding of interpretability is that humans understand what the AI is doing.  Weaker AIs understanding the powerful AI doesn't feel like a solution to interpretability. Instead it feels like a solution to amplification that's still uninterpretable by humans.

What do you think are the top 3 (or top 5, or top handful) of interpretability results to date? If I gave a 5-minute talk called "The Few Greatest Achievements of Interpretability to Date," what would you recommend I include in the talk?

You also claim that GPT-like models achieve "SOTA performance in domains traditionally dominated by RL, like games." You cite the paper "Multi-Game Decision Transformers" for this claim.

But, in Multi-Game Decision Transformers, reinforcement learning (specifically, a Q-learning variant called BCQ) trained on a single Atari game beats Decision Transformer trained on many Atari games. This is shown in Figure 1 of that paper. The authors of the paper don't even claim that Decision Transformer beats RL. Instead, they write: "We are not striving for mastery or efficiency that game-specific agents can offer, as we believe we are still in early stages of this research agenda. Rather, we investigate whether the same trends observed in language and vision hold for large-scale generalist reinforcement learning agents."

It may be that Decision Transformers are on a path to matching RL, but it's important to know that this hasn't yet happened. I'm also not aware of any work establishing scaling laws in RL.

"A supreme counterexample is the Decision Transformer, which can be used to run processes which achieve SOTA for offline reinforcement learning despite being trained on random trajectories."

This is not true. The Decision Transformer paper doesn't run any complex experiments on random data; they only give a toy example with random data.

We actually ran experiments with Decision Transformer on random data from the D4RL offline RL suite. Specifically, we considered random data from the Mujoco Gym tasks. We found that when it only has access to random data, Decision Transformer only achieves 4% of the performance that it can achieve when it has access to expert data. (See the D4RL Gym results in our Table 1, and compare "DT" on "random" to "medium-expert".)

The technology [of lethal autonomous drones], from the point of view of AI, is entirely feasible. When the Russian ambassador made the remark that these things are 20 or 30 years off in the future, I responded that, with three good grad students and possibly the help of a couple of my robotics colleagues, it will be a term project [six to eight weeks] to build a weapon that could come into the United Nations building and find the Russian ambassador and deliver a package to him.

-- Stuart Russell on a February 25, 2021 podcast with the Future of Life Institute.

Load More