Alignment Newsletter #49

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

Highlights

Exploring Neural Networks with Activation Atlases (Shan Carter et al): Previous work by this group of people includes The Building Blocks of Interpretability and Feature Visualization, both of which apparently came out before this newsletter started so I don't have a summary to point to. Those were primarily about understanding what individual neurons in an image classifer were responding to, and the key idea was to "name" each neuron with the input that would maximally activate that neuron. This can give you a global view of what the network is doing.

However, such a global view makes it hard to understand the interaction between neurons. To understand these, we can look at a specific input image, and use techniques like attribution. Rather than attribute final classifications to the input, you could attribute classifications to neurons in the network, and then since individual neurons now had meanings (roughly: "fuzzy texture neuron", "tennis ball neuron", etc) you can gain insight to how the network is making decisions for that specific input.

However, ideally we would like to see how the network uses interactions between neurons to make decisions in general; not on a single image. This motivates activation atlases, which analyze the activations of a network on a large dataset of inputs. In particular, for each of a million images, they randomly choose a non-border patch from the image, and compute the activation vector at a particular layer of the network at that patch. This gives a dataset of a million activation vectors. They use standard dimensionality reduction techniques to map each activation vector into an (x, y) point on the 2D plane. They divide the 2D plane into a reasonably sized grid (e.g. 50x50), and for each grid cell they compute the average of all the activation vectors in the cell, visualize that activation vector using feature visualization, and put the resulting image into the grid cell. This gives a 50x50 grid of the "concepts" that the particular neural network layer we are analyzing can reason about. They also use attribution to show, for each grid cell, which class that grid cell most supports.

The paper then goes into a lot of detail about what we can infer from the activation atlas. For example, we can see that paths in activation vector space can correspond to human-interpretable concepts like the number of objects in an image, or moving from water to beaches to rocky cliffs. If we look at activation atlases for different layers, we can see that the later layers seem to get much more specific and complex, and formed of combinations of previous features (e.g. combining sand and water features to get a single sandbar feature).

By looking at images for specific classes, we can use attribution to see which parts of an activation atlas are most relevant for the class. By comparing across classes, we can see how the network makes decisions. For example, for fireboats vs. streetcars, the network looks for windows for both, crane-like structures for both (though less than windows), and water for fireboats vs. buildings for streetcars. This sort of analysis can also help us find mistakes in reasoning -- e.g. looking at the difference between grey whales and great white sharks, we can see that the network looks for the teeth and mouth of a great white shark, including an activation that looks suspiciously like a baseball. In fact, if you take a grey whale and put a patch of a baseball in the top left corner, this becomes an adversarial example that fools the network into thinking the grey whale is a great white shark. They run a bunch of experiments with these human-found adversarial examples and find they are quite effective.

Rohin's opinion: While the authors present this as a method for understanding how neurons interact, it seems to me that the key insight is about looking at and explaining the behavior of the neural network on data points in-distribution. Most possible inputs are off-distribution, and there is not much to be gained by understanding what the network does on these points. Techniques that aim to gain a global understanding of the network are going to be "explaining" the behavior of the network on such points as well, and so will be presenting data that we won't be able to interpret. By looking specifically at activations corresponding to in-distribution images, we can ensure that the data we're visualizing is in-distribution and is expected to make sense to us.

I'm pretty excited that interpretability techniques have gotten good enough that they allow us to construct adversarial examples "by hand" -- that seems like a clear demonstration that we are learning something real about the network. It feels like the next step would be to use interpretability techniques to enable us to actually fix the network -- though admittedly this would require us to also develop methods that allow humans to "tweak" networks, which doesn't really fit within interpretability research as normally defined.

Read more: OpenAI blog post and Google AI blog post

Feature Denoising for Improving Adversarial Robustness (Cihang Xie et al) (summarized by Dan H): This paper claims to obtain nontrivial adversarial robustness on ImageNet. Assuming an adversary can add perturbations of size 16/255 (l_infinity), previous adversarially trained classifiers could not obtain above 1% adversarial accuracy. Some groups have tried to break the model proposed in this paper, but so far it appears its robustness is close to what it claims, around 40% adversarial accuracy. Vanilla adversarial training is how they obtain said adversarial robustness. There has only been one previous public attempt at applying (multistep) adversarial training to ImageNet, as those at universities simply do not have the GPUs necessary to perform adversarial training on 224x224 images. Unlike the previous attempt, this paper ostensibly uses better hyperparameters, possibly accounting for the discrepancy. If true, this result reminds us that hyperparameter tuning can be critical even in vision, and that improving adversarial robustness on large-scale images may not be possible outside industry for many years.

Technical AI alignment

Learning human intent

Using Causal Analysis to Learn Specifications from Task Demonstrations (Daniel Angelov et al)

Reward learning theory

A theory of human values (Stuart Armstrong): This post presents an outline of how to construct a theory of human values. First, we need to infer preferences and meta-preferences from humans who are in "reasonable" situations. Then we need to synthesize these into a utility function, by resolving contradictions between preferences, applying meta-preferences to preferences, and having a way of changing the procedures used to do the previous two things. We then need to argue that this leads to adequate outcomes -- he gives some simple arguments for this, that rely on particular facts about humans (such as the fact that they are scope insensitive).

Preventing bad behavior

Designing agent incentives to avoid side effects (Victoria Krakovna et al): This blog post provides details about the recent update to the relative reachability paper (AN #10), which is now more a paper about the design choices available with impact measures. There are three main axes that they identify:

First, what baseline is impact measurede relative to? A natural choice is to compare against the starting state, but this will penalize the agent for environment effects, such as apples growing on trees. We can instead compare against an inaction baseline, i.e. measuring impact relative to what would have happened if the agent did nothing. Unfortunately, this leads to offsetting behavior: the agent first makes a change to get reward, and then undoes the change in order to not be penalized for impact. This motivates the stepwise inaction baseline, which compares each action against what would have happened if the agent did nothing from that step onwards.

Second, we need a measure by which to compare states. The unreachability measure measures how hard it is to reach the baseline from the current state. However, this "maxes out" as soon as the baseline is unreachability, and so there is no incentive to avoid further irreversible actions. This motivates relative reachability, which computes the set of states reachable from the baseline, and measures what proportion of those states are reachable from the state created by the agent. Attainable utility (AN #25) generalizes this to talk about the utility that could be achieved from the baseline for a wide range of utility functions. (This is equivalent to relative reachability when the utility functions are of the form "1 if state s is ever encountered, else 0".)

Finally, we need to figure how to penalize changes in our chosen measure. Penalizing decreases in the measure allows us to penalize actions that make it harder to do things (what the AUP post calls "opportunity cost"), while penalizing increases in the measure allows us to penalize convergent instrumental subgoals (which almost by definition increase the ability to satisfy many different goals or reach many different states).

Rohin's opinion: Since the AUP post was published about half a year ago, I've been watching this unification of AUP and relative reachability slowly take form, since they were phrased very differently initially. I'm glad to see this finally explained clearly and concisely, with experiments showing the effect of each choice. I do want to put special emphasis on the insight of AUP that the pursuit of convergent instrumental subgoals leads to large increases in "ability to do things", and thus that penalizing increases can help avoid such subgoals. This point doesn't typically make it into the academic writings on the subject but seems quite important.

On the topic of impact measures, I'll repeat what I've said before: I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on human values), safety (preventing any catastrophic outcomes) and usefulness (the AI system is still able to do useful things). Impact measures are very clearly aiming for the first two criteria, but usually don't have much to say about the third one. My expectation is that there is a strong tradeoff between the first two criteria and the third one, and impact measures have not dealt with this fact yet, but will have to at some point.

Conservative Agency via Attainable Utility Preservation (Alexander Matt Turner et al): This paper presents in a more academic format a lot of the content that Alex has published about attainable utility preservation, see Towards a New Impact Measure (AN #25) and Penalizing Impact via Attainable Utility Preservation(AN #39).

Interpretability

Exploring Neural Networks with Activation Atlases (Shan Carter et al): Summarized in the highlights!

Adversarial examples

Feature Denoising for Improving Adversarial Robustness (Cihang Xie et al): Summarized in the highlights!

Forecasting

Signup form for AI Metaculus (Jacob Lagerros and Ben Goldhaber): Recently, forecasting platform Metaculus launched a new instance dedicated specifically to AI in order to get good answers for empirical questions (such as AGI timelines) that can help avoid situations like info-cascades. While most questions don't have that many predictions, the current set of beta-users were invited based on forecasting track-record and AI domain-expertise, so the signal of the average forecast should be high.

Some interesting predictions include:

- By end of 2019, will there be an agent at least as good as AlphaStar using non-controversial, human-like APM restrictions? [mean: 58%, median: 66%, n = 26]

- When will there be a superhuman Starcraft II agent with no domain-specific hardcoded knowledge, trained using <=$10,000 of publicly available compute? [50%: 2021 to 2037, with median 2026, n = 35]

This forecast is supported by a Guesstimate model, which estimates current and future sample efficiency of Starcraft II algorithms, based on current performance, algorithmic progress, and the generalization of Moore's law. For algorithmic progress, they look at the improvement in sample efficiency on Atari, and find a doubling time of roughly a year, via DQN --> DDQN --> Dueling DDQN --> Prioritized DDQN --> PPO --> Rainbow --> IMPALA.

Overall, there are 50+ questions, including on malicious use of AI, publishing norms, conference attendance, MIRI's research progress, the max compute doubling trend, OpenAI LP, nationalisation of AI labs, whether financial markets expect AGI, and more. You can sign-up to join here.

AI conference attendance (Katja Grace): This post presents data on attendance numbers at AI conferences. The main result: "total large conference participation has grown by a factor 3.76 between 2011 and 2019, which is equivalent to a factor of 1.21 per year during that period". Looking at the graph, it seems to me that the exponential growth started in 2013, which would mean a slightly higher factor of around 1.3 per year. This would also make sense given that the current boom is often attributed to the publication of AlexNet in 2012.

Field building

Alignment Research Field Guide (Abram Demski): This post gives advice on how to get started on technical research, in particular by starting a local MIRIx research group.

Rohin's opinion: I strongly recommend this post to anyone looking to get into research -- it's a great post; I'm not summarizing it because I want this newsletter to be primarily about technical research. Even if you are not planning to do the type of research that MIRI does, I think this post presents a very different perspective on how to do research compared to the mainstream view in academia. Note though that this is not the advice I'd give to someone trying to publish papers or break into academia. Also, while I'm talking about recommendations on how to do research, let me also recommend Research as a Stochastic Decision Process.

Miscellaneous (Alignment)

Partial preferences needed; partial preferences sufficient (Stuart Armstrong): I'm not sure I fully understand this post, but my understanding is that it is saying that alignment proposals must rely on some information about human preferences. Proposals like impact measures and corrigibility try to formalize a property that will lead to good outcomes; but any such formalization will be denoting some policies as safe and some as dangerous, and there will always exist a utility function according to which the "safe" policies are catastrophic. Thus, you need to also define a utility function (or a class of them?) that safety is computed with respect to; and designing this is particularly difficult.

Rohin's opinion: This seems very similar to the problem I have with impact measures, but I wouldn't apply that argument to corrigibility. I think the difference might be that I'm thinking of "natural" things that agents might want, whereas Stuart is considering the entire space of possible utility functions. I'm not sure what drives this difference.

Understanding Agent Incentives with Causal Influence Diagrams (Tom Everitt et al): This post and associated paper model an agent's decision process using a causal influence diagram -- think of a Bayes net, and then imagine that you add nodes corresponding to actions and utilities. A major benefit of Bayes nets is that the criterion of d-separation can be used to determine whether two nodes are conditionally independent. Once we add actions and utilities, we can also analyze whether observing or intervening on nodes would lead the agent to achieve higher expected utility. The authors derive criteria resembling d-separation for identifying each of these cases, which they call observation incentives (for nodes whose value the agent would like to know) and intervention incentives (for nodes whose value the agent would like to change). They use observation incentives to show how to analyze whether a particular decision is fair or not (that is, whether it depended on a sensitive feature that should not be used, like gender). Intervention incentives are used to establish the security of counterfactual oracles more simply and rigorously.

Rohin's opinion: These criteria are theoretically quite nice, but I'm not sure how they relate to the broader picture. Is the hope that we will be able to elicit the causal influence diagram an AI system is using, or something like it? Or perhaps that we will be able to create a causal influence diagram of the environment, and these criteria can tell us which nodes we should be particularly interested in? Maybe the goal was simply to understand agent incentives better, with the expectation that more knowledge would help in some as-yet-unknown way? None of these seem very compelling to me, but the authors might have something in mind I haven't thought of.

Other progress in AI

Exploration

World Discovery Models (Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo Avila Pires et al)

Miscellaneous (AI)

The Bitter Lesson (Rich Sutton): This blog post is controversial. This is a combination summary and opinion, and so is more biased than my summaries usually are.

Much research in AI has been about embedding human knowledge in AI systems, in order to use the current limited amount of compute to achieve some outcomes. That is, we try to get our AI systems to think the way we think we think. However, this usually results in systems that work currently, but then cannot leverage the increasing computation that will be available. The bitter lesson is that methods like search and learning that can scale to more computation eventually win out, as more computation becomes available. There are many examples that will likely be familiar to readers of this newsletter, such as chess (large scale tree search), Go (large scale self play), image classification (CNNs), and speech recognition (Hidden Markov Models in the 70s, and now deep learning).

Shimon Whiteson's take is that in reality lots of human knowledge has been important in getting AI to do things; such as the invariances built into convolutional nets, or the MCTS and self-play algorithm underlying AlphaZero. I don't see this as opposed to Rich Sutton's point -- it seems to me that the takeaway is that we should aim to build algorithms that will be able to leverage large amounts of compute, but we can be clever and embed important knowledge in such algorithms. I think this criterion would have predicted ex-ante (i.e. before seeing the results) that much past and current research in AI was misguided, without also predicting that any of the major advances (like CNNs) were misguided.

It's worth noting that this is coming from a perspective of aiming for the most general possible capabilities for AI systems. If your goal is to instead build something that works reliably now, then it really is a good idea to embed human domain knowledge, as it does lead to a performance improvement -- you should just expect that in time the system will be replaced with a better performing system with less embedded human knowledge.

One disagreement I have is that this post doesn't acknowledge the importance of data. The AI advances we see now are ones where the data has been around for a long time (or you use simulation to get the data), and someone finally put in enough engineering effort + compute to get the data out and put it in a big enough model. That is, currently compute is increasing much faster (AN #7) than data, so the breakthroughs you see are in domains where the bottleneck was compute and not data; that doesn't mean data bottlenecks don't exist.

News

AI Safety workshop at IJCAI 2019 (HuÃ¡scar Espinoza et al): There will be a workshop on AI safety at IJCAI 2019 in Macao, China; the paper submission deadline is April 12. In addition to the standard submissions (technical papers, proposals for technical talks, and position papers), they are seeking papers for their "AI safety landscape" initiative, which aims to build a single document identifying the core knowledge and needs of the AI safety community.

11