Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


Danny Hernandez on forecasting and the drivers of AI progress

(Arden Koehler and Danny Hernandez) (summarized by Rohin): This podcast is a great introduction to the practice of forecasting and measurement in AI, and why it is important. I won't summarize everything in the podcast, but here are some of the points made.

Danny talks about the AI and Compute (AN #7) and AI and Efficiency (AN #99) work that he did at OpenAI. The former shows that the compute devoted to the largest-scale experiments has increased by a factor of 300,000 from 2012 to 2018, and the latter suggests that algorithms have been able to achieve similar performance with 25x less compute over the same time period (later updated to 44x from 2012 to 2019).

One thing I didn’t realize earlier was that the 25x / 44x factor should be thought of as a loose lower bound: in other areas such as language modeling, the factor looks higher. But more importantly, the methodology used doesn’t allow us to model the effects of an algorithm allowing us to do something we couldn’t do before (which we could interpret as something we could do, but with way more compute). Possibly this algorithmic progress should be thought of as a 100x or even 1000x improvement in efficiency. Overall, Danny sees both algorithmic progress and increase in compute as pretty big factors in predicting how AI will go in the future.

Unfortunately, it’s hard to draw strong implications from these measurements for the downstream things we care about -- should we think that AI progress is “slow” or “fast”, or “linear” or “exponential”, based on these results? It’s important to be specific about the units you’re using when thinking about such a question. Danny thinks the economic impact of AI is an important lens here. It seems to him that neural nets were having very little impact back in (say) 2008, but since then they have been having a lot more impact, e.g. by making ~15% of Google searches better (by using a new language model). To his eye, this trend looks exponential.

In any case, Danny thinks that this sort of rigorous measurement and forecasting work is important, because it provides concrete inputs that can allow decision makers to perform their job better. This is at least one reason why OpenAI’s communication policy involves blog posts that deliberately target a wide audience: any decision maker can read these posts and get value out of them (unlike e.g. research papers).

This work is part of the broader work done by the Foresight team at OpenAI (which is hiring for research engineers): other work includes Scaling Laws for Neural Language Models (AN #87) and How AI Training Scales (AN #37).

Danny thinks work in AI hardware is promising and under-explored by the community: it seems like it will be a particularly important field in the future, as it will drive some of the progress in increased compute, and as a result having some influence in the area could be quite helpful. For example, perhaps one could advocate for a windfall clause (AN #88) at AI hardware companies.

Rohin's opinion: This measurement and forecasting work seems great; it constrains how we should expect future AI systems to look, and also improves our understanding of the impacts of AI, which probably helps us develop plans for deployment.

I was not very convinced by the reasoning about economic impact. I would believe that the economic impact of neural nets has grown exponentially, but it seems like we should be analyzing trends in machine learning (ML) overall, not just neural nets, and it seems much less likely to me that we see an exponential growth in that. Any time you see a new, better version of a previous technology (as with neural nets in relation to ML), you’re going to see an exponential trend as the new technology is adopted; this doesn’t mean that the exponential will keep going on and lead to transformative impact.



Learning to Complement Humans

(Bryan Wilder et al) (summarized by Rohin): Many current AI systems aim to assist humans in complex tasks such as medical diagnoses. Given that AI systems have a very different range of capabilities than humans, there has been a lot of interest in detecting “hard” examples and showing them to humans. This paper demonstrates how this can be done in an end-to-end way.

The authors assume they have access to an augmented supervised learning dataset of triples (x, y, h), where x is the input, y is the label, and h is the human prediction. A traditional approach would be to first train a model to predict y given x, and then come up with a new algorithm or model to predict when you should ask the human instead of querying the model. In contrast, they create a single model that first decides whether to look at h (for some fixed cost c), and then make a prediction given x (and h, if the model chose to look at it). They have two versions: a classic discriminative approach (very similar to e.g. image classifiers) and a decision-theoretic approach (where the model uses several probabilistic models and then calculates the value of information (VOI) of h to decide whether to query the human).

The end-to-end training confers two main benefits:

1. The models automatically learn to focus their learning capability on examples that are hard for humans.

2. The models ignore examples where they are going to ask a human anyway (rather than e.g. learning enough to make a 50% confident prediction).

Rohin's opinion: This is a cool idea! Even if you only have a dataset with ground truth (x, y) pairs, you could assume that h = y, and while you wouldn’t get benefit 1, you would still get benefit 2 above. If you constructed your dataset by the common method of getting a bunch of human predictions and defining y to be the modal prediction, then you could automatically construct h by having it be the average across the human predictions, and get both benefits above.

Showing versus doing: Teaching by demonstration

(Mark K. Ho et al) (summarized by Rohin): This paper creates and validates a model of pedagogy as applied to reward learning. Typically, inverse reinforcement learning (IRL) algorithms assume access to a set of demonstrations that are created from an approximately optimal policy. However, in practice, when people are asked to show a task, they don't give the optimal trajectory; they give the trajectory that helps the learner best disambiguate between the possible tasks. They formalize this by creating a model in two steps:

1. A literal or IRL robot is one which learns rewards under the model that the demonstrator is Boltzmann rational.

2. The pedagogic human shows trajectories in proportion to how likely a literal robot would think the true reward is upon seeing the trajectory.

They validate this model with user studies and find that it predicts human demonstrations well.

Read more: Literal or Pedagogic Human? Analyzing Human Model Misspecification in Objective Learning (AN #50)

Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement

(Chao Yang, Xiaojian Ma et al) (summarized by Zach): Learning from observation (LfO) focuses on imitation learning in situations where we want to learn from state-only demonstrations. This contrasts with learning from demonstration (LfD) which needs both state and action information. In practice, LfO is the more common situation due to the prevalence of unannotated data, such as video. In this paper, the authors show that the gap between LfO and LfD comes from the disagreement of inverse dynamics models between the imitator and the expert. If the inverse dynamics model is perfect, then state transitions can be labeled with actions and LfD can be performed on the result. However, it's often the case that many actions can generate the same state transition. They then show that optimizing an upper-bound on this gap leads to improved performance as compared to other LfO methods such as GAIfO (GAIL extended to LfO).

Prerequisites: GAIfO and Recent Advances in LfO

Read more: divergence minimization perspective

Zach's opinion: The main value of this paper is that the difference between LfO and LfD is clarified by introducing the notion of inverse disagreement. Related to this analysis, the authors note that GAIfO has the same objective as the inverse disagreement model if we replace KL with JS divergence. This makes me suspect that there's a general LfO divergence minimization perspective relating all of these methods together. In other words, the fact that the objectives for LfO and LfD can be related via KL/JS divergence indicates that there is an entire class of methods underlying this approach to LfO. Specifically, I'd hypothesize that regularized inverse reinforcement learning from observation followed by reinforcement learning would be equivalent to a divergence minimization problem.


How can Interpretability help Alignment?

(Robert Kirk et al) (summarized by Rohin): Interpretability seems to be useful for a wide variety of AI alignment proposals. Presumably, different proposals require different kinds of interpretability. This post analyzes this question to allow researchers to prioritize across different kinds of interpretability research.

At a high level, interpretability can either make our current experiments more informative to help us answer research questions (e.g. “when I set up a debate (AN #5) in this particular way, does honesty win?”), or it could be used as part of an alignment technique to train AI systems. The former only have to be done once (to answer the question), and so we can spend a lot of effort on them, while the latter must be efficient in order to be competitive with other AI algorithms.

The Authors then analyze how interpretability could apply to several alignment techniques, and come to several tentative conclusions. For example, they suggest that for recursive techniques like iterated amplification, we may want comparative interpretability, that can explain the changes between models (e.g. between distillation steps, in iterated amplification). They also suggest that by having interpretability techniques that can be used by other ML models, we can regularize a trained model to be aligned, without requiring a human in the loop.

Rohin's opinion: I like this general direction of thought, and hope that people continue to pursue it, especially since I think interpretability will be necessary for inner alignment. I think it would be easier to build on the ideas in this post if they were made more concrete.

Exploratory Not Explanatory: Counterfactual Analysis of Saliency Maps for Deep Reinforcement Learning

(Akanksha Atrey et al) (summarized by Robert): This paper presents an analysis of the use of saliency maps in deep vision-based reinforcement learning on ATARI. They consider several types of saliency methods, all of which produce heatmaps on the input image. They show that all (46 claims across 11 papers) uses of saliency maps in deep RL literature interpret them as representing the agent's "focus", 87% use the saliency map to generate a claim about the agent's behaviour or reasoning, but only 7% validate their claims with additional or more direct evidence.

They go on to present a framework to turn subjective and under-defined claims about agent behaviour generated with saliency maps into falsifiable claims. This framework effectively makes the claim more specific and targeted at specific semantic concepts in the game's state space. Using a fully parameterized version of the ATARI environment, they can alter the game's state in ways which preserve meaning (i.e. the new state is still a valid game state). This allows them to perform interventions in a rigorous way, and falsify the claims made in their framework.

Using their framework, they perform 3 experimental case studies on popular claims about agent behaviour backed up by saliency maps, and show that all of them are false (or at least stated more generally than they should be). For example, in the game Breakout, agents tend to build tunnels through the bricks to get a high score. Saliency maps show that the agent attends to these tunnels in natural games. However, shifting the position of the tunnel and/or the agent's paddle and/or the ball all remove the saliency on the tunnel's location. Even flipping the whole screen vertically (which still results in a valid game state) removes the saliency on the tunnel's location. This shows that the agent doesn’t understand the concept of tunnels generally or robustly, which is often what is claimed.

Robert's opinion: The framework presented in this paper is simple, but I like the idea of combining it with the fully adjustable ATARI simulator to enable meaningful interventions, which enable us to falsify claims made using saliency maps. This is one way to validate whether the methods we're using are producing good insights.

I think this paper points more at the fact that our interpretation of saliency maps is incorrect, due to us imposing anthropomorphic biases on the agent's reasoning, and trying to infer general behaviour from specific explanations. I think many of the claims they reference could be reworded to be much more specific, and then would probably hold true (i.e. instead of "The agent understands tunnels and attends to and builds them" say "The agent knows destroying bricks consistently on the right hand side of the screen leads to higher reward and so attends to that location when it's able to bounce the ball there".)

A Benchmark for Interpretability Methods in Deep Neural Networks

(Sara Hooker et al) (summarized by Robert): This paper presents an automatic benchmark for feature importance methods (otherwise known as saliency maps) called RemOve And Retrain (ROAR). The benchmark follows the following procedure:

1. Train an image classifier on a dataset (they use ResNet-50s on ImageNet, and get about 77% accuracy)

2. Measure the test-set accuracy at convergence

3. Using the feature importance method, find the most important features in the dataset, and remove them (by greying out the pixels)

4. Train another model on this new dataset, and measure the new test-set accuracy

5. The difference between the accuracy in (4) and in (2) is the measure of how effective the feature importance method is at finding important features

The idea behind retraining is that giving the original classifier images where many pixels have been greyed out will obviously result in lower accuracy, as they're out of the training distribution. Retraining solves this problem.

They benchmark a variety of feature importance methods (Gradient heatmap, Guided backprop, Integrated gradients, Classic SmoothGrad, SmoothGrad^2, VarGrad) on their benchmark, and compare to a random baseline, and a Sobel Edge detector (a hard-coded algorithm for finding edges in images). Only SmoothGrad^2 and VarGrad (which are both methods which ensemble other feature importance methods) do better than random. They can't explain why these methods perform better than other methods. They also note that even when removing 90% of the pixels in every image (i.e. the random baseline), the accuracy only drops from 77% to 63%, which shows how correlated pixels in images are.

Robert's opinion: I'm in favour of developing methods which allow us to rigorously compare interpretability methods. This benchmark is a step in the right direction, but I think it does have several flaws:

  1. In images especially, there's high correlation between pixels, and greying out pixels isn't the same as "removing" the feature (the model trained on the altered images could learn something like "these pixels are greyed out (and hence important), so this must be a bird otherwise the pixels wouldn't be important").
  2. The benchmark doesn't measure exactly what these methods are trying to capture. The methods are trying to answer "what parts of this image were important for making this classification?", which is at least slightly different from "what parts of this image, when removed, will prevent a new model from classifying accurately?".

I'd be interested in seeing the benchmark (or something conceptually similar) applied to domains other than images, where the correlation between features is lower: I imagine for some kinds of tabular data the methods would perform much better (although the methods have mostly been designed to work on images rather than tabular data).


User-Agent Value Alignment (Daniel Shapiro et al) (summarized by Rohin) (H/T Stuart Russell): This paper from 2002 investigates what it would take to align an artificial agent with a human principal, under the assumption that the human utility function is known, but that the agent reward and human utility might be computed from different feature sets fA and fH. In this case, it is possible that the agent reward cannot capture all of the effects that the human cares about, leading to misalignment.

They introduce the concept of graphical value alignment, in which the only way that the agent’s actions can affect fH is through fA. In this case, we can establish functional value alignment (in which the agent’s optimal policy also maximizes human utility), by setting the agent's reward for any specific fA to be the expectation (over fH) of the utility of fH, given fA. Note that the graphical criterion is very strong: it requires that none of the agent’s unobserved effects matter at all to the human.

They suggest two methods for establishing alignment. First, we can define additional agent features (perhaps requiring additional sensors), until all of the effects on fH are captured by fA. However, this would be very difficult, if not impossible. Second, we can include all agent actions and observations as agent features, since any effect of the agent’s choice of policy on fH depends only on the observations made and actions taken. Of course, to achieve functional value alignment we would then have to have a good understanding of the expected human utility for every action given any observation, which is also hard.

They also briefly discuss the relationship between aligned agents and capable agents: a stone is aligned with you (per their definition), but also entirely useless. An interesting quote: “Note that it might be harder to establish alignment with more competent agents because their skills afford many more pathways for adverse effects. This is a somewhat troubling thought.”

Rohin's opinion: It’s interesting how much of the alignment problem manifests itself even when you assume that the human utility function is known, but the feature sets used by the human and agent are different. The only piece of the argument missing from this paper is that with sufficiently capable agents, the agent will actually be adversarial towards the human because of convergent instrumental subgoals, and that argument can be made in this framework.

Unfortunately, both of their methods for producing alignment don’t scale well, as they admit in the paper. (The second method in particular is kind of like hardcoding the policy, similarly to the construction here (AN #35).)


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment