Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world


Inaccessible information (Paul Christiano) (summarized by Rohin): One way to think about the problem of AI alignment is that we only know how to train models on information that is accessible to us, but we want models that leverage inaccessible information.

Information is accessible if it can be checked directly, or if an ML model would successfully transfer to provide the information when trained on some other accessible information. (An example of the latter would be if we trained a system to predict what happens in a day, and it successfully transfers to predicting what happens in a month.) Otherwise, the information is inaccessible: for example, “what Alice is thinking” is (at least currently) inaccessible, while “what Alice will say” is accessible. The post has several other examples.

Note that while an ML model may not directly say exactly what Alice is thinking, if we train it to predict what Alice will say, it will probably have some internal model of what Alice is thinking, since that is useful for predicting what Alice will say. It is nonetheless inaccessible because there’s no obvious way of extracting this information from the model. While we could train the model to also output “what Alice is thinking”, this would have to be training for “a consistent and plausible answer to what Alice is thinking”, since we don’t have the ground truth answer. This could incentivize bad policies that figure out what we would most believe, rather than reporting the truth.

The argument for risk is then as follows: we care about inaccessible information (e.g. we care about what people actually experience, rather than what they say they experience) but can’t easily make AI systems that optimize for it. However, AI systems will be able to infer and use inaccessible information, and would outcompete ones that don’t. AI systems will be able to plan using such inaccessible information for at least some goals. Then, the AI systems that plan using the inaccessible information could eventually control most resources. Key quote: “The key asymmetry working against us is that optimizing flourishing appears to require a particular quantity to be accessible, while danger just requires anything to be accessible.”

The post then goes on to list some possible angles of attack on this problem. Iterated amplification can be thought of as addressing gaps in speed, size, experience, algorithmic sophistication etc. between the agents we train and ourselves, which can limit what inaccessible information our agents can have that we won’t. However, it seems likely that amplification will eventually run up against some inaccessible information that will never be produced. As a result, this could be a “hard core” of alignment.

Rohin's opinion: I think the idea of inaccessible information is an important one, but it’s one that feels deceptively hard to reason about. For example, I often think about solving alignment by approximating “what a human would say after thinking for a long time”; this is effectively a claim that human reasoning transfers well when iterated over long periods of time, and “what a human would say” is at least somewhat accessible. Regardless, it seems reasonably likely that AI systems will inherit the same property of transferability that I attribute to human reasoning, in which case the argument for risk applies primarily because the AI system might apply its reasoning towards a different goal than the ones we care about, which leads us back to the intent alignment (AN #33) formulation.

This response views this post as a fairly general argument against black box optimization, where we only look at input-output behavior, as then we can’t use inaccessible information. It suggests that we need to understand how the AI system works, rather than relying on search, to avoid these problems.

Possible takeaways from the coronavirus pandemic for slow AI takeoff (Victoria Krakovna) (summarized by Rohin): The COVID-19 pandemic is an example of a large risk that humanity faced. What lessons can we learn for AI alignment? This post argues that the pandemic is an example of the sort of situation we can expect in a slow takeoff scenario, since we had the opportunity to learn from experience, act on warning signs, and reach a timely consensus that there is a serious problem. However, while we could have learned from previous epidemics like SARS, we failed to generalize the lessons from SARS. Despite warning signs of a pandemic in February, many countries wasted a month when they could have been stocking up on PPE and testing capacity. We had no consensus that COVID-19 was a problem, with articles dismissing it as no worse than the flu as late as March.

All of these problems could also happen with slow takeoff: we may fail to generalize from narrow AI systems to more general AI systems; we might not act on warning signs; and we may not believe that powerful AI is on the horizon until it is too late. The conclusion is “unless more competent institutions are in place by the time general AI arrives, it is not clear to me that slow takeoff would be much safer than fast takeoff”.

Rohin's opinion: While I agree that the COVID response was worse than it could have been, I think there are several important disanalogies between the COVID-19 pandemic and a soft takeoff scenario, which I elaborate on in this comment. First, with COVID there were many novel problems, which I don’t expect with AI. Second, I expect a longer time period over which decisions can be made for AI alignment. Finally, with AI alignment, we have the option of preventing problems from ever arising, which is not really an option with pandemics. See also this post.



Steven Pinker and Stuart Russell on the Foundations, Benefits, and Possible Existential Threat of AI (Lucas Perry, Steven Pinker and Stuart Russell) (summarized by Rohin): Despite their disagreements on AI risk, Stuart and Steven agree on quite a lot. They both see the development of AI as depending on many historical ideas. They are both particularly critical of the idea that we can get general intelligence by simply scaling up existing deep learning models, citing the need for reasoning, symbol manipulation, and few-shot learning, which current models mostly don’t do. They both predict that we probably won’t go extinct from superintelligent AI, at least in part because we’ll notice and fix any potential failures, either via extensive testing or via initial failures that illustrate the problem.

On the AI risk side, while they spent a lot of time discussing it, I’ll only talk about the parts where it seems to me that there is a real disagreement, and not mention anything else. Steven’s position against AI risk seems to be twofold. First, we are unlikely to build superintelligent AI soon, and so we should focus on other clear risks like climate change. In contrast, Stuart thinks that superintelligent AI is reasonably likely by the end of the century and thus worth thinking about. Second, the idea of building a super-optimizer that focuses on a single goal is so obviously bad that AI researchers will obviously not build such a thing. In contrast, Stuart thinks that goal-directed systems are our default way of modeling and building intelligent systems. It seemed like Steven was particularly objecting to the especially simplistic goals used in examples like maximizing paperclips or curing cancer, to which Stuart argued that the problem doesn’t go away if you have multiple goals, because there will always be some part of your goal that you failed to specify.

Steven also disagrees with the notion of intelligence that is typically used by AI risk proponents, saying “a super-optimizer that pursued a single goal is self-evidently unintelligent, not superintelligent”. I don’t get what he means by this, but it seems relevant to his views.

Rohin's opinion: Unsurprisingly I agreed with Stuart’s responses, but nevertheless I found this illuminating, especially in illustrating the downsides of examples with simplistic goals. I did find it frustrating that Steven didn’t respond to the point about multiple goals not helping, since that seemed like a major crux, though they were discussing many different aspects and that thread may simply have been dropped by accident.


Sparsity and interpretability? (Stanislav Böhm et al) (summarized by Rohin): If you want to visualize exactly what a neural network is doing, one approach is to visualize the entire computation graph of multiplies, additions, and nonlinearities. While this is extremely complex even on MNIST, we can make it much simpler by making the networks sparse, since any zero weights can be removed from the computation graph. Previous work has shown that we can remove well over 95% of weights from a model without degrading accuracy too much, so the authors do this to make the computation graph easier to understand.

They use this to visualize an MLP model for classifying MNIST digits, and for a DQN agent trained to play Cartpole. In the MNIST case, the computation graph can be drastically simplified by visualizing the first layer of the net as a list of 2D images, where the kth activation is given by the dot product of the 2D image with the input image. This deals with the vast majority of the weights in the neural net.

Rohin's opinion: This method has the nice property that it visualizes exactly what the neural net is doing -- it isn’t “rationalizing” an explanation, or eliding potentially important details. It is possible to gain interesting insights about the model: for example, the logit for digit 2 is always -2.39, implying that everything else is computed relative to -2.39. Looking at the images for digit 7, it seems like the model strongly believes that sevens must have the top few rows of pixels be blank, which I found a bit surprising. (I chose to look at the digit 7 somewhat arbitrarily.)

Of course, since the technique doesn’t throw away any information about the model, it becomes very complicated very quickly, and wouldn’t scale to larger models.


More on disambiguating "discontinuity" (Aryeh Englander) (summarized by Rohin): This post considers three different kinds of “discontinuity” that we might imagine with AI development. First, there could be a sharp change in progress or the rate of progress that breaks with the previous trendline (this is the sort of thing examined (AN #97) by AI Impacts). Second, the rate of progress could either be slow or fast, regardless of whether there is a discontinuity in it. Finally, the calendar time could either be short or long, regardless of the rate of progress.

The post then applies these categories to three questions. Will we see AGI coming before it arrives? Will we be able to “course correct” if there are problems? Is it likely that a single actor obtains a decisive strategic advantage?



Meta-Learning without Memorization (Mingzhang Yin et al) (summarized by Asya): Meta-learning is a technique for leveraging data from previous tasks to enable efficient learning of new tasks. This paper proposes a solution to a problem in meta-learning which the paper calls the memorization problem. Imagine a meta-learning algorithm trained to look at 2D pictures of 3D objects and determine their orientation relative to a fixed canonical pose. Trained on a small number of objects, it may be easy for the algorithm to just memorize the canonical pose for each training object and then infer the orientation from the input image. However, the algorithm will perform poorly at test time because it has not seen novel objects and their canonical poses. Rather than memorizing, we would like the meta-learning algorithm to learn to adapt to new tasks, guessing at rules for determining canonical poses given just a few example images of a new object.

At a high level, a meta-learning algorithm uses information from three sources when making a prediction-- the training data, the parameters learned while doing meta-training on previous tasks, and the current input. To prevent memorization, we would like the algorithm to get information about which task it's solving only from the training data, rather than memorizing it by storing it in its other information sources. To discourage this kind of memorization, the paper proposes two new kinds of regularization techniques which it calls "meta-regularization" schemes. One penalizes the amount of information that the algorithm stores in the direct relationship between input data and predicted label ("meta-regularization on activations"), and the other penalizes the amount of information that the algorithm stores in the parameters learned during meta-training ("meta-regularization on weights").

In some cases, meta-regularization on activations fails to prevent the memorization problem where meta-regularization on weights succeeds. The paper hypothesizes that this is because even a small amount of direct information between input data and predicted label is enough to store the correct prediction (e.g., a single number that is the correct orientation). That is, the correct activations will have low information complexity, so it is easy to store them even when information in activations is heavily penalized. On the other hand, the function needed to memorize the predicted label has a high information complexity, so penalizing information in the weights, which store that function, successfully discourages memorization. The key insight here is that memorizing all the training examples results in a more information-theoretically complex model than task-specific adaptation, because the memorization model is a single model that must simultaneously perform well on all tasks.

Both meta-regularization techniques outperform non-regularized meta-learning techniques in several experimental set-ups, including a toy sinusoid regression problem, the pose prediction problem described above, and modified Omniglot and MiniImagenet classification tasks. They also outperform fine-tuned models and models regularized with standard regularization techniques.

Asya's opinion: I like this paper, and the techniques for meta-regularization it proposes seem to me like they're natural and will be picked up elsewhere. Penalizing model complexity to encourage more adaptive learning reminds me of arguments that pressure for compressed policies could create mesa-optimizers (AN #58) -- this feels like very weak evidence that that could indeed be the case.


OpenAI API (OpenAI) (summarized by Rohin): OpenAI has released a commercial API that gives access to natural language completions via GPT-3 (AN #102), allowing users to specify tasks in English that GPT-3 can then (hopefully) solve.

Rohin's opinion: This is notable since this is (to my knowledge) OpenAI’s first commercial application.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment
2 comments, sorted by Click to highlight new comments since:

On the Russell / Pinker debate, I thought Pinker had an interesting rhetorical sleight-of-hand that I hadn't heard before...

When people on the "AGI safety is important" side explain their position, there's kinda a pedagogical dialog:

A: Superintelligent AGI will be awesome, what could go wrong? B: Well it could outclass all of humanity and steer the future in a bad direction. A: OK then we won't give it an aggressive goal. B: Even with an innocuous-sounding goal like "maximize paperclips" it would still kill everyone... A: OK, then we'll give it a good goal like "maximize human happiness". B: Then it would forcibly drug everyone. A: OK, then we'll give it a more complicated goal like ... B: That one doesn't work either because ...

...And then Pinker reads this back-and-forth dialog, removes a couple pieces of it from their context, and says "The existential risk scenario that people are concerned about is the paperclip scenario and/or the drugging scenario! They really think those exact things are going to happen!" Then that's the strawman that he can easily rebut.

Pinker had other bad arguments too, I just thought that was a particularly sneaky one.

They are both particularly critical of the idea that we can get general intelligence by simply scaling up existing deep learning models, citing the need for reasoning, symbol manipulation, and few-shot learning, which current models mostly don’t do

Huh. GPT-3 seems to me like something that does all three of those things, albeit at a rudimentary level. I'm thinking especially about its ability to do addition and anagrams/word letter manipulations. Was this interview recorded before GPT-3 came out?