Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


Discontinuous progress in history: an update (Katja Grace) (summarized by Nicholas): One of the big questions in AI alignment is whether there will be a discontinuous AI takeoff (see here (AN #62) for some reasons why the question is decision-relevant). To get a better outside view, AI Impacts has been looking for large discontinuities in historical technological trends. A discontinuity is measured by how many years ahead of time that value is reached, relative to what would have been expected by extrapolating the trend.

They found ten 100-year discontinuous events, for example in ship size (The SS Great Eastern), the average speed of military payload across the Atlantic Ocean (the first ICBM), and the warmest temperature of superconduction (yttrium barium copper oxide).

There are also some interesting negative examples of discontinuities. Particularly relevant to AI are AlexNet not being a discontinuity on the ImageNet benchmark and chess performance not having any discontinuities in Elo rating.

Nicholas' opinion: Ignoring the George Washington Bridge (which confuses both me and the authors), I’d roughly categorize the causes of these discontinuities as

- 3 of them were due to a concerted but apparently misplaced effort towards something others weren’t trying to do. These are Pyramid of Djoser, SS Great Eastern, and the Paris Gun.

- 2 of them were due to the Atlantic Ocean causing a threshold effect (as they explain in the post). These are the ICBM and the first nonstop transatlantic flight.

- 4 of them were due to a new technological breakthrough followed by increased investment and a faster rate of progress. These are the two telegraph cables, nuclear weapons, and superconductors.

Of these, the final category seems the most relevant to AGI timelines, and I could imagine AGI development following a similar trajectory, where a major breakthrough causes a large amount of investment and then we have much faster progress on AI going forward.

I was quite surprised that AlexNet did not represent a discontinuity on ImageNet performance. It is widely regarded to have kicked off deep learning in the computer vision community. I’m not sure if this is because the discontinuity metric they use doesn’t correspond with my sense of a “breakthrough”, because there were only two years of ImageNet beforehand, or because the vision community is just mistakenly attributing gradual progress to one major event.

Rohin's opinion: I agree with Nicholas that the final category seems most relevant to AI progress. Note though that even for this analogy to hold, you need to imagine a major AI breakthrough, since as Nicholas pointed out, these discontinuities were caused by a radically new technology (telegraph cables replacing ships, nuclear weapons replacing conventional bombs, and ceramic superconductors replacing alloy superconductors). This doesn't seem likely in worlds where progress is driven primarily by compute (AN #7), but could happen if (as academics often suggest) deep learning hits a wall and we need to find other AI algorithms to make progress.

Description vs simulated prediction (Rick Korzekwa) (summarized by Nicholas): AI Impacts’ investigation into discontinuous progress intends to answer two questions:

1. How did tech progress happen in the past?

2. How well could it have been predicted beforehand?

These can diverge when we have different information available now than in the past. For example, we could have more information because later data clarified trends or because the information is more accessible. We might have less information because we take an outside view (looking at trends) rather than an inside view (knowing the specific bottlenecks and what might need to be overcome).

The post then outlines some tradeoffs between answering these two questions and settles on primarily focusing on the first: describing tech progress in the past.

Nicholas' opinion: I don’t have a strong opinion between which of these two questions is most important to focus on. It makes sense to me to work on them both in parallel since the data required is likely to be the same. My concern with this approach is that there is no clear denominator to the discontinuities they find. The case studies convince me that discontinuities can happen, but I really want to know the frequency with which they happen.

Rohin's opinion: Given that we want to use this to forecast AI progress, it seems like we primarily care about the second question (simulated prediction). However, it's really hard to put yourselves in the shoes of someone in the past, making sure to have exactly the information that was available at the time; as a result I broadly agree with the decision to focus more on a description of what happened.



Specification gaming: the flip side of AI ingenuity (Victoria Krakovna et al) (summarized by Rohin): This post on the DeepMind website explains the concept of specification gaming (AN #1), and illustrates three problems that arise within it. First and most obviously, we need to capture the human concept of a given task in a reward function. Second, we must design agents without introducing any mistaken implicit assumptions (e.g. that the physics simulation is accurate, when it isn't). Finally, we need to ensure that agents don't tamper with their reward functions.


Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents (Christian Rupprecht et al) (summarized by Rohin): This paper proposes a new visualization tool in order to understand the behaviour of agents trained using deep reinforcement learning. Specifically, they train a generative model which produces game states, and then optimise a distribution over state embeddings according to some target function (such as high reward for taking a specific action). By sampling from the resulting distribution, they create a diverse set of realistic states that score highly according to the target function. They propose a few target cost functions, which allow them to optimise for states in which the agent takes a particular action, states which are high reward (worst Q-value is large), states which are low reward (best Q-value is small), and critical states (large difference in Q value). They demonstrate results on Atari games as well as a simulated driving environment.

Robert's opinion: I liked the paper, and I'm in favour of new work on interpreting reinforcement learning agents; I think it's under explored and useful, and relevant to AI safety. The methods seem in a similar vein to Feature Visualisation methods for classic vision, but focused solely on the resulting behaviour of the agent; it'd be interesting to see if such methods can give insight into the internals of RL agents. It's also a shame the demonstration of the results is wholly qualitative; the authors demonstrate some apparent flaws in the agents, but don't produce any results which show that the insights their method produces are useful. I think the insights are useful, but it's difficult to validate the claim, and I'm cautious of work which produces interesting and seemingly insightful methods but doesn't validate that the methods produce actually useful insight.

Estimating Training Data Influence by Tracking Gradient Descent (Garima Pruthi et al) (summarized by Robert): This paper presents the TrackIn method for tracking the influence of training datapoints on the loss on a test datapoint. The purpose of the method is to discover influential training points for decisions made on the testing set. This is defined (loosely) for a training point x and test point z as the total change in loss on z caused by training on x. They present several approximations and methods for calculating this quantity efficiently, allowing them to scale their method to ResNet 50 models trained on ImageNet

The standard method of evaluation for these kinds of methods is finding mislabelled examples in the training dataset. Mislabelled examples are likely to have a strong positive influence on their own loss (strong as they're outliers, and positive as they'll reduce their own loss). Sorting the training dataset in decreasing order of this self-influence, we should hence expect to see more mislabelled examples at the beginning of the list. We can measure what proportion of mislabelled examples is present in each different initial segments of the list. The authors perform this experiment on CiFAR, first training a model to convergence, and then mislabelling 10% of the training set as the next highest predicted class, and then retraining a new model on which TrackIn is run. When compared to the two previous methods from the literature (Influence Functions and Representer Points), TrackIn recovers more than 80% of the mislabelled data in the first 20% of the ranking, whereas the other methods recover less than 50% at the same point. For all segments TrackIn does significantly better.

They demonstrate the method on a variety of domains, including NLP tasks and vision tasks. The influential examples found seem reasonable, but there's no quantification of these results.

Read more: Understanding Black-box Predictions via Influence Functions

Robert's opinion: It's interesting to see methods able to identify which parts of the training data have an impact on the decisions of a model. I think the approach taken here (and in Influence Functions) of using the change in the test loss is OK, but it doesn't seem to be exactly what I think when I say "which datapoints had the most influence on this decision being made in this way?". It's also difficult to compare these methods without either a benchmark, a human experiment, or some way of demonstrating the method has produced novel insight which has been verified. The mislabelled data experiment partially fulfils this, but isn't what these methods are ultimately designed for, and is hence unsatisfactory.


Various trends relevant to AI alignment (Asya Bergal and Daniel Kokotajlo) (summarized by Rohin): AI Impacts has published a few analyses of trends relevant to AI alignment (see links below).

Will we see a continuous or discontinuous takeoff? Takeoff speeds operationalizes continuous takeoff (there called "slow takeoff") as: There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles. AI impacts searched for precedents for economic n-year doubling before 4n-year doubling, and found that this happened between 4,000 and 3,000 BC, and probably also between 10,000 and 4,000 BC. (Note this implies there was a 6000-year doubling before the 1000-year doubling, even though there wasn't a 4000-year doubling.)

How hard will it be to solve a crisply-stated problem of alignment? One way to get an outside view on the matter is to look at resolutions of mathematical conjectures over time. While there is obvious sampling bias in which conjectures are remembered as being important, the results could nonetheless be informative. They find that "the data is fit closely by an exponential function with a half-life of 117 years".

Since AI progress seems to be driven at least partially by compute (AN #7), forecasting trends in compute seems important to forecasting AI progress. DRAM price per gigabyte has fallen by about an order of magnitude every 5 years from 1957 to 2020, although since 2010, the data suggests more like 14 years for a drop by an order of magnitude. Geekbench score per CPU price has grown by around 16% a year from 2006-2020, which would yield an order of magnitude over 16 years. This is slower than other CPU growth trends, but this could be because Geekbench score is a markedly different metric.

Rohin's opinion: I'm surprised that mathematical conjectures take so long to be resolved, I would have expected a smaller half-life than 117 years. I'm not sure if I should update strongly though -- it's possible that we only remember conjectures that took a long time to be resolved (though it's somewhat surprising then how well the data fits an exponential).

Surveys on fractional progress towards HLAI (Asya Bergal) (summarized by Rohin): One way to predict AGI timelines is to ask experts to estimate what fraction of progress has been made over a fixed number of years, then to extrapolate to the full 100% of progress. Doing this with the 2016 expert survey yields an estimate of 2056 (36 years from now), while doing this with Robin Hanson's informal ~15-expert survey gives 2392 (372 years from now). Part of the reason for the discrepancy is that Hanson only asked experts who had been in their field for at least 20 years; restricting to just these respondents in the 2016 survey yields an estimate of 2162 (142 years from now).


Survey of prescient actions (Rick Korzekwa) (summarized by Rohin): AI Impacts is looking into other examples in history where people took actions in order to address a complex, novel, severe future problem, and in hindsight we recognize those actions as prescient. Ideally we could learn lessons for AI alignment from such cases. The survey is so far very preliminary, so I'll summarize it later when it has been further developed, but I thought I'd send it along if you wanted to follow along (I found the six cases they've identified quite interesting).

Rohin's opinion: One particularly interesting finding is that so far, in all of the cases they've looked at, there was a lot of feedback available to develop a solution. The post notes that this could be interpreted in two ways. First, since feedback is abundant in real-world problems, we should expect feedback for AI alignment as well (the optimistic take). Second, since AI alignment has no opportunity for feedback, it is unlike other problems (the pessimistic take (AN #27)). I would add a third option: that any real-world problem without feedback is extremely hard to solve, and so we wouldn't generate any hypotheses of actions that were prescient for such problems (in which case AI risk is not special amongst problems, it is just an instance of a very difficult problem).


Improving Verifiability in AI Development (Miles Brundage et al) (summarized by Flo): This multi-stakeholder report by authors from 30 different organizations proposes mechanisms to help with making claims about AI systems easier to verify. Despite being far from a complete solution to responsible AI development, verifiable claims can enable the public to hold developers accountable to comply with their stated ethics principles and allow developers to build trust by providing hard evidence about safety and fairness of their system. Better mechanisms for making such claims could also: i) help with the regulation of AI systems, ii) improve safety by counterbalancing competitive pressures to cut corners, and iii) help independent outside parties to assess the risk posed by specific applications.

The proposed mechanisms cluster into three classes: institutional, software, and hardware. Institutional mechanisms act on the incentives AI developers face. These include third party auditing of AI systems, for which a task force investigating different options would be helpful, and the publication of incidents, to provide evidence that incidents are taken seriously and prevent others from repeating the same mistakes. Other approaches are broadly analogous to adversarial training: collaborative red teaming exercises help to explore risks, and bias and safety bounties incentivize outsiders to seek out and report problems.

Software mechanisms include audit trails that are used in many safety-critical applications in other industries, better interpretability to help with risk assessment and auditing, as well as better tools and standardization for privacy-preserving machine learning. The proposed hardware mechanisms are secure hardware for machine learning, which requires additional investment as machine learning often uses specialized hardware such that progress in the security of commodity hardware cannot be directly leveraged, high-precision compute measurement to help with verifying claims about how many computational resources were used for a particular project, and compute support for academia to allow academic researchers to better scrutinize claims made by the AI industry.

Read more: Paper: Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims

Flo's opinion: It would be exciting to see policymakers and AI developers experimenting with the proposed mechanisms, not because I am confident that all of these mechanisms will be useful for incentivising safer AI development, but because trying solutions and observing their specific shortcomings is useful for coming up with better solutions, and trying things early gives us more time to improve and iterate. However, the danger of lock-in is real as well: if the mechanisms won't be iterated on, delaying implementation until everything is watertight could be the better option. On the level of concrete mechanisms, regular red teaming exercises and more research on interpretability seem especially useful for safety, as common interaction with failure modes of AI systems seems to make safety issues more salient.

Rohin's opinion: I love seeing these huge collaborations that simply enumerate possibilities for achieving some goal (the previous one, also organized by Miles, is the Malicious use of AI paper). It gives me much more confidence that the list is in some sense "exhaustive"; any items not on the list seem more likely to have been intentionally excluded (as opposed to the authors failing to think of those items).

That said, these mechanisms are still only one type of way that you could use to build trust -- in practice, when I trust people, it's because I think they also want to do the things I want them to do (whether because of external incentives or their intrinsic goals). I wonder how much this kind of trust building can be done with AI. For example, one story you could tell is that by producing compelling evidence that a particular algorithm is likely to lead to an existential catastrophe, you can build trust that no one will use that algorithm, at least as long as you believe that everyone strongly wants to avoid an existential catastrophe (and this outweighs any benefits of running the algorithm).


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment