Alignment Newsletter #27

Rohin Shah

Dan Hendrycks has now joined, and will likely write summaries primarily on adversarial examples and robustness. As with Richard, his summaries are marked as such; I'm reviewing some of them now but expect to review less over time.

Highlights

80K podcast with Paul Christiano (Paul Christiano and Rob Wiblin): This is a mammoth 4-hour interview that covers a lot of ground. I'll try to state the main points without the supporting arguments in roughly chronological order, listen to the podcast for more.

- The problem of AI safety is that we don't know how to build AI that does what we want it to do. It arises primarily because each actor faces a tradeoff between AI systems being maximally effective at its task, and being robustly beneficial.

- AI safety has had much more attention in the last few years.

- Everyone agrees that we don't know how to build AI that does what we want, but disagrees on how hard the problem is, or how it should be framed.

- The best arguments against working on alignment are opportunity cost (eg. working on biosecurity instead) and that the problem might be very easy or impossible, but even then it seems like work would be valuable for getting information about how hard the problem actually is.

- It's not very important for the best AI safety team to work with the best ML team for the purpose of pursuing alignment research, but it is important for actually building powerful aligned AI.

- The variance in outcomes from AGI come primarily from uncertainty in how hard the technical problem is, how people behave about AGI, and then how good we are at technical safety research. The last one is easiest to push on.

- It seems useful to build organizations that can make commitments that are credible to outsiders. This would allow the top AI actors to jointly commit that they meet a particular bar for safety, though this would also require monitoring and enforcing to be effective, which is hard to do without leaking information.

- We should expect slow takeoff, as Paul defines it. (I'm ignoring a lot of detail here.)

- We should focus on short timelines because we have more leverage over them, but the analogous argument for focusing on fast takeoff is not as compelling.

- Paul places 15% probability on human labor being obselete in 10 years, and 35% on 20 years, but doesn't think he has done enough analysis that people should defer to him.

- Comparing current AI systems to humans seems like the wrong way to measure progress in AI. Instead, we should consider what we'd be able to do now if AI becomes comparable to humans in 10-20 years, and compare to that.

- We can decompose alignment into the problem of training an AI given a smarter overseer, and the problem of creating a sufficiently smart overseer. These roughly correspond to distillation and amplification respectively in IDA. (There's more discussion of IDA, but it should be pretty familiar to people who have engaged with IDA before.) Reactions fall into three camps: a) IDA is hopelessly difficult, b) IDA is focusing on far-away problems that will be easy by the time they are relevant, and c) optimistic about IDA.

- Very few people think about how to solve the full problem, that is, solve alignment in the limit of arbitrarily intelligent AI. MIRI doesn't think about the question because it seems obviously doomed to them, while the broader ML community wants to wait until we know how to build the system. The other approaches are debate (AN #5), which is very related to IDA, and inverse reinforcement learning (IRL). However, there are key problems with IRL, and research hasn't shed much light on the core of the problem.

- AI safety via debate also shares the insight of IDA that we can use AI to help us define a better training signal for AI. (There's discussion of how debate works, that again should be familiar to anyone who has engaged with it before.) The biggest difficulty is whether human judges are actually capable of judging debates on truthfulness and usefulness, as opposed to eg. persuasiveness.

- There are three main categories of work to be done on IDA and debate -- engineering work to actually build systems, philosophical work to determine whether we would be happy with the output of IDA or debate, and ML research that allows us to try out IDA or debate with current ML techniques.

- We should focus on prosaic AI, that is, powerful AI built out of current techniques (so no unknown unknowns). This is easier to work on since it is very concrete, and even if AGI requires new techniques, it will probably still use current ones, and so work aligning current techniques should transfer. In addition, if current techniques go further than expected, it would catch people by surprise, which makes this case more important.

- With sufficient computation, current ML techniques can produce general intelligence, because evolution did so, and current ML looks a lot like evolution.

- The biggest crux between Paul and MIRI is whether prosaic AI can be aligned.

- One problem that MIRI thinks is irresolvable is the problem of inner optimizers, where even if you optimize a perfectly constructed objective function that captures what we want, you may create a consequentialist that has good behavior in training environments but arbitrarily bad behavior in test environments. However, we could try to solve this through techniques like adversarial training.

- The other problem is that constructing a good objective is incredibly difficult, and existing schemes are hiding the magic somewhere (for example, in IDA, it would be hidden in the amplification step).

- Research of the kind that MIRI does will probably be useful for answering the philosophical questions around IDA and debate.

- Ought's Factored Cognition (AN #12) project is very related to IDA.

- Besides learning ML, and careers in strategy and policy, Paul is excited for people to start careers studying problems around IDA from a CS lens, a philosophical lens, or a psychological lens (in the sense of studying how humans decompose problems for IDA, or how they judge debates).

- Computer security problems that are about attacking AI (rather than defending against attacks in a world with AI) are often very related to long term AI alignment.

- It is important for safety researchers to be respectful of ML researchers, since they are justifiably defensive given the high levels of external interest in safety that's off-base.

- EAs often incorrectly think in terms of a system that has been given a goal to optimize.

- Probably the most important question in moral philosophy is what kinds of unaligned AI would be morally valuable, and how they compare to the scenario where we build an aligned AI.

- Among super weird backup plans, we could build unaligned AI that is in expectation as valuable as aligned AI, which allows us to sidestep AI risk. For example, we could simulate other civilizations that evolution would produce, and hand over control of the world to a civilization that would have done the same thing if our places were swapped. From behind a veil of ignorance of "civilizations evolution could have produced", or from a multiverse perspective, this has the same expected value as building an aligned AI (modulo the resource cost of simulations), allowing us to sidestep AI risk.

- We might face an issue where society worries about being bigoted towards AI and so gives them rights and autonomy, instead of focusing on the more important question of whether their values or goals align with ours.

Rohin's opinion: This is definitely worth reading or listening if you haven't engaged much with Paul's work before, it will probably be my go-to reference to introduce someone to the approach. Even if you have, this podcast will probably help tie things together in a unified whole (at least, it felt that way to me). A lot of the specific things mentioned have been in the newsletter before, if you want to dig up my opinions on them.

Technical AI alignment

Technical agendas and prioritization

80K podcast with Paul Christiano (Paul Christiano and Rob Wiblin): Summarized in the highlights!

The Rocket Alignment Problem (Eliezer Yudkowsky) (summarized by Richard): Eliezer explains the motivations behind MIRI’s work using an analogy between aligning AI and designing a rocket that can get to the moon. He portrays our current theoretical understanding of intelligence as having massive conceptual holes; MIRI is trying to clarify these fundamental confusions. Although there’s not yet any clear path from these sorts of advances to building an aligned AI, Eliezer estimates our chances of success without them as basically 0%: it’s like somebody who doesn’t understand calculus building a rocket with the intention of manually steering it on the way up.

Richard's opinion: I think it’s important to take this post as an explication of MIRI’s mindset, not as an argument for that mindset. In the former role, it’s excellent: the analogy is a fitting one in many ways. It’s worth noting, though, that the idea of only having one shot at success seems like an important component, but isn’t made explicit. Also, it’d be nice to have more clarity about the “approximately 0% chance of success” without advances in agent foundations - maybe that credence is justified under a specific model of what’s needed for AI alignment, but does it take into account model uncertainty?

Interpretability

Stakeholders in Explainable AI (Alun Preece et al) (summarized by Richard): There are at least four groups for whom "explainable" AI is relevant: developers (who want AI to be easier to work with), theorists (who want to understand fundamental properties of AI), ethicists (who want AI to behave well) and users (who want AI to be useful). This has complicated work on explanability/interpretability: the first two groups focus on understanding how a system functions internally (described in this paper as "verification"), while the latter two focus on understanding what the system does ("validation"). The authors propose an alternative framing of interpretability, based on known knowns, unknown knowns, etc.

Training Machine Learning Models by Regularizing their Explanations (Andrew Slavin Ross)

Adversarial examples

Towards Deep Learning Models Resistant to Adversarial Attacks (Aleksander Madry et al) (summarized by Dan H): Madry et al.'s paper is a seminal work which shows that some neural networks can attain more adversarial robustness with a well-designed adversarial training procedure. They train networks on adversarial examples generated by several iterations of projected gradient descent rather than examples generated in one step (FGSM). Another crucial component is that they add slight noise to a clean example before generating a corresponding adversarial example. When trained long enough, some networks will attain more L-infinity adversarial robustness.

Dan H's opinion: What's notable is that this paper has survived third-party security analysis, so this is a solid contribution. This contribution is limited by the fact that its improvements are limited to L-infinity adversarial perturbations on small images, as follow-up work has shown.

Towards the first adversarially robust neural network model on MNIST (Lukas Schott, Jonas Rauber et al) (summarized by Dan H): This recent pre-print claims to make MNIST classifiers more adversarially robust to different L-p perturbations. The basic building block in their approach is a variational autoencoder, one for each MNIST class. Each variational autoencoder computes the likelihood of the input sample, and this information is used for classification. They also demonstrate that binarizing MNIST images can serve as strong defense against some perturbations. They evaluate against strong attacks and not just the fast gradient sign method.

Dan H's opinion: This paper has generated considerable excitement among my peers. Yet inference time with this approach is approximately 100,000 times that of normal inference (10^4 samples per VAE * 10 VAEs). Also unusual is that the L-infinity "latent descent attack" result is missing. It is not clear why training a single VAE does not work. Also, could results improve by adversarially training the VAEs? As with all defense papers, it is prudent to wait for third-party reimplementations and analysis, but the range of attacks they consider is certainly thorough.

Robustness

Bayesian Policy Optimization for Model Uncertainty (Gilwoo Lee et al)

Reinforcement Learning with Perturbed Rewards (Jingkang Wang et al)

Miscellaneous (Alignment)

Existential Risk, Creativity & Well-Adapted Science (Adrian Currie): From a brief skim, it seems like this paper defines "creativity" in scientific research, and argues that existential risk research needs to be creative. Research is creative if it is composed of "hot" searches, where we jump large distances from one proposed solution to another, with broad differences between these solutions, as opposed to "cold" searches, in which we primarily make incremental improvements, looking over a small set of solutions clustered in the neighborhood of existing solutions. The paper argues that research on existential risk needs to be creative, because many aspects of such research make it hard to analyze in a traditional way -- we can't perform controlled experiments of extinction, nor of the extreme circumstances under which it is likely; there are many interdependent parts that affect each other (since existential risks typically involve effects on many aspects of society), and there is likely to be a huge amount of uncertainty due to lack of evidence. As a result, we want to change the norms around existential risk research from the standard academic norms, which generally incentivize conservatism and "cold" searches. Table 1 provides a list of properties of academia that lead to conservatism, and asks that future work think about how we could mitigate these.

Rohin's opinion: While I'm not sure I agree with the reasons in this paper, I do think we need creativity and "hot" searches in technical AI safety, simply based on the level of confusion and uncertainty that we (or at least I) have currently. The properties in Table 1 seem particularly good as an initial list of things to target if we want to make creative research more likely.

AI strategy and policy

Countering Superintelligence Misinformation (Seth Baum) (summarized by Richard): Two ways to have better discussions about superintelligence are correcting misconceptions, and preventing misinformation from being spread in the first place. The latter might be achieved by educating prominent voices, creating reputational costs to misinformers (both individuals and companies), focusing media attention, etc. Research suggests the former is very difficult; strategies include addressing pre-existing motivations for believing misinformation and using advance warnings to 'inoculate' people against false claims.

Richard's opinion: I'm glad to see this systematic exploration of an issue that the AI safety community has consistently had to grapple with. I would have liked to see a more nuanced definition of misinformation than "information that is already clearly false", since it's not always obvious what qualifies as clearly false, and since there are many varieties of misinformation.

Prerequisities: Superintelligence Skepticism as a Political Tool

Other progress in AI

Exploration

The Dreaming Variational Autoencoder for Reinforcement Learning Environments (Per-Arne Andersen et al)

EMI: Exploration with Mutual Information Maximizing State and Action Embeddings (Hyoungseok Kim, Jaekyeom Kim et al)

Reinforcement learning

Near-Optimal Representation Learning for Hierarchical Reinforcement Learning (Ofir Nachum et al) (summarized by Richard): This paper discusses the use of learned representations in hierarchical RL. In the setting where a higher-level policy chooses goals which lower-level policies are rewarded for reaching, how bad is it when the goal representation isn't able to express all possible states? The authors define a metric for a representation's lossiness based on how close to optimal the policies which can be learned using that representation are, and prove that using a certain objective function, representations with bounded lossiness can be learned. They note a similarity between this objective function and those of mutual information estimators.

The authors test their learner on the MuJoCo Ant Maze environment, achieving compelling results.

Richard's opinion: This is a fairly mathematical paper and I didn't entirely follow the proofs, so I'm not sure how dependent they are on the particular choice of objective function. However, the empirical results using that objective seem very impressive, and significantly outperform alternative methods of learning representations.

Introducing Holodeck (joshgreaves32)

Generalization and Regularization in DQN (Jesse Farebrother et al)

CEM-RL: Combining evolutionary and gradient-based methods for policy search (Aloïs Pourchot et al)

Learning and Planning with a Semantic Model (Yi Wu et al)