Alignment Newsletter #39

Rohin Shah

Happy New Year!

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

Highlights

Constructing Unrestricted Adversarial Examples with Generative Models (Yang Song et al): This paper predates the unrestricted adversarial examples challenge (AN #24) and shows how to generate such unrestricted adversarial examples using generative models. As a reminder, most adversarial examples research is focused on finding imperceptible perturbations to existing images that cause the model to make a mistake. In contrast, unrestricted adversarial examples allow you to find any image that humans will reliably classify a particular way, where the model produces some other classification.

The key idea is simple -- train a GAN to generate images in the domain of interest, and then create adversarial examples by optimizing an image to simultaneously be "realistic" (as evaluated by the generator), while still being misclassified by the model under attack. The authors also introduce another term into the loss function that minimizes deviation from a randomly chosen noise vector -- this allows them to get diverse adversarial examples, rather than always converging to the same one.

They also consider a "noise-augmented" attack, where in effect they are running the normal attack they have, and then running a standard attack like FGSM or PGD afterwards. (They do these two things simultaneously, but I believe it's nearly equivalent.)

For evaluation, they generate adversarial examples with their method and check that humans on Mechanical Turk reliably classify the examples as a particular class. Unsurprisingly, their adversarial examples "break" all existing defenses, including the certified defenses, though to be clear existing defenses assume a different threat model where an adversarial example must be an imperceptible perturbation to one of a known set of images. You could imagine doing something similar by taking the imperceptible-perturbation attacks and raise the value of ϵ until it is perceptible -- but in this case the generated images are much less realistic.

Rohin's opinion: This is the clear first thing to try with unrestricted adversarial examples, and it seems to work reasonably well. I'd love to see whether adversarial training with these sorts of adversarial examples works as a defense against both this attack and standard imperceptible-perturbation attacks. In addition, it would be interesting to see if humans could direct or control the search for unrestricted adversarial examples.

Technical AI alignment

Technical agendas and prioritization

Why I expect successful alignment (Tobias Baumann): This post gives three arguments that we will likely solve the narrow alignment problem of having an AI system do what its operators intend it to do. First, advanced AI systems may be developed in such a way that the alignment problem doesn't even happen, at least as we currently conceive of it. For example, under the comprehensive AI services model, there are many different AI services that are superintelligent at particular tasks that can work together to accomplish complex goals, but there isn't a single unified agent to "align". Second, if it becomes obvious that alignment will be a serious problem, then we will devote a lot of resources to tackling the problem. We already see reward hacking in current systems, but it isn't sufficiently dangerous yet to merit the application of a lot of resources. Third, we have already come up with some decent approaches that seem like they could work.

Rohin's opinion: I generally agree with these arguments and the general viewpoint that we will probably solve alignment in this narrow sense. The most compelling argument to me is the second one, that we will eventually devote significant resources to the problem. This does depend on the crux that we see examples of these problems and how they could be dangerous before it is too late.

I also agree that it's much less clear whether we will solve other related problems, such as how to deal with malicious uses of AI, issues that arise when multiple superintelligent AI systems aligned with different humans start to compete, and how to ensure that humans have "good" values. I don't know if this implies that on the margin it is more useful to work on the related problems. It could be that these problems are so hard that there is not much that we can do. (I'm neglecting importance of the problem here.)

Integrative Biological Simulation, Neuropsychology, and AI Safety (Gopal Sarma et al): This paper argues that we can make progress on AI capabilities and AI safety through integrative biological simulation, that is, a composite simulation of all of the processes involved in neurons that allow us to simulate brains. In the near future, such simulations would be limited to simple organisms like Drosophila, but even these organisms exhibit behavior that we find hard to replicate today using our AI techniques, especially at the sample efficiency that the organisms show. On the safety side, even such small brains share many architectural features with human brains, and so we might hope that we could discover neuroscience-based methods for value learning that generalize well to humans. Another possibility would be to create test suites (as in AI Safety Gridworlds) for simulated organisms.

Rohin's opinion: I don't know how hard it would be to create integrative biological simulations, but it does strike me as very useful if we did have them. If we had a complete mechanistic understanding of how intelligence happens in biological brains (in the sense that we can simulate them), the obvious next step would be to understand how the mechanistic procedures lead to intelligence (in the same way that we currently try to understand why neural nets work). If we succeed at this, I would expect to get several insights into intelligence that would translate into significant progress in AI. However, I know very little about biological neurons and brains so take this with many grains of salt.

On the value learning side, it would be a good test of inverse reinforcement learning to see how well it could work on simple organisms, though it's not obvious what the ground truth is. I do want to note that this is specific to inverse reinforcement learning -- other techniques depend on uniquely human characteristics, like the ability to answer questions posed by the AI system.

Agent foundations

Robust program equilibrium (Caspar Oesterheld): In a prisoner's dilemma where you have access to an opponent's source code, you can hope to achieve cooperation by looking at how the opponent would perform against you. Naively, you could simply simulate what the opponent would do given your source code, and use that to make your decision. However, if your opponent also tries to simulate you, this leads to an infinite loop. The key idea of this paper is to break the infinite loop by introducing a small probability of guaranteed cooperation (without simulating the opponent), so that eventually after many rounds of simulation the recursion "bottoms out" with guaranteed cooperation. They explore what happens when applying this idea to the equivalents of FairBot/Tit-for-Tat strategies when you are simulating the opponent.

Preventing bad behavior

Penalizing Impact via Attainable Utility Preservation (Alex Turner): This post and the linked paper present [Attainable Utility Preservation] (AN #25) more simply. There are new experiments that show that AUP works on some of the AI Safety Gridworlds even when using a set of random utility functions, and compares this against other methods of avoiding side effects.

Rohin's opinion: While this is easier to read and understand, I think there are important points in the original post that do not come across, so I would recommend reading both. In particular, one of my core takeaways from AUP was that convergent instrumental subgoals could be avoided by penalizing increases in attainable utilities, and I don't think that comes across as well in this paper. This is the main thing that makes AUP different, and it's what allows it to avoid disabling the off switch in the Survival gridworld.

The fact that AUP works with random rewards is interesting, but I'm not sure it will generalize to realistic environments. In these gridworlds, there is usually a single thing that the agent is not supposed to do. It's very likely that several of the random rewards will care about that particular thing, which means that the AUP penalty will apply, so as long as full AUP would have solved the problem, AUP with random rewards would probably also solve it. However, in more realistic environments, there are many different things that the agent is supposed to avoid, and it's not clear how big a random sample of reward functions needs to be in order to capture all of them. (However, it does seem reasonably likely that if the reward functions are "natural", you only need a few of them to avoid convergent instrumental subgoals.)

Adversarial examples

Constructing Unrestricted Adversarial Examples with Generative Models (Yang Song et al): Summarized in the highlights!

Near-term concerns

Fairness and bias

Learning Not to Learn: Training Deep Neural Networks with Biased Data (Byungju Kim et al)

AI strategy and policy

AI Index 2018 Report (Yoav Shoham et al): Lots of data about AI. The report highlights how AI is global, the particular improvement in natural language understanding over the last year, and the limited gender diversity in the classroom. We also see the expected trend of huge growth in AI, both in terms of interest in the field as well as in performance metrics.

AI Now 2018 Report (Meredith Whittaker et al): See Import AI

12