Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

‘Solving for X?’ Towards a problem-finding framework to ground long-term governance strategies for artificial intelligence (Hin-Yan Liu et al) (summarized by Rohin): The typical workflow in governance research might go something like this: first, choose an existing problem to work on; second, list out possible governance mechanisms that could be applied to the problem; third, figure out which of these is best. We might call this the problem-solving approach. However, such an approach has several downsides:

1. Such an approach will tend to use existing analogies and metaphors used for that problem, even when they are no longer appropriate.

2. If there are problems which aren’t obvious given current frameworks for governance, this approach won’t address them.

3. Usually, solutions under this approach build on earlier, allegedly similar problems and their solutions, leading to path-dependencies in what kind of solutions are being sought. This makes it harder to identify and/or pursue new classes of solutions

4. It is hard to differentiate between problems that are symptoms vs. problems that are root causes in such a framework, since not much thought is put into comparisons across problems

5. Framing our job as solving an existing set of problems lulls us into a false sense of security, as it makes us think we understand the situation better than we actually do (“if only we solved these problems, we’d be done; nothing else would come up”).

The core claim of this paper is that we should also invest in a problem-finding approach, in which we do not assume that we even know what the problem is, and are trying to figure it out in advance before it arises. This distinction between problem-solving and problem-finding is analogous to the distinction between normal science and paradigm-changing science, between exploitation and exploration, and between “addressing problems” and “pursuing mysteries”. Including a problem-finding approach in our portfolio of research techniques helps mitigate the five disadvantages listed above. One particularly nice advantage is that it can help avoid the Collingridge dilemma: by searching for problems in advance, we can control them before they get entrenched in society (when they would be harder to control).

The authors then propose a classification of governance research, where levels 0 and 1 correspond to problem-solving and levels 2 and 3 correspond to problem-finding:

- Business as usual (level 0): There is no need to change the existing governance structures; they will naturally handle any problems that arise.

- Puzzle-solving (level 1): Aims to solve the problem at hand (something like deepfakes), possibly by changing the existing governance structures.

- Disruptor-finding (level 2): Searches for properties of AI systems that would be hard to accommodate with the existing governance tools, so that we can prepare in advance.

- Charting macrostrategic trajectories (level 3): Looks for crucial considerations about how AI could affect the trajectory of the world.

These are not just meant to apply to AGI. For example, autonomous weapons may make it easier to predict and preempt conflict, in which case rather than very visible drone strikes we may instead have “invisible” high-tech wars. This may lessen the reputational penalties of war, and so we may need to increase scrutiny of, and accountability for, this sort of “hidden violence”. This is a central example of a level 2 consideration.

The authors note that we could extend the framework even further to cases where governance research fails: at level -1, governance stays fixed and unchanging in its current form, either because reality is itself not changing, or because the governance got locked in for some reason. Conversely, at level 4, we are unable to respond to governance challenges, either because we cannot see the problems at all, or because we cannot comprehend them, or because we cannot control them despite understanding them.

Rohin's opinion: One technique I like a lot is backchaining: starting from the goal you are trying to accomplish, and figuring out what actions or intermediate subgoals would most help accomplish that goal. I’ve spent a lot of time doing this sort of thing with AI alignment. This paper feels like it is advocating the same for AI governance, but also gives a bunch of concrete examples of what this sort of work might look like. I’m hoping that it inspires a lot more governance work of the problem-finding variety; this does seem quite neglected to me right now.

One important caveat to all of this is that I am not a governance researcher and don’t have experience actually trying to do such research, so it’s not unlikely that even though I think this sounds like good meta-research advice, it is actually missing the mark in a way I failed to see.

While I do recommend reading through the paper, I should warn you that it is rather dense and filled with jargon, at least from my perspective as an outsider.

TECHNICAL AI ALIGNMENT

ITERATED AMPLIFICATION

Epistemology of HCH (Adam Shimi) (summarized by Rohin): This post identifies and explores three perspectives one can take on HCH (AN #34):

1. Philosophical abstraction: In this perspective, HCH is an operationalization of the concept of one’s enlightened judgment.

2. Intermediary alignment scheme: Here we consider HCH as a scheme that arguably would be aligned if we could build it.

3. Model of computation: By identifying the human in HCH with some computation primitive (e.g. arbitrary polynomial-time algorithms), we can think of HCH as a particular theoretical model of computation that can be done using that primitive.

MESA OPTIMIZATION

Fixing The Good Regulator Theorem (John Wentworth) (summarized by Rohin): Consider a setting in which we must extract information from some data X to produce model M, so that we can later perform some task Z in a system S while only having access to M. We assume that the task depends only on S and not on X (except inasmuch as X affects S). As a concrete example, we might consider gradient descent extracting information from a training dataset (X) and encoding it in neural network weights (M), which can later be used to classify new test images (Z) taken in the world (S) without looking at the training dataset.

The key question: when is it reasonable to call M a model of S?

1. If we assume that this process is done optimally, then M must contain all information in X that is needed for optimal performance on Z.

2. If we assume that every aspect of S is important for optimal performance on Z, then M must contain all information about S that it is possible to get. Note that it is usually important that Z contains some new input (e.g. test images to be classified) to prevent M from hardcoding solutions to Z without needing to infer properties of S.

3. If we assume that M contains no more information than it needs, then it must contain exactly the information about S that can be deduced from X.

It seems reasonable to say that in this case we constructed a model M of the system S from the source X "as well as possible". This post formalizes this conceptual argument and presents it as a refined version of the Good Regulator Theorem.

Returning to the neural net example, this argument suggests that since neural networks are trained on data from the world, their weights will encode information about the world and can be thought of as a model of the world.

PREVENTING BAD BEHAVIOR

Shielding Atari Games with Bounded Prescience (Mirco Giacobbe et al) (summarized by Rohin): In order to study agents trained for Atari, the authors write down several safety properties using the internals of the ALE simulator that agents should satisfy. They then test several agents trained with deep RL algorithms to see how well they perform on these safety properties. They find that the agents only successfully satisfy 4 out of their 43 properties all the time, whereas for 24 of the properties, all agents fail at least some of the time (and frequently they fail on every single rollout tested).

This even happens for some properties that should be easy to satisfy. For example, in the game Assault, the agent loses a life if its gun ever overheats, but avoiding this is trivial: just don’t use the gun when the display shows that the gun is about to overheat.

The authors implement a “bounded shielding” approach, which basically simulates actions up to N timesteps in the future, and then only takes actions from the ones that don’t lead to an unsafe state (if that is possible). With N = 1 this is enough to avoid the failure described above with Assault.

Rohin's opinion: I liked the analysis of what safety properties agents failed to satisfy, and the fact that agents sometimes fail the “obvious” or “easy” safety properties suggests that the bounded shielding approach can actually be useful in practice. Nonetheless, I still prefer the approach of finding an inductive safety invariant (AN #124), as it provides a guarantee of safety throughout the episode, rather than only for the next N timesteps.

ADVERSARIAL EXAMPLES

Adversarial images for the primate brain (Li Yuan et al) (summarized by Rohin) (H/T Xuan): It turns out that you can create adversarial examples for monkeys! The task: classifying a given face as coming from a monkey vs. a human. The method is pretty simple: train a neural network to predict what monkeys would do, and then find adversarial examples for monkeys. These examples don’t transfer perfectly, but they transfer enough that it seems reasonable to call them adversarial examples. In fact, these adversarial examples also make humans make the wrong classification reasonably often (though not as often as with monkeys), when given about 1 second to classify (a fairly long amount of time). Still, it is clear that the monkeys and humans are much more behaviorally robust than the neural networks.

Rohin's opinion: First, a nitpick: the adversarially modified images are pretty significantly modified, such that you now have to wonder whether we should say that the humans are getting the answer “wrong”, or that the image has been modified meaningfully enough that there is no longer a right answer (as is arguably the case with the infamous cat-dog). The authors do show that e.g. Gaussian noise of the same magnitude doesn't degrade human performance, which is a good sanity check, but doesn’t negate this point.

Nonetheless, I liked this paper -- it seems like good evidence that neural networks and biological brains are picking up on similar features. My preferred explanation is that these are the “natural” features for our environment, though other explanations are possible, e.g. perhaps brains and neural networks are sufficiently similar architectures that they do similar things. Note however that they do require a grey-box approach, where they first train the neural network to predict the monkey's neuronal responses. When they instead use a neural network trained to classify human faces vs. monkey faces, the resulting adversarial images do not cause misclassifications in monkeys. So they do need to at least finetune the final layer for this to work, and thus there is at least some difference between the neural networks and monkey brains.

FORECASTING

2020 Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy (McKenna Fitzgerald et al) (summarized by Flo): This is a survey of AGI research and development (R&D) projects, based on public information like publications and websites. The survey finds 72 such projects active in 2020 compared to 70 projects active in 2017. This corresponds to 15 new projects and 13 projects that shut down since 2017. Almost half of the projects are US-based (and this is fewer than in 2017!), and most of the rest is based in US-allied countries. Around half of the projects publish open-source code. Many projects are interconnected via shared personnel or joint projects and only a few have identifiable military connections (fewer than in 2017). All of these factors might facilitate cooperation around safety.

The projects form three major clusters: 1) corporate projects active on AGI safety 2) academic projects not active on AGI safety and 3) small corporations not active on AGI safety. Most of the projects are rather small and project size varies a lot, with the largest projects having more than 100 times as many employees as the smallest ones. While the share of projects with a humanitarian focus has increased to more than half, only a small but growing number is active on safety. Compared to 2017, the share of corporate projects has increased, and there are fewer academic projects. While academic projects are more likely to focus on knowledge expansion rather than humanitarian goals, corporate projects seem more likely to prioritize profit over public interest and safety. Consequently, corporate governance might be especially important.

Flo's opinion: These kinds of surveys seem important to conduct, even if they don't always deliver very surprising results. That said, I was surprised by the large amount of small AGI projects (for which I expect the chances of success to be tiny) and the overall small amount of Chinese AGI projects.

How The Hell Do We Create General-Purpose Robots? (Sergey Alexashenko) (summarized by Rohin): A general-purpose robot (GPR) is one that can execute simple commands like “unload the dishwasher” or “paint the wall”. This post outlines an approach to get to such robots, and estimates how much it would cost to get there.

On the hardware side, we need to have hardware for the body, sensors, and brain. The body is ready; the Spot robot from Boston Dynamics seems like a reasonable candidate. On sensors, we have vision, hearing and lidar covered; however, we don’t have great sensors for touch yet. That being said, it seems possible to get by with bad sensors for touch, and compensate with vision. Finally, for the brain, even if we can’t put enough chips on the robot itself, we can use more compute via the cloud.

For software, in principle a large enough neural network should suffice; all of the skills involved in GPRs have already been demonstrated by neural nets, just not as well as would be necessary. (In particular, we don’t need to posit AGI.) The big issue is that we don’t know how to train such a network. (We can’t train in the real world, as that is way too slow.)

With a big enough investment, it seems plausible that we could build a simulator in which the robot could learn. The simulator would have to be physically realistic and diverse, which is quite a challenge. But we don’t have to write down physically accurate models of all objects: instead, we can virtualize objects. Specifically, we interact with an object for a couple of minutes, and then use the resulting data to build a model of the object in our simulation. (You could imagine an AlphaFold-like system that does this very well.)

The author then runs some Fermi estimates and concludes that it might cost around $42 billion for the R&D in such a project (though it may not succeed), and concludes that this would clearly be worth it given the huge economic benefits.

Rohin's opinion: This outline seems pretty reasonable to me. There are a lot of specific points to nitpick with; for example, I am not convinced that we can just use cloud compute. It seems plausible that manipulation tasks require quick, iterative feedback, where the latency of cloud compute would be unacceptable. (Indeed, the quick, iterative feedback of touch is exactly why it is such a valuable sensor.) Nonetheless, I broadly like the outlined plan and it feels like these sorts of nitpicks are things that we will be able to solve as we work on the problem.

I am more skeptical of the cost estimate, which seems pretty optimistic to me. The author basically took existing numbers and then multiplied them by some factor for the increased hardness; I think that those factors are too low (for the AI aspects, idk about the robot hardware aspects), and I think that there are probably lots of other significant “invisible” costs that aren’t being counted here.

NEWS

Postdoc role at CHAI (CHAI) (summarized by Rohin): The Center for Human-Compatible AI (where I did my PhD) is looking for postdocs. Apply here.

Apply to EA Funds now (Jonas Vollmer) (summarized by Rohin): EA Funds applications are open until the deadline of March 7. This includes the Long-Term Future Fund (LTFF), which often provides grants to people working on AI alignment. I’m told that LTFF is constrained by high-quality applications, and that applying only takes a few hours, so it is probably best to err on the side of applying.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

AI ALIGNMENT FORUM
AF