[AN #95]: A framework for thinking about how to make AI go well

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

HIGHLIGHTS

Current Work in AI Alignment (Paul Christiano) (summarized by Rohin): In this talk (whose main slide we covered before (AN #74)), Paul Christiano explains how he decomposes the problem of beneficial AI:

1. At the top level, "make AI go well" is decomposed into making AI competent, making AI aligned, and coping with the impacts of AI. Paul focuses on the alignment part, which he defines as building AI systems that are trying to do what we want. See Clarifying "AI Alignment" (AN #33) and my comment on it. Paul considers many problems of competence as separate from alignment, including understanding humans well, and most reliability / robustness work.

2. Within alignment, we can consider the concept of an "alignment tax": the cost incurred by insisting that we only deploy aligned AI. One approach is to help pay the alignment tax, for example, by convincing important actors that they should care about alignment, or by adopting agreements that make it easier to coordinate to pay the tax, as with the OpenAI Charter (AN #2)). Technical AI safety research on the other hand can help reduce the alignment tax, by creating better aligned AI systems (which consequently incur less cost than before).

3. With alignment tax reduction, we could either try to advance current alignable algorithms (making them more competent, and so reducing their tax), or make existing algorithms alignable. It would be particularly nice to take some general class of algorithms (such as deep reinforcement learning) and figure out how to transform them to make them alignable, such that improvements to the algorithms automatically translate to improvements in the alignable version. This is what Paul works on.

4. The next layer is simply a decomposition of possible algorithms we could try to align, e.g. planning, deduction, and learning. Paul focuses on learning.

5. Within aligned learning, we can distinguish between outer alignment (finding an objective that incentivizes aligned behavior) and inner alignment (ensuring that the trained agent robustly pursues the aligned objective). Paul works primarily on outer alignment, but has written about inner alignment (AN #81).

6. Within outer alignment, we could either consider algorithms that learn from a teacher, such as imitiation learning or preference inference, or we could find algorithms that perform better than the teacher (as would be needed for superhuman performance). Paul focuses on the latter case.

7. To go beyond the teacher, you could extrapolate beyond what you've seen (i.e. generalization), do some sort of ambitious value learning (AN #31), or build a better teacher. Paul focuses on the last case, and thinks of amplification as a way to achieve this.

Rohin's opinion: I really like this decomposition. I already laid out most of my thoughts back when I summarized just the main slide (AN #74); I still endorse them.

TECHNICAL AI ALIGNMENT

ITERATED AMPLIFICATION

Unsupervised Question Decomposition for Question Answering (Ethan Perez et al) (summarized by Zach): Existing methods are proficient at simple question and answering (QA). These simple questions are called single-hop and can be answered with a single yes/no or underlined passage in the text. However, progress on the more difficult task of multi-hop QA lags behind. This paper introduces a method that can decompose hard multi-hop questions into easier single-hop questions that existing QA systems can answer. Since collecting labeled decompositions is hard, the authors introduce a pseudo-decomposition where multi-hop questions are matched with similar single-hop questions while making sure the single-hop questions are diverse. Following this, the model is trained to map multi-hop questions to simpler subquestions using unsupervised sequence-to-sequence learning (as they found the supervised version performed worse). They show large improvement on the popular HotPot QA baseline with large improvement on out-of-domain questions due to the ability of sub-questions to help gather supporting facts that can be used to answer questions.

Zach's opinion: A core feature of this paper is the unsupervised approach to producing question decompositions. By doing this, it's possible to augment the data-set significantly by question-crawling the data-sets which helps explain why the model has performance on-par with supervised approaches. Moreover, looking at a few decomposition examples from the model seems to indicate that relevant sub-questions are being discovered. It's worth noting that decompositions with more than two questions are unlikely due to the specific loss used in the main paper. In the appendix, the authors experiment with a different loss for the pseudo-decomposition that allows more questions in the decomposition, but it performs slightly worse than the original loss. This makes me wonder whether or not such a procedure would be useful if used recursively to create sub-sub-questions. Overall, I think the decomposition is useful for both down-stream processing and interpretation.

Rohin's opinion: The capabilities of methods like iterated amplification depend on the ability to solve hard questions by decomposing them into simpler questions that we already know how to answer, and then combining the results appropriately. This paper demonstrates that even a very basic unsupervised approach ("decompose into the most similar simpler questions") to decomposition can work quite well, at least for current AI systems.

In private correspondence, Ethan suggested that in the long term a semi-supervised approach would probably work best, which agrees with my intuitions.

AGENT FOUNDATIONS

An Orthodox Case Against Utility Functions (Abram Demski) (summarized by Rohin): How might we theoretically ground utility functions? One approach could be to view the possible environments as a set of universe histories (e.g. a list of the positions of all quarks, etc. at all times), and a utility function as a function that maps these universe histories to real numbers. We might want this utility function to be computable, but this eliminates some plausible preferences we might want to represent. For example, in the procrastination paradox, the subject prefers to push the button as late as possible, but disprefers never pressing the button. If the history is infinitely long, no computable function can know for sure that the button was never pressed: it's always possible that it was pressed at some later day.

Instead, we could use subjective utility functions, which are defined over events, which is basically anything you can think about (i.e. it could be chairs and tables, or quarks and strings). This allows us to have utility functions over high level concepts. In the previous example, we can define an event "never presses the button", and reason about that event atomically, sidestepping the issues of computability.

We could go further and view probabilities as subjective (as in the Jeffrey-Bolkor axioms), and only require that our beliefs are updated in such a way that we cannot be Dutch-booked. This is the perspective taken in logical induction.

INTERPRETABILITY

Neuron Shapley: Discovering the Responsible Neurons (Amirata Ghorbani et al) (summarized by Robert): This paper presents a novel method, Neuron Shapley, that uses the Shapley value framework to measure the importance of different neurons in determining an arbitrary metric of the neural net output. (Shapley values have been applied to machine learning before to measure the importance of features to a model's output, but here the authors use them to calculate neuron importance.) Due to several novel approaches and optimisations in calculating these Shapley values, the top k most responsible neurons (k ~ 30) can be feasibly found for large networks such as Inception-v3.

The authors demonstrate that finding these neurons enables the performance of model surgery. Removing the top 30 neurons that contribute to accuracy completely destroys the accuracy, whereas in expectation removing 30 neurons at random from the network barely moves the accuracy at all. Since the method can be applied to an arbitrary metric, this kind of surgery can be performed for other metrics we care about. For example, removing the neurons which are most responsible for vulnerability to adversarial attacks makes the network more robust, and removing the neurons most responsible for the class-accuracy imbalance (a fairness metric) makes the classes much more even, while only reducing the overall accuracy by a small amount.

Robert's opinion: It's nice to see an interpretability method with demonstrable and measurable use cases. Many methods aim at improving insight, but often don't demonstrate this aim; I think this paper does this well in showing how its method can be used for model surgery. I think methods that allow us to investigate and understand individual neurons and their contributions are useful in building up a fine grained picture of how neural networks work. This links to previous work such as Network Dissection as well as the recent Circuits Thread on Distill, and I'd love to see how these methods interact. They all give different kinds of understanding, and I think it would be interesting to see if given the results of the circuits tools we were able to predict which neurons where most responsible for different metrics (Neuron Shapley) or aligned to which relevant features (Network Dissection).

Visualizing Neural Networks with the Grand Tour (Mingwei Li et al) (summarized by Flo): Visualizing a complete dataset instead of single input examples is helpful when we want to analyze the relationships between different input examples and how their classification changes during training, as we can do so by looking at a single video.

The authors use an example on MNIST in which the network learns to classify the numbers 1 and 7 in an almost discrete fashion during particular epochs to compare different methods for visualizing how the dataset is classified. They find that one problem with nonlinear dimensionality reduction like t-SNE and UMAPs is that changes to a subset of the dataset can strongly affect how unchanged data points are represented. Then they compare this to the Grand Tour, a classical technique that projects the data into two dimensions from varying points of view. As projections are linear in the input variables, it is rather easy to reason about how changes in the data affect this visualization and the times the classes 1 and 7 are learnt are indeed quite salient in their example. Another advantage of this method is that confusion between two specific classes can be identified more easily, as the corresponding data points will be projected onto the line connecting the clusters for these classes. A similar approach can be taken on a network's hidden layers to identify the layer in which different classes become clearly distinguishable. They find that they can identify adversarial examples generated by FGSM by looking at the second to last layer, where the adversarial examples form a cluster distinct from the real images.

As the Grand Tour involves varying rotations, it is basically unaffected by rotations of the data. The authors argue that this is a feature, as rotations are small changes to the data and should not have a large effect on the visualization.

Flo's opinion: The dataset perspective on visualization seems pretty useful as a quick diagnostic tool for practitioners, but less useful than feature visualization for a detailed understanding of a model. While I think that it is good to highlight invariances, I am not convinced that rotational invariance is actually desirable for visualizing intermediate layers of a neural network, as most nonlinearities are strongly affected by rotations.

FORECASTING

Atari early (Katja Grace) (summarized by Rohin): With DeepMind's Agent57 (summarized below), it seems that it is feasible to outperform professional game testers on all Atari games using no game-specific knowledge. Interestingly, in a 2016 survey, the median response put a small chance (10%) on this being feasible by 2021, and a medium chance (50%) of being feasible by 2026.

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

Agent57: Outperforming the human Atari benchmark (Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann et al) (summarized by Sudhanshu): This blogpost and its associated arxiv publication present Agent57, DeepMind's latest RL agent created for the purpose of achieving human-level performance in a suite of 57 Atari games. Notably, Agent57 is the first agent that is able to surpass average human performance, as measured by Human Normalized Score or HNS, on every individual game in the suite, with the same set of hyperparameters. The blogpost details the evolution of DeepMind's Atari agents from DQN up to Agent57, and the paper elaborates on the improvements made in Agent57.

Specifically, Agent57 builds on a recent agent 'Never Give Up' (NGU), which itself augments R2D2 with episodic memory for curiosity-driven exploration. Agent57 introduces (i) a new parameterization of state-action value function that decomposes into intrinsic and extrinsic rewards, and (ii) a meta-controller which selects which of its numerous distributed policies to prioritize during learning, allowing the agent to control the exploration/exploitation trade-off.

Sudhanshu's opinion: On the one hand, this work feels like the achievement of an important milestone in DeepMind's ongoing research agenda towards building more general agents. On the other hand, it has the flavour of engineered sophistry: a remarkable collection of building blocks arranged together to patch specific known weaknesses, but lacking in core insights about how to make agents more general, without, say, making them more complex.

The work is well presented and accessible, especially the blogpost that contains a snapshot of the functional development of deep reinforcement learning capabilities over time. There are several open questions from here on out; personally, I hope this progresses to a single instance of an agent that is proficient at multiple games, and to the design of agents that do not require extensive hyperparameter tuning. The scale of DeepMind's experiments continues to grow, with 256 actors, and 10s of billions of frames, suggesting that, for now, this work is only suitable for simulated environments.

Massively Scaling Reinforcement Learning with SEED RL (Lasse Espeholt, Raphaël Marinier, Piotr Stanczyk et al) (summarized by Nicholas): Deep learning has historically (AN #7) seen many improvements as a result of scaling to larger models with larger amounts of computation, as with the months-long training of OpenAI Five (AN #82) and AlphaStar (AN #43). SEED RL redesigns the architecture of distributed RL to enable better machine utilization and communication and achieves an order of magnitude improvement in training speed.

Current distributed architectures typically separate machines into actors and learners. Actors are typically CPUs that simulate the environment, and run inference to predict agent actions. They then send trajectories to the learners. Learners are typically accelerators (GPUs or TPUs), which are responsible for training the model. They then send the updated model parameters to the actors.

SEED RL addresses 3 main issues in this setup:

1. Inference could benefit from specialized accelerators

2. Sending model parameters and states requires high bandwidth.

3. Environment simulation and inference are very different tasks and having them on the same machine makes it hard to utilize the resource efficiently.

The solution is to instead have actors only simulate the environment. After each step, they send the resulting observation to the learner, which is responsible for both training and inference, possibly split on separate hardware. It then sends back just the actions to the environment. This enables each piece of hardware to be used for its designed purpose. Since they now need to communicate at each step, they use gRPC to minimize latency.

Nicholas' opinion: Given how compute-intensive deep RL is, I think it is quite useful to enable cheaper and faster training before these algorithms can be broadly useful. Their claimed speedup is quite impressive, and I like how well they can separate the training and inference from the simulation. I expect that specialized hardware for both training and inference will soon become the norm and SEED RL seems like it will scale well as those accelerators become faster. One thing to note is that this architecture seems very specifically tuned to the problem of games where CPUs can efficiently simulate the environment and it does not improve the sample efficiency for situations where we can’t run lots of simulations.

Rohin's opinion: It was quite surprising to me that this worked as well as it did: this model requires communication across machines at every timestep of the environment, which intuitively means that latency should be a major bottleneck, while the standard model only requires communication once per batch of trajectories.

DEEP LEARNING

AutoML-Zero: Evolving Machine Learning Algorithms From Scratch (Esteban Real, Chen Liang et al) (summarized by Sudhanshu): Most previous work in the area of automated machine learning, or AutoML, has focussed on narrow search spaces that are restricted to specific parts of the machine learning pipeline, e.g. the architecture of a neural network, or the optimizer in meta-learning. These spaces are often so constrained by the hand-engineered components around them that architectures and algorithms discovered, say, by evolutionary search (ES), are only slightly better than random search (RS). This work aims to set up the problem with very weak constraints and a wide search space: a) a machine learning program has three component functions, Setup, Predict, and Learn, which start out empty, and b) are populated by RS or ES with procedural operations from over 50 arithmetic, trigonometric, linear algebra, probability, and pre-calculus operators.

They demonstrate that with such a vast search space, RS fares very poorly in comparison to ES. They also report that ES finds several procedures that are recognizable as useful for machine learning, such as a simple neural network, gradient descent, gradient normalization, multiplicative interactions, noise augmentation, noisy dropout and learning rate decay.

Sudhanshu's opinion: This work empirically demonstrates that we now have sufficient methods and tricks in our ES toolkit that enable us to evolve machine learning algorithms from scratch. Additionally, this process produces computer code, which itself may yield to theoretical analysis furthering our knowledge of learning algorithms. I think that powerful AI systems of the future may employ such techniques to discover solutions.

NEWS

Announcing Web-TAISU, May 13-17 (Linda Linsefors) (summarized by Rohin): The Technical AI Safety Unconference (AN #57) will be held online from May 13-17.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

11