[AN #119]: AI safety when agents are shaped by environments, not rewards



Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


Shaping Safer Goals (Richard Ngo) (summarized by Nicholas): Much of safety research focuses on a single agent that is directly incentivized by a loss/reward function to take particular actions. This sequence instead considers safety in the case of multi-agent systems interacting in complex environments. In this situation, even simple reward functions can yield complex and highly intelligent behaviors that are only indirectly related. For example, evolution led to humans who can learn to play chess, despite the fact that the ancestral environment did not contain chess games. In these situations, the problem is not how to construct an aligned reward function, the problem is how to shape the experience that the agent gets at training time such that the final agent policy optimizes for the goals that we want. This sequence lays out some considerations and research directions for safety in such situations.

One approach is to teach agents the generalizable skill of obedience. To accomplish this, one could design the environment to incentivize specialization. For instance, if an agent A is more powerful than agent B, but can see less of the environment than B, A might be incentivized to obey B’s instructions if they share a goal. Similarly we can increase the ease and value of coordination through enabling access to a shared permanent record or designing tasks that require large-scale coordination.

A second approach is to move agents to simpler and safer training regimes as they develop more intelligence. The key assumption here is that we may require complex regimes such as competitive multi-agent environments to jumpstart intelligent behavior, but may be able to continue training in a simpler regime such as single-task RL later. This is similar to current approaches for training a language model via supervised learning and then finetuning with RL, but going in the opposite direction to increase safety rather than capabilities.

A third approach is specific to a collective AGI: an AGI that is composed of a number of separate general agents trained on different objectives that learn to cooperatively solve harder tasks. This is similar to how human civilization is able to accomplish much more than any individual human. In this regime, the AGI can be effectively sandboxed by either reducing the population size or by limiting communication channels between the agents. One advantage of this approach to sandboxing is that it allows us to change the effective intelligence of the system at test-time, without going through a potentially expensive retraining phase.

Nicholas' opinion: I agree that we should put more emphasis on the safety of multi-agent systems. We already have evidence (AN #65) that complex behavior can arise from simple objectives in current systems, and this seems only more likely as systems become more powerful. Two-agent paradigms such as GANs, self-play, and debate, are already quite common in ML. Lastly, humans evolved complex behavior from the simple process of evolution so we have at least one example of this working. I also think this is an interesting area where there is lots to learn from other fields, such as game theory and evolutionary biology,

For any empirically-minded readers of this newsletter, I think this sequence opens up a lot of potential for research. The development of safety benchmarks for multi-agent systems and then the evaluation of these approaches seems like it would make many of the considerations discussed here more concrete. I personally would find them much more convincing with empirical evidence to back up that they work with current ML.

Rohin's opinion: The AGI model here in which powerful AI systems arise through multiagent interaction is an important and plausible one, and I'm excited to see some initial thoughts about it. I don't particularly expect any of these ideas to be substantially useful, but I'm also not confident that they won't be useful, and given the huge amount of uncertainty about how multiagent interaction shapes agents, that may be the best we can hope for currently. I'd be excited to see empirical results testing some of these ideas out, as well as more conceptual posts suggesting more ideas to try.



Non-Adversarial Imitation Learning and its Connections to Adversarial Methods (Oleg Arenz et al) (summarized by Zach): Viewing imitation learning as a distribution matching problem has become more popular in recent years (see Value-Dice (AN #98) / I2L (AN #94)). However, the authors in this paper argue that such methods are unstable due to their formulation as saddle-point problems which means they have weak convergence guarantees due to the assumption that the policy is slowly updated. In this paper, the authors reformulate Adversarial IRL (AN #17) as a non-adversarial problem allowing for much stronger convergence guarantees to be proved. In particular, the authors derive a lower-bound on the discrimination reward which allows for larger policy updates and then introduce a method to iteratively tighten this bound. They also build on prior work for value-dice and derive a soft actor-critic algorithm (ONAIL) that they evaluate on a variety of control tasks.

Zach's opinion: The experiments in this paper are a bit underwhelming. While they run a large number of experiments, ONAIL only occasionally outperforms value-dice consistently in the HalfCheetah environment. The authors justify this by noting that ONAIL wasn't regularized. Additionally, the policies are initialized with behavior cloning, something that value-dice doesn't require. However, the theoretical insight on iterative tightening is interesting, and together with the recent work on value-dice indicates that the design space of imitation learning algorithms is far from being exhausted.


Canaries in Technology Mines: Warning Signs of Transformative Progress in AI (Carla Zoe Cremer et al) (summarized by Asya): In this paper, Cremer et al. propose a methodology for identifying early warning signs ('canaries') for transformative AI progress. The methodology consists of identifying key milestones using expert elicitation, arranging those milestones into causal graphs where any given milestone may make another milestone more likely, and then using the causal graph representation to identify canaries-- nodes which have a significant number of outgoing nodes.

As an example, they give a partial implementation of using this methodology to identify canaries for high-level machine intelligence. Cremer et al. interview 25 experts in a variety of fields about the limitations of deep learning, then collate the named limitations and translate them into 'milestones'. Interviewees name 34 (potentially overlapping) milestones in total, including causal reasoning, meta-learning, hierarchical decomposition, (abstract) representation, flexible memory, common sense, architecture search, and navigating brittle environments.

Cremer et al. then construct one possible causal graph for these milestones, and identify two that may act as canaries: Symbol-like representations, i.e. the ability to construct abstract, discrete, and disentangled representations of inputs, could underly grammar, mathematical reasoning, concept formation, and flexible memory. Flexible memory, the ability to store, recognize, and re-use knowledge, could unlock the ability to learn from dynamic data, the ability to do continuous learning, and the ability to learn how to learn.

Asya's opinion: I like the methodology proposed in this paper, and I found the list of named limitations of deep learning interesting and informative. I’m not sure that I personally agree with the particular canaries identified in the example (which the authors emphasize is just one possible causal graph). It seems plausible to me that both flexible memory and symbol-like representations would be an emergent property of any deep learning system with a sufficiently rich training dataset, curriculum, compute available, etc. and the real milestones to watch would be advances in those inputs.


Hard Choices in Artificial Intelligence: Addressing Normative Uncertainty through Sociotechnical Commitments (Roel Dobbe et al) (summarized by Flo): This paper looks at AI Safety from the lens of Science & Technology Studies. AI systems are framed as sociotechnical, meaning that both social and technical aspects influence their development and deployment. As AI systems scale, we may face difficult value choices: for example, how do we compare between values like equality and liberty when we cannot have both? This can be resolved using intuitive comparability (IC): even if it seems incomparable in the abstract, humans are still able to make deliberate tradeoffs that involve these values. This is particularly relevant for so-called hard choices where different alternatives seem to be on par, which require normative reasoning and the incorporation of values that were previously neglected. As AI systems can reshape the contexts in which stakeholders exist, we are likely to encounter many hard choices as new values emerge or become more salient. The IC perspective then suggests that AI systems and criteria for evaluation should be iteratively redesigned based on qualitative feedback from different stakeholders.

The authors then argue that as AI systems encode hard choices made by or for different stakeholders, they are fundamentally political. Developers are in a position of power and have the responsibility to take a political stance. A set of challenges to preserve stakeholders' access to hard choices in an AI system's development are proposed:

1. The design of the system should involve the explicit negotiation of modelling assumptions or the lack thereof and learning goals as well as deliberation about future value conflicts or externalities that might make a reiteration of the design process necessary and give enough flexibility for stakeholders to imprint their own values during training and deployment.

2. The training of the system should involve an impartial assessment of the tradeoff between visible performance and potential hidden disadvantages like bias, brittleness or unwanted strategic behaviour and involve stakeholders in the resolution. Furthermore, a team consensus about what can and cannot be done to improve performance should be established.

3. During deployment, there should be an easily useable and trustworthy feedback channel for stakeholders, who should either have an explicit say in shaping the system (political setting) or the option to opt out of the system without major costs (market setting).

These challenges should be part of the training of AI designers and engineers, while the public needs to be sufficiently educated about the assumptions behind and the abilities and limitations of AI systems to allow for informed dissent.

Flo's opinion: I agree that technology, especially widely used one, always has a political aspect: providing access to new capabilities can have large societal effects that affect actors differently, based on accessibility and how well the capabilities match with their existing ones. I also like the distinction between deployment in market settings, where opting out is possible, and political settings, even though this can obviously become quite fuzzy when network effects make competition difficult. Lastly, I strongly agree that we will need an iterative process involving qualitative feedback to ensure good outcomes from AI, but worry that competitive pressures or the underestimation of runaway feedback dynamics could lead to situations where AI systems directly or indirectly prevent us from adjusting them.



What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study (Marcin Andrychowicz et al) (summarized by Sudhanshu): In what is likely the largest study of on-policy reinforcement learning agents, this work unifies various algorithms into a collection of over 50 design choices, which the authors implement as tunable hyperparameters and systematically investigate how those parameters impact learning across five standard continuous control environments. Specifically, they choose subsets of these hyperparameters in eight experiment themes -- policy losses, network architectures, normalization and clipping, advantage estimation, training setup, timestep handling, optimizers, and regularization. They train thousands of agents for various choices within each theme, for a total of over 250,000 agents.

They present nearly a hundred graphs summarizing their experiments for the reader to make their own conclusions. Their own recommendations include: using the PPO loss, using separate value and policy networks, initializing the last policy layer with x100 smaller weights, using tanh as activation functions, using observation normalization, using generalized advantage estimation with λ = 0.9, tuning the number of transitions gathered in each training loop if possible, tuning the discount factor, using the Adam optimizer with a linearly decaying learning rate, among several others.

Sudhanshu's opinion: This is a paper worth a skim to glimpse at the complexity of today's RL research while noting how little we understand and can predict about the behaviour of our algorithms. A fun game to play here is to go through the graphs in Appendices D through K and arrive at one's own interpretations before comparing them to the authors' in the main text. What was unsatisfying was that often there were contradictory results between environments, meaning there was no insight to be gleaned about what was happening: for instance, value function normalization always helps except in Walker2d where it significantly hurts performance. Such work raises more questions than it answers; perhaps it will motivate future research that fundamentally rethinks our environments and algorithms.

A more mundane, but alignment-relevant observation is that seeing how difficult it is to tune an agent for a task in simulation, and how much hyperparameters may vary across tasks, is weak evidence against powerful sim-to-real transfer performance arising out of the current paradigm of simulators/tasks and algorithms: agents will need to be trained in the real world, spawning associated risks which we may want to avoid.


Measuring Massive Multitask Language Understanding (Dan Hendrycks et al) (summarized by Rohin): With the advent of large language models, there has been a shift to evaluating these models based on the knowledge they have acquired, i.e. evaluating their “common sense”. However, with GPT-3 (AN #102) models have reached approximately human performance even on these benchmarks. What should be next?

We’ve previously seen (AN #113) a benchmark that evaluates models based on their knowledge of ethics. This benchmark (with many of the same authors) goes further by testing models with multiple choice questions on a variety of subjects that humans need to learn. These are not easy: their 57 subjects include advanced topics like Professional Medicine, College Mathematics, and International Law.

All but the largest of the GPT-3 models do about as well as random chance (25%). However, the largest 175 billion parameter model does significantly better, reaching an average score of 43.9%. This performance is very lopsided: on US Foreign Policy it gets almost 70%, while on College Chemistry and Moral Scenarios it gets about 25% (i.e. still random chance). The authors note that GPT-3 tends to do worse on subjects that require calculations and thus speculate that it is harder for GPT-3 to acquire procedural knowledge compared to declarative knowledge. The authors also find that GPT-3 is very uncalibrated about its answers in the zero-shot setting, and becomes more calibrated (though still not very good) in the few-shot setting.

It isn’t necessary to have huge models in order to do better than chance: in fact, you can do better with a smaller model that is finetuned for question answering. In particular, the UnifiedQA system has an order of magnitude fewer parameters than GPT-3, but outperforms it with a score of 48.9% accuracy. This system was trained on other question answering datasets (but notably was not trained on the questions in this dataset, as this dataset is meant for evaluation rather than training). A small UnifiedQA model with only 60 million parameters (over 3 orders of magnitude smaller than GPT-3) can still do better than chance, achieving 29.3% on the dataset.

Read more: Import AI summary

Rohin's opinion: The examples of the questions are pretty interesting, and show that this really is a hard challenge: while experts in each subject would probably get very high scores, if we tested me on all of these subjects I don't think I would do very well. I like this method of evaluation because it gets a bit closer to what we care about: whether our models can capture enough domain knowledge that they can then be used widely for automation. Depending on your beliefs about how AI will progress, there might be too much of a focus on this generality -- maybe our models only need to understand “general reasoning” and then we can finetune them for specific domains.


Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense (Yixin Zhu et al) (summarized by Rohin): This paper argues that current computer vision research focuses too much on a “big data for small tasks” paradigm that focuses only on the “what” and “where” of images. More work should be done on a “small data for big tasks” paradigm that focuses more on the “how” and “why” of images. These “how” and “why” questions focus attention on details of an image that may not be directly present in the pixels of the image, which the authors term “dark” data (analogously to dark matter in physics, whose existence is inferred, not observed). For example, by asking why a human is holding a kettle with the spout pointing down, we can infer that the kettle contains liquid that will soon come out of the kettle, even though there are no pixels that directly correspond to the liquid.

The authors propose five important areas for further research, abbreviated FPICU, and do a literature review within each one:

1. Functionality: Many objects, especially those designed by humans, can be better understood by focusing on what functionalities they have.

2. Physics: Cognitive science has shown that humans make extensive use of intuitive physics to understand the world. For example, simply reasoning about whether objects would fall can provide a lot of constraints on a visual scene; it would be weird to see an upright cup floating in the air.

3. Intent: The world is filled with goal-directed agents, and so understanding the world requires us to infer the goals that various agents have. This is a capability humans get very quickly -- at eighteen months of age, children can infer and imitate the intended goal of an action, even if the action fails to achieve the goal.

4. Causality: Much has already been written about causality; I will not bore you with it again. The authors see this as the most important factor that underlies the other four areas.

5. Utility: I didn’t really get how this differed from intent. The section in the paper discusses utility theory, and then talks about work that infers utility functions from behavior.

Rohin's opinion: I really liked the super detailed description of a large number of things that humans can do but current vision systems cannot do; it feels like I have a much more detailed sense now of what is missing from current approaches to vision. While the paper has a huge 491 references backing up its claims, I’m not sure how relevant all of them are. For example, the reference to the revelation principle didn’t really seem to justify the associated point. As a counterpoint, the discussion on utility functions in various fields was excellent. Unfortunately I’m not familiar enough with most of the other areas to spot check them.

I read this paper because I heard that the last author, Song-Chun Zhu, was leaving his job as a professor at UCLA to set up a research institute on “general AI” in Beijing, and I wanted to get a sense of what the institute was likely to work on. It seems like the institute will probably pursue an agenda that focuses on building the five particular facets of intelligence into AI systems, as a form of inductive bias: this is how you’d get to a “small data for big tasks” paradigm. If that’s right, it would be in stark contrast to the neural network approaches taken by most of industry, and the biology-inspired approaches taken by (say) the Human Brain Project, but it would feel quite aligned with the views of many academics (like Josh Tenenbaum, who is a coauthor on this paper).


OpenAI Licenses GPT-3 Technology to Microsoft (summarized by Rohin): In the initial announcement of Microsoft’s investment in OpenAI (AN #61), OpenAI suggested that they would likely license pre-AGI technologies to Microsoft in order to get enough capital to run high-compute experiments. This has now happened with the GPT-3 API (AN #104).


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.