Alignment Newsletter #48

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

Highlights

Quantilizers: A Safer Alternative to Maximizers for Limited Optimization and When to use quantilization (Jessica Taylor and Ryan Carey): A key worry with AI alignment is that if we maximize expected utility for some utility function chosen by hand, we will likely get unintended side effects that score highly by the utility function but are nevertheless not what we intended. We might hope to leverage human feedback to solve this: in particular, an AI system that simply mimics human actions would often be desirable. However, mimicry can only achieve human performance, and cannot improve upon it. The first link is a 2015 paper that introduces quantilization, which interpolates between these two extremes to improve upon human performance while bounding the potential (expected) loss from unintended side effects.

In particular, let's suppose that humans have some policy γ (i.e. probability distribution over actions). We evaluate utility or performance using a utility function U, but we do not assume it is well-specified -- U can be any function, including one we would not want to maximize. Our goal is to design a policy π that gets higher expected U than γ (reflecting our hope that U measures utility well) without doing too much worse than γ in the worst case when U was as badly designed as possible. We'll consider a one-shot case: π is used to select an action once, and then the game is over.

The core idea behind quantilization is simple: if our policy only does things that the human might have done, any expected loss it incurs corresponds to some loss that the human could incur. So, let's take our human policy γ, keep only the top q-fraction of γ (as evaluated by U), and then sample an action from there. This defines our policy π_q, also called a q-quantilizer. For example, suppose the human would choose A with probability 0.25, B with probability 0.5, and C with probability 0.25, and U(A) > U(B) > U(C). Then a (1/4)-quantilizer would choose A with certainty, a (1/2)-quantilizer would choose randomly between A and B, and a (3/8)-quantilizer would choose A twice as often as B.

Note that even if π_q places all of its probability on the worst possible actions (because U was badly designed), by construction the human had probability q of doing the same thing as π_q, and so the expected cost of π_q can be at most (1/q) times larger than the expected cost of the human policy γ. In fact, if we have no other knowledge, a q-quantilizer maximizes expected utility U subject to the constraint of never doing worse than (1/q) times as bad as γ.

As soon as you move to the setting with multiple actions, if you choose each action using quantilization, then your worst case bound is exponential in the number of actions. If you assume the cost for each action is independent, you recover the guarantees, but this is not a realistic assumption (as the authors note). Long-term plans are very good or very bad because all of the actions build on each other to achieve some goal, so the costs are not going to be independent.

The second link is a recent post that further analyzes quantilization. It points out that the theoretical analysis in the paper assumes that deviating from the human policy can only lead to costs. If the deviation could also forgo benefits, then the theorems no longer apply. Perhaps the bottom (1-q) of actions actually have strong positive benefits, that we failed to model in U. Then by eliminating those actions, we may have lost arbitrary amounts of expected value. In these situations, the only way to bound the expected regret is by exact mimicry. It also points out that if you are aiming to simultaneously do well both on U and the worst-case bound, then typically imitation will be better since adding any optimization can drastically weaken the worst-case bound and usually will not make U better by the same amount. Quantilization makes sense when there is a "sweet-spot of actions that are pretty common but substantially outperform imitation".

Rohin's opinion: The exponential blowup in potential loss with multiple actions would make this prohibitive, but of course you could instead view the full sequence of actions (i.e. trajectory) as a mega-action, and quantilize over this mega-action. In this case, a one-millionth-quantilizer could choose from among the million best plans that a human would make (assuming a well-specified U), and any unintended consequences (that were intentionally chosen by the quantilizer) would have to be ones that a human had a one-in-a-million chance of causing to occur, which quite plausibly excludes really bad outcomes.

Phrased this way, quantilization feels like an amplification of a human policy. Unlike the amplification in iterated amplification, it does not try to preserve alignment, it simply tries to bound how far away from alignment the resulting policy can diverge. As a result, you can't iterate quantilization to get arbitrarily good capabilities. You might hope that humans could learn from powerful AI systems, grow more capable themselves (while remaining as safe as they were before), and then the next quantilizers would be more powerful.

It's worth noting that the theorem in the paper shows that, to the extent that you think quantilization is insufficient for AI alignment, you need to make some other assumption, or find some other source of information, in order to do better, since quantilization is optimal for its particular setup. For example, you could try to assume that U is at least somewhat reasonable and not pathologically bad; or you could assume an interactive setting where the human can notice and correct for any issues with the U-maximizing plan before it is executed; or you could not have U at all and exceed human performance through some other technique.

I'm not very worried about the issue that quantilization could forgo benefits that the human policy had. It seems that even if this happens, we could notice this, turn off the quantilizer, and fix the utility function U so that it no longer ignores those benefits. (We wouldn't be able to prevent the quantilizer from forgoing benefits of our policy that we didn't know about, but that seems okay to me.)

Technical AI alignment

Iterated amplification

Can HCH epistemically dominate Ramanujan? (Alex Zhu): Iterated amplification rests on the hope that we can achieve arbitrarily high capabilities with (potentially very large) trees of explicit verbal breakdowns of problems. This is often formalized as a question about HCH (AN #34). This post considers the example of Srinivasa Ramanujan, who is "famously known for solving math problems with sudden and inexplicable flashes of insight". It is not clear how HCH would be able to replicate this sort of reasoning.

Learning human intent

Unsupervised Visuomotor Control through Distributional Planning Networks (Tianhe Yu et al)

Syntax vs semantics: alarm better example than thermostat (Stuart Armstrong): This post gives a new example that more clearly illustrates the points made in a previous post (AN #26).

Prerequisities: Bridging syntax and semantics, empirically

Interpretability

Synthesizing the preferred inputs for neurons in neural networks via deep generator networks (Anh Nguyen et al)

Adversarial examples

Quantifying Perceptual Distortion of Adversarial Examples (Matt Jordan et al) (summarized by Dan H): This paper takes a step toward more general adversarial threat models by combining adversarial additive perturbations small in an l_p sense with spatially transformed adversarial examples, among other other attacks. In this more general setting, they measure the size of perturbations by computing the SSIM between clean and perturbed samples, which has limitations but is on the whole better than the l_2 distance. This work shows, along with other concurrent works, that perturbation robustness under some threat models does not yield robustness under other threat models. Therefore the view that l_p perturbation robustness must be achieved before considering other threat models is made more questionable. The paper also contributes a large code library for testing adversarial perturbation robustness.

On the Sensitivity of Adversarial Robustness to Input Data Distributions (Gavin Weiguang Ding et al)

Forecasting

Primates vs birds: Is one brain architecture better than the other? (Tegan McCaslin): Progress in AI can be driven by both larger models as well as architectural improvements (given sufficient data and compute), but which of these is more important? One source of evidence comes from animals: different species that are closely related will have similar neural architectures, but potentially quite different brain sizes. This post compares intelligence across birds and primates: while primates (and mammals more generally) have a neocortex (often used to explain human intelligence), birds have a different, independently-evolved type of cortex. Using a survey over non-expert participants about how intelligent different bird and primate behavior is, it finds that there is not much difference in intelligence ratings between birds and primates, but that species with larger brains are rated as more intelligent than those with smaller brains. This only suggests that there are at least two neural architectures that work -- it could still be a hard problem to find them in the vast space of possible architectures. Still, it is some evidence that at least in the case of evolution, you get more intelligence through more neurons, and architectural improvements are relatively less important.

Rohin's opinion: Upon reading the experimental setup I didn't really know which way the answer was going to turn out, so I'm quite happy about now having another data point with which to understand learning dynamics. Of course, it's not clear how data about evolution will generalize to AI systems. For example, architectural improvements probably require some hard-to-find insight which make them hard to find via random search (imagine how hard it would be to invent CNNs by randomly trying stuff), while scaling up model size is easy, and so we might expect AI researchers to be differentially better at finding architectural improvements relative to scaling up model size (as compared to evolution).

Miscellaneous (Alignment)

Quantilizers: A Safer Alternative to Maximizers for Limited Optimization and When to use quantilization (Jessica Taylor and Ryan Carey): Summarized in the highlights!

Human-Centered Artificial Intelligence and Machine Learning (Mark O. Riedl)

AI strategy and policy

Stable Agreements in Turbulent Times (Cullen O’Keefe): On the one hand we would like actors to be able to cooperate before the development of AGI by entering into binding agreements, but on the other hand such agreements are often unpalatable and hard to write because there is a lot of uncertainty, indeterminacy and unfamiliarity with the consequences of developing powerful AI systems. This makes it very hard to be confident that any given agreement is actually net positive for a given actor. The key point of this report is that we can strike a balance between these two extremes by agreeing pre-AGI to be bound by decisions that are made post-AGI with the benefit of increased knowledge. It examines five tools for this purpose: options, impossibility doctrines, contractual standards, renegotiation, and third-party resolution.

Advice to UN High-level Panel on Digital Cooperation (Luke Kemp et al)

Other progress in AI

Reinforcement learning

Neural MMO (OpenAI) (summarized by Richard): Neural MMO is "a massively multiagent game environment for reinforcement learning agents". It was designed to be persistent (with concurrent learning and no environment resets), large-scale, efficient and expandable. Agents need to traverse an environment to obtain food and water in order to survive for longer (the metric for which they are rewarded), and are also able to engage in combat with other agents. Agents trained within a larger population explore more and consistently outperform those trained in smaller populations (when evaluated together). The authors note that multiagent training is a curriculum magnifier, not a curriculum in itself, and that the environment must facilitate adaptive pressures by allowing a sufficient range of interactions.

Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research (Joel Z. Leibo, Edward Hughes, Marc Lanctot, Thore Graepel) (summarized by Richard): The authors argue that the best solution to the problem of task generation is creating multi-agent systems where each agent must adapt to the others. These agents do so first by learning how to implement a high-level strategy, and then by adapting it based on the strategies of others. (The authors use the term "adaptive unit" rather than "agent" to emphasise that change can occur at many different hierarchical levels, and either by evolution or learning). This adaptation may be exogenous (driven by the need to respond to a changing environment) or endogenous (driven by a unit's need to improve its own functionality). An example of the latter is a society implementing institutions which enforce cooperation between individuals. Since individuals will try to exploit these institutions, the process of gradually robustifying them can be considered an automatically-generated curriculum (aka autocurriuclum).

Richard's opinion: My guess is that multiagent learning will become very popular fairly soon. In addition to this paper and the Neural MMO paper, it was also a key part of the AlphaStar training process. The implications of this research direction for safety are still unclear, and it seems valuable to explore them further. One which comes to mind: the sort of deceptive behaviour required for treacherous turns seems more likely to emerge from multiagent training than from single-agent training.

Long-Range Robotic Navigation via Automated Reinforcement Learning (Aleksandra Faust and Anthony Francis): How can we get robots that successfully navigate in the real world? One approach is to use a high-level route planner that uses a learned control policy over very short distances (10-15 meters). The control policy is learned using deep reinforcement learning, where the network architecture and reward shaping is also learned via neural architecture search (or at least something very similar). The simulations have enough noise that the learned control policy transfers well to new environments. Given this policy as well as a floorplan of the environment we want the robot to navigate in, we can build a graph of points on the floorplan, where there is an edge between two points if the robot can safely navigate between the two points using the learned controller (which I think is checked in simulation). At execution time, we can find a path to the goal in this graph, and move along the edges using the learned policy. They were able to build a graph for the four buildings at the Google main campus using 300 workers over 4 days. They find that the robots are very robust in the real world. See also Import AI.

Rohin's opinion: This is a great example of a pattern that seems quite common: once we automate tasks using end-to-end training that previously required more structured approaches, new more complex tasks will arise that will use the end-to-end trained systems as building blocks in a bigger structured approach. In this case, we can now train robots to navigate over short distances using end-to-end training, and this has been used in a structured approach involving graphs and waypoints to create robots that can traverse larger distances.

It's also an example of what you can do when you have a ton of compute: for the learned controller, they learned both the network architecture and the reward shaping. About the only thing that had to be explicity specified was the sparse true reward. (Although I'm sure in practice it took a lot of effort to get everything to actually work.)

Competitive Experience Replay (Hao Liu et al)

News

Q&A with Jason Matheny, Founding Director of CSET (Jason Matheny): The Center for Security and Emerging Technology has been announced, with a $55 million grant from the Open Philanthropy Project, and is hiring. While the center will work on emerging technologies generally, it will initially focus on AI, since demand for AI policy analysis has far outpaced supply.

One area of focus is the implications of AI on national and international security. Current AI systems are brittle and can easily be fooled, implying several safety and security challenges. What are these challenges, and how important are they? How can we make systems that are more robust and mitigate these problems?

Another area is how to enable effective competition on AI in a global environment, while also cooperating on issues of safety, security and ethics? This will likely require measurement of investment flows, publications, data and hardware across countries, as well as management of talent and knowledge workflows.

See also Import AI.

Rohin's opinion: It's great to see a center for AI policy that's run by a person who has wanted to consume AI policy analysis in the past (Jason Matheny was previously the director of IARPA). It's interesting to see the areas he focuses on in this Q&A -- it's not what I would have expected given my very little knowledge of AI policy.

Question about quantilization: where does the base distribution come from? You and Jessica both mention humans, but if we apply ML to humans, and the ML is really good, wouldn't it just give a prediction like "With near certainty, the human will output X in this situation"? (If the ML isn't very good, then any deviation from the above prediction would reflect the properties of the ML algorithm more than properties of the human.)

To get around this, the human could deliberately choose an unpredictable (e.g., randomized) action to help the quantilizer, but how are they supposed to do that?

Alternatively, construct a distribution over actions such that each action has measure according to some function of its e.g. attainable utility impact penalty (normalized appropriately, of course). Seems like a potential way to get a mild optimizer which is explicitly low-impact and doesn't require complicated models of humans.

What advantages do you think this has compared to vanilla RL on U + AUP_Penalty?

it's also mild on the inside of the algorithm, not just in its effects on the world. this could avert problems with inner optimizers. beyond that, I haven't thought enough about the behavior of the agent. I might reply with another comment.

Question about quantilization: where does the base distribution come from? You and Jessica both mention humans, but if we apply ML to humans, and the ML is really good, wouldn't it just give a prediction like "With near certainty, the human will output X in this situation"? (If the ML isn't very good, then any deviation from the above prediction would reflect the properties of the ML algorithm more than properties of the human.)

I don't have a great answer to this. Intuitively, at the high level, there are a lot of different plans I "could have" taken, and the fact that I didn't take them is more a result of what I happened to think about rather than a considered decision that they were bad. So in the limit of really good ML, one thing you could do is to have a distribution over "initial states" and then ask for the induced distribution over human actions. For example, if you're predicting the human's choice from a list of actions, then you could make the prediction depending on different orderings of the choices in the list, and different presentations of the list. If you're predicting what the human will do in some physical environment, you could check to see what would be done if the human felt slightly colder or slightly warmer, or if they had just thought of particular random words or sentences, etc. All of these have issues in the worst case (e.g. if you're making a decision about whether to wear a jacket, slightly changing the temperature of the room will make the decision worse), but seem fine in most cases, suggesting that there could be a way of making this work, especially if you can do it differently for different domains.

I just want to say that I noticed that you reformatted the post to make it more readable, so thanks! And also thanks for writing these in the first place. :)

(Assuming this was about the leftover email formatting in the last newsletter post) Woops, the last one not being properly formatted is my bad. I go through the newsletter every week and make sure that the email-formatting is removed, and I forgot to do it last week.

Oh I didn't realize that you were reformatting it. I just saw the post change format after doing a refresh and assumed that Rohin did it. I was going to suggest that in the future Rohin try pasting in the HTML version of the newsletter that's available at https://mailchi.mp/3091c6e9405c/alignment-newsletter-48 (by following the "view this email in your browser link" that was in the original post), but actually I'm not sure it's possible to paste HTML into a LW post. Do you know if there's a way to do that?

We are using the RSS feed that Rohin has set up, the HTML of which automatically gets crossposted to LW, so Rohin doesn't actually need to do anything on LW directly. Would be cool to have an RSS feed that's more cleaned up, but not sure whether that's easy to make happen with the mailchimp interface.

Admins can directly post HTML, but for obvious security and styling reasons we don't allow users to do that.

Ah ok. What's your suggestion for other people crossposting between LW and another blog (that doesn't use Markdown)? (Or how are people already doing this?) Use an HTML to Markdown converter? (Is there one that you'd suggest?) Reformat for LW manually? Something else?

For doing manual crossposts, copy-pasting from the displayed-HTML-version of your post into our WYSIWYG editor actually works quite well, and works basically flawlessly about 90% of the time for everything except tables. I've put a bunch of effort into making that experience good.

Ok, I had the Markdown editor enabled, and when I tried to paste in HTML all the formatting was removed, so I thought pasting HTML doesn't work. Can you implement this conversion feature for the Markdown editor too, or if that's too hard, detect the user pasting HTML and show a suggestion to switch to the WYSIWYG editor?

Also it's unclear in the settings that if I uncheck the "Markdown editor" checkbox, the alternative would be the WYSIWYG editor. Maybe add a note to explain that, or make the setting a radio button so the user can more easily understand what the two options mean?

Ah, yeah. That makes a bunch of sense.

Implementing the same behavior for Markdown is doable, though a bit of annoying engineering effort. I created an issue for it. It's about as much work as detecting the user pasting HTML, so I think I will try to just do that instead of asking the user to switch to the WYSIWYG editor. (Edit: After checking, the library we use to convert html into markdown is actually server-side only, so that means it's more work than I expected and maybe the correct choice is to present the user with a dialog. Will have to play around a bit.)

And yeah, it does seem like we should use a radio button instead of a checkbox. Also added an issue for that. We will probably rework the user-settings page at some point in the next few months since it's been getting pretty crowded and confusing, and so we will probably get to that then.

Basically, my recommendation is to set up crossposting (by pinging us to set up crossposting) and to make sure that your RSS feed contains non-styled html, which is the case for all the other RSS feeds except this one (like Zvi's and Sarah Constantin's). When you go and edit the post on LW after it's been automatically imported, we automatically convert it into Markdown or our WYSIWYG format (depending on your user setting) to allow you to edit it (though some styling might be lost if you save it, and we warn you about that).

To get around this, the human could deliberately choose an unpredictable (e.g., randomized) action to help the quantilizer, but how are they supposed to do that?

What advantages do you think this has compared to vanilla RL on U + AUP_Penalty?

Question about quantilization: where does the base distribution come from? You and Jessica both mention humans, but if we apply ML to humans, and the ML is really good, wouldn't it just give a prediction like "With near certainty, the human will output X in this situation"? (If the ML isn't very good, then any deviation from the above prediction would reflect the properties of the ML algorithm more than properties of the human.)

I just want to say that I noticed that you reformatted the post to make it more readable, so thanks! And also thanks for writing these in the first place. :)

Admins can directly post HTML, but for obvious security and styling reasons we don't allow users to do that.

Ah, yeah. That makes a bunch of sense.

16

Alignment Newsletter #48

16

Highlights

Technical AI alignment

Iterated amplification

Learning human intent

Interpretability

Adversarial examples

Forecasting

Miscellaneous (Alignment)

AI strategy and policy

Other progress in AI

Reinforcement learning

News