[AN #132]: Complex and subtly incorrect arguments as an obstacle to debate Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world
Newsletter #132
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.




Debate update: Obfuscated arguments problem (Beth Barnes et al) (summarized by Rohin): We’ve previously seen (AN #86) work on addressing potential problems with debate, including (but not limited to):

1. Evasiveness: By introducing structure to the debate, explicitly stating which claim is under consideration, we can prevent dishonest debaters from simply avoiding precision.

2. Misleading implications: To prevent the dishonest debater from “framing the debate” with misleading claims, debaters may also choose to argue about the meta-question “given the questions and answers provided in this round, which answer is better?”.

3. Truth is ambiguous: Rather than judging whether answers are true, which can be ambiguous and depend on definitions, we instead judge which answer is better.

4. Ambiguity: The dishonest debater can use an ambiguous concept, and then later choose which definition to work with depending on what the honest debater says. This can be solved with cross-examination (AN #86).

This post presents an open problem: the problem of obfuscated arguments. This happens when the dishonest debater presents a long, complex argument for an incorrect answer, where neither debater knows which of the series of steps is wrong. In this case, any given step is quite likely to be correct, and the honest debater can only say “I don’t know where the flaw is, but one of these arguments is incorrect”. Unfortunately, honest arguments are also often complex and long, to which a dishonest debater could also say the same thing. It’s not clear how you can distinguish between these two cases.

While this problem was known to be a potential theoretical issue with debate, the post provides several examples of this dynamic arising in practice in debates about physics problems, suggesting that this will be a problem we have to contend with.

Rohin's opinion: This does seem like a challenging problem to address, and as the authors mention, it also affects iterated amplification. (Intuitively, if during iterated amplification the decomposition chosen happens to be one that ends up being obfuscated, then iterated amplification will get to the wrong answer.) I’m not really sure whether I expect this to be a problem in practice -- it feels like it could be, but it also feels like we should be able to address it using whatever techniques we use for robustness. But I generally feel very confused about this interaction and want to see more work on it.



AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy (Tan Zhi Xuan) (summarized by Rohin): This post argues that AI alignment has specific philosophical tendencies: 1) connectionism, where knowledge is encoded in neural net weights rather than through symbols, 2) behaviorism, where we learn from data rather than using reasoning or planning, 3) Humean motivations for humans (i.e. modeling humans as reward maximizers), 4) viewing rationality as decision theoretic, that is, about maximizing expected utility, rather than also considering e.g. logic, argumentation, and dialectic, and 5) consequentialism. This could be a “philosophical bubble” caused by founder effects from the EA and rationality communities, as well as from the recent success and popularity of deep learning.

Instead, we should be aiming for philosophical plurality, where we explore other philosophical traditions as well. This would be useful because 1) we would likely find insights not available in Western philosophy, 2) we would be more robust to moral uncertainty, 3) it helps us get buy in from more actors, and 4) it is the “right” thing to do, to allow others to choose the values and ethical frameworks that matter to them.

For example, certain interpretations of Confucian philosophy hold that norms have intrinsic value, as opposed to the dominant approach in Western philosophy in which individual preferences have intrinsic value, while norms only have instrumental value. This may be very relevant for learning what an AI system should optimize. Similarly, Buddhist thought often talks about problems of ontological shifts.

Rohin's opinion: Certainly to the extent that AI alignment requires us to “lock in” philosophical approaches, I think it is important that we consider a plurality of views for this purpose (see also The Argument from Philosophical Difficulty (AN #46)). I especially think this is true if our approach to alignment is to figure out “human values” and then tell an AI to maximize them. However, I’m more optimistic about other approaches to alignment; and I think they require fewer philosophical commitments, so it becomes less of an issue that the alignment community has a specific philosophical bubble. See this comment for more details.


DERAIL: Diagnostic Environments for Reward And Imitation Learning (Pedro Freire et al) (summarized by Rohin): Most deep RL algorithms are quite sensitive to implementation and hyperparameters, and this transfers to imitation learning as well. So, it would be useful to have some simple sanity checks that an algorithm works well, before throwing algorithms at challenging benchmarks trying to beat the state of the art. This paper presents a suite of simple environments that each aim to test a single aspect of an algorithm, in a similar spirit to unit testing.

For example, RiskyPath is a very simple four-state MDP, in which the agent can take a long, safe path to the reward, or a short, risky path. As long as the agent is not incredibly short-sighted (i.e. very low γ), it should choose the safe path. This environment was directly inspired to catch an issue that affects Maximum Entropy IRL (AN #12) (later fixed by using causal entropy (AN #12)).

The paper also presents a case study in tuning an implementation of Deep RL from Human Preferences, in which a sparse exploration task suggested that the comparison queries were insufficiently diverse to guarantee stability.

Understanding Learned Reward Functions (Eric J. Michaud et al) (summarized by Rohin): This paper investigates what exactly learned reward functions are doing, through the use of interpretability techniques. They hope that this will be more scalable, as it seems plausible that reward functions will stay relatively similar in complexity, even when the policies become more complex as AI systems become more capable. Specifically, the authors look at:

1. Saliency maps, which plot the gradient of the reward with respect to each pixel, intuitively quantifying “how important is this pixel to the reward”

2. Occlusion maps, which show how much the reward changes if a certain area of the image is blurred

3. Counterfactual inputs, in which the authors manually craft input images to see what the learned reward function outputs.

In a simple gridworld where the agent must find the goal, the authors coded the reward function “1 if the agent moves to a previously visible goal location, else 0”, but they show that the learned reward is instead “0 if there is a currently visible goal location, else 1”. These are identical in the training environment, where there is always exactly one goal location (that the agent may be standing on, in which case that location is not visible). However, if there are changes at test time, e.g. multiple goal locations, then the learned reward will diverge from the true reward.

They then apply a similar methodology to Atari. They find that if the score is not hidden, then the learned reward model will simply check whether the score pixels are changing to detect reward -- unless the score pixels change at a later time than reward is accrued, in which case this is not a viable strategy. They thus suggest that future reward learning work on Atari should ensure that the score is removed from the screen.

Bayesian Inverse Reinforcement Learning (Deepak Ramachandran et al) (summarized by Rohin): Unlike many other methods, Bayesian Inverse Reinforcement Learning produces a posterior distribution over the reward functions that would explain the observed demonstrations. This distribution can be used for e.g. planning in a risk-averse manner. It works by starting with some randomly chosen reward function, and then repeating the following steps:

1. Perturb the reward function randomly

2. Solve for the optimal policy for that reward function

3. Use the learned policy to see how likely the demonstrations would be for the reward function

4. Use the likelihood to determine whether to take this new reward function, or return to the old one.

(This is the application of a standard MCMC sampling algorithm to the likelihood model used in IRL.)

Efficient Exploration of Reward Functions in Inverse Reinforcement Learning via Bayesian Optimization (Sreejith Balakrishnan et al) (summarized by Rohin): In the description of Bayesian IRL above, Step 2 is a very expensive step, as it requires solving a full RL problem. Can we improve any of the other steps to reduce the amount of times we have to run step 2? This paper aims to improve step 1: rather than choosing the next reward randomly, we can choose one that we think will be most informative. The authors apply the framework of Bayesian optimization to put this into practice. I won’t explain it more here since the details are fairly technical and involved (and I didn’t read the paper closely enough to understand it myself). They did have to introduce a new kernel in order to handle the fact that reward functions are invariant to the addition of a potential function.


How energy efficient are human-engineered flight designs relative to natural ones? (Ronny Fernandez) (summarized by Rohin): When forecasting AI timelines from biological anchors (AN #121), one important subquestion is how well we expect human-made artifacts to compare to natural artifacts (i.e. artifacts made by evolution). This post gathers empirical data for flight, by comparing the Monarch butterfly and the Wandering Albatross to various types of planes. The albatross is the most efficient, with a score of 2.2 kg-m per Joule (that is, a ~7 kg albatross spends ~3 Joules for every meter it travels). This is 2-8x better than the most efficient manmade plane that the authors considered, the Boeing 747-400, which in turn is better than the Monarch butterfly. (The authors also looked at distance per Joule without considering mass, in which case unsurprisingly the butterfly wins by miles; it is about 3 orders of magnitude better than the albatross, which is in turn better than all the manmade solutions.)



Does GPT-2 Know Your Phone Number? (Nicholas Carlini et al) (summarized by Rohin): This post and associated paper demonstrate that large language models memorize rare training data, and (some of) that training data can then be extracted through an automated attack. The key idea is to sample text that is unusually high likelihood. Given a high likelihood sample from a language model, we can check whether the likelihood is especially high by comparing the likelihood to:

1. The likelihood assigned by other (especially smaller) language models. Presumably these models would not have memorized the same content, especially if the content was rare (which is the content we are most interested in).

2. The length of the text when compressed by (say) zlib. Existing compression algorithms are pretty good at compressing regular English text, and so it is notable when a language model assigns high likelihood but the compression algorithm can’t compress it much.

3. The likelihood assigned to the same text, but lowercase. Often, memorized content is case-sensitive, and likelihood drops significantly when the case is changed.

The authors generate a lot of samples from GPT-2, use the metrics above to rank them in order of how likely they are to be memorized from the training set, and then investigate the top 1800 manually. They find that 604 of them are directly from the training set. While many are unobjectionable (such as news headlines), in some cases GPT-2 has memorized personal data (and the authors have extracted it simply by prompting GPT-2). In their most objectionable example, they extract the name, email, phone number, work address, and fax of a single person.

Read more: Blog post: Privacy Considerations in Large Language Models

Paper: Extracting Training Data from Large Language Models

Rohin's opinion: I really liked the paper: it contains a lot of empirical detail that didn’t make it into the blog post, that gave me a much better sense of the scope of the problem. I don’t really have the space to summarize it here, so I recommend reading the paper.


Why those who care about catastrophic and existential risk should care about autonomous weapons (Anthony Aguirre) (summarized by Nicholas): This post argues for a focus on autonomous weapons systems (AWs) for three main reasons:

AWs Provide a Trial Run for AGI governance. Governance of AWs shares many properties with AGI safety. Preventing an AW arms race would require international cooperation that would provide a chance to understand and improve AI governance institutions. As with any AI system, AWs have the potential to be effective without necessarily being aligned with human values, and accidents could quickly lead to deadly consequences. Public opinion and the vast majority of AI researchers oppose AW arms races, so there is an opportunity for global coordination on this issue.

Some AWs can directly cause catastrophic risk. Cheap drones could potentially be created at scale that are easy to transport and hard to detect. This could enable an individual to kill many people without the need to convince many others that it is justified. They can discriminate targets better than other WMDs and cause less environmental damage. This has the potential to make war less harmful, but also makes it easier to justify.

AWs increase the likelihood and severity of conflict by providing better tools for terrorists and assassins, lowering the threshold for violence between and within states, upsetting the relative power balance of current militaries, and increasing the likelihood of accidental escalation. In particular, AWs that are being used to counter other AWs might intentionally be made hard to understand and predict, and AWs may react to each other at timescales that are too quick for humans to intervene or de-escalate.

An international agreement governing autonomous weapons could help to alleviate the above concerns. In particular, some classes of weapons could be banned, and others could be tracked and subjected to regulations. This would hopefully lead us to an equilibrium where offensive AWs are prohibited, but defended against in a stable way.

Nicholas' opinion: I agree completely with the first two points. Much of technical safety work has been based around solving currently existing analogs of the alignment problem. Governance does seem to have less of these, so autonomous weapon governance could provide a great opportunity to test and build credibility for AI governance structures. The ability for autonomous weapons to cause catastrophic risk seems hard to argue against. With powerful enough AI, even accidents can pose catastrophic risk, but I would expect military use to only increase those.

For the third point, I agree with the reasons provided, but I think there are also ways in which AWs may reduce the likelihood and severity of war. For instance, currently soldiers bear most of the risk in wars, whereas decision-makers are often protected. Targeted AW attacks may increase the relative risk for those making decisions and thus disincentivize them from declaring war. An equilibrium of AW mutually assured destruction might also be attained if we can find reliable ways to attribute AW attacks and selectively retaliate. I’d be interested to see a more extensive analysis of how these and other factors trade off as I am unsure of the net effect.

The piece that gives me the most doubt that this is an area for the x-risk community to focus on is tractability. An international agreement runs the risk of weakening the states that sign on without slowing the rate of AW development in countries that don’t. Getting all actors to sign on seems intractable to me. As an analogy, nuclear weapons proliferation has been a challenge and nuclear weapons development is much more complex and visible than development of AWs.

Rohin's opinion: I particularly liked this piece because it actually made the case for work on autonomous weapons -- I do not see such work as obviously good (see for example this post that I liked for the perspective against banning autonomous weapons). I still feel pretty uncertain overall, but I think this post meaningfully moved the debate forward.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
Subscribe here:

Copyright © 2021 Alignment Newsletter, All rights reserved.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.
New Comment
1 comment, sorted by Click to highlight new comments since:

As always, thanks for everyone involved in the newsletter!

The Understanding Learned Reward Functions paper looks great, both in terms of studying inner alignment (the version with goal-directed/RL policies instead of mesa-optimizers) and for thinking about goal-directedness.