Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

Writeup: Progress on AI Safety via Debate (Beth Barnes et al) (summarized by Rohin): This post reports on work done on creating a debate (AN #5) setup that works well with human players. In the game, one player is honest (i.e. arguing for the correct answer) and one is malicious (i.e. arguing for some worse answer), and they play a debate in some format, after which a judge must decide which player won the debate. They are using Thinking Physics questions for these debates, because they involve questions with clear answers that are confusing to most people (the judges) but easy for some experts (the players).

Early freeform text debates did not work very well, even with smart, motivated judges. The malicious player could deflect on questions they didn't want to answer, e.g. by claiming that the question was ambiguous and redirecting attention by asking new questions. In addition, when the malicious player got to go first and give an incorrect "framework" for finding the answer, and then made individually true claims to "fill in" the framework, it was hard for the honest player to rebut it. So, they moved to a framework without such asymmetries: both players gave a claim (simultaneously), both gave constructive arguments, and both rebutted the other's arguments. In addition, part of the appeal of debate is that the agents can "zoom in" on the particular disagreement they have, and the judge need only analyze the smallest disagreement in order to declare an overall winner.

This suggests the following setup: players simultaneously provide an answer supported with subclaims. Then, after looking at the other player's answer and subclaims, they can provide objections (perhaps on particular subclaims, or the overall structure of the argument). Players can concede or challenge objections left by the other player; the debate then recurses on one of the challenged objections (currently, the malicious player chooses which claim to recurse on, in order to be conservative). There is a depth limit to the recursion, once it is reached (or if there are no challenged objections) the judge is given a transcript of the debate and must declare a winner.

A new problem arises: the malicious player can exploit ambiguity to their advantage. For example, suppose the malicious player is arguing that there are 5 primes between 0 and 10, and makes the subclaims "there are 3 primes between 0 and 5" and "there are 2 primes between 5 and 10". If challenged on the first, they will say "the three primes are 2, 3, and 5", if challenged on the second, they will say "the two primes are 5 and 7". While in this simple case the honest player could register a top-level objection claiming that the word "between" is ambiguous, and ask for clarification, in general it would be too hard to eliminate all ambiguity.

To mitigate this, the team added a method of cross examination, in which each player is allowed to ask an earlier version of the other agent a question, and quote its answer as part of the debate transcript. (The post talks about how this can be approximated with humans, but I won't go into the details.) In the complexity theory analogy, this boosts debate from PSPACE to NEXP. Intuitively, cross examination allows players to force each other to precommit to specific claims and arguments, because if they don't stay consistent in their claims and arguments, then their opponent can demonstrate the inconsistency. Now, if the malicious player tries to make an argument that depends on interpreting an ambiguous concept in two different ways, then the honest player can cross examine and ask which of the two interpretations they mean. If they are inconsistent, that can be demonstrated to the judge; if they consistently answer one way, then the honest player can challenge the part of the argument that depends on the other interpretation.

They then identify several open concerns with debate, of which they highlight the long computation problem. This is a problem when you no longer assume that the debaters have optimal play: in this case, the malicious player could create a complicated argument that neither debater understands well, that supports the malicious case but that the honest player doesn't know how to refute.

Rohin's opinion: I enjoyed this a lot: the problems found were crisp and the solutions had good arguments that they actually solved the identified problem. Reading through the actual examples and arguments made me more optimistic about debate in general, mostly from a felt sense that the actual concrete results were getting closer to matching the theoretical ideal, and that there actually could be reasonable solutions to "messy" problems like ambiguity.

The full post has formal explanations and actual examples, which I highly recommend.

Evaluating Arguments One Step at a Time (Ought) (summarized by Rohin): This blog post reports on the results of Ought's experiments with factored evaluation (AN #62) by humans on questions about movie reviews. There are two human players, one honest and one malicious. Half the time, the honest player chooses a true top-level claim, decomposes it into supporting subclaims up to some depth, and provides supporting quotes for each claim and subclaim. The other half of the time, the malicious player does the same, except they choose a false top-level claim. In either case, the opposite player then chooses quotes that aim to refute the claims and subclaims. For each claim and subclaim, a new judge evaluates whether that claim is "valid" or "invalid". If any claim is marked invalid, then the top-level claim is judged to be false, otherwise it is judged true.

They found that depth 1 and depth 2 trees outperformed depth 0 trees (for which judgments were at about chance). However, they did not perform amazingly well -- with posthoc hyperparameter tuning, depth 1 and 2 trees got accuracies of 65% and 69% respectively. False positives (accepting false claims) were primarily caused due to malicious claims that couldn't easily be rebutted with the limited budget (but could be rebutted with a larger budget), and individual steps being complex enough that the judge made mistakes. False negatives (rejecting true claims) tend to happen when players choose hard-to-argue true claims, especially ones in which no specific quote clearly supports the claim, even though the full paragraph supports the claim through its tone and style.

There are several approaches that could theoretically solve these problems, such as increasing the size of claim trees, and improving quality control for judgments (e.g. by aggregating multiple judgments together).

Rohin's opinion: I wouldn't read too much into the low accuracies of the depth 2 trees: it seems quite plausible that this is specific to the movie review setting, and in settings with clearer answers you could do better. Like with the previous post, I found the actual examples quite illuminating: it's always interesting to see what happens when theory collides with the real world.

Technical AI alignment

Technical agendas and prioritization

Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda (Jesse Clifton) (summarized by Flo): This agenda by the Effective Altruism Foundation focuses on risks of astronomical suffering (s-risks) posed by Transformative AI (AN #82) (TAI) and especially those related to conflicts between powerful AI agents. This is because there is a very clear path from extortion and executed threats against altruistic values to s-risks. While especially important in the context of s-risks, cooperation between AI systems is also relevant from a range of different viewpoints. The agenda covers four clusters of topics: strategy, credibility and bargaining, current AI frameworks, as well as decision theory.

The extent of cooperation failures is likely influenced by how power is distributed after the transition to TAI. At first glance, it seems like widely distributed scenarios (as CAIS (AN #40)) are more problematic, but related literature from international relations paints a more complicated picture. The agenda seeks a better understanding of how the distribution of power affects catastrophic risk, as well as potential levers to influence this distribution. Other topics in the strategy/governance cluster include the identification and analysis of realistic scenarios for misalignment, as well as case studies on cooperation failures in humans and how they can be affected by policy.

TAI might enable unprecedented credibility, for example by being very transparent, which is crucial for both contracts and threats. The agenda aims at better models of the effects of credibility on cooperation failures. One approach to this is open-source game theory, where agents can see other agents' source codes. Promising approaches to prevent catastrophic cooperation failures include the identification of peaceful bargaining mechanisms, as well as surrogate goals. The idea of surrogate goals is for an agent to commit to act as if it had a different goal, whenever it is threatened, in order to protect its actual goal from threats.

As some aspects of contemporary AI architectures might still be present in TAI, it can be useful to study cooperation failure in current systems. One concrete approach to enabling cooperation in social dilemmas that could be tested with contemporary systems is based on bargaining over policies combined with punishments for deviations. Relatedly, it is worth investigating whether or not multi-agent training leads to human-like bargaining by default. This has implications on the suitability of behavioural vs classical game theory to study TAI. The behavioural game theory of human-machine interactions might also be important, especially in human-in-the-loop scenarios of TAI.

The last cluster discusses the implications of bounded computation on decision theory as well as the decision theories (implicitly) used by current agent architectures. Another focus lies on acausal reasoning and in particular the possibility of acausal trade, where different correlated AI systems cooperate without any causal links between them.

Flo's opinion: I am broadly sympathetic to the focus on preventing the worst outcomes and it seems plausible that extortion could play an important role in these, even though I worry more about distributional shift plus incorrigibility. Still, I am excited about the focus on cooperation, as this seems robustly useful for a wide range of scenarios and most value systems.

Rohin's opinion: Under a suffering-focused ethics under which s-risks far overwhelm x-risks, I think it makes sense to focus on this agenda. There don't seem to be many plausible paths to s-risks: by default, we shouldn't expect them, because it would be quite surprising for an amoral AI system to think it was particularly useful or good for humans to suffer, as opposed to not exist at all, and there doesn't seem to be much reason to expect an immoral AI system. Conflict and the possibility of carrying out threats are the most plausible ways by which I could see this happening, and the agenda here focuses on neglected problems in this space.

However, under other ethical systems (under which s-risks are worse than x-risks, but do not completely dwarf x-risks), I expect other technical safety research to be more impactful, because other approaches can more directly target the failure mode of an amoral AI system that doesn't care about you, which seems both more likely and more amenable to technical safety approaches (to me at least). I could imagine work on this agenda being quite important for strategy research, though I am far from an expert here.

Iterated amplification

Synthesizing amplification and debate (Evan Hubinger) (summarized by Rohin): The distillation step in iterated amplification (AN #30) can be done using imitation learning. However, as argued in Against Mimicry, if your model M is unable to do perfect imitation, there must be errors, and in this case the imitation objective doesn't necessarily incentivize a graceful failure, whereas a reward-based objective does. So, we might want to add an auxiliary reward objective. This post proposes an algorithm in which the amplified model answers a question via a debate (AN #5). The distilled model can then be trained by a combination of imitation of the amplified model, and reinforcement learning on the reward of +1 for winning the debate and -1 for losing.

Rohin's opinion: This seems like a reasonable algorithm to study, though I suspect there is a simpler algorithm that doesn't use debate that has the same advantages. Some other thoughts in this thread.

Learning human intent

Deep Bayesian Reward Learning from Preferences (Daniel S. Brown et al) (summarized by Zach): Bayesian inverse reinforcement learning (IRL) is ideal for safe imitation learning since it allows uncertainty in the reward function estimator to be quantified. This approach requires thousands of likelihood estimates for proposed reward functions. However, each likelihood estimate requires training an agent according to the hypothesized reward function. Predictably, such a method is computationally intractable for high dimensional problems.

In this paper, the authors propose Bayesian Reward Extrapolation (B-REX), a scalable preference-based Bayesian reward learning algorithm. They note that in this setting, a likelihood estimate that requires a loop over all demonstrations is much more feasible than an estimate that requires training a new agent. So, they assume that they have a set of ranked trajectories, and evaluate the likelihood of a reward function by its ability to reproduce the preference ordering in the demonstrations. To get further speedups, they fix all but the last layer of the reward model using a pretraining step: the reward of a trajectory is then simply the dot product of the last layer with the features of the trajectory as computed by all but the last layer of the net (which can be precomputed and cached once).

The authors test B-REX on pixel-level Atari games and show competitive performance to T-REX (AN #54), a related method that only computes the MAP estimate. Furthermore, the authors can create confidence intervals for performance since they can sample from the reward distribution.

Zach's opinion: The idea of using preference orderings (Bradley-Terry) to speed up the posterior probability calculation was ingenious. While B-REX isn't strictly better than T-REX in terms of rewards achieved, the ability to construct confidence intervals for performance is a major benefit. My takeaway is that Bayesian IRL is getting more efficient and may have good potential as a practical approach to safe value learning.

Preventing bad behavior

Attainable utility has a subagent problem (Stuart Armstrong) (summarized by Flo): This post argues that regularizing an agent's impact by attainable utility (AN #25) can fail when the agent is able to construct subagents. Attainable utility regularization uses auxiliary rewards and penalizes the agent for changing its ability to get high expected rewards for these to restrict the agent's power-seeking. More specifically, the penalty for an action is the absolute difference in expected cumulative auxiliary reward between the agent either doing the action or nothing for one time step and then optimizing for the auxiliary reward.

This can be circumvented in some cases: If the auxiliary reward does not benefit from two agents instead of one optimizing it, the agent can just build a copy of itself that does not have the penalty, as doing this does not change the agent's ability to get a high auxiliary reward. For more general auxiliary rewards, an agent could build another more powerful agent, as long as the powerful agent commits to balancing out the ensuing changes in the original agent's attainable auxiliary rewards.

Flo's opinion: I am confused about how much the commitment to balance out the original agent's attainable utility would constrain the powerful subagent. Also, in the presence of subagents, it seems plausible that attainable utility mostly depends on the agent's ability to produce subagents of different generality with different goals: If a subagent that optimizes for a single auxiliary reward was easier to build than a more general one, building a general powerful agent could considerably decrease attainable utility for all auxiliary rewards, such that the high penalty rules out this action.

News

TAISU - Technical AI Safety Unconference (Linda Linsefors) (summarized by Rohin): This unconference on technical AI safety will be held May 14th-17th; application deadline is February 23.

AI Alignment Visiting Fellowship (summarized by Rohin): This fellowship would support 2-3 applicants to visit FHI for three or more months to work on human-aligned AI. The application deadline is Feb 28.

AI ALIGNMENT FORUM
AF