Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).


ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring (David Berthelot et al) (summarized by Dan H): A common criticism of deep learning is that it requires far too much training data. Some view this as a fundamental flaw that suggests we need a new approach. However, considerable data efficiency is possible with a new technique called ReMixMatch. ReMixMatch on CIFAR-10 obtains 84.92% accuracy using only 4 labeled examples per class. Using 250 labeled examples, or around 25 labeled examples per class, a ReMixMatch model on CIFAR-10 has 93.73% accuracy. This is approximately how well a vanilla ResNet does on CIFAR-10 with 50000 labeled examples. Two years ago, special techniques utilizing 250 CIFAR-10 labeled examples could enable an accuracy of approximately 53%. ReMixMatch builds on MixMatch and has several seemingly arbitrary design decisions, so I will refrain from describing its design. In short, deep networks do not necessarily require large labeled datasets.

And just yesterday, after this summary was first written, the FixMatch paper got even better results.

Previous newsletters

In last week's email, two of Flo's opinions were somehow scrambled together. See below for what they were supposed to be.

Defining and Unpacking Transformative AI (Ross Gruetzemacher et al) (summarized by Flo): Focusing on the impacts on society instead of specific features of AI systems makes sense and I do believe that the shape of RTAI as well as the risks it poses will depend on the way we handle TAI at various levels. More precise terminology can also help to prevent misunderstandings, for example between people forecasting AI and decision makers.

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors (Stuart Armstrong) (summarized by Flo): I enjoyed this article and the proposed factors match my intuitions. There seem to be two types of problems: extreme beliefs and concave Pareto boundaries. Dealing with the second is more important since a concave Pareto boundary favours extreme policies, even for moderate beliefs. Luckily, diminishing returns can be used to bend the Pareto boundary. However, I expect it to be hard to find the correct rate of diminishing returns, especially in novel situations.

Technical AI alignment

Iterated amplification

AI Safety Debate and Its Applications (Vojta Kovarik) (summarized by Rohin): This post defines the components of a debate (AN #5) game, lists some of its applications, and defines truth-seeking as the property that we want. Assuming that the agent chooses randomly from the possible Nash equilibria, the truth-promoting likelihood is the probability that the agent picks the actually correct answer. The post then shows the results of experiments on MNIST and Fashion MNIST, seeing comparable results to the original paper.

(When) is Truth-telling Favored in AI debate? (Vojtěch Kovařík et al) (summarized by Rohin): Debate (AN #5) aims to train an AI system using self-play to win "debates" which aim to convincingly answer a question, as evaluated by a human judge. The main hope is that the equilibrium behavior of this game is for the AI systems to provide true, useful information. This paper studies this in a simple theoretical setting called feature debates. In this environment, a "world" is sampled from some distribution, and the agents (who have perfect information) are allowed to make claims about real-valued "features" of the world, in order to answer some question about the features of the world. The judge is allowed to check the value of a single feature before declaring a winner, but otherwise knows nothing about the world.

If either agent lies about the value of a feature, the other agent can point this out, which the judge can then check; so at the very least the agents are incentivized to honestly report the values of features. However, does this mean that they will try to answer the full question truthfully? If the debate has more rounds than there are features, then it certainly does: either agent can unilaterally reveal every feature, which uniquely determines the answer to the question. However, shorter debates need not lead to truthful answers. For example, if the question is whether the first K features are all 1, then if the debate length is shorter than K, there is no way for an agent to prove that the first K features are all 1.

Rohin's opinion: While it is interesting to see what doesn't work with feature debates, I see two problems that make it hard to generalize these results to regular debate. First, I see debate as being truth-seeking in the sense that the answer you arrive at is (in expectation) more accurate than the answer the judge would have arrived at by themselves. However, this paper wants the answers to actually be correct. Thus, they claim that for sufficiently complicated questions, since the debate can't reach the right answer, the debate isn't truth-seeking -- but in these cases, the answer is still in expectation more accurate than the answer the judge would come up with by themselves.

Second, feature debate doesn't allow for decomposition of the question during the debate, and doesn't allow the agents to challenge each other on particular questions. I think this limits the "expressive power" of feature debate to P, while regular debate reaches PSPACE, and is thus able to do much more than feature debate. See this comment for more details.

Read more: Paper: (When) Is Truth-telling Favored in AI Debate?

Mesa optimization

Malign generalization without internal search (Matthew Barnett) (summarized by Rohin): This post argues that agents can have capability generalization without objective generalization (AN #66), without having an agent that does internal search in pursuit of a simple mesa objective. Consider an agent that learns different heuristics for different situations which it selects from using a switch statement. For example, in lunar lander, if at training time the landing pad is always red, the agent may learn a heuristic about which thrusters to apply based on the position of red ground relative to the lander. The post argues that this selection across heuristics could still happen with very complex agents (though the heuristics themselves may involve search).

Rohin's opinion: I generally agree that you could get powerful agents that nonetheless are "following heuristics" rather than "doing search"; however, others with differing intuitions did not find this post convincing.

Agent foundations

Embedded Agency via Abstraction (John S Wentworth) (summarized by Asya): Embedded agency problems (AN #31) are a class of theoretical problems that arise as soon as an agent is part of the environment it is interacting with and modeling, rather than having a clearly-defined and separated relationship. This post makes the argument that before we can solve embedded agency problems, we first need to develop a theory of abstraction. Abstraction refers to the problem of throwing out some information about a system while still being able to make predictions about it. This problem can also be referred to as the problem of constructing a map for some territory.

The post argues that abstraction is key for embedded agency problems because the underlying challenge of embedded world models is that the agent (the map) is smaller than the environment it is modeling (the territory), and so inherently has to throw some information away.

Some simple questions around abstraction that we might want to answer include:

- Given a map-making process, characterize the queries whose answers the map can reliably predict.

- Given some representation of the map-territory correspondence, translate queries from the territory-representation to the map-representation and vice versa.

- Given a territory, characterize classes of queries which can be reliably answered using a map much smaller than the territory itself.

- Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.

The post argues that once we create the simple theory, we will have a natural way of looking at more challenging problems with embedded agency, like the problem of self-referential maps, the problem of other map-makers, and the problem of self-reasoning that arises when the produced map includes an abstraction of the map-making process itself.

Asya's opinion: My impression is that embedded agency problems as a class of problems are very young, extremely entangled, and characterized by a lot of confusion. I am enthusiastic about attempts to decrease confusion and intuitively, abstraction does feel like a key component to doing that.

That being said, my guess is that it’s difficult to predictably suggest the most promising research directions in a space that’s so entangled. For example, one thread in the comments of this post discusses the fact that this theory of abstraction as presented looks at “one-shot” agency where the system takes in some data once and then outputs it, rather than “dynamic” agency where a system takes in data and outputs decisions repeatedly over time. Abram Demski argues that the “dynamic” nature of embedded agency is a central part of the problem and that it may be more valuable and neglected to put research emphasis there.

Dissolving Confusion around Functional Decision Theory (Stephen Casper) (summarized by Rohin): This post argues for functional decision theory (FDT) on the basis of the following two principles:

1. Questions in decision theory are not about what "choice" you should make with your "free will", but about what source code you should be running.

2. P "subjunctively depends" on A to the extent that P's predictions of A depend on correlations that can't be confounded by choosing the source code that A runs.

Rohin's opinion: I liked these principles, especially the notion that subjunctive dependence should be cashed out as "correlations that aren't destroyed by changing the source code". This isn't a perfect criterion: FDT can and should apply to humans as well, but we don't have control over our source code.

Predictors exist: CDT going bonkers... forever (Stuart Armstrong) (summarized by Rohin): Consider a setting in which an agent can play a game against a predictor. The agent can choose to say zero or one. It gets 3 utility if it says something different from the predictor, and -1 utility if it says the same thing. If the predictor is near-perfect, but the agent models its actions as independent of the predictor (since the prediction was made in the past), then the agent will have some belief about the prediction and will choose the less likely action for expected utility at least 1, and will continually lose.

ACDT: a hack-y acausal decision theory (Stuart Armstrong) (summarized by Rohin): The problem with the previous agent is that it never learns that it has the wrong causal model. If the agent is able to learn a better causal model from experience, then it can learn that the predictor can actually predict the agent successfully, and so will no longer expect a 50% chance of winning, and it will stop playing the game.

Miscellaneous (Alignment)

Clarifying The Malignity of the Universal Prior: The Lexical Update (interstice)

Other progress in AI

Reinforcement learning

Reward-Conditioned Policies (Aviral Kumar et al) (summarized by Nicholas): Standard RL algorithms create a policy that maximizes a reward function; the Reward-Conditioned Policy algorithm instead creates a policy that can achieve a particular reward value passed in as an input. This allows the policy to be trained via supervised regression on a dataset. Each example in the dataset consists of a state, action, and either a return or an advantage, referred to as Z. The network then predicts the action based on the state and Z. The learned model is able to generalize to policies for larger returns. During training, the target value is sampled from a distribution that gradually increases so that it continues to learn higher rewards.

During evaluation, they then feed in the state and a high target value of Z (set one standard deviation above the average in their paper.) This enables them to achieve solid - but not state of the art - performance on a variety of the OpenAI Gym benchmark tasks. They also run ablation studies showing, among other things, that the policy is indeed accurate in achieving the target reward it aims for.

Nicholas's opinion: One of the dangers of training powerful AI to maximize a reward function is that optimizing the function to extreme values may no longer correlate with what we want, as in the classic paperclip maximizer example. I think RCP provides an interesting solution to that problem; if we can instead specify a good, but reasonable, value, we may be able to avoid those extreme cases. We can then gradually increase the desired reward without retraining while continuously monitoring for issues. I think there are likely flaws in the above scheme, but I am optimistic in general about the potential of finding alternate ways to communicate goals to an agent.

One piece I am still curious about is whether the policy remembers how to achieve lower rewards as its training dataset updates towards higher rewards. They show in a heatmap that the target and actual rewards do match up well, but the target rewards are all sampled quite near each other; it would be interesting to see how well the final policy generalizes to the entire spectrum of target rewards.

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions and Training Agents using Upside-Down Reinforcement Learning (Juergen Schmidhuber) (summarized by Zach): It's a common understanding that using supervised learning to solve RL problems is challenging because supervised learning works directly with error signals while RL only has access to evaluation signals. The approach in these papers introduce 'upside-down' reinforcement learning (UDRL) as a way to bridge this gap. Instead of learning how to predict rewards, UDRL learns how to take actions when given a state and a desired reward. Then, to get good behavior, we simply ask the policy to take actions that lead to particularly high rewards. The main approach is to slowly increase the desired goal behavior as the agent learns in order to maximize agent performance. The authors evaluate UDRL on the Lunar Lander and the Take Cover environments. UDRL ultimately performs worse on Lunar Lander and better on Take Cover so it's unclear whether or not UDRL is an improvement over popular methods. However, when rewards are made to be sparse UDRL is able to significantly outperform other RL methods.

Zach's opinion: This approach fits neatly with older work including "Learning to Reach Goals” and more recent work such as Hindsight experience replay and Goal-Conditioned Policies. In particular, all of these methods seem to be effective at addressing the difficulty that comes with working with sparse rewards. I also found myself justifying the utility of selecting the objective of 'learning to achieve general goals' to be related to the idea that seeking power is instrumentally convergent (AN #78).

Rohin's opinion: Both this and the previous paper have explored the idea of conditioning on rewards and predicting actions, trained by supervised learning. While this doesn't hit state-of-the-art performance, it works reasonably well for a new approach.

Planning with Goal-Conditioned Policies (Soroush Nasiriany, Vitchyr H. Pong et al) (summarized by Zach): Reinforcement learning can learn complex skills by interacting with the environment. However, temporally extended or long-range decision-making problems require more than just well-honed reactions. In this paper, the authors investigate whether or not they can obtain the benefits of action planning found in model-based RL without the need to model the environment at the lowest level. The authors propose a model-free planning framework that learns low-level goal-conditioned policies that use their value functions as implicit models. Goal-conditioned policies are policies that can be trained to reach a goal state provided as an additional input. Given a goal-conditioned policy, the agent can then plan over intermediate subgoals (goal states) using a goal-conditioned value function to estimate reachability. Since the state space is large, the authors propose what they call latent embeddings for abstracted planning (LEAP), which is able to find useful subgoals by first searching a much smaller latent representation space and then planning a sequence of reachable subgoals that reaches the target state. In experiments, LEAP significantly outperforms prior algorithms on 2D navigation and push/reach tasks. Moreover, their method can get a quadruped ant to navigate around walls which is difficult because much of the planning happens in configuration space. This shows that LEAP is able to be extended to non-visual domains.

Zach's opinion: The presentation of the paper is clear. In particular, the idea of planning a sequence of maximally feasible subgoals seems particularly intuitive. In general, I think that LEAP relies on the clever idea of reusing trajectory data to augment the data-set for the goal-conditioned policy. As the authors noted, the question of exploration was mostly neglected. I wonder how well the idea of reusing trajectory data generalizes to the general exploration problem.

Rohin's opinion: The general goal of inferring hierarchy and using this to plan more efficiently seems very compelling but hard to do well; this is the goal in most hierarchical RL algorithms and Learning Latent Plans from Play (AN #65).

Dream to Control: Learning Behaviors by Latent Imagination (Danijar Hafner et al) (summarized by Cody): In the past year or so, the idea of learning a transition model in a latent space has gained traction, motivated by the hope that such an approach could combine the best of the worlds of model-free and model-based learning. The central appeal of learning a latent transition model is that it allows you to imagine future trajectories in a potentially high-dimensional, structured observation space without actually having to generate those high-dimensional observations.

Dreamer builds on a prior model by the same authors, PlaNet (AN #33), which learned a latent representation of the observations, p(s|o), trained both through a VAE-style observation reconstruction loss, and also a transition model q(s-next|s, a), which is trained to predict the state at the next step given only the state at the prior one, with no next-step observation data. Together, these two models allow you to simulate action-conditioned trajectories through latent state space. If you then predict reward from state, you can use this to simulate the value of trajectories. Dreamer extends on this by also training an Actor Critic-style model on top of states to predict action and value, forcing the state representation to not only capture next-step transition information, but also information relevant to predicting future rewards. The authors claim this extension makes their model more able to solve long-horizon problems, because the predicted value function can capture far-future rewards without needing to simulate the entire way there. Empirically, there seems to be reasonable evidence that this claim plays out, at least within the fairly simple environments the model is tested in.

Cody's opinion: The extension from PlaNet (adding actor-critic rather than direct single-step reward prediction) is relatively straightforward, but I think latent models are an interesting area - especially if they eventually become at all possible to interpret - and so I'm happy to see more work in this area.

New Comment
2 comments, sorted by Click to highlight new comments since:
However, this paper wants the answers to actually be correct. Thus, they claim that for sufficiently complicated questions, since the debate can't reach the right answer, the debate isn't truth-seeking -- but in these cases, the answer is still in expectation more accurate than the answer the judge would come up with by themselves.

Truth-seeking: better than the answer the judge would have come up with by themself (how does this work? making an observation at random instead of choosing the observation that's recommended by the debate?)

Truth-finding: the truth is found.

how does this work?

You have a prior; you choose to do the experiment with highest VOI to get a posterior, and then you choose the best answer given that posterior. I'm pretty sure I could calculate this for many of their scenarios.