Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
Literature Review on Goal-Directedness (Adam Shimi et al) (summarized by Rohin): This post extracts five different concepts that have been identified in the literature as properties of goal-directed systems:
1. Restricted space of goals: The space of goals should not be too expansive, since otherwise goal-directedness can become vacuous (AN #35) (e.g. if we allow arbitrary functions over world-histories with no additional assumptions).
2. Explainability: A system should be described as goal-directed when doing so improves our ability to explain the system’s behavior and predict what it will do.
3. Generalization: A goal-directed system should adapt its behavior in the face of changes to its environment, such that it continues to pursue its goal.
4. Far-sighted: A goal-directed system should consider the long-term consequences of its actions.
5. Efficient: The more goal-directed a system is, the more efficiently it should achieve its goal.
The concepts of goal-directedness, optimization, and agency seem to have significant overlap, but there are differences in the ways the terms are used.
The authors then compare multiple proposals on these criteria:
1. The intentional stance says that we should model a system as goal-directed when it helps us better explain the system’s behavior, performing well on explainability and generalization. It could easily be extended to include far-sightedness as well. A more efficient system for some goal will be easier to explain via the intentional stance, so it does well on that criterion too. And not every possible function can be a goal, since many are very complicated and thus would not be better explanations of behavior. However, the biggest issue is that the intentional stance cannot be easily formalized.
2. One possible formalization of the intentional stance is to say that a system is goal-directed when we can better explain the system’s behavior as maximizing a specific utility function, relative to explaining it using an input-output mapping (see Agents and Devices: A Relative Definition of Agency (AN #22)). This also does well on all five criteria.
4. A definition based off of Kolmogorov complexity works well, though it doesn’t require far-sightedness.
Rohin's opinion: The five criteria seem pretty good to me as a description of what people mean when they say that a system is goal-directed. It is less clear to me that all five criteria are important for making the case for AI risk (which is why I care about a definition of goal-directedness); in particular it doesn’t seem to me like the explainability property is important for such an argument (see also this comment).
Note that it can still be the case that as a research strategy it is useful to search for definitions that satisfy these five criteria; it is just that in evaluating which definition to use I would choose the one that makes the AI risk argument work best. (See also Against the Backward Approach to Goal-Directedness.)
TECHNICAL AI ALIGNMENT
Factored Cognition sequence (Rafael Harth) (summarized by Rohin): The Factored Cognition Hypothesis (AN #36) informally states that any task can be performed by recursively decomposing the task into smaller and smaller subtasks until eventually the smallest tasks can be done by a human. This sequence aims to formalize the hypothesis to the point that it can be used to argue for the outer alignment of (idealized versions of) iterated amplification (AN #40) and debate (AN #5).
The key concept is that of an explanation or decomposition. An explanation for some statement s is a list of other statements s1, s2, … sn along with the statement “(s1 and s2 and … and sn) implies s”. A debate tree is a tree in which for a given node n with statement s, the children of n form an explanation (decomposition) of s. The leaves of the tree should be statements that the human can verify. (Note that the full formalism has significantly more detail, e.g. a concept of the “difficulty” for the human to verify any given statement.)
We can then define an idealized version of debate, in which the first debater must produce an answer with associated explanation, and the second debater can choose any particular statement to expand further. The judge decides the winner based on whether they can confidently verify the final statement or not. Assuming optimal play, the correct (honest) answer is an equilibrium as long as:
Ideal Debate Factored Cognition Hypothesis: For every question, there exists a debate tree for the correct answer where every leaf can be verified by the judge.
The idealized form of iterated amplification is HCH (AN #34); the corresponding Factored Cognition Hypothesis is simply “For every question, HCH returns the correct answer”. Note that the existence of a debate tree is not enough to guarantee this, as HCH must also find the decompositions in this debate tree. If we imagine that HCH gets access to a decomposition oracle that tells it the right decomposition to make at each node, then HCH would be similar to idealized debate. (HCH could of course simply try all possible decompositions, but we are ignoring that possibility: the decompositions that we rely on should reduce or hide complexity.)
Is the HCH version of the Factored Cognition Hypothesis true? The author tends to lean against (more specifically, that HCH would not be superintelligent), because it seems hard for HCH to find good decompositions. In particular, humans seem to improve their decompositions over time as they learn more, and also seem to improve the concepts by which they think over time, all of which are challenging for HCH to do. On the other hand, the author is cautiously optimistic about debate.
Rohin's opinion: I enjoyed this sequence: I’m glad to see more analysis of what is and isn’t necessary for iterated amplification and debate to work, as well as more theoretical models of debate. I broadly agreed with the conceptual points made, with one exception: I’m not convinced that we should not allow brute force for HCH, and for similar reasons I don’t find the arguments that HCH won’t be superintelligent convincing. In particular, the hope with iterated amplification is to approximate a truly massive tree of humans, perhaps a tree containing around 2^100 (about 1e30) base agents / humans. At that scale (or even at just a measly billion (1e9) humans), I don’t expect the reasoning to look anything like what an individual human does, and approaches that are more like “brute force” seem a lot more feasible.
One might wonder why I think it is possible to approximate a tree with more base agents than there are grains of sand in the Sahara desert. Well, a perfect binary tree of depth 99 would have 1e30 nodes; thus we can roughly say that we’re approximating 99-depth-limited HCH. If we had perfect distillation, this would take 99 rounds of iterated amplification and distillation, which seems quite reasonable. Of course, we don’t have perfect distillation, but I expect that to be a relatively small constant factor on top (say 100x), which still seems pretty reasonable. (There’s more detail about how we get this implicit exponential-time computation in this post (AN #36).)
Defining capability and alignment in gradient descent (Edouard Harris) (summarized by Rohin): Consider a neural network like GPT-3 trained by gradient descent on (say) the cross-entropy loss function. This loss function forms the base objective that the process is optimizing for. Gradient descent typically ends up at some local minimum, global minimum, or saddle point of this base objective.
However, if we look at the gradient descent equation, θ = θ - αG, where G is the gradient, we can see that this is effectively minimizing the size of the gradients. We can think of this as the mesa objective: the gradient descent process (with an appropriate learning rate decay schedule) will eventually get G down to zero, its minimum possible value (even though it may not be at the global minimum for the base objective).
The author then proposes defining capability of an optimizer based on how well it decreases its loss function in the limit of infinite training. Meanwhile, given a base optimizer and mesa optimizer, alignment is given by the capability of the base optimizer divided by the capability of the mesa optimizer. (Since the mesa optimizer is the one that actually acts, this is effectively measuring how much progress on the mesa objective also causes progress on the true base objective.)
This has all so far assumed a fixed training setup (such as a fixed dataset and network architecture). Ideally, we would also want to talk about robustness and generalization. For this, the author introduces the notion of a “perturbation” to the training setup, and then defines [capability / alignment] [robustness / generalization] based on whether the optimization stays approximately the same when the training setup is perturbed.
It should be noted that these are all definitions about the behavior of optimizers in the infinite limit. We may also want stronger guarantees that talk about the behavior on the way to the infinite limit.
LEARNING HUMAN INTENT
Imitating Interactive Intelligence (Interactive Agents Group et al) (summarized by Rohin): While existing (AN #11) work (AN #103) has trained agents to follow natural language instructions, it may be the case that achieving AGI requires more interactivity: perhaps we need to train agents to both give and follow instructions, or engage in a full dialogue, to accomplish tasks in a 3-D embodied environment. This paper makes progress on this goal.
The authors introduce a 3-D room environment in which agents can interact with objects and move them around, leading to a combinatorial space of possible high-level actions. So far the authors have only worked on question-answering (e.g. “what is the color of the chair?”) and instruction-following (e.g. “please lift up the purple object”), but they hope to eventually also work on dialogue and play.
They collect demonstrations of games between humans in which one human is given a goal, and then is asked to give a natural language instruction. The other human sees this instruction and must then execute it in the environment. The authors then use various kinds of imitation learning algorithms to learn a policy that can both set instructions and execute them. They also train models that can evaluate whether a particular trajectory successfully completes the goal or not.
The authors show that the learned policies are capable of some generalization -- for example, if during training they remove all rooms containing orange ducks (but don’t remove other orange objects, or other colors of duck), the resulting policies are still able to handle rooms containing orange ducks.
Evaluating the Robustness of Collaborative Agents (Paul Knott et al) (summarized by Rohin): Assuming a well-specified reward function, we would like to evaluate robustness of an agent by looking at the average reward it obtains on a wide scenario of plausible test time inputs that it might get. However, the key challenge of robustness is that it is hard to specify the test distribution in advance, and we must work with the training distribution instead.
This paper (on which I am an author) proposes measuring robustness using a suite of hand-designed unit tests. Just as a function is tested by having the programmer write down potential edge cases and checking for the expected behavior, AI developers can come up with a set of potential “edge case” situations (especially ones not likely to arise during training) and check whether the agent’s behavior on these situations works well or not. Intuitively, since these unit tests are created separately from the training process, they may not have the same spurious correlations that could be present in the training data. Thus, they can serve as an evaluation of the robustness of the agent.
For example, one technique is to start each episode from a state sampled randomly from a dataset of human-human gameplay, so that the agents learn how to handle a broader diversity of states. This technique decreases the average validation reward, and if that’s all we look at, we would conclude that it did not work. However, the technique also increases performance on the unit test suite, suggesting that in reality the technique does increase robustness, though it comes at the cost of reduced performance when playing with the particular set of partners that make up the validation distribution.
Bridging the Gap: The Case for an ‘Incompletely Theorized Agreement’ on AI Policy. (Charlotte Stix et al) (summarized by Rohin): Like several (AN #90) past (AN #44) papers (AN #105), this paper argues that the differences between the “near-term” and “long-term” communities are probably exaggerated. Collaboration between these communities would be particularly beneficial, since it could prevent the field of AI policy from becoming fragmented and ineffectual, which is especially important now while the field is nascent and there is political will for AI policy progress.
The authors propose the notion of an “incompletely theorized agreement” in order to foster this sort of collaboration. In an incompletely theorized agreement, the parties agree to suspend disagreement on some thorny theoretical question, in order to coordinate action towards a shared pragmatic purpose. Such agreements could be used to set aside relatively unimportant disagreements between the two communities, in favor of pursuing goals that both communities care about. For example, we could imagine that such an agreement would allow both communities to push for more and better reflection by AI researchers on the impacts of the systems that they build, or to enable action that ensures we preserve the integrity of public discourse and informed decision-making (e.g. by regulating AI-enabled disinformation).
Rohin's opinion: I’m certainly on board with the goal of working together towards shared goals. That being said, I don't fully understand what's being proposed here: how exactly is an incompletely theorized agreement supposed to be made? Is this more of a “shared ethos” that gets spread by word of mouth, or is there a document that people sign on to? If there is a document, what goes into it, who would agree to it, and how binding is it? I’d be excited to see more work fleshing out these concrete details, or even better, actually causing such an agreement to exist in practice.
I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.