We (Adam Shimi, Joe Collman & myself) are trying to emulate peer review feedback for Alignment Forum posts. This is the second review in the series. The first’s introduction sums up our motivation and approach rather well, we will not duplicate it here.
Instead, let’s dive into today’s reviewed work: Learning Normativity: A Research Agenda by Abram Demski. We’ll follow the same structure as before: summarize the work, locate its hypotheses, and examine its relevance to the field.
This post was written by Jérémy; as such, his perspective will likely bias its content, even if both Adam and Joe approve of it.
Pointing at norms instead of a specific set of values has several interesting features, especially how it handles the uncertainty of feedback from humans. Norms are reflected imperfectly in human behavior, approval from humans regarding norm-following is sparse and imperfect, yet norms roughly convey what a machine should do.
Abram points at language learning as a major motivating example. We don’t learn English by knowing all the rules, we don’t even know all of them, those we can articulate don’t fit the data perfectly. Nevertheless, we can succeed, by human standards, at language acquisition.
Going beyond those standards, Abram mentions superhuman performance in language based on various properties of texts, with the implicit statement (if we interpret the post correctly) that this performance stems from a better adherence to the underlying norms of language.
This example shows that learning should be possible even in the absence of a gold standard, an ideal reference to which the output of an agent may be compared for performance. No feedback can be fully trusted; a system should eventually learn to distrust entire types of feedback, be robust to ontology shifts in feedback, and more generally be able to reinterpret any kind of feedback.
This creates a hierarchy in feedback, which is mirrored in the value specification problem: if one cannot specify values directly, one may try to specify how they are learned, and barring that one may try to learn how to learn, etc. Abram suggests there are ways to learn all relevant levels at once, generalizing over all of them. This would represent a possible approach to outer alignment.
Abram then describes process-level feedback, where the methods to obtain results are also subject to norms. This type of feedback also suffers from the hierarchy problem outlined previously (i.e. feedback about how to process feedback, piled up recursively). We would like to collapse the levels again, to generalize over all feedback without having an untouchable, possibly malignant, top level.
The suggestion, then, is a “need to specify what it means to learn all the levels” in a given setting, in order to integrate process-level feedback properly. Abram lists three extra obstacles for this approach towards inner alignment.
After a summary of the main requirements that a “learning normativity” agenda would have to address, Abram evaluates the recursive quantilizers approach against those requirements. A technical argument follows, concluding that the approach falls short of the reinterpretable feedback requirement, and that process-level feedback may not actually be achieved there.
Do the examples fit the framework?
All three core desiderata for learning normativity involve a notion of a hierarchy of levels: object-level feedback, feedback about feedback, etc. and the problem of infinite regress. It also posits the existence of a correct behavior, which is where the term normativity stems from.
We find two issues with the examples used in the post:
- language, taken as a whole, does not have a single normatively correct usage, which makes it awkward to use as an example of single task where a normative ideal exists in the abstract;
- with not enough deconfusion around the terms of process, feedback and learning in this framework, the recursive quantilizer approach appears too technical compared to the rest of the post.
Norms and language
Abram’s description of language learning focuses on initial acquisition, with the rough goal of being able to convey meaning to other people (the primary task of language). In that context, normativity is indeed the result of a complex negotiation between humans, and fluent speakers will indeed be able to recognize correct usage, insofar they understand each other.
Yet, while language use is a good example of a learnable skill with no gold standard, we find that the post could be clearer on whether linguistic norms can be unified in a single objective norm, which we’ll argue against.
Rules for correct usage do not exist, not because they’re inaccessible or hard to pinpoint, but because looking for them is irrelevant to the task of language acquisition. Attempting to establish rules for language is linguistic prescription: it usually has a social or political aim; rule-following can be seen as a social signal, or as a coordination mechanism.
Each context, each rough task taken in isolation, will lack a commonly accepted ideal, with requirements often at odds with each other. Various journalistic standards require different treatment of facts; fiction norms about suspension of disbelief vary by genre, but also by reader.
It may make sense to talk about superhuman performance at syntax and grammar, but this does not extend to all tasks involving language. Creativity, compellingness in writing involve norms that are not seen as extensions of rules of grammar.
If a system is able to follow a variety of human norms in a wide range of contexts (and if we don’t move the goalposts endlessly), we might have an objective notion of superhuman performance. The point being, it would not correspond to getting closer to any single ideal use of language, but to rank higher than humans at many tasks, each involving distinct sets of norms.
As another framing for this example, this recent DeepMind paper on language agents comes to mind for alignment-relevant uses of language where superhuman performance may be more crisply defined, while preserving the absence of a gold standard, as Abram’s framework requires.
Confusion over quantilization
The post does not attempt to give definitions of feedback, of process, with the phrase “you need to specify what it means to learn in this setting” outlining the operationalization gap to bridge. There is a great deal of deconfusion about learning that needs to happen before piling up layers of feedback, and we find that trying to operationalize them doesn't seem to fit well in the first post.
In Abram’s attempt to use recursive quantilizers, he targets a fixed point of a quantilization process, where learning is defined as “an update against a UTAA [which] produces an update against initial distributions which produce that UTAA”. It makes more sense in context, with the little drawings, though the post would benefit from more of them.
This last section on quantilizers has a significantly different tone than the rest of the post. We understand the motive behind displaying a proof-of-concept, but its audience is different, more technical. We expect that splitting the post in two would have been beneficial to readers, with the agenda post simply asserting how the recursive quantilization framework performs on the desiderata, as a commentary on their tractability.
Relevance to the field
How Learning Normativity is meant to fit into the field is described right in the introduction. Abram writes that he’s pointing at correct behavior, in a way that differs from other approaches. Does it? Does this agenda bring a new perspective for alignment research?
First, learning normativity vs. value learning. The concepts being learned are of different natures. A clear distinction between norm-building and value-targeting is explored by Rohin Shah in this post. There, norms are described as expressing what not to do, to “read and predict the responses of human normative structure”. The point about norms being the result of complex negotiations, in complex situations, is referenced as well.
Second, learning normativity vs. imitation learning. The post clearly expands the point that direct observation of humans isn’t sufficient to learn norms, no argument on the distinction here.
Third & fourth, learning normativity vs. approval-directed learning vs. rule application. In the same way, this approval (or these rules) can be viewed as a single level of feedback, whereas normativity encompasses e.g. feedback about approval and so on.
The post makes a clear argument about how the infinite regresses are unsolved problems. In this review we’ll note that while the post argues that the desiderata point at something useful, it doesn’t argue they’re safety-critical. One could dismiss the desiderata as “nice to have” without further arguments… which Abram has provided in a later post complementing this one.
Where does the post stand?
In Adam’s epistemological framing, Abram’s work fits squarely in the second category: studying what well-behaved means for AI, by providing new desiderata for aligned systems. The last quantilizer section might fit in the third category, but it’s not where most of the post’s contribution fits.
All three core desiderata for learning normativity involve a notion of multiple levels of feedback, feedback about feedback, etc. and the problem of infinite regress.
Several methods are suggested to approach this:
- to learn a mapping from the feedback humans actually give to what they really mean: that seems like a restatement of the infinite specification problem (as opposed to a solution), since no feedback can be perfectly trusted;
- collapse all the levels into one learner: maybe some insight can be applied to all levels, like Occam’s razor, but then what will give feedback on its application? This also would require that there is a finite amount of useful information that can be extracted from all the levels at once, that going up the meta-ladder doesn’t provide increasingly relevant feedback;
- one level which is capable of accepting process-level feedback about itself: that’s one way to stop the infinite regress, which would be highly dependent on its initial prior. There is no guarantee all self-examining processes will converge to the same conclusions about feedback and norms.
None of these methods are satisfying… but they don’t have to be. The document is a research agenda, detailing and motivating a particular array of open problems.
Overall, the post succeeds at outlining the various ways where learning encounters infinite regress, where normative statements can be stacked upon themselves, while making a neat distinction between:
- unreliability of feedback at all levels of interpretation;
- unreliability of underlying values at all levels of approximation;
- unreliability of learning processes at all levels of reflection.
We find this post to be a thought-provoking description of a difficult class of open problems related to feedback in learning. Even if “normativity” cannot be reduced to a crisp property of systems, it’s useful to enable a learning process to manipulate the meta-level(s).
The main issues with the post are: the normative sharpness of the main motivating example, which distracts from the otherwise valid concerns about uncertain feedback; the last part about quantilizers that could have belonged in a separate post; conversely, the arguments motivating the importance of the agenda that could have been included here.
We are quite hopeful this avenue of research will lead to interesting results in AI alignment.