Learning Normativity: A Research Agenda

[-]evhub5y110

I like this post a lot. I pretty strongly agree that process-level feedback (what I would probably call mechanistic incentives) is necessary for inner alignment—and I'm quite excited about understanding what sorts of learning mechanisms we should be looking for when we give process-level feedback (and recursive quantilization seems like an interesting option in that space).

Since detecting malign hypotheses is difficult, we want the learning system to help us out here. It should generalize from examples of malign hypotheses, and attempt to draw a broad boundary around malignancy. Allowing the system to judge itself in this way can of course lead to malign reinterpretations of user feedback, but hopefully allows for a basin of attraction in which benevolent generalizations can be learned.

Notably, one way to get this is to have the process feedback given by an overseer implemented as a human with access to a prior version of the model being overseen (and then train the model both on the oversight signal directly and to match the amplified human's behavior doing oversight), as in relaxed adversarial training.

[-]abramdemski5y40

Notably, one way to get this is to have the process feedback given by an overseer implemented as a human with access to a prior version of the model being overseen (and then train the model both on the oversight signal directly and to match the amplified human's behavior doing oversight), as in relaxed adversarial training.

I guess you could say what I'm after is a learning theory of generalizing process-level feedback -- I your setup could do a version of the thing, but I'm thinking about Bayesian formulations because I think it's an interesting challenge. Not because it has to be Bayesian, but because it should be in some formal framework which allows loss bounds, and if it ends up not being Bayesian I think that's interesting/important to notice.

(Which might then translate to insights about a neural net version of the thing?)

I suppose another part of what's going on here is that I want several of my criteria working together -- I think they're all individually achievable, but part of what's interesting to me about the project is the way the criteria jointly reinforce the spirit of the thing I'm after.

Actually, come to think of it, factored-cognition-style amplification (was there a term for this type of amplification specifically? Ah, imitative amplification) gives us a sort of "feedback at all levels" capability, because the system is trained to answer all sorts of questions. So it can be trained to answer questions about human values, and meta-questions about learning human values, and doubly and triply meta questions about how to answer those questions. These are all useful to the extent that the human in the amplification makes use of those questions and answers while doing their tasks.

One thing this lacks (in comparison to recursive quantilizers at least) is the idea of necessarily optimizing things through search at all. Information and meta-information about values and how to optimize them is not necessarily used in HCH because questions aren't necessarily answered via search. This can of course be a safety advantage. But it is operating on an entirely different intuition than recursive quantilizers. So, the "ability to receive feedback at all levels" means something different.

My intuition is that if I were to articulate more desiderata with the HCH case in mind, it might have something to do with the "learning the human prior" stuff.

[-]johnswentworth5y90

A few years ago I thought about a problem which I think is the same thing you're pointing to here - no perfect feedback, uncertainty and learning all the way up the meta-ladder, etc. My attempt at a solution was quite different.

The basic idea is to use a communication prior - a prior which says "someone is trying to communicate with you".

With an idealized communication prior, our update is not P[Y|X], but instead P[Y|M], where (roughly) M = "X maximizes P[Y|M]" (except we unroll the fixed point to make initial condition of iteration explicit). Interpretation: the "message sender" chooses the value of X which results in us assigning maximum probability to Y, and we update on this fact. If you've played Codenames, this leads to similar chains of logic: "well, 'blue' seems like a good hint for both sky+sapphire and sky+water, but if it were sky+water they would have said 'weather' or something like that instead, so it's probably sky+sapphire...". As with recursive quantilizers, the infinite meta-tower collapses into a fixed-point calculation, and there's (hopefully) a basin of attraction.

To make this usable for alignment purposes we need a couple modifications.

First, obviously, humans are not perfectly rational and logically omniscient, so we have to replace "X maximizes P[Y|M]" with "<rough model of human> thinks X will produce high P[Y|M]". The better the human-model, the broader the basin of attraction for the whole thing to work.

Second, we have to say what the "message" X from the human is, and what Y is. Y would be something like human values, and X would include things like training data and/or explicit models. In principle, we could get uncertainty and learning "at the outermost level" by having the system treat its own source code as a "message" from the human - the source code is, after all, something the human chose expecting that it would produce a good estimate of human values. On the other hand, if the source code contained an error (that didn't just mess up everything), then the system could potentially recognize it as an error and then do something else.

Finally, obviously the "initial condition" of the iteration would have to be chosen carefully - that's basically just a good-enough world model and human-values-pointer. In a sense, we're trying to formalize "do what I mean" enough that the AI can figure out what we mean.

Maybe I'll write up a post on this tomorrow.

[-]abramdemski5y40

I like that these ideas can be turned into new learning paradigms relatively easily.

I think there's obviously something like your proposal going on, but I feel like it's the wrong place to start.

It's important that the system realize it has to model human communication as an attempt to communicate something, which is what you're doing here. It's something utterly missing from my model as written.

However, I feel like starting from this point forces us to hard-code a particular model of communication, which means the system can never get beyond this. As you said:

First, obviously, humans are not perfectly rational and logically omniscient, so we have to replace "X maximizes P[Y|M]" with "<rough model of human> thinks X will produce high P[Y|M]". The better the human-model, the broader the basin of attraction for the whole thing to work.

I would rather attack the problem of specifying what it could mean for a system to learn at all the meta levels in the first place, and then teach such a system about this kind of communication model as part of its broader education about how to avoid things like wireheading, human manipulation, treacherous turns, and so on.

Granted, you could overcome the hardwired-ness of the communication model if your "treat the source code as a communication, too" idea ended up allowing a reinterpretation of the basic communication model. That just seems very difficult.

All this being said, I'm glad to hear you were working on something similar. Your idea obviously starts to get at the "interpretable feedback" idea which I basically failed to make progress on in my proposal.

[-]johnswentworth5y10

Yeah, I largely agree with this critique. The strategy relies heavily on the AI being able to move beyond the initial communication model, and we have essentially no theory to back that up.

[-]abramdemski5y20

Still interested in your write-up, though!

[-]johnswentworth5y40

It's up.

[-]adamShimi5y50

Great post! I agree with everything up to the recursive quantilizer stuff (not that I disagree with that, I just don't feel that I get it enough to voice an opinion). I thinks it's a very useful post, and I'll definitely go back to it and try to work out more details soon.

In general, it's possible to get very rich types of feedback, but very sparse: humans get all sorts of feedback, including not only instruction on how to act, but also how to think.

I suppose there is a typo and the correct sentence goes "but also very sparse"?

In other words, the overriding feature of normativity which I'm trying to point at is that nothing is ever 100%. Correct grammar is not defined by any (known) rules or set of text, nor is it (quite) just whatever humans judge it is.

I think you're unto something, but I wonder how much of this actually comes from the fact that language usage evolves? If the language stayed static, I think rules would work better. For an example outside English, in French we have the Académie Française, which is the official authority on usage of french. If the usage never changed, they would probably have a pretty nice set of rules (although not really that easily programmable) for French. But as things go, French, like any language, changes, and so they must adapt to it and try to reign it.

That being said, this changing nature of language is probably a part of normativity. It just felt implicit in your post.

Wireheading and human manipulation can't be eliminated through object-level feedback, but we could point out examples of the wrong and right types of reasoning.

You don't put any citation for that. Is this an actual result, or just what you think really strongly?

[-]abramdemski5y20

You don't put any citation for that. Is this an actual result, or just what you think really strongly?

Yeah, sorry, I thought there might be an appropriate citation but I didn't find one. My thinking here is: in model-based RL, the best model you can have to fit the data is one which correctly identifies the reward signal as coming from the reward button (or whatever the actual physical reward source is). Whereas the desired model (what we want the system to learn) is one which, while perhaps being less predictively accurate, models reward as coming from some kind of values. If you couple RL with process-level feedback, you could directly discourage modeling reward as coming from the actual reward system, and encourage identifying it with other things -- overcoming the incentive to model it accurately.

Similarly, human manipulation comes from a "mistaken" (but predictively accurate) model which says that the human values are precisely whatever-the-human-feedback-says-they-are (IE that humans are in some sense the final judge of their values, so that any "manipulation" still reveals legitimate preferences by definition). Humans can provide feedback against this model, favoring models in which the human feedback can be corrupted by various means including manipulation.

[-]abramdemski5y20

That being said, this changing nature of language is probably a part of normativity. It just felt implicit in your post.

This is true. I wasn't thinking about this. My initial reaction to your point was to think, no, even if we froze English usage today, we'd still have a "normativity" phenomenon, where we (1) can't perfectly represent the rules via statistical occurrence, (2) can say more about the rules, but can't state all of them, and can make mistatkes, (3) can say more about what good reasoning-about-the-rules would look like, ... etc.

But if we apply all the meta-meta-meta reasoning, what we ultimately get is evolution of the language at least in a very straightforward sense of a changing object-level usage and changing first-meta-level opinions about proper usage (and so on), even if we think of it as merely correcting imperfections rather than really changing. And, the meta-meta-meta consensus would probably include provisions that the language should be adaptive!

[-]TurnTrout5y40

Another thing which seems to "gain something" every time it hops up a level of meta: Corrigibility as Outside View. Not sure what the fixed points are like, if there are any, and I don't view what I wrote as attempting to meet these desiderata. But I think there's something interesting that's gained each time you go meta.

[-]abramdemski5y20

Seems like this could fit well with John's communication prior idea, in that the outside view resembles the communication view.

[-]Beth Barnes5y30

However, that only works if we have the right prior. We could try to learn the prior from humans, which gets us 99% of the way there... but as I've mentioned earlier, human imitation does not get us all the way. Humans don't perfectly endorse their own reactions.

Note that Learning the Prior uses an amplified human (ie, a human with access to a model trained via IDA/Debate/RRM). So we can do a bit better than a base human - e.g. could do something like having an HCH tree where many humans generate possible feedback and other humans look at the feedback and decide how much they endorse it.
I think the target is not to get normativity 'correct', but to design a mechanism such that we can't expect to find any mechanism that does better.

[-]abramdemski5y30

Right, I agree. I see myself as trying to construct a theory of normativity which gets that "by construction", IE, we can't expect to find any mechanism which does better because if we could say anything about what that mechanism does better then we could tell it to the system, and the system would take it into account.

HCH isn't such a theory; it does provide a somewhat reasonable notion of amplification, but if we noticed systematic flaws with how HCH reasons, we would not be able to systematically correct them.

[-]Beth Barnes5y10

I see myself as trying to construct a theory of normativity which gets that "by construction", IE, we can't expect to find any mechanism which does better because if we could say anything about what that mechanism does better then we could tell it to the system, and the system would take it into account.

Nice, this is what I was trying to say but was struggling to phrase it. I like this.

I guess I usually think of HCH as having this property, as long as the thinking time for each human is long enough, the tree is deep enough, and we're correct about the hope that natural language is sufficiently universal. It's quite likely I'm either confused or being sloppy though.

You could put 'learning the prior' inside HCH I think, it would just be inefficient - for every claim, you'd ask your HCH tree how much you should believe it, and HCH would think about the correct way to do bayesian reasoning, what the prior on that claim should be, and how well it predicted every piece of data you'd seen so far, in conjunction with everything else in your prior. I think one view of learning the prior is just making this process more tractable/practical, and saving you from having to revisit all your data points every time you ask any question - you just do all the learning from data once, then use the result of that to answer any subsequent questions.

[-]Rohin Shah5y20

Planned summary for the Alignment Newsletter:

To build aligned AI systems, we need to have our AI systems learn what to do from human feedback. However, it is unclear how to interpret that feedback: any particular piece of feedback could be wrong; economics provides many examples of stated preferences diverging from revealed preferences. Not only would we like our AI system to be uncertain about the interpretation about any particular piece of feedback, we would also like it to _improve_ its process for interpreting human feedback. This would come from human feedback on the meta-level process by which the AI system learns. This gives us _process-level feedback_, where we make sure the AI system gets the right answers _for the right reasons_.
For example, perhaps initially we have an AI system that interprets human statements literally. Switching from this literal interpretation to a Gricean interpretation (where you also take into account the fact that the human chose to say this statement rather than other statements) is likely to yield improvements, and human feedback could help the AI system do this. (See also [Gricean communication and meta-preferences](https://www.alignmentforum.org/posts/8NpwfjFuEPMjTdriJ/gricean-communication-and-meta-preferences), [Communication Prior as Alignment Strategy](https://www.alignmentforum.org/posts/zAvhvGa6ToieNGuy2/communication-prior-as-alignment-strategy), and [multiple related CHAI papers](https://www.alignmentforum.org/posts/zAvhvGa6ToieNGuy2/communication-prior-as-alignment-strategy?commentId=uWBFsKK6XFbL4Hs4z).)
Of course, if we learn _how_ to interpret human feedback, that too is going to be uncertain. We can fix this by “going meta” once again: learning how to learn to interpret human feedback. Iterating this process we get an infinite tower of “levels” of learning, and at every level we assume that feedback is not perfect and the loss function we are using is also not perfect.
In order for this to actually be feasible, we clearly need to share information across these various “levels” (or else it would take infinite time to learn across all of the levels). The AI system should not just learn to decrease the probability assigned to a single hypothesis, it should learn what _kinds_ of hypotheses tend to be good or bad.

Planned opinion (the second post is Recursive Quantilizers II):

**On feedback types:** It seems like the scheme introduced here is relying quite strongly on the ability of humans to give good process-level feedback _at arbitrarily high levels_. It is not clear to me that this is something humans can do: it seems to me that when thinking at the meta level, humans often fail to think of important considerations that would be obvious in an object-level case. I think this could be a significant barrier to this scheme, though it’s hard to say without more concrete examples of what this looks like in practice.
**On interaction:** I’ve previously <@argued@>(@Human-AI Interaction@) that it is important to get feedback _online_ from the human; giving feedback “all at once” at the beginning is too hard to do well. However, the idealized algorithm here does have the feedback “all at once”. It’s possible that this is okay, if it is primarily process-level feedback, but it seems fairly worrying to me.
**On desiderata:** The desiderata introduced in the first post feel stronger than they need to be. It seems possible to specify a method of interpreting feedback that is _good enough_: it doesn’t exactly capture everything, but it gets it sufficiently correct that it results in good outcomes. This seems especially true when talking about process-level feedback, or feedback one meta level up -- as long as the AI system has learned an okay notion of “being helpful” or “being corrigible”, then it seems like we’re probably fine.
Often just making feedback uncertain can help. For example, in the preference learning literature, Boltzmann rationality has emerged as the model of choice for how to interpret human feedback. While there are several theoretical justifications for this model, I suspect its success is simply because it makes feedback uncertain: if you want to have a model that assigns higher likelihood to high-reward actions, but still assigns some probability to all actions, it seems like you end up choosing the Boltzmann model (or something functionally equivalent). Note that there is work trying to improve upon this model, such as by [modeling humans as pedagogic](https://papers.nips.cc/paper/2016/file/b5488aeff42889188d03c9895255cecc-Paper.pdf), or by <@incorporating a notion of similarity@>(@LESS is More: Rethinking Probabilistic Models of Human Behavior@).
So overall, I don’t feel convinced that we need to aim for learning at all levels. That being said, the second post introduced a different argument: that the method does as well as we “could” do given the limits of human reasoning. I like this a lot more as a desideratum; it feels more achievable and more tied to what we care about.

[-]Charlie Steiner5y20

I'm pretty on board with this research agenda, but I'm curious what you think about the distinction between approaches that look like finding a fixed point, and approaches that look like doing perturbation theory.

And on the assumption that you have no idea what I'm referring to, here's the link to my post.

There are a couple different directions to go from here. One way is to try to collapse the recursion. Find a single agent-shaped model of humans that is (or approximates) a fixed point of this model-ratification process (and also hopefully stays close to real humans by some metric), and use the preferences of that. This is what I see as the endgame of the imitation / bootstrapping research.
Another way might be to imitate communication, and find a way to use recursive models such that we can stop the recursion early without much loss in effectiveness. In communication, the innermost layer of the model can be quite simplistic, and then the next is more complicated by virtue of taking advantage of the first, and so on. At each layer you can do some amount of abstracting away of the details of previous layers, so by the time you're at layer 4 maybe it doesn't matter that layer 1 was just a crude facsimile of human behavior.

Thinking specifically about this UTAA monad thing, I think it's a really clever way to think about what levers we have access to in the fixed-point picture. (If I was going to point out one thing it's lacking, it's that it's a little hazy on whether you're supposed to model V as now having meta-values about the state of the entire recursive tree of UTAAs, or whether your Q function is now supposed to learn about meta-preferences from some outside data source.) But it retains the things I'm worried about from this fixed-point picture, which is basically that I'm not sure it buys us much of anything if the starting point isn't benign in a quite strong sense.

[-]abramdemski5y20

I'm pretty on board with this research agenda, but I'm curious what you think about the distinction between approaches that look like finding a fixed point, and approaches that look like doing perturbation theory.

Ah, nice post, sorry I didn't see it originally! It's pointing at a very very related idea.

Seems like it also has to do with John's communication model.

With respect to your question about fixed points, I think the issue is quite complicated, and I'd rather approach it indirectly by collecting criteria and trying to make models which fit the various criteria. But here are some attempted thoughts.

We should be quite skeptical of just taking a fixed point, without carefully building up all the elements of the final solution -- we don't just want consistency, we want consistency as a result of sufficiently humanlike deliberation. This is similar to the idea that naive infinite HCH might be malign (because it's just some weird fixed point of humans-consulting-HCH), but if we ensure that the HCH tree is finite by (a) requiring all queries to have a recursion budget, or (b) having a probability of randomly stopping (not allowing the tree to be expanded any further), or things like that, we can avoid weird fixed points (and, not coincidentally, these models fit better with what you'd get from iterated amplification if you're training it carefully rather than in a way which allows weird malign fixed-points to creep in).
However, I still may want to take fixed points in the design; for example, the way UTAAs allow me to collapse all the meta-levels down. A big difference between your approach in the post and mine here is that I've got more separation between the rationality criteria of the design vs the rationality the system is going to learn, so I can use pure fixed points on one but not the other (hopefully that makes sense?). The system can be based on a perfect fixed point of some sort, while still building up a careful picture iteratively improving on initial models. That's kind of illustrated by the recursive quantilization idea. The output is supposed to come from an actual fixed-point of quantilizing UTAAs, but it can also be seen as the result of successive layers. (Though overall I think we probably don't get enough of the "carefully building up incremental improvements" spirit.)

(If I was going to point out one thing it's lacking, it's that it's a little hazy on whether you're supposed to model V as now having meta-values about the state of the entire recursive tree of UTAAs, or whether your Q function is now supposed to learn about meta-preferences from some outside data source.)

Agreed, I was totally lazy about this. I might write something more detailed in the future, but this felt like an OK version to get the rough ideas out. After all, I think there are bigger issues than this (IE the two desiderata failures I pointed out at the end).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

41

Learning Normativity: A Research Agenda

41

Example: Language Learning

Learning in the Absence of a Gold Standard

No Perfect Feedback

No Perfect Loss Function

Process-Level Feedback

Learning from Process-Level Feedback

Prospects for Inner Alignment

Summary of Desiderata

Initial Attempt: Recursive Quantilizers

Parameterizing Stationary Distributions

Analysis in terms of the Criteria