Deliberation as a method to find the "actual preferences" of humans

riceissa

Some recent discussion about what Paul Christiano means by "short-term preferences" got me thinking more generally about deliberation as a method of figuring out the human user's or users' "actual preferences". (I can't give a definition of "actual preferences" because we have such a poor understanding of meta-ethics that we don't even know what the term should mean or if they even exist.)

To set the framing of this post: We want good outcomes from AI. To get this, we probably want to figure out the human user's or users' "actual preferences" at some point. There are several options for this:

Directly solve meta-ethics. We figure out whether there are normative facts about what we should value, and use this solution to clarify what "actual preferences" means and to find the human's or humans' "actual preferences".
Solve meta-philosophy. This is like solving meta-ethics, but there is an extra level of meta: we figure out what philosophy is or what human brains are doing when they make philosophical progress, then use this understanding to solve meta-ethics. Then we proceed as in the "directly solve meta-ethics" approach.
Deliberate for a long time. We actually get a human or group of humans to think for a long time under idealized conditions, or approximate the output of this process somehow. This might reduce to one of the above approaches (if the humans come to believe that solving meta-ethics/meta-philosophy is the best way to find their "actual preferences") or it might not.

The third option is the focus of this post. The first two are also very worthy of consideration—they just aren't the focus here. Also the list isn't meant to be comprehensive; I would be interested to hear any other approaches.

In terms of Paul's recent AI alignment landscape tree, I think this discussion fits under the "Learn from teacher" node, but I'm not sure.

Terminological note: In this post, I use "deliberation" and "reflection" interchangeably. I think this is standard, but I'm not sure. If anyone uses these terms differently, I would like to know how they distinguish between them.

Approaches to deliberation that have been suggested so far

In this section, I list some concrete-ish approaches to deliberation that have been considered so far. I say "concrete-ish" rather than "concrete" because each of these approaches seems underdetermined in many ways, e.g. for "humans sitting down", it's not clear if we split the humans up in some way, which humans we use, how much time we allow, what kind of "voting"/parliamentary system we use, and so on. Later on in this post I will talk about properties for deliberation, so the "concrete-ish" approaches here are concrete in two senses: (a) they have some of these properties filled in (e.g. "humans sitting down" says the computation happens primarily inside human brains); and (b) within a single property, they might specify a specific mechanism (e.g. saying "use counterfactual oracles somehow" is more concrete than saying "use an approach where the computation doesn't happen inside human brains").

Humans sitting down. A human or group of humans sitting down and thinking for a long time (a.k.a. Long Reflection/Great Deliberation).
Uploads sitting down. The above, but with whole brain emulations (uploads) instead. This would speed up the reflection in calendar time. There are probably other benefits and drawbacks as well.
Counterfactual oracles. Certain uses of counterfactual oracles allow humans to speed up reflection. For example, if we ask a counterfactual oracle to predict what we will say in a week, then we can get the answer now instead of waiting a week to find out what we would have said. See these two comments for a more detailed proposal.
Imitation-based IDA (iterated distillation and amplification). The human can break apart the question of "What are my actual preferences?" into sub-queries and use AI assistants to help answer the question. Alternatively, the human can ask more "concrete" questions like "How do I solve this math problem?" or specify more concrete tasks like "Help me schedule an appointment", where the output of deliberation is implicit in how the AI system behaves.
RL-based IDA. This is like imitation-based IDA, but instead of distilling the overseer via imitation, we use reinforcement learning.
Debate. This is probably a dumb idea, but we can imagine getting the two AIs in Debate to argue for what the human should think about their values. Instead of exploring the whole tree of arguments and counter-arguments, the human can just process a single path through the tree, which will speed up the reflection.
CEV (coherent extrapolated volition), or more specifically what Eliezer Yudkowsky calls "initial dynamic" in the CEV paper. An AI tries to figure out what a group of humans would think about their values if they knew more, thought faster, etc.
Ambitious value learning. Somehow use lots of data and compute to learn the human utility function.

Properties of deliberation

With the examples above in hand, I want to step back and abstract out some properties/axes/dimensions they have.

Human vs non-human computation. Does the computation happen primarily inside human brains?
Human-like vs non-human-like cognition. Does the cognition resemble human thought? If the computation happens outside human brains, we can try to mimic the low-level steps in the reasoning of the deliberation (i.e. simulating human thought), or we can just try to predict the outward result without going through the same internal mechanics (non-simulated). There are intermediate cases where we predict-without-simulating over short time periods (like one hour) but then simulate as we glue together these short-term reflections. One can also think of human-like vs non-human-like cognition as whether the process (rather than output; see below) of deliberation is explicit (human-like) vs implicit (non-human-like).
- In the non-simulated case, there is the further question of whether any consequentialist reasoning takes place (it might be impossible to predict a long-term reflection without any consequentialist reasoning, so this might only apply to short-term reflections). Further discussion: Q5 in the CEV paper, Vingean reflection, this comment by Eliezer and Paul's reply to it, this post and the resulting discussion.
- When I was originally thinking about this, I conflated "Does the computation happen primarily inside human brains?" and "Does the cognition resemble human thought?" but these two can vary independently. For instance, whole brain emulation does the computation outside of human brains even though the cognition resembles human thought, and an implementation of HCH could hypothetically be done involving just humans but its cognition would not resemble human thought.
Implicit vs explicit output. Is the output of the deliberation explicitly represented, or is it just implicit in how the system behaves? From the IDA paper: "The human must be involved in this process because there is no external objective to guide learning—the objective is implicit in the way that the human coordinates the copies of . For example, we have no external measure of what constitutes a 'good' answer to a question, this notion is only implicit in how a human decides to combine the answers to subquestions (which usually involves both facts and value judgments)."
- I also initially conflated implicit vs explicit process and implicit vs explicit output. Again, these can vary independently: RL-based IDA would have an explicit representation of the reward function but the deliberation would not resemble human thought (explicit output, implicit process), and we can imagine some humans who end up refusing to state what they value even after reflection, saying something like "I'll just do whatever I feel like doing in the moment" (implicit output, explicit process).
Human intermediate integration. Some kinds of approaches (like counterfactual oracles and Debate) seem to speed up the deliberation by "offloading" parts of the work to AIs and having the humans integrate the intermediate results.
Human understandability of output. The output of deliberation could be simple enough that a human could understand it and integrate it into their worldviews, or it could be so complicated that this is not possible. It seems like there is a choice as to whether to allow non-understandable outputs. This was called "understandability" in this comment. Whereas human vs non-human computation is about whether the process of deliberation takes place in the human brain, and human-like vs non-human-like cognition is about whether the process is humanly understandable, human understandability is about whether the output eventually makes its way into the human brain. See the table below for a summary.
Speed. In AI takeoff scenarios where a bunch of different AIs are competing with each other, the deliberation process must produce some answer quickly or produce successive answers as time goes on (in order to figure out which resources are worth acquiring). On the other hand, in takeoff scenarios where the first successful project achieves a decisive strategic advantage, the deliberation can take its time.
Number of rounds (satisficing vs maximizing). The CEV paper talks about CEV as a way to "apply emergency first aid to human civilization, but not do humanity’s work on our behalf, or decide our futures for us" (p. 36). This seems to imply that in the "CEV story", humanity itself (or at least some subset of humans) will do even more reflection after CEV. To use the terminology from the CEV paper, the first round is to satisfice, and the second round is to maximize.^[1] Paul also seems to envision the AI learning human values in two rounds: the first round to gain a minimal understanding for the purpose of strategy-stealing, and the second round to gain a more nuanced understanding to implement our "actual preferences".^[2]
Individual vs collective reflection. The CEV paper argues for extrapolating the collective volition of all currently-existing humans, and says things like "You can go from a collective dynamic to an individual dynamic, but not the other way around; it’s a one-way hatch" (p. 23). As far as I know, other people haven't really argued one way or the other (in some places I've seen people restricting discussion to a single human for sake of simplicity).
Peeking at the output. In the CEV paper and Arbital page, Eliezer talks about giving a single human or group of humans the ability to peek at the output of the reflection and allow them to "veto" the output of the reflection. See "Moral hazard vs. debugging" on the Arbital CEV page and also discussions of Last Judge in the CEV paper.

I'm not sure that these dimensions cleanly separate or how important they are. There are also probably many other dimensions that I'm missing.

Since I had trouble distinguishing between some of the above properties, I made the following table:

	Output	Process
Implicit vs explicit	Implicit vs explicit output	Human-like vs non-human-like cognition
Understandable vs not understandable	Human understandability of output (human intermediate integration also implies understandability of intermediate results and thus also of the output)	Human-like vs non-human-like cognition (there might also be non-human-like approaches that are understandable)
Inside vs outside human brain	(Reduces to understandable vs not understandable)	Human vs non-human computation

Comparison table

The following table summarizes my understanding of where each of the concrete-ish approaches stands on a subset of the above properties. I've restricted the comparison to a subset of the properties because many approaches leave certain questions unanswered and also because if I add too many columns the table will become difficult to read.

In addition to the approaches listed above, I've included HCH since I think it's an interesting theoretical case to look at.

	Inside human brain?	Human-like cognition?	Implicit vs explicit output	Intermediate integration	Understandable output?
Human sitting down	yes	yes	explicit (hopefully)	yes	yes
Uploads sitting down	no	yes	explicit	maybe	yes
Counterfactual oracle	no	no	explicit	yes	yes
Imitation-based IDA	no	no	implicit/depends on question*	no	no
RL-based IDA	no	no†	explicit†	no	no†
HCH	yes	no	implicit/depends on question*	no	n.a.
Debate	no	no	explicit	yes	yes
CEV	no	?‡	explicit	no	yes
Ambitious value learning	no	no	explicit	no	maybe

* We could imagine asking a question like "What are my actual preferences?" to get an explicit answer, or just ask AI assistants to do something (in which case the output of deliberation is not explicit).

† Paul says "Rather than learning a reward function from human data, we also train it by amplification (acting on the same representations used by the generative model). Again, we can distill the reward function into a neural network that acts on sequences of observations, but now instead of learning to predict human judgments it’s predicting a very large implicit deliberation." The "implicit" in this quote seems to refer to the process (rather than output) of deliberation. See also the paragraph starting with "To summarize my own understanding" in this comment (which I think is talking about RL-based IDA), which suggests that maybe we should distinguish between "understandable in theory if we had the time" vs "understandable within the time constraints we have" (in the table I went with the latter). There is also the question of whether a reward function is "explicit enough" as a representation of values.

‡ Q5 (p. 32) in the CEV paper clarifies that the computation to find CEV wouldn't be sentient, but I'm not sure if the paper says whether the cognition will resemble human thought.

Takeaways

We can imagine a graph where the horizontal axis is "quality of deliberation" and the vertical axis is "quality of outcome (overall value of the future)". If your intuition says that the overall value of the future is sensitive to the quality of deliberation, it seems good to pay attention to how different "success stories" incorporate deliberation, and to understand the quality of deliberation for each approach. It might turn out that there is a certain threshold above which outcomes are "good enough" and that all the concrete approaches pass this threshold (the threshold could exist on either axis—we might stop caring about how good the outcomes are above a certain point, or all approaches to deliberation above a certain point produce basically the same outcome); in that case, understanding deliberation might not be so interesting. However, if there are no such thresholds (so that better deliberation continually leads to better outcomes), or if some of the approaches do not pass the threshold, then it seems worth being picky about how deliberation is implemented (potentially rejecting certain success stories for lack of satisfactory deliberation).
Thinking about deliberation is tricky because it requires mentally keeping track of the strategic background/assumptions for each "success story", e.g. talking about speed of deliberation only makes sense in a slow takeoff scenario, and peeking at the output only makes sense under types of deliberation where the humans aren't doing the work. See my related comment about a similar issue with success stories. It's also tricky because there turns out to be a bunch of subtle distinctions that I didn't realize existed.
One of my original motivations for thinking about deliberation was to try to understand what kind of deliberation Paul has in mind for IDA. Having gone through the above analysis, I feel like I understand each approach (e.g. RL-based IDA, counterfactual oracles) better but I'm not sure I understand Paul's overall vision any better. I think my main confusion is that Paul talks about many different ways deliberation could work (e.g. RL-based IDA and human-in-the-counterfactual-loop seem pretty different), and it's not clear what approach he thinks is most plausible.

Acknowledgments

Thanks to Wei Dai for suggesting the point about solving meta-ethics. (However, I may have misrepresented his point, and this acknowledgment should not be seen as an endorsement by him.)

From the CEV paper: "Do we want our coherent extrapolated volition to satisfice, or maximize? My guess is that we want our coherent extrapolated volition to satisfice […]. If so, rather than trying to guess the optimal decision of a specific individual, the CEV would pick a solution that satisficed the spread of possibilities for the extrapolated statistical aggregate of humankind." (p. 36)

And: "This is another reason not to stand in awe of the judgments of a CEV—a solution that satisfices an extrapolated spread of possibilities for the statistical aggregate of humankind may not correspond to the best decision of any individual, or even the best vote of any real, actual adult humankind." (p. 37) ↩︎
Paul says "So an excellent agent with a minimal understanding of human values seems OK. Such an agent could avoid getting left behind by its competitors, and remain under human control. Eventually, once it got enough information to understand human values (say, by interacting with humans), it could help us implement our values." ↩︎

[-]Wei Dai7y*40

I think my main confusion is that Paul talks about many different ways deliberation could work (e.g. RL-based IDA and human-in-the-counterfactual-loop seem pretty different), and it’s not clear what approach he thinks is most plausible.

I have similar questions, and I'm not sure how much of it is that Paul is uncertain himself, and how much is Paul not having communicated his thinking yet. Also one thing to keep in mind is that different forms of deliberation could be used at different levels of the system, so for example one method can be used to model/emulate/extrapolate the overseer's deliberation and another one for the end-user.

On a more general note, I'm really worried that we don't have much understanding of how or why human deliberation can lead to good outcomes in the long run. It seems clear that an individual human deliberating in isolation is highly likely to get stuck or go off the rails, and groups of humans often do so as well. To the extent that we as a global civilization seemingly are able to make progress in the very long run, it seems at best a fragile process, which we don't know how to reliably preserve, or reproduce in an artificial setting.

11