Issa Rice

I am Issa Rice.

Issa Rice's Comments

Utility ≠ Reward

To me, it seems like the two distinctions are different. There seem to be three levels to distinguish:

  1. The reward (in the reinforcement learning sense) or the base objective (example: inclusive genetic fitness for humans)
  2. A mechanism in the brain that dispenses pleasure or provides a proxy for the reward (example: pleasure in humans)
  3. The actual goal/utility that the agent ends up pursuing (example: a reflective equilibrium for some human's values, which might have nothing to do with pleasure or inclusive genetic fitness)

The base objective vs mesa-objective distinction seems to be about (1) vs a combination of (2) and (3). The reward maximizer vs utility maximizer distinction seems to be about (2) vs (3), or maybe (1) vs (3).

Depending on the agent that is considered, only some of these levels may be present:

  • A "dumb" RL-trained agent that engages in reward gaming. Only level (1), and there is no mesa-optimizer.
  • A "dumb" RL-trained agent that engages in reward tampering. Only level (1), and there is no mesa-optimizer.
  • A paperclip maximizer built from scratch. Only level (3), and there is no mesa-optimizer.
  • A relatively "dumb" mesa-optimizer trained using RL might have just (1) (the base objective) and (2) (the mesa-objective). This kind of agent would be incentivized to tamper with its pleasure circuitry (in the sense of (2)), but wouldn't be incentivized to tamper with its RL-reward circuitry. (Example: rats wirehead to give themselves MAX_PLEASURE, but don't self-modify to delude themselves into thinking they have left many descendants.)
  • If the training procedure somehow coughs up a mesa-optimizer that doesn't have a "pleasure center" in its brain (I don't know how this would happen, but it seems logically possible), there would just be (1) (the base objective) and (3) (the mesa-objective). This kind of agent wouldn't try to tamper with its utility function (in the sense of (3)), nor would it try to tamper with its RL-reward/base-objective to delude itself into thinking it has high rewards.

ETA: Here is a table that shows these distinctions varying independently:

Utility maximizer Reward maximizer
Optimizes for base objective (i.e. mesa-optimizer absent) Paperclip maximizer "Dumb" RL-trained agent
Optimizes for mesa-objective (i.e. mesa-optimizer present) Human in reflective equilibrium Rats
The strategy-stealing assumption

Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:

  1. time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
  2. act-based ("short") vs goal-based ("long"): using the human's (or more generally, the human-plus-AI-assistants'; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization of the future based on some goal, e.g. using a utility function (goal-based)
  3. amount of reflection the human has undergone: "short" would be the current human (I think this is what you call "preferences-as-elicited"), and this would get "longer" as we give the human more time to think, with something like CEV/Long Reflection/Great Deliberation being the "longest" in this sense (I think this is what you call "preference-on-idealized-reflection"). This sense further breaks down into whether the human itself is actually doing the reflection, or if the AI is instead predicting what the human would think after reflection.
  4. how far the search happens: "short" would be a limited search (that lacks insight/doesn't see interesting consequences) and "long" would be a search that has insight/sees interesting consequences. This is a distinction you made in a discussion with Eliezer a while back. This distinction also isn't strictly about preferences, but rather about how one would achieve those preferences.
  5. de dicto ("short") vs de re ("long"): This is a distinction you made in this post. I think this is the same distinction as (2) or (3), but I'm not sure which. (But if my interpretation of you below is correct, I guess this must be the same as (2) or else a completely different distinction.)
  6. understandable ("short") vs evaluable ("long"): A course of action is understandable if the human (without any AI assistants) can understand the rationale behind it; a course of action is evaluable if there is some procedure the human can implement to evaluate the rationale using AI assistants. I guess there is also a "not even evaluable" option here that is even "longer". (Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.)

My interpretation is that when you say "short-term preferences-on-reflection", you mean short in sense (1), except when the AI needs to gather resources, in which case either the human or the AI will need to do more long-term planning; short in sense (2); long in sense (3), with the AI predicting what the human would think after reflection; long in sense (4); short in sense (5); long in sense (6). Does this sound right to you? If not, I think it would help me a lot if you could "fill in the list" with which of short or long you choose for each point.

Assuming my interpretation is correct, my confusion is that you say we shouldn't expect a situation where "the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy" (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.

What are the differences between all the iterative/recursive approaches to AI alignment?

Thanks. It looks like all the realistic examples I had of weak HCH are actually examples of strong HCH after all, so I'm looking for some examples of weak HCH to help my understanding. I can see how weak HCH would compute the answer to a "naturally linear recursive" problem (like computing factorials) but how would weak HCH answer a question like "Should I get laser eye surgery?" (to take an example from here). The natural way to decompose a problem like this seems to use branching.

Also, I just looked again at Alex Zhu's FAQ for Paul's agenda, and Alex's explanation of weak HCH (in section 2.2.1) seems to imply that it is doing tree recursion (e.g. "I sometimes picture this as an infinite tree of humans-in-boxes, who can break down questions and pass them to other humans-in-boxes"). It seems like either you or Alex must be mistaken here, but I have no idea which.

What are the differences between all the iterative/recursive approaches to AI alignment?

Thanks! I found this answer really useful.

I have some follow-up questions that I'm hoping you can answer:

  1. I didn't realize that weak HCH uses linear recursion. On the original HCH post (which is talking about weak HCH), Paul talks in comments about "branching factor", and Vaniver says things like "So he asks HCH to separately solve A, B, and C". Are Paul/Vaniver talking about strong HCH here, or am I wrong to think that branching implies tree recursion? If Paul/Vaniver are talking about weak HCH, and branching does imply tree recursion, then it seems like weak HCH must be using tree recursion rather than linear recursion.
  2. Your answer didn't confirm or deny whether the agents in HCH are human-level or superhuman. I'm guessing it's the former, in which case I'm confused about how IDA and recursive reward modeling are approximating strong HCH, since in these approaches the agents are eventually superhuman (so they could solve some problems in ways that HCH can't, or solve problems that HCH can't solve at all).
  3. You write that meta-execution is "more a component of other approaches", but Paul says "Meta-execution is annotated functional programming + strong HCH + a level of indirection", which makes it sound like meta-execution is a specific implementation rather than a component that plugs into other approaches (whereas annotated functional programming does seem like a component that can plug into other approaches). Were you talking about annotated functional programming here? If not, how is meta-execution used in other approaches?
  4. I'm confused that you say IDA is task-based rather than reward-based. My understanding was that IDA can be task-based or reward-based depending on the learning method used during the distillation process. This discussion thread seems to imply that recursive reward modeling is an instance of IDA. Am I missing something, or were you restricting attention to a specific kind of IDA (like imitation-based IDA)?
Open Problems Regarding Counterfactuals: An Introduction For Beginners

The link no longer works (I get "This project has not yet been moved into the new version of Overleaf. You will need to log in and move it in order to continue working on it.") Would you be willing to re-post it or move it so that it is visible?

Topological Fixed Point Exercises

My solution for #3:

Define by . We know that is continuous because and the identity map both are, and by the limit laws. Applying the intermediate value theorem (problem #2) we see that there exists such that . But this means , so we are done.

Counterexample for the open interval: consider defined by . First, we can verify that if then , so indeed maps to . To see that there is no fixed point, note that the only solution to in is , which is not in . (We can also view this graphically by plotting both and and checking that they do not intersect in .)

Topological Fixed Point Exercises

Here is my attempt, based on Hoagy's proof.

Let be an integer. We are given that and . Now consider the points in the interval . By 1-D Sperner's lemma, there are an odd number of such that and (i.e. an odd number of "segments" that begin below zero and end up above zero). In particular, is an even number, so there must be at least one such number . Choose the smallest and call this number .

Now consider the sequence . Since this sequence takes values in , it is bounded, and by the Bolzano–Weierstrass theorem there must be some subsequence that converges to some number .

Consider the sequences and . We have for each . By the limit laws, as . Since is continuous, we have and as . Thus and , showing that , as desired.

Topological Fixed Point Exercises

I'm having trouble understanding why we can't just fix in your proof. Then at each iteration we bisect the interval, so we wouldn't be using the "full power" of the 1-D Sperner's lemma (we would just be using something close to the base case).

Also if we are only given that is continuous, does it make sense to talk about the gradient?