Issa Rice

I am Issa Rice.

Issa Rice's Comments

How special are human brains among animal brains?

It seems like "agricultural revolution" is used to mean both the beginning of agriculture ("First Agricultural Revolution") and the 18th century agricultural revolution ("Second Agricultural Revolution").

What are some exercises for building/generating intuitions about key disagreements in AI alignment?

I have only a very vague idea of what you mean. Could you give an example of how one would do this?

Name of Problem?

I think that makes sense, thanks.

Name of Problem?

Just to make sure I understand, the first few expansions of the second one are:

  • f(n)
  • f(n+1)
  • f((n+1) + 1)
  • f(((n+1) + 1) + 1)
  • f((((n+1) + 1) + 1) + 1)

Is that right? If so, wouldn't the infinite expansion look like f((((...) + 1) + 1) + 1) instead of what you wrote?

Coherence arguments do not imply goal-directed behavior

I read the post and parts of the paper. Here is my understanding: conditions similar to those in Theorem 2 above don't exist, because Alex's paper doesn't take an arbitrary utility function and prove instrumental convergence; instead, the idea is to set the rewards for the MDP randomly (by sampling i.i.d. from some distribution) and then show that in most cases, the agent seeks "power" (states which allow the agent to obtain high rewards in the future). So it avoids the twitching robot not by saying that it can't make use of additional resources, but by saying that the twitching robot has an atypical reward function. So even though there aren't conditions similar to those in Theorem 2, there are still conditions analogous to them (in the structure of the argument "expected utility/reward maximization + X implies catastrophe"), namely X = "the reward function is typical". Does that sound right?

Writing this comment reminded me of Oliver's comment where X = "agent wasn't specifically optimized away from goal-directedness".

Coherence arguments do not imply goal-directed behavior

Can you say more about Alex Turner's formalism? For example, are there conditions in his paper or post similar to the conditions I named for Theorem 2 above? If so, what do they say and where can I find them in the paper or post? If not, how does the paper avoid the twitching robot from seeking convergent instrumental goals?

Coherence arguments do not imply goal-directed behavior

One additional source that I found helpful to look at is the paper "Formalizing Convergent Instrumental Goals" by Tsvi Benson-Tilsen and Nate Soares, which tries to formalize Omohundro's instrumental convergence idea using math. I read the paper quickly and skipped the proofs, so I might have misunderstood something, but here is my current interpretation.

The key assumptions seem to appear in the statement of Theorem 2; these assumptions state that using additional resources will allow the agent to implement a strategy that gives it strictly higher utility (compared to the utility it could achieve if it didn't make use of the additional resources). Therefore, any optimal strategy will make use of those additional resources (killing humans in the process). In the Bit Universe example given in the paper, if the agent doesn't terminally care what happens in some particular region (I guess they chose this letter because it's supposed to represent where humans are), but contains resources that can be burned to increase utility in other regions, the agent will burn those resources.

Both Rohin's and Jessica's twitching robot examples seem to violate these assumptions (if we were to translate them into the formalism used in the paper), because the robot cannot make use of additional resources to obtain a higher utility.

For me, the upshot of looking at this paper is something like:

  • MIRI people don't seem to be arguing that expected utility maximization alone implies catastrophe.
  • There are some additional conditions that, when taken together with expected utility maximization, seem to give a pretty good argument for catastrophe.
  • These additional conditions don't seem to have been argued for (or at least, this specific paper just assumes them).
Will AI undergo discontinuous progress?

Rohin Shah told me something similar.

This quote seems to be from Rob Bensinger.

Bayesian Evolving-to-Extinction

I'm confused about what it means for a hypothesis to "want" to score better, to change its predictions to get a better score, to print manipulative messages, and so forth. In probability theory each hypothesis is just an event, so is static, cannot perform actions, etc. I'm guessing you have some other formalism in mind but I can't tell what it is.

Utility ≠ Reward

To me, it seems like the two distinctions are different. There seem to be three levels to distinguish:

  1. The reward (in the reinforcement learning sense) or the base objective (example: inclusive genetic fitness for humans)
  2. A mechanism in the brain that dispenses pleasure or provides a proxy for the reward (example: pleasure in humans)
  3. The actual goal/utility that the agent ends up pursuing (example: a reflective equilibrium for some human's values, which might have nothing to do with pleasure or inclusive genetic fitness)

The base objective vs mesa-objective distinction seems to be about (1) vs a combination of (2) and (3). The reward maximizer vs utility maximizer distinction seems to be about (2) vs (3), or maybe (1) vs (3).

Depending on the agent that is considered, only some of these levels may be present:

  • A "dumb" RL-trained agent that engages in reward gaming. Only level (1), and there is no mesa-optimizer.
  • A "dumb" RL-trained agent that engages in reward tampering. Only level (1), and there is no mesa-optimizer.
  • A paperclip maximizer built from scratch. Only level (3), and there is no mesa-optimizer.
  • A relatively "dumb" mesa-optimizer trained using RL might have just (1) (the base objective) and (2) (the mesa-objective). This kind of agent would be incentivized to tamper with its pleasure circuitry (in the sense of (2)), but wouldn't be incentivized to tamper with its RL-reward circuitry. (Example: rats wirehead to give themselves MAX_PLEASURE, but don't self-modify to delude themselves into thinking they have left many descendants.)
  • If the training procedure somehow coughs up a mesa-optimizer that doesn't have a "pleasure center" in its brain (I don't know how this would happen, but it seems logically possible), there would just be (1) (the base objective) and (3) (the mesa-objective). This kind of agent wouldn't try to tamper with its utility function (in the sense of (3)), nor would it try to tamper with its RL-reward/base-objective to delude itself into thinking it has high rewards.

ETA: Here is a table that shows these distinctions varying independently:

Utility maximizer Reward maximizer
Optimizes for base objective (i.e. mesa-optimizer absent) Paperclip maximizer "Dumb" RL-trained agent
Optimizes for mesa-objective (i.e. mesa-optimizer present) Human in reflective equilibrium Rats
Load More