I am Issa Rice. https://issarice.com/
I have only a very vague idea of what you mean. Could you give an example of how one would do this?
I think that makes sense, thanks.
Just to make sure I understand, the first few expansions of the second one are:
Is that right? If so, wouldn't the infinite expansion look like f((((...) + 1) + 1) + 1) instead of what you wrote?
I read the post and parts of the paper. Here is my understanding: conditions similar to those in Theorem 2 above don't exist, because Alex's paper doesn't take an arbitrary utility function and prove instrumental convergence; instead, the idea is to set the rewards for the MDP randomly (by sampling i.i.d. from some distribution) and then show that in most cases, the agent seeks "power" (states which allow the agent to obtain high rewards in the future). So it avoids the twitching robot not by saying that it can't make use of additional resources, but by saying that the twitching robot has an atypical reward function. So even though there aren't conditions similar to those in Theorem 2, there are still conditions analogous to them (in the structure of the argument "expected utility/reward maximization + X implies catastrophe"), namely X = "the reward function is typical". Does that sound right?
Writing this comment reminded me of Oliver's comment where X = "agent wasn't specifically optimized away from goal-directedness".
Can you say more about Alex Turner's formalism? For example, are there conditions in his paper or post similar to the conditions I named for Theorem 2 above? If so, what do they say and where can I find them in the paper or post? If not, how does the paper avoid the twitching robot from seeking convergent instrumental goals?
One additional source that I found helpful to look at is the paper "Formalizing Convergent Instrumental Goals" by Tsvi Benson-Tilsen and Nate Soares, which tries to formalize Omohundro's instrumental convergence idea using math. I read the paper quickly and skipped the proofs, so I might have misunderstood something, but here is my current interpretation.
The key assumptions seem to appear in the statement of Theorem 2; these assumptions state that using additional resources will allow the agent to implement a strategy that gives it strictly higher utility (compared to the utility it could achieve if it didn't make use of the additional resources). Therefore, any optimal strategy will make use of those additional resources (killing humans in the process). In the Bit Universe example given in the paper, if the agent doesn't terminally care what happens in some particular region (I guess they chose this letter because it's supposed to represent where humans are), but contains resources that can be burned to increase utility in other regions, the agent will burn those resources.
Both Rohin's and Jessica's twitching robot examples seem to violate these assumptions (if we were to translate them into the formalism used in the paper), because the robot cannot make use of additional resources to obtain a higher utility.
For me, the upshot of looking at this paper is something like:
Rohin Shah told me something similar.
This quote seems to be from Rob Bensinger.
I'm confused about what it means for a hypothesis to "want" to score better, to change its predictions to get a better score, to print manipulative messages, and so forth. In probability theory each hypothesis is just an event, so is static, cannot perform actions, etc. I'm guessing you have some other formalism in mind but I can't tell what it is.
To me, it seems like the two distinctions are different. There seem to be three levels to distinguish:
The base objective vs mesa-objective distinction seems to be about (1) vs a combination of (2) and (3). The reward maximizer vs utility maximizer distinction seems to be about (2) vs (3), or maybe (1) vs (3).
Depending on the agent that is considered, only some of these levels may be present:
ETA: Here is a table that shows these distinctions varying independently:
Utility maximizer | Reward maximizer | |
---|---|---|
Optimizes for base objective (i.e. mesa-optimizer absent) | Paperclip maximizer | "Dumb" RL-trained agent |
Optimizes for mesa-objective (i.e. mesa-optimizer present) | Human in reflective equilibrium | Rats |
It seems like "agricultural revolution" is used to mean both the beginning of agriculture ("First Agricultural Revolution") and the 18th century agricultural revolution ("Second Agricultural Revolution").