Did you tell your friend the premise behind this interaction out of band?
I think the key question is how much policy-caused variance in the outcome there is.
That is, juice can chain onto itself because we assume there will be a bunch of scenarios where reinforcement-triggering depends a lot on the choices made. If you are near a juice but not in front of the juice, you can walk up to the juice, which triggers reinforcement, or you can not walk up to the juice, which doesn't trigger reinforcement. The fact that these two plausible actions differ in their reinforcement is what I am referring to with "policy-caused variance".
If you are told not to kill someone, then you can usually stay sufficiently far away from killing anyone, in such a way that the variance in killing is 0, because killing itself has a constant value of 0. (Unlike juice-drinking which might have a value of 0 or 1, depending on both scenario and actions.)
But you could also have a case with a positive value that fails to be robust/lasting, if the positive value fails to have variance. One example would be if the positive value is bounded and always achieved; for instance you might imagine that if you are always carrying a juice dispenser, juice-drinking would always have a value of 1, and therefore again there wouldn't be any variance to reinforce seeking juice.
A more subtle point is that if the task is too difficult, e.g. if you are in a desert with no juice available, then the policy-caused variance is also 0, because the juice is constant 0. This is a general point that encompasses many failures of reinforcement learning. Often, if you don't do things like reward shaping, then reinforcement learning simply fails, because the task is too complex to learn.
I think future RL algorithms will be able to succeed using less policy-caused variance than present RL algorithms require by using models to track the outcomes through many layers of interactions.
I don't think so.
Like technically yes, it shows that there is an internal optimization process that is running in the networks, but much of the meat of optimization such as instrumental convergence/power-seeking depends the structure of the function one is optimizing over.
If the function is not consequentialist - if it doesn't attempt to compute the real-world consequences of different outputs and grade things based on those consequences - then much of the discussion about optimizers does not apply.
I would very much assume that you have a strong genetic disposition to be smart and curious.
Do you think unschooling would work acceptably well for kids who are not smart and curious?
In your view, are values closer to recipes/instruction manuals than to utility functions?
For your quiz, could you give an example of something that is grader-optimization but which is not wireheading?
This is a nice frame of the problem.
In theory, at least. It's not so clear that there are any viable alternatives to argmax-style reasoning that will lead to superhuman intelligence.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they're from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don't think I've properly communicated my feelings in this comment, but hopefully it's better than nothing))
My intuition is that in order to go beyond imitation learning and random exploration, we need some sort of "iteration" system (a la IDA), and the cases of such systems that we know of tend to either literally be argmax planners with crisp utility functions, or have similar problems to argmax planners with crisp utility functions.
A nuclear bomb steers a lot of far-away objects into a high-entropy configuration, and does so very robustly, but that perhaps is not a "small part of the state space"
This example reminds me of a thing I have been thinking about, namely that it seems like optimization can only occur in cases where the optimization produces/is granted enough "energy" to control the level below. In this example, the model works in a quite literal way, as a nuclear bomb floods an area with energy, and I think this example generalizes to e.g. markets with Dutch disease.
Flooding the lower level with "energy" is presumably not the only way this problem can occur; lack of incentives/credit assignment in the upper level generates this result simply because no incentives means that the upper level does not allocate "energy" to the area.
Counter: The human source code won't contain general-purpose search, it will contain something like babble-and-prune or gradient descent over trajectories or something, and so there is also the question of what to do about the general-purpose search component.