LW1.0 username Manfred. Day job is condensed matter physics, hobby is thinking I know how to assign anthropic probabilities.
I almost agree, but still ended up disagreeing with a lot of your bullet points. Since reading your list was useful, I figured it would be worthwhile to just make a parallel list. ✓ for agreement, × for disagreement (• for neutral).
✓ I think we're confused about what we really mean when we talk about human values.
× But our real problem is on the meta-level: we want to understand value learning so that we can build an AI that learns human values even without starting with a precise model waiting to be filled in.
_× We can trust AI to discover that structure for us even though we couldn't verify the result, because the point isn't getting the right answer, it's having a trustworthy process.
_ × We can't just write down the correct structure any more than we can just write down the correct content. We're trying to translate a vague human concept into precise instructions for an AI
✓ Agree with extensional definition of values, and relevance to decision-making.
• Research on the content of human values may be useful information about what humans consider to be human values. I think research on the structure of human values is in much the same boat - information, not the final say.
✓ Agree about Stuart's work being where you'd go to write down a precise set of preferences based on human preferences, and that the problems you mention are problems.
✓ Agree with assumptions.
• I think the basic model leaves out the fact that we're changing levels of description.
_ × Merely causing events (in the physical level of description) is not sufficient to say we're acting (in the agent level of description). We need some notion of "could have done something else," which is an abstraction about agents, not something fundamentally physical.
_ × Similar quibbles apply to the other parts - there is no physically special decision process, we can only find one by changing our level of description of the world to one where we posit such a structure.
_ × The point: Everything in the basic model is a statistical regularity we can observe over the behavior of a physical system. You need a bit more nuanced way to place preferences and meta-preferences.
_ • The simple patch is to just say that there's some level of description where the decision-generation process lives, and preferences live at a higher level of abstraction than that. Therefore preferences are emergent phenomena from the level of description the decision-generation process is on.
_ _ × But I think if one applies this patch, then it's a big mistake to use loaded words like "values" to describe the inputs (all inputs?) to the decision-generation process, which are, after all, at a level of description below the level where we can talk about preferences. I think this conflicts with the extensive definitions from earlier.
× If we recognize that we're talking about different levels of description, then preferences are not either causally after or causally before decisions-on-the-basic-model-level-of-abstraction. They're regular patterns that we can use to model decisions at a slightly higher level of abstraction.
_ • How to describe self-aware agents at a low level of abstraction then? Well, time to put on our GEB hats. The low level of abstraction just has to include a computation of the model we would use on the higher level of abstraction.
✓ Despite all these disagreements, I think you've made a pretty good case that the human brain plausibly computes a single currency (valence) that it uses to rate both most decisions and most predictions.
_ × But I still don't agree that this makes valence human values. I mean values in the sense of "the cluster we sometimes also point at with words like value, preference, affinity, taste, aesthetic, intention, and axiology." So I don't think we're left with a neuroscience problem, I still think what we want the AI to learn is on that higher level of abstraction where preferences live.
I'll probably post a child comment after I actually read the article, but I want to note before I do that I think the power of ResNets are evidence against these claims. Having super-deep networks with residual connections promote a picture that looks much more like a continuous "massaging" of the data than a human-friendly decision tree.
Right. Some intuition is necessary. But a lot of these choices are ad hoc, by which I mean they aren't strongly constrained by the result you want from them.
For example, you have a linear penalty governed by this parameter lambda, but in principle it could have been any old function - the only strong constraint is that you want it to monotonically increase from a finite number to infinity. Now, maybe this is fine, or maybe not. But I basically don't have much trust for meditation in this sort of case, and would rather see explicit constraints that rule out more of the available space.
My very general concern is that strategies that maximize RAUP might be very... let's say creative, and your claims are mostly relying on intuitive arguments for why those strategies won't be bad for humans.
I don't really buy the claim that if you've been able to patch each specific problem, we'll soon reach a version with no problems - the exact same inductive argument you mention suggests that there will just be a series of problems, and patches, and then more problems with the patched version. Again, I worry that patches are based a lot on intuition.
For example, in the latest version, because you're essentially dividing out by the long-term reward of taking the best action now, if the best action now is really really good, then it becomes cheap to take moderately good actions that still increase future reward - which means the agent is incentivized to concentrate the power of actions into specific timsteps. For example, an agent might be able to set things up so that it can sacrifice its ability to achieve total future reward of 1010 to make it cheap to take an action that increases its future reward by 108 . This might looks like sacrificing the ability to colonize distant galaxies in order to gain total control over the Milky Way.
After a bit more thought, I've learned that it's hard to avoid ending back up with EU maximization - it basically happens as soon as you require that strategies be good not just on the true environment, but on some distribution of environments that reflect what we think we're designing an agent for (or the agent's initial state of knowledge about states of the world). And since this is such an effective tool at penalizing the "just pick the absolute best answer" strategy, it's hard for me to avoid circling back to it.
Here's one possible option, though: look for strategies that are too simple to encode the one best answer in the first place. If the absolute best policy has K-complexity of 10^3 (achievable in the real world by strategies being complicated, or in the multi-armed bandit case by just having 2^1000 possible actions) and your agent is only allowed to start with 10^2 symbols, this might make things interesting.
Maybe optimality relative to the best performer out of some class of algorithms that doesn't include "just pick the absolute best answer?" You basically prove that in environments with traps, anything that would, absent traps, be guaranteed to find the absolute best answer will instead get trapped. So those aren't actually very good performers.
I just can't come up with anything too clever, though, because the obvious classes of algorithms, like "polynomial time," include the ability to just pick the absolute best answer by luck.
It seems like the upshot is that even weak optimality is too strong, since it has to try everything once. How does one make even weaker guarantees of good behavior that are useful in proving things, without just defaulting to expected utility maximization?
Reflective modification flow: Suppose we have an EDT agent that can take an action to modify its decision theory. It will try to choose based on the average outcome conditioned on taking the different decision. In some circumstances, EDT agents are doing well so it will expect to do well by not changing; in other circumstances, maybe it expects to do better conditional on self-modifying to use the Counterfactual Perspective more.
Evolutionary flow: If you put a mixture of EDT and FDT agents in an evolutionary competition where they're playing some iterated game and high scorers get to reproduce, what does the population look like at large times, for different games and starting populations?
How much are you thinking about stability under optimization? Most objective catastrophes are also human catastrophes. But if a powerful agent is trying to achieve some goal while avoiding objective catastrophes, it seems like it's still incentivized to dethrone humans - to cause basically the most human-catastrophic thing that's not objective-catastrophic.