LW1.0 username Manfred. Day job is condensed matter physics, hobby is thinking I know how to assign anthropic probabilities.
After a bit more thought, I've learned that it's hard to avoid ending back up with EU maximization - it basically happens as soon as you require that strategies be good not just on the true environment, but on some distribution of environments that reflect what we think we're designing an agent for (or the agent's initial state of knowledge about states of the world). And since this is such an effective tool at penalizing the "just pick the absolute best answer" strategy, it's hard for me to avoid circling back to it.
Here's one possible option, though: look for strategies that are too simple to encode the one best answer in the first place. If the absolute best policy has K-complexity of 10^3 (achievable in the real world by strategies being complicated, or in the multi-armed bandit case by just having 2^1000 possible actions) and your agent is only allowed to start with 10^2 symbols, this might make things interesting.
Maybe optimality relative to the best performer out of some class of algorithms that doesn't include "just pick the absolute best answer?" You basically prove that in environments with traps, anything that would, absent traps, be guaranteed to find the absolute best answer will instead get trapped. So those aren't actually very good performers.
I just can't come up with anything too clever, though, because the obvious classes of algorithms, like "polynomial time," include the ability to just pick the absolute best answer by luck.
It seems like the upshot is that even weak optimality is too strong, since it has to try everything once. How does one make even weaker guarantees of good behavior that are useful in proving things, without just defaulting to expected utility maximization?
Reflective modification flow: Suppose we have an EDT agent that can take an action to modify its decision theory. It will try to choose based on the average outcome conditioned on taking the different decision. In some circumstances, EDT agents are doing well so it will expect to do well by not changing; in other circumstances, maybe it expects to do better conditional on self-modifying to use the Counterfactual Perspective more.
Evolutionary flow: If you put a mixture of EDT and FDT agents in an evolutionary competition where they're playing some iterated game and high scorers get to reproduce, what does the population look like at large times, for different games and starting populations?
How much are you thinking about stability under optimization? Most objective catastrophes are also human catastrophes. But if a powerful agent is trying to achieve some goal while avoiding objective catastrophes, it seems like it's still incentivized to dethrone humans - to cause basically the most human-catastrophic thing that's not objective-catastrophic.
I'm definitely satisfied with this kind of content.
The names suggest you're classifying decision procedures by what kind of thoughts they have in special cases. But "sneakily" the point is this is relevant because these are the kinds of thoughts they have all the time.
I think the next place to go is to put this in the context of methods of choosing decision theories - the big ones being reflective modification and evolutionary/population level change. Pretty generally it seems like the trivial perspective is unstable is under these, but there are some circumstances where it's not.
Thank you for putting all the time and thoughtfulness into this post, even if the conclusion is "nope, doesn't pan out." I'm grateful that it's out here.
I really love the level of detail in this sketch!
I'm mentally substituting continuet for some question more like "should this debate continue?", because I think the setup you describe keeps going until Amp is satisfied with an answer, which might be never for weak M. It's also not obvious to me that this reward system you describe actually teaches agents to debate between odd and even steps. If there's a right answer that the judge might be convinced of, I think M will be trained to give it no matter the step parity, because when that happens it gets rewarded.
Really, it feels like the state of the debate is more like the state of a RNN, and you're going to end up training something that can make use of that state to do a good job ending debates and making the human response be similar to the model response.
You have an entire copy of the post in the commenting guidelines, fyi :)
What's often going on in unresolvable debates among humans is that there is a vague definition baked into the question, such that there is no "really" right answer (or too many right answers).
E.g. "Are viruses alive?"
To the extent that we've dealt with the question of whether viruses are alive, it's been by understanding the complications and letting go of the need for the categorical thinking that generated the question in the first place. Allowing this as an option seems like it brings back down the complexity class of things you can resolve debates on (though if you count "it's a tie" as a resolution, you might retain the ability to ask questions in PSPACE but just have lots of uninformative ties and only update your own worldview when it's super easy).
For questions of value, though, this approach might not even always work, because the question might be "is it right to take action A or action B," and even if you step back from the category "right" because it's too vague, you still have to choose between action A or B. But you still have the original issue that the question has too few / too many right answers. Any thoughts on ways to make debate do work on this sort of tricky problem?
When you say the human decision procedure causes human values, what I hear is that the human decision procedure (and its surrounding way of describing the world) is more ontologically basic than human values (and their surrounding way if describing the world).
Our decision procedure is "the reason for our values" in the same way that the motion of electric charge in your computer is the reason it plays videogames (even though "the electric charge is moving" and "it's playing a game" might be describing the same physical event). The arrow between them isn't the most typical causal arrow between two peers in a singular way of describing the world, it's an arrow of reduction/emergence, between things at different levels of abstraction.