Ty for the reply. A few points in response:
Of course, you might not know which problem your insights allow you to solve until you have the insights. I'm a big fan of constructing stylized problems that you can solve, after you know which insight you want to validate.
That said, I think it's even better if you can specify problems in advance to help guide research in the field. The big risk, then, is that these problems might not be robust to paradigm shifts (because paradigm shifts could change the set of important problems). If that is your concern, then I think you should probably give object-level arguments that solving auditing games is a bad concrete problem to direct attention to. (Or argue that specifying concrete problems is in general a bad thing.)
The bigger the scientific advance, the harder it is to specify problems in advance which it should solve. You can and should keep track of the unresolved problems in the field, as Neel does, but trying to predict specifically which unresolved problems in biology Darwinian evolution would straightforwardly solve (or which unresolved problems in physics special relativity would straightforwardly solve) is about as hard as generating those theories in the first place.
I expect that when you personally are actually doing your scientific research you are building sophisticated mental models of how and why different techniques work. But I think that in your community-level advocacy you are emphasizing precisely the wrong thing—I want junior researchers to viscerally internalize that their job is to understand (mis)alignment better than anyone else does, not to optimize on proxies that someone else has designed (which, by the nature of the problem, are going to be bad proxies).
It feels like the core disagreement is that I intuitively believe that bad metrics are worse than no metrics, because they actively confuse people/lead them astray. More specifically, I feel like your list of four problems is closer to a list of things that we should expect from an actually-productive scientific field, and getting rid of them would neuter the ability for alignment to make progress:
I think that e.g. RL algorithms researchers have some pretty deep insights about the nature of exploration, learning, etc.
They have some. But so did Galileo. If you'd turned physics into a numbers-go-up field after Galileo, you would have lost most of the subsequent progress, because you would've had no idea which numbers going up would contribute to progress.
I'd recommend reading more about the history of science, e.g. The Sleepwalkers by Koestler, to get a better sense of where I'm coming from.
I strongly disagree. "Numbers-Go-Up Science" is an oxymoron: great science (especially what Kuhn calls revolutionary science) comes from developing novel models or ontologies which can't be quantitatively compared to previous ontologies.
Indeed, in an important sense, the reason the alignment problem is a big deal in the first place is that ML isn't a science which tries to develop deep explanations of artificial cognition, but instead a numbers-go-up discipline.
And so the idea of trying to make (a subfield of) alignment more like architecture design, performance optimization or RL algorithms feels precisely backwards—it steers people directly away from the thing that alignment research should be contributing.
I suspect that many of the things you've said here are also true for humans.
That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I'm Richard. What's my opinion on this post? Well, being "Richard" doesn't specify how I should respond to this post. But let me check the cached facts I believe about myself ("I'm truth-seeking"; "I'm polite") and construct an answer which fits well with those facts. A child might start off not really knowing what "polite" means, but still wanting to be polite, and gradually flesh out what that means that as they learn more about the world.
Another way of putting this point: being pulled from the void is not a feature of LLM personas. It's a feature of personas. Personas start off with underspecified narratives that fail to predict most behavior (but are self-fulfilling) and then gradually systematize to infer deeper motivations, resolving conflicts with the actual drivers of behavior along the way.
What's the takeaway here? We should still be worried about models learning the wrong self-fulfilling prophecies. But the "pulling from the void" thing should be seen less as an odd thing that we're doing with AIs, and more as a claim about the nature of minds in general.
This feels related to the predictive processing framework, in which the classifications of one model are then predicted by another.
More tangentially, I've previously thought about merging cognitive biases and values, since we can view both of them as deviations from the optimal resource-maximizing policy. For example, suppose that I am willing to bet even when I am being Dutch booked. You could think of that as a type of irrationality, or you could think of it as an expression of me valuing being Dutch booked, and therefore being willing to pay to experience it.
This is related to the Lacanian/"existential kink" idea that most dysfunctions are actually deliberate, caused by subagents that are trying to pursue some goal at odds with the rest of your goals.
I think my thought process when I typed "risk-averse money-maximizer" was that an agent could be risk-averse (in which case it wouldn't be an EUM) and then separately be a money-maximizer.
But I didn't explicitly think "the risk-aversion would be with regard to utility not money, and risk-aversion with regard to money could still be risk-neutral with regard to utility", so I appreciate the clarification.
Your example bet is a probabilistic mixture of two options: $0 and $2. The agent prefers one of the options individually (getting $2) over any probabilistic mixture of getting $0 and $2.
In other words, your example rebuts the claim that an EUM can't prefer a probabilistic mixture of two options to the expectation of those two options. But that's not the claim I made.
Hmm, this feels analogous to saying "companies are an unnecessary abstraction in economic theory, since individuals could each make separate contracts about how they'll interact with each other. Therefore we can reduce economics to studying isolated individuals".
But companies are in fact a very useful unit of analysis. For example, instead of talking about the separate ways in which each person in the company has committed to treating each other person in the company, you can talk about the HR policy which governs all interactions within the company. You might then see emergent effects (like political battles over what the HR policies are) which are very hard to reason about when taking a single-agent view.
Similarly, although in principle you could have any kind of graph of which agents listen to which other agents, in practice I expect that realistic agents will tend to consist of clusters of agents which all "listen to" each other in some ways. This is both because clustering is efficient (hence animals having bodies made up of clusters of cells; companies being made of clusters of individuals; etc) and because when you even define what counts as a single agent, you're doing a kind of clustering. That is, I think that the first step of talking about "individual rationality" is implicitly defining which coalitions qualify as individuals.
a superintelligent AI probably has a pretty good guess of the other AI's real utility function based on its own historical knowledge, simulations, etc.
This seems very unclear to me—in general it's not easy for agents to predict the goals of other agents with their own level of intelligence, because the amount of intelligence aimed at deception increases in proportion to the amount of intelligence aimed at discovering that deception.
(You could look at the AI's behavior from when it was less intelligent, but then—as with humans—it's hard to distinguish sincere change from improvement at masking undesirable goals.)
But regardless, that's a separate point. If you can do that, you don't need your mechanism above. If you can't, then my objection still holds.
One argument for being optimistic: the universe is just very big, and there's a lot to go around. So there's a huge amount of room for positive-sum bargaining.
Another: at any given point in time, few of the agents that currently exist would want their goals to become significantly simplified (all else equal). So there's a strong incentive to coordinate to reduce competition on this axis.
Lastly: if at each point in time, the set of agents who are alive are in conflict with potentially-simpler future agents in a very destructive way, then they should all just Do Something Else. In particular, if there's some decision-theoretic argument roughly like "more powerful agents should continue to spend some of their resources on the values of their less-powerful ancestors, to reduce the incentives for inter-generational conflict", even agents with very simple goals might be motivated by it. I call this "the generational contract".
Congratulations on doing this :) More specifically, I think there are two parts of making predictions: identifying a hypothesis at all, and then figuring out how likely the hypothesis is to be true or false. The former part is almost always the hard part, and that's the bit where the "reward reinforces previous computations" frame was most helpful.
(I think Oliver's pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)