Yes, if predictors can influence the world in addition to making a prediction, they can go make their predictions more accurate. The nice thing about working with predictive models is that by default the only action they can take is making predictions.
AI safety via market making, which Evan linked in another comment, touches on the analogy where agents are making predictions but can also influence the outcome. You might be interested in reading through it.
Having re-read the posts and thought about it some more, I do think zero-sum competition could be applied to logical inductors to resolve the futarchy hack. It would require minor changes to the formalism to accommodate, but I don't see how those changes would break anything else.
I think the tie-in to market-making, and other similar approaches like debate, is in interpreting the predictions. While the examples in this post were only for the two-outcome case, we would probably want predictions over orders of magnitude more outcomes for the higher informational density. Since evaluating distributions over a double digit number of outcomes already starts posing problems (sometimes even high single digits), a process to direct a decision maker's attention is necessary.
I've been thinking of a proposal like debate, where both sides go back and forth proposing clusters of outcomes based on shared characteristics. Ideally, in equilibrium, the first debater should propose the fewest number of clusters such that splitting them further doesn't change the decision maker's mind. This could also be thought of in terms of market-making, where rather than the adversary proposing a string, they propose a further subdivision of existing clusters.
I like the use case of understanding predictions for debate/market-making, because the prediction itself acts as a ground truth. Then, there's no need to ancitipate/reject a ton of counterarguments based on potential lies, rather arguments are limited to selectively revealing the truth. It is probably important that the predictors are separate models from the analyzer to avoid contamination of the objectives. The proof of Theorem 6, which skips to the end of the search process, needs to use a non-zero sum prediction for that result.
As an aside, I also did some early work on decision markets, distinct from your post on market-making, since the Othman and Sandholm had an impossibility result for those too. However, but the results were ultimately trivial. Once you can use zero-sum competition to costlessly get honest conditional predictions, then as soon as you can pair off entrants to the market it becomes efficient. But the question then arises of why use a decision market in the first place instead of just querying experts?
With respect to pre-training, I agree that it's not easy to incorporate. I'm not sure how any training regime that only trains on data where the prediction has no effect can imbue incentives that generalize in the desired way to situations where predictions do affect the outcome. If you do get a performative predictor out of pretraining, then as long as it's myopic you might be able to train the performativity out of it in safely controlled scenarios (and if it's not myopic, it's a risk whether it's performative or not). That was part of my reasoning for the second experiment, checking how well performativity could be trained out.
To incorporate into an ongoing pre-training process, human decisions are likely too expensive, but the human is probably not the important part. Instead, predictions where performativity is possible by influencing simple AI decision makers could be mixed into the pre-training process. Defining a decision problem environment of low or medium complexity is not too difficult, and I suspect previous-generation models would be able to do a good job generating many examples. A danger arises that the model learns only to not predict performatively in those scenarios (same with untraining afterwards only applying to the controlled environments), though I think that's a somewhat unnatural generalization.
Good question! These scoring rules do also prevent agents from trying to make the environment more unpredictable. In the same way that making the environment more predictable benefits all agents equally and so cancels out, making the environment less predictable hurts all agents equally and so cancels out in a zero-sum competition.
I'll take a look at the linked posts and let you know my thoughts soon!
Thanks for the clarification, I'll think more about it that way and how it relates to corrigibility
I don't think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.
The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it's probably feasible to have AI not manipulate us using one particular type of manipulation.
On a separate note, could you clarify what you mean by "anti-natural"? I'll keep in mind your previous caveat that it's not definitive.
Great questions!
When I say straightforwardly, I mean when using end states that only include the information available at the time. If we define the end state to also include the history that lead to it, then there exists a set of preferences over them that ranks all end states with histories that include manipulation below the ones that don't. The issue, of course, is that we don't know how to specify all the types of manipulation that a superintelligent AI could conceive of.
The gridworld example is a great demonstration of this, because while we can't reflect the preferences as a ranking of just the end states, the environment is simple enough that you can specify all the paths you don't want to take to them. I don't think it really matters whether you call that "anti-naturality that can be overcome with brute force in a simple environment" or just "not anti-naturality".
Hi Adrià, thanks for the comment! (Accidentally posted mid-writing, will edit to respond fully)
> Probabilistic policy?
Once we have the head estimating the Q-function, we can sample actions from the policy and sum the product of their Q-values and their probability of being chosen to get an estimate of the state value alone. You can then calculate advantages for all of the sampled actions (maybe dropping them from the weighted average used to estimate state value first), and update the policy towards actions predicted to do well. Does that make sense, or am I skipping something that you think leads to the difficulty of updating the policy?
For LLMs in particular, you don't actually need the Q-value estimator, you can just use a state value estimator and apply it before and after the sequences of tokens representing actions are taken.
> Safety during training
We can start with a pretrained model that we think contains a good world model to speed up the process significantly. I agree that there might be many training steps needed before the model behaves desirably, and that training outside a simulation has difficulties, but that seems like a general critique of training AGI rather than specific to this method.
> Are RL agents really necessarily CDT?
I agree that LLM agents can just choose to follow non-CDT decision theories. I think this will be selected against by default in training, but if it's not we can explicitly train against it, e.g. finetune on CDT behavior, add CDT to a Constitutional AI's constitution. I am concerned that wouldn't be robust, but it seems like an obvious first step.
> The model might ignore the reward you put in
Yes, I think models are not optimizing for the reward (or anything). If model's are not optimizing for anything, the incorrigiiblity is less of a threat, since much of the pressure towards it comes from the instrumental incentive to preserve a goal. However, I'm worried that future models will become more goal-directed to improve performance. Regardless of whether models are goal directed, the corrigibility transformed rewards are very consistent in reinforcing corrigible behavior, which is ultimately what we want.
I appreciate you taking the time to read and engage with my post!