The goal of this blog-post is to explore the intuition that drives the feeling that the agent who arrives at the conclusion that they should pay Pascal's mugger is being unreasonable, and whether this can be tied in with the analysis of a logical inductor as an algorithm not exploitable by an efficiently computable trader. An analysis of this intuition may help us to design the decision theory of a future Friendly AI so that it is not vulnerable to being Pascal-mugged, which is plausibly a desirable goal from the point of view of obtaining behaviour in alignment with our values.

So, then, consider the following conversation that I recently had with another person at a MIRI workshop.

"Do you know Amanda Askell's work? She takes Pascal's wager seriously. If there really are outcomes in the outcome space with infinite utility and non-infinitesimal probability, then that deserves your attention."

"Oh, but that's very Goodhartable."

A thought in close proximity to this would be that that line of reasoning would open you up to exploitation by agents not aligned with your core values. This is plausibly the main driver behind the intuition that paying Pascal's mugger is unreasonable, a decision theory like that makes you too exploitable.

The work done by MIRI on the analysis of the notion of logical induction identifies the key desirable property of a logical inductor algorithm as producing a pattern of belief states evolving over time which isn't exploitable by any efficiently computable trader.

Traders studying the history of your belief states won't be able to identify any pattern that they can exploit with a polynomial-time computable algorithm, so when such patterns emerge in the sentences you have already proved, you'll make the necessary "market correction" in your own belief states so that such patterns don't make you exploitable. Thus, being a good logical inductor is related to having patterns of behaviour that are not exploitable. We see here some hint of a connection with the notion that being Pascal's-mugging-resilient is related to not being exploitable.

Plausibly a useful research project would be to explore whether some similar formalisable notion in an AI's decision theory could capture the essence of the thought behind "Oh, Pascal's-mugging reasoning looks too suspicious". Perhaps the desired line of reasoning might be "A regular habit of being persuaded by Pascal-mugging style reasoning would make me vulnerable to exploitation by agents not aligned with my core values, independently of whether such exploitation is in fact likely to occur given my beliefs about the universe. If I can see that an argument for a particular conclusion about which action to take manifests that kind of vulnerability, I ought to invest more computing time before actually taking the action". It is not hard to imagine that natural selection might have been a driver for such a heuristic and this may explain why Pascal-mugging-style reasoning "feels suspicious" to humans even when they haven't yet constructed the decision-theoretic argument that yields the conclusion not to pay the mugger. If we can build a similar heuristic into an AI's decision theory, this may help to filter out "unintended interpretations" of what kind of behaviour would optimise what humans care about, such as large-scale wireheading, or investing all your computing resources into figuring out some way to hack the laws of physics to produce infinite utility despite very low probability of a positive pay-off.

We also see here some hint of a connection between the insights that drove the "correct solution" to what a logical inductor should be and the "correct solution" to what the best decision theory should be. Exploring whether this is in some way generalisable to other domains may also be a line of thought that deserves further examination.

New Comment