It's obvious that you intend this as requiring research, including making good conceptual choices, rather than having a fixed answer. However, I'm going to speak from my current understanding of predictive processing.

I'm quite interested in your (John's) take on how the following differs from what you had in mind.

I believe there are several possible answers based on different ways of using predictive-processing-associated ideas.

**A. Soft-max decision-making.**

One thing I've seen in a presentation on this stuff is the claim of a close connection between probability and utility, namely **u=log(p)**.

This relates to a very common approximate model of bounded rationality: you introduce some randomness, but make worse mistakes less probable, by making actions exponentially more probable as their utility goes up. The level of rationality can be controlled by a "temperature" parameter -- higher temperature means more randomness, lower temperature means closer to just always taking the max.

The **u=log(p)** idea takes that "approximation" as *definitional; *action probabilities are revealed preferences, from which we can find utilities by taking logarithms.

The randomness can be interpreted as exploration. I don't personally see that interpretation as very good, since this form of randomness does not vary based on model uncertainty, but there may be justifications I'm not aware of.

The stronger attempt to justify the randomness, in my book, is based on *monte carlo inference*. However, that's better discussed under the next heading.

**B. Sampling from wishful thinking.**

If you were to construct an agent by the formula from option (A), you would first define the agent's beliefs and desires in the usual Bayesian way. You'd then calculate expected utilities for events in the normal way. You only depart from standard Bayesian decision-making at the last step, where you randomize rather than just taking the best action.

The implicit promise of the **u=log(p)** formula is to provide a deeper unification of belief and value than that, and correspondingly, a deeper restructuring of decision theory.

One commonly discussed proposal is as follows: *condition on success, then sample from the resulting distribution on actions. *(You don't necessarily have a binary notion of "success" if you attach real-valued utilities to the various outcomes, but, there is a generalization where we condition on "utility being high" without exactly specifying how high it is. This will involve the same "temperature" parameter mentioned earlier.)

The technical name for this idea is "planning by inference", because we can use algorithms for Monte Carlo inference to sample actions. We're using inference algorithms to plan! That's a useful unification of utility and probability: machinery previously used for one purpose, is now used for both.

It also kinda captures the intuition you mentioned, about restricting our world-model to assume some stuff we want to be true:

Abstracting out the key idea: we pack all of the complicated stuff into our world-model, hardcode some things into our world-model which we *want* to be true, then generally try to make the model match reality.

However, planning-by-inference can cause us to take some pretty dumb-looking actions.

For example, let's say that we need $200 for rent money. For simplicity, we have binary success/failure: either we get the money we need, or not. We have $25 which we can use to gamble, for a 1/16th chance of making the $200 we need. Alternately, we happen to know tomorrow's winning lotto numbers, which we can enter in for a 100% chance of getting the money we need.

However, taking random actions, let's say there is only a 1/million chance of entering the winning lotto numbers.

Conditioning on our success, it's much more probable that we gamble with our $25 and get the money we need that way.

So planning-by-inference is heavily biased toward plans of action which are *not too improbable in the prior before conditioning on success*.

On the other hand, the temperature parameter can help us out here. Adjusting the temperature looks kind of like "conditioning on success multiple times" -- IE, it's as if you took the new distribution on actions as the prior, and then conditioned again to further bias things in the direction of success.

This has a somewhat nice justification in terms of monte-carlo algorithms. For some algorithms, this "temperature" ends up being an indication of *how long you took to think*. There's a bias toward actions with high prior probabilities because *that's where you look first when planning*, effectively (due to the randomness of the search).

This sounds like a nice account of bounded rationality: the randomness in the **p=log(u)** model is due to the boundedness of our search, and the fact that we may or may not find the good solutions in that time.

Except for one major problem: *this kind of random search isn't what humans, or AIs, do in general.* Even within the realm of Monte Carlo algorithms, there are a lot of optimizations one can add which would destroy the **p=log(u)** relationship. I don't currently know of any reason to suppose that there's some nice generalization which holds for computationally efficient minds.

So ultimately, I would say that there is a *sorta nice* theory of bounded rationality here, but not a *very nice* one.

Except... I actually know a way to address the concern about bias toward *a priori* actions, while sticking to the planning-by-inference picture, and also using an arguably much better theory of bounded rationality.

**C. Logical Induction Decision Theory**

As Scott discussed in a recent talk, if you try the planning-by-inference trick with a *logical inductor* as your inferencer, you maximize expected utility anyway:

This algorithm predicts what it did conditional on having won, and then copies that distribution. It just says, “output whatever I predict that I output conditioned on my having won”.

[...]

But it turns out that you do reach the same endpoint, because the only fixed point of this process is going to do the same as the last algorithm’s. So this algorithm turns out to be functionally the same as the previous one.

One way of understanding what's happening is this: in the planning-by-inference picture, we start with a prior, and condition on success, then sample actions. This creates a bias toward *a priori *probable actions, which can result in the irrational behavior I mentioned earlier.

In the context of logical induction, however, we additionally stipulate that *the a priori distribution on actions and the updated distribution must match.* This has the effect of "updating on success an infinite number of times" (in the sense that I mentioned earlier, where lowering the temperature is kind of like "updating on success again").

Furthermore, unlike the monte-carlo algorithms mentioned earlier, logical induction is a theoretically very well-founded theory of bounded rationality. Not so bounded you'd want to run it on an actual computer, granted. But at least it *addresses the question* of what kind of optimality we can enforce on bounded reasoning, rather than just positing a particular kind of computation as the answer.

Since this is equivalent to regular expected utility maximization with logical inductors, there's no reason to use planning-by-inference, but there's also no reason not to.

So, what kind of decision theory does this get us?

- Cooperate in Prisoner's Dilemma with agents whose pseudorandom moves exactly match, or sufficiently correlate with, our own. Defect against agents with uncorrelated pseudorandom exploration sequences (even if they otherwise have "the same mental architecture"). So cooperation is pretty difficult.
- One-box in Newcomb with a perfect predictor. Two-box if the predictor is imperfect. This holds even if the predictor is extremely accurate (say 99.9% accurate), so long as the agent knows more about its own move than the predictor -- the only way the agent will one-box is if the predictor's prediction contains information about the agent's own action which the agent does not possess at the time of choosing.
- Fail transparent Newcomb.
- Fail counterfactual mugging.
- Fail Parfit's Hitchhiker.
- Fail at agent-simulates-predictor.