Musings on Exploration

Diffractor

[epistemic status: Polemical, representative of my opinions and not those of any others, plausibly flawed in some places, generally endorsed.]

Looking at the two main settings of decision theory so far, Proof-based decision theory and Logical Inductor decision theory, they both have exploration steps.

Proof-based decision theory has an implicit exploration step, where in models of $P A$ where there's a proof that the agent doesn't take a particular action, it is possible to prove arbitrarily good outcomes, inducing the agent to take the action.

Logical Inductor decision theory has an explicit exploration step, where if the agent is sufficiently certain that it won't take a particular action, it takes that action.

Both of these are motivated by the issue where, if it is certain that an agent won't take a particular action, arbitrarily bad guesses of what happens if the agent takes an action can persist, and won't be removed, because the agent doesn't take that action. Both forms of exploration step can be thought of as maintaining a model/tiny probability where the action is taken, so EDT-style conditioning works. However, conditionals are not counterfactuals (This particular link is extra-important to read)

Further, this isn't an issue specifically with 0-probability events, in general, the estimates of what happens conditional on a low-probability event are worse than estimates of what happens conditional on a high-probability event, as shown in theorem 3.3 of the Optimal Poly-Time Estimator paper. In short, if you've got an optimal estimator $P_{L}$ for the indicator function that dictates whether property $L$ holds, and you've got an optimal estimator $P_{L \cdot f}$ for the function that's 0 when $L (x)$ is false, and $f (x)$ when $L (x)$ is true, then $\frac{P_{L \cdot f}}{P_{L}}$ is an optimal estimator for $f (x)$ conditional on L being true. However, the error in the original optimal estimators is blown up by a factor of $\frac{1}{D (n)}$ , where $D (n)$ is the portion of the probability distribution where property $L$ holds. As the probability of something shrinks, the error in using standard conditional probabilities increases because a tiny error in $P (ϕ \land ψ)$ gets amplified when you divide by $P (ψ)$ , which is very small.

#Troll Bridge

Troll Bridge is the decision theory problem where there's a troll that blows up a bridge that you want to cross precisely when your exploration step is active. The utility of staying on your side of the bridge is 0.5, the utility of crossing the bridge is 1, and the utility of crossing the bridge when it blows up is 0. It is the logical inductor version of a similar problem for proof-based decision theory. In this version, the troll blows up the bridge when PA is inconsistent, which is a necessary condition for proof-based decision theory to "explore" into suboptimal actions.

Troll bridge is actually not a good argument against EDT-style conditioning with exploration steps. This is because an analogue of it plausibly applies to pretty much any agent you could come up with. Exploration is intimately related to the exploration-exploitation tradeoff, so, consulting Wikipedia's list of multi-armed bandit algorithms...

For all the semi-uniform strategies based on randomization, there's a clear correspondence to the exploration step approach, and the troll blows up the bridge whenever the agent randomizes.

The solution from the paper, "Optimal Adaptive Policies for Markov Decision Processes" features two exploration mechanisms. The first is systematically overestimating the utility gained by taking untaken actions, and the second is taking historically low-probability actions, at approximately a logarithmic rate over time, as in density zero exploration. Yet again, it is possible to design a variant of troll bridge where the agent starts off with a significant underestimate of the utility of crossing the bridge, and the criterion for bridge explosion is the exploration clause being active. The agent will converge to not crossing the bridge.

For pricing strategies, there is also a variant of troll bridge that defeats them. Each option is associated with a score that is the sum of expected reward, and an estimate of extra future rewards gained through the additional knowledge. So, if the troll's bridge-explosion criterion is the value of information for bridge-crossing being above a certain threshold, the agent can converge to not crossing the bridge.

As far as I can tell, the only strategy that doesn't have some sort of targetable exploration behavior is Thompson sampling.

Even for humans, if you were facing troll bridge, and didn't know how it worked, and the troll's criterion for bridge explosion was "the human is thinking "I feel like doing something unusual and seeing what happens"", I wouldn't be surprised if you also got possible convergence to failure.

Causal inference is greatly assisted by the ability to perform randomized controlled trials, and because troll bridge targets exactly that key capability, it probably isn't an environment where we should expect optimality. In general, if there's some key assumption that is necessary to get good performance on an environment, we shouldn't necessarily demand optimality on problems that violate that key assumption.

So, since you can tune a variant of troll bridge to work on most algorithms for handling the exploration/exploitation tradeoff, and it violates the ability to perform randomized controlled trials of various actions (which is key for inferring the causal structure of an environment)... It's probably just an unfair environment and we shouldn't expect optimality on it. As far as I can tell, it's just designed to screw over algorithms with explicit exploration steps.

#Against Exploration Steps

However, even though Troll Bridge is a bad argument against exploration steps, exploration steps are still bad for completely different reasons. Infinite exploration, when the environment contains irreversible traps, guarantees catastrophe. Exploration into the following actions is not advisable.

Launch nuclear arsenal
Fly downward when flying near the ground
Build and fire an astronomical-sized particle accelerator
Overwrite source code with random bitstring

Further, exploration steps seem to introduce an awkward discontinuity into the decision criterion, where expected utility is maximized, except sometimes an exploration step occurs. Due to the potential for catastrophe, I'd anticipate that any agent with an exploration step would self-modify it away as it grew up. In the specific case of AIXI, according to the probability distribution over hypotheses at some finite time, the expected utility of following the AIXI policy is greater than the expected utility of following the AIXI+exploration policy.

The true reason to do exploration seems to be because the agent believes the action it is taking will not lead to an irreversible trap, and because it believes that the action will reveal information about the true environment that enables a better policy later on, which in expectation up to the time horizon, outweighs the temporary loss incurred due to exploring. AIXI works this way, as an example. The setting of Logical Inductor decision theory renders analysis of this impossible, because there is a separate utility score for each round, instead of a cumulative utility score over the rounds, which is necessary to model the possibility of falling into irreversible traps. The real world has traps, so it is unwise to assume them away in the problem setting. It seems useful to attempt to import this feature of AIXI to Logical Inductor decision theory.

Admittedly, there is a specific class of failure in AIXI that is solved by exploration steps, and is reminiscent of getting locked into a bad action forever by an enforcing trader, that is detailed in this paper by Leike and Hutter Pretty much, if you pick a universal turing machine that assigns high probability to a universe where you go to hell if you do anything other than [string of actions], the agent will just output [string of actions].

As previously mentioned, explicit exploration steps would probably be self-modified away. Then how is this issue to be solved? For early pruning of hypotheses, the posts on Delegative Inverse Reinforcement Learning provide an answer. The agent defers to the human for a choice of action when it suspects something might be an irreversible trap (as in the heaven-hell example from Leike and Hutter), and this permits human-assisted identification of obvious traps/not traps. The agent updates on this and can asymptotically learn to take better actions than the human, and over $n$ rounds, defers to the human on $O (n^{2 / 3})$ rounds.

Once the agent has figured out some of how the world works, most environments/hypotheses where there is a trap have evidential clues elsewhere to rule them out. The world in which firing a full nuclear arsenal is dangerous, can be distinguished by smaller-scale nuclear testing and modeling in uninhabited areas. The GUT-scale particle accelerator that may induce vacuum collapse presumably has the difference in physical laws being testable at a lower energy scale. For "bayesian paranoia" of the sort "if I take [some action], it will lead to an irreversible loss of utility", it doesn't seem much of an issue. There would be a K-complexity penalty on the order of "number of bits to specify an additional physical law that kicks in upon taking a particular action, and at no other time". To cause problems, it isn't enough to just have an environment that implies that an action is an irreversible trap, this environment also must be indistinguishable from the environment where an action isn't an irreversible trap, by any means except trying it and seeing what happens. So I don't expect irrational aversion of obviously-not-traps to be an issue in practice, and this applies with even more force when delegative approaches are thrown into the mix.

tl;dr

Exploration steps are inelegant, will probably be self-modified away, and imply unacceptably bad behavior in reality.

At the same time, troll bridge is an unfair argument against exploration steps because it explicitly targets one of the features which enables learning causal structure.

The best feasible result seems to be something like vanilla AIXI. Specifically, exploration should be motivated the resulting reduction in the hypothesis space to enable better actions in the future. Also, the setting of Logical Inductor decision theory is not right for studying this behavior.

Admittedly, AIXI does have the issue of possibly getting permanently locked into bad hypotheses where it paranoid of possible traps. However, a better solution would probably be DIRL's approach of deferring to the human early on. Later on, I anticipate that the majority of worlds with spurious traps can be distinguished from the real world by some evidence, which will be sought out.

A few comments.

Traps are a somewhat bigger issue than you seem to think, when you take acausal attack into account. Your prior contains low complexity hypotheses that intentionally produce good predictions until some critical pivot point at which they switch to something manipulative. So, every time your reach such a pivot point you are facing a decision with irreversible consequences and there is no prior evidence to help you. Delegative learning gets around this my having the agent delegate precisely at the pivot point.

Even disregarding that, "Once the agent has figured out some of how the world works, most environments/hypotheses where there is a trap have evidential clues elsewhere to rule them out" is not quite true.

The notion of a "trap" is relative to the way you organize your uncertainty about the world. Saying that the environment might contain traps is saying that the class of environments you consider is unlearnable. However, the specification of a Bayesian agent only depends on the prior that you get by "averaging" the environments in this class. Different ways of decomposing the prior into hypotheses might yield learnable or unlearnable classes.

For example, consider an environment in which taking action A leads to heaven with probability 70% and hell with probability 30% whereas taking action B leads to heaven with probability 50% and hell with probability 50%. In this environment, taking action A is the better choice and there is no problem. However, if you decompose it into a mixture of deterministic environments then, from that perspective, you have a trap.

To give a "realistic" example, imagine that, we think that quantum random is truly random, but actually there is an underlying theory which allows predicting quantum events deterministically. Then, a strategy that seems optimal from the perspective of an agent that only knows quantum theory might be "suicidal" (permanently sacrificing value) from the perspective of an agent that knows the deeper underlying theory.

As another example, imagine that (i) the only way to escape the heat death of our universe is by controlled vacuum collapse and (ii) because we don't know in which string vacuum we are, there is no way to be certain about the outcome of a controlled vacuum collapse without high energy experiments that have a significant chance of triggering an uncontrolled vacuum collapse. AFAIK this situation is consistent with our knowledge of physics. So, if you consider the different string vacua to be different hypotheses, we are facing a trap. On the other hand, if you have some theory that gives you a probability distribution over these vacua then there is a well-defined optimal strategy.

The point is, it is probably not meaningful/realistic to claim that we can design an agent that will almost certainly successfully deal with all traps, but it is meaningful and realistic to claim that we can design an agent that will be optimal relatively to our own posterior belief state (which is, more or less by definition, the sort of agent that it is a good idea to build).

The reason "explorative" algorithms such as PSRL (Posterior Sampling Reinforcement Learning) cannot be trivially replaced by the Bayes optimal policy, is that the Bayes optimal policy is (more) computationally intractable. For example, if you consider a finite set of finite MDP hypotheses then PSRL can be implemented in polynomial time but the Bayes optimal policy cannot (in fact I am not 100% sure about this, but I think that hole can be patched using the PCP theorem).

The true reason to do exploration seems to be because the agent believes the action it is taking will not lead to an irreversible trap, and because it believes that the action will reveal information about the true environment that enables a better policy later on, which in expectation up to the time horizon, outweighs the temporary loss incurred due to exploring.

My understanding of logical inductor exploration (e.g. in asymptotic decision theory) is that the exploration steps the agent learns from mostly don't happen in its own lifetime, rather they happen in the lifetimes of similar but simpler agents. This allows exploration to work for single-shot problems such as 5 and 10. Intuitively, if you are in a 5 and 10 problem and your brain has size 10^1000, then you can simulate someone whose brain has size 10^999 doing a 5 and 10 problem, and thereby learn the relation between the agent's action and how much utility they get. So each particular agent has some chance of exploring irrecoverably, but in aggregate not many of them will (and it's hard to predict which will and which won't).

As far as I can tell, the only strategy that doesn’t have some sort of targetable exploration behavior is Thompson sampling.

Thompson sampling still randomizes (it randomizes its belief about the world it's in) and is therefore vulnerable to troll bridge.

A: While that is a really interesting note that I hadn't spotted before, the standard formulation of exploration steps in logical inductor decision theory involve infinite exploration steps over all time, so even though an agent of this type would be able to inductively learn from what other agents do in different decision problems in less time than it naively appears, that wouldn't make it explore less.

B: What I intended with the remark about Thompson sampling was that troll bridge functions on there being two distinct causes of "attempting to cross the bridge". One is crossing because you believe it to be the best action, and the other is crossing because an exploration step occurred, and Thompson sampling doesn't have a split decision criterion like this. Although now that you point it out, it is possible to make a Thompson sampling variant where the troll blows up the bridge when "crossing the bridge" is not the highest-ranked action.

I'm not convinced exploration doesn't tile. Exploration steps would not be self-modified away if they're actually useful/important, and if the agent can recognize this.

In the case of chicken-rule, it's unclear how to make a decision at all without it. Plus, the exploration never needs to occur -- though, subjectively, the agent can't know that, so that doesn't really effect the question of tiling. But, the agent could see that removing the chicken step removes the ability of the agent to reason about the consequences of alternate actions.

I think it's somewhat plausible that you can get something which needs chicken rule as a foundation, but which can decide not to use it in cases like troll bridge, because it is deciding whether to use the chicken rule for sensible reasons (where "sensible" includes the chicken rule itself).

Deciding to use the chicken rule is a transparent-Newcomb-like problem: your behavior in the case that you do see a proof of your own action affects your ability to reason in the case that you don't.

The same things seem to apply to exploration.

Given our current level of understanding, chicken rule and true exploration seem closely analogous. However, it's quite plausible that this will stop being the case with a better understanding. In particular, Sam recently pointed out to me that Löb's theorem doesn't go through for $□ ϕ := P (ϕ) \geq 0.95$ . We have a decent picture of what tiling looks like in pure logical settings, but that's shaped very highly by Löb's theorem. So, tiling considerations for exploration could look very different from those for chicken-rule.