Vingean Reflection: Open Problems

abramdemski

Edit: None of the ideas here should be attributed to me; they all came out of discussions at the workshop or were pre-existing. I've just done a little bit of interpretation and (in some cases) naming.

The summer MIRI workshop on Vingean reflection has drawn to a close, and I think the highest-value thing for me to communicate from it is my sense of what the open problems are. The existing solutions to Vingean uncertainty are by no means satisfactory, but nonetheless, they solve the example problems which are currently on the table (namely, the Löbian obstacle and the procrastination paradox). This state of affairs makes it more difficult than it needs to be to see what is wrong with solutions so far. Based on a discussion with Eliezer, and a table of solutions and their properties which Patrick wrote out, I'll try to discuss the boundary where progress can be made.

Existing Problems/Solutions

A Vingean agent is defined as one which can choose actions without entirely planning what actions it will take in the future. This seems like a necessary condition for a boundedly rational agent. Contrast this with boundlessly rational procedures such as backward induction which must determine plan for all contingencies at time $t + 1$ before it is able to decide what to do at time $t$ , because it's not possible to choose between actions when the expected utility (under the assumption of acting rationally after that point) is not yet computed. Vingean uncertainty is, therefore, a large departure from standard utility theory and game theory.

The Löbian obstacle is, informally, the observation that an agent cannot trust itself generally, thanks to Löb's theorem, and therefore cannot trust its future self to achieve goals. More formally: Löb's theorem implies that trust of its own soundness over any class of sentences entails provability of those sentences. It therefore seems impossible for an agent to trust future selves to prove actions "safe" (for some formal safety predicate) without predicting specifically what action will be taken. A more formal specification of the obstacle is given in the tiling agents paper.

Solutions to the Löbian obstacle include Marcello's Waterfall and Model Polymorphism (previously called Parametric Polymorphism), also detailed in that paper.

Marcello's Waterfall falls prey to the Procrastination Paradox, as a direct result of trusting itself: it believes that its future self will satisfy the goal, so it is willing to put off achieving the goal for now. Unfortunately, this means that it is willing to put off goal-achievement forever.

Parametric Polymorphism solves this as well. So does a modification of Marcello's Waterfall called the Consistency Waterfall.

The problem of Vingean reflection has very much been framed as a matter of navigating between the Scylla of the Löbian obstacle and the Charybdis of the procrastination paradox, for example in this paper. However, the fact is, we can solve both of these issues. Model Polymorphism and the consistency waterfall are not thought of as providing the final word on Vingean matters, but, dissatisfaction with those models seems insufficiently articulated. (This technical report has some discussion, however.) What more do we want?

One class of concerns is naturalistic self-trust (also discussed in the tiling agents paper). Due to Löb's theorem, it is not possible for a system to trust calculations made by copies of itself (or, any system which it can know obeys its same axioms and inference rules). Waterfall-style solutions to the Löbian obstacle allow an infinite chain of trust, but still require the agent to modify itself at each step (to be trustworthy to its past self). Model polymorphism arguably has a larger degree of naturalistic trust, but only trusts itself on a subset of its goals. To pin down more precisely what the problem is, we discussed stronger criteria to impose on agents.

New Problems

The Imitation Problem

Suppose that an agent faces a computationally difficult decision. The agent sees a copy of itself in a situation which is identical in all relevant aspects, and sees that that agent takes action $a$ . The agent can prove that its copy would only take provably safe actions. It seems possible that the agent could save processing time by taking the action which its copy settled on.

Edit: The copy of the agent should be identical in all decision-relevant aspects except for having information about a copy; that is, the agent knows that its copy is making the decision on its own, rather than by imitating yet another copy; or, at least, a well-founded chain of copies should exist.

This is not possible for agents considered so far. The waterfall agents are capable of trusting their children, and could therefore take actions which they saw their children decide upon (if they and their children were somehow present at the same time, contrary to the usual setup). Model Polymorphism trusts an exact copy of itself for only a subset of its goals, and so could not imitate copies. In particular, it trusts copies to be safe on future decisions, but not the current decision. Therefore, it is incapable of trusting a clone when it comes to what it should do right now.

This temporal aspect is further highlighted by the next problem.

The Notebook Problem

Vingean reflection explicitly focused on trusting future self, but similar problems arise when trying to trust your past self.

An AI may be programmed to simply trust its memory -- and if it considers whether to change that programming, it will be considering whether to modify its future self, so previous solutions may apply. Suppose, however, that its internal storage has been nearly exhausted, and it is forced to augment this with some external storage. We can imagine that it is writing things down in a notebook for later use.

Unfortunately, most agent designs considered so far will not be capable of trusting things written in such a notebook. Supposing that an agent proves an important and difficult statement $X$ and writes it in the notebook. One week later, the question comes up again. Checking the notebook, the agent sees that it has written $X$ . From this, the agent knows that it must have proven $X$ , and it can conclude from this $□ X$ . For the usual Löbian reasons, it cannot then conclude $X$ .

This problem was written about in Naturalistic trust among AIs: The parable of the thesis advisor's theorem, by Benja. Nik Weaver proposed the following solution: what if an agent accepts $X$ , $□ X$ , $□ □ X$ , and so on? Here, "accept" does not mean believe at the level of $X$ ; if we had an inference rule $□ X ⊢ X$ , we would become inconsistent. Rather, the agent is willing to act on a belief $□ X$ as if it were $X$ ; and in general, any number of box is treated as if it were a plain belief, for the sake of action.

Let's call this the box-ignoring system. The box-ignoring system does solve the notebook problem in some sense: it adds boxes to things when it reads them from a notebook, but this does not make any difference to the system. Unfortunately, it does not solve either the imitation problem or the original Löbian obstacle. For the imitation problem, it fails to trust another copy of itself because although it can prove that there must have been some proof of safety with some number of boxes, it cannot prove the particular number of boxes just by observing its clone. (Perhaps there is some clever fix for this.)

AI Suicide Problem

A problem very closely related to Vingean uncertainty is that most logical agents which have a strong safety requirement want to destroy themselves because they cannot prove they will be safe in the future. The waterfall agents get around this by allowing an unbounded line of "children" which preserve most of the properties of the parent. These children can take the form of small modifications to the agent's software, as opposed to building whole new AIs from scratch. Still, it's undesirable for the agent to want to rewrite itself every time it examines its own code.

The AI suicide problem, then, is a test of whether an agent can have Vingean trust for exactly itself. Model Polymorphism passes this test, and the waterfall proposals do not. (Eliezer also argued that Model Polymorphism failed the test in spirit, because it could only trust future versions of itself by virtue of their being in the future -- it still wants to modify children, but just in the precise way that time naturally modifies them. I'm not sure how to state a stronger test to satisfy Eliezer's concern.)

Other Open Areas

Other things which have less to do with naturalistic self-trust:

Probabilistic reflection and Vingean trust. Paul Christiano's reflective probability falls prey to a probabilistic version of the procrastination paradox. As far as I know, it isn't known whether it passes or fails the Löbian obstacle. It could be interesting to look for modifications related to Model Polymorphism, or perhaps the consistency waterfall. (Patrick and Benja were trying to do this at the end of the workshop.) It could also be interesting to go for similar things with other approaches to logical uncertainty.
Vingean decision theory. In order to be useful, Vingean reflection needs to break out of its realm of pure logic and start dealing with more complicated decision problems. It could be helpful to combine it with the work on alternative decision theories. Patrick goes in this direction with a Modal UDT approach to the Löbian obstacle.

[-]orthonormal11y00

I actually think that the Imitation Problem should not be resolved in the direction of taking action $a$ . Otherwise it seems like Omega could tell the agent that it is going to shortly take action $a$ , the agent will conclude that $a$ must be safe, and it will then take action $a$ ... but $a$ can be arbitrary here!

Basically, I think a solution to Vingean reflection ought to ensure that the reasoning process is well-founded: for every decision, we want to be sure that some particular version of the agent has actually done the necessary verification work rather than passing the buck. (For that reason, I do think that model polymorphism is on the right track.)

[-]abramdemski11y00

I think we can address this by assuming that the agent is imitating a copy who made the decision on its own rather than by imitating; I've edited the post to reflect this additional stipulation.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

2

Vingean Reflection: Open Problems

2

Existing Problems/Solutions

New Problems

Other Open Areas