This [EDIT: final version, presentation] is a design for a conservative agent that I worked on with Marcus Hutter. Conservative agents are reluctant to make unprecedented things happen. The agent also approaches at least human-level reward acquisition.

The agent is made conservative by being pessimistic. Pessimism is tuned by a scalar parameter . When the agent is more pessimistic, it is more conservative. When it is made more pessimistic, it would be less likely to *exceed* human-level reward acquisition (almost definitely, but I haven’t tried to prove that). It would also require more observations before it started acting, the more pessimistic it is. It is not clear to me how useful the agent would be at the level of pessimism where we could be confident it is safe. At 0 pessimism, it is similar to AIXI (although technically stronger, because AIXI doesn’t have the performance guarantee of matching or exceeding human-level reward acquisition).

The agent has access to a human mentor, and at every timestep, it can either act or defer to the mentor. The only assumption we make is that the true environment belongs to the agent’s (countable) set of world-models. First a bit of math, then the main results.

A bit of math:

An event is a subset of interaction histories that end with an action. Letting , , and be the action, observation, and reward spaces, an event . An element of would look like . Below, I will say “[to] take an action which immediately causes event ”, by which I mean “to take an action such that now the interaction history is an element of .”

The main results:

1) (At least) mentor-level reward acquisition,

2) Probability of querying the mentor ,

and this will take some time read, but I figured I’d spell it all out properly:

3) For any complexity class C (defined on normal Turing machines, not e.g. non-deterministic ones), we can construct a set of world-models such that for all events in the complexity class C and for all , there exists a such that: when the pessimistic agent has the model class and a pessimism , the following holds with probability : for the whole lifetime of the pessimistic agent, if has never happened before, the agent will not take an action which immediately causes event ; if the event ever happens, it will be because the mentor took the action that made it happen.

Comment 1: “The agent takes an action which *eventually* causes (with probability at least )” is an event itself, and it happens *immediately* if the agent takes the action in question, so the theorem above applies. But this event may not be in a complexity class that is in.

Comment 2: The less simple is, the higher has to be.

Some other interesting results follow from the “Probably Respecting Precedent Theorem” above. One of which is roughly that (using the same there-exists and for-alls as in the main theorem) it is not instrumentally useful for the agent to cause to happen. Note there is no need for the qualifier “immediately”.

Here is an event *E* that makes the Probably Respecting Precent Theorem particularly interesting: “Everyone is probably about to be dead.” If we want the agent to avoid an unprecedented bad outcome, all we have to know is an upper bound of the computational complexity of the bad outcome. We don’t have to know how to define the bad outcome formally.

Here’s how the agent works. It has a belief distribution over countably many world-models. A world-model is something that gives a probability distribution over observations and rewards given an interaction history (that ends in an action). It has a belief distribution over countably many mentor-models. A mentor-model is a policy—a probability distribution over actions given an interaction history. At each timestep, it takes the top world-models in its posterior until the total posterior weight of those world-models is at least . The pessimistic value of a policy is the minimum over those world-models of the expected future discounted reward when following that policy in that world-model. The agent picks a policy which maximizes the pessimistic value. Either it follows this policy, or it defers to the mentor. To decide, it samples a world-model and a mentor-model from its posterior; then, it calculates the expected future discounted reward when following the mentor-model (which is a policy) in that world-model. If this value is greater than the pessimistic value plus positive noise, the agent defers to the mentor. Also, if the pessimistic value is 0, it defers to the mentor. This is called the zero condition, and to ensure that it only happens finitely often, the actual rewards we give have to always be greater than some . (If for some reason we failed to do this, despite that being in no one’s interest, the only results that would break are performance results, not safety results).

Here is an intuitive argument that some might find more persuasive than the formal results. An advanced RL agent run on a computer in Oxford might come up with two hypotheses about how the environment produces rewards: (1) the environment produces rewards according to how satisfied the human operators are with my performance; (2) the environment produces reward according to which keys are pressed on the keyboard of a certain computer. An agent which assigns sufficient weight to (2) will take over the world if possible to make sure those keys are pressed right. A pessimistic agent (that is sufficiently pessimistic to include (1) in its set of top world-models that cover of its posterior) will predict that taking over the world will make the human operators unsatisfied, which puts an upper bound on the pessimistic value of such a policy. Better to play it safe, and take actions which satisfy the human operators *and* cause them to press the right keys accordingly. (With the help of mentor-demonstrations, it will have seen enough to have all its top models be approximately accurate about the effects of normal actions). Intuitively, I think much lower values of are required to get this sort of behavior than the value of that would be required to get very a small for the event “everyone is probably about to be dead” in the Probably Respecting Precedent Theorem.

This agent is definitely not tractable. I mentioned that when is large enough to make it safe, it might never learn to be particularly superhuman. It is also possible that we never manage to come up with heuristic approximations to this agent (for the sake of tractability) without ruining the safety results. (The most powerful “heuristic approximations” will probably look like “applying the state of the art in AI in place of proper Bayesian reasoning and expectimax planning.”) These are the main reasons I see for pessimism about pessimism.

One thought I’ve had on tractable approximations: I imagine the over world-models being approximated by an adversary, who takes the agent’s plan and searches for a simple world-model that retrodicts past observations well, but makes the plan look dumb.

Just a warning: the paper is dense.

“I was sweating blood” — Marcus Hutter

Some kind, kind people who read drafts and who were not familiar with the notation said it took them 2-3 hours (excluding proofs and appendices). Sorry about that. I’ve tried to present the agent and the results as formally as I can here without lots of equations with Greek letters and subscripts. Going a level deeper may take some time.

Thanks to Marcus Hutter, Jan Leike, Mike Osborne, Ryan Carey, Chris van Merwijk, and Lewis Hammond for reading drafts. Thanks to FHI for sponsorship. We’ve just submitted this to COLT. EDIT: It's been accepted! We’ll post it to ArXiv after we’ve gotten comments from reviewers. If you’d like to cite this in a paper in the meantime, you can cite it as an unpublished manuscript; if you’re citing it elsewhere, you can link to this page if you like. Hopefully, theorem numbers will stay the same in the final version, but I can’t promise that. I might not be super-responsive to comments here. EDIT: I have more time to respond to comments now.

Planned summary for the Alignment Newsletter:

Planned opinion: