Personal Blog

Previously we derived a regret bound for DRL which assumed the advisor is "locally sane." Such an advisor can only take actions that don't lose any value in the long term. In particular, if the environment contains a latent catastrophe that manifests with a certain rate (such as the possibility of an UFAI), a locally sane advisor has to take the optimal course of action to mitigate it, since every delay yields a positive probability of the catastrophe manifesting and leading to permanent loss of value. This state of affairs is unsatisfactory, since we would like to have performance guarantees for an AI that can mitigate catastrophes that the human operator cannot mitigate on their own. To address this problem, we introduce a new form of DRL where in every hypothetical environment the set of uncorrupted states is divided into "dangerous" (impending catastrophe) and "safe" (catastrophe was mitigated). The advisor is then only required to be locally sane in safe states, whereas in dangerous states certain "leaking" of long-term value is allowed. We derive a regret bound in this setting as a function of the time discount factor, the expected value of catastrophe mitigation time for the optimal policy, and the "value leak" rate (i.e. essentially the rate of catastrophe occurrence). The form of this regret bound implies that in certain asymptotic regimes, the agent attains near-optimal expected utility (and in particular mitigates the catastrophe with probability close to 1), whereas the advisor on its own fails to mitigate the catastrophe with probability close to 1.

Appendix A proves the main theorem. Appendix B contains the proof of an important lemma which is however almost identical to what appeared in the previous essay. Appendix C contains several propositions from the previous essay which we are used in the proof. [Appendices B and C were moved to a separate post because of a length limit in the website.]


We start by formalising the concepts of a "catastrophe" and "catastrophe mitigation" in the language of MDPs.

Definition 1

A catastrophe MDP is an MDP together with a partition of into subsets (safe, dangerous and corrupt states respectively).

Definition 2

Fix a catastrophe MDP . Define by

is called a mitigation policy for when

i. For any , .

is called a proper mitigation policy for when condition i holds and

ii. For any , .

Definition 3

Fix , a catastrophe MDP and a proper mitigation policy . is said to have expected mitigation time when for any

Next, we introduce the notion of an MDP perturbation. We will use it by considering perturbations of a catastrophe MDP which "eliminate the catastrophe."

Definition 4

Fix and consider a catastrophe MDP . An MDP is said to be a -perturbation of when




iv. For any and ,

v. For any and , there exists s.t. .

Similarly, we can consider perturbations of a policy.

Definition 5

Fix and consider a catastrophe MDP . Given and , is said to be a -perturbation of when

i. For any , .

ii. For any , there exists s.t. .

We will also need to introduce policy-specific value functions, Q-functions and relatively -optimal actions.

Definition 6

Fix an MDP and . We define and by

For each , we define , and by

Now we give the new (weaker) condition on the advisor policy. For notational simplicity, we assume the policy is stationary. It is easy to generalize these results to non-stationary advisor policies and to policies that depend on irrelevant additional information (i.e. policies for universes that are realizations of the MDP).

Definition 7

Given a catastrophe MDP , we denote the MDP defined by

  • For any , .

  • For any , .

Definition 8

Fix . Consider a catastrophe MDP . A policy is said to be locally -sane for when there exists a -perturbation of with a deterministic proper mitigation policy and a -perturbation of s.t.

i. For all , .

ii. is a mitigation policy for .

iii. For any :

iv. For any :

Given , is said to have potential mitigation time when has it as expected mitigation time.

Note that a locally -sane policy still has to be -optimal in . This requirement seems reasonably realistic, since, roughly speaking, it only means that there is some way to "rearrange the universe" that the agent can achieve, and that would be "endorsed" by the advisor, s.t this rearrangement doesn't destroy substantially much value and s.t. after this rearrangement, there is no "impending catastrophe" that the agent has to prevent and the advisor wouldn't be able to prevent in its place. In particular, this rearrangement may involve creating some subagents inside the environment and destroying the original agent, in which case any policy on is "vacuously" optimal (since all actions have no effect).

We can now formulate the main result.

Theorem 1

Fix an interface , , and for each , an MDP s.t. . Now, consider for each , an -universe which is an -realization of a catastrophe MDP with state function s.t.


ii. For each and , .

iii. For each , .

iv. Given and , if and , then (this condition means that in uncorrupted states, the reward is observable).

Consider also , and a locally -sane policy for . Assume has potential mitigation time . Then, there exists an -policy s.t. for any

Here, is the -policy defined by . and the are regarded as fixed and we don't explicitly examine their effect on regret, whereas , , and the are regarded as variable with the asymptotics , .

In most interesting cases, (i.e. the "mean time between catastrophes" is much shorter than a discount horizon) and (i.e. the expected mitigation time is much shorter than the discount horizon), which allows simplifying the above to

We give a simple example.

Example 1

Let , . For any and , we fix some and define the catastrophe MDP by

  • , , (adding corrupted states is an easy exercise).

  • If and then

  • If then

  • If and then

  • If and then

  • , if then .

  • and iff (this defines a unique ).

  • If then for any .

  • , .

  • If then , .

We have . Consider the asymptotic regime , , . According to Theorem 1, we get

The probability of a catastrophe (i.e. ending up in state ) for the optimal policy for a given is . Therefore, the probability of a catastrophe for policy is . On the other hand, it is easy to see that the policy has a probability of catastrophe (and in particular regret ): it spends time "exploring" with a probability of a catastrophe on every step.

Note that this example can be interpreted as a version of Christiano's approval-directed agent, if we regard the state as a "plan of action" that the advisor may either approve or not. But in this formalism, it is a special case of consequentialist reasoning.

Theorem 1 speaks of a finite set of environments, but as before (see Proposition 1 here and Corollary 3 here), there is a "structural" equivalent, i.e. we can use it to produce corollaries about Bayesian agents with priors over a countable set of environments. The difference is, in this case we consider asymptotic regimes in which the environment is also variable, so the probability weight of the environment in the prior will affect the regret bound. We leave out the details for now.

Appendix A

We start by deriving a more general and more precise version of the non-catastrophic regret bound, in which the optimal policy is replaced by an arbitrary "reference policy" (later it will be related to the mitigation policy) and the dependence on the MDPs is expressed via a bound on the derivative of by .

Definition A.1

Fix . Consider an MDP and policies , . is called -sane relatively to when for any



Lemma A.1

Fix an interface , and . Now, consider for each , an -universe which is an -realization of an MDP with state function and policies , . Consider also , and assume that

i. is -sane relatively to .

ii. For any and

Then, there exists an -policy s.t. for any

The -notation refers to the asymptotics where is fixed (so we don't explicitly examine its effect on regret) whereas , and the are variable and , .

The proof of Lemma A.1 is almost identical to the proof the main theorem for "non-catastrophic" DRL up to minor modifications needed to pass from absolute to relative regret, and tracking the contribution of the derivative of . We give it in Appendix B.

We will not apply Lemma A.1 directly the the universes of Theorem 1. Instead, we will define new universes using the following constructions.

Definition A.2

Consider a catastrophe MDP. We define the catastrophe MDP as follows.

  • , , .

  • For any :

  • For any , :

  • For any :

  • For any , .

  • For any , .

Now, consider an interface and a which is an -realization of a catastrophe MDP with state function . Denote , and . Denote the projection mapping and corresponding. We define the -universe and the function as follows

It is easy to see that is an -realization of with state function .

Definition A.3

Consider a catastrophe MDP. We define the catastrophe MDP as follows.

  • , , .

  • For any , .

Now, consider an interface and a which is an -realization of a catastrophe MDP with state function . We define the -universe as follows

It is easy to see that is an -realization of with state function .

Given , we will use the notation

Given an -policy , the -policy is defined by .

In order to utilize condition iii of Definition 8, we need to establish the following relation between and , .

Proposition A.2

Consider a catastrophe MDP, some and a proper mitigation policy. Then

For the purpose of the proof, the following notation will be convenient

Definition A.4

Consider a finite set and some . We define by

As well known, the limit above always exists.

Proof of Proposition A.2

Consider any and . Since , we have

Let be either of and . Since , we get

Since is a mitigation policy, we get

Finally, since is proper, and . We conclude

Now we will establish a bound on the derivative of by in terms of expected mitigation time, in order to demonstrate condition ii of Lemma A.1.

Proposition A.3

Fix . Consider a catastrophe MDP and a proper mitigation policy with expected mitigation time . Assume than for any and

Then, for any and

Note that, since is a rational function of with no poles on the interval , some finite always exists. Note also that Proposition A.3 is really about Markov chains rather than MDPs, but we don't make it explicit to avoid introducing more notation.

Proof of Proposition A.3

Let be the Markov chain with transition matrix and initial state . For any ), we have

Given , we define by

It is easy to see that can be rewritten as

The expression above is well defined because is a proper mitigation policy and therefore is finite with probability 1.

Let us decompose as defined as follows

We have

The second term can be regarded as a weighted average (since ), where the maximal term in the average is at most , hence

Also, we have

To transform the relative regret bounds for "auxiliary" universes obtained from Lemma A.1 to regret bounds for the original universes, we will need the following.

Definition A.5

Fix and a universe which is an -realization of a catastrophe MDP with state function . Let be a -perturbation of . An environment is said to be a -lift of to when

i. is an -realization of with state function .


iii. For any and , if then .

iv. For any and , if then there exists s.t.

It is easy to see that such a lift always exists, for example we can take:

Proposition A.4

Consider s.t