We introduce a reinforcement-like learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an "advisor". The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I'm wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.

The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of "Bayesian paranoia" and the Charybdis of falling into traps). We prove that, given certain assumption about the advisor, a Bayesian DIRL agent (whose prior is supported on some countable set of hypotheses) is guaranteed to attain most of the value in the slow falling time discount (long-term planning) limit (assuming one of the hypotheses in the prior is true). The assumption about the advisor is quite strong, but the advisor is not required to be fully optimal: a "soft maximizer" satisfies the conditions. Moreover, we allow for the existence of "corrupt states" in which the advisor stops being a relevant signal, thus demonstrating that this approach can deal with wireheading and avoid manipulating the advisor, at least in principle (the assumption about the advisor is still unrealistically strong). Finally we consider advisors that don't know the environment but have some beliefs about the environment, and show that in this case the agent converges to Bayes-optimality w.r.t. the advisor's beliefs, which is arguably the best we can expect.

All the proofs are in the Appendix.

Notation

The set of natural numbers N is defined to begin from 0. Given n∈N, [n] denotes the set {m∈N∣m<n}. Given a logical formula φ, [[φ]]∈{0,1} denotes its truth value.

Given a set X, we denote X∗:=⨆n∈NXn, the set of finite strings over the alphabet X. The unique string of length 0 is denoted λ. We also denote by Xω the set of infinite strings over alphabet X. Given α∈X∗⊔Xω and n∈N, αn∈X is the n-th symbol in α (i.e. α=α0α1α2…) and α:n∈X∗ is the prefix of length n of α (ending in αn−1). Given α,β∈X∗, |α|∈N is the length of α and αβ∈X∗ is the concatenation of α and β. The latter notation is also applicable when β∈Xω. The notation α⊑β means that α is a prefix of β. Given sets X,Y, x∈X and y∈Y, we sometimes use the notation xy=(x,y)∈X×Y. Given α∈(X×Y)∗, and n∈N, α:n+1/2∈(X×Y)∗×X is defined by α:n+1/2=(α:n,x) where αn=(x,y).

Given sets A and B, the notation f:A∘→B means that f is a partial mapping from A to B, i.e. a mapping from some set domf⊆A to B.

Given a topological space A, ΔA is the space of Borel probability measures on A. When no topology is specified, A is understood to be discrete: in this case, μ∈ΔA can also be regarded as a function from A to [0,1]. The space Xω is understood to have the product topology. Given topological spaces A,B, μ∈ΔA and ν∈ΔB, suppμ⊆A is the support of μ and μ×ν∈Δ(A×B) is the product measure. Given K a Markov kernel from A to B, μ⋉A∈Δ(A×B) is the semidirect product of μ and K. When A and B are discrete, H(μ) is the Shannon entropy of μ (in natural base) and DKL(μ∥ν) is the Kullback-Leibler divergence from ν to μ.

Given d∈N and x∈Rd, we use the notation

∥x∥1:=∑i<d|xi|

∥x∥∞:=maxi<d|xi|

The symbols o,O,ω,Ω,Θ will refer to usual O-notation.

Results

An interfaceI=(A,O) is a pair of finite sets ("actions" and "observations"). An I-policy is a function π:(A×O)∗→ΔA. An I-environment is a partial function μ:(A×O)∗×A∘→ΔO s.t.

i. λ×A⊆domμ

ii. Given h∈(A×O)∗ and aob∈A×O×A, haob∈domμ iff ha∈domμ and μ(ha)(o)>0.

It is easy to see that domμ is always of the form X×A for some X⊆(A×O)∗. We denote hdomμ:=X.

Given an I-policy π and an I-environment μ, we get μ⋈π∈Δ(A×O)ω in the usual way.

An I-reward function is a partial function r:(A×O)∗∘→[0,1]. An I-universe is a pair (μ,r) where μ is an I-environment and r is an I-reward function s.t. domr⊇hdomμ. We denote the space of I-universes by ΥI. Given an I-reward function r and t∈(0,∞), we have the associated utility functionUrt:(A×O)ω∘→[0,1] defined by

Urt(x):=∑∞n=0e−n/tr(x:n)∑∞n=0e−n/t

Here and throughout, we use geometric time discount, however this choice is mostly for notational simplicity. More or less all results carry over to other shapes of the time discount function.

Denote ΠI\ the space of I-policies. An I-metapolicy is a family {πt∈ΠI}t∈(0,∞), where the parameter t is thought of as setting the scale of the time discount. An I-meta-universe is a family {υt∈Υ}t∈(0,∞). This latter concept is useful for analyzing multi-agent systems, where the environment contains other agents and we study the asymptotics when all agents' time discount scales go to infinity. We won't focus on the multi-agent case in this essay, but for future reference, it seems useful to make sure the results hold in the greater generality of meta-universes.

Given an I-policy π, an I-universe υ and t>0, we denote EUπυ(t):=Eμ⋈π[Urt] (this is well-defined since Urt is defined on the support of μ⋈π). We also denote EU∗υ(t):=maxπ∈ΠIEUπυ(t). We will omit I when it is obvious from the context.

Definition 1

Fix an interface I. Consider π∗ a metapolicy and H a set of meta-universes. π∗ is said to learnH when for any υ∈H

limt→∞(EU∗υt(t)−EUπ∗tυt(t))=0

H is said to be learnable when there exists π∗ that learns H.

Our notion of learnability is closely related to the notion of sublinear regret, as defined in Leike 2016, except that we allow the policy to explicitly depend on the time discount scale. This difference is important: for example, given a single universe υ, it might be impossible to achieve sublinear regret, but {υ} is always learnable.

Proposition 1

Fix an interface I. Consider H a countable learnable set of meta-universes. Consider any ζ∈ΔH s.t. suppζ=H. Consider πζ a ζ-Bayes optimal metapolicy, i.e.

πζt∈argmaxπ∈ΠEυ∼ζ[EUπυt(t)]

Then, πζ learns H.

Proposition 1 can be regarded as a "frequentist" justification for Bayesian agents: if any metapolicy is optimal in a "frequentist" sense for the class H (i.e. learns it), then the Bayes optimal metapolicy is such.

Another handy property of learnability is the following.

Proposition 2

Fix an interface I. Let H be a countable set of meta-universes s.t. any finite G⊆H is learnable. Then, H is learnable.

We now introduce the formalism needed to discuss advisors. Define ¯A:=A⊔{⊥}, ¯O:=¯A×O and ¯I:=(¯A,¯O). Here, the ¯A factor of ¯O is the action taken by the advisor, assumed to be observable by the agent. The environments we will consider are s.t. this action is ⊥ unless the agent delegated to the advisor at this point of time, which is specified by the agent taking action ⊥. It will also be the case that whenever the agent takes action ⊥, the advisor cannot take action ⊥.

Denote ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O:=¯A×¯O∖{⊥⊥o∣o∈O}. Given abo∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O, we define abo––––∈A×O by

abo––––:={ao if a≠⊥bo if a=⊥

Given h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗, we define h––∈(A×O)∗ by h––n:=hn–––.

Definition 2

An ¯I-policy α is said to be autonomous when for any h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗, α(h)(⊥)=0.

Consider an I-environment μ and an autonomous ¯I-policy α, which we think of as the advisor policy. We define the ¯I-environment ¯μ[α] as follows. For any h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ s.t. h––∈hdomμ, a,b∈¯A and o∈O:

¯μ[α](ha)(bo):=⎧⎨⎩μ(h––a)(o) if a≠⊥,b=⊥α(h)(b)⋅μ(h––b)(o) if a=⊥,b≠⊥0 if a≠⊥,b≠⊥ or a=b=⊥

It is easy to the above is a well-defined ¯I-environment with hdomμ⊆¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗.

Given an I-universe υ=(μ,r), we define the ¯I-reward function ¯r:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗∘→[0,1] by ¯r(h):=r(h––) and the ¯I-universe ¯υ[α]:=(¯μ[α],¯r).

We now introduce the conditions on the advisor policy which will allow us to prove a learnability theorem. First, we specify an advisor that always remains "approximately rational."

The notation Ex∼ρ[X∣h] will be used to mean Ex∼ρ[X∣h⊑x]. Given a universe υ=(μ,r) and t∈(0,∞) we define Vυt:hdomμ→[0,1] and Qυt:hdomμ×A→[0,1] by

Fix an interface I. Consider a universe υ=(μ,r). Let t,β∈(0,∞). A policy α is called strictly β-rational for (υ,t) when for any h∈domμ and a∈A

α(h)(a)≤exp[β(Qυt(ha)−Vυt(h))]maxa∗∈Aα(h)(a∗)

Now we deal with the possibility of the advisor becoming "corrupt". In practical implementations where the "advisor" is a human operator, this can correspond to several types of events, e.g. sabotaging the channel that transmits data from the operator to the AI ("wireheading"), manipulation of the operator or replacement of the operator by a different entity.

Definition 4

Fix an interface I. Consider a family {Ct⊆(A×O)∗}t∈(0,∞) s.t. for any h,g∈(A×O)∗, if h∈Ct then hg∈Ct. We think of Ct as the set of histories in which a certain event occurred. Consider a meta-universe υ=(μ,r). C is said to be a υ-avoidable event when there is a meta-policy π∗ and D:(0,∞)→N s.t.

i. limt→∞(EU∗υt(t)−EUπ∗tυt(t))=0

ii. D=ω(t)

iii. limt→∞Prx∼μt⋈π∗t[∃n≤D(t):x:n∈Ct]=0

That is, C is υ-avoidable when it is possible to avoid the event for a long time while retaining most of the value. Consider a meta-universe υ=(μ,r) and C a υ-avoidable event. Denote −υ:=(μ,1−r). We define the reward function rCt by

rCt(h):={rt(h) if h∉CtV−υt(h:n) if h∈Ct and n=min{m∈N∣h:m∈Ct}

We think of rC as representing a process wherein once the event represented by C occurs, the agent starts minimizing the utility function. We also use the notation υC:=(μ,rC).

Definition 5

Consider a meta-universe υ=(μ,r) and β:(0,∞)→(0,∞), where we think of the argument of the function β as the time discount scale. An autonomous ¯I-metapolicy α is called β-rational for υ when there exists a υ-avoidable event C (that we think of as the advisor becoming "corrupt") and an autonomous ¯I-metapolicy α∗ (representing advisor policy conditional on non-corruption) s.t.

i. For any h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ s.t. h––∉Ct, αt(h)=α∗t(h).

ii. α∗t is strictly β(t)-rational for (¯υCt[α∗],t).

Our definition of β-rationality requires the advisor to be extremely averse to corruption: the advisor behaves as if, once a corruption occurs, the agent policy becomes the worst possible. In general, this seems much too strong: by the time corruption occurs, the agent might have already converged into accurate beliefs about the universe that allow it to detect the corruption and keep operating without the advisor. Even better, the agent can usually outperform the worst possible policy using the prior alone. Moreover, we can allow for corruption to depend on unobservable random events and differentiate between different degrees of corruption and treat them accordingly. We leave those further improvements for future work.

We are now ready to formulate the main result.

Theorem

Consider H={υk}k∈N a countable family of I-meta-universes and β:(0,∞)→(0,∞) s.t. β(t)=ω(t2/3). Let {αk}k∈N be a family of autonomous ¯I-metapolicies s.t. for every k∈N, αk is β-rational for υk. Define ¯H:={¯υk[αk]}k∈N. Then, ¯H is learnable.

Some remarks:

By Proposition 1, ¯H is learned by any Bayes optimal metapolicy with prior supported on ¯H.

To get a feeling for the condition β(t)=ω(t2/3), consider an environment where the reward depends only on the last action and observation. In such an environment, an advisor that performs softmax (with constant parameter) on the next reward has β(t)=Θ(t). It is thus "more rational" than the required minimum.

It is easy to see that the Theorem can be generalized by introducing an external penalty (negative reward) for each time the agent delegates to the advisor: as it is, using the advisor already carries a penalty due to its suboptimal choice.

The conditions of the Theorem imply that, in some sense, the advisor "knows" the true environment. This is unrealistic: obviously, we expect the human operator to have some (!) uncertainty about the world. However, we clearly cannot do away with this assumption: if the same action triggers a trap in some universes and is necessary for approaching maximal utility in other universes, and there is no observable difference between the universes before the action is taken, then there is no way to guarantee optimality. The prior knowledge you have about the universe caps the utility you can guarantee to obtain. On the other hand, as an AI designer, one can reasonably expect the AI to do at least as well as possible using the designer's own knowledge. If running the AI is the designer's best strategy, the AI should be close to Bayes optimal (in some sense that includes computational bounds et cetera: complications that we currently ignore) with respect to the designer's posterior rather than with respect to some simple prior. In other words, we need a way to transmit the designer's knowledge to the AI, without hand-crafting an elaborate prior.

The following shows that the DIRL achieves this goal (theoretically, given the considerable simplifying assumptions).

Given an environment μ, we define ^μ:(A×O)∗→[0,1] as follows. For h=a0o0a1o1…an−1on−1∈hdomμ

^μ(h):=∏m<nμ(h:mam)(om)

For h∈(A×O)∗∖hdomμ, ^μ(h):=0.

Given a family of environments {μk}k∈N and ξ∈ΔN, we will use the notation Ek∼ξ[μk] to denote the environment given by

hdomEk∼ξ[μk]:=⋃k∈suppξhdomμk

Ek∼ξ[μk](ha):=Ek∼ξ[^μk(h)μk(ha)]Ek∼ξ[^μk(h)]

Corollary 1

Consider {μk}k∈N a countable family of I-meta-environments, {rK}K∈N a countable family of reward functions and {ξK∈ΔN}K∈N s.t. given k∈suppξK, domrK⊇hdomμk. We think of ξK as the advisor's belief about the environment in universe K. Let β:(0,∞)→(0,∞) be s.t. β(t)=ω(t2/3) and {αK}K∈N be a family of autonomous ¯I-metapolicies s.t. for every K∈N, αK is β-rational for (Ek∼ξK[μk],rK). Let ζ∈ΔN2 be s.t. for any J,j∈N

Pr(K,k)∈ζ[K=J]>0

Pr(K,k)∈ζ[k=j∣K=J]=ξJ(j)

We think of ζ as the agent's prior and the equation above as stating the agent's belief that the advisor's beliefs are "calibrated". Consider πζ a ζ-Bayes optimal ¯I-metapolicy, i.e.

πζ∈argmaxπ∈Π¯IE(K,k)∈ζ[EUπ¯μk[αK],¯rK(t)]
Then, for every K∈N

If we happen to be so lucky that the advisor's (presumably justified) belief is supported on a learnable environment class, we get a stronger conclusion.

Corollary 2

In the setting of Corollary 1, fix K∈N. Define the set of meta-universes HK by

HK:={(μk,rK)∣k∈suppξK}

Assume HK is learnable. Then, for every k∈N

limt→∞(EU∗μk,rK(t)−EUπζ¯μk[αK],¯rK(t))=0

We also believe that to some extent DIRL is effective against acausal attack. Indeed, the optimality we get from the Theorem + Proposition 1 holds for any prior. However, the speed of convergence to optimality certainly depends on the prior. It is therefore desirable to analyze this dependency and bound the damage an adversary can do by controlling a certain portion of the prior. We leave this for future work.

Appendix

Proposition A.0

Fix an interface I. Consider H a countable learnable set of meta-universes. Consider any ζ∈ΔH s.t. suppζ=H. Consider πζ a metapolicy s.t.

limt→∞(maxπ∈ΠEυ∼ζ[EUπυt(t)]−Eυ∼ζ[EUπζtυt(t)])=0

Then, πζ learns H.

Proof of Proposition A.0

Fix π∗ a metapolicy that learns H. Consider ϵ>0 and let Hϵ⊆H be finite s.t. ζ(H∖Hϵ)<ϵ. For t≫0 and every υ∈Hϵ we have

Since ∥y∥1=1, ∥y∥∞≥1d yielding the desired result.

Proposition A.2

Consider H a finite set, L:H→[0,∞), ζ∈ΔH and β,ϵ∈(0,∞). Assume that

i. For any k∈H, ζ(k)≥βϵ.

ii. Eζ[L]≥ϵ

Then

Eζ[e−βL]≤1−(1−e−1)βϵ

Proof of Proposition A.2

Without loss of generality, we can assume that Eζ[L]=ϵ, because otherwise we can rescale L by a constant in (0,1) which will only make Eζ[e−βL] larger. It now follows from conditions i+ii that for any k, L(k)≤β−1 and therefore βL(k)∈[0,1]. We have

It is easy to see that the second term vanishes, yielding the desired result.

Lemma A

Consider the setting of Theorem, but assume that H={υk=(μk,rk)}k<N for some N∈N (i.e. it is finite) and that αk(t) is strictlyβ(t)-rational for ¯υk[αk]. Denote νk:=¯μk[αk] and hdom¯H:=⋃k<Nhdomνk. Denote ζ0∈Δ[N] the uniform probability distribution. For any t>N3, define ζt,~ζt:hdom¯H→Δ[N] recursively as follows

~ζt(λ):=ζ0

~ζt(hao)(k):={~Zt(h)−1⋅ζt(h)(k)⋅νk(ha)(o) if ∃j:ζt(h)(j)⋅νj(ha)(o)>0N−1 otherwise

ζt(h)(k):=Zt(h)−1~ζt(h)(k)[[~ζt(h)(k)>t−1/3]]

In the above, Zt(h) and ~Zt(h) are normalization factor chosen to make the probabilities sum to 1. That is, ζt(h) is obtained by starting from prior ζ0, updating on every observation, and setting to 0 the probability of any universe whose probability drops below t−1/3. When encountering an "impossible" observation we reset to the uniform distribution, but this is arbitrary.

Define the "loss function" Lt:hdom¯H×A→[0,1] by

Lt(ha):=Ek∼ζt(h)[Vυktt(h––)−Qυktt(h––a)]

Denote ϵt:=β(t)−1t−1/3. Define the following ¯I-metapolicy π∗:

π∗t(h):=⎧⎪
⎪⎨⎪
⎪⎩argmina∈ALt(ha) if h∈hdom¯H,mina∈ALt(ha)<ϵt⊥ if h∈hdom¯H,mina∈ALt(ha)≥ϵt⊥ if h∉hdom¯H

(Technically, we only defined π∗t for t>N3, but it doesn't matter.) Then, π∗ learns ¯H.

We introduce a reinforcement-like learning setting we call

Delegative Inverse Reinforcement Learning(DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an "advisor". The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I'm wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of "Bayesian paranoia" and the Charybdis of falling into traps). We prove that, given certain assumption about the advisor, a Bayesian DIRL agent (whose prior is supported on some countable set of hypotheses) is guaranteed to attain most of the value in the slow falling time discount (long-term planning) limit (assuming one of the hypotheses in the prior is true). The assumption about the advisor is quite strong, but the advisor is not required to be fully optimal: a "soft maximizer" satisfies the conditions. Moreover, we allow for the existence of "corrupt states" in which the advisor stops being a relevant signal, thus demonstrating that this approach can deal with wireheading and avoid manipulating the advisor, at least in principle (the assumption about the advisor is still unrealistically strong). Finally we consider advisors that don't know the environment but have some

beliefsabout the environment, and show that in this case the agent converges to Bayes-optimality w.r.t. the advisor's beliefs, which is arguably the best we can expect.All the proofs are in the Appendix.

## Notation

The set of natural numbers N is defined to begin from 0. Given n∈N, [n] denotes the set {m∈N∣m<n}. Given a logical formula φ, [[φ]]∈{0,1} denotes its truth value.

Given a set X, we denote X∗:=⨆n∈NXn, the set of finite strings over the alphabet X. The unique string of length 0 is denoted λ. We also denote by Xω the set of

infinitestrings over alphabet X. Given α∈X∗⊔Xω and n∈N, αn∈X is the n-th symbol in α (i.e. α=α0α1α2…) and α:n∈X∗ is the prefix of length n of α (ending in αn−1). Given α,β∈X∗, |α|∈N is the length of α and αβ∈X∗ is the concatenation of α and β. The latter notation is also applicable when β∈Xω. The notation α⊑β means that α is a prefix of β. Given sets X,Y, x∈X and y∈Y, we sometimes use the notation xy=(x,y)∈X×Y. Given α∈(X×Y)∗, and n∈N, α:n+1/2∈(X×Y)∗×X is defined by α:n+1/2=(α:n,x) where αn=(x,y).Given sets A and B, the notation f:A∘→B means that f is a

partialmapping from A to B, i.e. a mapping from some set domf⊆A to B.Given a topological space A, ΔA is the space of Borel probability measures on A. When no topology is specified, A is understood to be discrete: in this case, μ∈ΔA can also be regarded as a function from A to [0,1]. The space Xω is understood to have the product topology. Given topological spaces A,B, μ∈ΔA and ν∈ΔB, suppμ⊆A is the support of μ and μ×ν∈Δ(A×B) is the product measure. Given K a Markov kernel from A to B, μ⋉A∈Δ(A×B) is the semidirect product of μ and K. When A and B are discrete, H(μ) is the Shannon entropy of μ (in natural base) and DKL(μ∥ν) is the Kullback-Leibler divergence from ν to μ.

Given d∈N and x∈Rd, we use the notation

∥x∥1:=∑i<d|xi|

∥x∥∞:=maxi<d|xi|

The symbols o,O,ω,Ω,Θ will refer to usual O-notation.

## Results

An

interfaceI=(A,O) is a pair of finite sets ("actions" and "observations"). An I-policyis a function π:(A×O)∗→ΔA. An I-environmentis a partial function μ:(A×O)∗×A∘→ΔO s.t.i. λ×A⊆domμ

ii. Given h∈(A×O)∗ and aob∈A×O×A, haob∈domμ iff ha∈domμ and μ(ha)(o)>0.

It is easy to see that domμ is always of the form X×A for some X⊆(A×O)∗. We denote hdomμ:=X.

Given an I-policy π and an I-environment μ, we get μ⋈π∈Δ(A×O)ω in the usual way.

An I-

reward functionis a partial function r:(A×O)∗∘→[0,1]. An I-universeis a pair (μ,r) where μ is an I-environment and r is an I-reward function s.t. domr⊇hdomμ. We denote the space of I-universes by ΥI. Given an I-reward function r and t∈(0,∞), we have the associatedutility functionUrt:(A×O)ω∘→[0,1] defined byUrt(x):=∑∞n=0e−n/tr(x:n)∑∞n=0e−n/t

Here and throughout, we use geometric time discount, however this choice is mostly for notational simplicity. More or less all results carry over to other shapes of the time discount function.

Denote ΠI\ the space of I-policies. An I-

metapolicyis a family {πt∈ΠI}t∈(0,∞), where the parameter t is thought of as setting the scale of the time discount. An I-meta-universeis a family {υt∈Υ}t∈(0,∞). This latter concept is useful for analyzing multi-agent systems, where the environment contains other agents and we study the asymptotics when all agents' time discount scales go to infinity. We won't focus on the multi-agent case in this essay, but for future reference, it seems useful to make sure the results hold in the greater generality of meta-universes.Given an I-policy π, an I-universe υ and t>0, we denote EUπυ(t):=Eμ⋈π[Urt] (this is well-defined since Urt is defined on the support of μ⋈π). We also denote EU∗υ(t):=maxπ∈ΠIEUπυ(t). We will omit I when it is obvious from the context.

## Definition 1

Fix an interface I. Consider π∗ a metapolicy and H a set of meta-universes. π∗ is said to

learnH when for any υ∈Hlimt→∞(EU∗υt(t)−EUπ∗tυt(t))=0

H is said to be

learnablewhen there exists π∗ that learns H.Our notion of learnability is closely related to the notion of

sublinear regret, as defined in Leike 2016, except that we allow the policy to explicitly depend on the time discount scale. This difference is important: for example, given asingleuniverse υ, it might be impossible to achieve sublinear regret, but {υ} is always learnable.## Proposition 1

Fix an interface I. Consider H a countable learnable set of meta-universes. Consider any ζ∈ΔH s.t. suppζ=H. Consider πζ a ζ-Bayes optimal metapolicy, i.e.

πζt∈argmaxπ∈ΠEυ∼ζ[EUπυt(t)]

Then, πζ learns H.

Proposition 1 can be regarded as a "frequentist" justification for Bayesian agents: if

anymetapolicy is optimal in a "frequentist" sense for the class H (i.e. learns it), then the Bayes optimal metapolicy is such.Another handy property of learnability is the following.

## Proposition 2

Fix an interface I. Let H be a countable set of meta-universes s.t. any finite G⊆H is learnable. Then, H is learnable.

We now introduce the formalism needed to discuss advisors. Define ¯A:=A⊔{⊥}, ¯O:=¯A×O and ¯I:=(¯A,¯O). Here, the ¯A factor of ¯O is the action taken by the advisor, assumed to be observable by the agent. The environments we will consider are s.t. this action is ⊥ unless the agent delegated to the advisor at this point of time, which is specified by the agent taking action ⊥. It will also be the case that whenever the agent takes action ⊥, the advisor cannot take action ⊥.

Denote ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O:=¯A×¯O∖{⊥⊥o∣o∈O}. Given abo∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O, we define abo––––∈A×O by

abo––––:={ao if a≠⊥bo if a=⊥

Given h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗, we define h––∈(A×O)∗ by h––n:=hn–––.

## Definition 2

An ¯I-policy α is said to be

autonomouswhen for any h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗, α(h)(⊥)=0.Consider an I-environment μ and an autonomous ¯I-policy α, which we think of as the advisor policy. We define the ¯I-environment ¯μ[α] as follows. For any h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ s.t. h––∈hdomμ, a,b∈¯A and o∈O:

¯μ[α](ha)(bo):=⎧⎨⎩μ(h––a)(o) if a≠⊥,b=⊥α(h)(b)⋅μ(h––b)(o) if a=⊥,b≠⊥0 if a≠⊥,b≠⊥ or a=b=⊥

It is easy to the above is a well-defined ¯I-environment with hdomμ⊆¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗.

Given an I-universe υ=(μ,r), we define the ¯I-reward function ¯r:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗∘→[0,1] by ¯r(h):=r(h––) and the ¯I-universe ¯υ[α]:=(¯μ[α],¯r).

We now introduce the conditions on the advisor policy which will allow us to prove a learnability theorem. First, we specify an advisor that always remains "approximately rational."

The notation Ex∼ρ[X∣h] will be used to mean Ex∼ρ[X∣h⊑x]. Given a universe υ=(μ,r) and t∈(0,∞) we define Vυt:hdomμ→[0,1] and Qυt:hdomμ×A→[0,1] by

Vυt(h):=maxπ∈ΠEx∼μ⋈π[∑∞n=|h|e−n/tr(x:n)∑∞n=|h|e−n/t∣h]

Qυt(ha):=(1−e−1/t)r(h)+e−1/tEo∼μ(ha)[Vυt(hao)]

## Definition 3

Fix an interface I. Consider a universe υ=(μ,r). Let t,β∈(0,∞). A policy α is called

strictly β-rationalfor (υ,t) when for any h∈domμ and a∈Aα(h)(a)≤exp[β(Qυt(ha)−Vυt(h))]maxa∗∈Aα(h)(a∗)

Now we deal with the possibility of the advisor becoming "corrupt". In practical implementations where the "advisor" is a human operator, this can correspond to several types of events, e.g. sabotaging the channel that transmits data from the operator to the AI ("wireheading"), manipulation of the operator or replacement of the operator by a different entity.

## Definition 4

Fix an interface I. Consider a family {Ct⊆(A×O)∗}t∈(0,∞) s.t. for any h,g∈(A×O)∗, if h∈Ct then hg∈Ct. We think of Ct as the set of histories in which a certain event occurred. Consider a meta-universe υ=(μ,r). C is said to be a υ-

avoidable eventwhen there is a meta-policy π∗ and D:(0,∞)→N s.t.i. limt→∞(EU∗υt(t)−EUπ∗tυt(t))=0

ii. D=ω(t)

iii. limt→∞Prx∼μt⋈π∗t[∃n≤D(t):x:n∈Ct]=0

That is, C is υ-avoidable when it is possible to avoid the event for a long time while retaining most of the value. Consider a meta-universe υ=(μ,r) and C a υ-avoidable event. Denote −υ:=(μ,1−r). We define the reward function rCt by

rCt(h):={rt(h) if h∉CtV−υt(h:n) if h∈Ct and n=min{m∈N∣h:m∈Ct}

We think of rC as representing a process wherein once the event represented by C occurs, the agent starts

minimizingthe utility function. We also use the notation υC:=(μ,rC).## Definition 5

Consider a meta-universe υ=(μ,r) and β:(0,∞)→(0,∞), where we think of the argument of the function β as the time discount scale. An autonomous ¯I-metapolicy α is called

β-rationalfor υ when there exists a υ-avoidable event C (that we think of as the advisor becoming "corrupt") and an autonomous ¯I-metapolicy α∗ (representing advisor policy conditional on non-corruption) s.t.i. For any h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ s.t. h––∉Ct, αt(h)=α∗t(h).

ii. α∗t is strictly β(t)-rational for (¯υCt[α∗],t).

Our definition of β-rationality requires the advisor to be extremely averse to corruption: the advisor behaves as if, once a corruption occurs, the agent policy becomes the worst possible. In general, this seems much too strong: by the time corruption occurs, the agent might have already converged into accurate beliefs about the universe that allow it to detect the corruption and keep operating without the advisor. Even better, the agent can usually outperform

the worst possible policyusing the prior alone. Moreover, we can allow for corruption to depend on unobservable random events and differentiate between different degrees of corruption and treat them accordingly. We leave those further improvements for future work.We are now ready to formulate the main result.

## Theorem

Consider H={υk}k∈N a countable family of I-meta-universes and β:(0,∞)→(0,∞) s.t. β(t)=ω(t2/3). Let {αk}k∈N be a family of autonomous ¯I-metapolicies s.t. for every k∈N, αk is β-rational for υk. Define ¯H:={¯υk[αk]}k∈N. Then, ¯H is learnable.

Some remarks:

By Proposition 1, ¯H is learned by any Bayes optimal metapolicy with prior supported on ¯H.

To get a feeling for the condition β(t)=ω(t2/3), consider an environment where the reward depends only on the last action and observation. In such an environment, an advisor that performs softmax (with constant parameter) on the next reward has β(t)=Θ(t). It is thus "more rational" than the required minimum.

It is easy to see that the Theorem can be generalized by introducing an external penalty (negative reward) for each time the agent delegates to the advisor: as it is, using the advisor already carries a penalty due to its suboptimal choice.

The conditions of the Theorem imply that, in some sense, the advisor "knows" the true environment. This is unrealistic: obviously, we expect the human operator to have some (!) uncertainty about the world. However, we clearly cannot do away with this assumption: if the same action triggers a trap in some universes and is necessary for approaching maximal utility in other universes, and there is no observable difference between the universes before the action is taken, then there is no way to guarantee optimality. The prior knowledge you have about the universe caps the utility you can guarantee to obtain. On the other hand, as an AI designer, one can reasonably expect the AI to do at least as well as possible using the

designer'sown knowledge. If running the AI is the designer's best strategy, the AI should be close to Bayes optimal (in some sense that includes computational bounds et cetera: complications that we currently ignore) with respect to the designer's posterior rather than with respect to some simple prior. In other words, we need a way to transmit the designer's knowledge to the AI, without hand-crafting an elaborate prior.The following shows that the DIRL achieves this goal (theoretically, given the considerable simplifying assumptions).

Given an environment μ, we define ^μ:(A×O)∗→[0,1] as follows. For h=a0o0a1o1…an−1on−1∈hdomμ

^μ(h):=∏m<nμ(h:mam)(om)

For h∈(A×O)∗∖hdomμ, ^μ(h):=0.

Given a family of environments {μk}k∈N and ξ∈ΔN, we will use the notation Ek∼ξ[μk] to denote the environment given by

hdomEk∼ξ[μk]:=⋃k∈suppξhdomμk

Ek∼ξ[μk](ha):=Ek∼ξ[^μk(h)μk(ha)]Ek∼ξ[^μk(h)]

## Corollary 1

Consider {μk}k∈N a countable family of I-meta-environments, {rK}K∈N a countable family of reward functions and {ξK∈ΔN}K∈N s.t. given k∈suppξK, domrK⊇hdomμk. We think of ξK as the advisor's belief about the environment in universe K. Let β:(0,∞)→(0,∞) be s.t. β(t)=ω(t2/3) and {αK}K∈N be a family of autonomous ¯I-metapolicies s.t. for every K∈N, αK is β-rational for (Ek∼ξK[μk],rK). Let ζ∈ΔN2 be s.t. for any J,j∈N

Pr(K,k)∈ζ[K=J]>0

Pr(K,k)∈ζ[k=j∣K=J]=ξJ(j)

We think of ζ as the agent's prior and the equation above as stating the agent's belief that the advisor's beliefs are "calibrated". Consider πζ a ζ-Bayes optimal ¯I-metapolicy, i.e.

πζ∈argmaxπ∈Π¯IE(K,k)∈ζ[EUπ¯μk[αK],¯rK(t)] Then, for every K∈N

limt→∞(maxπ∈ΠIEk∈ξK[EUπμk,rK(t)]−Ek∈ξK[EUπζ¯μk[αK],¯rK(t)])=0

If we happen to be so lucky that the advisor's (presumably justified) belief is supported on a learnable environment class, we get a stronger conclusion.

## Corollary 2

In the setting of Corollary 1, fix K∈N. Define the set of meta-universes HK by

HK:={(μk,rK)∣k∈suppξK}

Assume HK is learnable. Then, for every k∈N

limt→∞(EU∗μk,rK(t)−EUπζ¯μk[αK],¯rK(t))=0

We also believe that to some extent DIRL is effective against acausal attack. Indeed, the optimality we get from the Theorem + Proposition 1 holds for any prior. However, the speed of convergence to optimality certainly depends on the prior. It is therefore desirable to analyze this dependency and bound the damage an adversary can do by controlling a certain portion of the prior. We leave this for future work.

## Appendix

## Proposition A.0

Fix an interface I. Consider H a countable learnable set of meta-universes. Consider any ζ∈ΔH s.t. suppζ=H. Consider πζ a metapolicy s.t.

limt→∞(maxπ∈ΠEυ∼ζ[EUπυt(t)]−Eυ∼ζ[EUπζtυt(t)])=0

Then, πζ learns H.

## Proof of Proposition A.0

Fix π∗ a metapolicy that learns H. Consider ϵ>0 and let Hϵ⊆H be finite s.t. ζ(H∖Hϵ)<ϵ. For t≫0 and every υ∈Hϵ we have

EUπ∗tυt(t)≥EU∗υt(t)−ϵ

Also

Eυ∼ζ[EUπζtυt(t)]≥Eυ∼ζ[EUπ∗tυt(t)]−ϵ≥Eυ∼ζ[EUπ∗tυt(t);υ∈Hϵ]−ϵ

Combining, we get

Eυ∼ζ[EUπζtυt(t)]≥Eυ∼ζ[EU∗υt(t);υ∈Hϵ]−2ϵ

Eυ∼ζ[EUπζtυt(t)]≥Eυ∼ζ[EU∗υt(t)]−Eυ∼ζ[EU∗υt(t);υ∉Hϵ]−2ϵ

By definition of Hϵ, this implies

Eυ∼ζ[EUπζtυt(t)]≥Eυ∼ζ[EU∗υt(t)]−3ϵ

For any υ∈H, we get

EUπζtυt(t)≥EU∗υt(t)−3ϵζ(υ)

Taking ϵ to 0, we get the desired result.

## Proof of Proposition 1

Immediate from Proposition A.0.

## Proof of Proposition 2

Let H={υk}k∈N. For each k∈N, let πk learn {υl}l<k. Choose {tk∈(0,∞)}k∈N s.t.

i. t0=0

ii. tk<tk+1

iii. limk→∞tk=∞

iv. For any l<k and t≥tk, EUπktυlt(t)≥EU∗υlt(t)−1k+1.

Now define π∗t:=πmax{k∣t≥tk}t. Clearly, π∗ learns H.

## Proposition A.1

Consider d∈N and x,y∈Rd∖0. Then

∥x∥x∥∞−y∥y∥∞∥∞≤2d∥x∥x∥1−y∥y∥1∥1

## Proof of Proposition A.1

Without loss of generality, assume ∥x∥1=∥y∥1=1. For any i<d, we have

xi∥x∥∞−yi∥y∥∞=xi∥y∥∞−yi∥x∥∞∥x∥∞∥y∥∞=xi∥y∥∞−xi∥x∥∞+xi∥x∥∞−yi∥x∥∞∥x∥∞∥y∥∞

xi∥x∥∞−yi∥y∥∞=xi(∥y∥∞−∥x∥∞)+(xi−yi)∥x∥∞∥x∥∞∥y∥∞

Denote r:=∥x−y∥1. Obviously, ∥x−y∥∞≤r and therefore |xi−yi|≤r and |∥x∥∞−∥y∥∞|≤r. We get

|xi∥x∥∞−yi∥y∥∞|≤|xi|r+r∥x∥∞∥x∥∞∥y∥∞≤∥x∥∞r+r∥x∥∞∥x∥∞∥y∥∞≤2r∥y∥∞

Since ∥y∥1=1, ∥y∥∞≥1d yielding the desired result.

## Proposition A.2

Consider H a finite set, L:H→[0,∞), ζ∈ΔH and β,ϵ∈(0,∞). Assume that

i. For any k∈H, ζ(k)≥βϵ.

ii. Eζ[L]≥ϵ

Then

Eζ[e−βL]≤1−(1−e−1)βϵ

## Proof of Proposition A.2

Without loss of generality, we can assume that Eζ[L]=ϵ, because otherwise we can rescale L by a constant in (0,1) which will only make Eζ[e−βL] larger. It now follows from conditions i+ii that for any k, L(k)≤β−1 and therefore βL(k)∈[0,1]. We have

L(k)=(1−βL(k))⋅0+βL(k)⋅β−1

Since e−βx is a convex function, we get

e−βL(k)≤(1−βL(k))⋅e−β⋅0+βL(k)⋅e−β⋅β−1=1−βL(k)+e−1βL(k)=1−(1−e−1)βL(k)

Eζ[e−βL]≤1−(1−e−1)βEζ[L]=1−(1−e−1)βϵ

## Proposition A.3

Consider H and A finite sets, L:H×A→[0,∞), ζ∈ΔH, α:H→ΔA and β,ϵ∈(0,∞). Assume that

i. For any k∈H, ζ(k)≥βϵ.

ii. For any a∈A, Ek∈ζ[L(k,a)]≥ϵ.

iii. For any k∈H and a∈A, α(k)(a)≤exp[−βL(k,a)]maxb∈Aα(k)(b).

Define ζ⋉α∈Δ(H×A) by (ζ⋉α)(k,a):=ζ(k)α(k,a). Then, the mutual information I between k and a in the distribution ζ⋉α satisfies

I≥(1−e−1)28|A|2β2ϵ2

## Proof of Proposition A.3

Define ¯α∈ΔA by ¯α:=Eζ[α]. We have

I=Eζ[DKL(α∥¯α)]

Applying Pinsker's inequality

I≥2Eζ[dtv(α,¯α)2]=12Eζ[∥α−¯α∥21]≥12Eζ[∥α−¯α∥1]2

By Proposition A.1

I≥18|A|2Eζ[∥α∥α∥∞−¯α∥¯α∥∞∥∞]2≥18|A|2∥Eζ[α∥α∥∞−¯α∥¯α∥∞]∥2∞=18|A|2∥Eζ[α∥α∥∞]−¯α∥¯α∥∞∥2∞

I≥18|A|2(∥¯α∥¯α∥∞∥∞−∥Eζ[α∥α∥∞]∥∞)2=18|A|2(1−∥Eζ[α∥α∥∞]∥∞)2

By condition iii, α(k)(a)/∥α(k)∥∞≤exp[−βL(k,a)] and therefore

I≥18|A|2(1−maxa∈AEk∼ζ[e−βL(k,a)])2

Applying Proposition A.2, we get

I≥18|A|2(1−(1−(1−e−1)βϵ))2=(1−e−1)28|A|2β2ϵ2

## Proposition A.4

Consider A a finite set, L:A→[0,∞) and α∈ΔA and β∈(0,∞) s.t. for any a∈A

α(a)≤e−βL(a)maxb∈Aα(b)

Then, Eα[L]≤|A|e−1β−1.

## Proof of Proposition A.4

We have

Eα[L]=∑a∈Aα(a)L(a)≤maxb∈Aα(b)∑a∈Ae−βL(a)L(a)≤|A|maxx∈[0,∞)e−βxx

We compute maxx∈[0,∞)e−βxx:

0=ddx(e−βxx)|x=x∗=−βe−βx∗x∗+e−βx∗

x∗=1β

e−βx∗x∗=1eβ

## Proposition A.5

Consider a universe υ=(μ,r), a policy π0 and t∈(0,∞). Then,

EU∗υ(t)−EUπ0υ(t)=∞∑n=0e−n/tEx∼μ⋈π0[Vυt(x:n)−Qυt(x:n+1/2)]

## Proof of Proposition A.5

For any x∈(A×O)ω, it is easy to see that

EU∗υ(t)=Vυt(λ)=∞∑n=0e−n/t(Vυt(x:n)−e−1/tVυt(x:n+1))

Urt(x)=(1−e−1/t)∞∑n=0e−n/tr(x:n)

EU∗υ(t)−Urt(x)=∞∑n=0e−n/t(Vυt(x:n)−(1−e−1/t)r(x:n)−e−1/tVυt(x:n+1))

EU∗υ(t)−Urt(x)=∞∑n=0e−n/t(Vυt(x:n)−Qυt(x:n+1/2)+Qυt(x:n+1/2)−(1−e−1/t)r(x:n)−e−1/tVυt(x:n+1))

Taking expected value over x, we get

EU∗υ(t)−EUπ0υ(t)=∞∑n=0e−n/t(Eμ⋈π0[Vυt(x:n)−Qυt(x:n+1/2)]+Eμ⋈π0[Qυt(x:n+1/2)−(1−e−1/t)r(x:n)−e−1/tVυt(x:n+1)])

It is easy to see that the second term vanishes, yielding the desired result.

## Lemma A

Consider the setting of Theorem, but assume that H={υk=(μk,rk)}k<N for some N∈N (i.e. it is finite) and that αk(t) is

strictlyβ(t)-rational for ¯υk[αk]. Denote νk:=¯μk[αk] and hdom¯H:=⋃k<Nhdomνk. Denote ζ0∈Δ[N] the uniform probability distribution. For any t>N3, define ζt,~ζt:hdom¯H→Δ[N] recursively as follows~ζt(λ):=ζ0

~ζt(hao)(k):={~Zt(h)−1⋅ζt(h)(k)⋅νk(ha)(o) if ∃j:ζt(h)(j)⋅νj(ha)(o)>0N−1 otherwise

ζt(h)(k):=Zt(h)−1~ζt(h)(k)[[~ζt(h)(k)>t−1/3]]

In the above, Zt(h) and ~Zt(h) are normalization factor chosen to make the probabilities sum to 1. That is, ζt(h) is obtained by starting from prior ζ0, updating on every observation, and setting to 0 the probability of any universe whose probability drops below t−1/3. When encountering an "impossible" observation we reset to the uniform distribution, but this is arbitrary.

Define the "loss function" Lt:hdom¯H×A→[0,1] by

Lt(ha):=Ek∼ζt(h)[Vυktt(h––)−Qυktt(h––a)]

Denote ϵt:=β(t)−1t−1/3. Define the following ¯I-metapolicy π∗:

π∗t(h):=⎧⎪ ⎪⎨⎪ ⎪⎩argmina∈ALt(ha) if h∈hdom¯H,mina∈ALt(ha)<ϵt⊥ if h∈hdom¯H,mina∈ALt(ha)≥ϵt⊥ if h∉hdom¯H

(Technically, we only defined π∗t for t>N3, but it doesn't matter.) Then, π∗ learns ¯H.

## Proof of Lemma A

For every k</