I showed recently, predicated on a few assumptions, that a certain agent was asymptotically “benign” with probability 1. (That term may be replaced by something like “domesticated” in the next version, but I’ll use “benign” for now).

This result leaves something to be desired: namely an agent which is safe for its entire lifetime. It seems very difficult to formally show such a strong result for any agent. Suppose we had a design for an agent which did value learning properly. That is, suppose we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function.

Presumably, such an agent could learn (just about) any utility function depending on what observations it encounters. Surely, there would be a set of observations which caused it to believe that every human was better off dead.

In the presence of cosmic rays, then, one cannot say that agent is safe for its entire lifetime with probability 1 (edited for clarity). For any finite sequence of observations that would cause the agent to conclude that humanity was better off dead, this sequence has strictly positive probability, since with positive probability, cosmic rays will flip every relevant bit in the computer’s memory.

This agent is presumably still *asymptotically* safe. This is a bit hard to justify without a concrete proposal for what this agent looks like, but at the very least, the cosmic ray argument doesn’t go through. With probability 1, the sample mean of a Bernoulli() random variable (like the indicator of whether a bit was flipped) approaches , which is small enough that a competent value learner should be able to deal with it.

This is not to suggest that the value learner is unsafe. Insanely inconvenient cosmic ray activity is a risk I’m willing to take. The takeaway here is that it complicates the question of what we as algorithm designers should aim for. We should definitely be writing down sets assumptions from which we can derive formal results about the expected behavior of an agent, but is there anything to aim for that is stronger than asymptotic safety?

Delegative Reinforcement Learning is safe not just asymptotically. See also this, this and (once it's uploaded) upcoming paper for SafeML 2019. In addition, there are directions for further improvment here in the "value learning protocols" sections.

Can you add the key assumptions being made when you say it is safe asymptotically? From skimming, it looked like "assuming the world is an MDP and that a human can recognize which actions lead to catastrophes."

I have to admit I got a little swamped by unfamiliar notation. Can you give me a short description of a Delegative Reinforcement Learner?

The agent interacts with an environment, that is for the time being assumed to be a finite MDP (generalizations to POMDP and infinite state spaces should be possible, but working out the precise assumptions that are needed is currently an open problem). On each round it either takes a normal action from the set A or takes the special "delegation" action ⊥. If the agent delegates, the advisor produces an action from A that acts on the environment instead.

The assumptions on the advisor are: (i) it never falls into traps (or enters corrupt states, which means states in which the advisor and/or the input channels were compromised and longer provide reliable rewards or advice) (ii) it has at least some small probability of taking the optimal action (instead, we could assume that there is some set of "good enough" actions s.t. the advisor has at least small small probability to take such an action, and reformulate the guarantee w.r.t. the best policy comprised of "good enough" actions rather than the fully optimal policy).

Under these assumptions, we have a regret bound (the particular algorithm I use to prove the bound is Thompson sampling where (i) the agent delegates when it's not sure than an action is safe and (ii) hypotheses with low probability are discarded), meaning that as the geometric time discount constant goes to 1, the agent achieves nearly optimal expected utility.

Here I generalize the setup to allow a small probability of losing long-term value or entering a corrupt state when following the advisor policy. This is important because the aligned AGI is supposed to, among other things, block any unaligned AGI and this is something that the advisor cannot do on its own. I envision more ways to further "soften" the assumptions, in particular we can use the same method as in quantilizers, and argue that if the advisor policy loses long-term value very slowly then any policy with sufficiently small Renyi divergence w.r.t. the advisor policy also loses long-term value slowly at most. The agent should then be able to converge to the optimal policy under the Renyi divergence constraint. (Intuitively, we constraint the agent to behavior that is sufficiently "human like".) This should also have the benefit of a continuous rather than discrete model of corruption (that covers e.g. gradual value drift).

So the AI only takes action a from state s if it has already seen the human do that? If so, that seems like the root of all the safety guarantees to me.

Not quite. The AI starts with some prior over (environment, advisor policy) pairs and updates it with incoming observations. It can take an action if, given its current belief state, it is sufficiently confident that it is an action the advisor

couldtake. The confidence threshold is controlled by the parameter η which has a certain optimal value to achieve the best regret bound (as γ→1, η→0; in other words, the more long-term the plan is, the more cautious the AI becomes; obviously catastrophes modify this trade-off). That is, the AI generalizes from what it already observed rather than requiring the exact same state to repeat itself. Indeed, if we required the exact same state to repeat itself, the regret bound would scale with the number of states. Instead, it scales with the number of hypotheses (of course we can also derive a "structural" / "non-uniform" version for a countable number of hypotheses). Also, I am pretty sure that we can derive a regret bound that scales with RVO and MB dimensions (I also think MB dimension can be replaced by prior entropy, but so far hasn't been able to prove it), which can be bounded either in terms of the number of hypotheses or in terms of the number of states and actions, and can also remain small when both the number of hypotheses and the number of states are large.Another useful perspective on the conditions the advisor must satisfy, is regarding the environment w.r.t. which these conditions are defined as the

belief stateof the advisor rather than the true environment. This is difficult to do with the current formalism that requires MDPs, but would be possible with POMDPs for example. Indeed, I took this perspective in an earlier essay about a different setting that allows general environments (see Corollary 1 in that essay). This would lead to a performance guarantee which shows that the agent achieves optimal expected utility w.r.t. the belief state of the advisor. Obviously, this is not as good as optimal expected utility w.r.t. the true environment, however, this means that from the perspective of the advisor, building such an agent is the best possible strategy.I sort of object to titling this post "Value Learning is only Asymptotically Safe" when the actual point you make is that we don't yet have concrete optimality results for value learning other than asymptotic safety.

In the case of value learning, given the generous assumption that "we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function", it seems like you should be able to get a PAC-type bound, where by time T, the agent is only ϵ-suboptimal with probability 1−δ(ϵ,T), where δ(ϵ,T) is increasing in ϵ but decreasing in T -- see results on PAC bounds for Bayesian learning, which I haven't actually looked at. This gives you bounds stronger than asymptotic optimality for value leraning. Sadly, if you want your agent to actually behave well in general environments, you probably won't get results better than asymptotic optimality, but if you're happy to restrict yourself to MDPs, you probably can.

It is true that going beyond finite MDPs (more generally, environments satisfying sufficient ergodicity assumptions) causes problems but I believe it is possible to overcome them. For example, we can assume that there is a baseline policy (the advisor policy in case of DRL) s.t. the resulting trajectory in state space never (up to catastrophes) diverges from the optimal trajectory (or, less ambitiously, some "target" trajectory) further than some "distance" (measured in terms of the time it would take to go back to the optimal trajectory).

In the real world, this is usually impossible.

I think that in the real world, most superficially reasonable actions do not have irreversible consequences that are very important. So, this assumption can hold within some approximation, and this should lead to a performance guarantee that is optimal within the accuracy of this approximation.

Doesn't the cosmic ray example point to a strictly positive probability of dangerous behavior?

EDIT: Nvm I see what you're saying. If I'm understanding correctly, you'd prefer, e.g. "Value Learning is not [Safe with Probability 1]".

Thanks for the pointer to PAC-type bounds.