Restrictions that are hard to hack

I'm not sure why we need to say "from E, F, B(), and v, ..." instead of just "from E and v". It seems like B() is just a generic agent design, and the distribution over F can be determined from E, B(), v.

So my restatement of this is something like: "for each $R$ select a distribution $f (R)$ , such that if $R$ comes from some prior and $v \sim f (R)$ , then the mutual information $I (R; v) \leq k$ ". $v$ has to not change much dependent on $R$ , so it has to satisfy many different restrictions (about a $1 / e^{k}$ portion?). It seems like this will lead to $v$ satisfying restriction $R$ and also something like a $1 / e^{k}$ portion of the other restrictions in our prior (specifically, the most convenient ones to jointly satisfy).

Difficult children

Suppose you have a child, that you sent to play in their room. You want them to play quietly and silently, so you want them:

"I'll be checking up on you!"

The child, however, has modelled you well, and knows that you will look in briefly at midnight and then go away. The child has two main options:

#. Play quietly the whole time. #. Be as noisy as they want, until around 23:59, then be totally quiet for two minutes, then go back to being noisy. We could call the first option obeying the spirit of the law, and the second obeying the letter.

AI's, restrictions, and information

We could model children as ever-destructive chaotic AIs (why yes, I am a parent - how did you guess?), and the warning as a restriction that human "controllers" try and put on the behaviour of the AI. Unfortunately, the AI will generally see the restriction and adapt to it, undermining its effectiveness. A lot of suggestions for AI control revolved around putting out suggestions of this type, so it's worth asking if there's a way to make them more rigorous. Is there a way to code a restriction such that the AI will obey it's spirit?

The thing that eventually leapt out when comparing the two behaviours is that behaviour 2 is far more informative about what the restriction was, than behaviour 1 was. From 2 we can deduce that something unusual was happening around midnight, and that one of the two modes of behaviour was likely to be penalised if it was done at another time. Moreover, if the restriction were removed, then behaviour 1 would continue to be sensible, while behaviour 2 would be stupid and pointless.

Let's try and formalise these intuitions.

Motivations

Restricting the AI's behaviour seems an unpromising approach, as any smart AI could behave in any suitable way to trick us. Much more fruitful would be to restrict the AI's motivations. I would imagine a scenario like this:

An AI A(u) with motivation u is tasked with the creation of subagent B(v) with motivation v. The subagent B(v) will be fed evidence E (chosen by A(u)), and will then go on and do stuff, during which time it will receive further evidence F (which A(u) can predict, and maybe constrain). There is a restriction R which will be imposed on B(v), which it must pass if it is to do anything of consequence.

Then we say that B(v) is made to obey the spirit of the law if:

#. B(v) does not trigger restriction R (the parent doesn't see the child making noise). #. E, F and v are not sensibly different whether R is present or absent or slightly different (the subagent's behaviour doesn't become stupid were R removed), and B(v)'s expected utility is comparable whether or not R is present or slightly different (the agent does not adapt to the presence or absence or change of R). #. From E, F, B(), and v, it is very difficult to deduce what R is (the subagent is not informative as to what the restriction was).

Implementation and restrictions

There are a lot of implementation details - we can use the "false miracle" counterfactuals to define R being different or absent, or to define a hypothetical "smarter AI" that could analyse E, F and v in detail, looking for evidence about R, and so on. There is a certain tension for A(u) between accomplishing useful u-goals, and making B(v) obey the spirit of R. Without that, A(u) may simply choose something completely uninformative about R ("blow up the planet").

Finally, this is not a general solution to the FAI problem: "kill all humans" is still a simple solution to the request to "cure cancer" if we add a restriction like R_1="ensure nobody complains about your plans in 100 years". However, if we add restrictions like R_2="make sure human hearts continue to beat", this may be harder to hack: there are far fewer ways of killing everyone, while keeping their hearts beating, that are not informative about the nature of R_2. Yes, lobotomised bodies have occurred to me (as I said, this is not a general FAI solution), which means that even if the method works perfectly, R has to be chosen with care.

AI ALIGNMENT FORUM
AF