It is a relatively intuitive thought that if a Bayesian agent is uncertain about its utility function, it will act more conservatively until it has a better handle on what its true utility function is.

This might be deeply flawed in a way that I'm not aware of, but I'm going to point out a way in which I think this intuition is slightly flawed. For a Bayesian agent, a natural measure of uncertainty is the entropy of its distribution over utility functions (the distribution over which possible utility function it thinks is the true one). No matter how uncertain a Bayesian agent is about which utility function is the true one, if the agent does not believe that any future observations will cause it to update its belief distribution, then it will just act as if it has a utility function equal to the Bayes' mixture over all the utility functions it considers plausible (weighted by its credence in each one).

It seems like what our intuition is grasping for is not uncertainty about the utility function, but expected information gain about the utility function. If the agent expects to gain information about the utility function, then (intuitively to me, at least) it will act more conservatively until it has a better handle on what its true utility function is.

Expected information gain (at time t) is naturally formalized as the expectation (w.r.t. current beliefs) of KL(posterior distribution at time t + m || posterior distribution at time t). Roughly, this is how poorly it expects its current beliefs will approximate its future beliefs (in m timesteps).

So if anyone has a safety idea to which utility uncertainty feels central, my guess is that a mental substitution from uncertainty to expected information gain would be helpful.

Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here).

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 9:01 AM

Fwiw, I talk about utility uncertainty because that's what the mechanical change to the code looks like -- instead of having a known reward function, you have a distribution over reward functions. It's certainly true that this only makes a difference as long as there is still information to be gained about the utility function.

So if anyone has a safety idea to which utility uncertainty feels central

These two posts looked at some possibilities for using utility uncertainty but they didn't seem that promising and I don't know if anyone is still going in these directions:

Fwiw I don't find the problem of fully updated deference very compelling. My real rejection of utility uncertainty in the superintelligent-god-AI scenario is:

  • It seems hard to create a distribution over utility functions that is guaranteed to include the truth (with non-trivial probability, perhaps). It's been a while since I read it, but I think this is the point of Incorrigibility in the CIRL Framework.
  • It seems hard to correctly interpret your observations as evidence about utility functions. In other words, the likelihood is arbitrary and not a fact about the world, and so there's no way to ensure you get it right. This is pointed at somewhat by your first link.

If we somehow magically vanished away these problems, maximizing expected utility under that distribution seems fine, even though the resulting AI system would prevent us from shutting it down. It would be aligned but not corrigible.

I could imagine an efficient algorithm that could be said to be approximating a Bayesian agent with a prior including the truth, but I don't say that with much confidence.

I agree with the second bullet point, but I'm not so convinced this is prohibitively hard. That said, not only would we have to make our (arbitrarily chosen) un-game-able, one reading of my original post is that we would also have to ensure that by the time the agent was no longer continuing to gain much information, it would already have to have a pretty good grasp on the true utility function. This requirement might reduce to a concept like identifiability of the optimal policy.

Identifiability of the optimal policy seems too strong: it's basically fine if my household robot doesn't figure out the optimal schedule for cleaning my house, as long as it's cleaning it somewhat regularly. But I agree that conceptually we would want something like that.

Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here).

Where's the "pretty fast"? The theorem makes a claim in the limit and says nothing about convergence. (I haven't read the rest of the paper.)

Oh yeah sorry that isn’t shown there. But I believe the sum over all timesteps of the m-step expected info gain at each timestep is finite w.p.1 which would make it o(1/t) w.p.1.