Here are some more general results derived from "Occam's razor is insufficient to infer the preferences of irrational agents":

  • Regularisation is insufficient to make inverse reinforcement learning work.
  • Unsupervised learning cannot deduce human preferences; at the very least, you need semi-supervised learning.
  • Human theory of mind cannot be deduced merely be observing humans.
  • When programmers "correct an error/bug" in a value-learning system, they are often injecting their own preferences into it.
  • The implicit and explicit assumptions made for a value-learning system, determine what values the system will learn.
  • No simple definition can distinguish a bias from a preference, unless it connects with human judgement.

These are all true; the key question is to what extent they are true. Do we need to have minimal supervision, or add minimal assumptions, to get the AI to deduce our values correctly? After all, we do produce a lot of data, that it could use to learn our values, if it gets the basic assumptions right.

Or do we need to put in a lot more work in the assumptions? After all, most of the data we produce is by humans, for humans, who will accept the implicit part of the data, hence there are few explicit labels for the AI to use.

My feeling is that we probably only need a few assumptions - but maybe more than some optimists in this area believe.

New Comment