Part of this post is based on points made by Scott Garrabrant and Abram Demski at a recent AI safety workshop put on by Alex Zhu.
Paul Christiano recently published a sequence of posts on a concept he calls "ascription universality." I highly recommend reading the full sequence of posts, though I will do my best to summarize the basic concept here. As I see it, ascription universality can be seen as a formalization of transparency/honesty: a system is ascription universal if, relative to our current epistemic state, its explicit beliefs contain just as much information as any other way of ascribing beliefs to it.
Of particular interest to this post is the last part—what does "any other way of ascribing beliefs" mean? Paul points this out as one of the most informal steps in his definition of ascription universality. In this post, I want to try and point out one way in which I think this process could go wrong.
First, one of the most obvious ways in which a system could fail to be ascription universal is if it is doing some sort of internal cognition which is at odds with its stated beliefs—if an AI system is attempting to deceive its programmers, for example. Such a system would fail to be ascription universal because there would be a way of ascribing beliefs to the internal cognition it is hiding from its programmers that would give the programmers information they currently don't have and can't get purely from the system's explicit output.
However, consider the following alternative system: take the system from the previous example and "memoize" it, replacing it with a simple lookup table that always outputs what the system from the previous example would have output. Is this system ascription universal? From a moral standpoint, we certainly don't want it to be, since it's still taking the same deceptive actions. But if your concept of "ascribing beliefs" only considers ways of looking at the system's internal cognition, you might think that this system is ascription universal and thus erroneously conclude that it is safe.
Is this likely to ever be an actual problem for a real-world AI system? I argue yes. Certainly, we don't expect our AI systems to be massive lookup tables. However, there is a sense in which, if we are only looking at an agent's internal cognition to determine whether it is ascription universal or not, we will miss a huge part of the optimization that went into producing that system's output: namely, that of the training process. For example, a training process might produce an algorithm which isn't itself performing any internal optimization—but rather doing something akin to the lookup table in the second example—that is nevertheless unsafe because the entries put into that lookup table by the training process are unsafe. One realistic way in which something like this could happen is through distillation. If a powerful amplified system is set upon the task of distilling another system, it might produce something akin to the unsafe lookup table in the second example.
This is not necessarily a difficult concern to address in the sense of making sure that any definition of ascription universality includes some concept of ascribing beliefs to a system by looking at the beliefs of any systems that helped create that system. However, if one ever wants to produce a practical guarantee of ascription universality for a real-world system, this sort of concern could end up causing lots of problems. For example, even if machine learning transparency research progresses to a point where all the internals of an AI system can be made transparent, that might not be enough to guarantee ascription universality if the training process that produced the system can't also be made similarly transparent.
a system is ascription universal if, relative to our current epistemic state, its explicit beliefs contain just as much information as any other way of ascribing beliefs to it.
This is a bit different than the definition in my post (which requires epistemically dominating every other simple computation), but is probably a better approach in the long run. This definition is a little bit harder to use for the arguments in this post, and my current expectation is that the "right" definition will be usable for both informed oversight and securing HCH. Within OpenAI Geoffrey Irving has been calling a similar property "belief closure."
This is not necessarily a difficult concern to address in the sense of making sure that any definition of ascription universality includes some concept of ascribing beliefs to a system by looking at the beliefs of any systems that helped create that system.
In the language of my post, I'd say: