Disclaimer: this post is a collection of preliminary thoughts around a specific question for MSFP'19 blogpost writing day. I do not intend to present novel results or define a research agenda.
The capital of Mozambique is Maputo. This is a fact about the world I learned in school many years ago. Some part of my brain must have changed in order to store this fact and allow me to remember it today. It may be possible, in principle, to recover this fact by simply looking at my brain.
In this post, I consider an analogous topic in the context of Artificial Intelligence: is it possible to retrieve the knowledge stored in a neural network? This line of enquiry could be a step towards more transparent AI.
Suppose we train a RL agent in the virtual sandbox environment Minetest. In this environment, agents can use tools called pickaxes to efficiently "mine" and collect objects made of stone. A pickaxe is an example of a tool. Tools have some properties that distinguish them from other items (e.g. tools cannot be "stacked" in the inventory).
We may try to retrieve the knowledge represented in the internal state of the agent by means of queries. For instance: at which point, during training, does the RL agent learn that a pickaxe is a tool?
It seems plausible to me that this question could be asked "as it is" by an observer in order to gauge the agent's understanding of its environment. That is to say, at the moment of posing the question, it seems perfectly meaningful. However, further thought reveals some vagueness in it: what exactly is a pickaxe? What exactly is a tool? What does it really mean for a pickaxe to be a tool? It is funny how easily one can become confused with words like "really" and "exactly".
In this example, it seems plausible to interpret the question in reference to the source code of the Minetest engine. Let us assume, for the sake of the argument, that the source code uses an object-oriented approach, and contains a Tool class, and a Pickaxe class which is a subclass of the former. The fact "a pickaxe is a tool" can be recast as a claim that the Pickaxe class implements all relevant aspects of the Tool class.
When we ask similar queries for agents that learn in the real world (e.g. "when has a system learned the fact that the capital of Mozambique is Maputo?"), formalising the query is not as straightforward. A possible solution to this issue could be to introduce ontologies (in the sense used in information science). An ontology defines formally the properties, meaning, and relations between categories and entities in a particular domain. Such ontologies could be used to model relevant aspects of a target domain and used analogously to the source code in the previous example. However, the addition of a new "layer" between the real-world domain and the query can be a source of problems if the ontology fails to capture all relevant aspects of the domain.
Once the query has been formally defined, I find at least two approaches to answering it. The first is to define a battery of tests to gather evidence that all relevant features or aspects of the fact have been learned by the agent. For instance, if we observe that the agent has learned not to stack pickaxes, we may conclude that this particular aspect of the Tools class has been learned. However, this approach quickly runs into trouble, since it seems very hard to ensure that the designed suite of tests captures all relevant aspects of the query. For instance, an agent may not attempt to stack iron pickaxes, but perhaps it would attempt to stack bronze pickaxes, if it found them. Furthermore, this approach does not entirely satisfy the original desideratum of retrieving knowledge directly from the network, because it depends on tests (and the number of required tests is potentially very large).
A second approach is to attempt to define a structure-preserving mapping between the learned model of the network and the representation of the Pickaxe and Tool classes. This approach is more exhaustive than the previous and does not require tests. However, the questions of how to define such mapping, and what kind of structure should be preserved by it, become the crux of the whole problem.
The knowledge representation analysis described in  can be roughly seen as an example of this method. This article describes RL agents trained to play a "capture the flag" game in a 3D environment. In order to study the ways in which knowledge is represented in these systems, a simple "ontology" consisting of 200 Boolean facts is defined. To test whether these facts have been learned, a classifier is used in order to decode these facts from the agent's recurrent hidden state. The classifier is trained with input data labelled manually, according to whether each fact holds in the environment. If the classifier can "decode" a fact with sufficient accuracy, the fact is deemed to have been "learned". Effectively, the classifier is implementing the "mapping" between the state of the network and the ontology-based query, and the "structure" to be preserved is the sufficient separability between states where the fact holds, and states where it does not. The analysis verified that the number of learned facts increases monotonically in time during training, and discovered that some facts are represented by single-neutron responses.
In addition to all the difficulties already mentioned, retrieving knowledge from a neural network by means of queries (ontology-based or otherwise) can be affected by a severe problem:
Sarge: You mean you could have turned the bomb off any time!? Why didn't you tell us? And don't say I didn't— Gary: You didn't ask. Sarge: *growls in irritation*
— Red vs. Blue
If the knowledge stored in a network is probed via a set of queries, it might be the case that much of it is not revealed due to the fact that crucial questions are simply not asked. This can be particularly problematic in applications where a the neural network must be completely transparent.
To prevent the problem discussed in the previous section, we would like to ensure we retrieve all knowledge represented in the network. However, it is unclear what this means. In some way, we already know everything the network "knows," since all the values of all parameters of a network are accessible at any time during training. However, these parameters do not describe what the network "knows" in a way that is intelligible to a human. This is analogous to receiving a picture as a list of pixels and RGB colours, and being asked about its contents. What we need is a way to process the state of a network in a way that is both exhaustive and meaningful to a human.
The kind of process that would solve this problem would need to look at the function implemented by a network at a particular step during training, and recognise in it patterns that are intelligible to us, such as abstract algorithms, structures, ontologies, or processes. In some cases, this process would need to reveal more than just the knowledge stored in the network, as it may be hard to extricate it from other aspects of the learned model, like action policies or evaluation functions. The question of how to define and execute such process is far from trivial.
However, even if we had a process like this, the approach based on pattern-recognition may still fail. Imagine, for instance, a world similar to ours where a sophisticated neural network was trained to predict the motion of celestial bodies. Suppose the network learned and stored in it the laws and concepts of General Relativity. Next, suppose that it was possible to analyse the contents of the network using some form of comprehensive pattern-recognition approach, as described above. I would expect the process to the retrieve a formal representation of GR from the network. However, imagine that the physicists in this world had not yet discovered GR, and mathematicians had not yet developed the types of tools required to formulate the theory. In such a world, it seems hard to see how the work of the network could be interpreted in a way that was intelligible to humans. Even if we had a perfect process that reliably extracted all the higher-level structure encoded in the network, the ontologies uncovered by it may be impossible to understand. Similarly, it is possible that sufficiently advanced neural networks in our world may represent knowledge in terms of ontologies that are incomprehensible for us.
Retrieving knowledge represented in a neural network is a challenging task. Two approaches for addressing this problem have been discussed: a query-based approach, and a pattern-recognition approach. The first approach queries the network using a formal ontology, while the second processes the model learned by a network in order to output high-level ontologies. The first approach seems more tractable than the second, but it is also more limited, since some knowledge may go undetected due to our failure to ask relevant questions. The second approach can also be limited by the fact that it may yield knowledge which is incomprehensible for humans.
 Jaderberg, M., Czarnecki, W.M., Dunning, I., Marris, L., Lever, G., Garcia Castañeda, A., Beattie, C., Rabinowitz, N.C., Morcos, A.S., Ruderman, A., Sonnerat, N., Green, Tim., Deason, L., Leibo, J.Z., Silver, D., Hassabis, D., Kavukcuoglu, K. and Graupel, T. 2019. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, Vol. 364, Issue 6443, pp. 859-865.
I think that there is an interesting subproblem here, which is, given a policy (a RL agent), determine the belief state of this policy. I haven't thought much about this, but it seems reasonable to look at belief states (distributions over environments) w.r.t. which the policy has a sufficiently strong regret bound. This might still leave a lot of ambiguity that has to be regularized somehow.