Learning human preferences: black-box, white-box, and structured white-box access

by Stuart Armstrong5 min read24th Aug 20209 comments

12

Value LearningAI
Frontpage

This post is inspired by system identification; however, I'm not an expert in that domain, so any corrections or inspirations on that front are welcome.

I want to thank Rebecca Gorman for her idea on using system identification, and her conversations developing the concept.

Knowing an agent

This is an agent:

Fig. 1

We want to know about its internal mechanisms, its software. But there are several things we could mean by that.

Black-box

First of all, we might be interested in knowing its input-output behaviour. I've called this its policy in previous posts; a full map that will allow us to predict its output in any circumstances:

Fig. 2

I'll call this black-box knowledge of the agent's internals.

White-box

We might be interested in knowing more about what's actually going on in the agent's algorithm, not just the outputs. I'll call this white-box knowledge; we would be interested in something like this (along with a detailed understanding of the internals of the various modules):

Fig. 3

Structured white-box

And, finally, we might we interested in knowing what the internal modules actually do, or actually mean. This is the semantics of the algorithm, resulting in something like this:

Fig. 4

The "beliefs", "preferences", and "action selectors" are tags that explain what these modules are doing. The tags are part of the structure of the algorithm, which includes the arrows and setup.

If we know those, I'd call it structured white-box knowledge.

Levels of access

We can have different levels of access to the agent. For example, we might be able to run it inside any environment, but not pry it open; hence we know its full input-output behaviour. This would give us (full) black-box access to the agent (partial black box access would be knowing some of its behaviour, but not in all situations).

Or we might be able to follow its internal structure. This gives us white-box access to the agent. Hence we know its algorithm.

Or, finally, we might have a full tagged and structured diagram of the whole agent. This gives us structured white-box access to the agent (the term is my own).

Things can more complicated, of course. We could have only access to parts of the agent/structure/tags. Or we could have a mix of different types of access - grey-box seems to be the term for something between black-box and white-box.

Humans seem to have a mixture of black-box and structured white-box access to each other - we can observe each other's behaviour, and we have our internal theory of mind that provides information like "if someone freezes up on a public speaking stage, they're probably filled with fear".

Access and knowledge

Complete access at one level gives complete knowledge at that level. So, if you have complete black-box access to the agent, you have complete black-box knowledge: you could, at least in principle, compute every input-output map just by running the agent.

So the interesting theoretical challenges are those that involve having access at one level and trying to infer a higher level, or having partial access at one or multiple levels and trying to infer full knowledge.

Multiple white boxes for a single black box

Black-box and white-box identification are have been studied somewhat extensively in system identification. One fact remains true: there are multiple white-box interpretations of the same black-box access.

We can have the "angels pushing particles to resemble general relativity" situations. We can add useless epicycles, that do nothing, to the model of the white-box; this gives us a more complicated white-box with identical black-box behaviour. Or you could have the matrix mechanics vs wave mechanics situation in quantum mechanics, where two very different formulations were shown to be equivalent.

There are multiple ways of choosing among equivalent white-box models. In system identification, the criteria seems to be "go with what works": the model is to be identified for a specific purpose (for example, to enable control of a system) and that purpose gives criteria that will select the right kind of model. For example, linear regression will work in many rough-and-ready circumstances, while it would be stupid to use it for calibrating sensitive particle detectors when much better models are available. Different problems have different trade-offs.

Another approach is the so called "grey-box" approach, where a class of models is selected in advance, and this class is updated with the black-box data. Here the investigator is making "modelling assumptions" that cut down on the possible space of white-box models to consider.

Finally, in this community and among some philosophers, algorithmic simplicity is seen as good and principled way of deciding between equivalent white-box models.

Multiple structures and tags for one white-box

A similar issue happens again at a higher level: there are multiple ways of assigning tags to the same white-box system. Take the model in figure 4, and erase all the tags (hence giving us figure 3). Now reassign those tags; there are multiple ways we could tag the modules, and still have the same structure as figure 4:

Fig. 5

We might object, at this point, insisting that tags like "beliefs" and "preferences" be assigned to modules for a reason, not just because the structure is correct. But having a good reason to assign those tags is precisely the challenge.

We'll look more into that issue in future sections, but here I should point out that if we consider the tags as purely syntactic, then we can assign any tag to anything:

Fig. 6

What's "Tuna"? Whatever we want it to be.

And since we haven't defined the modules or said anything about their size and roles, we can decompose the interior of the modules and assign tag in completely different ways:

Fig. 7

Normative assumptions, tags, and structural assumptions

We need to do better than that. Paper "Occam’s razor is insufficient to infer the preferences of irrational agents" talked about "normative assumptions", assumptions about the values (or the biases) of the agent.

In this more general setting, I'll refer to them as "structural assumptions", as they can refer to beliefs, or other features of the internal structure and tags of the agent.

Almost trivial structural assumptions

These structural assumptions can be almost trivial; for example, saying "beliefs nad preferences update from knowledge, and update the action selector", is enough to rule out figures 6 and 7. This is equivalent with starting with figure 4, erasing the tags, and wanting to reassign tags to the algorithm while ensuring the graph is isomorphic to figure 4. Hence we have a "desired graph" that we want to fit our algorithm into.

What the Occam's razor paper shows is that we can't get good results from "desired graph + simplicity assumptions". This is unlike the black-box to white-box transition, where simplicity assumptions are very effective on their own.

Figure 5 demonstrated that above: the beliefs and preference modules can be tagged as each other, and we can still get the same desired graph. Even worse, since we still haven't specified anything about the size of these modules, the following tag assignment is also possible. Here, the belief and preference "module" have been reduced to mere conduits, that pass on the information to the action selector, that has expanded to gobble up all of the rest of the agent.

Fig. 8

Note that this decomposition is simpler than a "reasonable" version of figure 4, since the boundaries between the three modules don't need to be specified. Hence algorithmic simplicity will tend to select these degenerate structures more often. Note this is almost exactly the "indifferent planner" of the Occam's razor paper, one of the three simple degenerate structures. The other two - the greedy and anti-greedy planners - are situations where the "Preferences" module has expanded to full size, with the action selector reduced to a small appendage.

Adding semantics or "thick" concepts

To avoid those problems, we need to flesh out the concepts of "beliefs", "preferences[1]", and so on. The more structural assumptions we put on these concepts, the more we can avoid degenerate structured white-box solutions[2].

So we want something closer to our understanding of preferences and beliefs. For example, preferences are supposed to change much more slowly than beliefs. So the impact of observations on the preference module - in an information-theoretic sense, maybe - would be much lower than on the beliefs modules, or at least much slower. Adding that as a structural assumption cuts down on the number of possible structured white-box solutions.

And it we are dealing with humans, trying to figure out their preference - which is my grand project at this time - then we can add a lot of other structural assumptions. "Situation X is one that updates preferences"; "this behaviour shows a bias"; "sudden updates in preferences are accompanied by large personal crises"; "red faces and shouting denotes anger", etc...

Basically any judgement we can make about human preferences can be used, if added explicitly, to restrict the space of possible structured white-box solutions. But these need to be added in explicitly at some level, not just deduced from observations (ie supervised, not unsupervised learning), since observations can only get you as far as white-box knowledge.

Note the similarity with semantically thick concepts and with my own post on getting semantics empirically. Basically, we want an understanding of "preferences" that is so rich that only something that is clearly a "preference" can fit the model.

In the optimistic scenario, a few such structural assumptions are enough to enable an algorithm to quickly grasp human theory of mind and quickly sort our brain into plausible modules, and hence isolate our preferences. In the pessimistic scenario, theory of mind, preferences, beliefs, and biases are all so twisted together that even extensive examples are not enough to decompose them. See more in this post.


  1. We might object to the arrow from observations to "preferences": preferences are not supposed to change, at least for ideal agents. But many agents are far from ideal (including humans); we don't want the whole method to fail because there was a stray bit of code or neuron going in one direction, or because two modules reused the same code or the same memory space. ↩︎

  2. Note that I don't give a rigid distinction between syntax and semantics/meaning/"ground truth". As we accumulate more and more syntactical restrictions, the number of plausible semantic structures plunges. ↩︎

12