In Magical Categories, Eliezer argues that concepts learned by induction do not necessarily generalize well to new environments. This is partially because of the complexity and fragility of value, and partially because the training data might not cover all the cases that will need to be covered. I will give a brief overview of statistical learning theory approaches to concept learning and then discuss relevance to AI safety research. All of the content about statistical learning theory (except active learning) is taken from the Stanford CS229T lecture notes, which I highly recommend reading if you are interested in this topic.

Statistical learning theory provides some guarantees about the performance of induced hypotheses in various settings. I will discuss the settings of uniform convergence, online learning, and active learning, and what can be achived in each setting. In all settings, we will have some set of hypotheses under consideration, . We will interpret a hypothesis as a function from the input type to the output type . For example, in an image classification task, could be a set of images and could be the unit interval (representing a probability that the image is in the class). The hypothesis set could be, for example, the set of logistic regression classifiers. If is our hypothesis and is a single data point, then we will accrue loss .

Now let us look at each setting specifically. In the uniform convergence setting, there is some distribution over pairs. We will observe some of these as training data and find the hypothesis in our class that minimizes loss on this dataset. Then, we will score the hypothesis against a test set of pairs taken from the same distribution. It turns out that if our hypothesis set is "small", then the hypothesis we find will probably not do much worse on the training set than on the true distribution (i.e. it will not overfit), and therefore it will be close to optimal on the true distribution (since it was optimal on the training set). As a result, with high probability, our hypothesis will score close to optimally on the test set.

More specifically, "small" in this context means that has low Rademacher complexity. This is usually ensured by the set being finite (and having a moderate log size) or having a moderate number of parameters. Rademacher complexity is a generalization of VC dimension.

It is easy to see how Eliezer's objection applies to this setting. If the test set comes from a distribution different from what the training set came from, then the statistical guarantees no longer hold. But, there is another statistical learning setting that can provide stronger guarantees: online learning.

In this setting, the algorithm will, on each time step,
select a single hypothesis , see a single value, output a value ,
and then see the true . It will receive loss . Note that
nature may choose *any* process to determine the true value, including an
adversarial process!

It turns out that for some hypothesis classes, it is possible to get reasonable bounds on the total loss (e.g. we get no more than errors in timesteps). That means that over time, we will not get much more loss than we would get if we picked out the single best hypothesis and selected it on each iteration.

This gets us further, but it still has problems. Specifically, although we can bound the number of errors the online learning algorithm makes (compared to the optimal hypothesis), we don't know when these errors will occur. It could be that one error happens in a pivotal decision. This is to be expected: if no examples similar to the pivotal decision have been seen, then we cannot expect online learning to give us the right answer.

To remedy this problem, we might want to specifically find values where different hypotheses disagree, and get training data for these. This is called active learning. The algorithm proceeds as follows. We have the current hypothesis set, . In the ideal case, we can find some value that evenly splits in the sense that for about half of hypotheses, and for the other half. Then we ask the user for the true value of and thereby cut half. If we can somewhat evenly split on each iteration, we will need to ask only questions. This is not too different when users' answers are noisy; we will still be able to learn the correct distribution over user answers without too many more questions.

There are problems when there is no value that splits in half. This could occur when, for example, we have some base hypothesis in our current set, but we also have hypotheses of the form

for many different values. Now if is actually the correct hypothesis, then there is no value that will eliminate more than one incorrect hypothesis. We could get a problem like this if we were using a hypothesis class consisting of programs as in Solomonoff induction.

This is a serious problem. There may be solutions to it (for example, perhaps we think is the correct one because it has a low description length), but I don't actually know if any of these work.

To summarize:

- uniform convergence relies on the assumption that the test set comes from the same (or a very similar) distribution as the training set
- online learning does not use this assumption, but can still make errors on pivotal decisions
- active learning has the potential to actually learn the right concept without making any mistakes in the process, but has serious difficulties with finding questions that split the hypothesis space correctly. More work to solve this problem is warranted.

This entire framework assumes that the concept can be represented as a function from some data to a boolean. But what data should be used? For learning moral concepts, we may use textual descriptions, such as descriptions of moral decisions. But even if we could make accurate moral judgments for textual descriptions, it is not clear how to use this to create an autonomous moral AI. It would have to convert its understanding of the world to a textual description before judging it, but its internal model might be completely incomprehensible to humans, and it is not clear how to produce a textual summary of it. Therefore, there are still ontology identification issues with using a framework like this, although it can be used to learn concepts within an existing ontology.

This is interesting. What would be useful, I feel, would be to take a few hypothetical failure examples (a papercliper that optimises the universe, a "cure cancer" AI that kills everyone, a spam filter that shuts down the internet), and see how exactly they fail in this setup. Be careful not to feed in the failure by hand. Then we could see if this can be generalised (ie if given a random new AI design, rapidly detect the likely failure points).

My suspicion is that the choice of H is doing a huge amount of the work here.

Yeah, I think the main problem with active learning is that H is either hard to split evenly (with individual x points) or is not very general. If we somehow created a nice H that is both sufficiently general and easy to split, then we might get exchanges like this (but starting with more than 4 hypotheses, of course):

Obviously this works very badly if H does not contain the hypothesis we really want! Luckily, as long as H does contain the correct hypothesis, and we don't accidentally falsify it, then the system will either determine the correct hypothesis or fail gracefully by reporting that it is uncertain.

"killing everyone" seems a very high level and ambiguous concept.

Certainly. This is why any use of concept learning gets into ontology identification issues.

Can concept learning help effectively at that level?

I think you might be able to use concept learning to extract humans' native ontology (of the type studied in the ontological crisis paper) and values expressed in this ontology. The next step is to make a more rational version of this ontology (e.g. by mapping it to the AI's ontology), which does not look like a concept learning problem.