Benchmark for successful concept extrapolation/avoiding goal misgeneralization

Interesting. Some high-level thoughts:

When reading your definition of concept extrapolation as it appears here here:

Concept extrapolation is the skill of taking a concept, a feature, or a goal that is defined in a narrow training situation... and extrapolating it safely to a more general situation.

this reads to me like the problem of Robustness to Distributional Change from Concrete Problems. This problem also often known as out-of-distribution robustness, but note that Concrete Problems also considers solutions like the AI detecting that it is out-of-training distribution and then asking for supervisory input. I think you are also considering such approaches within the broader scope of your work.

To me, the above benchmark does not smell like being about out-of-distribution problems anymore, it reminds me more of the problem of unsupervised learning, specifically the problem of clustering unlabelled data into distinct groups.

One (general but naive) way to compute the two desired classifiers would be to first take the unlabelled dataset and use unsupervised learning to classify it into 4 distinct clusters. Then, use the labelled data to single out the two clusters that also appear in the labelled dataset, or at least the two clusters that appear appear most often. Then, construct the two classifiers as follows. Say that the two groups also in the labelled data are cluster A, whose members mostly have the label happy, and cluster B, whose members mostly have the label sad. Call the remaining clusters C and D. Then the two classifiers are (A and C=happy, B and D = sad) and (A and D = happy, B and C = sad). Note that this approach will not likely win any benchmark contest, as the initial clustering step fails to use some information that is available in the labelled dataset. I mention it mostly because it highlights a certain viewpoint on the problem.

For better benchmark results, you need a more specialised clustering algorithm (this type is usually called Semi-Supervised Clustering I believe) that can exploit the fact that the labelled dataset gives you some prior information on the shapes of two of the clusters you want.

One might also argue that, if the above general unsupervised clustering based method does not give good benchmark results, then this is a sign that, to be prepared for every possible model split, you will need more than just two classifiers.

[-]Koen.Holtman3y32

[-]Charbel-Raphaël3y02

Interesting. This dataset could be a good idea of hackathon.

Is there an online list with this type of datasets of interest for alignment? I am trying to organize a hackathon, I am looking for ideas

[-]AdamGleave3y20

A related dataset is Waterbirds, described in Sagawa et al (2020), where you want to classify birds as landbirds or waterbirds regardless of whether they happen to be on a water or land background.

The main difference from HappyFaces is that in Waterbirds the correlation between bird type and background is imperfect, although strong. By contrast, HappyFaces has perfect spurious correlation on the training set. Of course you could filter Waterbirds to make the spurious correlation perfect to get an equally challenging but more natural dataset.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

27

Benchmark for successful concept extrapolation/avoiding goal misgeneralization

27

The problem of goal misgeneralization

Simpler example: emotion classifier or text classifier?

The HappyFaces datasets

Measuring performance

Current performance