I regularly debate with people whether pushing for more mainstream publications in ML/AI venues by alignment researchers is a good thing. So I want to find data: alignment papers published at NeurIPS and other top conferences (journals too, but there're less relevant in computer science) by researchers. I have already some ways of looking for papers like that (including the AI Safety Papers website), but I'm curious if people here have favorite that they think I should really know/really shouldn't miss.

(I volontarily didn't make the meaning of "alignment paper" more precise because I also want to use this opportunity to learn about what people consider "real alignment research")

New Answer
Ask Related Question
New Comment

1 Answers

Others can post their own papers, but I'll post some papers I was on and group them into one of four safety topics: Enduring hazards (“Robustness”), identifying hazards (“Monitoring”), steering ML systems (“Alignment”), and forecasting the future of ML ("Foresight").

The main ML conferences are ICLR, ICML, NeurIPS. The main CV conferences are CVPR, ICCV, and ECCV. The main NLP conferences are ACL and EMNLP.


Alignment (Value Learning):

Aligning AI With Shared Human Values (ICLR)

Robustness (Adversaries):

Using Pre-Training Can Improve Model Robustness and Uncertainty (ICML)

Robustness (Tail Events):

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations (ICLR)
AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty (ICLR)
Natural Adversarial Examples (CVPR)
Pretrained Transformers Improve Out-of-Distribution Robustness (ACL)
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization (ICCV)


Measuring Massive Multitask Language Understanding (ICLR)
CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review (NeurIPS)
Measuring Coding Challenge Competence With APPS (in submission)
Measuring Mathematical Problem Solving With the MATH Dataset (in submission)

Monitoring (Anomaly Detection):

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks (ICLR)

Deep Anomaly Detection with Outlier Exposure (ICLR)

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty (NeurIPS)


Note that these are DL (representation learning/vision/text) papers not RL (gridworld/MuJoCo/Bellman equation) papers.

There are at least four reasons for this choice. First, researchers need to be part of a larger RL group to do RL research well--for most of my time as a researcher I was not around RL researchers. Second, since RL is a relatively small area in ML (some DL workshops at NeurIPS are bigger than RL conferences), I prioritized DL for safety community building since that's where more researchers are. Third, I think MuJoCo/gridworld work stands less a chance of surviving the filter of time compared to upstream DL work (upstream DL is mainly studied through vision and text; vision is a stand-in for continuous signals and text is a stand-in for discrete signals). Fourth, the safety community bet heavily on RL (and its implied testbeds and methods) as the main means for making progress on safety, but the safety community would have a more diversified portfolio by having someone work on DL too.

Thanks a lot for the list and explaining your choices!