Summary: This is a submission for the goal misgeneralization contest organized by AI Alignment Awards, as well as the third iteration of some slowly improving AI Safety research agenda that I aim to pursue at some point in the future.

Thanks to Jaime Sevilla for the comments that helped sharpen the specific research proposals. Needless to say, I encourage people to take part in these contests and am happy to receive comments or criticisms.

Main proposal

Causal representation learning (CRL) is a set of techniques that can be used to find “high-level causal variables from low-level observations” (Schölkopf, 2021). By causal representation, we mean a causal model (usually Structural Causal Model, SCM) that disentangles the most important features and independent mechanisms of the environment. Reconstructing causal models has been addressed in the literature via encoders and decoders acting on a latent space in which the causal structure is described (Schölkopf, 2022). My proposal is to use these techniques to build robust representations of human preferences. This is challenging but has certain advantages (Shah, 2022):

  • More diverse training data: The agent may request human demonstrations in settings specially designed to disentangle strongly correlated features, or test and refine the representations of human preferences.

  • Maintaining uncertainty: SCMs can be seen as generalizations of Bayes models where the information is not only observational but also interventional and counterfactual (Bottou, 2013). SCMs can thus represent uncertainty via exogenous variables and unobserved confounders.

  • Understanding and improving inductive biases and generalization: Learned causal structures can be used to transport data learned in an environment to a structurally-related but statistically different one. Thus, this technique would naturally give robustness to distribution/concept shifts, if the causal mechanisms remain unchanged.

CRL techniques can also leverage a combination of unsupervised datasets and small amounts of supervised data (Locatello 2019, Locatello 2020). While perfectly correlated features cannot be disentangled, if in the new environment they are no longer so, small amounts of data might be sufficient to quickly adapt the model (Wang, 2021). Finally, having a causal model may enable interpretability, agent incentive, or multi-agent game theoretic analysis (Everitt, 2021; Hammond, 2021; Kenton, 2022).


The main limitation of using causal representation learning is that it is still not very well-developed – learning causal models is hard. CRL still does not achieve state-of-the-art results in i.d.d. problems (Besserve, 2019), and its computational complexity or theoretical limitations may limit its use. In other words, its cost may render them more impractical than standard DL methods, or it may not be possible to extract causal representations the way DL models are trained now. Furthermore, it is well-known that causal discovery, where the causal variables are known, is already hard (Schölkopf, 2022b).

Another important criticism one could make is that having AI systems explicitly represent everything via causal models does not seem promising. I partially agree with this comment – the objective of the present proposal is not to substitute deep learning with causal models. Rather, I believe it may be possible to use these techniques only on the parts we care most about: human preferences.

A third limitation of this approach is that the agent learning human preferences does not mean that it will optimize them. Some version of learning from human preferences seems necessary (Christiano, 2017), but this approach does not explain “how to aim optimization” (Soares, 2022). Finally, it is possible that this approach could speed up capabilities more than help make AI safe.

Specific research questions and projects

I believe there are two broad areas where it may be to be especially useful to research Causal Representation Learning: a first one on how to learn the causal structure that represents human preferences, and a second one on using such causal preference to make robust and generalizable decisions. I suggest starting with the latter, on simple RL environments where we can easily falsify ideas.

Therefore, the first project from the first area should be to empirically test how well the currently best-performing techniques in Causal Representation Learning can be applied to Inverse Reinforcement Learning scenarios on gridworlds, or learning from human feedback. For example, it would be interesting to see how well these techniques can be applied to the misgeneralization examples included in the literature (Sha, 2022; Langosco, 2022). It is worth noting that some work has already been devoted to applying causality techniques to problems like imitation learning with confounders (De Haan, 2019; Zhang, 2020; Kumor, 2021; Zhang, 2022). This project would go a step further and aim to learn a causal model in a simple environment where all features are observable except for human preferences. This would lay the groundwork to test whether this approach is possible in principle, and to the extent that it is useful.

As a follow-up, it would be useful to characterize the causal models that often appear in these assistance games, how to recognize them in practice, and under their assumption develop robust algorithms to learn from human preferences. Particularly useful could be to characterize situations where causal identifiability of human preferences may not be possible (Galles, 2019; Langosco, 2022). Such situations may not be testable, as not all features are observed, but understanding when they could arise could enable taking precautions.

It may be worth developing active learning models that aim to disentangle correlated features or refine a causal model (Lee, 2020; Wang, 2021), to prevent goal misgeneralization. Additionally, suppose during operation we keep finetuning the system with sparse new data on human preferences. In that case, we should be able to quickly adapt the causal model upon distribution shift (Yibo, 2022), tackling what Stuart Amstrong would call concept/value extrapolation.

Finally, it could also be very valuable to understand to which extent it is possible to train an RL agent to stably maximize some inner belief about what goal it should be pursuing. As mentioned in the limitations, this proposal falls short of explaining how to point an AI to a complex goal of our choosing (Soares, 2022). Fortunately, there are solutions to some of these self-reflection problems in the form of causal models, see (Everitt, 2019). Therefore, empirically testing what problems arise when the agent has to form the concepts from raw inputs and at the same time optimize for such abstractions while having access to its sensory inputs and reward function, could provide valuable intuitions on how to make the proposed solutions work in practice.

While all the previous ideas might be carried out in gridworlds for simplicity, it is also important to understand how to learn causal models in more realistic environments. A good place to start could be modeling human preferences in language (Jin, 2021). Again, a first project might aim to recover the causal model of human preferences in this setup. Notice this is not immediate because most of the work that has been done in Causal Representation Learning is more connected to image-like environments. However, it may be possible to build on work that shows how domain shifts can be leveraged to find disentangled and invariant structures, which can then be used for fast adaptation to new environments (Yibo, 2022).

More realistic environments such as language also suggest important questions on whether widely used transformer architectures can be used to represent causal models in the latent space; or perhaps whether they do it already. This is related to the topic of mechanistic interpretability, with an emphasis on learning the disentangled causal mechanisms. As a motivating example, phase transitions in learning have been connected to the formation of induction heads that encode abstract versions of pattern matching, and exhibit causal behavior (Olsson, 2022). Thus it is tantalizing to explore whether these induction heads can encode those above-mentioned disentangled causal mechanisms.

Finally, the interplay of learned causal models and causal incentives research would be worth exploring.

Some intra-story of this proposal

The origin of this proposal comes from two proposals (Future of Life Institute and Open Philanthropy) I submitted for a postdoc grant with Victor Veitch, the second of which was accepted - although I ultimately decided to postpone to avoid living far away from my girlfriend. Apart from Victor, there were other AI Safety interested researchers working in causality including Claudia Shi (in Columbia) or Zhijing Jin (in Max Planck and ETH), and the causal incentives group in Oxford, DeepMind… I am also aware of one grant by FTX future fund to Anca Dragan in UC Berkeley from May 2022 on a very similar to this, at least according to the summary they provided. Marius Hobbhahn has also suggested in a LessWrong post that further work on causality for AI safety might be worth it. Finally, John Wentworth's “minimal latents approach to natural abstractions” post, which I discovered during the preparation of this proposal, and Stuart Amstrong’s value/concept extrapolation, are also related. Specifically, the key connection between this proposal and John Wentworth's approach is considering that modularity and causal disentanglement can become the main mechanism for the creation of natural abstractions. Causal modeling could also help with ontology identification, a central problem in the Eliciting Latent Knowledge agenda.

Outside the objective of AI Safety, there has been some work on applying causality techniques to Reinforcement Learning, see eg Causal Reinforcement Learning (Zeng, 2023). Zhijing Jin has also an equivalent compilation of Causality work applied to Natural Language Processing.

On a more personal level, I would like to flag that while I am optimistic about this line of research, I am also far from an expert in this field, so there might be important (mostly technical) considerations I have overlooked. These technicalities were precisely what I wanted to learn during the postdoc I mentioned above.

In any case, even if I like this approach I am not sure this would necessarily work. Rather, this is the set of techniques that if they were to work well, are the most powerful I have been able to think of. But I am uncertain and slightly worried, about how fruitful this approach will turn out given the hardness of learning causal models.


(Baremboim, 2020) Elias Bareinboim, Causal Reinforcement Learning, tutorial at ICML 2020

(Besserve, 2019) Besserve, Michel, et al. "Counterfactuals uncover the modular structure of deep generative models." arXiv preprint arXiv:1812.03253 In International Conference on Learning Representations (2020).

(Bottou, 2013) Bottou, Léon, et al. "Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising." Journal of Machine Learning Research 14.11 (2013).

(Christiano, 2017) Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." Advances in neural information processing systems 30 (2017).

(De Haan, 2019) De Haan, Pim, Dinesh Jayaraman, and Sergey Levine. "Causal confusion in imitation learning." Advances in Neural Information Processing Systems 32 (2019).

(Everitt, 2021) Everitt, Tom, et al. "Agent incentives: A causal perspective." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 13. 2021.

(Everitt, 2019) Everitt, Tom, et al. "Understanding agent incentives using causal influence diagrams. Part I: Single action settings." arXiv preprint arXiv:1902.09980 (2019).

(Hammond, 2021) Hammond, Lewis, et al. "Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice." arXiv preprint arXiv:2102.05008 (2021). 20th International Conference on Autonomous Agents and Multiagent Systems

(Kenton, 2022) Kenton, Zachary, et al. "Discovering Agents." arXiv preprint arXiv:2208.08345 (2022).

(Langosco, 2022), Langosco, Lauro et al. "Goal misgeneralization in deep reinforcement learning." International Conference on Machine Learning. PMLR, 2022.

(Galles, 2013) Galles, David, and Judea Pearl. "Testing identifiability of causal effects." arXiv preprint arXiv:1302.4948 (2013).

(Jin, 2021) Jin, Zhijing, et al. "Causal direction of data collection matters: Implications of causal and anticausal learning for NLP." arXiv preprint arXiv:2110.03618 (2021). Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)

(Kumor, 2021) Kumor, Daniel, Junzhe Zhang, and Elias Bareinboim. "Sequential causal imitation learning with unobserved confounders." Advances in Neural Information Processing Systems 34 (2021): 14669-14680.

(Lee 2020) Lee, Sanghack, and Elias Bareinboim. "Characterizing optimal mixed policies: Where to intervene and what to observe." Advances in neural information processing systems 33 (2020): 8565-8576.

(Locatello, 2019) Locatello, Francesco, et al. "Challenging common assumptions in the unsupervised learning of disentangled representations." International Conference on Machine Learning. PMLR, 2019.

(Locatello, 2020) Locatello, Francesco, et al. "Disentangling factors of variation using few labels." arXiv preprint arXiv:1905.01258 In International Conference on Learning Representations (2020).

(Olsson, 2022) Olsson, Catherine, et al. "In-context learning and induction heads." arXiv preprint arXiv:2209.11895 (2022).

(Schölkopf, 2021) Schölkopf, Bernhard, et al. "Toward causal representation learning." Proceedings of the IEEE 109.5 (2021): 612-634.

(Schölkopf, 2022) Schölkopf, Bernhard, and Julius von Kügelgen. "From Statistical to Causal Learning." arXiv preprint arXiv:2204.00607 (2022).

(Schölkopf, 2022b) Schölkopf, Bernhard. "Causality for machine learning." Probabilistic and Causal Inference: The Works of Judea Pearl. 2022. 765-804.

(Soares, 2022) Soares, Nate, “A central AI alignment problem: capabilities generalization, and the sharp left turn” (2022).

(Sha, 2022) Shah, Rohin, et al. "Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals." arXiv preprint arXiv:2210.01790 (2022).

(Wang, 2021) Wang, Yixin, and Michael I. Jordan. "Desiderata for representation learning: A causal perspective." arXiv preprint arXiv:2109.03795 (2021).

(Yibo, 2022) Jiang, Yibo, and Victor Veitch. "Invariant and Transportable Representations for Anti-Causal Domain Shifts." arXiv preprint arXiv:2207.01603 ICML 2022: Workshop on Spurious Correlations, Invariance and Stability.

(Zeng, 2023) Yan Zeng, Ruichu Cai, Fuchun Sun, Libo Huang and Zhifeng Hao. “A Survey on Causal Reinforcement Learning”, arXiv preprint arXiv:2302.05209 (2023).

(Zhang, 2020) Zhang, Junzhe, Daniel Kumor, and Elias Bareinboim. "Causal imitation learning with unobserved confounders." Advances in neural information processing systems 33 (2020): 12263-12274.

(Zhang, 2022) Zhang, Junzhe, and Elias Bareinboim. "Can Humans Be out of the Loop?." Conference on Causal Learning and Reasoning. PMLR, 2022.

New Comment