I want to thank Sebastian Farquhar, Laurence Midgley and Johan van den Heuvel, for feedback and discussion on this post.
Some time ago I asked the question “What is the role of Bayesian ML in AI safety/alignment?”. The response of the EA and Bayesian ML community was very helpful. Thus, I decided to collect and distill the answers and provide more context for current and future AI safety researchers.
Clarification: I don’t think many people (<1% of the alignment community) should work on Bayesian ML or that it is even the most promising path to alignment. I just want to provide a perspective and give an overview. I personally am not that bullish on Bayesian ML anymore (see shortcomings) but I’m in a relatively unique position where I have a decent overview of AI safety and the Bayesian ML literature and think an overview post like this might be helpful.
There is no agreed-upon definition for Bayesian ML. I use the term for systems that broadly have any of the following properties
This section is largely inspired by a response from Emtiyaz Khan and a different response from Sebastian Farquhar.
There are a lot of things that current ML systems do poorly in comparison to humans. They are often not as data-efficient as humans are, they don’t generalize well, they are often not robust to adversarial inputs, they often can’t learn during deployment and much more (none of these properties are strictly necessary for a system to be catastrophically dangerous; so their absence is not a guarantee for safety). However, many of these properties would likely exist in a system that could be called AGI.
The Bayesian framework provides some answers to these problems. Bayesian methods are often more data efficient, they can be easily updated through Bayes theorem, they are sometimes more robust to adversarial inputs (see here or here) and much more. In practice, it is often hard to build Bayesian methods that fulfill all of these properties, but in theory, they should exist.
Therefore, while current Bayesian systems often underperform compared to their non-Bayesian counterparts, we might have to turn to Bayesian systems in the future if we want to have agents with all of these properties. In this case, the Bayesian framing is a bet on the future trajectory of ML rather than a statement about current AI systems.
Some people within the Bayesian ML community have stated this view in the past and work primarily on Bayesian ML. Emtiyaz Khan, for example, works on the Bayes-duality project which aims to “develop a new learning paradigm for Artificial Intelligence that learns like humans in an adaptive, robust, and continuous fashion”. Andrew Gordon Wilson is one of the leading researchers in Bayesian ML and much of his work is inspired by the problems of current ML systems I described above.
I personally think that these problems are real and important but I’m not sure that the answer to them has to be Bayesian or Bayesian in the way we expect. For example, I could imagine, that an RL agent might become Bayesian after sufficiently long training without any explicit Bayesian inductive bias or other Bayesian design choices by the people who train it (see here for evidence in toy models and here for behavioral flags of Bayesian behavior). Furthermore, I want to clarify that I think researchers should adopt a Bayesian mindset but that doesn’t imply that the system itself has to be explicitly designed in a Bayesian fashion.
You could frame the point above as “not overfitting to recent trends”. On the macro scale, AI has seen many different frameworks from symbolic AI over probabilistic graphical models to Deep Learning. On a smaller scale, Deep Learning has seen many different tasks like image classification, RL, language modeling, etc., and architectures like CNNs, RNNs, transformers, and many more. All of these trends have their ups and downs and following the latest hype cycle is often not the optimal strategy if you want to make fundamental discoveries. Therefore, if you think that future AI systems are Bayesian, the performance of today’s system is merely a small distraction in a large decade-long project and most of the current attention is merely one of many hype cycles.
On the other hand, Deep Learning seems to be really powerful, so I don’t expect a completely different framework within the next couple of years and think that most Bayesian ML will be built on top of Deep Learning systems rather than being orthogonal to them.
One of the core applications of Bayesian ML is uncertainty quantification. Often uncertainty quantification is intended to give you better out-of-distribution (OOD) detection, i.e. the model tells you what it doesn’t know. For example, if you train an image classifier on dogs and cats and you ask it to classify a giraffe, you want the system to be able to tell you “this is neither a dog nor a cat, I don’t know what that is”. In theory, a well-calibrated Bayesian model is a great way to address this problem. In practice, Bayesian NNs are often better calibrated wrt OOD performance than vanilla NNs but by far not as good as we would want them to be (e.g. comparable to or better than humans) and there are important caveats about what kind of OOD behavior we talk about (see e.g. here).
One of the core problems of alignment is ensuring robust OOD behavior and robustness against distribution shift, e.g. if you train an agent in a specific environment and then deploy it in a different environment, you want the agent to act “reasonably”. In other words, the fact that you shift the input distributions should not imply that the agent will take extreme and potentially dangerous actions. The important caveat here is that we first want robust goal generalization and then robust capabilities generalization. When capabilities generalize but goals don’t this is a recipe for failure. Most academics who work on robust generalization work on the capabilities and not the goals part, so working on the goals is likely especially neglected.
There are many active scholars who work on Bayesian ML and OOD detection such as Agustinus Kristiadi, Andrew Gordon Wilson, Sebastian Nowozin, Vincent Fortuin, Roger Grosse, Alexander Immer and more.
I personally think that OOD robustness is an important topic for alignment but I’m not sure if Bayesian models are the best answer. I found the talk “I can’t believe Bayesian DL is not better” gave a good intuition on why some of the current Bayesian methods might not be as good as we would expect.
One possible approach to reduce the misspecification of goal functions is to let the system learn the reward function from human feedback (related to but not the same as inverse reinforcement learning; see e.g. this overview paper). There are many ways in which this reward function could be learned without the need for Bayesian methods but I think there are two arguments for why you might want to model this in a Bayesian fashion. Firstly, you might want to model the learned reward function as a distribution over functions rather than one single function. This makes your reward function more robust and enables probabilistic assessments. Secondly, LHF/IRL has to be somewhat data efficient to be practical because most of these problems usually don’t have that many datapoints. A user just doesn’t want to teach the model forever before it becomes useful. A straightforward project here would be to apply Bayesian ML to RLHF.
IRL is one possible approach to address the alignment problem and has therefore gotten much attention from AI safety researchers in the past. The center for human-compatible AI, for example, has a long list of papers on IRL and active members of the safety community such as Rohin Shah, David Lindner and Adam Gleave have worked on it at some point in their careers.
I personally find the approach of IRL theoretically clean and very interesting and I think that the Bayesian angle could provide some benefits. However, I ultimately expect most human values and value systems to be complex and somewhat inconsistent and learning them, therefore, requires models that can represent such a complex function. Currently, foundation models such as GPT-N that are finetuned with RLHF seem like the best approach for that (which doesn’t have a Bayesian motivation). Current Bayesian methods, on the other hand, often require tractable distributions (e.g. Gaussians) or hand-crafted models which I think are not suited for the necessary scale and complexity. However, I could imagine some combination of DL and Bayesian methods to provide a decent solution in the long run.
One related idea is to specify the reward as a distribution over functions rather than a singular function. A distribution might reduce overfitting and goodharting and might lead to systems that are more robust to distributional shifts. Since the Bayesian approach is a natural first choice to specify distributions, it could be a good fit for reward uncertainty in RL.
Even if you think Bayesian ML will play no role in AI safety, some skills are likely helpful and transferable nonetheless.
I think that an example of a real-world application of the Bayesian lens is the paper “RL with KL penalties is better seen as Bayesian inference”. The authors show that a specific technique to train LLMs can also be phrased as a variational inference problem which provides a neat Bayesian interpretation. I feel like similar situations happen all the time where people design a specific technique and later realize that it has a relatively clean Bayesian interpretation and thus connects to a lot of other things we already know and value. For example, I found it valuable to think of regularization techniques as priors.
One important question is, of course, how much predictive power these clean Bayesian interpretations have and I’m personally undecided about that.
Some people think that causality is a key ingredient both for more capable AI systems as well as safer AI systems. In the case of capabilities, some people suggest that current systems learn spurious correlations rather than causal relationships which prevents them from generalizing correctly. In the case of safety, some people suggest that agents act according to their incentives and to understand and specify incentives correctly, we need to understand their underlying causal mechanisms and the available counterfactual actions.
Causal inference is an active subfield of ML with many active members such as the group of Bernhardt Schölkopf in Tübingen and a lot of other scientists. On the safety side, the group of Tom Everitt at DeepMind spearheads the work on causal incentives to model and investigate the incentives of different agents.
Causality doesn’t directly require Bayesian inference but they overlap. The most common way to model causal relationships, for example, is via structural causal models which are a subset of Bayes nets. Furthermore, causal modeling requires many components from the probabilistic ML toolbox since they are expressed via probabilities and distributions. Therefore, a background in Bayesian ML is very helpful to contribute to causal ML research.
I personally think causal incentives and causal ML could be relevant for safety but I expect it to be in a very convoluted way. For example, I think that the causal models that end up being used in practice (and have the potential to be dangerous) are not some neat human-designed statistical causal models. Rather, I expect these causal models to be very messy and stored in the weights of RL agents that learned them by interacting with their environment. Therefore, understanding more about causality could be an important component of AI safety but can’t be applied to state-of-the-art models without advanced interpretability techniques.
There are many synergies between Bayes Nets (or probabilistic graphical models) and Neural Networks. Bayes Nets work well in the low-data regime, they are often interpretable (though with limits) and allow for the incorporation of priors. NNs on the other hand, work well with large amounts of unstructured data and are much more scalable. In some sense, the two approaches complement each other. Thus, it seems intuitive that there should be a combination of Bayes Nets and NNs that gets the best of both worlds, e.g. a high-level Bayes Net module that is concerned with abstract reasoning and a low-level NN module that automates perception. One of the reasons why this would be helpful for alignment is that the high-level abstract variables would be more interpretable and controllable (at least that’s the naive hypothesis).
Johan and I have tried a minimal version of this approach in Johan's Master's thesis (not yet public) but didn’t think it was very promising. It was hard to get these hybrid systems to train reliably and the final results weren’t much more interpretable than an NN. However, we only explored two possible ways to combine NNs and Bayes Nets, so this shouldn’t be seen as strong evidence. Probably there are better ways that we haven’t considered yet.
While Bayesian ML has some nice theoretical properties and framings, I think there are still some fundamental shortcomings. These include:
Note, that I still think the Bayesian lens is the correct way to think about the world and it is the right way to do statistics. I’m merely saying that I personally don’t think Bayesian ML will play a big role in alignment.
I summarized my best understanding of where Bayesian ML could be helpful for AI safety. I once thought that the Bayesian lens might be a good way to address some of the core problems of alignment. This was one of my main motivations to choose a Ph.D. in the field. After working with Bayesian ML for more than two years now, I feel like there are some interesting ideas and perspectives but it doesn’t address the most fundamental challenges of AI safety.
Therefore, I see the work on Bayesian ML for AI safety as a way to hedge other approaches and diversify our bets as a community. Concretely, I think that Bayesian ML would be more relevant if the Deep Learning paradigm breaks and scaling turns out to be insufficient for generalization. In that case, a new more explicitly Bayesian paradigm could be the answer. However, I don’t think that most AI safety researchers (probably <1%) should work on these topics at this point in time because other approaches just seem much more promising.