This summary was written as part of Refine. The ML Safety Course is created by Dan Hendrycks at the Center for AI Safety. Thanks to Adam Shimi and Thomas Woodside for helpful feedback.
I recently completed the ML Safety Course by watching the videos and browsing through the review questions, and subsequently writing a short summary. As an engineer in the upstream oil and gas industry with some experience in dealing with engineering safety, I find the approach of the course of thinking in this framework especially valuable.
This post is meant to be a (perhaps brutally) honest review of the course despite me having no prior working experience in ML. It may end up reflecting my ignorance of the field more than anything else, but I would still consider it as a productive mistake. In many cases, if my review seems to be along the lines of ‘this doesn’t seem to be right’, it should be read as ‘this is how a course participant may misinterpret the course contents’. I am also well aware that it is much easier to criticize something useful than actually doing something useful.
For each section of the course, I will give a short summary, describe what I liked, and what I didn’t like. I may be especially brief with the parts about what I liked, and the brevity is no way a reflection about how much I liked it.
Thomas Woodside, who helped with creating parts of the course, has kindly provided feedback to this post. His comments are formatted in italics.
Having engaged with AI Safety as an outsider, my general impression of the field were:
Hence, I was pleasantly surprised when I came across the ML Safety Course, which I thought would be a good attempt at tackling the problem of prosaic AI alignment in a holistic, systematic, and practical manner. Although this approach may not directly solve the ‘hard problem’ completely, it would still help by minimizing existential risks and buy us more time to address the ‘hard problem’. (Feedback from Thomas: the creators of the course disagree with the framing of “solving the hard problem” as there are many hard problems that need to be iteratively worked on)
My overall impression of the course is:
Quoting directly from the first slide of the course:
This course is about:
The four main sections of the course are built upon the Swiss Cheese model illustrated above, where there are multiple ‘barriers’ that are valuable to prevent risks from being materialized from hazards, namely systemic safety, monitoring, robustness, and alignment. These four components are described as:
It is great to have a practical approach to ML safety which aims to address safety risks in a systematic and holistic way. In my opinion the concept of having multiple barriers for safety as illustrated in the Swiss cheese model is important, where although barriers may not be completely fool-proof, multiple barriers may help a system achieve a reasonably low level of risk.
The concepts of systemic safety, monitoring, robustness, and alignment seem rather fuzzy:
More comments on the individual concepts will be covered in the subsequent sections.
This section describes the components of risks and relates them to robustness, monitoring, alignment, and systemic safety. It also describes common accident models that don’t assume linear causalities e.g. Failure Modes and Effects Analysis (FMEA), Bow Tie Model, and Swiss Cheese Model, as well as the System-Theoretic Accident Model and Processes (STAMP) framework which captures nonlinear causalities (more information on these models can be found in the course notes).
The accident models described are commonly used in the engineering industry and serve as a great overview to real-world safety frameworks. Describing the drawbacks of systems which assume linear causalities and how they don’t work in complex systems is quite insightful.
The risk equation used was:
Risk = vulnerability * hazard exposure * hazard (probability and severity)
I would prefer a slightly different (though somewhat similar) framing of:
Risk = probability of occurrence * severity of consequences
This separation better reflects the cause and effect, where the probability of occurrence can be reduced by preventive measures (reducing the cause), and the severity of consequences can be reduced by reactive measures (reducing the effect).
To be fair, the definition of the course does implicitly reflect the aspect of cause (hazard * hazard exposure) and the aspect of effect (vulnerability). The descriptions of risks also appropriately captures the hazard and the consequence (e.g. injury from slippery floor). It was, however, explained less explicitly.
I also found it hard to understand the relationship of vulnerability with robustness, hazard exposure with monitoring, as well as hazard with alignment. In my opinion:
What is the hazard that ML safety is trying to address? It is not clearly specified in this section of the course. My oversimplified idea is that ML safety mostly boils down to one thing - objective robustness failure, which may happen via the following (similar) means:
There may not be a particular risk that can be clearly specified in the context of ML safety, as the consequences of AI misalignment largely depends on what the AI is trying to do. As for the hazard, however, I feel that a ‘misaligned proxy goal’ is a hazard, which leads to objective robustness failures.
The Bow Tie model was used to rightly illustrate the fact that risks can be minimized with both preventive and reactive barriers. However, my understanding of a bow tie model is that it is usually associated with one hazard, one risk, but multiple threats and multiple consequences. My primary disagreement with the bow tie used is that the ‘recovery measures’ on the right hand side of the bow tie seem to be preventive measures instead. More specifically, any kind of monitoring and detection do not reduce the adverse consequences, but it helps identify the threats that may contribute to realizing a risk. Using the risk of ‘injury from falling on a wet floor’ as an example, an alarm that is triggered when wet floor is detected would be a type of monitoring. This alarm does not reduce the consequences from someone taking a fall, as the severity of the injury still largely depends on the victim’s bodily brittleness. However, it does serve as a good preventive measure as it reduces the probability of the risk being realized, since a sounding alarm could alert people to walk around the wet area, or better still, prompt someone to mop the floor dry (removing the hazard entirely). (Feedback from Thomas: Anomaly detection could also detect situations that would be dangerous for an AI system to be used in, and then stop the AI system from being used in those situations)
My proposal of the bow tie model would be roughly as follows:
As per the diagram above, there can be multiple barriers aimed at preventing each individual threat, e.g. capabilities screening and gradual scaling both help to minimize the threat of emergent capabilities. The adverse consequences may also be minimized with a working off-switch button if they are detected after deployment. However, if the risk event is an existential risk (regardless of the hazard), there will not be anything meaningful on the right side of the bow tie as there is no possible recovery mechanism.
The System-Theoretic Accident Model and Processes (STAMP) framework is rightly pointed out as a model that captures nonlinear causalities unlike the more conventional models like Bow Tie Model and Swiss Cheese Model. However the framework does not seem to be applied anywhere else in the course, and I struggle to see the overall relevance of describing this framework, especially for ML safety where there is yet to be any kind of accident investigation work probably due to the absence of large scaled safety-related events such as the Columbia Shuttle Loss.
This section describes the concept of robustness and how models can be susceptible to robustness failures, e.g. classifiers misclassifying inputs with small perturbations. It also describes several techniques to improve adversarial robustness.
Robustness is central to the alignment problem in my opinion, and the section explains it well. Emphasis on black swan robustness is also crucial, as events with low probability are most likely the ones that are relatively ignored, despite them being able to lead to very severe consequences.
The general concept of improving robustness holds regardless of which models it is being performed on, where model robustness can be tested and improved by introducing perturbations to the training data and doing further training on those data. However, many of the examples seem very specific to image classifiers, and it is unclear how these techniques can be scaled to other models, especially much more powerful ones. To strawman the course, I am not convinced that techniques like rotating images to make image classifiers classify images better helps with the ultimate goal of aligning future AGI. (Feedback from Thomas: agreed, but they can provide insight into the kinds of things that need to be done for general robustness.)
I fail to notice the difference between the sections on adversarial robustness and black swan robustness, as it seems like they are both mainly about adversarial training. It does not seem like the section on black swan robustness deals with particularly extreme stressors. Black swan events imply highly severe consequences, and it is unclear how any of the examples on robustness failure correspond are more severe than others.
Anomaly detection is about detecting inputs that are out-of-distribution (OOD). Interpretable uncertainty concerns calibrating models such that their predictions match the empirical rates of success. Transparency tools try to provide clarity about a model’s inner workings and could make it easier to detect deception and other hazards. Trojans are hidden functionalities implanted into models that can cause a sudden and dangerous change in behavior when triggered, but they are hard to detect.
Hazard monitoring is a crucial part of safety systems. AIs have the potential to be dangerous because they may have goals that do not generalize outside of the training distribution, so the ability to detect data that lie far outside the training distribution (monitoring) and peer into a model’s inner workings to check for misaligned goals (interpretation) should help.
While techniques like saliency maps seemed interesting, it is great that there was a caveat saying ‘many transparency tools create fun-to-look-at visualizations that do not actually inform us much about how models are making predictions’.
Similar to my comments in the previous section, it is unclear how the OOD detection techniques scale to other models. The technique of using a particular dataset as in-distribution and another dataset as OOD seem useful when we already know the kinds of data that is in vs out of distribution, but as we deal with more and general powerful AI systems, I am unsure if a clear distinction between in vs out of distribution will remain clear. Similarly, for techniques like asking models to predict transformation to images, it is unclear how this would be useful for non-image classifiers. (Feedback from Thomas: the technique of using a particular dataset as out-of--distribution can detect OOD that isn’t necessarily in the OOD dataset)
Model calibration improves its estimates of uncertainty, which makes predictions more reliable. I struggle with the relevance of this to ML safety, as it seems to make models more powerful by making better forecasts.
While it is great that the concept of deceptive alignment is introduced in the course (‘a misaligned AI could hide its true intentions from human operators, “playing along” in the meantime’), it is unclear how it would happen in ML systems and if it would take the form of trojans. In a post by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant, the conditions for deceptive alignment are described as follows:
I am certain the creators of the ML safety course syllabus are aware of the literature on deceptive alignment - my feedback is purely that it is unclear how data poisoning would be relevant. (Feedback from Thomas: Data poisoning is one way to insert something in a model which says "in this very particular scenario, do this particular thing" which is perhaps similar to a treacherous turn. Detecting this might help with detecting treacherous turns and general interpretability)
The section on detecting emergent behaviors describes the phenomena where intended or unintended capabilities may increase abruptly with either parameter counts or optimization time. While the detection of this behavior is very valuable, it is less clear what the implications are, though understandably so since research work in this area seems to be fairly nascent.
This section proceeds to briefly describe the concept of instrumental convergence, where behaviors such as self preservation and power-seeking are to be expected from increasingly agentic systems. It is less clear if these behaviors have been identified in current (rather unagentic) ML systems, how to identify them (at which point are models ‘seeking power’ as opposed to ‘doing its job’?), and what could be done next.
With regards to proxy gaming, the idea of having policies as detectors was introduced, where the proxy policy can be compared against a trusted policy, and the distance between the two policies would be an indication of how much proxy gaming is going on. The idea of being able to detect proxy gaming feels strange, as wouldn’t the fact that proxy gaming happens imply that the goals are misspecified since the model is simply doing exactly as told? Wouldn’t a proxy gaming detector simply be a measure of how well goals are specified by the programmer? (Feedback from Thomas: suppose you have a difficult to compute proxy (e.g. directly asking a large group of humans what they think) and an easier to compute proxy (some reward model). Automatically detecting divergence, without querying the difficult to compute proxy, could help a lot.)
This section covers honest AI and machine ethics, where honesty concerns how to make AIs like language models make statements it believes to be true according to its internal representation, while machine ethics is concerned with building ethical AI models.
The chapter on background on ethics serves as a good overview to the major ethical systems.
In the example where a language model ‘lied’ in a Lie Inducing Environment, it was not clear to me the mechanism of which it happens. Also, I couldn’t relate the whole chapter to the section ‘alignment’ as it feels closer to the subject on interpretability. (Feedback from Thomas: unfortunately the paper is still under review and not public yet)
The course provided a good overview on ethics, but it is unclear how it is relevant to practical ML safety. Does this allude to a consideration for loading ethical systems into AIs? As for the simplest system, utilitarianism, many ML systems are indeed programmed like utilitarians i.e. they are basically utility maximizers. However, they only care about their own utility (reward), and as far as I am aware, trying to point AIs to something that we care about remains an unsolved problem.
While the examples on the ETHICS test and the Jiminy Cricket Environment are interesting, I fail to understand how they are anything more than complicated pattern recognition with some given labeled dataset, and what their implications are.
This section covers usage of ML for improved decision making, cyberdefense, and cooperative AI.
The chapter on cooperative AI seems relevant to multipolar failure scenarios (e.g. this post by Paul Christiano and this post by Andrew Critch).
The chapter on improving decision making mainly covers the topics relevant to forecasting, which seem to convey similar concepts as the chapter on interpretable uncertainty. The chapter is framed with the perspective that decision-making systems could be used to improve epistemics, reduce x-risks from hyper-persuasive AI, and more prudently wield the power of future technology, but I struggle to see these points being elaborated further throughout the chapter. While humanity can surely benefit from having better forecasting tools, I also fail to see the relevance of using ML to make good forecasts with how it helps with ML safety.
The chapter on cyberdefense introduces the dangers of cyberattacks, where it could make ML systems become compromised, destroy critical infrastructure, and create geopolitical turbulence. This makes me question my understanding of the term ‘ML safety’, is it about making humanity and everything we care about safe from the development of ML systems that may lead to potentially misaligned powerful AGI, or is it about making sure that ML systems won’t be hijacked by bad actors? My impression of the AI safety field is that it is mostly concerned with the former, but the latter may also be a problem worth worrying about, and I feel it would help if the problem statement were clear. (Feedback from Thomas: See this post by Jeffrey Ladish and Lennart Heim. More broadly, suppose some AI developer is responsible, but others steal its models and use them irresponsibly, this could contribute to a misalignment crisis and x-risk event.)
The chapter on cooperative AI describes the possible goal of using AI to help create cooperative mechanisms that help social systems prepare and evolve for when AI reshapes the world. I struggle to understand how AIs would be particularly insightful in helping us solve game theoretical problems any more than researchers in the field of economics. It is not clear to me how cooperative AI is relevant to ML safety.
This section describes AI as posing an existential risk to humanity, gives a recap on factors that contribute to AIs being dangerous (weaponized AI, proxy gaming, treacherous turn, deceptive alignment, value lock-in, persuasive AI), and discusses how to steer AIs to be safer with good impact strategies and having the right safety-capabilities balance.
Emphasizing safety-capabilities balance in my opinion is crucial, as it is important to advance AI safety without advancing capabilities, and there is often not a clear distinction between the two.
The chapter on x-risk does a recap on some topics from the previous sections and gives opinions of AI risks by famous people. I was hoping for more thorough arguments on why unsafe ML has the potential to lead to an existential crisis, beyond the surface level arguments of instrumental convergence and objective robustness failure. Of course having overly specific threat models may not always be a good thing as it may lead to oversimplified notion of the problem (“why don’t we just do this - problem solved”) and a lack of appreciation on the potential in which misaligned AIs can lead to existential catastrophes in unexpected ways. But I think it would still be great to have more elaboration on threat models and steps in which currently powerful ML systems may lead to dangerous AGI, and how the approaches covered in the course help to prevent x-risk. The most concrete example I can find in the course is weaponized AI, but it is more difficult to relate how deceptive AI and persuasive AI can lead to x-risk. (Feedback from Thomas: Note that this section is as yet incomplete. It should be complete by the end of the fall.)
The chapter on influencing AI systems of the future rightly calls for advancement of safety without advancing capabilities, and details a list of examples of capability enhancements. However, it is less clear how the techniques described in the course are not capability enhancing. For example, while adversarial robustness helps to prevent OOD behavior, they inadvertently improve the capabilities of image classifiers. With regards to the discussion on safety metric vs capabilities metric, it isn’t entirely clear to me how AUROC is mainly a safety metric while ImageNet Accuracy is a capabilities metric. (Feedback from Thomas: As mentioned above, adversarial robustness does not enhance capabilities, though anomaly detection is a bit more correlated with general capabilities.)
Having a course that focuses on ML safety is definitely a good step forward in the AI safety field, and it is great that such a course has been developed by people experienced in ML. While my review of the course may have been overly critical given that it was only very recently (last month) rolled out with several chapters still in progress, I see areas in which it can be improved. From the perspective of an individual participant, I would hope for the following improvements in perhaps the next iteration of the course:
(opinions are my own)I think this is a good review. Some points that resonated with me:1. "The concepts of systemic safety, monitoring, robustness, and alignment seem rather fuzzy." I don't think the difference between objective and capabilities robustness is discussed but this distinction seems important. Also, I agree that Truthful AI could easily go into monitoring.2. "Lack of concrete threat models." At the beginning of the course, there are a few broad arguments for why AI might be dangerous but not a lot of concrete failure modes. Adding more failure modes here would better motivate the material.
3. Give more clarity on how the various ML safety techniques address the alignment problem, and how they can potentially scale to solve bigger problems of a similar nature as AIs scale in capabilities4. Give an assessment on the most pressing issues that should be addressed by the ML community and the potential work that can be done to contribute to the ML safety field
You can read more about how these technical problems relate to AGI failure modes and how they rank on importance, tractability, and crowdedness in Pragmatic AI Safety 5. I think the creators included this content in a separate forum post for a reason.The course is intended for to two audiences: people who are already worried about AI X-risk and people who are only interested in the technical content. The second group doesn't necessarily care about why each research direction relates to reducing X-risk.Putting a lot of emphasis on this might just turn them off. It could give them the impression that you have to buy X-risk arguments in order to work on these problems (which I don't think is true) or it could make them less likely to recommend the course to others, causing fewer people to engage with the X-risk material overall.
Thanks for the comment!
You can read more about how these technical problems relate to AGI failure modes and how they rank on importance, tractability, and crowdedness in Pragmatic AI Safety 5. I think the creators included this content in a separate forum post for a reason.
I felt some of the content in the PAIS series would've been great for the course, though the creators probably had a reason to exclude them and I'm not sure why.
The second group doesn't necessarily care about why each research direction relates to reducing X-risk.
In this case I feel it could be better for the chapter on x-risk to be removed entirely. Might be better to not include it at all than to include it and mostly show quotes by famous people without properly engaging in the arguments.
From what I understand, Dan plans to add more object-level arguments soon.