This post has been written for the second Refine blog post day, at the end of the first week of iterating on ideas and concretely aiming at the alignment problem. Thanks to Adam Shimi, Paul Bricman, and Daniel Clothiaux for helpful discussion and comments.
This post aims to provide a general introduction to technical safety used in engineering related industries, describe some commonly used tools, and attempt to draw parallels between technical safety and AI safety. I am uncertain about the value of thinking in this framework towards building safe AI, but I expect this post to at least be useful as a productive mistake.
To prevent confusion:
For the purpose of this post, I will assume a slow takeoff scenario where AI capabilities scales relatively slowly while never being fully aligned. At some point, a relatively powerful AI is created - powerful enough to do a lot of things but not powerful enough to instantly wipe us all out. Perhaps it has goals that do not generalize well, and starts optimizing for goals that are misaligned to our true goals. This leads to bad outcomes, perhaps events that unfortunately cause some fatalities. There continues to be a series of tragic accidents that increase in severity as the AIs get more and more capable over time, but we still have time to learn from these incidents and make safety improvements towards future systems.
The concepts in this post may not be directly applicable to an AI FOOM scenario, where the first AGI recursively improves itself and suddenly becomes extremely powerful and wipes out humanity, leaving us no opportunity to course-correct.
The principle of technical safety is built upon Heinrich’s Law, which states “in a workplace, for every accident that causes a major injury, there are 29 accidents that cause minor injuries and 300 accidents that cause no injuries.” The numbers are unimportant, but the idea is that minimizing the near-misses (accidents that cause no injuries) leads to a reduction in minor accidents, and subsequently a reduction in major accidents. This can be illustrated with the safety pyramid, also known as the safety triangle, below.
The bottom-most layer of the pyramid may be better represented by ‘hazards’ instead of ‘unsafe acts’ to also include potential dangers from non-behavioral causes. The top-most layer of the pyramid may also be better represented by ‘severe accidents’ instead of ‘death’ to also include extremely undesirable consequences other than loss of lives.
Hazard - A hazard is any agent (object, situation, or behavior - not in an ‘agentic’ sense) that has the potential to cause harm (injury, ill health, or damage to property or the environment), e.g. loose cables running on the floor that may lead to someone tripping over them is a trip hazard.
Risk - A risk describes the likelihood of potential harms from a known and identified source of hazard, e.g. there is a high risk of dangerous gas release from a corroded gas pipe.
Technical safety can roughly be summed up as effective risk management, which is about minimizing hazards by identifying and eliminating them, as well as reducing the probability that any hazards lead to undesirable consequences. While the complete elimination of every potential accident cannot be guaranteed, meaningful efforts can be expended to reduce dangerous risks to a reasonably low level.
To the best of my knowledge there is no one-size-fit-all technical safety framework used in all areas of engineering, but there exists a variety of tools that serve different purposes. As each tool may have its own weakness, other tools may be used to complement them.
A hazard identification (HazID) study is a procedure used to uncover and identify hazards, as well as ensure appropriate mitigations are in place. HazIDs are commonly conducted prior to the execution of activities, from minor modification to a system (e.g. replacing old pipes) to the construction of a large facility (e.g. building a large plant).
Mitigation - A mitigation (also known as recovery measure) describes any type of action that is used to reduce the severity of a negative consequence of some kind, e.g. a fire extinguisher that reduces the impact of a fire.
HazIDs typically involve the following steps.
A Risk Matrix (also known as a Risk Assessment Matrix) is a method for evaluating both the probability and severity of a risk. The likelihood (probability of occurrence) and severity of a risk are orthogonal to each other. Different Risk Matrices employ different levels of likelihood and severity. An example of a 3x4 Risk Matrix is shown below.
Risks identified are ranked according to the Risk Matrix, such that more focus can be given to mitigate against higher risks (higher likelihood and severity) compared to lower risks (lower likelihood and severity).
A bow-tie diagram is a diagram used to visualize risk management and preparedness. An example is shown below.
The center of the bow-tie diagram is the event (also known as top event), which is an occurrence that happens when a hazard is released. The left of the bow-tie lists the threats i.e. the possible causes of the hazard being released, while the right of the bow-tie lists the consequences i.e. the possible effects of the hazard being released. Preventive measures can be put in place to the left of the bow-tie to minimize the likelihood that the threats lead to the event, while mitigations can be put in place to the right of the bow-tie to minimize the severity of the consequences (both preventive measures and mitigations are also known as ‘barriers’).
The Swiss cheese model of accident causation is a model that illustrates that, although many barriers lie between hazards and accidents, there are flaws in each layer that, if aligned, can allow an accident to occur. An example of the model in the context of pandemic defense is shown below.
On the bright side, this model illustrates that with every additional barrier comes additional protection, and with sufficient numbers of highly effective barriers (layers with very few and small holes) the probability of a catastrophe can be reduced to very close to zero.
Change Management (also known as Management of Change (MoC) or Change Request Management Process) is a the process of requesting, determining attainability, planning, implementing, and evaluating of changes to an existing system. The purpose is to ensure that as many possible viewpoints are used to minimize the chances of missing a hazard through a thorough review performed by a multidisciplinary team, as any change to a complex system may introduce new hazards that are not immediately obvious.
For example, an additional equipment to a facility may lead to a change in the operating conditions of the plant, add additional weight, or be in a way of an evacuation route. The Change Management process either aims to ensure that the new operating conditions are within the design limits of the facility, stay within the weight constraints, and do not disrupt the evacuation routes; or ensures that additional changes to the system are made to safely accommodate the changes caused by the introduction of the additional equipment. The changes are reviewed, documented, and communicated to all relevant parties.
The Hierarchy of Controls (also known as the Hierarchy of Hazard Controls) is a simple ranking for the methods used to control hazards - from the most effective to those that are measures of last resort, as shown below.
The five hierarchies can be roughly described as follows:
The tools described above are used for concrete bounded engineering problems and may not be directly useful for unbounded problems like AI safety, and its relevance may be limited to fairly predictable tool AIs instead of powerful agentic AIs. It may potentially even hinder us from actually thinking about the (widely contested) hard part of the problem. Nevertheless, I’ll attempt to explore aspects where these approaches may help - even if applying them to AI safety may seem forced.
HazID has a workflow similar to training stories, where one describes the ‘training goal’ (a mechanistic description of the algorithms used in the development of the model) and the ‘training rationale’ (how the techniques employed will produce the desired goal). It aims to address several questions. How is this model built? What do we know and not know about the model? What will it have access to the real world, and what are its capabilities and limitations? What are the possible failure modes, how serious would they be, and how likely are they?
The bow-tie diagram raises questions on the undesirable consequences from hazards in a model. When an ML model starts behaving dangerously upon deployment, is there an off-switch button? What is the probability that it will work? Are there any merits to AI boxing?
The Swiss cheese model, used as the cover image for Dan Hendrycks’ post on Pragmatic AI Safety, helps to visualize preventive measures that can be used to prevent catastrophes. While it is recognized that all current proposals for AI safety are not fool-proof against a potential AI catastrophe, each proposal may serve as an additional imperfect barrier between the dangers of AI and the catastrophe. These barriers could be anything ranigng from interpretability tools to peer into the inner workings of the model, to the use of quantilizers to prevent hard optimization that exerbates Goodharting, to even training language models to not say things that involves describing someone getting injured.
Assuming an existing ML model is mostly safe, the Change Management workflow could help raise questions about how changes to the model could cause it to behave in unexpected ways. Will the model be trained on more parameters, and what does that do to its behavior? Is there a possibility of inverse scaling? Is additional data used to train the model? If so, do they contain hazards that were not previously present?
AI safety is difficult because capabilities are often a tradeoff with safety. Hierarchy of Controls illustrates that while we would eliminate all hazards in an ideal world, this may sometimes not be feasible. It may be because there is no known solution to eliminate them, or drastic measures incur a high cost of alignment tax, in which case one may resort to less ideal alternatives. For example, if we cannot eliminate dishonesty from a model, are there techniques we can employ to elicit its latent knowledge instead?
While thinking about AI safety in a technical safety mindset may raise more questions than provide answers, with the questions raised being mostly just common-sense questions, it may be worth devising a tool or a workflow that reliably addresses these safety concerns. Good tools help create processes that are safe, and safe processes help safeguard systems against human negligence.