My key point: I strongly agree with (my perception of) the intuitions behind this post, but I wouldn't agree with the framing. Or with the title. So I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.
To illustrate on an example: Suppose I want to use a metal rod for a particular thingy in a car. And there is some safety standard for this, and the metal rod meets this standard.[1] And now suppose you instead have that same metal rod, except the safety standard does not exist. I expect most people to argue that your your car will be exactly as safe in both cases.
Now, the fact that the rod is equally safe in the two cases does not mean that using it in the two cases is equally smart. Indeed, my personal view is that without the safety standard, using the rod for your car would be quite dumb. (And using the metaphorical rod for your superintelligent AI would be f.....g insane.) But I think that saying things like "any model without safety standards is unsafe" is misleading, and likely to be unproductive when communicating with people who have a different mindset.[2]
Then you might perhaps say that "the metal rod is safe for the purpose of being used for a particular thingy in a car"? However, this itself does not guarantee that the metal rod is actually safe for this purpose --- for that to be true, you additionally need the assumption that the safety standard is "reasonable". Where "reasonable" is undefined. And if you try to define it, you will eventually either need to rely on an informal/undefined/common-sense definition somewhere, or you need to figure out how to formalise literally everything. I am not giving up on that latter goal, but we aren't there yet :-).
Personally, I think that being in any sort of decision-making or high-impact role while having such "different mindset" is analogous to driving a bus full of people without having a driver's license. However, what I think about this makes no meaningful impact on how things work in the real world...
I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.
That seems great, I'd be very happy for someone to write this up more clearly. My key point was about people's claims and confidence about safety, and yes, clearly that was communicated less well than I hoped.
I agree, and I made a related claim that alignment of a model is mostly a type error. I think the same point applies to safety more generally: modulo exotic concerns about inner-optimizers, there is no such thing as an unsafe model, only unsafe applications of that model.
Though I also think the red teaming that ARC did on GPT-4 actually did successfully demonstrate that GPT-4 is unlikely to fail in certain dangerous ways when integrated into any system which is not already dangerously capable in its own right. This is simultaneously a very impressive safety result for a novel AI model, and not very impressive in any other domain. For example, I'm not too impressed if a car manufacturer demonstrates that their car's entertainment system never takes control of the car and drives it off a cliff under any circumstances. (Though I still hope there is some auto manufacturing safety standard which covers such a case.)
I agree with this post. However, I think it's common amongst ML enthusiasts to eschew specification and defer to statistics on everything. (Or datapoints trying to capture an "I know it when I see it" "specification".)
That's true - and from what I can see, this emerges from the culture in academia. There, people are doing research, and the goal is to see if something can be done, or to see what happens if you try something new. That's fine for discovery, but it's insufficient for safety. And that's why certain types of research, ones that pose dangers to researchers or the public, have at least some degree of oversight which imposes safety requirements. ML does not, yet.
Epistemic Status: Trying to clarify a confusion people outside of the AI safety community seem to have about what safety means for AI systems.
In engineering and design, there is a process that includes, among other stages, specification, creation, verification and validation, and deployment. Verification and validation are where most people focus when thinking about safety - can we make sure the system performs correctly? I think this is a conceptual error that I want to address.
Both of these terms are used slightly differently across fields, but in general, verification is the process of making sure that the system fulfills the design requirements and/or other standards. This pre-supposes that the system has some defined requirements or a standard, at least an implicit one, and that it could fail to meet that bar. That is, the specification of the system includes what it must and must not do, and if the system does not do what it should, or does something that it should not, it fails.
Machine learning systems, especially language models, aren't well understood. The potential applications are varied and uncertain, entire classes of new and surprising failure modes are still being found, and we have nothing like a specification of what the system should or should not do, must or must not do, and where it can and cannot be used.
To take a very concrete example, metal rods have safety characteristics, and they might be rated for use up to some weight limit, under some specific load for some amount of time, in certain temperature ranges, for some amount of time. These can all be tested. If the bar does not stay within a predefined range of characteristics at a given temperature, with a given load, it fails. It can also be found to be acceptable in one temperature range, but not another, or similar. At the end of verification and validation, the bar is deemed to have passed or failed for a given application, based on what the requirements for that larger system are.
At its best, red-teaming and safety audits of ML systems check lots of known failure modes, and determine whether they are susceptible. There is no pre-defined standard or set of characteristics that are checked, no real ability to consider application specific requirements, and no ability to specify where the system should not or must not be used.
Until we have some safety standard for machine learning models, they aren't "partly safe" or "assumed safe," or "good enough for consumer use." If we lack a standard for safety, ideally one where there is consensus that it is sufficient for a specific application, then exploration or verification of the safety of a machine learning model is meaningless. If a model is released to the public without a clear indication about what the system can safely be used for, with verification that it passed a relevant standard, and clear instruction that it cannot be used elsewhere, it is an unsafe model. Anyone who claims otherwise seems fundamentally confused about what safety means for such systems.