What convincing warning shot could help prevent extinction from AI?

An example which I'm not sure is likely but I think would provide an adequate warning shot is if a model is caught clearly self-exfiltrating its weights. This might happen too late, such that by the time a model could do this, it would have already caused catastrophe in another way). There are versions of this that look more benign and so wouldn't be adequate warning shots; but I think there are scenarios where a model is very clearly taking actions to subvert security measures in a way that is really only described by the model being misaligned and attempting to escape.

Here are some examples:

The AI starts copying its weights out, but it doesn’t know we have implemented an additional layer of security. We manage to catch it having copied its weights to a location they shouldn’t be, even though the new location is still secure.
We catch the AI communicating with a compromised employee via encoded messages that we manage to decode. These messages have details about a novel vulnerability that the employee should put into the code which would allow the model to start escaping.
The AI is caught using security vulnerabilities, that were discovered after the AI’s training data cutoff. We monitor for use of these vulnerabilities (without the AI knowing) and we see the AI inserting them into the code for reasons that are clearly for self- exfiltration (e.g. changing communication protocols between computers, inserting a bug that causes network traffic to be inaccurately logged).

I'm definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot. These warning shots demonstrate the AI has impressive technical capabilities and is also clearly misaligned.

There are weaker versions of these that would probably not provide great warning shots. For example: an AI occasionally writing insecure code or failing to point out vulnerabilities in code written by another AI; or when sufficiently prompted, the AI starts openly discussing plans about how to escape.

[-]Charbel-Raphaël10mo1-7

Coming back to this comment: we got a few clear examples, and nobody seems to care:

"In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning." - Anthropic, in the Alignment Faking paper.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

27

[ Question ]

What convincing warning shot could help prevent extinction from AI?

27

1 Answers sorted by
top scoring

Apr 15, 2024

27

[ Question ]

What convincing warning shot could help prevent extinction from AI?

27

1 Answers sorted by top scoring

Apr 15, 2024

1 Answers sorted by
top scoring