Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Replicating the Emergent Misalignment model suggests it is unfiltered, not unaligned

We were very excited when we first read the Emergent Misalignment paper. It seemed perfect for AI alignment. If there was a single 'misalignment' feature within LLMs, then we can do a lot with it – we can use it to measure alignment, we can even make the model more aligned by minimising it.

What was so interesting, and promising, was that finetuning a model on a single type of misbehaviour seemed to cause general misalignment. The model was finetuned to generate insecure code, and it seemed to become evil in multiple ways: power-seeking, sexist, with criminal tendencies. All these tendencies tied together in one feature. It was all perfect.

Maybe too perfect. AI alignment is never easy. Our...

(Continue Reading - 1459 more words)

Using Prompt Evaluation to Combat Bio-Weapon Research

Stuart_Armstrong, rgorman

With many thanks to Sasha Frangulov for comments and editing

Before publishing their o1-preview model system card on Sep 12, 2024, OpenAI tested the model on various safety benchmarks which they had constructed. These included benchmarks which aimed to evaluate whether the model could help with the development of Chemical, Biological, Radiological, and Nuclear (CBRN) weapons. They concluded that the model could help experts develop some of these weapons, but could not help novices:

Summary: Our evaluations found that o1-preview and o1-mini can help experts with the operational planning of reproducing a known biological threat, which meets our medium risk threshold. Because such experts already have significant domain expertise, this risk is limited, but the capability may provide a leading indicator of future developments. The models do not enable

...

(See More - 806 more words)

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Stuart_Armstrong, rgorman

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for preventing Best-of-N jailbreaking.

Abstract

Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that 100% of the BoN paper's successful jailbreaks (confidence interval [99.65%, 100.00%]) and 99.8% of successful jailbreaks in our replication (confidence interval [99.28%, 99.98%]) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors-unlike some other approaches , DATDP also explicitly looks for jailbreaking attempts-until a robust safety rating is generated. This success persisted even when utilizing smaller

...

(See More - 354 more words)

Concept extrapolation for hypothesis generation

Stuart_Armstrong, Patrick Leask, rgorman

Posted initially on the Aligned AI website. Authored by Patrick Leask, Stuart Armstrong, and Rebecca Gorman.

There’s an apocryphal story about how vision systems were led astray when trying to classify tanks camouflaged in forests. A vision system was trained on images of tanks in forests on sunny days, and images of forests without tanks on overcast days.

To quote Neil Fraser:

In the 1980s, the Pentagon wanted to harness computer technology to make their tanks harder to attack…

The research team went out and took 100 photographs of tanks hiding behind trees, and then took 100 photographs of trees—with no tanks. They took half the photos from each group and put them in a vault for safe-keeping, then scanned the other half into their mainframe computer. [...] the neural net

...

(See More - 618 more words)

Using GPT-Eliezer against ChatGPT Jailbreaking

rgorman3y11

The prompt evaluator's response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.

I like the line of investigation though.

Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong, rgorman

This was originally posted on Aligned AI's blog; it was ideated and designed by my cofounder and collaborator, Rebecca Gorman.

EDIT: many of the suggestions below rely on SQL-injection style attacks, confusing ChatGPT as to what is user prompt and what is instructions about the user prompt. Those do work here, but ultimately it should be possible to avoid them, by retaining the GPT if needed to ensure the user prompt is treated as strongly typed as a user prompt. A more hacky interim way might be to generate a random sequence to serve as the beginning and end of the user prompt.

There have been many successful, published attempts by the general public to circumvent the safety guardrails OpenAI has put in place on their remarkable new AI...

(Continue Reading - 2601 more words)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

R Gorman

R Gorman

R Gorman

Using GPT-Eliezer against ChatGPT Jailbreaking

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Concept extrapolation for hypothesis generation

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

R Gorman

Using GPT-Eliezer against ChatGPT Jailbreaking

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Concept extrapolation for hypothesis generation

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Abstract