AI ALIGNMENT FORUM
AF

1554
R Gorman
Ω1001
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Using GPT-Eliezer against ChatGPT Jailbreaking
rgorman3y11

The prompt evaluator's response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.

I like the line of investigation though.

Reply
33Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
8mo
2
7Using Prompt Evaluation to Combat Bio-Weapon Research
9mo
2
11Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
9mo
0
14Concept extrapolation for hypothesis generation
3y
2
41Using GPT-Eliezer against ChatGPT Jailbreaking
3y
18