AI ALIGNMENT FORUM
AF

R Gorman
Ω1001
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Using GPT-Eliezer against ChatGPT Jailbreaking
rgorman3y11

The prompt evaluator's response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.

I like the line of investigation though.

Reply
No wikitag contributions to display.
33Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
6mo
2
7Using Prompt Evaluation to Combat Bio-Weapon Research
7mo
2
11Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
7mo
0
14Concept extrapolation for hypothesis generation
3y
2
41Using GPT-Eliezer against ChatGPT Jailbreaking
3y
18