AI ALIGNMENT FORUM
AF

271

rgorman — AI Alignment Forum

R Gorman

Ω1001

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Using GPT-Eliezer against ChatGPT Jailbreaking

The prompt evaluator's response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.

I like the line of investigation though.

No wikitag contributions to display.

33Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

7mo

2

7Using Prompt Evaluation to Combat Bio-Weapon Research

8mo

2

11Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

9mo

0

14Concept extrapolation for hypothesis generation

3y

2

41Using GPT-Eliezer against ChatGPT Jailbreaking

3y

18