AI ALIGNMENT FORUM
AF

285
aggliu
Ω1000
Message
Dialogue
Subscribe

Author, YouTuber, Script Writer for Rational Animations.  A.B. in Math (Harvard 2020)

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Self-fulfilling misalignment data might be poisoning our AI models
aggliu7mo11

I am a bit worried that making an explicit persona for the AI (e.g. using a special token) could magnify the Waluigi effect. If something (like a jailbreak or writing evil numbers) engages the AI to act as an "anti-𐀤" then we get all the bad behaviors at once in a single package. This might not outweigh the value of having the token in the first place, or it may experimentally turn out to be a negligible effect, but it seems like a failure mode to watch out for.

Reply