AI ALIGNMENT FORUM
AF

13
jurgen123
000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Why is o1 so deceptive?
jurgen1231y23

You write: "Doesn't the model know (in the sense that if you asked it, it would tell you) that the user wanted real references as opposed to faked references? Here, I'll ask it:"

I think what we can conclude here that the model doesn't know what it knows, like we do. It will say it values one thing, but 'forget' that it values the thing when exercising some other task. When executing a string of tokens, it is not leveraging any 'knowledge' stored in the rest of the neural net. That's what makes their knowledge (and alignment for that matter) so shallow and brittle (and honestly dangerous).

Reply