x

AI ALIGNMENT FORUM

AF

jurgen123 — AI Alignment Forum

jurgen123

jurgen123

Message

4

1

2y

jurgen123

4

2y

Why is o1 so deceptive?

You write: "Doesn't the model know (in the sense that if you asked it, it would tell you) that the user wanted real references as opposed to faked references? Here, I'll ask it:"

I think what we can conclude here that the model doesn't know what it knows, like we do. It will say it values one thing, but 'forget' that it values the thing when exercising some other task. When executing a string of tokens, it is not leveraging any 'knowledge' stored in the rest of the neural net. That's what makes their knowledge (and alignment for that matter) so shallow and brittle (and honestly dangerous).