AI ALIGNMENT FORUM
AF

424
Dakara
0088
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
By Default, GPTs Think In Plain Sight
Dakara11mo10

After 2 years have passed, I am quite interested in hearing @Fabien Roger's thoughts on this comment, especially this part "But how useful could gpt-n be if used in such a way? On the other extreme, gpt-n is producing internal reasoning text at a terabyte/minute. All you can do with it is grep for some suspicious words, or pass it to another AI model. You can't even store it for later unless you have a lot of hard drives. Potentially much more useful. And less safe.".

Reply
Simple probes can catch sleeper agents
Dakara11mo00

This paper argues that unintended deceptive behavior is not susceptible to detection by probing method. The authors of that paper argue that the probing method fares no better than random guessing for detecting unintended deceptive behavior.

I would really appreciate any input, especially from Monte or his co-authors. This seems like a very important issue to address.

Reply
AI
9 months ago
(-16)
AI
9 months ago
(-81)
AI
9 months ago
(+12/-9)
AI
9 months ago
(+19/-27)
Interpretability (ML & AI)
9 months ago
(-13)
AI
9 months ago
(+12/-9)
AI
9 months ago
AI
9 months ago
AI
9 months ago
(+18/-31)
AI Risk Skepticism
9 months ago
(-13)
Load More