AI ALIGNMENT FORUM
AF

Dakara
0088
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
By Default, GPTs Think In Plain Sight
Dakara8mo10

After 2 years have passed, I am quite interested in hearing @Fabien Roger's thoughts on this comment, especially this part "But how useful could gpt-n be if used in such a way? On the other extreme, gpt-n is producing internal reasoning text at a terabyte/minute. All you can do with it is grep for some suspicious words, or pass it to another AI model. You can't even store it for later unless you have a lot of hard drives. Potentially much more useful. And less safe.".

Reply
Simple probes can catch sleeper agents
Dakara8mo00

This paper argues that unintended deceptive behavior is not susceptible to detection by probing method. The authors of that paper argue that the probing method fares no better than random guessing for detecting unintended deceptive behavior.

I would really appreciate any input, especially from Monte or his co-authors. This seems like a very important issue to address.

Reply
AI
6mo
(-16)
AI
6mo
(-81)
AI
6mo
(+12/-9)
AI
6mo
(+19/-27)
Interpretability (ML & AI)
6mo
(-13)
AI
6mo
(+12/-9)
AI
6mo
AI
6mo
AI
6mo
(+18/-31)
AI Risk Skepticism
6mo
(-13)
Load More
No posts to display.