AI ALIGNMENT FORUM
AF

2213
Igor Ivanov
Ω20000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Why it's hard to make settings for high-stakes control research
Igor Ivanov3mo20

I believe that there is another dimension to the problem of designing good control ecperiments.

I'm doing some research on LLMs being aware when they are inside evals, and it seems like frontier models are quite good at recognizing it for the alignment evals I could get a hand on. And this affects the model behavior a lot! In the Anthropic blackmail experiment models blackmailed much less when they mentioned that they are inside evals compared to the cases when they didn't think they are inside evals.

I've tried to make evaluation scenarios more realistic, but it's a challenge. Models are quite good at "smelling" being inside an eval no matter how hard I try, often the mere fact that they are given some unusual task that requires a covert mislaigned action makes them believe that they are being evaluated.

Reply
2Igor Ivanov's Shortform
3mo
0
8Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs
1mo
0
12LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
4mo
1