AI ALIGNMENT FORUM
AF

760
Mikita Balesni
Ω112310
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
If anyone builds it, everyone will plausibly be fine
Mikita Balesni23d21

(Author of the empirical paper linked above) Nit: your comment is about o3 but the screenshot is of o4-mini. o3 has higher rates of taking covert actions across the board. Although I mostly agree with the top-level claim "current models aren't well described as consistently scheming".

Reply
57Stress Testing Deliberative Alignment for Anti-Scheming Training
1mo
10
29Lessons from Studying Two-Hop Latent Reasoning
1mo
8
80Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
3mo
7
18How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
6mo
0
71Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
7mo
1
89Frontier Models are Capable of In-context Scheming
10mo
9
29Toward Safety Cases For AI Scheming
1y
0
42Apollo Research 1-year update
1y
0
14How I select alignment research projects
2y
4
26A starter guide for evals
2y
0
Load More