AI ALIGNMENT FORUM
AF

1412

Mikita Balesni

Ω112310

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

No wikitag contributions to display.

If anyone builds it, everyone will plausibly be fine

Mikita Balesni23d21

(Author of the empirical paper linked above) Nit: your comment is about o3 but the screenshot is of o4-mini. o3 has higher rates of taking covert actions across the board. Although I mostly agree with the top-level claim "current models aren't well described as consistently scheming".

57Stress Testing Deliberative Alignment for Anti-Scheming Training

1mo

10

29Lessons from Studying Two-Hop Latent Reasoning

1mo

8

80Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

3mo

7

18How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

6mo

0

71Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

7mo

1

89Frontier Models are Capable of In-context Scheming

10mo

9

29Toward Safety Cases For AI Scheming

1y

0

42Apollo Research 1-year update

1y

0

14How I select alignment research projects

2y

4

26A starter guide for evals

2y

0