x

AI ALIGNMENT FORUM
AF

Mikita Balesni — AI Alignment Forum

Mikita Balesni

Ω112310

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

No wikitag contributions to display.

If anyone builds it, everyone will plausibly be fine

Mikita Balesni2mo21

(Author of the empirical paper linked above) Nit: your comment is about o3 but the screenshot is of o4-mini. o3 has higher rates of taking covert actions across the board. Although I mostly agree with the top-level claim "current models aren't well described as consistently scheming".

57Stress Testing Deliberative Alignment for Anti-Scheming Training

2mo

10

29Lessons from Studying Two-Hop Latent Reasoning

3mo

8

80Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

5mo

7

18How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

8mo

0

74Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

8mo

1

89Frontier Models are Capable of In-context Scheming

1y

9

29Toward Safety Cases For AI Scheming

1y

0

42Apollo Research 1-year update

2y

0

14How I select alignment research projects

2y

4

26A starter guide for evals

2y

0