AI ALIGNMENT FORUM
AF

1170
Teun van der Weij
Ω91400
Message
Dialogue
Subscribe

Research scientist at Apollo Research.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2Teun van der Weij's Shortform
7mo
0
No wikitag contributions to display.
No Comments Found
57Stress Testing Deliberative Alignment for Anti-Scheming Training
1mo
10
12How to mitigate sandbagging
7mo
0
4The Elicitation Game: Evaluating capability elicitation techniques
8mo
1
35[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
1y
0
19An Introduction to AI Sandbagging
1y
0
21Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?
2y
0