This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
AF
Login
1170
Teun van der Weij
Research scientist at Apollo Research.
Posts
Sorted by New
Wikitag Contributions
Comments
Sorted by
Newest
2
Teun van der Weij's Shortform
7mo
0
57
Stress Testing Deliberative Alignment for Anti-Scheming Training
1mo
10
12
How to mitigate sandbagging
7mo
0
4
The Elicitation Game: Evaluating capability elicitation techniques
8mo
1
35
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
1y
0
19
An Introduction to AI Sandbagging
1y
0
21
Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?
2y
0
Comments