AI ALIGNMENT FORUM
AF

82
Felix Hofstätter
Ω18100
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Anthropic's Core Views on AI Safety
Felix Hofstätter3y01

Thank you for this post. It looks like the people at Anthropic have put a lot of thought into this which is good to see.

You mention that there are often surprising qualitative differences between larger and smaller models. How seriously is Anthropic considering a scenario where there is a sudden jump in certain dangerous capabilities (in particular deception) at some level of model intelligence? Does it seem plausible that it might not be possible to foresee this jump from experiments on even slighter weaker models? 

Reply
57Stress Testing Deliberative Alignment for Anti-Scheming Training
1mo
10
4The Elicitation Game: Evaluating capability elicitation techniques
8mo
1
35[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
1y
0
19An Introduction to AI Sandbagging
1y
0
21Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?
2y
0
18Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models
2y
0