AI ALIGNMENT FORUM
AF

Felix Hofstätter
Ω18100
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Anthropic's Core Views on AI Safety
Felix Hofstätter3y01

Thank you for this post. It looks like the people at Anthropic have put a lot of thought into this which is good to see.

You mention that there are often surprising qualitative differences between larger and smaller models. How seriously is Anthropic considering a scenario where there is a sudden jump in certain dangerous capabilities (in particular deception) at some level of model intelligence? Does it seem plausible that it might not be possible to foresee this jump from experiments on even slighter weaker models? 

Reply
4The Elicitation Game: Evaluating capability elicitation techniques
7mo
1
35[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
1y
0
19An Introduction to AI Sandbagging
1y
0
21Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?
2y
0
18Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models
2y
0