AI ALIGNMENT FORUMTags
AF

Anthropic (org)

EditHistorySubscribe

Help improve this page

EditHistorySubscribe

Help improve this page

Anthropic (org)

Contributors

Anthropic is an AI organization.

Not to be confused with anthropics.

Posts tagged Anthropic (org)

4

67Anthropic's Core Views on AI Safety

Zac Hatfield-Dodds

1y

13

1

50Why I'm joining Anthropic

1y

2

3

33Toy Models of Superposition

2y

2

1

34Concrete Reasons for Hope about AI

Zac Hatfield-Dodds

1y

0

2

67Transformer Circuits

2y

3

1

110Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds

6mo

11

1

92Introducing Alignment Stress-Testing at Anthropic

3mo

19

1

47Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

Zac Hatfield-Dodds

7mo

5

1

46Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy

Zac Hatfield-Dodds

6mo

0

1

30Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

1y

0

1

18How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?

2y

1

1

15Frontier Model Security

Matthew "Vaniver" Gray

9mo

1

-1

78A challenge for AGI organizations, and a challenge for readers

Rob Bensinger, Eliezer Yudkowsky

1y

17

1

56Comparing Anthropic's Dictionary Learning to Ours

6mo

1

0

58Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman, Ethan Perez

9mo

12