AI ALIGNMENT FORUM
AF

Wikitags

Redwood Research

Edited by Dakara last updated 30th Dec 2024

Redwood Research is a nonprofit organization focused on mitigating risks from advanced artificial intelligence.

The initial directions of their research agenda include:

  • AI control
  • Evaluations and demonstrations of risk from strategic deception
  • Consulting on risks from misalignment
Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Redwood Research
192Alignment Faking in Large Language Models
Ryan Greenblatt, Evan Hubinger, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Sam Bowman, Buck Shlegeris
6mo
24
115The case for ensuring that powerful AIs are controlled
Ryan Greenblatt, Buck Shlegeris
1y
32
103Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas
3y
29
102AI Control: Improving Safety Despite Intentional Subversion
Buck Shlegeris, Fabien Roger, Ryan Greenblatt, Kshitij Sachan
2y
5
57Takeaways from our robust injury classifier project [Redwood Research]
dmz
3y
7
51Benchmarks for Detecting Measurement Tampering [Redwood Research]
Ryan Greenblatt, Fabien Roger
2y
11
57Redwood Research’s current project
Buck Shlegeris
4y
18
59Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
maxnadeau, Xander Davies, Buck Shlegeris, Nate Thomas
3y
5
61Preventing Language Models from hiding their reasoning
Fabien Roger, Ryan Greenblatt
2y
5
62Catching AIs red-handed
Ryan Greenblatt, Buck Shlegeris
2y
8
27Redwood's Technique-Focused Epistemic Strategy
Adam Shimi
4y
1
10AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
3y
0
81Will alignment-faking Claude accept a deal to reveal its misalignment?
Ryan Greenblatt, Kyle Fish
5mo
5
86How will we update about scheming?
Ryan Greenblatt
5mo
3
65High-stakes alignment via adversarial training [Redwood Research report]
dmz, Lawrence Chan, Nate Thomas
3y
15
Load More (15/44)
Add Posts