x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
AF
Login
MATS Program — AI Alignment Forum
You are viewing version 1.3.0 of this page. Click here to view the latest version.
MATS Program
Edited by
Multicore
,
Ryan Kidd
last updated
30th Dec 2024
You are viewing revision 1.3.0, last edited by
Ryan Kidd
The
ML Alignment
&
Theory Scholars
program.
https://www.
matsprogram.
org/
Subscribe
Discussion
3
Subscribe
Discussion
3
Posts tagged
MATS Program
Most Relevant
3
134
SolidGoldMagikarp (plus, prompt generation)
Jessica Rumbelow
,
mwatkins
3y
17
4
32
SERI MATS Program - Winter 2022 Cohort
Ryan Kidd
,
Victor Warlop
,
Christian Smith
3y
0
5
140
Understanding and controlling a maze-solving policy network
TurnTrout
,
peligrietzer
,
Ulisse Mini
,
Monte M
,
David Udell
3y
23
3
43
Soft optimization makes the value target bigger
Jeremy Gillen
3y
4
3
25
SERI ML Alignment Theory Scholars Program 2022
Ryan Kidd
,
Victor Warlop
,
ozhang
4y
0
2
56
Finite Factored Sets in Pictures
Magdalena Wache
3y
2
2
52
Recontextualization Mitigates Specification Gaming Without Modifying the Specification
ariana_azarbal
,
Victor Gillioz
,
TurnTrout
,
cloud
2mo
0
3
47
Predictions for shard theory mechanistic interpretability results
TurnTrout
,
Ulisse Mini
,
peligrietzer
3y
6
1
33
Modulating sycophancy in an RLHF model via activation steering
Nina Panickssery
2y
19
2
17
Infra-Bayesian haggling
hannagabor
2y
0
2
14
Normative vs Descriptive Models of Agency
mattmacdermott
3y
2
1
121
Steering GPT-2-XL by adding an activation vector
TurnTrout
,
Monte M
,
David Udell
,
lisathiergart
,
Ulisse Mini
3y
63
1
145
Transformers Represent Belief State Geometry in their Residual Stream
Adam Shai
2y
4
1
111
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
,
Owain_Evans
9mo
1
1
77
Refusal in LLMs is mediated by a single direction
Andy Arditi
,
Oscar Obeso
,
Aaquib111
,
wesg
,
Neel Nanda
2y
44