AI ALIGNMENT FORUM
AF

Miguel de Guzman
Ω20010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
0Migueldev's shortform
2y
0
How to train your own "Sleeper Agents"
MiguelDev1y20

Obtain a helpful-only model


Hello! Just wondering if this step is necessary? Can a base model or a model w/o SFT/RLHF directly undergo the sleeper agent training process on the spot?

(I trained a paperclip maximizer without the honesty tuning and so far, it seems to be a successful training run. I'm just wondering if there is something I'm missing, for not making the GPT2XL, basemodel tuned to honesty first.)

Reply
Archetypal Transfer Learning
2y
(+4/-110)
Archetypal Transfer Learning
2y
(+6/-6)
Archetypal Transfer Learning
2y
(+8/-25)
Archetypal Transfer Learning
2y
(+46/-23)
Archetypal Transfer Learning
2y
(+26/-147)
Archetypal Transfer Learning
2y
(+39/-48)
Archetypal Transfer Learning
2y
(+121/-3)
Archetypal Transfer Learning
2y
(+324/-13)
Archetypal Transfer Learning
2y
(+10/-14)
Archetypal Transfer Learning
2y
(+220)
Load More
No posts to display.