AI ALIGNMENT FORUM
AF

Miguel de Guzman

→ help avoid catastrophic AI failures...

Posts

Sorted by New

Wiki Contributions

Archetypal Transfer Learning

1y

(+4/-110)

Archetypal Transfer Learning

1y

(+6/-6)

Archetypal Transfer Learning

1y

(+8/-25)

Archetypal Transfer Learning

1y

(+46/-23)

Archetypal Transfer Learning

1y

(+26/-147)

Archetypal Transfer Learning

1y

(+39/-48)

Archetypal Transfer Learning

1y

(+121/-3)

Archetypal Transfer Learning

1y

(+324/-13)

Archetypal Transfer Learning

1y

(+10/-14)

Archetypal Transfer Learning

1y

(+220)

Comments

How to train your own "Sleeper Agents"

Miguel de Guzman2mo20

Obtain a helpful-only model

Hello! Just wondering if this step is necessary? Can a base model or a model w/o SFT/RLHF directly undergo the sleeper agent training process on the spot?

(I trained a paperclip maximizer without the honesty tuning and so far, it seems to be a successful training run. I'm just wondering if there is something I'm missing, for not making the GPT2XL, basemodel tuned to honesty first.)