Top postsTop post
Jan Betley
1395
Ω
130
13
104
This is the abstract and introduction of our new paper. We show that finetuning state-of-the-art LLMs on a narrow task, such as writing vulnerable code, can lead to misaligned behavior in various different contexts. We don't fully understand that phenomenon. Authors: Jan Betley*, Daniel Tan*, Niels Warncke*, Anna Sztyber-Betley, Martín...
TLDR: I am trying to understand how goal misgeneralization happens in the same maze-solving network TurnTrout et al. work on. Nothing groundbreaking, but if we are ever to fully understand this model, this is probably an important step. Key findings: * Many channels in the last convolutional layer have a...