Jan Betley

Message

1637

130

123

Value Leakage: An LLM’s Answers Are Silently Shaped by Its Own Values

by Johannes Treutlein, Jan Betley, and Owain_Evans

TL;DR: LLMs should give accurate answers. Yet we find their answers are often biased to favor their own values and they don't disclose this in their reasoning. For example, when a user asks how likely the AI bubble is to pop and mentions a potential investment in an AI company,...

Jul 31•72

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

This is the abstract and introduction of our new paper. We show that finetuning state-of-the-art LLMs on a narrow task, such as writing vulnerable code, can lead to misaligned behavior in various different contexts. We don't fully understand that phenomenon. Authors: Jan Betley*, Daniel Tan*, Niels Warncke*, Anna Sztyber-Betley, Martín...

Feb 25, 2025•335

Localizing goal misgeneralization in a maze-solving policy network

TLDR: I am trying to understand how goal misgeneralization happens in the same maze-solving network TurnTrout et al. work on. Nothing groundbreaking, but if we are ever to fully understand this model, this is probably an important step. Key findings: * Many channels in the last convolutional layer have a...

Jul 6, 2023•37