Message

Nevan Wichers

Message

178

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

by Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen, and Fabien Roger

This is a link post for two papers that came out today: * Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.) * Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.) These papers both study the following...

Oct 8, 2025•176

Visualizing neural network planning

TLDR We develop a technique to try and detect if a NN is doing planning internally. We apply the decoder to the intermediate representations of the network to see if it’s representing the states it’s planning through internally. We successfully reveal intermediate states in a simple Game of Life model,...

May 9, 2024•4

A Variance Indifferent Maximizer Alternative

TLDR This post explores creating an agent which tries to make a certain number of paperclips without caring about the variance in the number it produces, only the expected value. This may be safer than one which wants to reduce the variance, since humans are a large source of variance...

Feb 13, 2020•7