x

AI ALIGNMENT FORUM

AF

Andrew Mack — AI Alignment Forum

Andrew Mack

Andrew Mack

Message

335

Ω

159

2

11

2y

Andrew Mack

335

Ω

159

2y

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Based off research performed in the MATS 5.1 extension program, under the mentorship of Alex Turner (TurnTrout). Research supported by a grant from the Long-Term Future Fund. TLDR: I introduce a new framework for mechanistically eliciting latent behaviors in LLMs. In particular, I propose deep causal transcoding - modelling the...

Dec 3, 2024•109

Mechanistically Eliciting Latent Behaviors in Language Models

Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout). TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The...

Apr 30, 2024•225