Bartosz Cywiński

Can we interpret latent reasoning using current mechanistic interpretability tools?

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Joshua Engels**, Neel Nanda**, Senthooran Rajamanoharan** * primary contributors ** advice and mentorship TL;DR We study a simple latent reasoning LLM on math tasks using standard mechanistic interpretability techniques to see whether the latent reasoning process (i.e., vector-based chain of thought) is interpretable....

Dec 22, 202544

Bartosz Cywiński

Bartosz Cywiński

Eliciting secret knowledge from language models

Current LLMs seem to rarely detect CoT tampering

Can we interpret latent reasoning using current mechanistic interpretability tools?

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Bartosz Cywiński

Eliciting secret knowledge from language models

Current LLMs seem to rarely detect CoT tampering

Can we interpret latent reasoning using current mechanistic interpretability tools?

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Can we interpret latent reasoning using current mechanistic interpretability tools?

Current LLMs seem to rarely detect CoT tampering

Eliciting secret knowledge from language models