x

AI ALIGNMENT FORUM

AF

Ilya Nachevsky — AI Alignment Forum

Ilya Nachevsky

Ilya Nachevsky

Message

4

1y

Ilya Nachevsky

4

1y

Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.

by Ilia Shirokov and Ilya Nachevsky

Introduction In this post we continue to work on the project, started in our previous post, of exploring hidden reasoning of large language models (LLM). We ask whether an LLM, when given both a “general” and a “secret” question, can generate its public answer to the general question while simultaneously...

Aug 9, 2025•7

Sleep peacefully: no hidden reasoning detected in LLMs. Well, at least in small ones.

by Ilia Shirokov and Ilya Nachevsky

This work was done in collaboration with Ilya Nachevsky.[1] Introduction. With the advent of r1-type models, the question of whether a model's “thoughts" are truly genuine is becoming extremely interesting and important. In this post we present our approach to tackle the following problem: whether the visible chain-of-thought produced by...

Apr 4, 2025•17