AI ALIGNMENT FORUM
AF

Detecting deception

Edited by Cleo Nardo last updated 22nd Jul 2025

Detecting deception refers to techniques for identifying when AIs are being dishonest or providing false information, particularly via learned classifiers of their internal activations.

Deception detection can be applied in various settings, such as trusted monitoring, eliciting latent knowledge, mitigating sandbagging, increasing honesty, and catching rouge behaviour.

AI ALIGNMENT FORUM
AF

Detecting deception

See Also

AI ALIGNMENT FORUM
AF

Detecting deception

See Also