AI ALIGNMENT FORUM
AF

Wikitags

Detecting deception

Edited by Cleo Nardo last updated 22nd Jul 2025

Detecting deception refers to techniques for identifying when AIs are being dishonest or providing false information, particularly via learned classifiers of their internal activations.

Deception detection can be applied in various settings, such as trusted monitoring, eliciting latent knowledge, mitigating sandbagging, increasing honesty, and catching rouge behaviour.

See Also

  • AI control
  • Interpretability (ML & AI)
  • Eliciting Latent Knowledge
Subscribe
Subscribe
Discussion0
Discussion0