Detecting deception refers to techniques for identifying when AIs are being dishonest or providing false information, particularly via learned classifiers of their internal activations.
Deception detection can be applied in various settings, such as trusted monitoring, eliciting latent knowledge, mitigating sandbagging, increasing honesty, and catching rouge behaviour.