AI ALIGNMENT FORUMTags
AF

Deception

•
Applied to Deceptive failures short of full catastrophe. by Alex Lawsen at 20d
•
Applied to The commercial incentive to intentionally train AI to deceive us by Derek M. Jones at 1mo
•
Applied to Getting up to Speed on the Speed Prior in 2022 by robertzk at 1mo
•
Applied to Monitoring for deceptive alignment by Noosphere89 at 5mo
•
Applied to How likely is deceptive alignment? by Raymond Arnold at 5mo
•
Applied to Three scenarios of pseudo-alignment by Eleni Angelou at 5mo
•
Applied to Deception as the optimal: mesa-optimizers and inner alignment by Eleni Angelou at 6mo
•
Applied to Precursor checking for deceptive alignment by Raymond Arnold at 6mo
•
Applied to Modelling Deception by RobertM at 7mo
•
Applied to Deception?! I ain’t got time for that! by Paul Colognese at 7mo
•
Applied to Latent Adversarial Training by Adam Jermyn at 7mo
•
Applied to Training Trace Priors and Speed Priors by Adam Jermyn at 7mo
•
Applied to Formalizing Deception by JamesH at 7mo
•
Applied to Conditioning Generative Models by Adam Jermyn at 7mo
•
Applied to Multigate Priors by Adam Jermyn at 8mo
•
Applied to Training Trace Priors by Adam Jermyn at 8mo
•
Applied to Gracefully correcting uncalibrated shame by HS2021 at 8mo
•
Applied to Why I'm Worried About AI by Peter Barnett at 8mo
•
Applied to The Speed + Simplicity Prior is probably anti-deceptive by Ruben Bloom at 9mo