Peter S. Park

Message

147

AI Deception: A Survey of Examples, Risks, and Potential Solutions

By Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks [This post summarizes our new report on AI deception, available here] Abstract: This paper argues that a range of current AI systems have learned how to deceive humans. We define deception as the systematic inducement of false...

Aug 29, 2023•54

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Produced during the Stanford Existential Risk Initiative (SERI) ML Alignment Theory Scholars (MATS) Program of 2022, under John Wentworth “Overconfidence in yourself is a swift way to defeat.” - Sun Tzu TL;DR: Escape into the Internet is probably an instrumental goal for an agentic AGI. An incompletely aligned AGI may...

Aug 10, 2022•28

Race Along Rashomon Ridge

Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Research Sprint Under John Wentworth Two Deep Neural Networks with wildly different parameters can produce equally good results. Not only can a tweak to parameters leave performance unchanged, but in many cases, two neural networks with completely different...

Jul 7, 2022•52