Update February 21st: After the initial publication of this article (January 3rd) we received a lot of feedback and several people pointed out that propositions 1 and 2 were incorrect as stated. That was unfortunate as it distracted from the broader arguments in the article and I (Jan K) take...
Meta: Over the past few months, we've held a seminar series on the Simulators theory by janus. As the theory is actively under development, the purpose of the series is to discover central structures and open problems. Our aim with this sequence is to share some of our discussions with...
In March 22nd, 2022, we released a survey with an accompanying post for the purpose of getting more insight into what tools we could build to augment alignment researchers and accelerate alignment research. Since then, we’ve also released a dataset, a manuscript (LW post), and the (relevant) Simulators post was...
TL;DR: In this project, we collected and cataloged AI alignment research literature and analyzed the resulting dataset in an unbiased way to identify major research directions. We found that the field is growing quickly, with several subfields emerging in parallel. We looked at the subfields and identified the prominent researchers,...
Meta: After a fun little motivating section, this post goes deep into the mathematical weeds. This thing aims to explore the mathematical properties of adversarial attacks from first principles. Perhaps other people are not as confused about this point as I was, but hopefully, the arguments are still useful and/or...
TL;DR: I got nerd-sniped into working through some rather technical work in AI Safety. Here's my best guess of what is going on. Imprecise probabilities for handling catastrophic downside risk. Short summary: I apply the updating equation from Infra-Bayesianism to a concrete example of an Infradistribution and illustrate the process....