This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end. Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans (*Equal Contribution). Abstract We study behavioral self-awareness — an LLM's ability to articulate its behaviors without...
This essay was part of my application to UKAISI. Unclear how much is novel, but I like the decomposition. Some AI-enabled catastrophic risks involve an actor purposefully bringing about the negative outcome, as opposed to multi-agent tragedies where no single actor necessarily desired it. Such purposeful catastrophes can be broken...
TL;DR: Let’s start iterating on experiments that approximate real, society-scale multi-AI deployment Epistemic status: These ideas seem like my most prominent delta with the average AI Safety researcher, have stood the test of time, and are shared by others I intellectually respect. Please attack them fiercely! Multi-polar risks Some authors...
In the last post I presented the basic, bare-bones model, used to assess the Expected Value of different interventions, and especially those related to Cooperative AI (as distinct from value Alignment). Here I briefly discuss important enhancements, and our strategy with regards to all-things-considered estimates. I describe first an easy...
Interventions that increase the probability of Aligned AGI aren't the only kind of AGI-related work that could importantly increase the Expected Value of the future. Here I present a very basic quantitative model (which you can run yourself here) to start thinking about these issues. In a follow-up post I...
Since beliefs about Evidential Correlations don't track any direct ground truth, it's not obvious how to resolve disagreements about them, which is very relevant to acausal trade. Here I present what seems like the only natural method (Third solution below). Ideas partly generated with Johannes Treutlein. Say two agents (algorithms...
I explain (in layman's terms) a realization that might make acausal trade hard or impossible in practice. Summary: We know that if players believe different Evidential Correlations, they might miscoordinate. But clearly they will eventually learn to have the correct Evidential Correlations, right? Not necessarily, because there is no objective...