This paper is a revised and expanded version of my blog post Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate, now with David Manheim as co-author. Abstract: > Several different approaches exist for ensuring the safety of future Transformative Artificial Intelligence (TAI) or...
This post is part 2 in our sequence on Modeling Transformative AI Risk. We are building a model to understand debates around existential risks from advanced AI. The model is made with Analytica software, and consists of nodes (representing key hypotheses and cruxes) and edges (representing the relationships between these...
Here is a timeline of AI safety that I originally wrote in 2017. The timeline has been updated several times since then, mostly by Vipul Naik. Here are some highlights by year from the timeline since 2013: YearHighlights2013Research and outreach focused on forecasting and timelines continue. Connections with the nascent...
This post is my attempt to summarize and distill the major public debates about MIRI's highly reliable agent designs (HRAD) work (which includes work on decision theory), including the discussions in Realism about rationality and Daniel Dewey's My current thoughts on MIRI's "highly reliable agent design" work. Part of the...
When I first started learning about IDA, I thought that agents trained using IDA would be human-level after the first stage, i.e. that Distill(H) would be human-level. As I've written about before, Paul later clarified this, so my new understanding is that after the first stage, the distilled agent will...
I am interested in having my own opinion about more of the key disagreements within the AI alignment field, such as whether there is a basin of attraction for corrigibility, whether there is a theory of rationality that is sufficiently precise to build hierarchies of abstraction, and to what extent...