This post is to announce my new paper Demanding and Designing Aligned Cognitive Architectures, which I recently presented in the PERLS workshop (Political Economy of Reinforcement Learning)] at NeurIPS 2021. In this post, I will give a brief overview of the paper, specifically written for this forum and the LW/EA...
In this fifth post in the sequence, I show the construction a counterfactual planning agent with an input terminal that can be used to iteratively improve the agent's reward function while it runs. The goal is to construct an agent which has has no direct incentive to manipulate this improvement...
This is the second post in a sequence. For the introduction post, see here. Graphical World Models A world model is a mathematical model of a particular world. This can be our real world, or an imaginary world. To make a mathematical model into a model of a particular world,...
Since the term corrigibility was introduced in 2015, there has been a lot of discussion about corrigibility, on this forum and elsewhere. In this post, I have tied to disentangle the many forms of corrigibility which have been identified and discussed so far. My aim is to offer a general...
In the third post in this sequence, I will define a counterfactual planning agent which has three safety interlocks. These interlocks all aim to enable and support agent oversight: the oversight that is needed when we equip a powerful AGI agent with a reward function for which we are pretty...
Counterfactual planning is a design approach for creating a range of safety mechanisms that can be applied in hypothetical future AI systems which have Artificial General Intelligence. My new paper Counterfactual Planning in AGI Systems introduces this design approach in full. It also constructs several example AGI safety mechanisms. The...
This post is to announce my new paper AGI Agent Safety by Iteratively Improving the Utility Function. I am also using this post to add some extra background information that is not on the paper. Questions and comments are welcome below. From the abstract: > While it is still unclear...