Most of the work on inner alignment so far has been informal or semi-formal (with the notable exception of a little work on minimal circuits). I feel this has resulted in some misconceptions about the problem. I want to write up a large document clearly defining the formal problem and detailing some formal directions for research. Here, I outline my intentions, inviting the reader to provide feedback and point me to any formal work or areas of potential formal work which should be covered in such a document. (Feel free to do that last one without reading further, if you are time-constrained!)
Risks from Learned Optimization (henceforth, RLO) offered semi-formal definitions of important terms, and provided an excellent introduction to the area for a...
I've felt like the problem of counterfactuals is "mostly settled" (modulo some math working out) for about a year, but I don't think I've really communicated this online. Partly, I've been waiting to write up more formal results. But other research has taken up most of my time, so I'm not sure when I would get to it.
So, the following contains some "shovel-ready" problems. If you're convinced by my overall perspective, you may be interested in pursuing some of them. I think these directions have a high chance of basically solving the problem of counterfactuals (including logical counterfactuals).
Another reason for posting this rough write-up is to get feedback: am I missing the mark? Is this not what counterfactual reasoning is about? Can you illustrate remaining problems with...
I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1).
If you want to ask something just post a top-level comment; I'll spend at least a day answering questions.
You can find some background about me here.
This is the first post in a sequence on Cartesian frames, a new way of modeling agency that has recently shaped my thinking a lot.
Traditional models of agency have some problems, like:
Cartesian frames are a way to add a first-person perspective (with choices, uncertainty, etc.) on top of a third-person "here is the set of all possible worlds," in such a way that many of these problems either disappear or become easier to address.
The idea of Cartesian frames is that we take as our basic building block...
Financial status: This is independent research. I welcome financial support to make further posts like this possible.
Epistemic status: I have been thinking about these ideas for years but still have not clarified them to my satisfaction.
This post asks whether it is possible, in Conway’s Game of Life, to arrange for a certain game state to arise after a certain number of steps given control only of a small region of the initial game state.
This question is then connected to questions of agency and AI, since one way to answer this question in the positive is by constructing an AI within Conway’s Game of Life.
I argue that the permissibility or impermissibility of AI is a deep property of our physics.
I propose the AI hypothesis, which is that any