AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda.
I think you misunderstood how the iterated quantilization works. It does not work by the AI setting a long-term goal and then charting a path towards that goal s.t. it doesn't deviate too much from the baseline over every short interval. Instead, every short-term quantilization is optimizing for the user's evaluation in the end of this short-term interval.
When I'm deciding whether to run an AI, I should be maximizing the expectation of my utility function w.r.t. my belief state. This is just what it means to act rationally. You can then ask, how is this compatible with trusting another agent smarter than myself?
One potentially useful model is: I'm good at evaluating and bad at searching (after all, P≠NP). I can therefore delegate searching to another agent. But, as you point out, this doesn't account for situations in which I seem to be bad at evaluating. Moreover, if the AI prior takes an intentional stance towards the user (in order to help learning their preferences), then the user must be regarded as good at searching.
A better model is: I'm good at both evaluating and searching, but the AI can access actions and observations that I cannot. For example, having additional information can allow it to evaluate better. An important special case is: the AI is connected to an external computer (Turing RL) which we can think of as an "oracle". This allows the AI to have additional information which is purely "logical". We need infra-Bayesianism to formalize this: the user has Knightian uncertainty over the oracle's outputs entangled with other beliefs about the universe.
For instance, in the chess example, if I know that a move was produced by exhaustive game-tree search then I know it's a good move, even without having the skill to understand why the move is good in any more detail.
Now let's examine short-term quantilization for chess. On each cycle, the AI finds a short-term strategy leading to a position that the user evaluates as good, but that the user would require luck to manage on their own. This is repeated again and again throughout the game, leading to overall play substantially superior to the user's. On the other hand, this play is not as good as the AI would achieve if it just optimized for winning at chess without any constrains. So, our AI might not be competitive with an unconstrained unaligned AI. But, this might be good enough.
I'm not sure what you're saying in the "turning off the stars example". If the probability for the user to autonomously decide to turn off the stars is much lower than the quantilization fraction, then the probability that quantilization will decide to turn off the stars is low. And, the quantilization fraction is automatically selected like this.
There is a case that aligned AI doesn't have to be competitive with unaligned AI, it just has to be much better than humans at alignment research. Because, if this holds, then we can delegate the rest of the problem to the AI.
Where it might fail is: it takes so much work to solve the alignment problem that even that superhuman aligned AI will not do it in time to build the "next stage" aligned AI (i.e. before the even-more-superhuman unaligned AI is deployed). In this case, it might be advantageous to have mere humans doing extra progress in alignment between the point "first stage" solution is available and the point the first stage aligned AI is deployed.
The bigger the capability gap between first stage aligned AI and humans, the less valuable this extra progress becomes (because the AI would be able to do it on its own much faster). On the other hand, the smaller the time difference between first stage aligned AI deployment and unaligned AI deployment, the more valuable this extra progress becomes.
IIUC, in a multi-agent influence model, every subgame perfect equilibrium is also a subgame perfect equilibrium in the corresponding extensive form game, but the converse is false in general. Do you know whether at least one subgame perfect equilibrium exists for any MAIM? I couldn't find it in the paper.
So, your thesis is, only exponential models give rise to nice abstractions? And, since it's important to have abstractions, we might just as well have our agents reason exclusively in terms of exponential models?
I'm still confused. What direction of GKPD do you want to use? It sounds like you want to use the low-dimensional statistic => exponential family direction. Why? What is good about some family being exponential?
Can you explain how the generalized KPD fits into all of this? KPD is about estimating the parameters of a model from samples via a low dimensional statistic, whereas you are talking about estimating one part of a sample from another (distant) part of the sample via a low dimensional statistic. Are you using KPD to rule out "high-dimensional" correlations going through the parameters of the model?
The way I think about instrumental goals is: You have have an MDP with a hierarchical structure (i.e. the states are the leaves of a rooted tree), s.t. transitions between states that differ on a higher level of the hierarchy (i.e. correspond to branches that split early) are slower than transitions between states that differ on lower levels of the hierarchy. Then quasi-stationary distributions on states resulting from different policies on the "inner MDP" of a particular "metastate" effectively function as actions w.r.t. to the higher levels. Under some assumptions it should be possible to efficiently control such an MDP in time complexity much lower than polynomial in the total number of states. Hopefully it is also possible to efficiently learn this type of hypothesis.
I don't think that anywhere there we will need a lemma saying that the algorithm picks "aligned" goals.
For example, if each vertex in the tree has the structure of one of some small set of MDPs, and you are given mappings from admissible distributions on "child" MDPs to actions of "parent" MDP that is compatible with the transition kernel. ↩︎
I think the GoL is not the best example for this sort of questions. See this post by Scott Aaronson discussing the notion of "physical universality" which seems relevant here.
Also, like other commenters pointed out, I don't think the object you get here is necessarily AI. That's because the "laws of physics" and the distribution of initial conditions are assumed to be simple and known. An AI would be something that can accomplish an objective of this sort while also having to learn the rules of the automaton or detect patterns in the initial conditions. For example, instead of initializing the rest of the field uniformly randomly, you could initialize it using something like the Solomonoff prior.