In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.
The problem is with what you mean by "find". If by "find" you mean "there exist some variables in the AI's world model which correspond directly to the things you mean by some English sentence", then yes, you've argued that. But it's not enough for there to exist some variables in the AI's world-model which correspond to the things we mean. We have to either know which variables those are, or have some other way of "pointing to them" in order to get the AI to actually do what we're saying.
An AI may understand what I mean, in the sense that it has some internal variables corresponding to what I mean, but I still need to know which variables those are (or some way to point to them) and how "what I mean" is represented in order to construct a feedback signal.
That's what I mean by "finding" the variables. It's not enough that they exist; we (the humans, not the AI) need some way to point to which specific functions/variables they are, in order to get the AI to do what we mean.
The AI knowing what I mean isn't sufficient here. I need the AI to do what I mean, which means I need to program it/train it to do what I mean. The program or feedback signal needs to be pointed at what I mean, not just whatever English-language input I give.
For instance, if an AI is trained to maximize how often I push a particular button, and I say "I'll push the button if you design a fusion power generator for me", it may know exactly what I mean and what I intend. But it will still be perfectly happy to give me a design with some unintended side effects which I'm unlikely to notice until after pushing the button.
I believe the paper says that log densities are (approximately) polynomial - e.g. a Gaussian would satisfy this, since the log density of a Gaussian is quadratic.
I'll answer the second question, and hopefully the first will be answered in the process.
First, note that P[X|M2]∝eαu(X), so arbitrarily large negative utilities aren't a problem - they get exponentiated, and yield probabilities arbitrarily close to 0. The problem is arbitrarily large positive utilities. In fact, they don't even need to be arbitrarily large, they just need to have an infinite exponential sum; e.g. if u(X) is 1 for any whole number of paperclips X, then to normalize the probability distribution we need to divide by ∑∞X=0eα⋅1=∞. The solution to this is to just leave the distribution unnormalized. That's what "improper distribution" means: it's a distribution which can't be normalized, because it sums to ∞.
The main question here seems to be "ok, but what does an improper distribution mean in terms of bits needed to encode X?". Basically, we need infinitely many bits in order to encode X, using this distribution. But it's "not the same infinity" for each X-value - not in the sense of "set of reals is bigger than the set of integers", but in the sense of "we constructed these infinities from a limit so one can be subtracted from the other". Every X value requires infinitely many bits, but one X-value may require 2 bits more than another, or 3 bits less than another, in such a way that all these comparisons are consistent. By leaving the distribution unnormalized, we're effectively picking a "reference point" for our infinity, and then keeping track of how many more or fewer bits each X-value needs, compared to the reference point.
In the case of the paperclip example, we could have a sequence of utilities un(X) which each assigns utility X to any number of paperclips X < n (i.e. 1 util per clip, up to n clips), and then we take the limit n→∞. Then our nthunnormalized distribution is Punnorm[X|Mn]=eαXI[X<n], and the normalizing constant is Zn=1−eαn1−eα, which grows like O(eαn) as n→∞. The number of bits required to encode a particular value X<n is
Key thing to notice: the first term, log1−eαn1−eα, is the part which goes to ∞ with n, and it does not depend on X. So, we can take that term to be our "reference point", and measure the number of bits required for any particular X relative to that reference point. That's exactly what we're implicitly doing if we don't normalize the distribution: ignoring normalization, we compute the number of bits required to encode X as
... which is exactly the "adjustment" from our reference point.
(Side note: this is exactly how information theory handles continuous distributions. An infinite number of bits is required to encode a real number, so we pull out a term logdx which diverges in the limit dx→0, and we measure everything relative to that. Equivalently, we measure the number of bits required to encode up to precision dx, and as long as the distribution is smooth and dx is small, the number of bits required to encode the rest of x using the distribution won't depend on the value of x.)
Does this make sense? Should I give a different example/use more English?
Awesome question! I spent about a day chewing on this exact problem.
First, if our variables are drawn from finite sets, then the problem goes away (as long as we don't have actually-infinite utilities). If we can construct everything as limits from finite sets (as is almost always the case), then that limit should involve a sequence of world models.
The more interesting question is what that limit converges to. In general, we may end up with an improper distribution (conceptually, we have to carry around two infinities which cancel each other out). That's fine - improper distributions happen sometimes in Bayesian probability, we usually know how to handle them.
I think that the vast majority of the existential risk comes from that “broader issue” that you're pointing to of not being able to get worst-case guarantees due to using deep learning or evolutionary search or whatever. That leads me to want to define inner alignment to be about that problem...
[Emphasis added.] I think this is a common and serious mistake-pattern, and in particular is one of the more common underlying causes of framing errors. The pattern is roughly:
The problem is that, in trying to "shoehorn" cause(y) into the category Cause(X), we miss the opportunity to notice a different pattern, which is more directly useful in understanding y as well as some other cluster of problems related to y.
A concrete example: this is the same mistake I accused Zvi of making when trying to cast moral mazes as a problem of super-perfect competition. The conditions needed for super-perfect competition to explain moral mazes did not hold, and by trying to shoehorn the problem into that mold Zvi was missing an orthogonal phenomenon which is extremely interesting in its own right: thinking about that exact problem was what led to Demons in Imperfect Search.
Now, this is not to say that changing a definition to fit another case is always the wrong move. Sometimes, a new use-case shows that the definition can handle the new case while still preserving its original essence. The key question is whether the problem cluster X and problem y really do have the same underlying structure, or if there's something genuinely new and different going on in y.
In this case, I think it's pretty clear that there is more than just inner alignment problems going on in the lack of worst-case guarantees for deep learning/evolutionary search/etc. Generalization failure is not just about, or even primarily about, inner agents. It occurs even in the absence of mesa-optimizers. So defining inner alignment to be about that problem looks to me like a mistake - you're likely to miss important, conceptually-distinct phenomena by making that move. (We could also come at it from the converse direction: if something clearly recognizable as an inner alignment problem occurs for ideal Bayesians, then redefining the inner alignment problem to be "we can't control what sort of model we get when we do ML" is probably a mistake, and you're likely to miss interesting phenomena that way which don't conceptually resemble inner alignment.)
A useful knee-jerk reaction here is to notice when cause(y) doesn't quite fit the pattern Cause(X), and use that as a curiosity-pump to look for other cases which resemble y. That's the sort of instinct which will tend to turn up insights we didn't know we were missing.
Related to the role of peer review: a lot stuff on LW/AF is relatively exploratory, feeling out concepts, trying to figure out the right frames, etc. We need to be generally willing to ask discuss incomplete ideas, stuff that hasn't yet had the details ironed out. For that to succeed, we need community discussion standards which tolerate a high level of imperfect details or incomplete ideas. I think we do pretty well with this today.
But sometimes, you want to be like "come at me bro". You've got something that you're pretty highly confident is right, and you want people to really try to shoot it down (partly as a social mechanism to demonstrate that the idea is in fact as solid and useful as you think it is). This isn't something I'd want to be the default kind of feedback, but I'd like for authors to be able to say "come at me bro" when they're ready for it, and I'd like for posts which survive such a review to be perceived as more epistemically-solid/useful.
With that in mind, here's a few of my own AF posts which I'd submit for a "come at me bro" review:
For all of these, things like "this frame is wrong" or "this seems true but not useful" are valid objections. I'm not just claiming that the proofs hold.
Good enough. I don't love it, but I also don't see easy ways to improve it without making it longer and more technical (which would mean it's not strictly an improvement). Maybe at some point I'll take the time to make a shorter and less math-dense writeup.
I was considering this, but the problem is that in your setup S is supposed to be derived from X (that is, S is a deterministic function of X), which is not true when X = training data and S = that which we want to predict.
That's an (implicit) assumption in Conant & Ashby's setup, I explicitly remove that constraint in the "Minimum Entropy -> Maximum Expected Utility and Imperfect Knowledge" section. (That's the "imperfect knowledge" part.)
If S is derived from X, then "information in S" = "information in X relevant to S"
Same here. Once we relax the "S is a deterministic function of X" constraint, the "information in X relevant to S" is exactly the posterior distribution (s↦P[S=s|X]), which is why that distribution comes up so much in the later sections.
(In general I struggled with keeping the summary short vs. staying true to the details of the causal model.)
Yeah, the number of necessary nontrivial pieces is... just a little to high to not have to worry about inductive distance.
Yes! That is exactly the sort of theorem I'd expect to hold. (Though you might need to be in POMDP-land, not just MDP-land, for it to be interesting.)