Student at Caltech. Currently trying to get an AI safety inside view.
I think a lot of commenters misunderstand this post, or think it's trying to do more than it is. TLDR of my take: it's conveying intuition, not suggesting we should model preferences with 2D vector spaces.
The risk-neutral measure in finance is one way that "rotations" between probability and utility can be made:
These two interpretations explain the same agent behavior. The risk-neutral measure still "feels" like probability due to its uniqueness in an efficient market (fundamental theorem of asset pricing), plus the fact that quants use and think in it every day to price derivatives. Mathematically, it's no different from the actual measure P.
The Radon-Nikodym theorem tells you how to transform between probability measures in general. For any utility function satisfying certain properties (which I don't know exactly), I think one can find a measure Q such that you're maximizing that utility function under Q. Sometimes when making career decisions, I think using the "actionable AI alignment probability measure" P_A which is P conditioned on my counterfactually saving the world. Under P_A, the alignment problem has a closer to 50% chance of being solved, my research directions are more tractable, etc. Again, P_A is just a probability measure, and "feels like" probability.
This post finds a particular probability measure Q which doesn't really have a physical meaning . But its purpose is to make it more obvious that probability and utility are inextricably intertwined, because
As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.
Personally, I find it pretty compelling to just think of the risk-neutral measure, to understand why probability and utility are inextricably linked. But actually knowing there is symmetry between probability and utility does add to my intuition.
: actually, if we're upweighting the high-utility worlds, maybe it can be called "rosy probability measure" or something.
I think we need to unpack "sufficiently aligned"; here's my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a "sufficiently aligned" AI that, conditional on a proposal looking promising, is likely to be actually correct.
We can't get those 100 bits through further selection for appearance. It seems plausible that we can get them somehow, though.
I haven't heard this. What's the strongest criticism?