## AI ALIGNMENT FORUMAF

Thomas Kwa

Student at Caltech. Currently trying to get an AI safety inside view.

Sorted by New

# Wiki Contributions

Probability is Real, and Value is Complex

I think a lot of commenters misunderstand this post, or think it's trying to do more than it is. TLDR of my take: it's conveying intuition, not suggesting we should model preferences with 2D vector spaces.

The risk-neutral measure in finance is one way that "rotations" between probability and utility can be made:

• under the actual measure P, agents have utility nonlinear in money (e.g. risk aversion), and probability corresponds to frequentist notions
• under the risk-neutral measure Q, agents have utility linear in money, and probability is skewed towards losing outcomes.

These two interpretations explain the same agent behavior. The risk-neutral measure still "feels" like probability due to its uniqueness in an efficient market (fundamental theorem of asset pricing), plus the fact that quants use and think in it every day to price derivatives. Mathematically, it's no different from the actual measure P.

The Radon-Nikodym theorem tells you how to transform between probability measures in general. For any utility function satisfying certain properties (which I don't know exactly), I think one can find a measure Q such that you're maximizing that utility function under Q. Sometimes when making career decisions, I think using the "actionable AI alignment probability measure" P_A which is P conditioned on my counterfactually saving the world. Under P_A, the alignment problem has a closer to 50% chance of being solved, my research directions are more tractable, etc. Again, P_A is just a probability measure, and "feels like" probability.

This post finds a particular probability measure Q which doesn't really have a physical meaning [1]. But its purpose is to make it more obvious that probability and utility are inextricably intertwined, because

• instead of explaining behavior in terms of P and the utility function V, you can represent it using P and Q
• P and Q form a vector space, and you can perform literal "rotations" between probability and utility that still predict the same agent behavior.

As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.

Personally, I find it pretty compelling to just think of the risk-neutral measure, to understand why probability and utility are inextricably linked. But actually knowing there is symmetry between probability and utility does add to my intuition.

[1]: actually, if we're upweighting the high-utility worlds, maybe it can be called "rosy probability measure" or something.

[Link] A minimal viable product for alignment

I think we need to unpack "sufficiently aligned"; here's my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a "sufficiently aligned" AI that, conditional on a proposal looking promising, is likely to be actually correct.

• A system that produces a random 10000-bit string that looks promising to a human reviewer is not "sufficiently aligned"
• A system that follows the process that the most truthful possible humans use to do alignment research is sufficiently aligned (or if not, we're doomed anyway). Truth-seeking humans doing alignment research are only accessing a tiny part of the space of 2^200 persuasive ideas, and most of this is in the subset of 2^100 truthful ideas
• If the system is selecting for appearance, it needs to also have 100 bits of selection towards truth to be sufficiently aligned.

We can't get those 100 bits through further selection for appearance. It seems plausible that we can get them somehow, though.

Possible takeaways from the coronavirus pandemic for slow AI takeoff

I haven't heard this. What's the strongest criticism?