Doing alignment research with Vivek Hebbar's team at MIRI.
Were any cautious people trying empirical alignment research before Redwood/Conjecture?
Do you have thoughts on when there are two algorithms that aren’t “doing the same thing” that fall within the same loss basin?
It seems like there could be two substantially different algorithms which can be linearly interpolated between with no increase in loss. For example, the model is trained to classify fruit types and ripeness. One module finds the average color of a fruit (in an arbitrary basis), and another module uses this to calculate fruit type and ripeness. The basis in which color is expressed can be arbitrary, since the second module can compensate.
The ultimate goal of John Wentworth’s sequence "Basic Foundations for Agent Models" is to prove a selection theorem of the form:
John has not yet proved such a result and it would be a major advance in the selection theorems agenda. I also find it plausible that someone without specific context could do meaningful work here. As such, I’ll offer a $5000 bounty to anyone who finds a precise theorem statement and beats John to the full proof (or disproof + proof of a well-motivated weaker statement). This bounty will decrease to zero as the sequence is completed and over the next ~12 months. Partial contributions will be rewarded proportionally.
Note that the particular form of "nonexistence of a representative agent" John mentions is an original result that's not too difficult to show informally, but hasn't really been written down formally either here or in the economics literature.
Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent utility function in general. A bounty also hasn't found any such papers.
Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.
At first this claim seemed kind of wild, but there's a version of it I agree with.
It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that's a pointer to some optimizer "outside" it, it's underspecified what it should point to. In the evolution -> humans -> gradient descent -> model example, corrigibility as defined in RLO could mean that the model is optimizing for the goals of evolution, humans, or the gradient. This doesn't seem to be different between the RLO and steered optimization stories.
I think the analogy to corrigible alignment among humans being hedonism assumes that a corrigibly aligned optimizer's goal would point to the thing immediately upstream of its reward. This is not obvious to me. It seems like wireheading / manipulating reward signals is a potential problem, but this is just a special case of not being able to steer an inner optimizer even conditional on it having a narrow corrigibility property.
I think a lot of commenters misunderstand this post, or think it's trying to do more than it is. TLDR of my take: it's conveying intuition, not suggesting we should model preferences with 2D vector spaces.
The risk-neutral measure in finance is one way that "rotations" between probability and utility can be made:
These two interpretations explain the same agent behavior. The risk-neutral measure still "feels" like probability due to its uniqueness in an efficient market (fundamental theorem of asset pricing), plus the fact that quants use and think in it every day to price derivatives. Mathematically, it's no different from the actual measure P.
The Radon-Nikodym theorem tells you how to transform between probability measures in general. For any utility function satisfying certain properties (which I don't know exactly), I think one can find a measure Q such that you're maximizing that utility function under Q. Sometimes when making career decisions, I think using the "actionable AI alignment probability measure" P_A which is P conditioned on my counterfactually saving the world. Under P_A, the alignment problem has a closer to 50% chance of being solved, my research directions are more tractable, etc. Again, P_A is just a probability measure, and "feels like" probability.
This post finds a particular probability measure Q which doesn't really have a physical meaning . But its purpose is to make it more obvious that probability and utility are inextricably intertwined, because
As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.
Personally, I find it pretty compelling to just think of the risk-neutral measure, to understand why probability and utility are inextricably linked. But actually knowing there is symmetry between probability and utility does add to my intuition.
: actually, if we're upweighting the high-utility worlds, maybe it can be called "rosy probability measure" or something.
I think we need to unpack "sufficiently aligned"; here's my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a "sufficiently aligned" AI that, conditional on a proposal looking promising, is likely to be actually correct.
We can't get those 100 bits through further selection for appearance. It seems plausible that we can get them somehow, though.
I haven't heard this. What's the strongest criticism?