Doing alignment research with Vivek Hebbar's team at MIRI.
not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world.
This seems sufficient for "what failure looks like" scenarios, with faster disempowerment through hard takeoff likely to depend on other pathways like nanotech, social engineering, etc. As for the whole argument against "heavy precedent", I'm not convinced either way and haven't thought about it a ton.
There's a clarification by John here. I heard it was going to be put on Superlinear but unclear if/when.
Why should we expect that True Names useful for research exist in general? It seems like there are reasons why they don't:
Most of these examples are bad, but hopefully they get the point across.
Were any cautious people trying empirical alignment research before Redwood/Conjecture?
Do you have thoughts on when there are two algorithms that aren’t “doing the same thing” that fall within the same loss basin?
It seems like there could be two substantially different algorithms which can be linearly interpolated between with no increase in loss. For example, the model is trained to classify fruit types and ripeness. One module finds the average color of a fruit (in an arbitrary basis), and another module uses this to calculate fruit type and ripeness. The basis in which color is expressed can be arbitrary, since the second module can compensate.
The ultimate goal of John Wentworth’s sequence "Basic Foundations for Agent Models" is to prove a selection theorem of the form:
John has not yet proved such a result and it would be a major advance in the selection theorems agenda. I also find it plausible that someone without specific context could do meaningful work here. As such, I’ll offer a $5000 bounty to anyone who finds a precise theorem statement and beats John to the full proof (or disproof + proof of a well-motivated weaker statement). This bounty will decrease to zero as the sequence is completed and over the next ~12 months. Partial contributions will be rewarded proportionally.
Note that the particular form of "nonexistence of a representative agent" John mentions is an original result that's not too difficult to show informally, but hasn't really been written down formally either here or in the economics literature.
Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent utility function in general. A bounty also hasn't found any such papers.
Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.
At first this claim seemed kind of wild, but there's a version of it I agree with.
It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that's a pointer to some optimizer "outside" it, it's underspecified what it should point to. In the evolution -> humans -> gradient descent -> model example, corrigibility as defined in RLO could mean that the model is optimizing for the goals of evolution, humans, or the gradient. This doesn't seem to be different between the RLO and steered optimization stories.
I think the analogy to corrigible alignment among humans being hedonism assumes that a corrigibly aligned optimizer's goal would point to the thing immediately upstream of its reward. This is not obvious to me. It seems like wireheading / manipulating reward signals is a potential problem, but this is just a special case of not being able to steer an inner optimizer even conditional on it having a narrow corrigibility property.
FWIW this was basically cached for me, and if I were better at writing and had explained this ~10 times before like I expect Eliezer has, I'd be able to do about as well. So would Nate Soares or Buck or Quintin Pope (just to pick people in 3 different areas of alignment), and Quintin would also have substantive disagreements.