## AI ALIGNMENT FORUMAF

Eliezer Yudkowsky

# Wiki Contributions

The Commitment Races problem

IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently 'committed' agent.  If somebody demands $6 from me in the Ultimatum game, threatening to leave us both with$0 unless I offer at least $6 to them... then I offer$6 with slightly less than 5/6 probability, so they do no better than if they demanded $5, the amount I think is fair. They cannot evade that by trying to make some 'commitment' earlier than I do. I expect that, whatever is the correct and sane version of this reasoning, it generalizes across all the cases. I am not locked into warfare with things that demand$6 instead of $5. I do not go around figuring out how to invert their utility function for purposes of threatening them back - 'destroy all utility-function inverters (but do not invert their own utility functions)' was my guessed commandment that would be taught to kids in dath ilan, because you don't want reality to end up full of utilityfunction inverters. From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner's dilemma. I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself. I suggest cultivating the same suspicion with respect to the imagination of commitment races between Ultimatum Game players, in which whoever manages to make some move logically first walks away with$9 and the other poor agent can only take \$1 - especially if you end up reasoning that the computationally weaker agent should be the winner.

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Idea A (for “Alright”): Humanity should develop hardware-destroying capabilities — e.g., broadly and rapidly deployable non-nuclear EMPs — to be used in emergencies to shut down potentially-out-of-control AGI situations, such as an AGI that has leaked onto the internet, or an irresponsible nation developing AGI unsafely.

Sounds obviously impossible in real life, so how about you go do that and then I'll doff my hat in amazement and change how I speak of pivotal acts. Go get gain-of-function banned, even, that should be vastly simpler. Then we can talk about doing the much more difficult thing. Otherwise it seems to me like this is just a fairytale about what you wouldn't need to do in a brighter world than this.

Late 2021 MIRI Conversations: AMA / Discussion

Yes, it was an intentional part of the goal.

If there were any possibility of surviving the first AGI built, then it would be nice to have AGI projects promising to do something that wouldn't look like trying to seize control of the Future for themselves, when, much later (subjectively?), they became able to do something like CEV.  I don't see much evidence that they're able to think on the level of abstraction that CEV was stated on, though, nor that they're able to understand the 'seizing control of the Future' failure mode that CEV is meant to prevent, and they would not understand why CEV was a solution to the problem while 'Apple pie and democracy for everyone forever!' was not a solution to that problem.  If at most one AGI project can understand the problem to which CEV is a solution, then it's not a solution to races between AGI projects.  I suppose it could still be a solution to letting one AGI project scale even when incorporating highly intelligent people with some object-level moral disagreements.

Shah and Yudkowsky on alignment failures

Late 2021 MIRI Conversations: AMA / Discussion

I would "destroy the world" from the perspective of natural selection in the sense that I would transform it in many ways, none of which were making lots of copies of my DNA, or the information in it, or even having tons of kids half resembling my old biological self.

From the perspective of my highly similar fellow humans with whom I evolved in context, they'd get nice stuff, because "my fellow humans get nice stuff" happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that ended up inside me, as the result of my being strictly outer-optimized over millions of generations for inclusive genetic fitness, which I now don't care about at all.

Paperclip-numbers do well out of paperclip-number maximization. The hapless outer creators of the thing that weirdly ends up a paperclip maximizer, not so much.

More Christiano, Cotra, and Yudkowsky on AI progress

Want to +1 that a vaguer version of this was my own rough sense of RNNs vs. CNNs vs. Transformers.

Biology-Inspired AGI Timelines: The Trick That Never Works

As much as Moravec-1988 and Moravec-1998 sound like they should be basically the same people, a decade passed between them, and I'd like to note that Moravec may legit have been making an updated version of his wrong argument in 1998 compared to 1988 after he had a chance to watch 10 more years pass and make his earlier prediction look less likely.

Biology-Inspired AGI Timelines: The Trick That Never Works

It does fit well there, but I think it was more inspired by the person I met who thought I was being way too arrogant by not updating in the direction of OpenPhil's timeline estimates to the extent I was uncertain.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Maybe another way of phrasing this - how much warning do you expect to get, how far out does your Nope Vision extend?  Do you expect to be able to say "We're now in the 'for all I know the IMO challenge could be won in 4 years' regime" more than 4 years before it happens, in general?  Would it be fair to ask you again at the end of 2022 and every year thereafter if we've entered the 'for all I know, within 4 years' regime?

Added:  This question fits into a larger concern I have about AI soberskeptics in general (not you, the soberskeptics would not consider you one of their own) where they saunter around saying "X will not occur in the next 5 / 10 / 20 years" and they're often right for the next couple of years, because there's only one year where X shows up for any particular definition of that, and most years are not that year; but also they're saying exactly the same thing up until 2 years before X shows up, if there's any early warning on X at all.  It seems to me that 2 years is about as far as Nope Vision extends in real life, for any case that isn't completely slam-dunk; when I called upon those gathered AI luminaries to say the least impressive thing that definitely couldn't be done in 2 years, and they all fell silent, and then a single one of them named Winograd schemas, they were right that Winograd schemas at the stated level didn't fall within 2 years, but very barely so (they fell the year after).  So part of what I'm flailingly asking here, is whether you think you have reliable and sensitive Nope Vision that extends out beyond 2 years, in general, such that you can go on saying "Not for 4 years" up until we are actually within 6 years of the thing, and then, you think, your Nope Vision will actually flash an alert and you will change your tune, before you are actually within 4 years of the thing.  Or maybe you think you've got Nope Vision extending out 6 years?  10 years?  Or maybe theorem-proving is just a special case and usually your Nope Vision would be limited to 2 years or 3 years?

This is all an extremely Yudkowskian frame on things, of course, so feel free to reframe.

Christiano, Cotra, and Yudkowsky on AI progress

I also think human brains are better than elephant brains at most things - what did I say that sounded otherwise?