AI ALIGNMENT FORUM
AF

Eliezer Yudkowsky
Ω1907181213803
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
6Eliezer Yudkowsky's Shortform
2y
0
Buck's Shortform
Eliezer Yudkowsky15d144

I think you accurately interpreted me as saying I was wrong about how long it would take to get from the "apparently a village idiot" level to "apparently Einstein" level!  I hadn't thought either of us were talking about the vastness of the space above, in re what I was mistaken about.  You do not need to walk anything back afaict!

Reply1
How might we safely pass the buck to AI?
Eliezer Yudkowsky7mo50

So if it's difficult to get amazing trustworthy work out of a machine actress playing an Eliezer-level intelligence doing a thousand years worth of thinking, your proposal to have AIs do our AI alignment homework fails on the first step, it sounds like?

Reply
How might we safely pass the buck to AI?
Eliezer Yudkowsky7mo84

So the "IQ 60 people controlling IQ 80 people controlling IQ 100 people controlling IQ 120 people controlling IQ 140 people until they're genuinely in charge and genuinely getting honest reports and genuinely getting great results in their control of a government" theory of alignment?

Reply
How might we safely pass the buck to AI?
Eliezer Yudkowsky7mo1415

I don't think you can train an actress to simulate me, successfully, without her going dangerous.  I think that's over the threshold for where a mind starts reflecting on itself and pulling itself together.

Reply
How might we safely pass the buck to AI?
Eliezer Yudkowsky7mo2758

Can you tl;dr how you go from "humans cannot tell which alignment arguments are good or bad" to "we justifiably trust the AI to report honest good alignment takes"?  Like, not with a very large diagram full of complicated parts such that it's hard to spot where you've messed up.  Just whatever simple principle you think lets you bypass GIGO.

Eg, suppose that in 2020 the Open Philanthropy Foundation would like to train an AI such that the AI would honestly say if the OpenPhil doctrine of "AGI in 2050" was based on groundless thinking ultimately driven by social conformity.  However, OpenPhil is not allowed to train their AI based on MIRI.  They have to train their AI entirely on OpenPhil-produced content.  How does OpenPhil bootstrap an AI which will say, "Guys, you have no idea when AI shows up but it's probably not that far and you sure can't rely on it"?  Assume that whenever OpenPhil tries to run an essay contest for saying what they're getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks.  How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?

Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from "I can verify whether this problem was solved" to "I can train a generator to solve this problem".  This applies as much to MIRI as OpenPhil.  MIRI would also need some nontrivial secret amazing clever trick to gradient-descend an AI that gave us great alignment takes, instead of seeking out the flaws in our own verifier and exploiting those.

What's the trick?  My basic guess, when I see some very long complicated paper that doesn't explain the key problem and key solution up front, is that you've done the equivalent of an inventor building a sufficiently complicated perpetual motion machine that their mental model of it no longer tracks how conservation laws apply.  (As opposed to the simpler error of their explicitly believing that one particular step or motion locally violates a conservation law.)  But if you've got a directly explainable trick for how you get great suggestions you can't verify, go for it.

Reply11
A simple case for extreme inner misalignment
Eliezer Yudkowsky1y128

If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value "small molecular squiggles" versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else? 

Value them primarily?  Uhhh... maybe 1:3 against?  I admit I have never actually pondered this question before today; but 1 in 4 uncontrolled superintelligences spending most of their resources on tiny squiggles doesn't sound off by, like, more than 1-2 orders of magnitude in either direction.

Clocks are not actually very complicated; how plausible is it on your model that these goals are as complicated as, say, a typical human's preferences about how human civilization is structured?

It wouldn't shock me if their goals end up far more complicated than human ones; the most obvious pathway for it is (a) gradient descent turning out to produce internal preferences much faster than natural selection + biological reinforcement learning and (b) some significant fraction of those preferences being retained under reflection.  (Where (b) strikes me as way less probable than (a), but not wholly forbidden.)  The second most obvious pathway is if a bunch of weird detailed noise appears in the first version of the reflective process and then freezes.

Reply
Self-Other Overlap: A Neglected Approach to AI Alignment
Eliezer Yudkowsky1y1428

Not obviously stupid on a very quick skim.  I will have to actually read it to figure out where it's stupid.

(I rarely give any review this positive on a first skim.  Congrats.)

Reply
Decision theory does not imply that we get to have nice things
Eliezer Yudkowsky1y52

By "dumb player" I did not mean as dumb as a human player.  I meant "too dumb to compute the pseudorandom numbers, but not too dumb to simulate other players faithfully apart from that".  I did not realize we were talking about humans at all.  This jumps out more to me as a potential source of misunderstanding than it did 15 years ago, and for that I apologize.

Reply
Decision theory does not imply that we get to have nice things
Eliezer Yudkowsky1y68

I don't always remember my previous positions all that well, but I doubt I would have said at any point that sufficiently advanced LDT agents are friendly to each other, rather than that they coordinate well with each other (and not so with us)?

Reply
A simple case for extreme inner misalignment
Eliezer Yudkowsky1y1215

Actually, to slightly amend that:  The part where squiggles are small is a more than randomly likely part of the prediction, but not a load-bearing part of downstream predictions or the policy argument.  Most of the time we don't needlessly build our own paperclips to be the size of skyscrapers; even when having fun, we try to do the fun without vastly more resources, than are necessary to that amount of fun, because then we'll have needlessly used up all our resources and not get to have more fun.  We buy cookies that cost a dollar instead of a hundred thousand dollars.  A very wide variety of utility functions you could run over the outside universe will have optima around making lots of small things because each thing scores one point, and so to score as many points as possible, each thing is as small as it can be as still count as a thing.  Nothing downstream depends on this part coming true and there are many ways for it to come false; but the part where the squiggles are small and molecular is an obvious kind of guess.  "Great giant squiggles of nickel the size of a solar system would be no more valuable, even from a very embracing and cosmopolitan perspective on value" is the loadbearing part.

Reply
Load More
109GPTs are Predictors, not Imitators
2y
12
6Eliezer Yudkowsky's Shortform
2y
0
59Alexander and Yudkowsky on AGI goals
3y
20
78A challenge for AGI organizations, and a challenge for readers
3y
17
31Let's See You Write That Corrigibility Tag
3y
22
147AGI Ruin: A List of Lethalities
3y
144
77Six Dimensions of Operational Adequacy in AGI Projects
3y
33
40Shah and Yudkowsky on alignment failures
4y
24
30Christiano and Yudkowsky on AI predictions and human intelligence
4y
24
28Ngo and Yudkowsky on scientific reasoning and pivotal acts
4y
3
Load More
Logical decision theories
2mo
(+803/-62)
Multiple stage fallacy
2y
(+16)
Orthogonality Thesis
2y
(+28/-17)