This account exists only for archival purposes.


Sorted by New

Wiki Contributions


We are on track to build many superhuman AI systems. Unless something unexpectedly good happens, eventually we will build one that has a failure of inner alignment. And then it will kill us all. Does the probability of any given system failing inner alignment really matter?

Yes, because if the first superhuman AGI is aligned, and if it performs a pivotal act to prevent misaligned AGI from being created, then we will avert existential catastrophe.

If there is a 99.99% chance of that happening, then we should be quite sanguine about AI x-risk. On the other hand, if there is only a 0.01% chance, then we should be very worried.

I don't know if anyone still reads comments on this post from over a year ago. Here goes nothing.

I am trying to understand the argument(s) as deeply and faithfully as I can. These two sentences from Section B.2 stuck out to me as the most important in the post (from the point of view of my understanding):

...outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.


...on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.

My first question is: supposing this is all true, what is the probability of failure of inner alignment? Is it 0.01%, 99.99%, 50%...? And how do we know how likely failure is?

It seems like there is a gulf between "it's not guaranteed to work" and "it's almost certain to fail". 

I meant "extract" more figuratively than literally. For example, GPT-4 seems to have acquired some ability to do moral reasoning in accordance with human values. This is one way to (very indirectly) "extract" information from the human brain.

Extract from the brain into, say, weights in an artificial neural network, lines of code, a natural language "constitution", or something of that nature.

...I think the human brain’s intrinsic-cost-like-thing is probably hundreds of lines of pseudocode, or maybe low thousands, certainly not millions. (And the part that’s relevant for AGI is just a fraction of that.) Unfortunately, I also think nobody knows what those lines are. I would feel better if they did.

So, the human brain's pseudo-intrinsic cost is not intractably complex, on your view, but difficult to extract.

Now, it doesn’t immediately follow that the AI will actually want to start buying chair-straps and heroin, for a similar reason as why I personally am not trying to get heroin right now.

This seems important to me. What is the intrinsic cost in a human brain like mine or yours? Why don’t humans have an alignment problem (e.g. if you radically enhanced human intelligence, you wouldn’t produce a paperclip maximiser)?

Maybe the view of alignment pessimists is that the paradigmatic human brain’s intrinsic cost is intractably complex. I don’t know. I would like more clarity on this point.