Ben Pace

I'm an admin of this site; I work full-time on trying to help people on LessWrong refine the art of human rationality.

Longer bio:


AI Alignment Writing Day 2019
AI Alignment Writing Day 2018

Wiki Contributions

Load More


Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

It's certainly the case that the resource disparity is an enormity. Perhaps you have more fleshed out models of what fights between different intelligence-levels look like, and how easy it is to defend against those with vastly fewer resources, but I don't. Such that while I would feel confident in saying that an army with a billion soldiers will consider a head-to-head battle with an army of one hundred soldiers barely a nuisance, I don't feel as confident in saying that an AGI with a trillion times as much compute will consider a smaller AGI foe barely a nuisance.

Anyway, I don't have anything smarter to say on this, so by default I'll drop the thread here (you're perfectly welcome to reply further).

(Added 9 days later: I want to note that while I think it's unlikely that this less well-resourced AGI would be an existential threat, I think the only thing I have to establish for this argument to go through is that the cost of the threat is notably higher than the cost of killing all the humans. Sadly I find it confusing to estimate the cost of the threat, even if it's small, and so it's currently possible to me that the cost will end up many orders of magnitude higher than the cost of killing them.)

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Yeah, I'm not sure if I see that. Some of the first solutions I come up with seem pretty complicated — like a global government that prevents people from building computers, or building an AGI to oversee Earth in particular and ensure we never build computers (my assumption is that building such an AGI is a very difficult task). In particular it seems like it might be very complicated to neutralize us while carving out lots of space for allowing us the sorts of lives we find valuable, where we get to build our own little societies and so on. And the easy solution is always to just eradicate us, which can surely be done in less than a day.

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

it seems totally plausible to me that... the AI will leave Earth alone, since (as you say) it would be very cheap for it to do so.

Counterargument: the humans may build another AGI that breaks out and poses an existential threat to the first AGI. 

My guess is the first AGI would want to neutralize our computational capabilities in a bunch of ways.

On how various plans miss the hard bits of the alignment challenge

Hm? It's as Nate says in the quote. It's the same type of problem as humans inventing birth-control out of distribution. If you have an alternative proposal for how to build a diamond-maximizer, you can specify that for a response, but the commonly discussed idea of "train on examples of diamonds" will fail at inner-alignment, and it will just optimize diamonds in a particular setting and then elsewhere do crazy other things that look like all kinds of white noise to you.

Also "expect this to fail" already seems to jump the gun. Who has a proposal for successfully building an AGI that can do this, other than saying gradient-descent will surprise us with one?

Let's See You Write That Corrigibility Tag
  • I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is only likely to be useful in cases like this where it is crisp and natural.

Can someone explain to me what this crispness is?

As I'm reading Paul's comment, there's an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI's optimization power, it gets better at the task, but it also gets better at breaking my rating ability (which in powerful systems can lead to an overpowering of who's values are getting optimized in the universe).

Then there's this idea that as you approach breaking my rating ability, the rating will always fall off, leaving a pool of undesirability (in a high-dimensional action-space) that groups around doing a task well/poorly, that separates it from doing a task in a way that breaks my rating ability.

Is that what this crispness is? This little pool of rating fall off?

If yes, it's not clear to me why this little pool that separates the AI from MASSIVE VALUE and TAKING OVER THE UNIVERSE is able to save us. I don't know if the pool always exists around the action space, and to the extent it does exist I don't know how to use its existence to build a powerful optimizer that stays on one side of the pool.

Though Paul isn't saying he knows how to do that. He's saying that there's something really useful about it being crisp. I guess that's what I want to know. I don't understand the difference between "corrigibility is well-defined" and "corrigibility is crisp". Insofar as it's not a literally incoherent idea, there is some description of what behavior is in the category and what isn't. Then there's this additional little pool property, where not only can you list what's in and out of the definition, but the ratings go down a little before spiking when you leave the list of things in the definition. Is Paul saying that this means it's a very natural and simple concept to design a system to stay within?

Let's See You Write That Corrigibility Tag

Minor clarification: This doesn't refer to re-writing the LW corrigibility tag. I believe a tag is a reply in glowfic, where each author responds with the next tag i.e. next bit of the story, with an implied "tag – now you're it!" at the other author. 

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Just as a related idea, in my mind, I often do a kind of thinking that HPMOR!Harry would call “Hufflepuff Bones”, where I look for ways a problem is solvable in physical reality at all, before considering ethical and coordination and even much in the way of practical concerns.

AGI Ruin: A List of Lethalities

Thanks, this story is pretty helpful (to my understanding).

Load More