Reality hits back on the models we train via loss functions based on reality-generated data. But alignment also hits back on models we train, because we also use loss functions (based on preference data). These seem to be symmetrically powerful forces.
Alignment doesn't hit back, the loss function hits back and the loss function doesn't capture what you really want (eg because killing the humans and taking control of a reward button will max reward, deceiving human raters will increase ratings, etc). If what we wanted was exactly captured in a loss function, alignment would be easier. Not easy because outer optimization doesn't create good inner alignment, but easier than the present case.
(the "AI immune system") The whole internet — including space satellites and the internet-of-things — becomes way more secure, and includes a distributed network of non-nuclear electromagnetic pulse emitters that will physically shut down any tech infrastructure appearing to be running rogue AI agents.
Define "way more secure". Like, superhuman-at-security AGIs rewrote the systems to be formally unhackable even taking into account hardware vulnerabilities like Rowhammer that violate the logical chip invariants?
Can you talk a bit about the world global dictatorship running the electromagnetic pulse emitters, and how they monitor every computer in the world? What sort of violence do you envision being inflicted on any countries who don't want to submit their computers for monitoring? Is part of the plan to use AI drones to kill any political leaders who oppose this plan, so as to minimize civilian casualties? Who controls these AI drones, are we quite sure this world dictatorship stays friendly to its citizens? A lot of political processes leading to such a thing sound like they could potentially be scary.
I said "burn all GPUs" to be frank about these things being scary. It's easy for things to sound less scary when they're vague and the processes leading up to them are left vague. See also, George Orwell, "Politics and the English Language". We can't evaluate whether you have a less scary proposal until you make a less vague one.
Fwiw, I interpreted this as saying that it doesn't work as a safety proposal (see also: my earlier comment). Also seems related to his arguments about ML systems having squiggles.
Yup. You can definitely train powerful systems on imitation of human thoughts, and in the limit this just gets you a powerful mesa-optimizer that figures out how to imitate them.
Agreed explicitly for the record.
When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer
Why privately?! Is there a phenomenon where other people feel concerned about the social reception of expressing disagreement until Paul does? This is a phenomenon common in many other fields - and I'd invoke it to explain how the 'tone' of talk about AI safety shifted so quickly once I came right out and was first to say everybody's dead - and if it's also happening on the other side then people need to start talking there too. Especially if people think they have solutions. They should talk.
For example, ARC’s report on ELK describes at least 10 difficulties of the same type and severity as the ~20 technical difficulties raised in Eliezer’s list.
I skimmed through the report and didn't find anything that looked like a centralized bullet point list of difficulties. I think it's valuable in general if people say what the problems are that they're trying to solve, and then collect them into a place so people can look them over simultaneously. I realize I haven't done enough of this myself, but if you've already written up the component pieces, that can make it easier to collect the bullet list.
Best list so far, imo; it's what to beat.
Well, I had to think about this for longer than five seconds, so that's already a huge victory.
If I try to compress your idea down to a few sentences:
The humans ask the AI to produce design tools, rather than designs, such that there's a bunch of human cognition that goes into picking out the particular atomic arrangements or synthesis pathways; and we can piecewise verify that the tool is making accurate predictions; and the tool is powerful enough that we can build molecular nanotech and an uploader by using the tool for an amount of time too short for Facebook to catch up and destroy the world. The AI that does this is purportedly sufficiently good at meta-engineering to build the tool, but not good enough at larger strategy that it can hack its way through the humans using just the code of the tool. The way in which this attacks a central difficulty is by making it harder for the AI to just build unhelpful nanotech using the capabilities that humans use to produce helpful nanotech.
Sound about right?
Depends what the evil clones are trying to do.
Get me to adopt a solution wrong in a particular direction, like a design that hands the universe over to them? I can maybe figure out the first time through who's out to get me, if it's 200 Yudkowsky-years. If it's 200,000 Yudkowsky-years I think I'm just screwed.
Get me to make any lethal mistake at all? I don't think I can get to 90% confidence period, or at least, not without spending an amount of Yudkowsky-time equivalent to the untrustworthy source.
If I know that it was written by aligned people? I wouldn't just be trying to evaluate it myself; I'd try to get a team together to implement it, and understanding it well enough to implement it would be the same process as verifying whatever remaining verifiable uncertainty was left about the origins, where most of that uncertainty is unverifiable because the putative hostile origin is plausibly also smart enough to sneak things past you.