Raymond Arnold

I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.


NeurIPS ML Safety Workshop 2022

Mod note: I'm frontpaging this. It's a bit of an edge case (workshops definitely aren't timeless, but we have tended to frontpage prize/contest announcements for intellectual content)

On how various plans miss the hard bits of the alignment challenge

I don't think the usual arguments apply as obviously here. "Maximal Diamond" is much simpler than most other optimization targets. It seems much easier to solve outer-alignment for – Diamond was chosen because it's a really simple molecule configuration to specify, and that just seems to be a pretty different scenario than most of the ones I've seen more detailed arguments for.

I'm partly confused about the phrasing "we have no idea how to do this." (which is stronger than "we don't currently have a plan for how to do this.")

But in the interests of actually trying to answer this sort of thing for myself instead of asking Nate/Eliezer to explain why it doesn't work, let me think through my own proposal of how I'd go about solving the problem, and see if I can think of obvious holes.

Problems currently known to me:

  1. Reward hijacking
  2. Point 19 in List of Lethalities ("there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment"). 
  3. Ontological updating (i.e. what exactly is a diamond?)
  4. New to me from this post: the most important capabilities advances may come from an inner process that isn't actually coupled to the reinforcement learning system. (I didn't really get this until reading this post and haven't finished thinking through the concept)

Main ingredients I'm imagining: (disclaimer: I'm a layman making a lot of informed guesses, wouldn't be surprised it

First, Get a general agent, with limitations to prevent immediate fooming. Get to general intelligence via something like DeepMind's General Agents, this time starting from a language model that benefits from a lot of human concepts. My current belief is that you'd need to solve some major efficiency issues to do this with a reasonable amount of compute. If you have a Jupiter brain (as originally stipulated) I'm not sure it even requires new advances.

(Maybe scrub the language model of all references to ML/programming, initially. They'll be helpful eventually but maybe don't give the AGI a headstart on self-modification.)

Train it to maximize diamonds in many different environments. Starting with (relatively) modest amounts of compute, train it to make diamonds in different simulated and physical worlds. Initially, at subhuman intelligence, the AGI isn't expected to invent it's own diamond-making technology. But it's presented with different physical-or-simulated tools that make things similar-but-non-identical to diamonds, and the tools have internal parts it can understand, and it's rewarded for choosing between them accurately. So it has to develop an understanding of physics.

(I think the outer-alignment goal here is to get it to advance at physics faster than self-modification, so that you can force it to learn ontological problems before it could get subverted by them). 

Some notable training-sets it needs to include:

  • digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion
  • it's ability to parse what's going on in the digital worlds depends on sensors that are present in the digital world (also for physical world), and there are different arrays of sensors in different worlds. It's trained against situations where it has the ability to modify its sensors for simple reward hacking.
  • eventually it's taught programming/hardware know-ho, and put in fairly simplified situations where the solution to it's puzzle is to notice that it physically exists, and make changes to it's hardware or software, but it doesn't directly hijack it's own reward function.

(Note: thinking through all these specific details is pretty helpful for noticing how many steps are involved here. I think for this sort of plan to work you actually need a lot of different puzzles that are designed to be solveable with safe amounts of compute, so it doesn't just bulldoze past your training setup. Designing such puzzles seems pretty time consuming. In practice I don't expect the Successfully Aligned "murder everyone and make diamonds forever" bot to be completed before "murder everyone and make some Random Other Thing Forever" bot)

Even though my goal is a murder-bot-that-makes-diamonds-forever, I'm probably coupling all of this with attempts at corrigibility training, dealing with uncertainty, impact tracking, etc, to give myself extra time to notice problems. (i.e if the machine isn't sure whether the thing it's making is diamond, it makes a little bit first, asks humans to verify that it's diamond, etc. Do similar training on "don't modify the input channel for 'was it actually diamond tho?')

Assuming those tricks all work and hold up under tons of optimization pressure, this all still leaves us with inner alignment, and point #4 on my list of known-to-me-concerns. "The most important capabilities advances may come from an inner process that isn't actually coupled to the reinforcement learning system."

And... okay actually this is a new thought for me, and I'm not sure how to think about it yet. I can see how it was probably meant to be included in the "confusingly pervasive consequentialism" concept, but I didn't get the "and therefore, impervious to gradient descent" argument till just now.

I'm out of time for now, will think about this more.

On how various plans miss the hard bits of the alignment challenge

Like, even simpler than the problem of an AGI that puts two identical strawberries on a plate and does nothing else, is the problem of an AGI that turns as much of the universe as possible into diamonds. This is easier because, while it still requires that we have some way to direct the system towards a concept of our choosing, we no longer require corrigibility. (Also, "diamond" is a significantly simpler concept than "strawberry" and "cellularly identical".)

It seems to me that we have basically no idea how to do this. We can train the AGI to be pretty good at building diamond-like things across a lot of training environments, but once it takes that sharp left turn, by default, it will wander off and do some other thing, like how humans wandered off and invented birth control.

Is there a writeup of where you expect this to fail? I recall this MIRI newsletter but I think it also just asserted it was hard/impossible.

Is the difficulty just in "it's gonna hijack it's own reward function?" or is there more to it than that?

Six Dimensions of Operational Adequacy in AGI Projects

Curated. My sense is there is no existing AI company with adequate infrastructure for safely deploying AGI, and this is a pretty big deal. I like this writeup for laying out a bunch of considerations.

A few times in this article, Eliezer notes "it'd be great if we could get X, but, the process of trying to get X would cause some bad consequences." I'd like to see further exploration/models of, "given the state of the current world, which approaches are actually tractable.

AGI Ruin: A List of Lethalities

Are you actually gonna remember the apostrophe?

AGI Ruin: A List of Lethalities

Curated. As previously noted, I'm quite glad to have this list of reasons written up. I like Robby's comment here which notes:

The point is not 'humanity needs to write a convincing-sounding essay for the thesis Safe AI Is Hard, so we can convince people'. The point is 'humanity needs to actually have a full and detailed understanding of the problem so we can do the engineering work of solving it'.

I look forward to other alignment thinkers writing up either their explicit disagreements with this list, or things that the list misses, or their own frame on the situation if they think something is off about the framing of this list.

AGI Ruin: A List of Lethalities

Note: I think there's a bunch of additional reasons for doom, surrounding "civilizational adequacy / organizational competence / societal dynamics". Eliezer briefly alluded to these, but AFAICT he's mostly focused on lethality that comes "early", and then didn't address them much. My model of Andrew Critch has a bunch of concerns about doom that show up later, because there's a bunch of additional challenges you have to solve if AI doesn't dramatically win/lose early on (i.e. multi/multi dynamics and how they spiral out of control)

I know a bunch of people whose hope funnels through "We'll be able to carefully iterate on slightly-smarter-than-human-intelligences, build schemes to play them against each other, leverage them to make some progress on alignment that we can use to build slightly-more-advanced-safer-systems". (Let's call this the "Careful Bootstrap plan")

I do actually feel nonzero optimism about that plan, but when I talk to people who are optimistic about that I feel a missing mood about the kind of difficulty that is involved here.

I'll attempt to write up some concrete things here later, but wanted to note this for now.

AGI Ruin: A List of Lethalities

I read an early draft of this awhile and am glad to have it publicly available.  And I do think the updates in structure/introduction were worth the wait. Thanks!

Confused why a "capabilities research is good for alignment progress" position isn't discussed more

My sense is that Anthropic is somewhat oriented around this idea. I'm not sure if this is their actual plan or just some guesswork I read between the lines.

But I vaguely recall something like "develop capabilities that you don't publish, while also developing interpretability techniques which you do publish, and try to have a competitive edge on capabilities which you then have some lead time to try to inspect via intepretability techniques and the practice alignment on various capability-scales.

(I may have just made this up while trying to steelman them to myself)

Six Dimensions of Operational Adequacy in AGI Projects

government-security-clearance-style screening

What does that actually involve?

Load More