Raymond Arnold

I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.


Curated. I think this domain of decision theory is easy to get confused in, and having a really explicit writeup of how it applies in the case of negotiating with AIs (or, failing to), seems quite helpful. I had had a vague understanding of the points in this post before, but feel much clearer about it now.

I tagged this "Pointers Problem" but am not 100% sure it's getting at the same thing. Curious if there's a different tag that feels more appropriate.

An angle I think is relevant here is that a sufficiently complex, "well founded" AI system is still going to be fairly difficult to understand. i.e. a large codebase, where everything is properly commented and labeled, might still have lots of unforeseen bugs and interactions the engineers didn't intend. 

So I think before you deploy a powerful "Well Founded" AI system, you'll probably still need a kind of generalized reverse-engineering/interpretability skill to explain how the entire process works in various test cases.

John's Why Not Just... sequence is a series of somewhat rough takes on a few of them. (though I think many of them are not written up super comprehensively)

Curated. I think shovel-ready projects that can help with alignment are quite helpful for the field, in particular right now when we have a bunch of smart people showing up, looking to contribute. 

Something I'm unsure about (commenting from my mod-perspective but not making a mod pronouncement) is how LW should relate to posts that lay out ideas that may advance AI capabilities. 

My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.

It seems important to be able to talk about that and model the world, but I'm wondering if posts like this should live behind a "need to log-in" filter, maybe with a slight karma-gate, so that the people who end up reading it are at least more likely to be plugged into the LW ecosystem and are also going to get exposed to arguments about AI risk.

nostalgiabraist, I'm curious how you would feel about that.

Curated. I'm not sure I endorse all the specific examples, but the general principles make sense to me as considerations to help guide alignment research directions.

FYI, I've found this concept useful in thinking, but I think "atomic" is a worse word than just saying "non-interruptible". When I'm explaining this to people I just say "unbounded, uninterruptible optimization". The word atomic only seems to serve to make people say "what's that?" and then I say "uninterruptible"

Mod note: I'm frontpaging this. It's a bit of an edge case (workshops definitely aren't timeless, but we have tended to frontpage prize/contest announcements for intellectual content)

I don't think the usual arguments apply as obviously here. "Maximal Diamond" is much simpler than most other optimization targets. It seems much easier to solve outer-alignment for – Diamond was chosen because it's a really simple molecule configuration to specify, and that just seems to be a pretty different scenario than most of the ones I've seen more detailed arguments for.

I'm partly confused about the phrasing "we have no idea how to do this." (which is stronger than "we don't currently have a plan for how to do this.")

But in the interests of actually trying to answer this sort of thing for myself instead of asking Nate/Eliezer to explain why it doesn't work, let me think through my own proposal of how I'd go about solving the problem, and see if I can think of obvious holes.

Problems currently known to me:

  1. Reward hijacking
  2. Point 19 in List of Lethalities ("there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment"). 
  3. Ontological updating (i.e. what exactly is a diamond?)
  4. New to me from this post: the most important capabilities advances may come from an inner process that isn't actually coupled to the reinforcement learning system. (I didn't really get this until reading this post and haven't finished thinking through the concept)

Main ingredients I'm imagining: (disclaimer: I'm a layman making a lot of informed guesses, wouldn't be surprised it

First, Get a general agent, with limitations to prevent immediate fooming. Get to general intelligence via something like DeepMind's General Agents, this time starting from a language model that benefits from a lot of human concepts. My current belief is that you'd need to solve some major efficiency issues to do this with a reasonable amount of compute. If you have a Jupiter brain (as originally stipulated) I'm not sure it even requires new advances.

(Maybe scrub the language model of all references to ML/programming, initially. They'll be helpful eventually but maybe don't give the AGI a headstart on self-modification.)

Train it to maximize diamonds in many different environments. Starting with (relatively) modest amounts of compute, train it to make diamonds in different simulated and physical worlds. Initially, at subhuman intelligence, the AGI isn't expected to invent it's own diamond-making technology. But it's presented with different physical-or-simulated tools that make things similar-but-non-identical to diamonds, and the tools have internal parts it can understand, and it's rewarded for choosing between them accurately. So it has to develop an understanding of physics.

(I think the outer-alignment goal here is to get it to advance at physics faster than self-modification, so that you can force it to learn ontological problems before it could get subverted by them). 

Some notable training-sets it needs to include:

  • digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion
  • it's ability to parse what's going on in the digital worlds depends on sensors that are present in the digital world (also for physical world), and there are different arrays of sensors in different worlds. It's trained against situations where it has the ability to modify its sensors for simple reward hacking.
  • eventually it's taught programming/hardware know-ho, and put in fairly simplified situations where the solution to it's puzzle is to notice that it physically exists, and make changes to it's hardware or software, but it doesn't directly hijack it's own reward function.

(Note: thinking through all these specific details is pretty helpful for noticing how many steps are involved here. I think for this sort of plan to work you actually need a lot of different puzzles that are designed to be solveable with safe amounts of compute, so it doesn't just bulldoze past your training setup. Designing such puzzles seems pretty time consuming. In practice I don't expect the Successfully Aligned "murder everyone and make diamonds forever" bot to be completed before "murder everyone and make some Random Other Thing Forever" bot)

Even though my goal is a murder-bot-that-makes-diamonds-forever, I'm probably coupling all of this with attempts at corrigibility training, dealing with uncertainty, impact tracking, etc, to give myself extra time to notice problems. (i.e if the machine isn't sure whether the thing it's making is diamond, it makes a little bit first, asks humans to verify that it's diamond, etc. Do similar training on "don't modify the input channel for 'was it actually diamond tho?')

Assuming those tricks all work and hold up under tons of optimization pressure, this all still leaves us with inner alignment, and point #4 on my list of known-to-me-concerns. "The most important capabilities advances may come from an inner process that isn't actually coupled to the reinforcement learning system."

And... okay actually this is a new thought for me, and I'm not sure how to think about it yet. I can see how it was probably meant to be included in the "confusingly pervasive consequentialism" concept, but I didn't get the "and therefore, impervious to gradient descent" argument till just now.

I'm out of time for now, will think about this more.

Load More