x

What are the high-level approaches to AI alignment? — AI Alignment Forum

5

[ Question ]

What are the high-level approaches to AI alignment?

by Gordon Seidoh Worley

16th Jun 2020

1 min read

5

I'm writing a post comparing some high-level approaches to AI alignment in terms of their false positive risk. Trouble is, there's no standard agreement on what various high-level approaches to AI alignment there are today, either in terms of what constitutes these high-level approaches or where to draw the line in categorizing various specific approaches.

So, I'll open it up as a question to get some feedback before I get too far along. What do you consider to be the high-level approaches to AI alignment?

(I'll supply my own partial answer below.)

Frontpage

What are the high-level approaches to AI alignment?

2Gordon Seidoh Worley

1Gordon Seidoh Worley

2Gordon Seidoh Worley

3Ben Pace, the Vacationing Vagabond

1Gordon Seidoh Worley

2Ben Pace, the Vacationing Vagabond

1Gordon Seidoh Worley

1Gordon Seidoh Worley

New Answer

New Comment

3 Answers sorted by
top scoring

Jun 17, 2020

60

Rohin Shah's talk gives one taxonomy of approaches to AI alignment: https://youtu.be/AMSKIDEbjLY?t=1643 (During Q&A Rohin also mentions some other stuff)
- This podcast episode also talks about similar things: https://futureoflife.org/2019/04/11/an-overview-of-technical-ai-alignment-with-rohin-shah-part-1/
Wei Dai's success stories post is another way to organize the various approaches: https://www.lesswrong.com/posts/bnY3L48TtDrKTzGRb/ai-safety-success-stories
I started trying to organize AI alignment agendas myself a while back, but never got far: https://aiwatch.issarice.com/#agendas
This post by Jan Leike also has a list of agendas in the Outlook section: https://medium.com/@deepmindsafetyresearch/scalable-agent-alignment-via-reward-modeling-bf4ab06dfd84

[-]Gordon Seidoh Worley6y20

Thanks. Your post specifically is pretty helpful because it helps with one of the things that was tripping me up, which is what standard names people call different methods. Your names do a better job of capturing them than mine did.

Jun 16, 2020

30

You might be interested in this post I wrote recently that goes into significant detail on what I see as the major leading proposals for building safe advanced AI under the current machine learning paradigm.

[-]Gordon Seidoh Worley6y10

Actually this post was not especially helpful for my purpose and I should have explained why in advance because I anticipated someone would link it. Although it helpfully lays out a number of proposals people have made, it does more to work out what's going on with those proposals rather than find ways they can be grouped together (except incidentally). I even reread this post before posting this question and it didn't help me improve on the taxonomy I proposed, which I already had in mind as of a few months ago.

Gordon Seidoh Worley

Jun 16, 2020

20

My initial thought is that there are at least 3, which I'll give the follow names (with short explanations):

Iterated Distillation and Amplification (IDA)
- Build an AI, have it interact with a human, create a new AI based on the interaction of the human and the AI, and repeat until the AI is good enough or it reaches a fixed point and additional iterations don't change it.
Inverse Reinforcement Learning (IRL)
- Build an AI that tries to infer human values from observations and then acts based on those inferred values.
Decision Theorized Agent (DTA)
- Build an AI that uses a decision theory that causes it to make choices that will be aligned with human interests.

All of these are woefully underspecified, so improved summaries of these approaches that you think accurately explain these approaches also appreciated.

[-]Ben Pace, the Vacationing Vagabond6y30

I think the last one seems odd / doesn't make much sense. All agents have a decision theory, including RL-based agents, so it's not a distinctive approach.

If you were attempting to describe MIRI's work, remember that they're trying to understand basic concepts of agency better (meta level, object level), not in order to directly put the new concepts into the AI (in the same way current AIs do not always have a section for the 'probability theory' to be written in) but in order to be much less confused about what we're trying to do.

So if you want to describe MIRI's work, you'd call it "getting less confused about the basic concepts" and then later building an AI via a mechanism we'll figure out after getting less confused. Right now it's not engineering, it's science.

1Gordon Seidoh Worley6y

That's true, but there's a natural and historical relationship here with what was in the past termed "seed AI", even if this is not an approach anyone is actively pursuing, which is the kind of thing I was hoping to point at without using that outmoded term.

3Rob Bensinger6y

I agree with Ben and Richard's summaries; see https://www.lesswrong.com/posts/uKbxi2EJ3KBNRDGpL/comment-on-decision-theory: [...] When I think about key distinctions and branching points in alignment, I usually think about things like: * Does the approach require human modeling? Lots of risks can be avoided if the system doesn't do human modeling, or if it only does small amounts of human modeling; but this constrains the options for value learning and learning-in-general. * Current ML is notoriously opaque. Different approaches try to achieve greater understanding and inspectability to different degrees and in different ways (e.g., embedded agency vs. MIRI's "new research directions" vs. the kind of work OpenAI Clarity does), or try to achieve alignment without needing to crack open the black box. * Is the goal to make a task-directed AGI system, vs. an open-ended optimizer? When you say "there's a natural and historical relationship here with what was in the past termed 'seed AI', even if this is not an approach anyone is actively pursuing", it calls to mind for me the transition from MIRI thinking about open-ended optimizers to instead treating task AGI as the place to start.

2Ben Pace, the Vacationing Vagabond6y

I'm not actually sure what you mean. I think 'seed AI' means something like 'first case in an iterative/recursive process' of self-improvement, which applies pretty well to the iterated amplification setup (which is a recursively self-improving AI) and lots of other examples that Evan wrote about in his 11-examples post. It still seems to me to be a pretty general term.

[-]Richard_Ngo6y30

I suspect that nobody is actually pursuing the third one as you've described it. Rather, my impression is that MIRI researchers tend to think of decision theory as a more fundamental problem in understanding AI, not directly related to human interests.

[-]Gordon Seidoh Worley6y10

Based on comments/links so far it seems I should revise the names and add a fourth:

IDA = IDA
IRL -> Ambitious Value Learning (AVL)
DTA -> Embedded Agency (EA)
+ Brain Emulation (BE)
- Build AI that either emulates how humans brains work or is bootstrapped from human brain emulations.

13