x

AI ALIGNMENT FORUM

AF

Rafael Harth — AI Alignment Forum

Rafael Harth

Top postsTop post

Rafael Harth

Message

I'm an independent researcher currently working on a sequence of posts about consciousness. You can send me anonymous feedback here: https://www.admonymous.co/rafaelharth.

5394

Ω

232

61

1124

1

9y

Rafael Harth

I'm an independent researcher currently working on a sequence of posts about consciousness. You can send me anonymous feedback here: https://www.admonymous.co/rafaelharth.

Top postsTop post

Inner Alignment: Explain like I'm 12 Edition

(This is an unofficial explanation of Inner Alignment based on the Miri paper Risks from Learned Optimization in Advanced Machine Learning Systems (which is almost identical to the LW sequence) and the Future of Life podcast with Evan Hubinger (Miri/LW). It's meant for anyone who found the sequence too long/challenging/technical to read.) Note that bold and italics means "this is a new term I'm introducing," whereas underline and italics is used for emphasis. What is Inner Alignment? Let's start with an abridged guide to how Deep Learning works: 1. Choose a problem 2. Decide on a space of possible solutions 3. Find a good solution from that space If the problem is "find a tool that can look at any image and decide whether or not it contains a cat," then each conceivable set of rules for answering this question (formally, each function from the set of all pixels to the set {yes,no}) defines one solution. We call each such solution a model. The space of possible models is depicted below. Since that's all possible models, most of them are utter nonsense. Pick a random one, and you're as likely to end up with a car-recognizer than a cat-recognizer – but far more likely with an algorithm that does nothing we can interpret. Note that even the examples I annotated aren't typical – most models would be more complex while still doing nothing related to cats. Nonetheless, somewhere in there is a model that would do a decent job on our problem. In the above, that's the one that says, "I look for cats." How does ML find such a model? One way that does not work is trying out all of them. That's because the space is too large: it might contain over 101000000 candidates. Instead, there's this thing called Stochastic Gradient Descent (SGD). Here's how it works: SGD begins with some (probably terrible) model and then proceeds in steps. In each step, it switches to another model that is "close" and hopefully a little better. Eventually, it stops and outputs the most

A guide to Iterated Amplification & Debate

Preface to the Sequence on Factored Cognition

Idealized Factored Cognition

Clarifying Factored Cognition

This post is sort of an intermediate between parts 1 and 2 of the sequence. It makes three points that I think people tend to get wrong. 1. Factored Cognition is about reducing hard problems to human judgment to achieve outer alignment. It's possible to lose sight of why Factored...

Dec 13, 2020•23

Traversing a Cognition Space

(This post is part of a sequence that's meant to be read in order; see the preface.) Post #1 was about developing and justifying a formalism for Factored Cognition. Now that we have this formalism, this post is about doing as much with it as possible. 1. Debate Trees Recall...

Dec 7, 2020•17

Idealized Factored Cognition

(This post is part of a sequence that's meant to be read in order; see the preface.) 1. HCH and Ideal Debate Recall from post #-2 that we have two perspectives on stock IDA.[1] One is that of a human with access to a model, the other is that of...

Nov 30, 2020•34

Preface to the Sequence on Factored Cognition

Factored Cognition is primarily studied by Ought, the same organization that was partially credited for implementing the interactive prediction feature. Ought is an organization with at least five members who have worked on the problem for several years. I am a single person who just finished a master's degree. The...

Nov 30, 2020•35

Hiding Complexity

1. The Principle Suppose you have some difficult cognitive problem you want to solve. What is the difference between (1) making progress on the problem by thinking about it for an hour and (2) solving a well-defined subproblem whose solution is useful for the entire problem? (Finding a good characterization...

Nov 20, 2020•29

A guide to Iterated Amplification & Debate

This post is about two proposals for aligning AI systems in a scalable way: * Iterated Distillation and Amplification (often just called 'Iterated Amplification'), or IDA for short,[1] is a proposal by Paul Christiano. * Debate is an IDA-inspired proposal by Geoffrey Irving. This post is written to be as...

Nov 15, 2020•76

Inner Alignment: Explain like I'm 12 Edition

(This is an unofficial explanation of Inner Alignment based on the Miri paper Risks from Learned Optimization in Advanced Machine Learning Systems (which is almost identical to the LW sequence) and the Future of Life podcast with Evan Hubinger (Miri/LW). It's meant for anyone who found the sequence too long/challenging/technical...

Aug 1, 2020•189

Load More (7/8)