2018 Review Discussion

Towards a New Impact Measure
201y32 min readShow Highlight

In which I propose a closed-form solution to low impact, increasing corrigibility and seemingly taking major steps to neutralize basic AI drives 1 (self-improvement), 5 (self-protectiveness), and 6 (acquisition of resources).

Previously: Worrying about the Vase: Whitelisting, Overcoming Clinginess in Impact Measures, Impact Measure Desiderata

To be used inside an advanced agent, an impact measure... must capture so much variance that there is no clever strategy whereby an advanced agent can produce some special type of variance that evades the measure.
~ Safe Impact Measure

If we have a safe impa... (Read more)

This is my post.

How my thinking has changed

I've spent much of the last year thinking about the pedagogical mistakes I made here, and am writing the Reframing Impact sequence to fix them. While this post recorded my 2018-thinking on impact measurement, I don't think it communicated the key insights well. Of course, I'm glad it seems to have nonetheless proven useful and exciting to some people!

If I were to update this post, it would probably turn into a rehash of Reframing Impact. Instead, I'll just briefly state the argument as I would present it today.

... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post
Paul's research agenda FAQ
61y18 min readShow Highlight

I think Paul Christiano’s research agenda for the alignment of superintelligent AGIs presents one of the most exciting and promising approaches to AI safety. After being very confused about Paul’s agenda, chatting with others about similar confusions, and clarifying with Paul many times over, I’ve decided to write a FAQ addressing common confusions around his agenda.

This FAQ is not intended to provide an introduction to Paul’s agenda, nor is it intended to provide an airtight defense. This FAQ only aims to clarify commonly misunderstood aspects of the agenda. Unless otherwise stated, all view... (Read more)

The following is a basically unedited summary I wrote up on March 16 of my take on Paul Christiano’s AGI alignment approach (described in “ALBA” and “Iterated Distillation and Amplification”). Where Paul had comments and replies, I’ve included them below.


I see a lot of free variables with respect to what exactly Paul might have in mind. I've sometimes tried presenting Paul with my objections and then he replies in a way that locally answers some of my question but I think would make other difficulties worse. My global objection is thus something like, "I don't see any concrete ... (Read more),,,,,

This post is close in my mind to Alex Zhu's post Paul's research agenda FAQ. They each helped to give me many new and interesting thoughts about alignment. 

This post was maybe the first time I'd seen a an actual conversation about Paul's work between two people who had deep disagreements in this area - where Paul wrote things, someone wrote an effort-post response, and Paul responded once again. Eliezer did it again in the comments of Alex's FAQ, which also was a big deal for me in terms of learning.

One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.

We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent,... (Read more)

3Rohin Shah3dI pretty strongly agree with this review (and jtbc it was written without any input from me, even though Daniel and I are both at CHAI). Yeah, maybe I should say "coherence theorems" to be clearer about this? (Like, it isn't a theorem that I shouldn't give you limitless number of dollars in return for nothing; maybe I think that you are more capable than me and fully aligned with me, and so you'd do a better job with my money. Or maybe I value your happiness, and the best way to purchase it is to give you money no strings attached.) Fwiw, I do in fact worry about goal-directedness, but (I think) I know what you mean. (For others, I think Daniel is referring to something like "the MIRI camp", though that is also not an accurate pointer, and it is true that I am outside that camp.) My responses to the questions: 1. The ones in Will humans build goal-directed agents? [https://www.lesswrong.com/posts/9zpT9dikrrebdq3Jf/will-humans-build-goal-directed-agents] , but if you want arguments that aren't about humans, then I don't know. 2. Depends on the distribution over utility functions, the action space, etc, but e.g. if it uniformly selects a numeric reward value for each possible trajectory (state-action sequence) where the actions are low-level (e.g. human muscle control), astronomically low. 3. That will probably be a good model for some (many?) powerful AI systems that humans build. 4. I don't know. (I think it depends quite strongly on the way in which we train powerful AI systems.) 5. Not likely at low levels of intelligence, plausible at higher levels of intelligence, but really the question is not specified enough.

it was written without any input from me

Well, I didn't consult you in the process of writing the review, but we've had many conversations on the topic which presumably have influenced how I think about the topic and what I ended up writing in the review.

2DanielFilan3dSorry, I meant theorems taking 'no limitless dollar sink' as an axiom and deriving something interesting from that.
4DanielFilan3dPutting my cards on the table, this is my guess at the answers to the questions that I raise: 1. I don't know. 2. Low. 3. Frequent if it's an 'intelligent' one. 4. Relatively. You probably don't end up with systems that resist literally all changes to their goals, but you probably do end up with systems that resist most changes to their goals, barring specific effort to prevent that. 5. Probably. That being said, I think that a better definition of 'goal-directedness' would go a long way in making me less confused by the topic.
Bottle Caps Aren't Optimisers
121y2 min readShow Highlight

Crossposted from my blog.

One thing I worry about sometimes is people writing code with optimisers in it, without realising that that's what they were doing. An example of this: suppose you were doing deep reinforcement learning, doing optimisation to select a controller (that is, a neural network that takes a percept and returns an action) that generated high reward in some environment. Alas, unknown to you, this controller actually did optimisation itself to select actions that score well according to some metric that so far has been closely related to your reward function. In such a scenario

... (Read more)
3orthonormal3dI'm surprised nobody has yet replied that the two examples are both products of significant optimizers with relevant optimization targets, and that the naive definition seems to work with one modification: A system is downstream from an optimizer of some objective function to the extent that that objective function attains much higher values than would be attained if the system didn't exist, or were doing some other random thing.

I'm surprised nobody has yet replied that the two examples are both products of significant optimizers with relevant optimization targets.

Yes, this seems pretty important and relevant.

That being said, I think that that definition suggests that natural selection and/or the earth's crust are downstream from an optimiser of the number of Holiday Inns, or that my liver is downstream from an optimiser from my income, both of which aren't right.

Probably it's important to relate 'natural subgoals' to some ideal definition - which offers some hope, since 'subgo

... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post

Note: weird stuff, very informal.

Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.

I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.

I don't know whether this is a real problem or not. But from a theoretical perspe... (Read more)

Also I expect we're going to have to make some assumption that the problem is "generic" (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.

I agree. The following is an attempt to show that if we don't rule out problems with the consequentialism embedded in them then the answer is trivially "no" (i.e. minimal circuits may contain consequentialists).

Let be a minimal circuit that takes as input a string of length that encodes a Turing machine, and outputs a string that is the concatenation o

... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post
Embedded Agents
321y1 min readShow Highlight

(A longer text-based version of this post is also available on MIRI's blog here, and the bibliography for the whole sequence can be found here)

Insofar as the AI Alignment Forum is part of the Best-of-2018 Review, this post deserves to be included. It's the friendliest explanation to MIRI's research agenda (as of 2018) that currently exists.

Load More