All of ESRogs's Comments + Replies

-3.  I'm assuming you are already familiar with some basics, and already know what 'orthogonality' and 'instrumental convergence' are and why they're true.

I think this is actually the part that I most "disagree" with. (I put "disagree" in quotes, because there are forms of these theses that I'm persuaded by. However, I'm not so confident that they'll be relevant for the kinds of AIs we'll actually build.)

1. The smart part is not the agent-y part

It seems to me that what's powerful about modern ML systems is their ability to do data compression / patter... (read more)

GPT-3 does unsupervised learning on text data. Our brains do predictive processing on sensory inputs. My guess (which I'd love to hear arguments against!) is that there's a true and deep analogy between the two, and that they lead to impressive abilities for fundamentally the same reason.

Agree that self-supervised learning powers both GPT-3 updates and human brain world-model updates (details & caveats). (Which isn’t to say that GPT-3 is exactly the same as the human brain world-model—there are infinitely many different possible ML algorithms that all ... (read more)

3James Payor1y
Can you visualize an agent that is not "open-ended" in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs? In my picture most of the extra sauce you'd need on top of GPT-3 looks very agenty. It seems tricky to name "virtual worlds" in which AIs manipulate just "virtual resources" and still manage to do something like melting the GPUs.

For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.

In particular, I don't think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It's just not modeling the real world at all. It's not even trying to rack up a bunch of wins over time. It's just playing the si

... (read more)

I think we (mostly) all agree that we want to somehow encode human values into AGIs. That's not a new idea. The devil is in the details.

I see how my above question seems naive. Maybe it is. But if one potential answer to the alignment problem lies in the way our brains work, maybe we should try to understand that better, instead of (or in addition to) letting a machine figure it out for us through some kind of "value learning". (Copied from my answer to AprilSR:) I stumbled across two papers from a few years ago by a psychologist, Mark Muraven, who thinks that the way humans deal with conflicting goals could be important for AI alignment ( and https://arxiv.... (read more)

The problem with this model is, its predictions depend a lot on how you draw the boundary around "field". Take Yudkowsky's example of startups. How do we explain small startups succeed where large companies failed?

I don't quite see how this is a problem for the model. The narrower you draw the boundary, the more jumpy progress will be, right?

Successful startups are big relative to individuals, but not that big relative to the world as a whole. If we're talking about a project / technology / company that can rival the rest of the world in its output, then t... (read more)

I don't quite see how this is a problem for the model. The narrower you draw the boundary, the more jumpy progress will be, right?

So, you're saying: if we draw the boundary around a narrow field, we get jumpy/noisy progress. If we the draw the boundary around a broad field, all the narrow subfields average out and the result is less noise. This makes a lot of sense, thank you!

The question is, what metric do we use to average the subfields. For example, on some metrics the Manhattan project might be a rather small jump in military-technology-averaged-ove... (read more)

In my view, the biological anchors and the Very Serious estimates derived therefrom are really useful for the following very narrow yet plausibly impactful purpose

I don't understand why it's not just useful directly. Saying that the numbers are not true upper or lower bounds seems like it's expecting way too much!

They're not even labeled as bounds (at least in the headline). They're supposed to be "anchors".

Suppose you'd never done the analysis to know how much compute a human brain uses, or how much compute all of evolution had used. Wouldn't this report ... (read more)

Claim 4: GPT-N need not be "trying" to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.

I got a bit confused by this section, I think because the word "mo... (read more)

6Rohin Shah2y
Yes, that's right, sorry about the confusion.

Let , where  and 

[...] The second rule says that  is orthogonal to itself

Should that be "is not orthogonal to itself"? I thought the  meant non-orthogonal, so would think  means that  is not orthogonal to itself.

(The transcript accurately reflects what was said in the talk, but I'm asking whether Scott misspoke.)

4Scott Garrabrant3y
Yeah, you are right. I will change it. Thanks.

But once you let it do more computation, then it doesn't have to know anything at all, right? Like, maybe the best go bot is, "Train an AlphaZero-like algorithm for a million years, and then use it to play."

I know more about go than that bot starts out knowing, but less than it will know after it does computation.

I wonder if, when you use the word "know", you mean some kind of distilled, compressed, easily explained knowledge?

Perhaps the bot knows different things at different times and your job is to figure out (a) what it always knows and (b) a way to quickly find out everything it knows at a certain point in time.
I would say that bot knows what the trained AlphaZero-like model knows.

This is commonly said on the basis of his $1b pledge

Wasn't it supposed to be a total of $1b pledged, from a variety of sources, including Reid Hoffman and Peter Thiel, rather than $1b just from Musk?

EDIT: yes, it was.

Sam, Greg, Elon, Reid Hoffman, Jessica Livingston, Peter Thiel, Amazon Web Services (AWS), Infosys, and YC Research are donating to support OpenAI. In total, these funders have committed $1 billion, although we expect to only spend a tiny fraction of this in the next few years.

For those organizations that do choose to compete... I think it is highly likely that they will attempt to build competing systems in basically the exact same way as the first organization did


It's unlikely for there to exist both aligned and misaligned AI systems at the same time

If the first group sunk some cost into aligning their system, but that wasn't integral to its everyday task performance, wouldn't a second competing group be somewhat likely to skimp on the alignment part?

It seems like this calls into the question the claim that we wouldn't get ... (read more)

3Evan Hubinger3y
I think that alignment will be a pretty important desideratum for anybody building an AI system—and I think that copying whatever alignment strategy was used previously is likely to be the easiest, most conservative, most risk-averse option for other organizations trying to fulfill that desideratum.
3Rob Bensinger3y
Scott's post explaining the relationship between C0 and C1 exists as of now: Functors and Coarse Worlds.

Are the two frames labeled  and  in section 3.3 (with an agent that thinks about either red or green and walks or stays home, and an environment that's safe or has bears) equivalent? (I would guess so, given the section that example was in, but didn't see it stated explicitly.)

2Rob Bensinger3y
They're not equivalent. If two frames are 'homotopy equivalent' / 'biextensionally equivalent' (two names for the same thing, in Cartesian frames), it means that you can change one frame into the other (ignoring the labels of possible agents and environments, i.e., just looking at the possible worlds) by doing some combination of 'make a copy of a row', 'make a copy of a column', 'delete a row that's a copy of another row', and/or 'delete a column that's a copy of another column'. The entries of C0 and C1 are totally different (Image(C0)={w0,w1,w2,w3,w4,w5,w6,w7}, while Image(C1)={w8,w9,w10,w11}, before we even get into asking how those entries are organized in the matrices), so they can't be biextensionally equivalent. There is an important relationship between C0 and C1, which Scott will discuss later in the sequence. But the reason they're brought up in this post is to make a more high-level point "here's a reason we want to reify agents and environments less than worlds, which is part of why we're interested in biextensional equivalence," not to provide an example of biextensional equivalence.

I didn't feel like any comment I would have made would have anything more to say than things I've said in the past.

FWIW, I say: don't let that stop you! (Don't be afraid to repeat yourself, especially if there's evidence that the point has not been widely appreciated.)

Unfortunately, I also only have so much time, and I don't generally think that repeating myself regularly in AF/LW comments is a super great use of it.

Next, we might imagine GPT-N to just be an Oracle AI, which we would have better hopes of using well. But I don't expect that an approximate Oracle AI could be used safely with anything like the precautions that might work for a genuine Oracle AI. I don't know what internal optimizers GPT-N ends up building along the way, but I'm not going to count on there being none of them.

Is the distinguishing feature between Oracle AI and approximate Oracle AI, as you use the terms here, just about whether there are inner optimizers or not?

(When I started the paragrap... (read more)

The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way. Inner optimizers are an illustration of another failure mode.

As for planning, we've seen the GPTs ascend from planning out the next few words, to planning out the sentence or line, to planning out the paragraph or stanza. Planning out a whole text interaction is well within the scope I could imagine for the next few iterations, and from there you have the capability of manipulation without external malicious use.

Perhaps a nitpick, but is what it does planning?

Is it actually thinking several words ahead (a la AlphaZero evaluating moves) when it decides what word to say next, or is it just doing free-writing, and it j... (read more)

That's not a nitpick at all!

Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there's some evidence it's not doing lookahead - its difficulty completing rhymes when writing poetry, for instance.

(Hmm, what's the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn't just memorize moves?)

Thinking about this more, I think that since planning depends on causal modeling, I'd expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I'll edit accordingly. Thanks!

First, do we now have an example of an AI not using cognitive capacities that it had, because the 'face' it's presenting wouldn't have those cognitive capacities?

This does seem like an interesting question. But I think we should be careful to measure against the task we actually asked the system to perform.

For example, if I ask my system to produce a cartoon drawing, it doesn't seem very notable if I get a cartoon as a result rather than a photorealistic image, even if it could have produced the latter.

Maybe what this just means is that we should track wha... (read more)

For example, if I ask my system to produce a cartoon drawing, it doesn't seem very notable if I get a cartoon as a result rather than a photorealistic image, even if it could have produced the latter.

Consider instead the scenario where I show a model a photo of a face, and the model produces a photo of the side of that face. An interesting question is "is there a 3d representation of the face in the model?". It could be getting the right answer that way, or it could be getting it some other way.

Similarly, when it models a 'dumb' ch... (read more)

I agree. And I thought Arthur Breitman had a good point on one of the related Twitter threads:

GPT-3 didn't "pretend" not to know. A lot of this is the AI dungeon environment. If you just prompt the raw GPT-3 with: "A definition of a monotreme follows" it'll likely do it right. But if you role play, sure, it'll predict that your stoner friend or young nephew don't know.

Yeah, it seems like deliberately pretending to be stupid here would be predicting a less likely sequence, in service of some other goal.

I wonder how long we'll be in the "prompt programming" regime. As Nick Cammarata put it:

We should actually be programming these by manipulating the hidden layers and prompts are a stand-in until we can.

My guess is that OpenAI will pretty quickly (within the next year) find a much better way to interface with what GPT-3 has learned.

Do others agree? Any reason to think that wouldn't be possible (or wouldn't give significant benefits)?

The problem with directly manipulating the hidden layers is reusability. If we directly manipulate the hidden layers, then we have to redo that whenever a newer, shinier model comes out, since the hidden layers will presumably be different. On the other hand, a prompt is designed so that human writing which starts with that prompt will likely contain the thing we want - a property mostly independent of the internal structure of the model, so presumably the prompt can be reused. I think the eventual solution here (and a major technical problem of alignment) is to take an internal notion learned by one model (i.e. found via introspection tools), back out a universal representation of the real-world pattern it represents, then match that real-world pattern against the internals of a different model in order to find the "corresponding" internal notion. Assuming that the first model has learned a real pattern which is actually present in the environment, we should expect that "better" models will also have some structure corresponding to that pattern - otherwise they'd lose predictive power on at least the cases where that pattern applies. Ideally, this would all happen in such a way that the second model can be more accurate, and that increased accuracy would be used. In the shorter term, I agree OpenAI will probably come up with some tricks over the next year or so.

I guess the rightmost term could be zero or negative, right? (If the difference between T and P is greater than or equal to the difference between P and S.) In that case, the payoffs would be such that there's no credence you could have that the other player will play Hare that would justify playing Hare yourself (or justify it as non-defection, that is).

So my claim #1 is always true, but claim #2 depends on the payoff values.

In other words, Stag Hunt could be subdivided into two games: one where the payoffs never justify playing Hare (as non-defection), and one where they sometimes do, depending on your credence that the other player will play Stag.

2Alex Turner3y
Yes, this is correct. For example, the following is an example of the second game:

Combining the two conditions, we have

Since , this holds for some nonempty subinterval of .

I want to check that I'm following this. Would it be fair to paraphrase the two parts of this inequality as:

1) If your credence that the other player is going to play Stag is high enough, you won't even be tempted to play Hare.

2) If your credence that the other player is going to play Hare is high enough, then i

... (read more)
I guess the rightmost term could be zero or negative, right? (If the difference between T and P is greater than or equal to the difference between P and S.) In that case, the payoffs would be such that there's no credence you could have that the other player will play Hare that would justify playing Hare yourself (or justify it as non-defection, that is). So my claim #1 is always true, but claim #2 depends on the payoff values. In other words, Stag Hunt could be subdivided into two games: one where the payoffs never justify playing Hare (as non-defection), and one where they sometimes do, depending on your credence that the other player will play Stag.

This is a good question, and I don't know the answer. My guess is that Paul would say that that is a potential problem, but different from the one being addressed in this post. Not sure though.

2Paul Christiano3y
Yeah, that's my view.

Thank you, this was helpful. I hadn't understood what was meant by "the generalization is now coming entirely from human beliefs", but now it seems clear. (And in retrospect obvious if I'd just read/thought more carefully.)

Ah, I see. It sounds like the key thing I was missing was that the strangeness of the prior only matters when you're testing on a different distribution than you trained on. (And since you can randomly sample from x* when you solicit forecasts from humans, the train and test distributions can be considered the same.)

4Daniel Kokotajlo3y
Is that actually true though? Why is that true? Say we are training the model on a dataset of N human answers, and then we are doing to deploy it to answer 10N more questions, all from the same big pool of questions. The AI can't tell whether it is in training or deployment, but it could decide to follow a policy of giving some sort of catastrophic answer with probability 1/10N, so that probably it'll make it through training just fine and then still get to cause catastrophe.

In this case, I can pay humans to make forecasts for many randomly chosen x* in D*, train a model f to predict those forecasts, and then use f to make forecasts about the rest of D*.

The generalization is now coming entirely from human beliefs, not from the structural of the neural net — we are only applying neural nets to iid samples from D*.

Perhaps a dumb question, but don't we now have the same problem at one remove? The model for predicting what the human would predict would still come from a "strange" prior (based on the l2 norm, or whatever)

... (read more)
In this case humans are doing the job of transferring from D to D∗, and the training algorithm just has to generalize from a representative sample of D∗ to the test set.
4Paul Christiano3y
The difference is that you can draw as many samples as you want from D* and they are all iid. Neural nets are fine in that regime.

Deep learning AGI implies mesa optimization: Since deep learning is so sample inefficient, it cannot reach human levels of performance if we apply deep learning directly to each possible task T. (For example, it has to relearn how the world works separately for each task T.) As a result, if we do get AGI primarily via deep learning, it must be that we used deep learning to create a new optimizing AI system, and that system was the AGI.

I don't quite understand what this is saying.

Suppose we train a giant deep learning model via self-supervised learning on a

... (read more)
4Rohin Shah3y
Yeah, I'm talking about the whole system. Yeah, I agree it doesn't fit the explanation / definition in Risks from Learned Optimization. I don't like that definition, and usually mean something like "running the model parameters instantiates a computation that does 'reasoning'", which I think does fit this example. I mentioned this a bit later in the comment:

I see. So, restating in my own terms -- outer alignment is in fact about whether getting what you asked for is good, but for the case of prediction, the malign universal prior argument says that "perfect" prediction is actually malign. So this would be a case of getting what you wanted / asked for / optimized for, but that not being good. So it is an outer alignment failure.

Whereas an inner alignment failure would necessarily involve not hitting optimal performance at your objective. (Otherwise it would be an inner alignment success, and an outer alignment failure.)

Is that about right?

3Evan Hubinger3y
Yep—at least that's how I'm generally thinking about it in this post.

Great post, thank you!

However, I think I don't quite understand the distinction between inner alignment and outer alignment, as they're being used here. In particular, why would the possible malignity of the universal prior be an example of outer alignment rather than inner?

I was thinking of outer alignment as being about whether, if a system achieves its objective, is that what you wanted. Whereas inner alignment was about whether it's secretly optimizing for something other than the stated objective in the first place.

From that perspective, wouldn't mali

... (read more)
3Evan Hubinger3y
The way I'm using outer alignment here is to refer to outer alignment at optimum. Under that definition, optimal loss on a predictive objective should require doing something like Bayesian inference on the universal prior, making the question of outer alignment in such a case basically just the question of whether Bayesian inference on the universal prior is aligned.

Hmm, maybe I'm missing something basic and should just go re-read the original posts, but I'm confused by this statement:

So what we do here is say "belief set A is strictly 'better' if this particular observer always trusts belief set A over belief set B", and "trust" is defined as "whatever we think belief set A believes is also what we believe".

In this, belief set A and belief set B are analogous to A[C] and C (or some c in C), right? If so, then what's the analogue of "trust... over"?

If we... (read more)

3Rohin Shah4y
Yes. So I only showed the case where info contains information about A[C]'s predictions, but info is allowed to contain information from A[C] and C (but not other agents). Even if it contains lots of information from C, we still need to trust A[C]. In contrast, if info contained information about A[A[C]]'s beliefs, then we would not trust A[C] over that.
Notably, we need to trust A[C] even over our own beliefs, that is, if A[C] believes something, we discard our position and adopt A[C]'s belief.

To clarify, this is only if we (or the process that generated our beliefs) fall into class C, right?

3Rohin Shah4y
No, under the current formalization, even if we are not in class C we have to trust A[C] over our own beliefs. Specifically, we need Eus[X∣info]=Eus[EA[C][X]∣info] for any X and information about A[C] . But then if we are given the info that EA[C][X]=Y, then we have: Eus[X∣info] =Eus[EA[C][X]∣info] (definition of universality) =Eus[EA[C][X]∣EA[C][X]=Y] (plugging in the specific info we have) =Y (If we are told that A[C] says Y, then we should expect that A[C] says Y) Putting it together, we have Eus[X∣info]=Y, that is, given the information that A[C] says Y, we must expect that the answer to X is Y. This happens because we don't have an observer-independent way of defining epistemic dominance: even if we have access to the ground truth, we don't know how to take two sets of beliefs and say "belief set A is strictly 'better' than this belief set B" [1]. So what we do here is say "belief set A is strictly 'better' if this particular observer always trusts belief set A over belief set B", and "trust" is defined as "whatever we think belief set A believes is also what we believe". You could hope that in the future we have an observer-independent way of defining epistemic dominance, and then the requirement that we adopt A[C]'s beliefs would go away. ---------------------------------------- 1. We could say that a set of beliefs is 'strictly better' if for every quantity X its belief is more accurate, but this is unachievable, because even full Bayesian updating on true information causes you to update in the wrong direction for some quantities, just by bad luck. ↩︎
The authors don't really suggest an explanation; the closest they come is speculating that at the interpolation threshold there's only ~one model that can fit the data, which may be overfit, but then as you increase further the training procedure can "choose" from the various models that all fit the data, and that "choice" leads to better generalization. But this doesn't make sense to me, because whatever is being used to "choose" the better model applies throughout training, and so even at the interpolation thr
... (read more)
3Rohin Shah4y
Yup, that.
This means that if there's more than twice the power coming from one move than from another, the former is more likely than the latter. In general, if one set of possibilities contributes 2K the power of another set of possibilities, the former set is at least K times more likely than the latter.

Where does the 2 come from? Why does one move have to have more than twice the power of another to be more likely? What happens if it only has 1.1x as much power?

3Alex Turner4y
Then it won't always be instrumentally convergent, depending on the environment in question. For Tic-Tac-Toe, there's an exact proportionality in the limit of farsightedness (see theorem 46). In general, there's a delicate interaction between control provided and probability which I don't fully understand right now. However, we can easily bound how different these quantities can be; the constant depends on the distribution D we choose (it's at most 2 for the uniform distribution). The formal explanation can be found in the proof of theorem 48, but I'll try to give a quick overview. The power calculation is the average attainable utility. This calculation breaks down into the weighted sum of the average attainable utility when Candy is best, the average attainable utility when Chocolate is best, and the average attainable utility when Hug is best; each term is weighted by the probability that its possibility is optimal.[1] Each term is the power contribution of a different possibility.[2] Let's think about Candy's contribution to the first (simple) example. First, how likely is Candy to be optimal? Well, each state has an equal chance of being optimal, so 13 of goals choose Candy. Next, given that Candy is optimal, how much reward do we expect to get? Learning that a possibility is optimal tells us something about its expected value. In this case, the expected reward is still 34; the higher this number is, the "happier" an agent is to have this as its optimal possibility. In general, power contribution of possibility=% of goals choosing this possibility⋅average control. If the agent can "die" in an environment, more of its "ability to do things in general" is coming from not dying at first. Like, let's follow where the power is coming from, and that lets us deduce things about the instrumental convergence. Consider the power at a state. Maybe 99% of the power comes from the possibilities for one move (like the move that avoids dying), and 1% comes from the rest.
Remember how, as the agent gets more farsighted, more of its control comes from Chocolate and Hug, while also these two possibilities become more and more likely?

I don't understand this bit -- how does more of its control come from Chocolate and Hug? Wouldn't you say its control comes from Wait!? Once it ends up in Candy, Chocolate, or Hug, it has no control left. No?

3Alex Turner4y
Yeah, you could think of the control as coming from Wait!. Will rephrase.
We bake the opponent's policy into the environment's rules: when you choose a move, the game automatically replies.

And the opponent plays to win, with perfect play?

3Alex Turner4y
Yes in this case, although note that that only tells us about the rules of the game, not about the reward function - most agents we're considering don't have the normal Tic-Tac-Toe reward function.
Imagine we only care about the reward we get next turn. How many goals choose Candy over Wait? Well, it's 50-50 – since we randomly choose a number between 0 and 1 for each state, both states have an equal chance of being maximal.

I got a little confused at the introduction of Wait!, but I think I understand it now. So, to check my understanding, and for the benefit of others, some notes:

  • the agent gets a reward for the Wait! state, just like the other states
  • for terminal states (the three non-Wait! states), the agent stays in that state, and keep
... (read more)
3Alex Turner4y
Yes. The full expansions (with no limit on the time horizon) are rCandy1−γ,rWait!+γrChocolate1−γ, and rWait!+γrHug1−γ, where rCandy,rWait!,rChocolate,rHug∼unif(0,1).
It's meant to be analogous to imputing a value in a causal Bayes net

Aha! I thought it might be borrowing language from some technical term I wasn't familiar with. Thanks!

Takeoff does matter, in that I expect that this worldview is not very accurate/good if there's discontinuous takeoff, but imputing the worldview I don't think takeoff matters.

Minor question: could you clarify what you mean by "imputing the worldview" here? Do you mean something like, "operating within the worldview"? (I ask because this doesn't seem to be a use of "impute" that I'm familiar with.)

3Rohin Shah4y
Basically yes. Longer version: "Suppose we were in scenario X. Normally, in such a scenario, I would discard this worldview, or put low weight on it, because reason Y. But suppose by fiat that I continue to use the worldview, with no other changes made to scenario X. Then ..." It's meant to be analogous to imputing a value in a causal Bayes net, where you simply "suppose" that some event happened, and don't update on anything causally upstream, but only reason forward about things that are causally downstream. (I seem to recall Scott Garrabrant writing a good post on this, but I can't find it now. ETA: Found it, it's here, but it doesn't use the term "impute" at all. I'm now worried that I literally made up the term, and it doesn't actually have any existing technical meaning.)
And I think claim 5 is basically in line with what, say, Bostrom would discuss (where stabilization is a thing to do before we attempt to build a sovereign).

You mean in the sense of stabilizing the whole world? I'd be surprised if that's what Yann had in mind. I took him just to mean building a specialized AI to be a check on a single other AI.

2Matthew "Vaniver" Gray4y
That's how I interpreted: To be clear, I think he would mean it more in the way that there's currently an international police order that is moderately difficult to circumvent, and that the same would be true for AGI, and not necessarily the more intense variants of stabilization (which are necessarily primarily if you think offense is highly advantaged over defense, which I don't know his opinion on).
there is a troll who will blow up the bridge with you on it, if you cross it "for a dumb reason"

Does this way of writing "if" mean the same thing as "iff", i.e. "if and only if"?

2Abram Demski4y
No, but I probably should have said "iff" or "if and only if". I'll edit.

I can't resist giving this pair of rather incongruous quotes from the paper

Could you spell out what makes the quotes incongruous with each other? It's not jumping out at me.

The authors acknowledged that the modifications they did to RL "brings RL closer to the frameworks of decision theory and game theory" (AFAICT, the algorithms they end up with are nearly pure decision/game theory) but given that some researchers have been focused on decision theory for a long time exactly because a decision theoretic agent can be reflectively stable, it seems incongruous to also write "perhaps surprisingly, there are modifications of the RL objective that remove the agent’s incentive to tamper with the reward function."

Very clear presentation! As someone outside the field who likes to follow along, I very much appreciate these clear conceptual frameworks and explanations.

I did however get slightly lost in section 1.2. At first reading I was expecting this part:

which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers.

to say, "... gap between the behavioral objective and the intended goal of the programmers." (In which case the inner alignment problem would be a subcomponent... (read more)

Phrases I've used: [intended/desired/designer's] [objective/goal]

I think "designer's objective" would fit in best with the rest of the terminology in this post, though "desired objective" is also good.

I don't have a good term for that, unfortunately—if you're trying to build an aligned AI, "human values" could be the right term, though in most cases you really just want "move one strawberry onto a plate without killing everyone," which is quite a lot less than "optimize for all human values." I could see how meta-objective might make sense if you're thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a me

... (read more)

To me the difference is that when I read 5 I'm thinking about people being careless or malevolent, in an everyday sense of those terms, whereas when I read 4 I'm thinking about how maybe there's no such thing as a human who's not careless or malevolent, if given enough power and presented with a weird enough situation.

I conclude from this that CDT should equal EDT (hence, causality must account for logical correlations, IE include logical causality). By "CDT" I really mean any approach at all to counterfactual reasoning; counterfactual expectations should equal evidential expectations.
As with most of my CDT=EDT arguments, this only provides an argument that the expectations should be equal for actions taken with nonzero probability. In fact, the amount lost to Dutch Book will be proportional to the probability of the action in question. So, differing counterfa
... (read more)
2Abram Demski5y
"The expectations should be equal for actions with nonzero probability" -- this means a CDT agent should have equal causal expectations for any action taken with nonzero probability, and EDT agents should similarly have equal evidential expectations. Actually, I should revise my statement to be more careful: in the case of epsilon-exploring agents, the condition is >epsilon rather than >0. In any case, my statement there isn't about evidential and causal expectations being equal to each other, but rather about one of them being conversant across (sufficiently probable) actions. "differing counterfactual and evidential expectations are smoothly more and more tenable as actions become less and less probable" -- this means that the amount we can take from a CDT agent through a Dutch Book, for an action which is given a different casual expectation than evidential expectation, smoothly reduces as the probability of an action goes to zero. In that statement, I was assuming you hold the difference between evidential and causal expectations constant add you reduce the probability of the action. Otherwise it's not necessarily true.

To clarify, you do do the human's instrumental sub-goals though, just not extra ones for yourself, right?

3Rohin Shah5y
If you've seen the human acquire resources, then you'll acquire resources in the same way. If there's now some new resource that you've never seen before, you may acquire it if you're sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn't actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.
With respect to whether it is what I want, I wouldn't say that I want any of these things in particular, I'm more pointing at the existence of systems that aren't goal-directed, yet behave like an agent.

Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn't when the human doesn't)?

In which case, would it be fair to summarize (part of) your argument as:

1) Many of the potential problems with building safe superintelligent systems ... (read more)

3Rohin Shah5y
I don't think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you'll be uncertain about what the human is going to do. If you're uncertain in this way, and you are getting your goals from a human, then you don't do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.) It may be that "goal-directed" is the wrong word for the property I'm talking about, but I'm predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
For example, imitation learning allows you to create an agent that behaves similarly to another agent—I would classify this as “Agent AI that is not goal-directed”.

Let's say that I'm pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at "whatever goal Rohin has"

What causes the agent to switch from X to Y?

Are you thinking of the ... (read more)

Are you thinking of the "agent" as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?

I was imagining something more like B for the imitation learning case.

I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you'd likely get an agent that continues to pursue goal X even after you've s
... (read more)
In this post (original here), Paul Christiano analyzes the ambitious value learning approach.

I find it a little bit confusing that Rohin's note refers to the "ambitious value learning approach", while the title of the post refers to the "easy goal inference problem". I think the note could benefit from clarifying the relationship of these two descriptors.

As it stands, I'm asking myself -- are they disagreeing about whether this is easy or hard? Or is "ambitious value learning" the same as "goal inference" (... (read more)

3Rohin Shah5y
The easy goal inference problem is the same thing as ambitious value learning under the assumption of infinite compute and data about human behavior (which is the assumption that we're considering for most of this sequence). The previous post was meant to outline the problem, all subsequent posts are about that problem. Ambitious value learning is probably the best name for the problem now, but not all posts use the same terminology even though they're talking about approximately the same thing.
Load More