# All of ESRogs's Comments + Replies

Challenge: know everything that the best go bot knows about go

But once you let it do more computation, then it doesn't have to know anything at all, right? Like, maybe the best go bot is, "Train an AlphaZero-like algorithm for a million years, and then use it to play."

I know more about go than that bot starts out knowing, but less than it will know after it does computation.

I wonder if, when you use the word "know", you mean some kind of distilled, compressed, easily explained knowledge?

1DanielFilan6dPerhaps the bot knows different things at different times and your job is to figure out (a) what it always knows and (b) a way to quickly find out everything it knows at a certain point in time.
2020 AI Alignment Literature Review and Charity Comparison

This is commonly said on the basis of his $1b pledge Wasn't it supposed to be a total of$1b pledged, from a variety of sources, including Reid Hoffman and Peter Thiel, rather than $1b just from Musk? EDIT: yes, it was. Sam, Greg, Elon, Reid Hoffman, Jessica Livingston, Peter Thiel, Amazon Web Services (AWS), Infosys, and YC Research are donating to support OpenAI. In total, these funders have committed$1 billion, although we expect to only spend a tiny fraction of this in the next few years.

https://openai.com/blog/introducing-openai/

Homogeneity vs. heterogeneity in AI takeoff scenarios

For those organizations that do choose to compete... I think it is highly likely that they will attempt to build competing systems in basically the exact same way as the first organization did

...

It's unlikely for there to exist both aligned and misaligned AI systems at the same time

If the first group sunk some cost into aligning their system, but that wasn't integral to its everyday task performance, wouldn't a second competing group be somewhat likely to skimp on the alignment part?

It seems like this calls into the question the claim that we wouldn't get ... (read more)

3Evan Hubinger5moI think that alignment will be a pretty important desideratum for anybody building an AI system—and I think that copying whatever alignment strategy was used previously is likely to be the easiest, most conservative, most risk-averse option for other organizations trying to fulfill that desideratum.
Biextensional Equivalence

3Rob Bensinger7moScott's post explaining the relationship betweenC0andC1exists as of now: Functors and Coarse Worlds [https://www.lesswrong.com/posts/GYQwJsChoRosjdW2r/functors-and-coarse-worlds].
Biextensional Equivalence

Are the two frames labeled  and  in section 3.3 (with an agent that thinks about either red or green and walks or stays home, and an environment that's safe or has bears) equivalent? (I would guess so, given the section that example was in, but didn't see it stated explicitly.)

2Rob Bensinger7moThey're not equivalent. If two frames are 'homotopy equivalent' / 'biextensionally equivalent' (two names for the same thing, in Cartesian frames), it means that you can change one frame into the other (ignoring the labels of possible agents and environments, i.e., just looking at the possible worlds) by doing some combination of 'make a copy of a row', 'make a copy of a column', 'delete a row that's a copy of another row', and/or 'delete a column that's a copy of another column'. The entries ofC0andC1are totally different (Image(C0)={w0,w1,w2,w3,w4,w5,w6,w7}, whileImage(C1)={w8,w9,w10,w11}, before we even get into asking how those entries are organized in the matrices), so they can't be biextensionally equivalent. There is an important relationship betweenC0andC1, which Scott will discuss later in the sequence. But the reason they're brought up in this post is to make a more high-level point "here's a reason we want to reify agents and environments less than worlds, which is part of why we're interested in biextensional equivalence," not to provide an example of biextensional equivalence.
Matt Botvinick on the spontaneous emergence of learning algorithms

I didn't feel like any comment I would have made would have anything more to say than things I've said in the past.

FWIW, I say: don't let that stop you! (Don't be afraid to repeat yourself, especially if there's evidence that the point has not been widely appreciated.)

Unfortunately, I also only have so much time, and I don't generally think that repeating myself regularly in AF/LW comments is a super great use of it.

Developmental Stages of GPTs

Next, we might imagine GPT-N to just be an Oracle AI, which we would have better hopes of using well. But I don't expect that an approximate Oracle AI could be used safely with anything like the precautions that might work for a genuine Oracle AI. I don't know what internal optimizers GPT-N ends up building along the way, but I'm not going to count on there being none of them.

Is the distinguishing feature between Oracle AI and approximate Oracle AI, as you use the terms here, just about whether there are inner optimizers or not?

(When I started the paragrap... (read more)

3orthonormal10moThe outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way. Inner optimizers are an illustration of another failure mode.
Developmental Stages of GPTs

As for planning, we've seen the GPTs ascend from planning out the next few words, to planning out the sentence or line, to planning out the paragraph or stanza. Planning out a whole text interaction is well within the scope I could imagine for the next few iterations, and from there you have the capability of manipulation without external malicious use.

Perhaps a nitpick, but is what it does planning?

Is it actually thinking several words ahead (a la AlphaZero evaluating moves) when it decides what word to say next, or is it just doing free-writing, and it j... (read more)

That's not a nitpick at all!

Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there's some evidence it's not doing lookahead - its difficulty completing rhymes when writing poetry, for instance.

(Hmm, what's the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn't just memorize moves?)

Thinking about this more, I think that since planning depends on causal modeling, I'd expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I'll edit accordingly. Thanks!

$1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is First, do we now have an example of an AI not using cognitive capacities that it had, because the 'face' it's presenting wouldn't have those cognitive capacities? This does seem like an interesting question. But I think we should be careful to measure against the task we actually asked the system to perform. For example, if I ask my system to produce a cartoon drawing, it doesn't seem very notable if I get a cartoon as a result rather than a photorealistic image, even if it could have produced the latter. Maybe what this just means is that we should track wha... (read more) For example, if I ask my system to produce a cartoon drawing, it doesn't seem very notable if I get a cartoon as a result rather than a photorealistic image, even if it could have produced the latter. Consider instead the scenario where I show a model a photo of a face, and the model produces a photo of the side of that face. An interesting question is "is there a 3d representation of the face in the model?". It could be getting the right answer that way, or it could be getting it some other way. Similarly, when it models a 'dumb' ch... (read more)$1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is

I agree. And I thought Arthur Breitman had a good point on one of the related Twitter threads:

GPT-3 didn't "pretend" not to know. A lot of this is the AI dungeon environment. If you just prompt the raw GPT-3 with: "A definition of a monotreme follows" it'll likely do it right. But if you role play, sure, it'll predict that your stoner friend or young nephew don't know.

\$1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is

Yeah, it seems like deliberately pretending to be stupid here would be predicting a less likely sequence, in service of some other goal.

Alignment As A Bottleneck To Usefulness Of GPT-3

I wonder how long we'll be in the "prompt programming" regime. As Nick Cammarata put it:

We should actually be programming these by manipulating the hidden layers and prompts are a stand-in until we can.

My guess is that OpenAI will pretty quickly (within the next year) find a much better way to interface with what GPT-3 has learned.

Do others agree? Any reason to think that wouldn't be possible (or wouldn't give significant benefits)?

2johnswentworth10moThe problem with directly manipulating the hidden layers is reusability. If we directly manipulate the hidden layers, then we have to redo that whenever a newer, shinier model comes out, since the hidden layers will presumably be different. On the other hand, a prompt is designed so that human writing which starts with that prompt will likely contain the thing we want - a property mostly independent of the internal structure of the model, so presumably the prompt can be reused. I think the eventual solution here (and a major technical problem of alignment) is to take an internal notion learned by one model (i.e. found via introspection tools), back out a universal representation of the real-world pattern it represents, then match that real-world pattern against the internals of a different model in order to find the "corresponding" internal notion. Assuming that the first model has learned a real pattern which is actually present in the environment, we should expect that "better" models will also have some structure corresponding to that pattern - otherwise they'd lose predictive power on at least the cases where that pattern applies. Ideally, this would all happen in such a way that the second model can be more accurate, and that increased accuracy would be used. In the shorter term, I agree OpenAI will probably come up with some tricks over the next year or so.
What counts as defection?

I guess the rightmost term could be zero or negative, right? (If the difference between T and P is greater than or equal to the difference between P and S.) In that case, the payoffs would be such that there's no credence you could have that the other player will play Hare that would justify playing Hare yourself (or justify it as non-defection, that is).

So my claim #1 is always true, but claim #2 depends on the payoff values.

In other words, Stag Hunt could be subdivided into two games: one where the payoffs never justify playing Hare (as non-defection), and one where they sometimes do, depending on your credence that the other player will play Stag.

2Alex Turner10moYes, this is correct. For example, the following is an example of the second game:
What counts as defection?

Combining the two conditions, we have

Since , this holds for some nonempty subinterval of .

I want to check that I'm following this. Would it be fair to paraphrase the two parts of this inequality as:

1) If your credence that the other player is going to play Stag is high enough, you won't even be tempted to play Hare.

2) If your credence that the other player is going to play Hare is high enough, then i

2ESRogs10moI guess the rightmost term could be zero or negative, right? (If the difference between T and P is greater than or equal to the difference between P and S.) In that case, the payoffs would be such that there's no credence you could have that the other player will play Hare that would justify playing Hare yourself (or justify it as non-defection, that is). So my claim #1 is always true, but claim #2 depends on the payoff values. In other words, Stag Hunt could be subdivided into two games: one where the payoffs never justify playing Hare (as non-defection), and one where they sometimes do, depending on your credence that the other player will play Stag.
Learning the prior

This is a good question, and I don't know the answer. My guess is that Paul would say that that is a potential problem, but different from the one being addressed in this post. Not sure though.

2Paul Christiano10moYeah, that's my view.
Learning the prior

Thank you, this was helpful. I hadn't understood what was meant by "the generalization is now coming entirely from human beliefs", but now it seems clear. (And in retrospect obvious if I'd just read/thought more carefully.)

Learning the prior

Ah, I see. It sounds like the key thing I was missing was that the strangeness of the prior only matters when you're testing on a different distribution than you trained on. (And since you can randomly sample from x* when you solicit forecasts from humans, the train and test distributions can be considered the same.)

4Daniel Kokotajlo10moIs that actually true though? Why is that true? Say we are training the model on a dataset of N human answers, and then we are doing to deploy it to answer 10N more questions, all from the same big pool of questions. The AI can't tell whether it is in training or deployment, but it could decide to follow a policy of giving some sort of catastrophic answer with probability 1/10N, so that probably it'll make it through training just fine and then still get to cause catastrophe.
Learning the prior

In this case, I can pay humans to make forecasts for many randomly chosen x* in D*, train a model f to predict those forecasts, and then use f to make forecasts about the rest of D*.

The generalization is now coming entirely from human beliefs, not from the structural of the neural net — we are only applying neural nets to iid samples from D*.

Perhaps a dumb question, but don't we now have the same problem at one remove? The model for predicting what the human would predict would still come from a "strange" prior (based on the l2 norm, or whatever)

3Nisan10moIn this case humans are doing the job of transferring from D to D∗, and the training algorithm just has to generalize from a representative sample of D∗ to the test set.
4Paul Christiano10moThe difference is that you can draw as many samples as you want from D* and they are all iid. Neural nets are fine in that regime.
The ground of optimization

Deep learning AGI implies mesa optimization: Since deep learning is so sample inefficient, it cannot reach human levels of performance if we apply deep learning directly to each possible task T. (For example, it has to relearn how the world works separately for each task T.) As a result, if we do get AGI primarily via deep learning, it must be that we used deep learning to create a new optimizing AI system, and that system was the AGI.

I don't quite understand what this is saying.

Suppose we train a giant deep learning model via self-supervised learning on a

3Rohin Shah1yYeah, I'm talking about the whole system. Yeah, I agree it doesn't fit the explanation / definition in Risks from Learned Optimization. I don't like that definition, and usually mean something like "running the model parameters instantiates a computation that does 'reasoning'", which I think does fit this example. I mentioned this a bit later in the comment:
An overview of 11 proposals for building safe advanced AI

I see. So, restating in my own terms -- outer alignment is in fact about whether getting what you asked for is good, but for the case of prediction, the malign universal prior argument says that "perfect" prediction is actually malign. So this would be a case of getting what you wanted / asked for / optimized for, but that not being good. So it is an outer alignment failure.

Whereas an inner alignment failure would necessarily involve not hitting optimal performance at your objective. (Otherwise it would be an inner alignment success, and an outer alignment failure.)

3Evan Hubinger1yYep—at least that's how I'm generally thinking about it in this post.
An overview of 11 proposals for building safe advanced AI

Great post, thank you!

However, I think I don't quite understand the distinction between inner alignment and outer alignment, as they're being used here. In particular, why would the possible malignity of the universal prior be an example of outer alignment rather than inner?

I was thinking of outer alignment as being about whether, if a system achieves its objective, is that what you wanted. Whereas inner alignment was about whether it's secretly optimizing for something other than the stated objective in the first place.

From that perspective, wouldn't mali

3Evan Hubinger1yThe way I'm using outer alignment here is to refer to outer alignment at optimum [https://www.alignmentforum.org/posts/33EKjmAdKFn3pbKPJ/outer-alignment-and-imitative-amplification] . Under that definition, optimal loss on a predictive objective should require doing something like Bayesian inference on the universal prior, making the question of outer alignment in such a case basically just the question of whether Bayesian inference on the universal prior is aligned.
[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment

Hmm, maybe I'm missing something basic and should just go re-read the original posts, but I'm confused by this statement:

So what we do here is say "belief set A is strictly 'better' if this particular observer always trusts belief set A over belief set B", and "trust" is defined as "whatever we think belief set A believes is also what we believe".

In this, belief set A and belief set B are analogous to A[C] and C (or some c in C), right? If so, then what's the analogue of "trust... over"?

3Rohin Shah1yYes. So I only showed the case where info contains information about A[C]'s predictions, but info is allowed to contain information from A[C] and C (but not other agents). Even if it contains lots of information from C, we still need to trust A[C]. In contrast, if info contained information about A[A[C]]'s beliefs, then we would not trust A[C] over that.
[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment
Notably, we need to trust A[C] even over our own beliefs, that is, if A[C] believes something, we discard our position and adopt A[C]'s belief.

To clarify, this is only if we (or the process that generated our beliefs) fall into class C, right?

3Rohin Shah1yNo, under the current formalization, even if we are not in class C we have to trust A[C] over our own beliefs. Specifically, we need Eus[X∣info]=Eus[EA[C][X]∣ info] for any X and information about A[C] . But then if we are given the info that EA[C][X]=Y, then we have: Eus[X∣info] =Eus[EA[C][X]∣info] (definition of universality) =Eus[EA[C][X]∣EA[C][X]=Y] (plugging in the specific info we have) =Y (If we are told that A[C] says Y, then we should expect that A[C] says Y) Putting it together, we have Eus[X∣info]=Y, that is, given the information that A[C] says Y, we must expect that the answer to X is Y. This happens because we don't have an observer-independent way of defining epistemic dominance: even if we have access to the ground truth, we don't know how to take two sets of beliefs and say "belief set A is strictly 'better' than this belief set B" [1]. So what we do here is say "belief set A is strictly 'better' if this particular observer always trusts belief set A over belief set B", and "trust" is defined as "whatever we think belief set A believes is also what we believe". You could hope that in the future we have an observer-independent way of defining epistemic dominance, and then the requirement that we adopt A[C]'s beliefs would go away. -------------------------------------------------------------------------------- 1. We could say that a set of beliefs is 'strictly better' if for every quantity X its belief is more accurate, but this is unachievable, because even full Bayesian updating on true information causes you to update in the wrong direction for some quantities, just by bad luck. ↩︎
[AN #77]: Double descent: a unification of statistical theory and modern ML practice
The authors don't really suggest an explanation; the closest they come is speculating that at the interpolation threshold there's only ~one model that can fit the data, which may be overfit, but then as you increase further the training procedure can "choose" from the various models that all fit the data, and that "choice" leads to better generalization. But this doesn't make sense to me, because whatever is being used to "choose" the better model applies throughout training, and so even at the interpolation thr
3Rohin Shah1yYup, that.
Seeking Power is Often Robustly Instrumental in MDPs
This means that if there's more than twice the power coming from one move than from another, the former is more likely than the latter. In general, if one set of possibilities contributes 2K the power of another set of possibilities, the former set is at least K times more likely than the latter.

Where does the 2 come from? Why does one move have to have more than twice the power of another to be more likely? What happens if it only has 1.1x as much power?

3Alex Turner1yThen it won't always be instrumentally convergent, depending on the environment in question. For Tic-Tac-Toe, there's an exact proportionality in the limit of farsightedness (see theorem 46). In general, there's a delicate interaction between control provided and probability which I don't fully understand right now. However, we can easily bound how different these quantities can be; the constant depends on the distribution D we choose (it's at most 2 for the uniform distribution). The formal explanation can be found in the proof of theorem 48, but I'll try to give a quick overview. The power calculation is the average attainable utility. This calculation breaks down into the weighted sum of the average attainable utility when Candy is best, the average attainable utility when Chocolate is best, and the average attainable utility when Hug is best; each term is weighted by the probability that its possibility is optimal.[1] [#fn-u2wQGaA7r8xSmRj88-1] Each term is the power contribution of a different possibility.[2] [#fn-u2wQGaA7r8xSmRj88-2] Let's think about Candy's contribution to the first (simple) example. First, how likely is Candy to be optimal? Well, each state has an equal chance of being optimal, so 13 of goals choose Candy. Next, given that Candy is optimal, how much reward do we expect to get? Learning that a possibility is optimal tells us something about its expected value. In this case, the expected reward is still 3 4; the higher this number is, the "happier" an agent is to have this as its optimal possibility. In general,power contribution of possibility=% of goals choosing this possibility⋅average control. If the agent can "die" in an environment, more of its "ability to do things in general" is coming from not dying at first. Like, let's follow where the power is coming from, and that lets us deduce things about the instrumental convergence. Consider the power at a state. Maybe 99% of the power comes from the possibilities for one move (like the
Seeking Power is Often Robustly Instrumental in MDPs
Remember how, as the agent gets more farsighted, more of its control comes from Chocolate and Hug, while also these two possibilities become more and more likely?

I don't understand this bit -- how does more of its control come from Chocolate and Hug? Wouldn't you say its control comes from Wait!? Once it ends up in Candy, Chocolate, or Hug, it has no control left. No?

3Alex Turner1yYeah, you could think of the control as coming from Wait!. Will rephrase.
Seeking Power is Often Robustly Instrumental in MDPs
We bake the opponent's policy into the environment's rules: when you choose a move, the game automatically replies.

And the opponent plays to win, with perfect play?

3Alex Turner1yYes in this case, although note that that only tells us about the rules of the game, not about the reward function - most agents we're considering don't have the normal Tic-Tac-Toe reward function.
Seeking Power is Often Robustly Instrumental in MDPs
Imagine we only care about the reward we get next turn. How many goals choose Candy over Wait? Well, it's 50-50 – since we randomly choose a number between 0 and 1 for each state, both states have an equal chance of being maximal.

I got a little confused at the introduction of Wait!, but I think I understand it now. So, to check my understanding, and for the benefit of others, some notes:

• the agent gets a reward for the Wait! state, just like the other states
• for terminal states (the three non-Wait! states), the agent stays in that state, and keep
3Alex Turner1yYes. The full expansions (with no limit on the time horizon) are rCandy1−γ,rWait!+γrChocolate1−γ,andrWait!+γrHug1−γ, where rCandy,rWait!,r Chocolate,rHug∼unif(0,1).
Chris Olah’s views on AGI safety
It's meant to be analogous to imputing a value in a causal Bayes net

Aha! I thought it might be borrowing language from some technical term I wasn't familiar with. Thanks!

Chris Olah’s views on AGI safety
Takeoff does matter, in that I expect that this worldview is not very accurate/good if there's discontinuous takeoff, but imputing the worldview I don't think takeoff matters.

Minor question: could you clarify what you mean by "imputing the worldview" here? Do you mean something like, "operating within the worldview"? (I ask because this doesn't seem to be a use of "impute" that I'm familiar with.)

3Rohin Shah2yBasically yes. Longer version: "Suppose we were in scenario X. Normally, in such a scenario, I would discard this worldview, or put low weight on it, because reason Y. But suppose by fiat that I continue to use the worldview, with no other changes made to scenario X. Then ..." It's meant to be analogous to imputing a value in a causal Bayes net, where you simply "suppose" that some event happened, and don't update on anything causally upstream, but only reason forward about things that are causally downstream. (I seem to recall Scott Garrabrant writing a good post on this, but I can't find it now. ETA: Found it, it's here [https://www.lesswrong.com/posts/8NBbq7xhyDXoDWM8e/don-t-condition-on-no-catastrophes] , but it doesn't use the term "impute" at all. I'm now worried that I literally made up the term, and it doesn't actually have any existing technical meaning.)
Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More
And I think claim 5 is basically in line with what, say, Bostrom would discuss (where stabilization is a thing to do before we attempt to build a sovereign).

You mean in the sense of stabilizing the whole world? I'd be surprised if that's what Yann had in mind. I took him just to mean building a specialized AI to be a check on a single other AI.

2Matthew "Vaniver" Graves2yThat's how I interpreted: To be clear, I think he would mean it more in the way that there's currently an international police order that is moderately difficult to circumvent, and that the same would be true for AGI, and not necessarily the more intense variants of stabilization (which are necessarily primarily if you think offense is highly advantaged over defense, which I don't know his opinion on).
Troll Bridge
there is a troll who will blow up the bridge with you on it, if you cross it "for a dumb reason"

Does this way of writing "if" mean the same thing as "iff", i.e. "if and only if"?

2Abram Demski2yNo, but I probably should have said "iff" or "if and only if". I'll edit.
"Designing agent incentives to avoid reward tampering", DeepMind

I can't resist giving this pair of rather incongruous quotes from the paper

Could you spell out what makes the quotes incongruous with each other? It's not jumping out at me.

The authors acknowledged that the modifications they did to RL "brings RL closer to the frameworks of decision theory and game theory" (AFAICT, the algorithms they end up with are nearly pure decision/game theory) but given that some researchers have been focused on decision theory for a long time exactly because a decision theoretic agent can be reflectively stable, it seems incongruous to also write "perhaps surprisingly, there are modifications of the RL objective that remove the agent’s incentive to tamper with the reward function."

Risks from Learned Optimization: Introduction

Got it, that's helpful. Thank you!

Risks from Learned Optimization: Introduction

Very clear presentation! As someone outside the field who likes to follow along, I very much appreciate these clear conceptual frameworks and explanations.

I did however get slightly lost in section 1.2. At first reading I was expecting this part:

which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers.

to say, "... gap between the behavioral objective and the intended goal of the programmers." (In which case the inner alignment problem would be a subcomponent... (read more)

Phrases I've used: [intended/desired/designer's] [objective/goal]

I think "designer's objective" would fit in best with the rest of the terminology in this post, though "desired objective" is also good.

I don't have a good term for that, unfortunately—if you're trying to build an aligned AI, "human values" could be the right term, though in most cases you really just want "move one strawberry onto a plate without killing everyone," which is quite a lot less than "optimize for all human values." I could see how meta-objective might make sense if you're thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a me

Disentangling arguments for the importance of AI safety

To me the difference is that when I read 5 I'm thinking about people being careless or malevolent, in an everyday sense of those terms, whereas when I read 4 I'm thinking about how maybe there's no such thing as a human who's not careless or malevolent, if given enough power and presented with a weird enough situation.

Dutch-Booking CDT
I conclude from this that CDT should equal EDT (hence, causality must account for logical correlations, IE include logical causality). By "CDT" I really mean any approach at all to counterfactual reasoning; counterfactual expectations should equal evidential expectations.
As with most of my CDT=EDT arguments, this only provides an argument that the expectations should be equal for actions taken with nonzero probability. In fact, the amount lost to Dutch Book will be proportional to the probability of the action in question. So, differing counterfa
2Abram Demski2y"The expectations should be equal for actions with nonzero probability" -- this means a CDT agent should have equal causal expectations for any action taken with nonzero probability, and EDT agents should similarly have equal evidential expectations. Actually, I should revise my statement to be more careful: in the case of epsilon-exploring agents, the condition is >epsilon rather than >0. In any case, my statement there isn't about evidential and causal expectations being equal to each other, but rather about one of them being conversant across (sufficiently probable) actions. "differing counterfactual and evidential expectations are smoothly more and more tenable as actions become less and less probable" -- this means that the amount we can take from a CDT agent through a Dutch Book, for an action which is given a different casual expectation than evidential expectation, smoothly reduces as the probability of an action goes to zero. In that statement, I was assuming you hold the difference between evidential and causal expectations constant add you reduce the probability of the action. Otherwise it's not necessarily true.
Will humans build goal-directed agents?

To clarify, you do do the human's instrumental sub-goals though, just not extra ones for yourself, right?

3Rohin Shah2yIf you've seen the human acquire resources, then you'll acquire resources in the same way. If there's now some new resource that you've never seen before, you may acquire it if you're sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn't actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.
Will humans build goal-directed agents?
With respect to whether it is what I want, I wouldn't say that I want any of these things in particular, I'm more pointing at the existence of systems that aren't goal-directed, yet behave like an agent.

Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn't when the human doesn't)?

In which case, would it be fair to summarize (part of) your argument as:

1) Many of the potential problems with building safe superintelligent systems ... (read more)

3Rohin Shah2yI don't think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you'll be uncertain about what the human is going to do. If you're uncertain in this way, and you are getting your goals from a human, then you don't do all of the instrumental subgoals. (See The Off-Switch Game [https://arxiv.org/abs/1611.08219] for a simple analysis showing that you can avoid the survival incentive.) It may be that "goal-directed" is the wrong word for the property I'm talking about, but I'm predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
Will humans build goal-directed agents?
For example, imitation learning allows you to create an agent that behaves similarly to another agent—I would classify this as “Agent AI that is not goal-directed”.

Let's say that I'm pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at "whatever goal Rohin has"

What causes the agent to switch from X to Y?

Are you thinking of the ... (read more)

Are you thinking of the "agent" as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?

I was imagining something more like B for the imitation learning case.

I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you'd likely get an agent that continues to pursue goal X even after you've s