All of riceissa's Comments + Replies

I didn't log the time I spent on the original blog post, and it's kinda hard to assign hours to this since most of the reading and thinking for the post happened while working on the modeling aspects of the MTAIR project. If I count just the time I sat down to write the blog post, I would guess maybe less than 20 hours.

As for the "convert the post to paper" part, I did log that time and it came out to 89 hours, so David's estimate of "perhaps another 100 hours" is fairly accurate.

2David Manheim2mo
I probably put in an extra 20-60 hours, so the total is probably closer to 150 - which surprises me. I will add that a lot of the conversion time was dealing with writing more, LaTeX figures and citations, which were all, I think, substantive valuable additions. (Changing to a more scholarly style was not substantively valuable, nor was struggling with latex margins and TikZ for the diagrams, and both took some part of the time.)

Did you end up running it through your internal infohazard review and if so what was the result?

Rob, are you able to disclose why people at Open Phil are interested in learning more decision theory? It seems a little far away from the AI strategy reports they've been publishing in recent years, and it also seemed like they were happy to keep funding MIRI (via their Committee for Effective Altruism Support) despite disagreements about the value of HRAD research, so the sudden interest in decision theory is intriguing.

Mostly personal interest on my part (I was working on a blog post on the topic, now up), though I do think that the topic has broader relevance.

I was in the chat and don't have anything especially to "disclose". Joe and Nick are both academic philosophers who've studied at Oxford and been at FHI, with a wide range of interests. And Abram and Scott are naturally great people to chat about decision theory with when they're available.

I was reading parts of Superintelligence recently for something unrelated and noticed that Bostrom makes many of the same points as this post:

If the frontrunner is an AI system, it could have attributes that make it easier for it to expand its capabilities while reducing the rate of diffusion. In human-run organizations, economies of scale are counteracted by bureaucratic inefficiencies and agency problems, including difficulties in keeping trade secrets. These problems would presumably limit the growth of a machine intelligence project so long as it is op

... (read more)

So the existence of this interface implies that A is “weaker” in a sense than A’.

Should that say B instead of A', or have I misunderstood? (I haven't read most of the sequence.)

2Rohin Shah2y
It should, good catch, thanks!

Does anyone know how Brian Christian came to be interested in AI alignment and why he decided to write this book instead of a book about a different topic? (I haven't read the book and looked at the Amazon preview but couldn't find the answer there.)

HCH is the result of a potentially infinite exponential process (see figure 1) and thereby, computationally intractable. In reality, we can not break down any task into its smallest parts and solve these subtasks one after another because that would take too much computation. This is why we need to iterate distillation and amplification and cannot just amplify.

In general your post talks about amplification (and HCH) as increasing the capability of the system and distillation as saving on computation/making things more efficient. But my understanding, based... (read more)

I still don't understand how corrigibility and intent alignment are different. If neither implies the other (as Paul says in his comment starting with "I don't really think this is true"), then there must be examples of AI systems that have one property but not the other. What would a corrigible but not-intent-aligned AI system look like?

I also had the thought that the implicative structure (between corrigibility and intent alignment) seems to depend on how the AI is used, i.e. on the particulars of the user/overseer. For example if you have an intent-alig... (read more)

4Paul Christiano3y
Suppose that I think you know me well and I want you to act autonomously on my behalf using your best guesses. Then you can be intent aligned without being corrigible. Indeed, I may even prefer that you be incorrigible, e.g. if I want your behavior to be predictable to others. If the agent knows that I have such a preference then it can't be both corrigible and intent aligned.
1Ben Pace3y

IDA tries to prevent catastrophic outcomes by searching for a competitive AI that never intentionally optimises for something harmful to us and that we can still correct once it’s running.

I don't see how the "we can still correct once it’s running" part can be true given this footnote:

However, I think at some point we will probably have the AI system autonomously execute the distillation and amplification steps or otherwise get outcompeted. And even before that point we might find some other way to train the AI in breaking down tasks that doesn’t involve h

... (read more)
1Alex Turner2y
One interpretation is that even if the AI is autonomously executing distillation/amplification steps, the relevant people are still able to say "hold on, we need to modify your algorithm" and have the AI actually let us correct it.
... (read more)
2G Gordon Worley III3y
Thanks. Your post specifically is pretty helpful because it helps with one of the things that was tripping me up, which is what standard names people call different methods. Your names do a better job of capturing them than mine did.

I'm confused about the tradeoff you're describing. Why is the first bullet point "Generating better ground truth data"? It would make more sense to me if it said instead something like "Generating large amounts of non-ground-truth data". In other words, the thing that amplification seems to be providing is access to more data (even if that data isn't the ground truth that is provided by the original human).

Also in the second bullet point, by "increasing the amount of data that you train on" I think you mean increasing the amount of data from the original h

... (read more)
4Rohin Shah3y
By "ground truth" I just mean "the data that the agent is trained on", feel free to just ignore that part of the phrase. But it is important that it is better data. The point of amplification is that Amplify(M) is more competent than M, e.g. it is a better speech writer, it has a higher ELO rating for chess, etc. This is because Amplify(M) is supposed to approximate "M thinking for a longer time". Yes, that's right. Paul's posts often do talk about this, e.g. An unaligned benchmark [], and the competitiveness desideratum in Directions and desiderata for AI alignment []. I agree though that it's hard to realize this since the posts are quite scattered. I suspect Paul would say that it is plausibly competitive relative to training a system using RL with a fixed reward function (because the additional human-in-the-loop effort could be a small fraction of that, as long as we do semi-supervised RL well). However, maybe we train systems in some completely different way (e.g. GPT-2 style language models), it's very hard to predict right now how IDA would compare to that.

The addition of the distillation step is an extra confounder, but we hope that it doesn't distort anything too much -- its purpose is to improve speed without affecting anything else (though in practice it will reduce capabilities somewhat).

I think this is the crux of my confusion, so I would appreciate if you could elaborate on this. (Everything else in your answer makes sense to me.) In Evans et al., during the distillation step, the model learns to solve the difficult tasks directly by using example solutions from the amplification step. But if c

... (read more)
3Rohin Shah3y
You could do this, but it's expensive. In practice, from the perspective of distillation, there's always a tradeoff between: * Generating better ground truth data (which you can do by amplifying the agent that generates the ground truth data) * Improving the accuracy of the distilled model (which you can do by increasing the amount of data that you train on, and other ML tricks) You could get to an Issa-level model using just the second method for long enough, but it's going to be much more efficient to get to an Issa-level model by alternating the two methods.

It seems like "agricultural revolution" is used to mean both the beginning of agriculture ("First Agricultural Revolution") and the 18th century agricultural revolution ("Second Agricultural Revolution").

I have only a very vague idea of what you mean. Could you give an example of how one would do this?

I think that makes sense, thanks.

Just to make sure I understand, the first few expansions of the second one are:

  • f(n)
  • f(n+1)
  • f((n+1) + 1)
  • f(((n+1) + 1) + 1)
  • f((((n+1) + 1) + 1) + 1)

Is that right? If so, wouldn't the infinite expansion look like f((((...) + 1) + 1) + 1) instead of what you wrote?

Yes, that's correct. I'd view "f((((...) + 1) + 1) + 1)" as an equivalent way of writing it as a string (along with the definition of f as f(n) = f(n + 1)). "...(((((...) + 1) + 1) + 1) + 1)..." just emphasizes that the expression tree does not have a root - it goes to infinity in both directions. By contrast, the expression tree for f(n) = f(n) + 1 does have a root; it would expand to (((((...) + 1) + 1) + 1) + 1). Does that make sense?

I read the post and parts of the paper. Here is my understanding: conditions similar to those in Theorem 2 above don't exist, because Alex's paper doesn't take an arbitrary utility function and prove instrumental convergence; instead, the idea is to set the rewards for the MDP randomly (by sampling i.i.d. from some distribution) and then show that in most cases, the agent seeks "power" (states which allow the agent to obtain high rewards in the future). So it avoids the twitching robot not by saying that it can't make use of additional resources, but by sa

... (read more)
2Alex Turner3y
That's right; that would prove too much. Yeah, although note that I proved asymptotic instrumental convergence for typical functions under iid reward sampling assumptions at each state, so I think there's wiggle room to say "but the reward functions we provide aren't drawn from this distribution!". I personally think this doesn't matter much, because the work still tells us a lot about the underlying optimization pressures. The result is also true in the general case of an arbitrary reward function distribution, you just don't know in advance which terminal states the distribution prefers.

Can you say more about Alex Turner's formalism? For example, are there conditions in his paper or post similar to the conditions I named for Theorem 2 above? If so, what do they say and where can I find them in the paper or post? If not, how does the paper avoid the twitching robot from seeking convergent instrumental goals?

3Alex Turner3y
Sure, I can say more about Alex Turner's formalism! The theorems show that, with respect to some distribution of reward functions and in the limit of farsightedness (as the discount rate goes to 1), the optimal policies under this distribution tend to steer towards parts of the future which give the agent access to more terminal states. Of course, there exist reward functions for which twitching or doing nothing is optimal. The theorems say that most reward functions aren't like this. I encourage you to read the post and/or paper; it's quite different from the one you cited in that it shows how instrumental convergence and power-seeking arise from first principles. Rather than assuming "resources" exist, whatever that means, resource acquisition is explained as a special case of power-seeking. ETA: Also, my recently completed sequence focuses on formally explaining and deeply understanding why catastrophic behavior seems to be incentivized. In particular, see The Catastrophic Convergence Conjecture [].

One additional source that I found helpful to look at is the paper "Formalizing Convergent Instrumental Goals" by Tsvi Benson-Tilsen and Nate Soares, which tries to formalize Omohundro's instrumental convergence idea using math. I read the paper quickly and skipped the proofs, so I might have misunderstood something, but here is my current interpretation.

The key assumptions seem to appear in the statement of Theorem 2; these assumptions state that using additional resources will allow the agent to implement a strategy that gives it strictly higher utility

... (read more)
2Rohin Shah3y
Yeah, that upshot sounds pretty reasonable to me. (Though idk if it's reasonable to think of that as endorsed by "all of MIRI".) Note that this requires the utility function to be completely indifferent to humans (or actively against them).
3Matthew Barnett3y
See also Alex Turner's work [] on formalizing instrumentally convergent goals, and his walkthrough [] of the MIRI paper.

Rohin Shah told me something similar.

This quote seems to be from Rob Bensinger.

I'm confused about what it means for a hypothesis to "want" to score better, to change its predictions to get a better score, to print manipulative messages, and so forth. In probability theory each hypothesis is just an event, so is static, cannot perform actions, etc. I'm guessing you have some other formalism in mind but I can't tell what it is.

3Abram Demski3y
Yeah, in probability theory you don't have to worry about how everything is implemented. But for implementations of Bayesian modeling with a rich hypothesis class, each hypothesis could be something like a blob of code which actually does a variety of things. As for "want", sorry for using that without unpacking it. What it specifically means is that hypotheses like that will have a tendency to get more probability weight in the system, so if we look at the weighty (and thus influential) hypotheses, they are more likely to implement strategies which achieve those ends.
4Matthew "Vaniver" Gray3y
I interpreted it as an ensemble of expert models, weighted in a Bayesian fashion based on past performance. But because of the diagnostic logs, the type signature is a little different; the models output both whatever probability distributions over queries / events and arbitrary text in some place. Then there's a move that I think of as the 'intentional stance move', where you look at a system that rewards behavior of a particular type (when updating the weights based on past success, you favor predictions that thought an event was more likely than its competitors did), and so pretend that the things in the system "want" to do the behavior that's rewarded. [Like, even in this paragraph, 'reward' is this sort of mental shorthand; it's not like any of the models have an interior preference to have high weight in the ensemble, it's just that the ensemble's predictions are eventually more like the predictions of the models that did things that happened to lead to having higher weight.]

To me, it seems like the two distinctions are different. There seem to be three levels to distinguish:

  1. The reward (in the reinforcement learning sense) or the base objective (example: inclusive genetic fitness for humans)
  2. A mechanism in the brain that dispenses pleasure or provides a proxy for the reward (example: pleasure in humans)
  3. The actual goal/utility that the agent ends up pursuing (example: a reflective equilibrium for some human's values, which might have nothing to do with pleasure or inclusive genetic fitness)

The base objective vs mesa-object

... (read more)

Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:

  1. time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
  2. act-based ("short") vs goal-based ("long"): using the human's (or more generally, the human-plus-AI-assistants'; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization
... (read more)

By "short" I mean short in sense (1) and (2). "Short" doesn't imply anything about senses (3), (4), (5), or (6) (and "short" and "long" don't seem like good words to describe those axes, though I'll keep using them in this comment for consistency).

By "preferences-on-reflection" I mean long in sense (3) and neither in sense (6). There is a hypothesis that "humans with AI help" is a reasonable way to capture preferences-on-reflection, but they aren't defined to be the same. I don... (read more)

6Wei Dai4y
(BTW Paul, if you're reading this, Issa and I and a few others have been chatting about this on MIRIxDiscord. I'm sure you're more than welcome to join if you're interested, but I figured you probably don't have time for it. PM me if you do want an invite.) Issa, I think my current understanding of what Paul means is roughly the same as yours, and I also share your confusion about “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy”. To summarize my own understanding (quoting myself from the Discord), what Paul means by "satisfying short-term preferences-on-reflection" seems to cash out as "do the action for which the AI can produce an explanation such that a hypothetical human would evaluate it as good (possibly using other AI assistants), with the evaluation procedure itself being the result of a hypothetical deliberation which is controlled by the preferences-for-deliberation that the AI learned/inferred from a real human." (I still have other confusions around this. For example is the "hypothetical human" here (the human being predicted in Issa's 3) a hypothetical end user evaluating the action based on what they themselves want, or is it a hypothetical overseer evaluating the action based on what the overseer thinks the end user wants? Or is the "hypothetical human" just a metaphor for some abstract, distributed, or not recognizably-human deliberative/evaluative process at this point?) I think maybe it would make sense to further break (6) down into 2 sub-dimensions: (6a) understandable vs evaluable and (6b) how much AI assistance. "Understandable" means the human achieves an understanding of the (outer/main) AI's rationale for action within their own brain, with or without (other) AI assistance (which can for example answer questions for the human or give video lectures, etc.). And "evaluable" means the human runs or participates in a procedure that returns a score for how good the action is, but

Thanks. It looks like all the realistic examples I had of weak HCH are actually examples of strong HCH after all, so I'm looking for some examples of weak HCH to help my understanding. I can see how weak HCH would compute the answer to a "naturally linear recursive" problem (like computing factorials) but how would weak HCH answer a question like "Should I get laser eye surgery?" (to take an example from here). The natural way to decompose a problem like this seems to use branching.

Also, I just looked again at Alex Zhu's FAQ for Paul's agenda, and Alex's e

... (read more)
4Rohin Shah4y
I just looked at both that and the original strong HCH post, and I think Alex's explanation is right and I was mistaken: both weak and strong HCH use tree recursion. Edited the original answer to not talk about linear / tree recursion.

Thanks! I found this answer really useful.

I have some follow-up questions that I'm hoping you can answer:

  1. I didn't realize that weak HCH uses linear recursion. On the original HCH post (which is talking about weak HCH), Paul talks in comments about "branching factor", and Vaniver says things like "So he asks HCH to separately solve A, B, and C". Are Paul/Vaniver talking about strong HCH here, or am I wrong to think that branching implies tree recursion? If Paul/Vaniver are talking about weak HCH, and branching does imply tree recursion, then it seems like
... (read more)
4Rohin Shah4y
ETA: This paragraph is wrong, see rest of comment thread. I'm pretty sure they're talking about strong HCH. At the time those comments were written (and now), HCH basically always referred to strong HCH, and it was rare for people to talk about weak HCH. (This is mentioned at the top of the strong HCH post [].) That's correct. There are two different meanings of the word "agent" here. First, there's the BaseAgent, the one which we start with that's already human-level (both HCH and IDA assume access to such a BaseAgent). Then, there's the PowerfulAgent, which is the system as a whole, which can be superhuman. With HCH, the entire infinite tree of BaseAgents together makes up the PowerfulAgent. With IDA / recursive reward modeling, either the neural net (output of distillation) or the one-level tree (output of amplification) could be considered the PowerfulAgent. Initially, the PowerfulAgent is only as capable as the BaseAgent, but as training progresses, it becomes more capable, reaching the HCH-PowerfulAgent in the limit. For all methods, the BaseAgent is human level, and the PowerfulAgent is (eventually) superhuman. Yes, my bad. (Technically, I was talking about both the annotated functional programming and the level of indirection, I think.) I was restricting attention to imitation-based IDA, which is the canonical example, and what various [] explainer [] articles [] are explaining.

The link no longer works (I get "This project has not yet been moved into the new version of Overleaf. You will need to log in and move it in order to continue working on it.") Would you be willing to re-post it or move it so that it is visible?

See if this works.

My solution for #3:

Define by . We know that is continuous because and the identity map both are, and by the limit laws. Applying the intermediate value theorem (problem #2) we see that there exists such that . But this means , so we are done.

Counterexample for the open interval: consider defined by . First, we can verify that if then , so indeed maps to . To see that there is no fixed point, note that the only solution to in is , which is no

... (read more)

Yeah, I did the same thing :)

Putting it right after #2 was highly suggestive - I wonder if this means there's some very different route I would have thought of instead, absent the framing.

EDIT: I've got another framing that I thought would be more useful for later problems, but I was wrong. I still think there is some value in understanding this proof as well.

In particular, look at this diagram on Wikipedia. It would be better if the whole upper triangle was blue and the whole lower triangle were red instead of just one side (you can arbitrarily decide whether to paint the rest of the diagonal blue or red). If x=0 and x=1 aren't fixed points, then they must be blue and red respectively. If we split [0,1] into n components of size ... (read more)

Here is my attempt, based on Hoagy's proof.

Let be an integer. We are given that and . Now consider the points in the interval . By 1-D Sperner's lemma, there are an odd number of such that and (i.e. an odd number of "segments" that begin below zero and end up above zero). In particular, is an even number, so there must be at least one such number . Choose the smallest and call this number .

Now consider the sequence . Since this sequence takes values in

... (read more)

I'm having trouble understanding why we can't just fix in your proof. Then at each iteration we bisect the interval, so we wouldn't be using the "full power" of the 1-D Sperner's lemma (we would just be using something close to the base case).

Also if we are only given that is continuous, does it make sense to talk about the gradient?

Yeah agreed, in fact I don't think you even need to continually bisect, you can just increase n indefinitely. Iterating becomes more dangerous as you move to higher dimensions because an n dimensional simplex with n+1 colours that has been coloured according to analogous rules doesn't necessarily contain the point that maps to zero.

On the second point, yes I'd been assuming that a bounded function had a bounded gradient, which certainly isn't true for say sin(x^2), the final step needs more work, I like the way you did it in the proof below.

"I'm having trouble understanding why we can't just fix n=2 in your proof. Then at each iteration we bisect the interval, so we wouldn't be using the "full power" of the 1-D Sperner's lemma (we would just be using something close to the base case)." - You're right, you can prove this without using the full power of Sperner's lemma. I think it becomes more useful for the multi-dimensional case.