Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter

Wiki Contributions


Christiano, Cotra, and Yudkowsky on AI progress

I agree that when you know about a critical threshold, as with nukes or orbits, you can and should predict a discontinuity there. (Sufficient specific knowledge is always going to allow you to outperform a general heuristic.) I think that (a) such thresholds are rare in general and (b) in AI in particular there is no such threshold. (According to me (b) seems like the biggest difference between Eliezer and Paul.) 

Some thoughts on aging:

  • It does in fact seem surprising, given the complexity of biology relative to physics, if there is a single core cause and core solution that leads to a discontinuity.
  • I would a priori guess that there won't be a core solution. (A core cause seems more plausible, and I'll roll with it for now.) Instead, we see a sequence of solutions that intervene on the core problem in different ways, each of which leads to some improvement on lifespan, and discovering these at different times leads to a smoother graph.
  • That being said, are people putting in a lot of effort into solving aging in mice? Everyone seems to constantly be saying that we're putting in almost no effort whatsoever. If that's true then a jumpy graph would be much less surprising.
  • As a more specific scenario, it seems possible that the graph of mouse lifespan over time looks basically flat, because we were making no progress due to putting in ~no effort. I could totally believe in this world that someone puts in some effort and we get a discontinuity, or even that the near-zero effort we're putting in finds some intervention this year (but not in previous years) which then looks like a discontinuity.

If we had a good operationalization, and people are in fact putting in a lot of effort now, I could imagine putting my $100 to your $300 on this (not going beyond 1:3 odds simply because you know way more about aging than I do).

Christiano, Cotra, and Yudkowsky on AI progress

The "continuous view" as I understand it doesn't predict that all straight lines always stay straight. My version of it (which may or may not be Paul's version) predicts that in domains where people are putting in lots of effort to optimize a metric, that metric will grow relatively continuously. In other words, the more effort put in to optimize the metric, the more you can rely on straight lines for that metric staying straight (assuming that the trends in effort are also staying straight).

In its application to AI, this is combined with a prediction that people will in fact be putting in lots of effort into making AI systems intelligent / powerful / able to automate AI R&D / etc, before AI has reached a point where it can execute a pivotal act. This second prediction comes for totally different reasons, like "look at what AI researchers are already trying to do" combined with "it doesn't seem like AI is anywhere near the point of executing a pivotal act yet".

(I think on Paul's view the second prediction is also bolstered by observing that most industries / things that had big economic impacts also seemed to have crappier predecessors. This feels intuitive to me but is not something I've checked and so isn't my personal main reason for believing the second prediction.)

One historical example immediately springs to mind where something-I'd-consider-a-Paul-esque-model utterly failed predictively: the breakdown of the Philips curve.

I'm not very familiar with this (I've only seen your discussion and the discussion in IEM) but it does not seem like the sort of thing where the argument I laid out above would have had a strong opinion. Was the y-axis of the straight line graph a metric that people were trying to optimize? If so, did the change in policy not represent a change in the amount of effort put into optimizing the metric? (I haven't looked at the details here, maybe the answer is yes to both, in which case I would be interested in looking at the details.)

Zooming out a meta-level, I think GDP is a particularly good example of a big aggregate metric which approximately-always looks smooth in hindsight, even when the underlying factors of interest undergo large jumps.

This seems plausible but it also seems like you can apply the above argument to a bunch of other topics besides GDP, like the ones listed in this comment, so it still seems like you should be able to exhibit a failure of the argument on those topics.

Ngo and Yudkowsky on alignment difficulty

We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”

Yeah, that's right. Adapted to the language here, it would be 1. Why would we have a "full and complete" outcome pump, rather than domain-specific outcome pumps that primarily use plans using actions from a certain domain rather than "all possible actions", and 2. Why are the outcomes being pumped incompatible with human survival?

Ngo and Yudkowsky on alignment difficulty

The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.)

I'll note that this is framed a bit too favorably to me, the actual question is "why is an effective and corrigible system so much more of a narrow target than that?"

Discussion with Eliezer Yudkowsky on AGI interventions

This just doesn't match my experience at all. Looking through my past AI papers, I only see two papers where I could predict the results of the experiments on the first algorithm I tried at the beginning of the project. The first one (benefits of assistance) was explicitly meant to be a "communication" paper rather than a "research" paper (at the time of project initiation, rather than in hindsight). The second one (Overcooked) was writing up results that were meant to be the baselines against which the actual unpredictable research (e.g. this) was going to be measured against; it just turned out that that was already sufficiently interesting to the broader community.

(Funny story about the Overcooked paper; we wrote the paper + did the user study in ~two weeks iirc, because it was only two weeks before the deadline that we considered that the "baseline" results might already be interesting enough to warrant a conference paper. It's now my most-cited AI paper.)

(I'm also not actually sure that I would have predicted the Overcooked results when writing down the first algorithm; the conceptual story felt strong but there are several other papers where the conceptual story felt strong but nonetheless the first thing we tried didn't work. And in fact we did have to make slight tweaks, like annealing from self-play to BC-play over the course of training, to get our algorithm to work.)

A more typical case would be something like Preferences Implicit in the State of the World, where the conceptual idea never changed over the course of the project, but:

  1. The first hacky / heuristic algorithm we wrote down didn't work in some cases. We analyzed it a bunch (via experiments) to figure out what sorts of things it wasn't capturing.
  2. When we eventually had a much more elegant derived-from-math algorithm, I gave a CHAI presentation presenting some experimental results. There were some results I was confused by, where I expected something different from what we got, and I mentioned this. (Specifically these were the results in the case where the robot had a uniform prior over the initial state at time -T). Many people in the room (including at least one person from MIRI) thought for a while and gave their explanation for why this was the behavior you should expect. (I'm pretty sure some even said "this isn't surprising" or something along those lines.) I remained unconvinced. Upon further investigation we found out that one of Ziebart's results that we were using had extremely high variance in our setting, since in our setting we only ever had one initial state, rather than sampling several which would give better coverage of the uniform prior. We derived a better version of Ziebart's result, implemented that, and voila the results were now what I had originally expected.
  3. It took about... 2 weeks (?) between getting this final version of the algorithm and submitting a paper, constituting maybe 15-20% of the total work. Most of that was what I'd call "communication" rather than "research", e.g. creating another environment to better demonstrate the algorithm's properties, writing up the paper clearly, making good figures, etc. Good communication seems clearly worth putting effort into.

If you want a deep learning example, consider Learning What To Do by Simulating the Past. The biggest example here is the curriculum -- that was not part of the original pseudocode I had written down and was crucial to get it to work.

You might look at this and think that "but the conceptual idea predicted the experiments that were eventually run!" I mean, sure, but then I think your crux is not "were the experiments predictable", rather it's "is there any value in going from a conceptual idea to a working implementation".

It's also pretty easy to predict the results of experiments in a paper, but that's because you have the extra evidence that you're reading a paper. This is super helpful:

  1. The experiments are going to show the algorithm working. They wouldn't have published the paper otherwise.
  2. The introduction, methods, etc are going to tell you exactly what to expect when you get to the experiments. Even if the authors initially thought the algorithm was going to improve the final score in Atari games, if the algorithm instead improved sample efficiency without changing final score, the introduction is going to be about how the algorithm was inspired by sample efficient learning in humans or whatever.

This is also why I often don't report on experiments in papers in the Alignment Newsletter; usually the point is just "yes, the conceptual idea worked".

I don't know if this is actually true, but one cynical take is that people are used to predicting the results of finished ML work, where they implicitly use (1) and (2) above, and incorrectly conclude that the vast majority of ML experiments are ex ante predictable. And now that they have to predict the outcome of Redwood's project, before knowing that a paper will result, they implicitly realize that no, it really could go either way. And so they incorrectly conclude that of the ML experiments, Redwood's project is a rare unpredictable one.

Discussion with Eliezer Yudkowsky on AGI interventions

That's a good example, thanks :)

EDIT: To be clear, I don't agree with 

But at the same time, I think that Abram wins hands-down on the metric of "progress towards AI alignment per researcher-hour"

but I do think this is a good example of what someone might mean when they say work is "predictable".

Discussion with Eliezer Yudkowsky on AGI interventions

^ This response is great.

I also think I naturally interpreted the terms in Adam's comment as pointing to specific clusters of work in today's world, rather than universal claims about all work that could ever be done. That is, when I see "experimental work and not doing only decision theory and logic", I automatically think of "experimental work" as pointing to a specific cluster of work that exists in today's world (which we might call mainstream ML alignment), rather than "any information you can get by running code". Whereas it seems you interpreted it as something closer to "MIRI thinks there isn't any information to get by running code".

My brain insists that my interpretation is the obvious one and is confused how anyone (within the AI alignment field, who knows about the work that is being done) could interpret it as the latter. (Although the existence of non-public experimental work that isn't mainstream ML is a good candidate for how you would start to interpret "experimental work" as the latter.) But this seems very plausibly a typical mind fallacy.

EDIT: Also, to explicitly say it, sorry for misunderstanding what you were trying to say. I did in fact read your comments as saying "no, MIRI is not categorically against mainstream ML work, and MIRI is not only working on HRAD-ish stuff like decision theory and logic, and furthermore this should be pretty obvious to outside observers", and now I realize that is not what you were saying.

Discussion with Eliezer Yudkowsky on AGI interventions

(Responding to entire comment thread) Rob, I don't think you're modeling what MIRI looks like from the outside very well.

  • There's a lot of public stuff from MIRI on a cluster that has as central elements decision theory and logic (logical induction, Vingean reflection, FDT, reflective oracles, Cartesian Frames, Finite Factored Sets...)
  • There was once an agenda (AAMLS) that involved thinking about machine learning systems, but it was deprioritized, and the people working on it left MIRI.
  • There was a non-public agenda that involved Haskell programmers. That's about all I know about it. For all I know they were doing something similar to the modal logic work I've seen in the past.
  • Eliezer frequently talks about how everyone doing ML work is pursuing dead ends, with potentially the exception of Chris Olah. Chris's work is not central to the cluster I would call "experimentalist".
  • There has been one positive comment on the KL-divergence result in summarizing from human feedback. That wasn't the main point of that paper and was an extremely predictable result.
  • There has also been one positive comment on Redwood Research, which was founded by people who have close ties to MIRI. The current steps they are taking are not dramatically different from what other people have been talking about and/or doing.
  • There was a positive-ish comment on aligning narrowly superhuman models, though iirc it gave off more of an impression of "well, let's at least die in a slightly more dignified way".

I don't particularly agree with Adam's comments, but it does not surprise me that someone could come to honestly believe the claims within them.

Discussion with Eliezer Yudkowsky on AGI interventions

That one makes sense (to the extent that Eliezer did confidently predict the results), since the main point of the work was to generate information through experiments. I thought the "predictable" part was also meant to apply to a lot of ML work where the main point is to produce new algorithms, but perhaps it was just meant to apply to things like Ought.

Discussion with Eliezer Yudkowsky on AGI interventions

A confusion: it seems that Eliezer views research that is predictable as basically-useless. I think I don't understand what "predictable" means here. In what sense is expected utility quantilization not predictable?

Maybe the point is that coming up with the concept is all that matters, and the experiments that people usually do don't matter because after coming up with the concept the experiments are predictable? I'm much more sympathetic to that, but then I'm confused why "predictable" implies "useless"; many prosaic alignment papers have as their main contribution a new algorithm, which seems like a similar type of thing as quantilization.

Load More