All of Richard_Ngo's Comments + Replies

I just stumbled upon the Independence of Pareto dominated alternatives criterion; does the ROSE value have this property? I'm pattern-matching it as related to disagreement-point invariance, but haven't thought about this at all.

Yeah, I agree I convey the implicit prediction that, even though not all one-month tasks will fall at once, they'll be closer than you would otherwise expect not using this framework.

I think I still disagree with your point, as follows: I agree that AI will soon do passably well at summarizing 10k word books, because the task is not very "sharp" - i.e. you get gradual rather than sudden returns to skill differences. But I think it will take significantly longer for AI to beat the quality of summary produced by a median expert in 1 month, because that expert's summary will in fact explore a rich hierarchical interconnected space of concepts from the novel (novel concepts, if you will).

Seems like there's a bunch of interesting stuff here, though some of it is phrased overly strongly.

E.g. "mechanistic interpretability requires program synthesis, program induction, and/or programming language translation" seems possible but far from obvious to me. In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways. Perhaps it's appropriate to advocate for MI researchers to pay more attention to these fields, but calling this an example of "reinventing", "reframing" or "renami... (read more)

1Stephen Casper1mo
Thanks for the comment. This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.  No qualms here. But (1) the point about program synthesis/induction/translation suggests that the toy problems are fundamentally more tractable than real ones. Analogously, imagine saying that having humans write and study simple algorithms for search, modular addition, etc. to be part of an agenda for program synthesis. (2) At some point the toy work should lead to competitive engineering work. think that there has not been a clear trend toward this in the past 6 years with the circuits agenda.  Thanks for the question. It might generalize. My intended point with the Ramanujan paper is that a subnetwork seeming to do something in isolation does not mean that it does that thing in context. The Ramanujan et al. weren't interpreting networks, they were just training the networks.  So the underlying subnetworks may generalize well, but in this case, this is not interpretability work any more than just gradient-based training of a sparse network is. 

My default (very haphazard) answer: 10,000 seconds in a day; we're at 1-second AGI now; I'm speculating 1 OOM every 1.5 years, which suggests that coherence over multiple days is 6-7 years away.

The 1.5 years thing is just a very rough ballpark though, could probably be convinced to double or halve it by doing some more careful case studies.

Thanks. For the record, my position is that we won't see progress that looks like "For t-AGI, t increases by +1 OOM every X years" but rather that the rate of OOMs per year will start off slow and then accelerate. So e.g. here's what I think t will look like as a function of years:

YearRichard (?) guessDaniel guess
202315
2024515
202525100
20261002000
2027500Infinity (singularity)
20282,500 
202910,000 
203050,000 
2031250,000 
20321,000,000 

I think this partly because of the way I think generalization works (I think e.g. once AIs have gotten... (read more)

Why is it cheating? That seems like the whole point of my framework - that we're comparing what AIs can do in any amount of time to what humans can do in a bounded amount of time.

Whatever. Maybe I was just jumping on an excuse to chit-chat about possible limitations of LLMs :) And maybe I was thread-hijacking by not engaging sufficiently with your post, sorry.

This part you wrote above was the most helpful for me:

if the task is "spend a month doing novel R&D for lidar", then my framework predicts that we'll need 1-month AGI for that

I guess I just want to state my opinion that (1) summarizing a 10,000-page book is a one-month task but could come pretty soon if indeed it’s not already possible, (2) spending a month doing novel R&a... (read more)

But then we could just ask the question: “Can you please pose a question about string theory that no AI would have any prayer of answering, and then answer it yourself?” That’s not cherry-picking, or at least not in the same way.


But can't we equivalently just ask an AI to pose a question that no human would have a prayer of answering in one second? It wouldn't even need to be a trivial memorization thing, it could also be a math problem complex enough that humans can't do it that quickly, or drawing a link between two very different domains of knowledge.

3Steve Byrnes1mo
I think the “in one second” would be cheating. The question for Ed Witten didn’t specify “the AI can’t answer it in one second”, but rather “the AI can’t answer it period”. Like, if GPT-4 can’t answer the string theory question in 5 minutes, then it probably can’t answer it in 1000 years either. (If the AI can get smarter and smarter, and figure out more and more stuff, without bound, in any domain, by just running it longer and longer [https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn?commentId=7yAJbkDtMepxDvcMe], then (1) it would be quite disanalogous to current LLMs [btw I’ve been assuming all along that this post is implicitly imagining something vaguely like current LLMs but I guess you didn’t say that explicitly], (2) I would guess that we’re already past end-of-the-world territory.)

How long would it take (in months) to train a smart recent college graduate with no specialized training in my field to complete this task?


This doesn't seem like a great metric because there are many tasks that a college grad can do with 0 training that current AI can't do, including:

  • Download and play a long video game to completion
  • Read and summarize a whole book
  • Spend a month planning an event

I do think that there's something important about this metric, but I think it's basically subsumed by my metric: if the task is "spend a month doing novel R&D for... (read more)

3Steve Byrnes1mo
Ah, that’s helpful, thanks. I think you’re saying “there are questions about string theory whose answers are obvious to Ed Witten because he happened to have thought about them in the course of some unpublished project, but these questions are hyper-specific, so bringing them up at all would be unfair cherry-picking.” But then we could just ask the question: “Can you please pose a question about string theory that no AI would have any prayer of answering, and then answer it yourself?” That’s not cherry-picking, or at least not in the same way. And it points to an important human capability, namely, figuring out which areas are promising and tractable to explore, and then exploring them. Like, if a human wants to make money or do science or take over the world, then they get to pick, endogenously, which areas or avenues to explore.

These are all arguments about the limit; whether or not they're relevant depends on whether they apply to the regime of "smart enough to automate alignment research".

1Joe_Collman1mo
Agreed. Are you aware of any work that attempts to answer this question? Does this work look like work on debate? (not rhetorical questions!) My guess is that work likely to address this does not look like work on debate. Therefore my current position remains: don't bother working on debate; rather work on understanding the fundamentals that might tell you when it'll break. The world won't be short of debate schemes. It'll be short of principled arguments for their safe application.

For instance, for debate, one could believe:
1) Debate will work for long enough for us to use it to help find an alignment solution.
2) Debate is a plausible basis for an alignment solution.

I generally don't think about things in terms of this dichotomy. To me, an "alignment solution" is anything that will align an AGI which is then capable of solving alignment for its successor. And so I don't think you can separate these two things.

(Of course I agree that debate is not an arbitrarily scalable alignment solution in the sense that you can just keep training... (read more)

1Joe_Collman1mo
Oh, to be clear, with "to help find" I only mean that we expect to make significant progress using debate. If we knew we'd safely make enough progress to get to a solution, then you're quite right that that would amount to (2). (apologies for lack of clarity if this was the miscommunication) That's the distinction I mean to make between (1) and (2): we need to get to the moon safely With (1) we have no idea when our rocket will explode. Similarly, we have no idea whether the moon will be far enough to know when our next rocket will explode. (not that I'm knocking robustly getting to the moon safely) If we had some principled argument telling us how far we could push debate before things became dangerous, that'd be great. I'm claiming that we have no such argument, and that all work on debate (that I'm aware of) stands near-zero chance of finding one. Of course I'm all for work "on debate" that aims at finding that kind of argument - however, I would expect that such work leaves the specifics of debate behind pretty quickly.

To preserve my current shards, I don't need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means "treading water" and seeing dogs sometimes in situations similar to historical dog-seeing events.

I think this depends sensitively on whether the "actor" and the "critic" in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the c... (read more)

In general if two possible models perform the same, then I expect the weights to drift towards the simpler one. And in this case they perform the same because of deceptive alignment: both are trying to get high reward during training in order to be able to carry out their misaligned goal later on.

Because of standard deceptive alignment reasons (e.g. "I should make sure gradient descent doesn't change my goal; I should make sure humans continue to trust me").

3Alex Turner2mo
I think you don't have to reason like that to avoid getting changed by SGD. Suppose I'm being updated by PPO, with reinforcement events around navigating to see dogs. To preserve my current shards, I don't need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means "treading water" and seeing dogs sometimes in situations similar to historical dog-seeing events.  Maybe this is compatible with what you had in mind! It's just not something that I think of as "high reward." And maybe there's some self-fulfilling prophecy where we trust models which get high reward, and therefore they want to get high reward to earn our trust... but that feels quite contingent to me.

This doesn't seem implausible. But on the other hand, imagine an agent which goes through a million episodes, and in each one reasons at the beginning "X is my misaligned terminal goal, and therefore I'm going to deceptively behave as if I'm aligned" and then acts perfectly like an aligned agent from then on. My claims then would be:

a) Over many update steps, even a small description length penalty of having terminal goal X (compared with being aligned) will add up.
b) Having terminal goal X also adds a runtime penalty, and I expect that NNs in practice are... (read more)

So I'm imagining the agent doing reasoning like:

Misaligned goal --> I should get high reward --> Behavior aligned with reward function

and then I'm hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make "I should get high reward" the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.)

I could also imagine something more like:

Misaligned goal --> I should behave in al... (read more)

1SoerenMind1mo
The shortest description of this thought doesn't include "I should get high reward" because that's already implied by having a misaligned goal and planning with it.  In contrast, having only the goal "I should get high reward" may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.
2Alex Turner2mo
Why would the agent reason like this? 

Ty for post. Just for reference, does John endorse this summary?

4Nate Soares2mo
John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment. (I added this note to the top of the post as a parenthetical; thanks.)

Deceptive alignment doesn't preserve goals.

A short note on a point that I'd been confused about until recently. Suppose you have a deceptively aligned policy which is behaving in aligned ways during training so that it will be able to better achieve a misaligned internally-represented goal during deployment. The misaligned goal causes the aligned behavior, but so would a wide range of other goals (either misaligned or aligned) - and so weight-based regularization would modify the internally-represented goal as training continues. For example, if the misali... (read more)

3Alex Turner2mo
Can you say why you think that weight-based regularization would drift the weights to the latter? That seems totally non-obvious to me, and probably false.

Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal.

Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representati... (read more)

3SoerenMind2mo
Interesting point. Though on this view, "Deceptive alignment preserves goals" would still become true once the goal has drifted to some random maximally simple goal for the first time. To be even more speculative: Goals represented in terms of existing concepts could be simple and therefore stable by default. Pretrained models represent all kinds of high-level states, and weight-regularization doesn't seem to change this in practice. Given this, all kinds of goals could be "simple" as they piggyback on existing representations, requiring little additional description length.

Supervised data seems way more fine-grained in what you are getting the AI to do. It's just that supervised fine-tuning is worse.

My (pretty uninformed) guess here is that supervised fine-tuning vs RLHF has relatively modest differences in terms of producing good responses, but bigger differences in terms of avoiding bad responses. And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don't want it to do.

Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has "policy learns to care about reward directly" as a footnote; I can imagine updating it based on the outcome of this discussion though.

2David Schneider-Joseph5mo
For someone who's read v1 of this paper, what would you recommend as the best way to "update" to v3? Is an entire reread the best approach? [Edit March 11, 2023: Having now read the new version in full, my recommendation to anyone else with the same question is a full reread.]

Note that the "without countermeasures" post consistently discusses both possibilities

Yepp, agreed, the thing I'm objecting to is how you mainly focus on the reward case, and then say "but the same dynamics apply in other cases too..."

I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that's distinct from being confident in the motivations that give rise to that policy.

The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).

I strongly disagree with the "best case" thing. Like, policies could just learn human values! It's not that implausible.

If I had to try point to the crux here, it might be "how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?" Where we both agree that there's some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existi... (read more)

3Ajeya Cotra6mo
Yes, sorry, "best case" was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing "correctly;" we could get lucky and have it generalize "incorrectly" in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here [https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#What_if_Alex_has_benevolent_motivations_]. I don't understand why reward isn't something the model has direct access to -- it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I'd have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here. Yeah, I don't really agree with this; I think I could pretty easily imagine being an AI system asking the question "How much reward would this episode get if it were sampled for training?" It seems like the intuition this is weird and unnatural is doing a lot of work in your argument, and I don't really share it.

(Written quickly and not very carefully.)

I think it's worth stating publicly that I have a significant disagreement with a number of recent presentations of AI risk, in particular Ajeya's "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover", and Cohen et al.'s "Advanced artificial agents intervene in the provision of reward". They focus on policies learning the goal of getting high reward. But I have two problems with this:

  1. I expect "reward" to be a hard goal to learn, because it's a pretty abstract concept a
... (read more)

Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has "policy learns to care about reward directly" as a footnote; I can imagine updating it based on the outcome of this discussion though.

6Ajeya Cotra6mo
Note that the "without countermeasures" post consistently discusses both possibilities (the model cares about reward or the model cares about something else that's consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro: As well as the section Even if Alex isn't "motivated" to maximize reward... [https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#Even_if_Alex_isn_t__motivated__to_maximize_reward__it_would_seek_to_seize_control]. I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that's distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons. With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard -- I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there's no notion of reward on the deployment distribution doesn't feel compelling to me.
2Lauro Langosco6mo
I agree with your general point here, but I think Ajeya's post actually gets this right, eg and
5Paul Christiano6mo
I'm not very convinced by this comment as an objection to "50% AI grabs power to get reward." (I find it more plausible as an objection to "AI will definitely grab power to get reward.") This seems to be most of your position but I'm skeptical (and it's kind of just asserted without argument): * The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your "unnaturalness" abstraction can make finer-grained distinctions than that, but I don't think I buy it. * If people train their AI with RLDT then the AI is literally be trained to predict reward! I don't see how this is remote, and I'm not clear if your position is that e.g. the value function will be bad at predicting reward because it is an "unnatural" target for supervised learning. * I don't understand the analogy with humans. It sounds like you are saying "an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward" be analogous to "humans care about the details of their reward circuitry." But: * I don't think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior. * It seems like the analogous conclusion for RL systems would be "they may not care about the rewards that go into the SGD update, they may instead care about the rewards that get entered into the dataset, or even something further causally upstream of that as long as it's very well-correlated on the training set." But it doesn't matter what we choose that's causally upstream of rewards, as long as it's perfectly correlated on the training set? * (Or you could be saying that humans are motivated by pleasure and pain but not the entire suite of things that are upstream of rewar

In general I think it's better to reason in terms of continuous variables like "how helpful is the iterative design loop" rather than "does it work or does it fail"?

My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it's 5% of worlds rather than 50% of worlds.

2johnswentworth6mo
(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment - this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let's say I were somehow approximately-100% convinced that it's basically possible for iterative design to produce an AI. Then I'd expect AI is probably not an X-risk, but I still want to reduce the small remaining chance of alignment failure. Would I expect that doing more iterative design is the most impactful approach? Most probably not. In that world, I'd expect the risk is dominated by some kind of tail risks which iterative design could maybe handle in principle, but for which iterative design is really not the optimal tool - otherwise they'd already be handled by the default iterative design processes. So I guess at that point I'd be looking at quantitative usefulness of iterative design, rather than binary. General point: it's just really hard to get a situation where "do marginally more of the thing we already do lots of by default" is the most impactful strategy. In nearly all cases, there will be problems which the things-we-already-do-lots-of-by-default handle relatively poorly, and then we can have much higher impact by using some other kind of strategy which better handles the kind of problems which are relatively poorly handled by default.

Upon further thought, I have another hypothesis about why there seems like a gap here. You claim here that the distribution is bimodal, but your previous claim ("I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1") suggests you don't actually think there's significant probability on the lower mode, you essentially think it's unimodal on the "iterative design fails" worlds.

I personally disagree with both the "significant probability on both modes, but not in between" hypothesis, and the "unimodal ... (read more)

2johnswentworth6mo
Yeah, that's fair. The reason I talked about it that way is that I was trying to give what I consider the strongest/most general argument, i.e. the argument with the fewest assumptions. What I actually think is that: * nearly all the probability mass is on worlds the iterative design loop fails to align AGI, but... * conditional on that being wrong, nearly all the probability mass is on the number of bits of optimization from iterative design resulting from ordinary economic/engineering activity being sufficient to align AGI, i.e. it is very unlikely that adding a few extra bits of qualitatively-similar optimization pressure will make the difference. ("We are unlikely to hit/miss by a little bit" is the more general slogan.) The second claim would be cruxy if I changed my mind on the first, and requires fewer assumptions, and therefore fewer inductive steps from readers' pre-existing models.

I think you're just doing the bimodal thing again. Sure, if you condition on worlds in which alignment happens automagically, then it's not valuable to advance the techniques involved. But there's a spectrum of possible difficulty, and in the middle parts there are worlds where RLHF works, but only because we've done a lot of research into it in advance (e.g. exploring things like debate); or where RLHF doesn't work, but finding specific failure cases earlier allowed us to develop better techniques.

5johnswentworth6mo
Yeah, ok, so I am making a substantive claim that the distribution is bimodal. (Or, more accurately, the distribution is wide and work on RLHF only counterfactually matters if we happen to land in a very specific tiny slice somewhere in the middle.) Those "middle worlds" are rare enough to be negligible; it would take a really weird accident for the world to end up such that the iteration cycles provided by ordinary economic/engineering activity would not produce aligned AI, but the extra iteration cycles provided by research into RLHF would produce aligned AI.

in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF

In worlds where iterative design works, it works by iteratively designing some techniques. Why wouldn't RLHF be one of them?

In particular, the excerpts/claims from Get What You Measure are pretty cruxy.

It seems pretty odd to explain this by quoting someone who thinks that this effect is dramatically less important than you do (i.e. nowhere near causing a ~100% probability of iterative design failing). Not gonna debate this on the object level,... (read more)

2johnswentworth6mo
Wrong question. The point is not that RLHF can't be part of a solution, in such worlds. The point is that working on RLHF does not provide any counterfactual improvement to chances of survival, in such worlds. Iterative design is something which happens automagically, for free, without any alignment researcher having to work on it. Customers see problems in their AI products, and companies are incentivized to fix them; that's iterative design from human feedback baked into everyday economic incentives. Engineers notice problems in the things they're building, open bugs in whatever tracking software they're using, and eventually fix them; that's iterative design baked into everyday engineering workflows. Companies hire people to test out their products, see what problems come up, then fix them; that's iterative design baked into everyday processes. And to a large extent, the fixes will occur by collecting problem-cases and then training them away, because ML engineers already have that affordance; it's one of the few easy ways of fixing apparent problems in ML systems. That will all happen regardless of whether any alignment researchers work on RLHF. When I say that "in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF", that's what I'm talking about. Problems which RLHF can solve (i.e. problems which are easy for humans to notice and then train away) will already be solved by default, without any alignment researchers working on them. So, there is no counterfactual value in working on RLHF, even in worlds where it basically works.

In worlds where the iterative design loop works for alignment, we probably survive AGI. So, if we want to improve humanity’s chances of survival, we should mostly focus on worlds where, for one reason or another, the iterative design loop fails. ... Among the most basic robust design loop failures is problem-hiding. It happens all the time in the real world, and in practice we tend to not find out about the hidden problems until after a disaster occurs. This is why RLHF is such a uniquely terrible strategy: unlike most other alignment schemes, it make

... (read more)
3johnswentworth6mo
The argument is not structurally invalid, because in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF. Working on RLHF does not particularly increase our chances of survival, in the worlds where RLHF doesn't make things worse. That said, I admit that argument is not very cruxy for me. The cruxy part is that I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1. And I think the various examples/analogies in the post convey my main intuition-sources behind that claim. In particular, the excerpts/claims from Get What You Measure are pretty cruxy.
3[comment deleted]6mo

I believe this because of how the world looks "brittle" (e.g., nanotech exists) and because lots of technological progress seems cognition-constrained (such as, again, nanotech). This is a big part of why I think heavy-precedent-style justifications are doomed.

Apart from nanotech, what are the main examples or arguments you would cite in favor of these claims?

Separately, how close is your conception of nanotech to "atomically precise manufacturing", which seems like Drexler's preferred framing right now?

1Vojtech Kovarik6mo
One way in which the world seems brittle / having free energy AI could use to gain advantage: We haven't figured out good communication practices for the digital age. We don't have good collective epistemics. And we dont seem to be on track to have this solved in the next 20 years. As a result I expect that with enough compute and understanding of network science, and perhaps a couple more things, you could sabotage the whole civilization. ("Enough" is meant to stand for "a lot, but within reach of an early AGI". Heck, if Google somehow spent the next 5 years just on that, I would give them fair odds.)
5Thomas Kwa7mo
not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world. * In military history it seems pretty common for some tech advance to cause one side to get a big advantage. This seems to be true today [https://www.lesswrong.com/posts/bffJJvCC78LZjFa3Z/what-a-20-year-lead-in-military-tech-might-look-like] as well with command-and-control and various other capabilities * I would guess pure fusion weapons [https://en.wikipedia.org/wiki/Pure_fusion_weapon] are technologically possible, which means an AI sophisticated enough to design one can get nukes without uranium * Currently on the cutting edge, the most advanced actors have large multiples over everyone else in important metrics. This is due to either a few years' lead or better research practices still within the human range * SMIC is mass producing the 14nm node whereas Samsung is at 3nm, which is something like 5x better FLOPS/watt * algorithmic improvements driven by cognitive labor of ML engineers have caused multiple OOM improvement in value/FLOPS * SpaceX gets 10x better cost per ton [https://aerospace.csis.org/data/space-launch-to-low-earth-orbit-how-much-does-it-cost/] to orbit than the next cheapest space launch provider, and this is before Starship. Also their internal costs are lower This seems sufficient for "what failure looks like" scenarios, with faster disempowerment through hard takeoff likely to depend on other pathways like nanotech, social engineering, etc. As for the whole argument against "heavy precedent", I'm not convinced either way and haven't thought about it a ton.

A couple of differences between Kolmogorov complexity/Shannon entropy and the loss function of autoregressive LMs (just to highlight them, not trying to say anything you don't already know):

  • The former are (approximately) symmetric, the latter isn't (it can be much harder to predict a string front-to-back than back-to-front)
  • The former calculate compression values as properties of a string (up to choice of UTM). The latter calculates compression values as properties of a string, a data distribution, and a model (and even then doesn't strictly determine the r
... (read more)

I agree that we'll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as "reflecting back on that data" in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).

6Eliezer Yudkowsky7mo
The cogitation here is implicitly hypothesizing an AI that's explicitly considering the data and trying to compress it, having been successfully anchored on that data's compression as identifying an ideal utility function.  You're welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn't arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.

if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.


I weakly disagree here, mainly because Nate's argument for very high levels of risk goes through strong generalization/a "sharp left turn" towards being much more coherent + goal-directed. So I find it hard... (read more)

Broadly agree with this post. Couple of small things:

Then later, it is smart enough to reflect back on that data and ask: “Were the humans pointing me towards the distinction between goodness and badness, with their training data? Or were they pointing me towards the distinction between that-which-they'd-label-goodness and that-which-they'd-label-badness, with things that look deceptively good (but are actually bad) falling into the former bin?” And to test this hypothesis, it would go back to its training data and find some example bad-but-deceptively-goo

... (read more)
5Eliezer Yudkowsky7mo
By "were the humans pointing me towards..." Nate is not asking "did the humans intend to point me towards..." but rather "did the humans actually point me towards..."  That is, we're assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.
3davidad (David A. Dalrymple)7mo
For the record, I have a convergently similar intuition [https://twitter.com/davidad/status/1582691978520633344?s=20&t=lXiuvapBGXmQVuXLoODbhg]: FDT removes the Cartesian specialness of the ego at the decision nodes (by framing each decision as a mere logical consequence of an agent-neutral nonphysical fact about FDT itself), but retains the Cartesian specialness of the ego at the utility node(s). I’ve thought about this for O(10 hours), and I also believe it could be crazy, but it does align quite well with the conclusions of Compassionate Moral Realism [https://ndpr.nd.edu/reviews/compassionate-moral-realism/].

I considered this, but it seems like the latter is 4x longer while covering fairly similar content?

2Evan Hubinger8mo
I've found that people often really struggle to understand the content from the former but got it when I gave them the latter—and also I think the latter post covers a lot of newer stuff that's not in the old one (e.g. different models of inductive biases).

Given the baseline classifier's 0.003% failure rate, you would have to sample and label 30,000 in-distribution examples to find a failure (which would cost about $10,000). With our tools, our contractors are able to find an adversarial example on the baseline classifier every 13 minutes (which costs about $8 – about 1000x cheaper).

This isn't comparing apples to apples, though? If you asked contractors to find adversarial examples without using the tools, they'd likely find them at a rate much higher than 0.003%.

2dmz8mo
That's right. We did some followup experiments doing the head-to-head comparison: the tools seem to speed up the contractors by 2x for the weak adversarial examples they were finding (and anecdotally speed us up a lot more when we use them to find more egregious failures). See https://www.lesswrong.com/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood#Quick_followup_results [https://www.lesswrong.com/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood#Quick_followup_results;]; an updated arXiv paper with those experiments is appearing on Monday. 

None of the "hard takeoff people" or hard takeoff models predicted or would predict that the sorts of minor productivity advancements we are starting to see would lead to a FOOM by now.

The hard takeoff models predict that there will be less AI-caused productivity advancements before a FOOM than soft takeoff models. Therefore any AI-caused productivity advancements without FOOM are relative evidence against the hard takeoff models.

You might say that this evidence is pretty weak; but it feels hard to discount the evidence too much when there are few concrete claims by hard-takeoff proponents about what advances would surprise them. Everything is kinda prosaic in hindsight.

3Daniel Kokotajlo8mo
I'm not sure about that actually. Hard takeoff and soft takeoff disagree about the rate of slope change, not about the absolute height of the line. I guess if you are thinking about the "soft takeoff means shorter timelines" then yeah it also means higher AI progress prior to takeoff, and in particular predicts more stuff happening now. But people generally agree that despite that effect, the overall correlation between short timelines and fast takeoff is positive.  Anyhow, even if you are right, I definitely think the evidence is pretty weak. Both sides make pretty much the exact same retrodictions and were in fact equally unsurprised by the last few years. I agree that Yudkowsky deserves spanking for not working harder to make concrete predictions/bets with Paul, but he did work somewhat hard, and also it's not like Paul, Ajeya, etc. are going around sticking their necks out much either. Finding concrete stuff to bet on (amongst this group of elite futurists) is hard. I speak from experience here, I've talked with Paul and Ajeya and tried to find things in the next 5 years we disagree on and it's not easy, EVEN THOUGH I HAVE 5-YEAR TIMELINES. We spent about an hour probably. I agree we should do it more. (Think about you vs. me. We both thought in detail about what our median futures look like. They were pretty similar, especially in the next 5 years!)

Thanks for the comments Vika! A few responses:

It might be good to clarify that this is an example architecture and the claims apply more broadly.

Makes sense, will do.

Phase 1 and 2 seem to map to outer and inner alignment respectively. 

That doesn't quite seem right to me. In particular:

  • Phase 3 seems like the most direct example of inner misalignment; I basically think of "goal misgeneralization" as a more academically respectable way of talking about inner misalignment.
  • Phase 1 introduces the reward misspecification problem (which I treat as synonymous
... (read more)

How? E.g. Jacob left a comment here about his motivations, does that count as a falsification? Or, if you'd say that this is an example of rationalization, then what would the comment need to look like in order to falsify your claim? Does Paul's comment here mentioning the discussions that took place before launching the GPT-3 work count as a falsification? if not, why not?

Jacob's comment does not count, since it's not addressing the "actually consider whether the project will net decrease chance of extinction" or the "could the answer have plausibly been 'no' and then the project would not have happened" part.

Paul's comment does address both of those, especially this part at the end:

To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arg

... (read more)

At a sufficiently high level of abstraction, I agree that "cost of experimenting" could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like "our inability to coordinate as a civilization" or "the power of intelligence" or "a lack of interpretability", etc. Given this, John's comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.

Also, I think that "on our first try" thing isn't ... (read more)

RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples, but also intrinsic motivation functions like curiosity and empowerment) which are used to train agents in the absence of RLHF.

The smiley faces example feels confusing as a "classic" outer alignment problem because AGIs won't be trained on a reward function anywhere near as limited as smiley faces. An alternative like "AGIs are trained on a reward function in wh... (read more)

I think the smiling example is much more analogous than you are making it out here. I think the basic argument for "this just encourages taking control of the reward" or "this just encourages deception" goes through the same way.

Like, RLHF is not some magical "we have definitely figured out whether a behavior is really good or bad" signal, it's historically been just some contractors thinking for like a minute about whether a thing is fine. I don't think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater th... (read more)

I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that's being made here is failing to recognize that reality doesn't grade on a curve when it comes to understanding the world - your arguments can be false even if nobody has refuted them. That's particularly true when it comes to very high-level abstractions, like the ones th... (read more)

Comments on parts of this other than the ITT thing (response to the ITT part is here)...

(and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment)

I don't usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn't relying on that particular abstraction at all.

Historically, the way that great scientists have gotten around

... (read more)

I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is.

I don't want to pour a ton of effort into this, but here's my 5-paragraph ITT attempt.

"As an analogy for alignment, consider processor manufacturing. We didn't get to gigahertz clock speed and ten nanometer feature size by trying to tackle all the problems of 10 nm manufacturing processes right out the gate. That would never have worked; too many things independently go wrong to isola... (read more)

2Oliver Habryka9mo
I am confused. How does RLHF help with outer alignment? Isn't optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces) I don't think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don't understand where the inner/outer alignment distinction comes from in this context)

There's one attitude towards alignment techniques which is something like "do they prevent all catastrophic misalignment?" And there's another which is more like "do they push out the frontier of how advanced our agents need to be before we get catastrophic misalignment?" I don't think the former approach is very productive, because by that standard no work is ever useful. So I tend to focus on the latter, with the theory of victory being "push the frontier far enough that we get a virtuous cycle of automating alignment work".

2Joe_Collman9mo
Ok, thanks - I can at least see where you're coming from then. Do you think debate satisfies the latter directly - or is the frontier only pushed out if it helps in the automation process? Presumably the latter (??) - or do you expect e.g. catastrophic out-with-a-whimper dynamics before deceptive alignment? I suppose I'm usually thinking of a "do we learn anything about what a scalable alignment approach would look like?" framing. Debate doesn't seem to get us much there (whether for scalable ELK, or scalable anything else), unless we can do the automating alignment work thing - and I'm quite sceptical of that (weakly, and with hand-waving). I can buy a safe-AI-automates-quite-a-bit-of-busy-work argument; once we're talking about AI that's itself coming up with significant alignment breakthroughs, I have my doubts. What we want seems to be unhelpfully complex, such that I expect we need to target our helper AIs precisely (automating capabilities research seems much simpler). Since we currently can't target our AIs precisely, I imagine we'd use (amplified-)human-approval. My worry is that errors/noise compounds for indirect feedback (e.g. deep debate tree on a non-crisp question), and that direct feedback is only as good as our (non-amplified) ability to do alignment research. AI that doesn't have these problems seems to be the already dangerous variety (e.g. AI that can infer our goal and optimize for it). I'd be more optimistic if I thought we were at/close-to a stage where we know the crisp high-level problems we need to solve, and could ask AI assistants for solutions to those crisp problems. That said, I certainly think it's worth thinking about how we might get to a "virtuous cycle of automating alignment work". It just seems to me that it's bottlenecked on the same thing as our direct attempts to tackle the problem: our depth of understanding.

Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I don't really know what "reliable empirical feedback" means in this context - if you have sufficiently reliable feedback mechanisms, then you've solved m... (read more)

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.

I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a g... (read more)

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I basically buy that argument, though I do still think lack of shots is the main factor which makes alignment harder than most other technical fields in their preparadigmatic stage.

Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late.

I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd still be hard to convert that into a solution for alignment. Nor do I believe we'll see no empirical evidence of power-seeking behavior before it's too late (and I think opinions amongst alignment researchers are pretty divided on this question).

I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd be very hard to convert that into a solution for alignment.

If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn't be harder than the core problems of any other field of science/engineering. It wouldn't be unusually hard, by the standards of technical research.

Of course, "empirical ... (read more)

Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".

My model is that John is talking about "evidence on whether an AI alignment solution is sufficient", and you understood him to say "evidence on whether the AI Alignment problem is real/difficult". My guess is you both agree on the former, but I am not confident.

No, I'm thinking of cases where Alice>Bob, and trying to gesture towards the distinction between "Bob knows that Alice believes X" and "Bob can use X to make predictions".

For example, suppose that Bob is a mediocre physicist and Alice just invented general relativity. Bob knows that Alice believes that time and space are relative, but has no idea what that means. So when trying to make predictions about physical events, Bob should still use Newtonian physics, even when those calculations require assumptions that contradict Alice's known beliefs.

3Abram Demski9mo
I think Bob still doesn't really need a two-part strategy in this case. Bob knows that Alice believes "time and space are relative", so Bob believes this proposition, even though Bob doesn't know the meaning of it. Bob doesn't need any special-case rule to predict Alice. The best thing Bob can do in this case still seems like, predict Alice based off of Bob's own beliefs. (Perhaps you are arguing that Bob can't believe something without knowing what that thing means? But to me this requires bringing in extra complexity which we don't know how to handle anyway, since we don't have a bayesian definition of "definition" to distinguish "Bob thinks X is true but doesn't know what X means" from a mere "Bob thinks X is true".) A similar example would be an auto mechanic. You expect the mechanic to do things like pop the hood, get underneath the vehicle, grab a wrench, etc. However, you cannot predict which specific actions are useful for a given situation. We could try to use a two-part model as you suggest, where we (1) maintain an incoherent-but-useful model of car-specific beliefs mechanics have, such as "wrenches are often needed"; (2) use the best of our own beliefs where that model doesn't apply.  However, this doesn't seem like it's ever really necessary or like it saves processing power for bounded reasoners, because we also believe that "wrenches are sometimes useful". This belief isn't specific enough that we could reproduce the mechanic's actions by acting on these beliefs; but, that's fine, that's just because we don't know enough. (Perhaps you have in mind a picture where we can't let incoherent beliefs into our world-model -- our limited understanding of Alice's physics, or of the mechanic's work, means that we want to maintain a separate, fully coherent world-model, and apply our limited understanding of expert knowledge only as a patch. If this is what you are getting at, this seems reasonable, so long as we can still count the whole resulting thing "m

Interesting post! Two quick comments:

Sometimes we analyze agents from a logically omniscient perspective. ... However, this omniscient perspective eliminates Vingean agency from the picture.

Another example of this happening comes when thinking about utilitarian morality, which by default doesn't treat other agents as moral actors (as I discuss here).

Bob has minimal use for attributing beliefs to Alice, because Bob doesn't think Alice is mistaken about anything -- the best he can do is to use his own beliefs as a proxy, and try to figure out what Alice will

... (read more)
2Abram Demski9mo
Interesting point!  It sounds to me like you're thinking of cases on my spectrum, somewhere between Alice>Bob and Bob>Alice. If Bob thinks Alice knows strictly more than Bob, then Bob can just use Bob's own beliefs, even when specific-things-bob-knows-Alice-believes are relevant -- because Bob also already believes those things, by hypothesis. So it's only in intermediate cases that Bob might get a benefit from a split strategy like the one you describe. 
  • Another response is "The AI paralyzes your face into smiling." 
    • But this is actually a highly nontrivial claim about the internal balance of value and computation which this reinforcement schedule carves into the AI. Insofar as this response implies that an AI will primarily "care about" literally making you smile, that seems like a highly speculative and unsupported claim about the AI internalizing a single powerful decision-relevant criterion / shard of value, which also happens to be related to the way that humans conceive of the situation (i.e. som
... (read more)
2Alex Turner10mo
I don't know? Seems like a representative kind of "potential risk" I've read about before, but I'm not going to go dig it up right now. (My post also isn't primarily about who said what, so I'm confused by your motivation for posting this question?)

I intend to convert this report to a nicely-formatted PDF with academic-style references. Please comment below, or message me, if you're interested in being paid to do this. EDIT: have now hired someone to do it.

More generally, I'll likely make a number of edits over the coming weeks, so comments and feedback would be very welcome.

3Robert Kirk9mo
Me, modelling skeptical ML researchers who may read this document: It felt to me that Large-scale goals are likely to incentivize misaligned power-seeking and AGIs’ behavior will eventually be mainly guided by goals they generalize to large scales were the least well-argued sections (in that while reading them I felt less convinced, and the arguments were more hand-wavy than before). In particular, the argument that we won't be able to use other AGIs to help with supervision because of collusion is entirely contained in footnote 22, and doesn't feel that robust to me -  or at least it seems easier for a skeptical reader to dismiss that, and hence not think the rest of section 3 is well-founded. Maybe it's worth adding another argument for why we probably can't just use other AGIs to help with alignment, or at least that we don't currently have good proposals for doing so that we're confident will work (e.g. how do we know the other AGIs are aligned and are hence actually helping). Also seems to be saying that positive goals won't generalise correctly because we need to get the positive goals exactly correct on the first try. I don't know if that is exactly an argument for why positive goals won't generalise correctly. It feels like this paragraph is trying to preempt the counterargument to this section that goes something like "Why wouldn't we just interactively adjust the objective if we see bad behaviour?", by justifying why we would need to get it right robustly and on the first try and throughout training, because the AGI will stop us doing this modification later on. Maybe it would be better to frame it that way if that was the intention.   Note that I agree with the document and I'm in favour of producing more ML-researcher-accessible descriptions of and motivations for the alignment problem, hence this effort to make the document more robust to skeptical ML researchers.

Just skimmed the course. One suggestion (will make more later): adding the Goal Misgeneralization paper from Langosco et al. as a core readings in the week on Detecting and Forecasting Emergent Behavior.

Hmm, perhaps clearer to say "reward does not automatically reinforce reward-focused thoughts into terminal values", given that we both agree that agents will have thoughts about reward either way.

But if you agree that reward gets reinforced as an instrumental value, then I think your claims here probably need to actually describe the distinction between terminal and instrumental values. And this feels pretty fuzzy - e.g. in humans, I think the distinction is actually not that clear-cut.

In other words, if everyone agrees that reward likely becomes a strong ... (read more)

Amplification can just be used as a method for making more and better common-sense improvements, though. You could also do all sorts of other stuff with it, but standard examples (like "catch agents when they lie to us") seem very much like common-sense improvements.

+1 on this comment, I feel pretty confused about the excerpt from Paul that Steve quoted above. And even without the agent deliberately deciding where to avoid exploring, incomplete exploration may lead to agents which learn non-reward goals before convergence - so if Paul's statement is intended to refer to optimal policies, I'd be curious why he thinks that's the most important case to focus on.

Load More