All of Richard_Ngo's Comments + Replies

Forgot to reply to this at the time, but I think this is a pretty good ITT. (I think there's probably some additional argument that people would make about why this isn't just an isolated analogy, but rather a more generally-applicable argument, but it does seem to be a fairly central example of that generally-applicable argument.)

Why not? It seems like this is a good description of how values change for humans under self-reflection; why not for AIs?

1Charlie Steiner1mo
An AI trained with RL that suddenly gets access to self-modifying actions might (briefly) have value dynamics according to idiosyncratic considerations that do not necessarily contain human-like guardrails. You could call this "systematization," but it's not proceeding according to the same story that governed systematization during training by gradient descent.

I'd classify them as values insofar as people care about them intrinsically.

Then they might also be strategies, insofar as people also care about them instrumentally.

I guess I should get rid of the "only" in the sentence you quoted? But I do want to convey "something which is only a strategy, not a goal or value, doesn't have any intrinsic value". Will think about phrasing.

It's not actually the case that the derivation of a higher abstraction level always changes our lower-level representation. Again, consider people -> social groups -> countries. Our models of specific people we know, how we relate to them, etc., don't change just because we've figured out a way to efficiently reason about entire groups of people at once. We can now make better predictions about the world, yes, we can track the impact of more-distant factors on our friends, but we don't actually start to care about our friends in a different way in th

... (read more)
1Thane Ruthenis1mo
Mm, I'll concede that point. I shouldn't have used people as an example; people are messy. Literal gears, then. Suppose you're studying some massive mechanism. You find gears in it, and derive the laws by which each individual gear moves. Then you grasp some higher-level dynamics, and suddenly understand what function a given gear fulfills in the grand scheme of things. But your low-level model of a specific gear's dynamics didn't change — locally, it was as correct as it could ever be. And if you had a terminal utility function over that gear (e. g., "I want it to spin at the rate of 1 rotation per minutes"), that utility function won't change in the light of your model expanding, either. Why would it? ... which can be represented as utility functions. Take a given deontological rule, like "killing is bad". Let's say we view it as a constraint on the allowable actions; or, in other words, a probability distribution over your actions that "predicts" that you're very likely/unlikely to take specific actions. Probability distributions of this form could be transformed into utility functions by reverse-softmaxing them; thus, it's perfectly coherent to model a deontologist as an agent with a lot of separate utility functions. See Friston's predictive-processing framework in neuroscience, plus this (and that comment). Deontologists reject utility-maximization in the sense that they refuse to engage in utility-maximizing calculations using their symbolic intelligence, but similar dynamics are still at play "under the hood". Well, not a flaw as such; a design choice. Humans are trained in an on-line regime, our values are learned from scratch, and... this process of active value learning just never switches off (although it plausibly slows down with age, see old people often being "set in their ways"). Our values change by the same process by which they were learned to begin with.

I agree that this is closely related to the predictive processing view of the brain. In the post I briefly distinguish between "low-level systematization" and "high-level systematization"; I'd call the thing you're describing the former. Whereas the latter seems like it might be more complicated, and rely on whatever machinery brains have on top of the predictive coding (e.g. abstract reasoning, etc).

In particular, some humans are way more systematizing than others (even at comparable levels of intelligence). And so just saying "humans are constantly doing... (read more)

3Jan_Kulveit1mo
My impression is you get a lot of "the later" if you run "the former" on the domain of language and symbolic reasoning, and often the underlying model is still S1-type. E.g. does not sound to me like someone did a ton of abstract reasoning to systematize other abstract values, but more like someone succeeded to write words which resonate with the "the former". Also, I'm not sure why do you think the later is more important for the connection to AI. Curent ML seem more similar to "the former", informal, intuitive, fuzzy reasonining.   That's interesting - in contrast, I have a pretty clear intuitive sense of a direction where some people have a lot of internal conflict and as a result their actions are less coherent, and some people have less of that. In contrast I think in case of humans who you would likely describe as 'having systematized there values' ... I often doubt what's going on.  A lot people who describe themselves as hardcore utilitarians seem to be ... actually not that, but more resemble a system where somewhat confused verbal part fights with other parts, which are sometimes suppressed. That's where I think looking at what human brains are doing seems interesting. Even if you believe the low-level / "the former" is not what's going with human theories of morality, the technical problem seems very similar and the same math possibly applies 

Thanks for the comment! I agree that thinking of minds as hierarchically modeling the world is very closely related to value systematization.

But I think the mistake you're making is to assume that the lower levels are preserved after finding higher-level abstractions. Instead, higher-level abstractions reframe the way we think about lower-level abstractions, which can potentially change them dramatically. This is what happens with most scientific breakthroughs: we start with lower-level phenomena, but we don't understand them very well until we discover th... (read more)

2Wei Dai1mo
Why is this a problem, that calls out to be fixed (hence leading to systematization)? Why not just stick with the default of "go with whichever value/preference/intuition that feels stronger in the moment"? People do that unthinkingly all the time, right? (I have my own thoughts on this, but curious if you agree with me or what your own thinking is.) How would you cash out "don't make sense" here?
1Kaj Sotala1mo
I think we should be careful to distinguish explicit and implicit systematization. Some of what you are saying (e.g. getting answers to question like "what counts as lying") sounds like you are talking about explicit, consciously done systematization; but some of what you are saying (e.g. minds identifying aspects of thinking that "don't make sense" and correcting them) also sounds like it'd apply more generally to developing implicit decision-making procedures.  I could see the deontologist solving their problem either way - by developing some explicit procedure and reasoning for solving the conflict between their values, or just going by a gut feel for which value seems to make more sense to apply in that situation and the mind then incorporating this decision into its underlying definition of the two values. I don't know how exactly deontological rules work, but I'm guessing that you could solve a conflict between them by basically just putting in a special case for "in this situation, rule X wins over rule Y" - and if you view the rules as regions in state space where the region for rule X corresponds to the situations where rule X is applied, then adding data points about which rule is meant to cover which situation ends up modifying the rule itself. It would also be similar to the way that rules work in skill learning in general, in that experts find the rules getting increasingly fine-grained, implicit and full of exceptions. Here's how Josh Waitzkin describes the development of chess expertise: "Sitting with paradox, being at peace with and navigating the tension of competing truths, letting go of any notion of solidity" also sounds to me like some of the models for higher stages of moral development, where one moves past the stage of trying to explicitly systematize morality and can treat entire systems of morality as things that all co-exist in one's mind and are applicable in different situations. Which would make sense, if moral reasoning is a skill i
1Thane Ruthenis1mo
Mm, I think there's two things being conflated there: ontological crises (even small-scale ones, like the concept of fitness not being outright destroyed but just re-shaped), and the simple process of translating your preference around the world-model without changing that world-model. It's not actually the case that the derivation of a higher abstraction level always changes our lower-level representation. Again, consider people -> social groups -> countries. Our models of specific people we know, how we relate to them, etc., don't change just because we've figured out a way to efficiently reason about entire groups of people at once. We can now make better predictions about the world, yes, we can track the impact of more-distant factors on our friends, but we don't actually start to care about our friends in a different way in the light of all this. In fact: Suppose we've magically created an agent that already starts our with a perfect world-model. It'll never experience an ontology crisis in its life. This agent would still engage in value translation as I'd outlined. If it cares about Alice and Bob, for example, and it's engaging in plotting at the geopolitical scales, it'd still be useful for it to project its care for Alice and Bob into higher abstraction levels, and start e. g. optimizing towards the improvement of the human economy. But optimizing for all humans' welfare would still remain an instrumental goal for it, wholly subordinate to its love for the two specific humans. I think you do, actually? Inasmuch as real-life deontologists don't actually shut down when facing a values conflict. They ultimately pick one or the other, in a show of revealed preferences. (They may hesitate a lot, yes, but their cognitive process doesn't get literally suspended.) I model this just as an agent having two utility functions, u1 and u2, and optimizing for their sum u1+u2. If the values are in conflict, if taking an action that maximizes u1 hurts u2 and vice versa

I'm very sympathetic to this complaint; I think that these arguments simply haven't been made rigorously, and at this point it seems like Nate and Eliezer are not in an epistemic position where they're capable of even trying to do so. (That is, they reject the conception of "rigorous" that you and I are using in these comments, and therefore aren't willing to formulate their arguments in a way which moves closer to meeting it.)

You should look at my recent post on value systematization, which is intended as a framework in which these claims can be discussed more clearly.

FWIW I think that gradient hacking is pretty plausible, but it'll probably end up looking fairly "prosaic", and may not be a problem even if it's present.

Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?

How do you feel about "In an ideal world, we'd stop all AI progress"? Or "ideally, we'd stop all AI progress"?

FWIW I think some of the thinking I've been doing about meta-rationality and ontological shifts feels like metaphilosophy. Would be happy to call and chat about it sometime.

I do feel pretty wary about reifying the label "metaphilosophy" though. My preference is to start with a set of interesting questions which we can maybe later cluster into a natural category, rather than starting with the abstract category and trying to populate it with questions (which feels more like what you're doing, although I could be wrong).

Strong disagree. This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs (e.g. imagine trying to describe the useful applications of electricity before anyone knew what it was or how it worked; or imagine Galileo trying to justify the practical use of studying astronomy).

Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works; this by itself is sufficient to re... (read more)

Reply1173

Five clusters of alignment researchers

Very broadly speaking, alignment researchers seem to fall into five different clusters when it comes to thinking about AI risk:

  1. MIRI cluster. Think that P(doom) is very high, based on intuitions about instrumental convergence, deceptive alignment, etc. Does work that's very different from mainstream ML. Central members: Eliezer Yudkowsky, Nate Soares.
  2. Structural risk cluster. Think that doom is more likely than not, but not for the same reasons as the MIRI cluster. Instead, this cluster focuses on systemic risks, multi-a
... (read more)

"I don't think inserting Knightian uncertainty is that helpful; the object-level stuff is usually the most important thing to be communicating."

The main point of my post is that accounting for disagreements about Knightian uncertainly is the best way to actually communicate object level things, since otherwise people get sidetracked by epistemological disagreements.

"I'd follow the policy of first making it common knowledge that you're reporting your inside views"

This is a good step, but one part of the epistemological disagreements I mention above is that ... (read more)

FWIW I think that confrontation-worthy empathy and use of the phrase "everyone will die" to describe AI risk are approximately mutually exclusive with each other, because communication using the latter phrase results from a failure to understand communication norms.

(Separately I also think that "if we build AGI, everyone will die" is epistemically unjustifiable given current knowledge. But the point above still stands even if you disagree with that bit.)

3Tsvi Benson-Tilsen5mo
What I mean by confrontation-worthy empathy is about that sort of phrase being usable. I mean, I'm not saying it's the best phrase, or a good phrase to start with, or whatever. I don't think inserting Knightian uncertainty is that helpful; the object-level stuff is usually the most important thing to be communicating. This maybe isn't so related to what you're saying here, but I'd follow the policy of first making it common knowledge that you're reporting your inside views (which implies that you're not assuming that the other person would share those views); and then you state your inside views. In some scenarios you describe, I get the sense that Person 2 isn't actually wanting Person 1 to say more modest models, they're wanting common knowledge that they won't already share those views / won't already have the evidence that should make them share those views.

I just stumbled upon the Independence of Pareto dominated alternatives criterion; does the ROSE value have this property? I'm pattern-matching it as related to disagreement-point invariance, but haven't thought about this at all.

Yeah, I agree I convey the implicit prediction that, even though not all one-month tasks will fall at once, they'll be closer than you would otherwise expect not using this framework.

I think I still disagree with your point, as follows: I agree that AI will soon do passably well at summarizing 10k word books, because the task is not very "sharp" - i.e. you get gradual rather than sudden returns to skill differences. But I think it will take significantly longer for AI to beat the quality of summary produced by a median expert in 1 month, because that expert's summary will in fact explore a rich hierarchical interconnected space of concepts from the novel (novel concepts, if you will).

Seems like there's a bunch of interesting stuff here, though some of it is phrased overly strongly.

E.g. "mechanistic interpretability requires program synthesis, program induction, and/or programming language translation" seems possible but far from obvious to me. In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways. Perhaps it's appropriate to advocate for MI researchers to pay more attention to these fields, but calling this an example of "reinventing", "reframing" or "renami... (read more)

2Stephen Casper7mo
Thanks for the comment. This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.  No qualms here. But (1) the point about program synthesis/induction/translation suggests that the toy problems are fundamentally more tractable than real ones. Analogously, imagine saying that having humans write and study simple algorithms for search, modular addition, etc. to be part of an agenda for program synthesis. (2) At some point the toy work should lead to competitive engineering work. think that there has not been a clear trend toward this in the past 6 years with the circuits agenda.  Thanks for the question. It might generalize. My intended point with the Ramanujan paper is that a subnetwork seeming to do something in isolation does not mean that it does that thing in context. The Ramanujan et al. weren't interpreting networks, they were just training the networks.  So the underlying subnetworks may generalize well, but in this case, this is not interpretability work any more than just gradient-based training of a sparse network is. 

My default (very haphazard) answer: 10,000 seconds in a day; we're at 1-second AGI now; I'm speculating 1 OOM every 1.5 years, which suggests that coherence over multiple days is 6-7 years away.

The 1.5 years thing is just a very rough ballpark though, could probably be convinced to double or halve it by doing some more careful case studies.

Thanks. For the record, my position is that we won't see progress that looks like "For t-AGI, t increases by +1 OOM every X years" but rather that the rate of OOMs per year will start off slow and then accelerate. So e.g. here's what I think t will look like as a function of years:

YearRichard (?) guessDaniel guess
202315
2024515
202525100
20261002000
2027500Infinity (singularity)
20282,500 
202910,000 
203050,000 
2031250,000 
20321,000,000 

I think this partly because of the way I think generalization works (I think e.g. once AIs have gotten... (read more)

Why is it cheating? That seems like the whole point of my framework - that we're comparing what AIs can do in any amount of time to what humans can do in a bounded amount of time.

Whatever. Maybe I was just jumping on an excuse to chit-chat about possible limitations of LLMs :) And maybe I was thread-hijacking by not engaging sufficiently with your post, sorry.

This part you wrote above was the most helpful for me:

if the task is "spend a month doing novel R&D for lidar", then my framework predicts that we'll need 1-month AGI for that

I guess I just want to state my opinion that (1) summarizing a 10,000-page book is a one-month task but could come pretty soon if indeed it’s not already possible, (2) spending a month doing novel R&a... (read more)

But then we could just ask the question: “Can you please pose a question about string theory that no AI would have any prayer of answering, and then answer it yourself?” That’s not cherry-picking, or at least not in the same way.


But can't we equivalently just ask an AI to pose a question that no human would have a prayer of answering in one second? It wouldn't even need to be a trivial memorization thing, it could also be a math problem complex enough that humans can't do it that quickly, or drawing a link between two very different domains of knowledge.

3Steve Byrnes7mo
I think the “in one second” would be cheating. The question for Ed Witten didn’t specify “the AI can’t answer it in one second”, but rather “the AI can’t answer it period”. Like, if GPT-4 can’t answer the string theory question in 5 minutes, then it probably can’t answer it in 1000 years either. (If the AI can get smarter and smarter, and figure out more and more stuff, without bound, in any domain, by just running it longer and longer, then (1) it would be quite disanalogous to current LLMs [btw I’ve been assuming all along that this post is implicitly imagining something vaguely like current LLMs but I guess you didn’t say that explicitly], (2) I would guess that we’re already past end-of-the-world territory.)

How long would it take (in months) to train a smart recent college graduate with no specialized training in my field to complete this task?


This doesn't seem like a great metric because there are many tasks that a college grad can do with 0 training that current AI can't do, including:

  • Download and play a long video game to completion
  • Read and summarize a whole book
  • Spend a month planning an event

I do think that there's something important about this metric, but I think it's basically subsumed by my metric: if the task is "spend a month doing novel R&D for... (read more)

3Steve Byrnes7mo
Ah, that’s helpful, thanks. I think you’re saying “there are questions about string theory whose answers are obvious to Ed Witten because he happened to have thought about them in the course of some unpublished project, but these questions are hyper-specific, so bringing them up at all would be unfair cherry-picking.” But then we could just ask the question: “Can you please pose a question about string theory that no AI would have any prayer of answering, and then answer it yourself?” That’s not cherry-picking, or at least not in the same way. And it points to an important human capability, namely, figuring out which areas are promising and tractable to explore, and then exploring them. Like, if a human wants to make money or do science or take over the world, then they get to pick, endogenously, which areas or avenues to explore.

These are all arguments about the limit; whether or not they're relevant depends on whether they apply to the regime of "smart enough to automate alignment research".

1Joe_Collman7mo
Agreed. Are you aware of any work that attempts to answer this question? Does this work look like work on debate? (not rhetorical questions!) My guess is that work likely to address this does not look like work on debate. Therefore my current position remains: don't bother working on debate; rather work on understanding the fundamentals that might tell you when it'll break. The world won't be short of debate schemes. It'll be short of principled arguments for their safe application.

For instance, for debate, one could believe:
1) Debate will work for long enough for us to use it to help find an alignment solution.
2) Debate is a plausible basis for an alignment solution.

I generally don't think about things in terms of this dichotomy. To me, an "alignment solution" is anything that will align an AGI which is then capable of solving alignment for its successor. And so I don't think you can separate these two things.

(Of course I agree that debate is not an arbitrarily scalable alignment solution in the sense that you can just keep training... (read more)

1Joe_Collman7mo
Oh, to be clear, with "to help find" I only mean that we expect to make significant progress using debate. If we knew we'd safely make enough progress to get to a solution, then you're quite right that that would amount to (2). (apologies for lack of clarity if this was the miscommunication) That's the distinction I mean to make between (1) and (2): we need to get to the moon safely With (1) we have no idea when our rocket will explode. Similarly, we have no idea whether the moon will be far enough to know when our next rocket will explode. (not that I'm knocking robustly getting to the moon safely) If we had some principled argument telling us how far we could push debate before things became dangerous, that'd be great. I'm claiming that we have no such argument, and that all work on debate (that I'm aware of) stands near-zero chance of finding one. Of course I'm all for work "on debate" that aims at finding that kind of argument - however, I would expect that such work leaves the specifics of debate behind pretty quickly.

To preserve my current shards, I don't need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means "treading water" and seeing dogs sometimes in situations similar to historical dog-seeing events.

I think this depends sensitively on whether the "actor" and the "critic" in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the c... (read more)

In general if two possible models perform the same, then I expect the weights to drift towards the simpler one. And in this case they perform the same because of deceptive alignment: both are trying to get high reward during training in order to be able to carry out their misaligned goal later on.

Because of standard deceptive alignment reasons (e.g. "I should make sure gradient descent doesn't change my goal; I should make sure humans continue to trust me").

3Alex Turner8mo
I think you don't have to reason like that to avoid getting changed by SGD. Suppose I'm being updated by PPO, with reinforcement events around navigating to see dogs. To preserve my current shards, I don't need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means "treading water" and seeing dogs sometimes in situations similar to historical dog-seeing events.  Maybe this is compatible with what you had in mind! It's just not something that I think of as "high reward." And maybe there's some self-fulfilling prophecy where we trust models which get high reward, and therefore they want to get high reward to earn our trust... but that feels quite contingent to me.

This doesn't seem implausible. But on the other hand, imagine an agent which goes through a million episodes, and in each one reasons at the beginning "X is my misaligned terminal goal, and therefore I'm going to deceptively behave as if I'm aligned" and then acts perfectly like an aligned agent from then on. My claims then would be:

a) Over many update steps, even a small description length penalty of having terminal goal X (compared with being aligned) will add up.
b) Having terminal goal X also adds a runtime penalty, and I expect that NNs in practice are... (read more)

So I'm imagining the agent doing reasoning like:

Misaligned goal --> I should get high reward --> Behavior aligned with reward function

and then I'm hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make "I should get high reward" the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.)

I could also imagine something more like:

Misaligned goal --> I should behave in al... (read more)

1SoerenMind7mo
The shortest description of this thought doesn't include "I should get high reward" because that's already implied by having a misaligned goal and planning with it.  In contrast, having only the goal "I should get high reward" may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.
2Alex Turner8mo
Why would the agent reason like this? 

Ty for post. Just for reference, does John endorse this summary?

4Nate Soares8mo
John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment. (I added this note to the top of the post as a parenthetical; thanks.)

Deceptive alignment doesn't preserve goals.

A short note on a point that I'd been confused about until recently. Suppose you have a deceptively aligned policy which is behaving in aligned ways during training so that it will be able to better achieve a misaligned internally-represented goal during deployment. The misaligned goal causes the aligned behavior, but so would a wide range of other goals (either misaligned or aligned) - and so weight-based regularization would modify the internally-represented goal as training continues. For example, if the misali... (read more)

3Alex Turner8mo
Can you say why you think that weight-based regularization would drift the weights to the latter? That seems totally non-obvious to me, and probably false.

Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal.

Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representati... (read more)

3SoerenMind8mo
Interesting point. Though on this view, "Deceptive alignment preserves goals" would still become true once the goal has drifted to some random maximally simple goal for the first time. To be even more speculative: Goals represented in terms of existing concepts could be simple and therefore stable by default. Pretrained models represent all kinds of high-level states, and weight-regularization doesn't seem to change this in practice. Given this, all kinds of goals could be "simple" as they piggyback on existing representations, requiring little additional description length.

Supervised data seems way more fine-grained in what you are getting the AI to do. It's just that supervised fine-tuning is worse.

My (pretty uninformed) guess here is that supervised fine-tuning vs RLHF has relatively modest differences in terms of producing good responses, but bigger differences in terms of avoiding bad responses. And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don't want it to do.

Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has "policy learns to care about reward directly" as a footnote; I can imagine updating it based on the outcome of this discussion though.

2David Schneider-Joseph1y
For someone who's read v1 of this paper, what would you recommend as the best way to "update" to v3? Is an entire reread the best approach? [Edit March 11, 2023: Having now read the new version in full, my recommendation to anyone else with the same question is a full reread.]

Note that the "without countermeasures" post consistently discusses both possibilities

Yepp, agreed, the thing I'm objecting to is how you mainly focus on the reward case, and then say "but the same dynamics apply in other cases too..."

I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that's distinct from being confident in the motivations that give rise to that policy.

The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).

I strongly disagree with the "best case" thing. Like, policies could just learn human values! It's not that implausible.

If I had to try point to the crux here, it might be "how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?" Where we both agree that there's some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existi... (read more)

3Ajeya Cotra1y
Yes, sorry, "best case" was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing "correctly;" we could get lucky and have it generalize "incorrectly" in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here. I don't understand why reward isn't something the model has direct access to -- it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I'd have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here. Yeah, I don't really agree with this; I think I could pretty easily imagine being an AI system asking the question "How much reward would this episode get if it were sampled for training?" It seems like the intuition this is weird and unnatural is doing a lot of work in your argument, and I don't really share it.

(Written quickly and not very carefully.)

I think it's worth stating publicly that I have a significant disagreement with a number of recent presentations of AI risk, in particular Ajeya's "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover", and Cohen et al.'s "Advanced artificial agents intervene in the provision of reward". They focus on policies learning the goal of getting high reward. But I have two problems with this:

  1. I expect "reward" to be a hard goal to learn, because it's a pretty abstract concept a
... (read more)

Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has "policy learns to care about reward directly" as a footnote; I can imagine updating it based on the outcome of this discussion though.

6Ajeya Cotra1y
Note that the "without countermeasures" post consistently discusses both possibilities (the model cares about reward or the model cares about something else that's consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro: As well as the section Even if Alex isn't "motivated" to maximize reward.... I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that's distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons. With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard -- I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there's no notion of reward on the deployment distribution doesn't feel compelling to me.
2Lauro Langosco1y
I agree with your general point here, but I think Ajeya's post actually gets this right, eg and
5Paul Christiano1y
I'm not very convinced by this comment as an objection to "50% AI grabs power to get reward." (I find it more plausible as an objection to "AI will definitely grab power to get reward.") This seems to be most of your position but I'm skeptical (and it's kind of just asserted without argument): * The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your "unnaturalness" abstraction can make finer-grained distinctions than that, but I don't think I buy it. * If people train their AI with RLDT then the AI is literally be trained to predict reward! I don't see how this is remote, and I'm not clear if your position is that e.g. the value function will be bad at predicting reward because it is an "unnatural" target for supervised learning. * I don't understand the analogy with humans. It sounds like you are saying "an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward" be analogous to "humans care about the details of their reward circuitry." But: * I don't think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior. * It seems like the analogous conclusion for RL systems would be "they may not care about the rewards that go into the SGD update, they may instead care about the rewards that get entered into the dataset, or even something further causally upstream of that as long as it's very well-correlated on the training set." But it doesn't matter what we choose that's causally upstream of rewards, as long as it's perfectly correlated on the training set? * (Or you could be saying that humans are motivated by pleasure and pain but not the entire suite of things that are upstream of reward? But that doesn't seem right to me.) I don't buy it: * If people tr

In general I think it's better to reason in terms of continuous variables like "how helpful is the iterative design loop" rather than "does it work or does it fail"?

My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it's 5% of worlds rather than 50% of worlds.

2johnswentworth1y
(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment - this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let's say I were somehow approximately-100% convinced that it's basically possible for iterative design to produce an AI. Then I'd expect AI is probably not an X-risk, but I still want to reduce the small remaining chance of alignment failure. Would I expect that doing more iterative design is the most impactful approach? Most probably not. In that world, I'd expect the risk is dominated by some kind of tail risks which iterative design could maybe handle in principle, but for which iterative design is really not the optimal tool - otherwise they'd already be handled by the default iterative design processes. So I guess at that point I'd be looking at quantitative usefulness of iterative design, rather than binary. General point: it's just really hard to get a situation where "do marginally more of the thing we already do lots of by default" is the most impactful strategy. In nearly all cases, there will be problems which the things-we-already-do-lots-of-by-default handle relatively poorly, and then we can have much higher impact by using some other kind of strategy which better handles the kind of problems which are relatively poorly handled by default.

Upon further thought, I have another hypothesis about why there seems like a gap here. You claim here that the distribution is bimodal, but your previous claim ("I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1") suggests you don't actually think there's significant probability on the lower mode, you essentially think it's unimodal on the "iterative design fails" worlds.

I personally disagree with both the "significant probability on both modes, but not in between" hypothesis, and the "unimodal ... (read more)

2johnswentworth1y
Yeah, that's fair. The reason I talked about it that way is that I was trying to give what I consider the strongest/most general argument, i.e. the argument with the fewest assumptions. What I actually think is that: * nearly all the probability mass is on worlds the iterative design loop fails to align AGI, but... * conditional on that being wrong, nearly all the probability mass is on the number of bits of optimization from iterative design resulting from ordinary economic/engineering activity being sufficient to align AGI, i.e. it is very unlikely that adding a few extra bits of qualitatively-similar optimization pressure will make the difference. ("We are unlikely to hit/miss by a little bit" is the more general slogan.) The second claim would be cruxy if I changed my mind on the first, and requires fewer assumptions, and therefore fewer inductive steps from readers' pre-existing models.

I think you're just doing the bimodal thing again. Sure, if you condition on worlds in which alignment happens automagically, then it's not valuable to advance the techniques involved. But there's a spectrum of possible difficulty, and in the middle parts there are worlds where RLHF works, but only because we've done a lot of research into it in advance (e.g. exploring things like debate); or where RLHF doesn't work, but finding specific failure cases earlier allowed us to develop better techniques.

5johnswentworth1y
Yeah, ok, so I am making a substantive claim that the distribution is bimodal. (Or, more accurately, the distribution is wide and work on RLHF only counterfactually matters if we happen to land in a very specific tiny slice somewhere in the middle.) Those "middle worlds" are rare enough to be negligible; it would take a really weird accident for the world to end up such that the iteration cycles provided by ordinary economic/engineering activity would not produce aligned AI, but the extra iteration cycles provided by research into RLHF would produce aligned AI.

in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF

In worlds where iterative design works, it works by iteratively designing some techniques. Why wouldn't RLHF be one of them?

In particular, the excerpts/claims from Get What You Measure are pretty cruxy.

It seems pretty odd to explain this by quoting someone who thinks that this effect is dramatically less important than you do (i.e. nowhere near causing a ~100% probability of iterative design failing). Not gonna debate this on the object level,... (read more)

2johnswentworth1y
Wrong question. The point is not that RLHF can't be part of a solution, in such worlds. The point is that working on RLHF does not provide any counterfactual improvement to chances of survival, in such worlds. Iterative design is something which happens automagically, for free, without any alignment researcher having to work on it. Customers see problems in their AI products, and companies are incentivized to fix them; that's iterative design from human feedback baked into everyday economic incentives. Engineers notice problems in the things they're building, open bugs in whatever tracking software they're using, and eventually fix them; that's iterative design baked into everyday engineering workflows. Companies hire people to test out their products, see what problems come up, then fix them; that's iterative design baked into everyday processes. And to a large extent, the fixes will occur by collecting problem-cases and then training them away, because ML engineers already have that affordance; it's one of the few easy ways of fixing apparent problems in ML systems. That will all happen regardless of whether any alignment researchers work on RLHF. When I say that "in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF", that's what I'm talking about. Problems which RLHF can solve (i.e. problems which are easy for humans to notice and then train away) will already be solved by default, without any alignment researchers working on them. So, there is no counterfactual value in working on RLHF, even in worlds where it basically works.

In worlds where the iterative design loop works for alignment, we probably survive AGI. So, if we want to improve humanity’s chances of survival, we should mostly focus on worlds where, for one reason or another, the iterative design loop fails. ... Among the most basic robust design loop failures is problem-hiding. It happens all the time in the real world, and in practice we tend to not find out about the hidden problems until after a disaster occurs. This is why RLHF is such a uniquely terrible strategy: unlike most other alignment schemes, it make

... (read more)
3johnswentworth1y
The argument is not structurally invalid, because in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF. Working on RLHF does not particularly increase our chances of survival, in the worlds where RLHF doesn't make things worse. That said, I admit that argument is not very cruxy for me. The cruxy part is that I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1. And I think the various examples/analogies in the post convey my main intuition-sources behind that claim. In particular, the excerpts/claims from Get What You Measure are pretty cruxy.
3[comment deleted]1y

I believe this because of how the world looks "brittle" (e.g., nanotech exists) and because lots of technological progress seems cognition-constrained (such as, again, nanotech). This is a big part of why I think heavy-precedent-style justifications are doomed.

Apart from nanotech, what are the main examples or arguments you would cite in favor of these claims?

Separately, how close is your conception of nanotech to "atomically precise manufacturing", which seems like Drexler's preferred framing right now?

1Vojtech Kovarik1y
One way in which the world seems brittle / having free energy AI could use to gain advantage: We haven't figured out good communication practices for the digital age. We don't have good collective epistemics. And we dont seem to be on track to have this solved in the next 20 years. As a result I expect that with enough compute and understanding of network science, and perhaps a couple more things, you could sabotage the whole civilization. ("Enough" is meant to stand for "a lot, but within reach of an early AGI". Heck, if Google somehow spent the next 5 years just on that, I would give them fair odds.)
5Thomas Kwa1y
not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world. * In military history it seems pretty common for some tech advance to cause one side to get a big advantage. This seems to be true today as well with command-and-control and various other capabilities * I would guess pure fusion weapons are technologically possible, which means an AI sophisticated enough to design one can get nukes without uranium * Currently on the cutting edge, the most advanced actors have large multiples over everyone else in important metrics. This is due to either a few years' lead or better research practices still within the human range * SMIC is mass producing the 14nm node whereas Samsung is at 3nm, which is something like 5x better FLOPS/watt * algorithmic improvements driven by cognitive labor of ML engineers have caused multiple OOM improvement in value/FLOPS * SpaceX gets 10x better cost per ton to orbit than the next cheapest space launch provider, and this is before Starship. Also their internal costs are lower This seems sufficient for "what failure looks like" scenarios, with faster disempowerment through hard takeoff likely to depend on other pathways like nanotech, social engineering, etc. As for the whole argument against "heavy precedent", I'm not convinced either way and haven't thought about it a ton.

A couple of differences between Kolmogorov complexity/Shannon entropy and the loss function of autoregressive LMs (just to highlight them, not trying to say anything you don't already know):

  • The former are (approximately) symmetric, the latter isn't (it can be much harder to predict a string front-to-back than back-to-front)
  • The former calculate compression values as properties of a string (up to choice of UTM). The latter calculates compression values as properties of a string, a data distribution, and a model (and even then doesn't strictly determine the r
... (read more)

I agree that we'll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as "reflecting back on that data" in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).

6Eliezer Yudkowsky1y
The cogitation here is implicitly hypothesizing an AI that's explicitly considering the data and trying to compress it, having been successfully anchored on that data's compression as identifying an ideal utility function.  You're welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn't arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.

if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.


I weakly disagree here, mainly because Nate's argument for very high levels of risk goes through strong generalization/a "sharp left turn" towards being much more coherent + goal-directed. So I find it hard... (read more)

Broadly agree with this post. Couple of small things:

Then later, it is smart enough to reflect back on that data and ask: “Were the humans pointing me towards the distinction between goodness and badness, with their training data? Or were they pointing me towards the distinction between that-which-they'd-label-goodness and that-which-they'd-label-badness, with things that look deceptively good (but are actually bad) falling into the former bin?” And to test this hypothesis, it would go back to its training data and find some example bad-but-deceptively-goo

... (read more)
5Eliezer Yudkowsky1y
By "were the humans pointing me towards..." Nate is not asking "did the humans intend to point me towards..." but rather "did the humans actually point me towards..."  That is, we're assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.
3davidad (David A. Dalrymple)1y
For the record, I have a convergently similar intuition: FDT removes the Cartesian specialness of the ego at the decision nodes (by framing each decision as a mere logical consequence of an agent-neutral nonphysical fact about FDT itself), but retains the Cartesian specialness of the ego at the utility node(s). I’ve thought about this for O(10 hours), and I also believe it could be crazy, but it does align quite well with the conclusions of Compassionate Moral Realism.

I considered this, but it seems like the latter is 4x longer while covering fairly similar content?

2Evan Hubinger1y
I've found that people often really struggle to understand the content from the former but got it when I gave them the latter—and also I think the latter post covers a lot of newer stuff that's not in the old one (e.g. different models of inductive biases).

Given the baseline classifier's 0.003% failure rate, you would have to sample and label 30,000 in-distribution examples to find a failure (which would cost about $10,000). With our tools, our contractors are able to find an adversarial example on the baseline classifier every 13 minutes (which costs about $8 – about 1000x cheaper).

This isn't comparing apples to apples, though? If you asked contractors to find adversarial examples without using the tools, they'd likely find them at a rate much higher than 0.003%.

2dmz1y
That's right. We did some followup experiments doing the head-to-head comparison: the tools seem to speed up the contractors by 2x for the weak adversarial examples they were finding (and anecdotally speed us up a lot more when we use them to find more egregious failures). See https://www.lesswrong.com/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood#Quick_followup_results; an updated arXiv paper with those experiments is appearing on Monday. 

None of the "hard takeoff people" or hard takeoff models predicted or would predict that the sorts of minor productivity advancements we are starting to see would lead to a FOOM by now.

The hard takeoff models predict that there will be less AI-caused productivity advancements before a FOOM than soft takeoff models. Therefore any AI-caused productivity advancements without FOOM are relative evidence against the hard takeoff models.

You might say that this evidence is pretty weak; but it feels hard to discount the evidence too much when there are few concrete claims by hard-takeoff proponents about what advances would surprise them. Everything is kinda prosaic in hindsight.

3Daniel Kokotajlo1y
I'm not sure about that actually. Hard takeoff and soft takeoff disagree about the rate of slope change, not about the absolute height of the line. I guess if you are thinking about the "soft takeoff means shorter timelines" then yeah it also means higher AI progress prior to takeoff, and in particular predicts more stuff happening now. But people generally agree that despite that effect, the overall correlation between short timelines and fast takeoff is positive.  Anyhow, even if you are right, I definitely think the evidence is pretty weak. Both sides make pretty much the exact same retrodictions and were in fact equally unsurprised by the last few years. I agree that Yudkowsky deserves spanking for not working harder to make concrete predictions/bets with Paul, but he did work somewhat hard, and also it's not like Paul, Ajeya, etc. are going around sticking their necks out much either. Finding concrete stuff to bet on (amongst this group of elite futurists) is hard. I speak from experience here, I've talked with Paul and Ajeya and tried to find things in the next 5 years we disagree on and it's not easy, EVEN THOUGH I HAVE 5-YEAR TIMELINES. We spent about an hour probably. I agree we should do it more. (Think about you vs. me. We both thought in detail about what our median futures look like. They were pretty similar, especially in the next 5 years!)

Thanks for the comments Vika! A few responses:

It might be good to clarify that this is an example architecture and the claims apply more broadly.

Makes sense, will do.

Phase 1 and 2 seem to map to outer and inner alignment respectively. 

That doesn't quite seem right to me. In particular:

  • Phase 3 seems like the most direct example of inner misalignment; I basically think of "goal misgeneralization" as a more academically respectable way of talking about inner misalignment.
  • Phase 1 introduces the reward misspecification problem (which I treat as synonymous
... (read more)

How? E.g. Jacob left a comment here about his motivations, does that count as a falsification? Or, if you'd say that this is an example of rationalization, then what would the comment need to look like in order to falsify your claim? Does Paul's comment here mentioning the discussions that took place before launching the GPT-3 work count as a falsification? if not, why not?

Jacob's comment does not count, since it's not addressing the "actually consider whether the project will net decrease chance of extinction" or the "could the answer have plausibly been 'no' and then the project would not have happened" part.

Paul's comment does address both of those, especially this part at the end:

To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arg

... (read more)
Load More