On how various plans miss the hard bits of the alignment challenge

So8res

This post has been recorded as part of the LessWrong Curated Podcast, and can be listened to on Spotify, Apple Podcasts, and Libsyn.

(As usual, this post was written by Nate Soares with some help and editing from Rob Bensinger.)

In my last post, I described a “hard bit” of the challenge of aligning AGI—the sharp left turn that comes when your system slides into the “AGI” capabilities well, the fact that alignment doesn’t generalize similarly well at this turn, and the fact that this turn seems likely to break a bunch of your existing alignment properties.

Here, I want to briefly discuss a variety of current research proposals in the field, to explain why I think this problem is currently neglected.

I also want to mention research proposals that do strike me as having some promise, or that strike me as adjacent to promising approaches.

Before getting into that, let me be very explicit about three points:

On my model, solutions to how capabilities generalize further than alignment are necessary but not sufficient. There is dignity in attacking a variety of other real problems, and I endorse that practice.
The imaginary versions of people in the dialogs below are not the same as the people themselves. I'm probably misunderstanding the various proposals in important ways, and/or rounding them to stupider versions of themselves along some important dimensions.^[1] If I've misrepresented your view, I apologize.
I do not subscribe to the Copenhagen interpretation of ethics wherein someone who takes a bad swing at the problem (or takes a swing at a different problem) is more culpable for civilization's failure than someone who never takes a swing at all. Everyone whose plans I discuss below is highly commendable, laudable, and virtuous by my accounting.

Also, many of the plans I touch upon below are not being given the depth of response that I'd ideally be able to give them, and I apologize for not engaging with their authors in significantly more depth first. I’ll be especially cursory in my discussion of some MIRI researchers and research associates like Vanessa Kosoy and Scott Garrabrant.^[2]

In this document I'm attempting to summarize my high-level view of the approaches I know about; I'm not attempting to provide full arguments for why I think particular approaches are more or less promising.

Think of the below as a window into my thought process, rather than an attempt to state or justify my entire background view. And obviously, if you disagree with my thoughts, I welcome objections.

So, without further ado, I’ll explain why I think that the larger field is basically not working on this particular hard problem:

Reactions to specific plans

Owen Cotton-Barratt & Truthful AI

Imaginary, possibly-mischaracterized-by-Nate version of Owen: What if we train our AGIs to be truthful? If our AGIs were generally truthful, we could just ask them if they're plotting to be deceptive, and if so how to fix it, and we could do these things early in ways that help us nip the problems in the bud before they fester, and so on and so forth.

Even if that particular idea doesn't work, it seems like our lives are a lot easier insofar as the AGI is truthful.

Nate: "Truthfulness" sure does sound like a nice property for our AGIs to have. But how do you get it in there? And how do you keep it in there, after that sharp left turn? If this idea is to make any progress on the hard problem we're discussing, it would have to come from some property of "truthfulness" that makes it more likely than other desirable properties to survive the great generalization of capabilities.

Like, even simpler than the problem of an AGI that puts two identical strawberries on a plate and does nothing else, is the problem of an AGI that turns as much of the universe as possible into diamonds. This is easier because, while it still requires that we have some way to direct the system towards a concept of our choosing, we no longer require corrigibility. (Also, "diamond" is a significantly simpler concept than "strawberry" and "cellularly identical".)

It seems to me that we have basically no idea how to do this. We can train the AGI to be pretty good at building diamond-like things across a lot of training environments, but once it takes that sharp left turn, by default, it will wander off and do some other thing, like how humans wandered off and invented birth control.

In my book, solving this hard problem so well that we could feasibly get an AGI that predictably maximizes diamond (after its capabilities start generalizing hard), would constitute an enormous advance.

Solving the hard problem so well that we could feasibly get an AGI that predictably answers operator questions truthfully, would constitute a similarly enormous advance. Because we would have figured out how to keep a highly capable system directed at any one thing of our choosing.

Now, in real life, building a truthful AGI is much harder than building a diamond optimizer, because 'truth' is a concept that's much more fraught than 'diamond'. (To see this, observe that the definition of "truth" routes through tricky concepts like "ways the AI communicated with the operators" and "the mental state of the operators", and involves grappling with tricky questions like "what ways of translating the AI's foreign concepts into human concepts count as manipulative?" and "what can be honestly elided?", and so on, whereas diamond is just carbon atoms bound covalently in tetrahedral lattices.)

So as far as I can tell, from the perspective of this hard problem, Owen's proposal boils down to "Wouldn't it be nice if the tricky problems were solved, and we managed to successfully direct our AGIs to be truthful?" Well, sure, that would be nice, but it's not helping solve our problem. In fact, this problem subsumes the whole diamond maximizer problem, but replaces the concept of "diamond" (that we obviously can't yet direct an AGI to optimize, diamond more clearly being a physical phenomenon far removed from the AGI's raw sensory inputs) with the concept of "truth" (which is abstract enough that we can easily forget that it's a much more difficult-to-describe physical phenomenon far removed from the AGI's raw sensory inputs).

(And this hasn't even touched on how if you did manage to get an AGI that keeps optimizing for truth-telling after it falls into the capabilities well, then you still have to make it corrigible, on pain of extinction. But I digress.)

Maybe Owen does think that the goal of "tell the truth" generalizes more readily than "optimize diamond", for some reason? But if so, I haven't seen him mention it, except when I explicitly prompted him when having this argument in person. Which doesn't seem like a very promising sign to me. My read is that he's just failing to engage with this hard problem.

Ryan Greenblatt & Eliciting Latent Knowledge

Imaginary Ryan, because Paul was on vacation last time I tried to understand ELK: As we train the nascent AI up to an AGI, we're going to simultaneously train a second system (the "ELK head") that can elicit its latent knowledge. The latter system will be able to tell us the moment that the former system is starting to do things like contemplate deception. We're then going to combine this with some third adversarial system, which can use that latent knowledge to expose the bad behavior to gradients, so that we can nip it in the bud.

Basically all this requires is that, whatever intelligence gradient descent (or some other optimizing procedure) can create, it can also decode, which doesn't seem all that big an ask: whatever optimization method you were using that found this intelligent reasoner, there should be a dual optimization process that finds its decoder, if you're a little clever about it.

Nate: So, set aside questions of whether that sort of thing (with the separate knowledge-elicitor and adversary) will turn out to be practical, because we're talking about whether the plan attacks the key problem.

Suppose we have an ELK head, and an adversary, that we're using to train a nascent AGI. The AGI takes a sharp left turn. Now it's dangerously capable, and suddenly a bunch of misalignments which it previously didn't notice within itself are relevant. And you've got to get them out.

Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent (much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner). You might not be able to just "expose the bad behavior" to gradients that you can hit to correct the thing, at least not easily and quickly.

Your second problem is that the AGI's concepts might rapidly get totally uninterpretable to your ELK head. Like, you could imagine doing neuroimaging on your mammals all the way through the evolution process. They've got some hunger instincts in there, but it's not like they’re smart enough yet to represent the concept of "inclusive genetic fitness" correctly, so you figure you'll just fix it when they get capable enough to understand the alternative (of eating because it's instrumentally useful for procreation). And so far you're doing great: you've basically decoded the visual cortex, and have a pretty decent understanding of what it's visualizing.

You maintain this pretty well as it walks up through to primate, and then suddenly it takes a sharp left turn and invents its own internal language and a bunch of abstract concepts, and suddenly you find your visualization tools to be quite lacking for interpreting its abstract mathematical reasoning about topology or whatever.

Analogously, your ELK head's abilities are liable to fall off a cliff right as the AGI's capabilities start generalizing way outside of its training distribution.

And if they don't, then this ELK head is (in this hypothetical) able to decode and understand the workings of an alien mind. Likely a kludgey behemoth of an alien mind. This itself is liable to require quite a lot of capability, quite plausibly of the sort that humanity gets first from the systems that took sharp left-turns, rather than systems that ground along today's scaling curves until they scaled that far.

Or in other words, if your ELK head does keep pace with your AGI, and takes a sharp left turn at the same time as it... then, well, now you're basically back to the "Truthful AI" proposal. How do you keep your ELK head reporting accurately (and doing so corrigibly), as it undergoes that sharp left turn?

This proposal seems to me like it's implicitly assuming that most of the capabilities gains come from the slow grind of gradient descent, in a world where the systems don't take sharp left turns and rapidly become highly capable in a wide variety of new (out-of-distribution) domains.

Which seems to me that it's mostly just assuming its way out from under the hard problem—and thus, on my models, assuming its way clean out of reality.

And if I imagine attempting to apply this plan inside of the reality I think I live in, I don't see how it plans to address the hard part of the problem, beyond saying "try training it against places where it knows it's diverging from the goal before the sharp turn, and then hope that it generalizes well or won't fight back", which doesn't instill a bunch of confidence in me (and which I don't expect to work).

Eric Drexler & AI Services

Imaginary Eric: Well, sure, AGI could get real dangerous if you let one system do everything under one umbrella. But that's not how good engineers engineer things. You can and should split your AI systems into siloed services, each of which can usefully help humanity with some fragment of whichever difficult sociopolitical or physical challenge you're hoping to tackle, but none of which constitutes an adversarial optimizer (with goals over the future) in its own right.

Nate: So mostly I expect that, if you try to split these systems into services, then you either fail to capture the heart of intelligence and your siloed AIs are irrelevant, or you wind up with enough AGI in one of your siloes that you have a whole alignment problem (hard parts and all) in there.

Like, I see this plan as basically saying "yep, that hard problem is in fact too hard, let's try to dodge it, by having humans + narrow AI services perform the pivotal act". Setting aside how I don't particularly expect this to work, we can at least hopefully agree that it's attempting to route around the problems that seem to me to be central, rather than attempting to solve them.

Evan Hubinger, in a recent personal conversation

Imaginary Evan: It's hard, in the modern paradigm, to separate the system's values from its capabilities and from the way it was trained. All we need to do is find a training regimen that leads to AIs that are both capable and aligned. At which point we can just make it publicly available, because it's not like people will be trying to disalign their AIs.

Nate: So, first of all, you haven't exactly made the problem easier.

As best I can tell, this plan amounts to "find a training method that not only can keep a system aligned through the sharp left turn, but must, and then popularize it". Which has, like, bolted two additional steps atop an assumed solution to some hard problems. So this proposal does not seem, to me, to make any progress towards solving those hard problems.

(Also, the observation "capabilities and alignment are fairly tightly coupled in the modern paradigm" doesn't seem to me like much of an argument that they're going to stay coupled after the ol' left turn. Indeed, I expect they won't stay coupled in the ways you want them to. Assuming that this modern desirable property will hold indefinitely seems dangerously close to just assuming this hard problem away, and thus assuming your way clean out of what-I-believe-to-be-reality.)

But maybe I just don't understand this proposal yet (and I have had some trouble distilling things I recognize as plans out of Evan's writing, so far).

A fairly straw version of someone with technical intuitions like Richard Ngo’s or Rohin Shah’s

Imaginary Richard/Rohin: You seem awfully confident in this sharp left turn thing. And that the goals it was trained for won't just generalize. This seems characteristically overconfident. For instance, observe that natural selection didn't try to get the inner optimizer to be aligned with inclusive genetic fitness at all. For all we know, a small amount of cleverness in exposing inner-misaligned behavior to the gradients will just be enough to fix the problem. And even if not that-exact-thing, then there are all sorts of ways that some other thing could come out of left field and just render the problem easy. So I don't see why you're worried.

Nate: My model says that the hard problem rears its ugly head by default, in a pretty robust way. Clever ideas might suffice to subvert the hard problem (though my guess is that we need something more like understanding and mastery, rather than just a few clever ideas). I have considered an array of clever ideas that look to me like they would predictably-to-me fail to solve the problems, and I admit that my guess is that you're putting most of your hope on small clever ideas that I can already see would fail. But perhaps you have ideas that I do not. Do you yourself have any specific ideas for tackling the hard problem?

Imaginary Richard/Rohin: Train it, while being aware of inner alignment issues, and hope for the best.

Nate: That doesn't seem to me to even start to engage with the issue where the capabilities fall into an attractor and the alignment doesn't.

Perhaps sometime we can both make a list of ways to train with inner alignment issues in mind, and then share them with each other, so that you can see whether you think I'm lacking awareness of some important tool you expect to be at our disposal, and so that I can go down your list and rattle off the reasons why the proposed training tools don't look to me like they result in alignment that is robust to sharp left turns. (Or find one that surprises me, and update.) But I don't want to delay this post any longer, so, some other time, maybe.

Another recent proposal

Imaginary Anonymous-Person-Whose-Name-I’ve-Misplaced: Okay, but maybe there is a pretty wide attractor basin around my own values, though. Like, maybe not my true values, but around a bunch of stuff like being low-impact and deferring to the operators about what to do and so on. You don't need to be all that smart, nor have a particularly detailed understanding of the subtleties of ethics, to figure out that it's bad (according to me) to kill all humans.

Nate: Yeah, that's basically the idea behind corrigibility, and is one reason why corrigibility is plausibly a lot easier to get than a full-fledged CEV sovereign. But this observation doesn't really engage with the question of how to point the AGI towards that concept, and how to cause its behavior to be governed by that concept in a fashion that's robust to the sharp left turn where capabilities start to really generalize.

Like, yes, some directions are easier to point an AI in, on account of the direction itself being simpler to conceptualize, but that observation alone doesn't say anything about how to determine which direction an AI is pointing after it falls into the capabilities well.

More generally, saying "maybe it's easy" is not the same as solving the problem. Maybe it is easy! But it's not going to get solved unless we have people trying to solve it.

Vivek Hebbar, summarized (perhaps poorly) from last time we spoke of this in person

Imaginary Vivek: Hold on, the AGI is being taught about what I value every time it tries something and gets a gradient about how well that promotes the thing I value. At least, assuming for the moment that we have a good ability to evaluate the goodness of the consequences of a given action (which seems fair, because it sounds like you're arguing for a way that we'd be screwed even if we had the One True Objective Function).

Like, you said that all aspects of reality are whispering to the nascent AGI of what it means to optimize, but few parts of reality are whispering of what to optimize for—whereas it looks to me like every gradient the AGI gets is whispering a little bit of both. So in particular, it seems to me like if you did have the one true objective function, you could just train good and hard until the system was both capable and aligned.

Nate: This seems to me like it's implicitly assuming that all of the system's cognitive gains come from the training. Like, with every gradient step, we are dragging the system one iota closer to being capable, and also one iota closer to being good, or something like that.

To which I say: I expect many of the cognitive gains to come from elsewhere, much as a huge number of the modern capabilities of humans are encoded in their culture and their textbooks rather than in their genomes. Because there are slopes in capabilities-space that an intelligence can snowball down, picking up lots of cognitive gains, but not alignment, along the way.

Assuming that this is not so, seems to me like simply assuming this hard problem away.

And maybe you simply don't believe that it's a real problem; that's fine, and I’d be interested to hear why you think that. But I have not yet heard a proposed solution, as opposed to an objection to the existence of the problem in the first place.

John Wentworth & Natural Abstractions

Imaginary John: I suspect there's a common format to concepts, that is a fairly objective fact about the math of the territory, and that—if mastered—could be used to understand an AGI's concepts. And perhaps select the ones we wish it would optimize for. Which isn't the whole problem, but sure is a big chunk of the problem. (And other chunks might well be easier to address given mastery of the fairly-objective concepts of "agent" and "optimizer" and so on.)

Nate: This does seem to me like it's trying to attack the actual problem! I have my doubts about this particular line of research (and those doubts are on my list of things to write up), but hooray for a proposal that, if it succeeded by its own lights, would address this hard problem!

Imaginary John: Well, uh, these days I'm mostly focusing on using my flimsy non-mastered grasp of the common-concept format to try to give a descriptive account of human values, because for some reason that's where I think the hope is. So I'm not actually working too much on this thing that you think takes a swing at the real problem (although I do flirt with it occasionally).

Nate: :'(

Imaginary John: Look, I didn't want to break the streak, OK.

Rob Bensinger, reading this draft: Wait, why do you see John’s proposal as attacking the central problem but not, for example, Eric Drexler’s Language for Intelligent Machines (summarized here)?

Nate: I understand Eric to be saying "maybe humans deploying narrow AIs will be capable enough to end the acute risk period before an AGI can (in which case we can avoid ever using AIs that have taken sharp left turns)", whereas John is saying "maybe a lot of objective facts about the territory determine which concepts are useful, and by understanding the objectivity of concepts we can become able to understand even an alien mind's concepts".

I think John’s guess is wrong (at least in the second clause), but it seems aimed at taking an AI system that has snowballed down a capabilities slope in the way that humans snowballed, and identifying its concepts in a way that’s stable to changes in the AI’s ontology—which is step one in the larger challenge of figuring out how to robustly direct an AGI’s motivations at the content of a particular concept it has.

My understanding of Eric’s idea, in contrast, is "I think there's a language these siloed components could use that's not so expressive as to allow them to be dangerous, but is expressive enough to allow them to help humans." To which my basic reply is roughly “The problem is that the non-siloed systems are going to start snowballing and end the world before the human+silo systems can save the world." As far as I can tell, Eric's attempting to route around the problem, whereas John's attempting to solve it.^[3]

Neel Nanda & Theories of Impact for Interpretability

Imaginary Neel: What if we get a lot of interpretability?

Nate: That would be great, and I endorse developing such tools.

I think this will only solve the hard problems if the field succeeds at interpretability so wildly that (a) our interpretability tools continue to work on fairly difficult concepts in a post-left-turn AGI; (b) that AGI has an architecture that turns out to be especially amenable to being aimed at some concept of our choosing; and (c) the interpretability tools grant us such a deep understanding of this alien mind that we can aim it using that understanding.

I admit I'm skeptical of all three. Where, to be clear, better interpretability tools help put us in a better position even if they don't clear these lofty bars. In real life, I expect interpretability to play a smaller role as a force-multiplier that awaits some other plan for addressing the hard problems.

Which are great to have and worth building, to be clear. I full-throatedly endorse humanity putting more effort into interpretability.

It simultaneously doesn't look to me like people are seriously aiming for "develop such a good ability to understand minds that we can reshape/rebuild them to be aimable in whatever time we have after we get one". It looks to me like the sights are currently set at much lower and more achievable targets, and that current progress is consistent with never hitting the more ambitious targets, the ones that would let us understand and reshape the first artificial minds into something aligned (fast enough to be relevant).

But if some ambitious interpretability researchers do set their sights on the sharp left turn and the generalization problem, then I would indeed count this as a real effort by humanity to solve its central technical challenge. I don't need a lot of hope in a specific research program in order to be satisfied with the field's allocation of resources; I just want to grow the space of attempts to solve the generalization problem at all.

Stuart Armstrong & Concept Extrapolation

Nate: (Note: This section consists of actual quotes and dialog, unlike the others.)^[4]

Stuart, in a blog post:

[...] It is easy to point at current examples of agents with low (or high) impact, at safe (or dangerous) suggestions, at low (or high) powered behaviours. So we have in a sense the 'training sets' for defining low-impact/Oracles/low-powered AIs.
It's extending these examples to the general situation that fails: definitions which cleanly divide the training set (whether produced by algorithms or humans) fail to extend to the general situation. Call this the 'value extrapolation problem, with 'value' interpreted broadly as a categorisation of situations into desirable and undesirable.
[...] Value extrapolation is thus necessary for AI alignment.
[...] We think that once humanity builds its first AGI, superintelligence is likely near, leaving little time to develop AI safety at that point. Indeed, it may be necessary that the first AGI start off aligned: we may not have the time or resources to convince its developers to retrofit alignment to it. So we need a way to have alignment deployed throughout the algorithmic world before anyone develops AGI.
To do this, we'll start by offering alignment as a service for more limited AIs. Value extrapolation scales down as well as up: companies value algorithms that won't immediately misbehave in new situations, algorithms that will become conservative and ask for guidance when facing ambiguity.
We will get this service into widespread use (a process that may take some time), and gradually upgrade it to a full alignment process. [...]

Rob Bensinger, replying on Twitter: The basic idea in that post seems to be: let's make it an industry standard for AI systems to "become conservative and ask for guidance when facing ambiguity", and gradually improve the standard from there as we figure out more alignment stuff.

The reasoning being something like: once we have AGI, we need to have deployment-ready aligned AGI extremely soon; and this will be more possible if the non-AGI preceding it is largely aligned.

(I at least agree with the "once we have AGI, we’ll need deployment-ready aligned AGI extremely soon" part of this.)

The other aspect of your plan seems to be 'focus on improving value extrapolation methods'. Both aspects of this plan seem very bad to me, speaking from my inside view:

1a. I don't expect that much overlap between what's needed to make, e.g., a present-day image classifier more conservative, and what's needed to make an AGI reliable and safe. So redirecting resources from the latter problem to the former seems wasteful to me.
1b. Relatedly, I don't think it's helpful for the field to absorb the message "oh, yeah, our image classifiers and Go players and so on are aligned, we're knocking that problem out of the park". If 1a is right, then making your image classifier conservative doesn't represent much progress toward being able to align AGI. They're different problems, like building a safe bridge vs. building a safe elevator.

'Alignment' is currently a word that's about the AGI problem in particular, which overlaps with a lot of narrow-AI robustness problems, but isn't just a scaled-up version of those; the difficulty of AGI alignment mostly comes from qualitatively new risks. So 'aligning' the field as a whole doesn't necessarily help much, and (less importantly) using the term 'alignment' for the broader, fuzzier goal is liable to distract from the core difficulties, and liable to engender a false sense of progress on the original problem.

2. We need to do value extrapolation eventually, but I don't think this is the field's current big bottleneck, and I don't think it helps address the bottleneck. Rather, I think the big bottleneck is understandability / interpretability.

Nate: I like Rob’s response. I’ll add that I’m not sure I understand your proposal. Your previous name for the value extrapolation problem was the “model splintering” problem, and iirc you endorsed Rohin’s summary of model splintering:

[Model splintering] is one way of more formally looking at the out-of-distribution problem in machine learning: instead of simply saying that we are out of distribution, we look at the model that the AI previously had, and see what model it transitions to in the new distribution, and analyze this transition.
Model splintering in particular refers to the phenomenon where a coarse-grained model is “splintered” into a more fine-grained model, with a one-to-many mapping between the environments that the coarse-grained model can distinguish between and the environments that the fine-grained model can distinguish between (this is what it means to be more fine-grained).

On the surface, work aimed at understanding and addressing "model splintering" sounds potentially promising to me—like, I might want to classify some version of "concept extrapolation" alongside Natural Abstractions, certain approaches to interpretability, Vanessa’s work, Scott’s work, etc. as "an angle of attack that might genuinely help with the core problem, if it succeeded wildly more than I expect it to succeed". Which is about as positive a situation as I’m expecting right now, and would be high praise in my books.

But in the past, I’ve often heard you use words and phrases in ways that I find promising at a glance, to mean things that I end up finding much less promising when I dig in on the specifics of what you’re talking about. So I’m initially skeptical, especially insofar as I don’t understand your proposal well.

I’d be interested in hearing how you think your proposal addresses the sharp left turn, if you think it does; or maybe you can give me pointers toward particular paragraphs/sections you’ve written up that you think already speak to this problem.

Regarding work on image-classifier conservatism: at a first glance, I don't have much confidence that the types of generalization you’re shooting for are tracking the possibility of sharp left turns. "We want our solutions to generalize" is cheap to say; things that engage with the sharp left turn are more expensive. What’s an example of a kind of present-day research on image classifier conservatism that you’d expect to help with the sharp left turn (if you do think any would help)?

Rebecca Gorman, in an email thread: We're working towards something that achieves interpretability objectives, and does so better than current approaches.

Agreed that AGI alignment isn't just a scaled-up version of narrow-AI robustness problems. But if we need to establish the foundations of alignment before we reach AGI and build it into every AI being built today (since we don't know when and where superintelligence will arise), then we need to try to scale down the alignment problem to something we can start to research today.

As for the article [A central AI alignment problem: capabilities generalization, and the sharp left turn]: I think it's an excellent article, but I'll give an insufficient response. I agree that capabilities form an attractor well. And that we don't get a strong understanding of human values as easily. That's why we think it's important to invest energy and resources into giving AI a strong understanding of human values; it's probably a harder problem. But - at a high level, some of the methods for getting there may generalize. That, at least, is a hopeful statement.

Nate: That sounds like a laudable goal. I have not yet managed to understand what sort of foundations of alignment you're trying to scale down and build into modern systems. What are you hoping to build into modern systems, and how do you expect it to relate to the problem of aligning systems with capabilities that generalize far outside of training?

So far, from parts of the aforementioned email thread that have been elided in this dialog, I have not yet managed to extract a plan beyond "generate training data that helps things like modern image classifiers distinguish intended features (such as ‘pre-treatment collapsed lung’ from ‘post-treatment collapsed lung with chest drains installed’, despite the chest-drains being easier to detect than the collapse itself)", and I don't yet see how generating this sort of training data and training modern image-classifiers thereon addresses the tricky alignment challenges I worry about.

Stuart, in an email thread: In simple or typical environments, simple proxies can achieve desired goals. Thus AIs tend to learn simple proxies, either directly (programmers write down what they currently think the goal is, leaving important pieces out) or indirectly (a simple proxy fits the training data they receive - eg image classifiers focusing on spurious correlations).

Then the AI develops a more complicated world model, either because the AI is becoming smarter or because the environment changes by itself. At this point, by the usual Goodhart arguments, the simple proxy no longer encodes desired goals, and can be actively pernicious.

What we're trying to do is to ensure that, when the AI transitions to a different world model, this updates its reward function at the same time. Capability increases should lead immediately to alignment increases (or at least alignment changes); this is the whole model splintering/value extrapolation approach.

The benchmark we published is a much-simplified example of this: the "typical environment" is the labeled datasets where facial expression and text are fully correlated. The "simple proxy/simple reward function" is the labeling of these images. The "more complicated world model" is the unlabeled data that the algorithm encounters, which includes images where the expression feature and the text feature are uncorrelated. The "alignment increase" (or, at least, the first step of this) is the algorithm realising that there are multiple distinct features in its "world model" (the unlabeled images) that could explain the labels, and thus generating multiple candidates for its "reward function".

One valid question worth asking is why we focused on image classification in a rather narrow toy example. The answer is that, after many years of work in this area, we've concluded that the key insights in extending reward functions do not lie in high-level philosophy, mathematics, or modelling. These have been useful, but have (temporarily?) run their course. Instead, practical experiments in value extrapolation seem necessary - and these will ultimately generate theoretical insights. Indeed, this has already happened; we now have, I believe, a much better understanding of model splintering than before we started working on this.

As a minor example, this approach seems to generate a new form of interpretability. When the algorithm asks the human to label a "smiling face with SAD written on it", it doesn't have a deep understanding of either expression or text; nor do humans have an understanding of what features it is really using. Nevertheless, seeing the ambiguous image gives us direct insight into the "reward functions" it is comparing, a potential new form of interpretability. There are other novel theoretical insights which we've been discussing in the company, but they're not yet written up for public presentation.

We're planning to generalise the approach and insights from image classifiers to other agent designs (RL agents, recommender systems, language models...); this will generate more insights and understanding on how value extrapolation works in general.

Nate: In Nate-speak, the main thing I took away from what you've said is "I want alignment to generalize when capabilities generalize. Also, we're hoping to get modern image classifiers to ask for labels on ambiguous data."

"Get the AI to ask for labels on ambiguous data" is one of many ideas I'd put on a list of shallow alignment ideas that are worth implementing. To my eye, it doesn't seem particularly related to the problem of pointing an AGI at something in a way that's robust to capabilities-start-generalizing.

It's a fine simple tool to use to help point at the concept you were hoping to point at, if you can get an AGI to do the thing you're pointing toward at all, and it would be embarrassing if we didn't try it. And I'm happy to have people trying early versions of such things as soon as possible. But I don't see these sorts of things as shedding much light on how you get a post-left-turn AGI to optimize for some concept of your choosing in the first place. If you could do that, then sure, getting it to ask for clarification when the training data is ambiguous is a nice extra saving throw (if it wasn't already doing that automatically because of some deeper corrigibility success), but I don't currently see this sort of thing as attacking one of the core issues.^[5]

Andrew Critch & political solutions

Imaginary Andrew Critch: Just politick between the AGI teams and get them all to agree to take the problem seriously, not race, not cut corners on safety, etc.

Nate: Uh, that ship sailed in, like, late 2015. My fairly-strong impression, from my proximity to the current politics between the current orgs, is "nope".

Also, even if this wasn't a straight-up "nope", you have the question of what you do with your cooperation. Somehow you've still got to leverage this cooperation into the end of the acute risk period, before the people outside your alliance end the world. And this involves having a leadership structure that can distinguish bad plans from good ones.

The alliance helps, for sure. It takes a bunch of the time pressure off (assuming your management is legibly capable of distinguishing good deployment ideas from bad ones). I endorse attempts to form such an alliance. (And it sure would be undignified for our world to die of antitrust law at the final extremity.) But it's not an attempt to solve this hard technical problem, and it doesn't alleviate enough pressure to cause me to think that the problem would eventually be solved, in this field where ~nobody manages to strike for the heart of the problem before them.

Imaginary Andrew Critch: So get global coordination going! Or have some major nation-state regulate global use of AI, in some legitimate way!

Nate: Here I basically have the same response: First, can't be done (though I endorse attempts to prove me wrong, and recommend practicing by trying to effect important political change on smaller-stakes challenges ASAP (The time is ripe for sweeping global coordination in pandemic preparedness! We just had our warning shot! If we'll be able to do something about AGI later, presumably we can do something analogous about pandemics now!)).

Second, it doesn't alleviate enough pressure; the bureaucrats can't tell real solutions from bad ones; the cost to build an unaligned AGI drops each year; etc., etc. Sufficiently good global coordination is a win condition, but we're not anywhere close to on track for that, and in real life we're still going to need technical solutions.

Which, apparently, only a handful of people in the world are trying to provide.

What about superbabies?

Nate: I doubt we have the time, but sure, go for superbabies. It's as dignified as any of the other attempts to walk around this hard problem.

What about other MIRI people?

There are a few people supported at least in part by MIRI (such as Scott and Vanessa) who seem to me to have identified confusing and poorly-understood aspects of cognition. And their targets strike me as the sort of things where if we got less confused about what the heck was going on, then we might thereby achieve a somewhat better understanding of minds/optimization/etc., in a way that sheds some light on the hard problems. So yeah, I'd chalk a few other MIRI-supported folk up in the "trying to tackle the hard problems" column.

We still wouldn’t have anything close to a full understanding, and at the progress rate of the last decade, I’d expect it to take a century for research directions like these to actually get us to an understanding of minds sufficient to align them.

Maybe early breakthroughs chain into follow-up breakthroughs that shorten that time? Or maybe if you have fifty people trying that sort of thing, instead of 3–6, one of them ends up tugging on a thread that unravels the whole knot if they manage to succeed in time. It seems good to me that researchers are trying approaches like these, but the existence of a handful of people making such an attempt doesn’t seem to me to represent much of an update about humanity’s odds of survival.

High-level view

I again stress that all the people whose plans I am pessimistic about are people that I consider virtuous, and whose efforts I applaud. (And that my characterizations of people above are probably not endorsed by those people, and that I'm putting less effort into passing their ideological Turing Tests than would be virtuous of me, etc. etc.)

Nevertheless, my overall impression is that most of the new people coming into alignment research end up pursuing research that seems doomed to me, not just because they're unlikely to succeed at their stated research goals, but because their stated research goals have little overlap with what seem to me to be the tricky bits. Or, well, that's what happens at best; what happens at worst is they wind up doing capabilities work with a thin veneer of alignment research.

Perhaps unfairly, my subjective experience of people entering the alignment research field is that there are:

a bunch of plans like Owen's (that seem to me to just completely miss the problem),
and a bunch of people who study some local phenomenon of modern systems that seems to me to have little relationship to the difficult problems that I expect to arise once things start getting serious, while calling that "alignment" (thus watering down the term, and allowing them to convince themselves that alignment is actually easy because it's just as easy to train a language model to answer "morality" questions as it is to train it to explain jokes or whatever),
and a few people who do capabilities work so that they can "stay near the action",
and very few who are taking stabs at the hard problems.

An exception is interpretability work, which I endorse, and which I think is getting rightful efforts (though I will caveat that some grim part of me expects that somehow interpretability work will be used to boost capabilities long before it gets to the high level required to face down the tricky problems I expect in the late game). And there are definitely a handful of folk plugging away at research proposals that seem to me to have non-trivial inner product with the tricky problems.

In fact, when writing this list, I was slightly pleasantly surprised by how many of the research directions seem to me to have non-trivial inner product with the tricky problems.^[6]

This isn't as much of a positive update as it might first seem, on account of how it looks to me like the total effort in the field is not distributed evenly across all the above proposals, and I still have a general sense that most researchers aren't really asking questions whose answers would really help us out. But it is something of a positive update nevertheless.

Returning to one of the better-by-my-lights proposals from above, Natural Abstractions: If this agenda succeeded and was correct in a key hypothesis, this would directly solve a big chunk of the problem.

I don't buy the key hypothesis (in the relevant way), and I don't expect that agenda to succeed.^[7] But if I was saying that about a hundred pretty-uncorrelated agendas being pursued by two hundred people, I'd start to think that maybe the odds are in our favor.

My overall impression is still that when I actually look at the particular community we have, weighted by person-hours, the large majority of the field isn't trying to solve the problem(s) I expect to kill us. They're just wandering off in some other direction.

It could turn out that I’m wrong about one of these other directions. But "turns out the hard/deep problem I thought I could see, did not in fact exist" feels a lot less likely, on my own models, than "one of these 100 people, whose research would clearly solve the problem if it achieved its self-professed goals, might in fact be able to achieve their goals (despite me not sharing their research intuitions)".

So the status quo looks grim to me.

I in fact think it's nice to have some people saying "we can totally route around that problem", and then pursuing research paths that they think route around the problem!

But currently, we have only a few fractions of plans that look to me to be trying to solve the problem that I expect to actually kill us. Like a field of contingency plans with no work going into a Plan A; or like a field of pandemic preparedness that immediately turned its gaze away from the true disaster scenarios and focused the vast majority of its effort on ideas like “get people to eat healthier so that their immune systems will be better-prepared”. (Not a perfect analogy; sorry.)

Hence: I'm not highly-pessimistic about our prospects because I think this problem is extraordinarily hard. I think this problem is normally hard, and very little effort is being deployed toward solving it.

Like, you know how some people out there (who I'm reluctant to name for fear that reminding them of their old stances will contribute to fixing them in their old ways) are like, "Your mistake was attempting to put a goal into the AGI; what you actually need to do is keep your hands off it and raise it compassionately!"? And from our perspective, they're just walking blindly into the razor blades?

And then other people are like, "The problem is giving the AGI a bad goal, or letting bad people control it", and... well, that's probably still where some of you get off the train, but to the rest of us, these people also look like they're walking willfully into the razor blades?

Well, from my perspective, the people who are like, "Just keep training it on your objective while being somewhat clever about the training, maybe that empirically works", are also walking directly into the razor blades.

(And it doesn't help that a bunch of folks are like "Well, if you're right, then we'll be able to update later, when we observe that getting language models to answer ethical questions is mysteriously trickier than getting it to answer other sorts of questions", apparently impervious to my cries of "No, my model does not predict that, my model does not predict that we get all that much more advance evidence than we've got already". If the evidence we have isn't enough to get people focused on the central problems, then we seem to me to be in rather a lot of trouble.)

My current prophecy is not so much "death by problem too hard" as "death by problem not assailed".

Which is absolutely a challenge. I'd love to see more people attacking the things that seem to me like they're at the core.

^{^}
I ran a few of the dialogs past the relevant people, but that has empirically dragged out the amount of time it takes this post to publish, and I have a handful of other posts to publish afterwards, so I neglected to get feedback from most of the people mentioned. Sorry.
^{^}
Much of Vanessa, Scott, etc.'s work does look to me like it is grappling with confusions related to the problem of aiming minds in theory, and if their research succeeds according to their own lights then I would expect to have a better understanding of how to aim minds in general, even ones that had undergone some sort of "sharp left turn".
Which is not to say that I’m optimistic about whether any of these plans will succeed by their own lights. Regardless, they get points for taking a swing, and the thing I’m mostly advocating for is that more people take swings at this problem at all, not that we filter strongly on my optimism about specific angles of attack.
I tried to solve the problem myself for a few years, and failed. Turns out I wasn't all that good at it.
Maybe I'll be able to do better next time, and I poke at it every so often. (Even though in my mainline prediction, we won’t have the time to complete the sort of research paths that I can see and that I think have any chance of working.)
MIRI funds or offers-to-fund most every researcher who I see as having this "their work would help with the generalization problem if they succeeded" property and as doing novel, nontrivial work, so it's no coincidence that I feel more positive about Vanessa, etc.'s work. But I'd like to see far more attempts to solve this problem than the field is currently marshaling.
^{^}
Again, to be clear, it's nice to have some people trying to route around the hard problems wholesale. But I don't count such attempts as attacks on the problem itself. (I'm also not optimistic about any attempts I have yet seen to dodge the problem, but that's a digression from today's topic.)
^{^}
I couldn't understand Stuart's views from what he's written publicly, so I ran this section by Stuart and Rebecca, who requested that I use actual quotes instead of my attempted paraphrasings. If I'd had more time, I'd like to have run all the dialogs by the researchers I mentioned in this post, and iterated until I could pass everyone's ideological Turing Test, as opposed to the current awkward set-up where the people that I thought I understood didn't get as much chance for feedback. But the time delay from editing this one section is evidence that this wouldn't be worth the time burnt. Instead, I hope the comments can correct any mischaracterizations on my part.
^{^}
Note also that while having the AI ask for clarification in the face of ambiguity is nice and helpful, it is of course far from autonomous-AGI-grade.
^{^}
I specifically see:
- ~3 MIRI-supported research approaches that are trying to attack a chunk of the hard problem (with a caveat that I think the relevant chunks are too small and progress is too slow for this to increase humanity's odds of success by much).
- ~1 other research approach that could maybe help address the core difficulty if it succeeds wildly more than I currently expect it to succeed (albeit no one is currently spending much time on this research approach): Natural Abstractions. Maybe 2, if you count sufficiently ambitious interpretability work.
- ~2 research approaches that mostly don't help address the core difficulty (unless perhaps more ambitious versions of those proposals are developed, and the ambitious versions wildly succeed), but might provide small safety boosts on the mainline if other research addresses the core difficulty: Concept Extrapolation, and current interpretability work (with a caveat that sufficiently ambitious interpretability work would seem more promising to me than this).
- 9+ approaches that appear to me to be either assuming away what look to me like the key problems, or hoping that we can do other things that allow us to avoid facing the problem: Truthful AI, ELK, AI Services, Evan's approach, the Richard/Rohin meta-approach, Vivek's approach, Critch's approach, superbabies, and the "maybe there is a pretty wide attractor basin around my own values" idea.
^{^}
I rate "interpretability succeeds so wildly that we can understand and aim one of the first AGIs" as probably a bit more plausible than "natural abstractions are so natural that, by understanding them, we can practically find concepts-worth-optimizing-for in an AGI". Both seem very unlikely to me, though they meet my bar for “deserving of a serious effort by humanity” in case they work out.

I'm going to spend most of this comment responding to your concrete remarks about ELK, but I wanted to start with some meta level discussion because it seems to cut closer to the heart of the issue and might be more generally applicable.

I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can't anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise. It means not becoming too pessimistic about a direction until we see fairly concretely where it's stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.

My sense is that you have more faith in a rough intuitive sense you've developed of what the "hard part" of alignment is, and so you'd primarily recommend thinking about that until we feel less confused. I disagree in large part because I feel like your broad intuitive sense has not yet had much opportunity to make contact with either reality or with formal reasoning, and I'd guess it's not precise enough to be a useful guide to research prioritization.

More concretely, you talk about novel mechanisms by which AI systems gain capabilities, but I think you haven't said much concrete about why existing alignment work couldn't address these mechanisms. This looks to me like a pretty unproductive stance; I suspect you are wrong about the shape of the problem, but if you are right then I think your main realistic path to impact involves saying something more concrete about why you think this.

I think you don't see the situation the same way, probably because you feel like you have said plenty concrete. Perhaps this is the most serious disagreement of all. I don't think saying there is a "capabilities well" is helpfully concrete until you say something about what it looks like, why it poses alignment problems different from SGD and why particular approaches don't generalize, etc.

In ARC's day to day work we write down particular models of capabilities that would generalize far outside of training (e.g.: what about a causal model of the world that holds robustly? what about logical deduction from valid premises with longer chains of reasoning? what about continuing to learn by trial and error when deployed in a novel environment?), and ask about whether a given alignment solution would generalize along with them. If we can find any gap, then that it goes on the list of problems. We focus on the gaps that seem least likely to be addressable by using known techniques, and try to develop new techniques or to identify general reasons why the gap is unresolvable.

My guess is that you are playing a roughly similar game much more informally, and that you are just making a mistake because reasoning about this stuff is in fact hard. But I can't really tell, since your thinking is happening in private and we are seeing the vague intuitions that result. (I've been hanging around MIRI for a long time, and I suspect I have a better model of your and Eliezer's position than virtually anyone else outside of MIRI, yet this is still where I'm at.)

Anyway, now turning to your discussion of ELK in particular.

Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent (much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner). You might not be able to just "expose the bad behavior" to gradients that you can hit to correct the thing, at least not easily and quickly.

I often think and write about other places where capabilities may come from that could challenge our basic alignment plan. Four particularly salient examples:

Your AI might perform search internally, e.g. looking for hypotheses that match the data or for policies that work well.
Natural selection may occur internally, e.g. cognitive patterns that acquire power might tend to dominate the behavior of your AI (despite the AI having no explicit prediction that they would work well).
Your AI might reason about how to think better, e.g. select cognitive actions based on anticipated consequences of those cognitive actions.
Our AI might deploy new algorithms that pose their own alignment risk for different (potentially unanticipated) reasons.

Some of these represent real problems, but none of them seem to fundamentally change the game or be deal-breakers:

Aligning the internal search seems very similar to aligning SGD on the outside. We could distinguish two additional difficulties in this case:
1. Because the search is on the inside, we can't directly apply our alignment insights to align it. Instead we need to ensure that SGD learns to align the search. This itself poses two difficulties: (a) the outer gradient needs to incentivize doing this, (b) we need to argue that it's nearly as easy for SGD to learn the aligned search as to learn the unaligned search (or build scaffolding such that it becomes similarly easily). This is what we're talking about in this appendix, and it's part of why we are skeptical about approaches to ELK based on simple regularizers. But we don't see a reason that either (a) or (b) would be a dealbreaker, and we tentatively think our current approaches to ontology identification would at least solve (a) if they were successful at all. It's pretty hard to talk about (b) without having more clarity about what the alignment scheme actually looks like but we don't see an in principle reason it's hard.
2. The internal search algorithm may not be SGD, and perhaps our alignment strategy was specific to some detail of SGD. But SGD appears to be amongst the hardest search algorithms, and ARC tries to pursue approaches that work for other algorithms rather than leveraging anything about SGD in particular. We're definitely in the market for other search algorithms that cause trouble but don't yet know of any.
Natural selection on the inside is similar but potentially more tricky, because the optimizer has more limited control over how this search works. This is like the analog of memetic selection being smarter than humans and eventually overpowering or hijacking human consequentialism. Another extreme example is that it seems like a large enough neural network may be catastrophically misaligned at initialization simply because of selection amongst activation patterns within a single forward pass. Ultimately we'd like to handle this in exactly the same way that we handle the last point, by some combination of (a) we can just directly apply the same hope from the previous section even to natural selection, (b) we can run explicit searches that are more powerful than implicit search by natural selection within our model, which requires ensuring that our explicit learned search captures whatever is good about natural selection (this seems tough but not at all obvious impossible to me). It's hard to talk about option (a) without seeing if/how we solve the problem from the last point. We could definitely work on option (b) now, and a large enough ARC would be working on it, but it seems like a relatively low priority since it's both very remote from existing systems and seems relatively unlikely-to-me to be the simplest place where we get stuck.
If our model is selecting cognitive actions, or designing new algorithms, then our core hope is that an aligned model will try to think in an aligned way. So if we've been succeeding at alignment so far then the model will be trying to stay aligned. By analogy, if humans cared about the amount of human DNA in the universe, then to the extent that cultural evolution was guided by human consequentialism (rather than e.g. being memetic selection), we would be trying to develop cultural machinery that was helpful for maximizing the amount of human DNA in the universe.
One way this can go wrong is if our model wants to stay aligned but fails, e.g. because it identifies new techniques for thinking that themselves pose new alignment difficulties (just as we desire human flourishing but may instead implement AI systems that want paperclips). I think this is a real problem, but there are a lot of reasons I don't consider it an existential challenge for our approach:
1. If you've succeeded at alignment so far, then your AI will also consider this a problem and will be trying to solve it. I think we should relate to our AI, discovering new ways to think that might pose new alignment difficulties, in the same way that we relate to future humans who may encounter alignment difficulties. The AI may solve the problem, or may implement policy solutions, or etc., and our role is to set them up for success just like we are trying to set up future humans for success. AI compresses the timescale both for "new AI algorithms with new alignment problems" but also for all of the solutions to those problems, so I don't think it changes the game from future humans. And so I'd focus on prosaic AI alignment for exactly the same reasons I focus on prosaic AI alignment when trying to help future humans succeed at alignment.
2. I think that we should be considering the particular algorithms that might pose a new alignment problem, and trying to solve alignment for each of them. If we have some general reason to think that new algorithms will be much harder than old algorithms, or that lessons won't transfer, then we can discuss those and whether they should affect research prioritization. So far I don't think we have such arguments, and so I think we should just be looking for algorithms that might pose problems. (I don't actually think that's the highest priority, because prosaic ML so obviously poses problems, and the other problems we see seem so closely analogous to the ones posed by prosaic ML. But I'm certainly in the market for other problems and think that a large enough research community should already be actively looking for them.)

Possible disagreements between us: (i) you think that at least one of these examples looks really bad for our approach, (ii) you have other examples in mind, (iii) you don't think we can write down a concrete example that looks bad, but we have reason to expect other kinds of capability gains that will be bad, (iv) nothing looks like a dealbreaker in particular, but it's just contributing to a long list of problems you'd have to solve and that's either a lot of work or something probably won't work out.

For me, the upshot of all of this is that SGD poses some obvious problems, that those problems are the most likely to actually occur, that they seem similar to (and at least subproblems of) the other alignment problems we may face, and that there are neither super compelling alternatives to aligning SGD nor particular arguments that the rest of the problem is harder than this step.

Your second problem is that the AGI's concepts might rapidly get totally uninterpretable to your ELK head. Like, you could imagine doing neuroimaging on your mammals all the way through the evolution process. They've got some hunger instincts in there, but it's not like they’re smart enough yet to represent the concept of "inclusive genetic fitness" correctly, so you figure you'll just fix it when they get capable enough to understand the alternative (of eating because it's instrumentally useful for procreation). And so far you're doing great: you've basically decoded the visual cortex, and have a pretty decent understanding of what it's visualizing.

Our goal is to learn a reporter that describes the latent knowledge of the model, and to keep this up to date as the model changes under SGD. If thinking about SGD, we usually think concretely about a single step of SGD, and how you could find a good reporter at the end of that gradient descent step assuming you had one at the beginning.

It feels to me like what you are saying here is just "you might not be able to solve ELK." Or else maybe restating the previous point, that the model builds latent knowledge by mechanisms other than SGD and therefore you need to learn a reporter that can also follow along with those other mechanisms.

In either case, I can't speak to whether it's helpful for the audience understanding why ELK is hard, but it is certainly not helping me understand why you think ELK is hard. I think this discussion is just too vague to be helpful.

I think it's not crazy for you to say "ARC's hopes about how to solve ELK are too vague to seem worth engaging with" (this is pretty similar to me saying "Nate's arguments about why alignment is hard are too vague to seem worth engaging with").

Analogously, your ELK head's abilities are liable to fall off a cliff right as the AGI's capabilities start generalizing way outside of its training distribution.

But can you say something concrete about why? What I'd like to do is talk about what the AGI is actually thinking, the particular computation it's running, so that we can talk about why that computation keeps being correlated with reality off distribution and then ask whether the reporter remains correlated with reality. When I go through this exercise I don't see big dealbreakers, and I can't tell if you disagree with that diagnosis, or if you are noticing other things that might be going on inside the AI, or if the difference is that I think "this looks like it might work in all the concrete cases we can see" is a relevant signal and you think "nah the cases we can't see are way worse than those we can see."

And if they don't, then this ELK head is (in this hypothetical) able to decode and understand the workings of an alien mind. Likely a kludgey behemoth of an alien mind. This itself is liable to require quite a lot of capability, quite plausibly of the sort that humanity gets first from the systems that took sharp left-turns, rather than systems that ground along today's scaling curves until they scaled that far.

Again, this seems too vague to be helpful, or perhaps just mistaken. The reporter is not some other AI looking at your predictor and trying to "decode its workings," or maybe it is but if so it's just because those english words are vague and broad. Can we talk about the particular kinds of cognition that your AI might be performing, such that you don't think this works? (Or which would require the reporter to itself be using magic-mystery-juice-of-intelligence?)

That's really the central theme of my response, so it's worth restating: ARC loves examples of ways an AI might be thinking such that ELK is difficult. But your description of the sharp left turn is too vague to be helpful for this purpose, and so I'd either like to turn this into more concrete discussion of the internals of the algorithm, or else some significantly more precise argument about why we expect the unknown possible internals to be so much less favorable for ELK than any of the concrete examples we can write down.^[1]

^{^}
I'd like to head off a possible response you might make that I disagree with: "Sure your algorithm works for any example you can write down, but the whole point is that you need it to work for alien cognition, where humans don't understand why it works. So of course it works on concrete examples but not in the unknown real world." . I'm putting this in a footnote because it seems like a digression and I have no idea if this is your view.
My main response is that we can in fact talk about concrete examples where "why your AI system's cognition works" isn't accessible to humans in the relevant ways:
- We can consider tricky facts we understand about how to reason, for which our discovery of those facts is empirically contingent (and where discovering those facts is harder than discovering the reasons itself). Then we can consider whether our AI alignment strategies would work even if humans hadn't figured out the relevant facts about reasoning.
- We can consider AI cognition which is contingent on hypothesized unknown-to-human facts, e.g. about the causal structure of reality, or about key facts about mathematics, or whatever else.
- Most of our ELK approaches don't make no-holds-barred use of "can a human come up with some story about why this AI cognition may work," and so this just isn't a particularly salient threshold anyway. As a silly example, if you were solving this problem with a speed prior (or indeed with any of the approaches in the regularization section of the ELK document) you wouldn't expect a particular key threshold at the space of strategies that a human understands.

I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can't anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise.

I would uncharitably summarize this as "let's just assume that finding a faithful concrete operationalization of the problem is not itself the hard part". And then, any time finding a faithful concrete operationalization of the problem is itself the hard part, you basically just automatically fail.

Is that... wrong? Am I missing something here? Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself "the hard part"? (I mean, just intuitively, I'd expect hacking away at the legible parts to induce some progress on the illegible, but it sounds extremely slow, to the point where it would very plausibly just not converge to solving the illegible parts at all.)

If I had to guess at your model here, I'd guess your intuition is something like "well, trying to make progress without concrete operationalizations is just really hard, it's too easy to become decoupled from mathematical/physical reality". To which my response would be "just because it's hard does not mean we can ignore it and still expect to solve the problem, especially in a reasonable timeframe". Yes, staying grounded is hard when finding faithful concrete operationalizations is itself the hard part of the problem, but we can't actually avoid that.

It means not becoming too pessimistic about a direction until we see fairly concretely where it's stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.

This is great early on in the process when we don't yet know what the hard parts are. But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again. Just ignoring those bottlenecks because we don't know how to operationalize them yet does not sound like an optimal search-strategy. What we want to do is focus on those intuitions, and figure out generalizable operationalizations of the bottlenecks. That itself is often where the hard work is (especially in alignment). Getting hyper-focused on a single concrete failure mode with a single strategy just results in an operationalization which is too narrow and potentially not relevant to most other strategies; a better approach is to look at intuitively-similar failure modes in a bunch of strategies and try to find an operationalization which unifies them and captures the intuitive pattern.

Similarly, once we have some intuition for where the bottlenecks are, it does seem completely correct to mostly dismiss strategies which are not obviously tackling them in some way, even before the bottlenecks are fully formalized. I mean, maybe spot-check one once in a while, but mostly just ignore such strategies. Otherwise, we just waste a ton of time on strategies which are in fact very likely hopeless.

Uncharitably summarizing again (and hopefully you will correct me if this is inaccurate): it sounds like you want to just not update very much on evidence which we don't know how to formalize yet. And I'd say this is basically the same mistake as e.g. someone who says we have no idea whether an updated version of a covid vaccine works until there's been a phase-3 clinical trial with statistically significant result.

It means not becoming too pessimistic about a direction until we see fairly concretely where it's stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.

Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It's like the exact opposite of security mindset.

I don't think those are great summaries. I think this is probably some misunderstanding about what ARC is trying to do and about what I mean by "concrete." In particular, "concrete" doesn't mean "formalized," it means more like: you are able to discuss a bunch of concrete examples of the difficulty and why they leads to failure of particular concrete approaches; you are able to point out where the problem will appear in a particular decomposition of the problem, and would revise your picture if that turned out to be wrong; etc.

You write:

But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again.

I don't yet have this sense about a "sharp left turn" bottleneck.

I think I would agree with you if we'd looked at a bunch of plausible approaches, and then convinced ourselves that they would fail. And then we tried to introduce the sharp left turn to capture the unifying theme of those failures and to start exploring what's really going on. At a high level that's very similar to what ARC is doing day to day, looking at a bunch of approaches to a problem, seeing why they fail, and then trying to understand the nature of the problem so that we can succeed.

But for the sharp left turn I think we basically don't have examples. Existing alignment strategies fail in much more basic ways, which I'd call "concrete." We don't have examples of strategies that don't run into concrete difficulties, but they fail for a vague and hard-to-understand reason that we'd summarize as a "sharp left turn." So I don't really believe that this difficulty is being abstracted from a pattern of failures.

There can be other ways to learn about problems, and I didn't think Nate was even saying that this problem is derived from examples of obstructions to potential alignment approaches. I think Nate's perspective is that he has some petty good arguments and intuitions about why a sharp left turn will cause novel problems. And so a lot of what I'm saying is that I'm not yet buying it, that I think Nate's argument has fatal holes in it which are hidden by its vagueness, and that if the arguments are really very important then we should be trying hard to make them more concrete and to address those holes.

Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself "the hard part"?

ARC does theoretical work guided by concrete stories about how a proposed AI system could fail; we are focused on the "legible part" insofar as we try to fix failures for which we can tell concrete stories. I'm not quite sure what you mean by "illegible" and so this might just be a miscommunication, but I think this is the relevant sense of "illegible" so I'll respond briefly to it.

I think we can tell concrete stories about deceptive alignment; about ontology mismatches making it hard or meaningless to "elicit latent knowledge;" about exploitability of humans making debate impossible; and so on. And I think those stories we can tell seem to do a great job of capturing the reasons why we would expect existing alignment approaches to fail. So if we addressed these concrete stories I would feel like we've made real progress. That's a huge part of my optimism about concrete stories.

It feels to me like either we are miscommunicating about what ARC is doing, or you are saying that those concrete difficulties aren't the really important failures. That even if an alignment approach addressed all of them, it still wouldn't represent meaningful progress because the true risk is the risk that cannot be named.

One thing you might mean is that "these concrete difficulties are just shadows of a deeper core." But I think that's not actually a challenge to ARC's approach at all, and it's not that different from my own view. I think that if you have an intuitive sense of a deep problem, then it's really great to attack specific instantiations of the problem as a way to learn about the deep core. I feel pretty good about this approach, and I think it's pretty standard in most disciplines that face problems like this (e.g. if you are deeply confused about physics, it's good to think a lot about the simplest concrete confusing phenomenon and understand it well; if you are confused about how to design algorithms that overcome a conceptual barrier, it's good to think about the simplest concrete task that requires crossing that barrier; etc.).

Another thing you might mean is that "these concrete difficulties are distractions from a bigger difficulty that emerges at a later step of the plan." It's worth noting that ARC really does try to look at the whole plan and pick the step that is most likely to fail. But I do think it would be a problem for our methodology if there is a good argument about why plans will fail, which won't let us tell a concrete story about what the failure looks like. My position right now is that I don't see such an argument; I think we have some vague intuitions, and we have a bunch of examples which do correspond to concrete failure stories. I don't think there are any examples from which to infer the existence of a difficulty that can't be captured in concrete stories, and I'm not yet aware of arguments that I find persuasive without any examples. But I'm really quite strongly in the market for such arguments.

Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It's like the exact opposite of security mindset.

Here's how the situation feels to me. I know this isn't remotely fair as a summary of your view, it's just intended to illustrate where ARC is coming from. (It's also possible this is a research methodology disagreement, in which case I do just disagree strongly.)

Cryptographer: It seems like our existing proposals for "secure" communication are still vulnerable to man in the middle attacks. Better infrastructure for key distribution is one way to overcome this particular attack, so let's try to improve that. We can also see how this might fit in with the rest of our security infrastructure to help build to a secure internet, though no doubt the details will change.
Cryptography skeptic: The real difficulty isn't man in the middle attacks, it's that security is really hard. By focusing on concrete stuff like man-in-the-middle you are overlooking the real nature of the problem, focusing on the known risks rather than the unknown unknowns. Someone with a true security mindset wouldn't be fiddling around the edges like this.

I'm not saying that infrastructure for key distribution solves security (and indeed we have huge security problems). I'm saying that working on concrete problems is the right way to make progress in situations like this. I don't think this is in tension with security mindset. In fact I think effective people with security mindset spend most of their time thinking about concrete risks and how to address them.

It's great to generalize once you have a bunch of concrete risks and you think there is a deeper underlying pattern. But I think you basically need the examples to learn from, and if there is a real pattern then you should be able to instantiate it in any particular case rather than making reference to the pattern.

This was a good reply, I basically buy it. Thanks.

I understand the security mindset (from the ordinary paranoia post) as: "What are the unexamined assumptions of your security systems which merely stem from investing or adapting a given model?". The vulnerability comes from the model. The problem is the "unknowable unknowns". In addition to the Cryptographer and the Cryptography skeptic, I would add the NSA Quantum computing engineer. Concretisation and operationalisation of these problems may have implicit assumptions that could be system wide catastrophic.

I don't have clear ways of better articulating this back from analogy to Paul's concretisations of a proposed AI system. I'm not sure there's no disanalogy here. However it could be something like "We have this effective model of a proposed AI system. What are useful concretisations in which the AI system would fail?". The security mindset question would be something like "What representations in the 'UV-complete' theory of this AI system would lead to catastrophic failure modes?"

I'm probably missing something here though.

This comment made me notice a kind of duality:
- Paul wants to focus on finding concrete problems, and claims that Nate/Eliezer aren't being very concrete with their proposed problems.
- Nate/Eliezer want to focus on finding concrete solutions, and claim that Paul/other alignment researchers aren't being very concrete with their proposed solutions.

It seems like "how well do we understand the problem" is one a crux here. I disagree with John's comment because it feels like he's assuming too much about our understanding of the problem. If you follow his strategy, then you can spend arbitrarily long trying to find a faithful concrete operationalization of a part of the problem that doesn't exist.

I don't feel like this is right (though I think this duality feels like a real thing that is important sometimes and is interesting to think about, so appreciated the comment).

ARC is spending its time right now (i) trying to write down concrete algorithms that solve ELK using heuristic arguments, and then trying to produce concrete examples in which they do the wrong thing, (ii) trying to write down concrete formalizations of heuristic arguments that have the desiderata needed for those algorithms to work, and trying to identify cases in which our algorithms don't yet meet those desiderata or they may be unachievable. The output is just actual code which is purported to solve major difficulties in alignment.

And on the flip side, I spend a significant amount of my time looking at the algorithms we are proposing (and the bigger plans into which they would fit if successful) and trying to find the best arguments I can that these plans will fail.

I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain.

Put differently: I'm not saying that Nate and Eliezer are vague about problems but concrete about solutions, I'm saying they are vague about everything. And I don't think they are saying that I'm concrete about problems but vague about solutions, they would say that I'm concrete about parts of the solution/problem that don't matter while systematically pushing all the difficulty into the parts I'm still vague about.

I do think "how well do we understand the problem" seems like a pretty big crux; that leads Nate and Eliezer to think that I'm avoiding the predictably-important difficulty, and it leads me to think that Nate and Eliezer need to get more concrete in order to have an accurate picture of what's going on.

Yeah, my comment was sloppily phrased; I agree with "I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain."

If you follow his strategy, then you can spend arbitrarily long trying to find a faithful concrete operationalization of a part of the problem that doesn't exist.

I don't think that's how this works? The strategy I'm recommending explicitly contains two parts where we gain evidence about whether a part of the problem actually exists:

noticing an intuitive pattern in the failure-modes of some strategies
attempting to formalize (which presumably includes backpropagating our mathematics into our intuitions)

... so if a part of the problem doesn't exist, then (a) we probably don't notice a pattern in the first place, but even if our notoriously unreliable human pattern-matchers over-match, then (b) while we're attempting to formalize we we have plenty of opportunity to notice that maybe the pattern doesn't actually exist the way we thought it did.

It feels like you're looking for a duality which does not exist. I mean, the duality between "look for concrete solutions" and "look for concrete problems" I buy (and that would indeed cause one side to be over-optimistic and the other over-pessimistic in exactly the pattern we actually see between Paul and Nate/Eliezer). But it feels like you're also looking for a duality between how-Paul's-recommended-search-order-just-fails and how-mine-just-fails. And the reason that duality does not exist is because my recommended search order is using strictly more evidence; Paul is basically advocating ignoring a whole class of very useful evidence, and that makes his strategy straightforwardly suboptimal. If we were both picking different points on a pareto frontier, then yeah, there'd be a trade-off. But Paul just isn't on the pareto frontier.

I feel confused about the difference between your "attempt to formalize" step and Paul's "attempt to concretize" step. It feels like you can view either as a step towards the other - if you successfully formalize, then presumably you'll be able to concretize; but also one valuable step towards formalizing is by finding concrete examples and then generalizing from them. I think everyone agrees that it'd be great to end up with a formalism for the problem, and then disagrees on how much that process should involve "finding concrete examples of the problem". My own view is that since it's so incredibly easy for people to get lost in abstractions, people should try to concretize much more when talking about highly abstract domains. (Even when people are confident that they're not lost in abstractions, like Eliezer and Nate are, that's still really useful for conveying ideas to other people.)

Imaginary John: Well, uh, these days I'm mostly focusing on using my flimsy non-mastered grasp of the common-concept format to try to give a descriptive account of human values, because for some reason that's where I think the hope is. So I'm not actually working too much on this thing that you think takes a swing at the real problem (although I do flirt with it occasionally).

That's not actually what I spend most of my time on, it's just a thing which came up in conversation with Eliezer that one time. I've never actually spent much time on a descriptive account of human values; I generally try to work on things which are bottlenecks to a wide variety of strategies (i.e. convergent hard subproblems), not things which are narrowly targeted to a single strategy.

What I'm actually spending most of my time on right now is figuring out how abstractions end up represented in cognitive systems, and how those representations correspond to structures (presumably natural abstractions) in the environment. In particular, I'd like to say things about convergent representations, such that we can both (a) test the claims on a wide variety of existing systems, and (b) have theorems saying that the claims extend to new kinds of systems.

... which, amusingly, looks like a much more ambitious version of interpretability work.

My guess at part of your views:

There's ~one natural structure for capabilities, such that (assuming we don't have deep mastery of intelligence) nearly anything we build that is an AGI will have that structure.
Given this, there will be a point where an AI system switches from everything-muddled-in-a-soup to clean capabilities and muddled alignment (the "sharp left turn").

I basically agree that the plans I consider don't engage much with this sort of scenario. This is mostly because I don't expect this scenario and so I'm trying to solve the alignment problem in the worlds I do expect.

(For the reader: I am not saying "we're screwed if the sharp left turn happens so we should ignore it", I am saying that the sharp left turn is unlikely.)

A consequence is that I care a lot about knowing whether the sharp left turn is actually likely. Unfortunately so far I have found it pretty hard to understand why exactly you and Eliezer find it so likely. I think current SOTA on this disagreement is this post and I'd be keen on more work along those lines.

Some commentary on the conversation with me:

Imaginary Richard/Rohin: You seem awfully confident in this sharp left turn thing. And that the goals it was trained for won't just generalize. This seems characteristically overconfident.

This isn't exactly wrong -- I do think you are overconfident -- but I wouldn't say something like "characteristically overconfident" unless you were advocating for some particular decision right now which depended on others deferring to your high credences in something. It just doesn't seem useful to argue this point most of the time and it doesn't feature much in my reasoning.

For instance, observe that natural selection didn't try to get the inner optimizer to be aligned with inclusive genetic fitness at all. For all we know, a small amount of cleverness in exposing inner-misaligned behavior to the gradients will just be enough to fix the problem.

Good description of why I don't find the evolution analogy compelling for "sharp left turn is very likely".

And even if not that-exact-thing, then there are all sorts of ways that some other thing could come out of left field and just render the problem easy. So I don't see why you're worried.

I'd phrase it as "I don't see why you think [sharp left turn leading to failures of generalization of alignment that we can't notice and fix before we're dead] is very likely to happen". I'm worried too!

Nate: My model says that the hard problem rears its ugly head by default, in a pretty robust way. Clever ideas might suffice to subvert the hard problem (though my guess is that we need something more like understanding and mastery, rather than just a few clever ideas). I have considered an array of clever ideas that look to me like they would predictably-to-me fail to solve the problems, and I admit that my guess is that you're putting most of your hope on small clever ideas that I can already see would fail. But perhaps you have ideas that I do not. Do you yourself have any specific ideas for tackling the hard problem?
Imaginary Richard/Rohin: Train it, while being aware of inner alignment issues, and hope for the best.

I think if you define the hard problem to be the sharp left turn as described at the beginning of my comment then my response is "no, I don't usually focus on that problem" (which I would defend as the correct action to take).

Also if I had to summarize the plan in a sentence it would be "empower your oversight process as much as possible to detect problems in the AI system you're training (both in the outcomes it produces and the reasoning process it employs)".

Nate: That doesn't seem to me to even start to engage with the issue where the capabilities fall into an attractor and the alignment doesn't.

Yup, agreed.

Though if you weaken claim 1, that there is ~one natural structure to capabilities, to instead say that there are many possible structures to capabilities but the default one is deadly EU maximization, then I no longer agree. It seems pretty plausible to me that stronger oversight changes the structure of your capabilities.

Perhaps sometime we can both make a list of ways to train with inner alignment issues in mind, and then share them with each other, so that you can see whether you think I'm lacking awareness of some important tool you expect to be at our disposal, and so that I can go down your list and rattle off the reasons why the proposed training tools don't look to me like they result in alignment that is robust to sharp left turns. (Or find one that surprises me, and update.) But I don't want to delay this post any longer, so, some other time, maybe.

I think the more relevant cruxes are the claims at the top of this comment (particularly claim 1); I think if I've understood the "sharp left turn" correctly I agree with you that the approaches I have in mind don't help much (unless the approaches succeed wildly, to the point of mastering intelligence, e.g. my approaches include mechanistic interpretability which as you agree could in theory get to that point even if they aren't likely to in practice).

As someone with limited knowledge of AI or alignment, I found this post accessible. There were times when I thought I knew vaguely what Nate meant but would not be able to explain it so I'm recording my confusions here to come back to when I've read up more. (If anyone wants to answer any of these r/NoStupidQuestions questions, that would be very helpful too).

"Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent". This is something that comes up in response to a few of the plans. Is the idea that during training, for advanced enough AIs capabilities gains come from gradient descent and also through processing input / interacting with the world. Or is the second part only after it has finished training. What does that concretely look like in ML?
Is a lot of the disagreement about these plans just because of others finding the idea of a "sharp left turn" more unlikely than Nate or is there more agreement about that idea but the disagreement is about what proposals might give us a shot at solving it?
What might an ambitious interpretability agenda focused on the sharp left turn and the generalization problem look like besides just trying harder at interpretability?
Another explanation of the "sharp left turn" would also be really helpful to me. At the moment, it feels like I can only explain why that happens by using analogies to humans/apes rather than being able to give a clear explanation for why we should expect that by default, using ML/alignment language.

What might an ambitious interpretability agenda focused on the sharp left turn and the generalization problem look like besides just trying harder at interpretability?

Some key pieces...

Desiderata 1: we need to aim for some kind of interpretability which will carry over across architectural/training paradigm changes, internal ontology shifts at runtime, etc. The tools need to work without needing a lot of new investment everytime there's a big change.

In my own approach, that's what Selection Theorems would give us: theorems which characterize certain interpretable internal structures as instrumentally convergent across a wide range of architecture/internal ontology.

Desiderata 2: we need to be able to robustly tie the internal structures identified to some kind of high-level human-interpretable "things". The "things" could be mathematical, like e.g. we might aim to robustly recognize embedded search processes or embedded world models. Or, the "things" could be real-world things, like e.g. we might aim to robustly recognize embedded representations of natural abstractions from the environment (and the natural abstractions in the environment to which the representations correspond). Either way, this would have to involve more than just a bunch of proxies which are vaguely correlated with the human-intuitive concept(s); the correspondence both between learned representation and mathematical/real-world structure, and between human concept and mathematical/real-world structure, would have to be highly robust.

In my own approach, that's what the formalization of natural abstractions would give us: theorems which let us robustly talk about the things-which-embedded-representations-represent, in a way which also ties those things to human concepts.

Desiderata 3: we need to somehow guarantee that there's no important/dangerous cognitive work routing around the interpretable structures. E.g. if we're aiming to recognize embedded search processes, we need to somehow guarantee that there's optimization performed in a way which would circumvent things-recognized-by-our-search-process-interpretability-tool. Or if we're aiming to recognize representations of natural abstractions in general, then we need to somehow guarantee that no important/dangerous cognitive work is routing through channels other than those concepts.

The natural abstraction framework fits this desiderata particularly well, since it directly talks about abstractions which summarize all the information relevant at a distance. There's no capabilities to be gained by using non-natural abstractions.

Finally, one thing which is not a desiderata but is an important barrier which most current interpretability work fails to tackle: interpretability is not compositional/reductive. If I understand each of 100 parts in isolation, that does not mean that I understand a system consisting of those 100 parts together. (If interpretability were compositional/reductive, then we'd already understand neural nets just fine, because individual neurons and weights are very simple!)

For 1—In humans, there’s the distinction between evolution-as-a-learning-algorithm versus within-lifetime learning. There’s some difference of opinion about which of those two slots will be occupied by the PyTorch code comprising our future AGI—the RFLO model says that this code will be doing something analogous to evolution, I say it will be doing something analogous to within-lifetime learning, see my discussion here.

My impression (from their writings) is that Nate & Eliezer are firmly in the former RFLO/evolution camp. If that’s your picture, then within-lifetime learning is a thing that happens inside a learned black box, and thus it’s a big step removed from the gradient descent (imagine: the outer-loop evolution-like gradient descent tweaks the weights, then the trained model thinks and acts and learns and grows and plans for a billion subjective seconds, then the outer-loop evolution-like gradient descent tweaks the weights, then the trained model thinks and acts and learns and grows and plans for a billion subjective seconds…). Then a “sharp left turn” could happen between gradient-descent steps, for example.

In my model, the human-written AGI PyTorch code is instead analogous to within-lifetime learning in humans, and it looks kinda like actor-critic model-based RL. There’s still some gradient descent, but the loss function is not directly “performance”, instead it’s things like self-supervised learning, and then there are also non-gradient-descent things like TD learning too. “Sharp left turns” don’t show up in my picture, at least not the same way. Or I guess, maybe instead of just one “sharp left turn”, the training process would have millions of “sharp left turns” as it keeps learning new things about the world (e.g. learning object permanence, learning that it’s an AGI running on a computer, learning physics, etc.), and each of these is almost guaranteed to help capabilities, but can potentially screw up alignment.

For 2, I think a lot of it is finding the "sharp left turn" idea unlikely. I think trying to get agreement on that question would be valuable.

For 4, some of the arguments for it in this post (and comments) may help.

For 3, I'd be interested in there being some more investigation into and explanation of what "interpretability" is supposed to achieve (ideally with some technical desiderata). I think this might end up looking like agency foundations if done right.

For example, I'm particularly interested in how "interpretability" is supposed to work if, in some sense, much of the action of planning and achieving some outcome occurs far away from the code or neural network that played some role in precipitating it. E.g., one NN-based system convinces another more capable system to do something (including figuring out how); or an AI builds some successor AIs that go on to do most of the thinking required to get something done. What should "interpretability" do for us in these cases, assuming we only have access to the local system?

I think the upvotes, without answers, means that other people are also interested in hearing Nate's clarifications on these questions, particularly #1.

2 is a mixture of both - examples will hopefully come as people comment their disagreements.

Ambitiousness in interpretability can look like greater generalization to never-before-seen architectures, especially automated generalization that doesn't strictly need human intervention. It can also look like robustly being able to use interpretability tools to provide oversight to training, e.g. as "thought assessors." I bet people more focused on interpretability have more ideas.

(Most of the QR-upvotes at the moment are from me. I think 1-4 are all good questions, for Nate or others; but I'm extra excited about people coming up with ideas for 3.)

Thanks for the post, I agree with a lot of it. A few quick comments on your dialogue with imaginary me/Rohin, which highlight the main points of disagreement:

And even if not that-exact-thing, then there are all sorts of ways that some other thing could come out of left field and just render the problem easy. So I don't see why you're worried.

More accurate to say "I don't see why you're so confident". I think I see why you're worried, and I'm worried too for the same reasons. Indeed, I wrote a similar post recently which lists out research directions and reasons why I don't expect them to solve the problem if it turns out to be hard. So in general you should probably put me down as having a reasonable amount of credence (20%?) on your view, but also considering many other possibilities plausible.

Nate: I have considered an array of clever ideas that look to me like they would predictably-to-me fail to solve the problems, and I admit that my guess is that you're putting most of your hope on small clever ideas that I can already see would fail.

The ideas that come out of left field are generally the ones you haven't considered yet, that's what it means for them to come out of left field. I expect that this is frustrating for you to hear, because it seems my position is therefore unfalsifiable, but I don't think it makes much pragmatic difference - I'm not saying we should relax because ideas will come out of left field. I think we should do a better job of looking for them, which involves people aiming more directly at worlds where the problem is hard, for which posts like this one help. I just also think that there's probably more leeway than you think, because I feel pretty uncertain how far past human level a sharp left turn would happen by default.

As the main author of the "Alignment"-appendix of the truthful AI paper, it seems worth clarifying: I totally don't think that "train your AI to be truthful" in itself is a plan for how to tackle any central alignment problems. Quoting from the alignment appendix:

While we’ve argued that scaleable truthfulness would constitute significant progress on alignment (and might provide a solution outright), we don’t mean to suggest that truthfulness will sidestep all difficulties that have been identified by alignment researchers. On the contrary, we expect work on scaleable truthfulness to encounter many of those same difficulties, and to benefit from many of the same solutions.

In other words: I don't think we had a novel proposal for how to make truthful AI systems, which tackled the hard bits of alignment. I just meant to say that the hard bits of making truthful A(G)I are similar to the hard bits of making aligned A(G)I.

At least from my own perspective, the truthful AI paper was partly about AI truthfulness maybe being a neat thing to aim for governance-wise (quite apart from the alignment problem), and partly about the idea that research on AI truthfulness could be helpful for alignment, and so it's good if people (at least/especially people who wouldn't otherwise work on alignment) work on that problem. (As one example of this: Interpretability seems useful for both truthfulness and alignment, so if people work on interpretability intended to help with truthfulness, then this might also be helpful for alignment.)

I don't think you're into this theory of change, because I suspect that you think that anyone who isn't directly aiming at the alignment problem has negligible chance of contributing any useful progress.

I just wanted to clarify that the truthful AI paper isn't evidence that people who try to hit the hard bits of alignment always miss — it's just a paper doing a different thing.

(And although I can't speak as confidently about others' views, I feel like that last sentence also applies to some of the other sections. E.g. Evan's statement, which seems to be about how you get an alignment solution implemented once you have it, and maybe about trying to find desiderata for alignment solutions, and not at all trying to tackle alignment itself. If you want to critique Evan's proposals for how to build aligned AGI, maybe you should look at this list of proposals or this positive case for how we might succeed.)

I really liked this post in that it seems to me to have tried quite seriously to engage with a bunch of other people's research, in a way that I feel like is quite rare in the field, and something I would like to see more of.

One of the key challenges I see for the rationality/AI-Alignment/EA community is the difficulty of somehow building institutions that are not premised on the quality or tractability of their own work. My current best guess is that the field of AI Alignment has made very little progress in the last few years, which is really not what you might think when you observe the enormous amount of talent, funding and prestige flooding into the space, and the relatively constant refrain of "now that we have cutting edge systems to play around with we are making progress at an unprecedented rate".

It is quite plausible to me that technical AI Alignment research is not a particularly valuable thing to be doing right now. I don't think I have seen much progress, and the dynamics of the field seem to be enshrining an expert class that seems almost ontologically committed to believing that the things they are working on must be good and tractable, because their salary and social standing relies on believing that.

This and a few other similar posts last year are the kind of post that helped me come to understand the considerations around this crucial question better, and where I am grateful that Nate, despite having spent a lot of his life on solving the technical AI Alignment problem, is willing to question the tractability of the whole field. This specific post is more oriented around other people's work, though other posts by Nate and Eliezer are also facing the degree to which their past work didn't make the relevant progress they were hoping for.

Hey, thanks for posting this!

And I apologise - I seem to have again failed to communicate what we're doing here :-(

"Get the AI to ask for labels on ambiguous data"

Having the AI ask is a minor aspect of our current methods, that I've repeatedly tried to de-emphasise (though it does turn it to have an unexpected connection with interpretability). What we're trying to do is:

Get the AI to generate candidate extrapolations of its reward data, that include human-survivable candidates.
Select among these candidates to get a human-survivable ultimate reward functions.

Possible selection processes include being conservative (see here for how that might work: https://www.lesswrong.com/posts/PADPJ3xac5ogjEGwA/defeating-goodhart-and-the-closest-unblocked-strategy ), asking humans and then extrapolating the process of what human-answering should idealise to (some initial thoughts on this here: https://www.lesswrong.com/posts/BeeirdrMXCPYZwgfj/the-blue-minimising-robot-and-model-splintering), removing some of the candidates on syntactic ground (e.g. wireheading, which I've written quite a bit on how it might be syntactically defined). There are some other approaches we've been considering, but they're currently under-developed.

But all those methods will fail if the AI can't generate human-survivable extrapolations of its reward training data. That is what we are currently most focused on. And, given our current results on toy models and a recent literature review, my impression is that there has been almost no decent applicable research done in this area to date. Our current results on HappyFaces are a bit simplistic, but, depressingly, they seem to be the best in the world in reward-function-extrapolation (and not just for image classification) :-(

Many proposals seem doomed to me because they involve one or multiple steps where they assume a representation, then try to point to robust relations in the representation and hope they'll hold in the territory. This wouldn't be so bad on its own but when pointed to it seems like handwaving happens rather than something more like conceptual engineering. I am relatively more hopeful about John's approach as being one that doesn't fail to halt and catch fire at these underspecified steps in other plans. In other areas like math and physics we try to get the representation to fall out of the model by sufficiently constraining the model. I would prefer to try to pin down a doomed model than stay in hand wave land because at least in the process of pinning down the doomed model you might get reusable pieces for an eventual non doomed model. Was happy about eg quantilizers for basically the same reason.

Partisans of the other "hard problem" are also quick to tell people that the things they call research are not in fact targeting the problem at all. (I wonder if it's something about the name...)

Much like the other hard problem, it's easy to get wrapped up in a particular picture of what properties a solution "must" have, and construct boundaries between your hard problem and all those other non-hard problems.

Turning the universe to diamond is a great example. It's totally reasonable that it could be strictly easier to build an AI to turn the world into diamond than it is to build an AI that is superhuman at doing good things, so that anyone claiming to have ideas about the latter should have even better ideas about the former. But that could also not be the case - the most likely way I see this happening is if if solving the hard left turn problem has details that depend on how you want to load the values, and so genuinely hard-problem-addressing work on value learning could nonetheless not be useful for specifying simple goals. (It may only help you get the diamond-universe AI "the hard way" - by doing the entire value leaning process except with a different target!)

Like, even simpler than the problem of an AGI that puts two identical strawberries on a plate and does nothing else, is the problem of an AGI that turns as much of the universe as possible into diamonds. This is easier because, while it still requires that we have some way to direct the system towards a concept of our choosing, we no longer require corrigibility. (Also, "diamond" is a significantly simpler concept than "strawberry" and "cellularly identical".)
It seems to me that we have basically no idea how to do this. We can train the AGI to be pretty good at building diamond-like things across a lot of training environments, but once it takes that sharp left turn, by default, it will wander off and do some other thing, like how humans wandered off and invented birth control.

Is there a writeup of where you expect this to fail? I recall this MIRI newsletter but I think it also just asserted it was hard/impossible.

Is the difficulty just in "it's gonna hijack it's own reward function?" or is there more to it than that?

There is also the ontology identification problem. The two biggest things are: we don't know how to specify exactly what a diamond is because we don't know the true base level ontology of the universe. We also don't know how diamonds will be represented in the AI's model of the world.

I personally don't expect coding a diamond maximizing AGI to be hard, because I think that diamonds is a sufficiently natural concept that doing normal gradient descent will extrapolate in the desired way, without inner alignment failures. If the agent discovers more basic physics, e.g. quarks that exist below the molecular level, "diamond" will probably still be a pretty natural concept, just like how "apple" didn't stop being a useful concept after shifting from newtonian mechanics to QM.

Of course, concepts such as human values/corrigibility/whatever are a lot more fragile than diamonds, so this doesn't seem helpful for alignment.

(Unsure whether to mark "agree" for the first two paragraphs, or "disagree" for the last line. Leaving this comment instead.)

Hm? It's as Nate says in the quote. It's the same type of problem as humans inventing birth-control out of distribution. If you have an alternative proposal for how to build a diamond-maximizer, you can specify that for a response, but the commonly discussed idea of "train on examples of diamonds" will fail at inner-alignment, and it will just optimize diamonds in a particular setting and then elsewhere do crazy other things that look like all kinds of white noise to you.

Also "expect this to fail" already seems to jump the gun. Who has a proposal for successfully building an AGI that can do this, other than saying gradient-descent will surprise us with one?

I don't think that "evolution -> human values" is the most useful reference class when trying to calibrate our expectations wrt how outer optimization criteria relate to inner objectives. Evolution didn't directly optimize over our goals. It optimized over our learning process and reward circuitry. Once you condition on a particular human's learning process + reward circuitry configuration + the human's environment, you screen off the influence of evolution on that human's goals. So, there are really two areas from which you can draw evidence about inner (mis)alignment:

"evolution's inclusive genetic fitness criteria -> a human's learned values" (as mediated by evolution's influence over the human's learning process + reward circuitry)
"a particular human's learning process + reward circuitry + "training" environment -> the human's learned values"

The relationship we want to make inferences about is:

"a particular AI's learning process + reward function + training environment -> the AI's learned values"

I think that "AI learning -> AI values" is much more similar to "human learning -> human values" than it is to "evolution -> human values". I grant that you can find various dissimilarities between "AI learning -> AI values" and "human learning -> human values". However, I think there are greater dissimilarities between "AI learning -> AI values" and "evolution -> human values". As a result, I think the vast majority of our intuitions regarding the likely outcomes of inner goals versus outer optimization should come from looking at the "human learning -> human values" analogy, not the "evolution -> human values" analogy.

Additionally, I think we have a lot more total empirical evidence from "human learning -> human values" compared to from "evolution -> human values". There are billions of instances of humans, and each of them have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once^[1]. Thus, evidence from "human learning -> human values" should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.

I will grant that the variations between different humans' learning processes / reward circuit configurations / learning environments are "sampling" over a small and restricted portion of the space of possible optimization process trajectories. This limits the strength of any conclusions we can draw from looking at the relationship between human values and human rewards / learning environments. However, I again hold that inferences from "evolution -> human values" suffer from an even more extreme version of this same issue. "Evolution -> human values" represent an even more restricted look at the general space of optimization process trajectories than we get from the observed variations in different humans' learning processes / reward circuit configurations / learning environments.

There are many sources of empirical evidence that can inform our intuitions regarding how inner goals relate to outer optimization criteria. My current (not very deeply considered) estimate of how to weight these evidence sources is roughly:

~66% from "human learning -> human values"
~4% from "evolution -> human values"^[2]
~30% from various other evidence sources, which I won't address further in this comment, on inner goals versus outer criteria:
- economics
- microbial ecology
- politics
- current results in machine learning
- game theory / mulit-agent negotiation dynamics

I think that using "human learning -> human values" as our reference class for inner goals versus outer optimization criteria suggests a much more straightforward relationship between the two, as compared to the (lack of a) relationship suggested by "evolution -> human values". Looking at the learning trajectories of individual humans, it seems like the reflectively endorsed extrapolations of a given person's values has a great deal in common with the sorts of experiences they've found rewarding in their lives up to that point in time. E.g., a person who grew up with and displayed affection for dogs probably doesn't want a future totally devoid of dogs, or one in which dogs suffer greatly.

I also think this regularity in inner values is reasonably robust to sharp left turns in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence. And I think this is very robust to the degree of capabilities you give the human. It's probably not as robust to your choice of which specific human to try this with. E.g., many people would screw themselves over with reckless self-modification, given the capability to do so. My point is that higher capabilities alone do not automatically render inner values completely alien to those demonstrated at lower capabilities.

^{^}
You can, of course, try to look at how population genetics relate to learned values to try to get more data from the "evolution -> human values" reference class, but I think most genetic influences on values are mediated by differences in reward circuitry or environmental correlates of genetic variation. So such an investigation probably ends up mostly redundant in light of how the "human learning -> human values" dynamics work out. I don't know how you'd try and back out a useful inference about general inner versus outer relationships (independent from the "human learning -> human values" dynamics) from that mess. In practice, I think the first order evidence from "human learning -> human values" still dominates any evolution-specific inferences you can make here.
^{^}
Even given the arguments in this comment, putting such a low weight on "evolution -> human values" might seem extreme, but I have an additional reason, originally identified by Alex Turner, for further down weighting the evidence from "evolution -> human values". See this document on shard theory and search for "homo inclusive-genetic-fitness-maximus".

The most important claim in your comment is that "human learning → human values" is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the "evolution -> human values" perspective. Here's why I disagree:

Evolution optimized humans for an environment very different from what we see today. This implies that humans are operating out-of-distribution. We see evidence of misalignment. Birth control is a good example of this.

A human's environment optimizes a human continually towards certain a certain objective (that changes given changes in the environment). This human is aligned with the environment's objective in that distribution. Outside that distribution, the human may not be aligned with the objective intended by the environment.

An outer misalignment example of this is a person brought up in a high-trust environment, and then thrown into a low-trust / high-conflict environment. Their habits and tendencies make them an easy mark for predators.

An inner misalignment example of this is a gay male who grows up in an environment hostile to his desires and his identity (but knows of environments where this isn't true). After a few extremely negative reactions to him opening up to people, or expressing his desires, he'll simply decide to present himself as heterosexual and bide his time and gather the power to leave the environment he is in.

One may claim that the previous example somehow doesn't count because since one's sexual orientation is biologically determined (and I'm assuming this to be the case for this example, even if this may not be entirely true), this means that evolution optimized this particular human for being inner misaligned relative to their environment. However, that doesn't weaken this argument: "human learning -> human values" shows a huge amount of evidence of inner misalignment being ubiquitous.

I worry you are being insufficiently pessimistic.

There may not be substantial disagreements here. Do you agree with:

"a particular human's learning process + reward circuitry + "training" environment -> the human's learned values" is more informative about inner-misalignment than the usual "evolution -> human values" (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences)

The most important claim in your comment is that "human learning → human values" is evidence that inner misalignment is easier than it seems when one looks at it from the "evolution -> human values" perspective.
Here's why I disagree:

I don't know what you mean by "inner misalignment is easier"? Could you elaborate? I don't think you mean "inner misalignment is more likely to happen" because you then go on to explain inner-misalignment & give an example and say "I worry you are being insufficiently pessimistic."

One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given. See:

I also think this regularity in inner values is reasonably robust to sharp left turns in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence.

This matches my intuitions.

Do you agree with: “a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values”

What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.

I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”

My bad; I've updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution's failure at inner alignment is the most significant and informative evidence that inner alignment is hard.

One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given.

I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too -- inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don't scale with an increase in capabilities? Perhaps we are defining inner values differently here.

By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people's inner values shift? Not exactly; it seems to me that we were mistaken about people's inner values instead.

This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.

I think you're ignoring the [now bolded part] in "a particular human’s learning process + reward circuitry + "training" environment" and just focusing in the environment. Humans very often don't optimize for their reward circuitry in their limbic system. If I gave you a button that killed everyone but maximized your reward circuitry every time you pressed it, most people wouldn't press it (would you?). I do agree that if you pressed the button once, you would then want to press the button again, but not beforehand which is an inner-misalignment w/ respect to the reward circuitry. Though maybe you'd say the wirehead thing is an extreme case OOD?

By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is.

I agree, but I'm bolding "most people" because you're claiming there exist some people that would retain that value if scaled up(?) I think replace "dog-lover" w/ "family-lover" and there's even more people. But I don't think this is a disagreement between us?

My bad; I've updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution's failure at inner alignment is the most significant and informative evidence that inner alignment is hard.

Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there's the disconnect (usually misalignment is thought of as bad, and I'm not just mistyping). Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the "a particular human’s learning process + reward circuitry + "training" environment" part, and less on the evolution part.

If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.

I don't think the usual arguments apply as obviously here. "Maximal Diamond" is much simpler than most other optimization targets. It seems much easier to solve outer-alignment for – Diamond was chosen because it's a really simple molecule configuration to specify, and that just seems to be a pretty different scenario than most of the ones I've seen more detailed arguments for.

I'm partly confused about the phrasing "we have no idea how to do this." (which is stronger than "we don't currently have a plan for how to do this.")

But in the interests of actually trying to answer this sort of thing for myself instead of asking Nate/Eliezer to explain why it doesn't work, let me think through my own proposal of how I'd go about solving the problem, and see if I can think of obvious holes.

Problems currently known to me:

Reward hijacking
Point 19 in List of Lethalities ("there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment").
Ontological updating (i.e. what exactly is a diamond?)
New to me from this post: the most important capabilities advances may come from an inner process that isn't actually coupled to the reinforcement learning system. (I didn't really get this until reading this post and haven't finished thinking through the concept)

Main ingredients I'm imagining: (disclaimer: I'm a layman making a lot of informed guesses, wouldn't be surprised it

First, Get a general agent, with limitations to prevent immediate fooming. Get to general intelligence via something like DeepMind's General Agents, this time starting from a language model that benefits from a lot of human concepts. My current belief is that you'd need to solve some major efficiency issues to do this with a reasonable amount of compute. If you have a Jupiter brain (as originally stipulated) I'm not sure it even requires new advances.

(Maybe scrub the language model of all references to ML/programming, initially. They'll be helpful eventually but maybe don't give the AGI a headstart on self-modification.)

Train it to maximize diamonds in many different environments. Starting with (relatively) modest amounts of compute, train it to make diamonds in different simulated and physical worlds. Initially, at subhuman intelligence, the AGI isn't expected to invent it's own diamond-making technology. But it's presented with different physical-or-simulated tools that make things similar-but-non-identical to diamonds, and the tools have internal parts it can understand, and it's rewarded for choosing between them accurately. So it has to develop an understanding of physics.

(I think the outer-alignment goal here is to get it to advance at physics faster than self-modification, so that you can force it to learn ontological problems before it could get subverted by them).

Some notable training-sets it needs to include:

digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion
it's ability to parse what's going on in the digital worlds depends on sensors that are present in the digital world (also for physical world), and there are different arrays of sensors in different worlds. It's trained against situations where it has the ability to modify its sensors for simple reward hacking.
eventually it's taught programming/hardware know-ho, and put in fairly simplified situations where the solution to it's puzzle is to notice that it physically exists, and make changes to it's hardware or software, but it doesn't directly hijack it's own reward function.

(Note: thinking through all these specific details is pretty helpful for noticing how many steps are involved here. I think for this sort of plan to work you actually need a lot of different puzzles that are designed to be solveable with safe amounts of compute, so it doesn't just bulldoze past your training setup. Designing such puzzles seems pretty time consuming. In practice I don't expect the Successfully Aligned "murder everyone and make diamonds forever" bot to be completed before "murder everyone and make some Random Other Thing Forever" bot)

Even though my goal is a murder-bot-that-makes-diamonds-forever, I'm probably coupling all of this with attempts at corrigibility training, dealing with uncertainty, impact tracking, etc, to give myself extra time to notice problems. (i.e if the machine isn't sure whether the thing it's making is diamond, it makes a little bit first, asks humans to verify that it's diamond, etc. Do similar training on "don't modify the input channel for 'was it actually diamond tho?')

Assuming those tricks all work and hold up under tons of optimization pressure, this all still leaves us with inner alignment, and point #4 on my list of known-to-me-concerns. "The most important capabilities advances may come from an inner process that isn't actually coupled to the reinforcement learning system."

And... okay actually this is a new thought for me, and I'm not sure how to think about it yet. I can see how it was probably meant to be included in the "confusingly pervasive consequentialism" concept, but I didn't get the "and therefore, impervious to gradient descent" argument till just now.

I'm out of time for now, will think about this more.

I think even without point #4 you don't necessarily get an AI maximizing diamonds. Heuristically, it feels to me like you're bulldozing open problems without understanding them (e.g. ontology identification by training with multiple models of physics, getting it not to reward-hack by explicit training, etc.) all of which are vulnerable to a deceptively aligned model (just wait till you're out of training to reward-hack). Also, every time you say "train it by X so it learns Y" you're assuming alignment (e.g. "digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion")

IMO shard theory provides a great frame to think about this in, it's a must-read for improving alignment intuitions.

But maybe I just don't understand this proposal yet (and I have had some trouble distilling things I recognize as plans out of Evan's writing, so far).

Maybe this and this will help.

Flagging that I don't think your description of what ELK is trying to do is that accurate, e.g. we explicitly don't think that you can rely on using ELK to ask your AI if it's being deceptive, because it might just not know. In general, we're currently quite comfortable with not understanding a lot of what our AI is "thinking", as long as we can get answers to a particular set of "narrow" questions we think is sufficient to determine how good the consequences of an action are. More in “Narrow” elicitation and why it might be sufficient.

Separately, I think that ELK isn't intended to address the problem you refer to as a "sharp-left turn" as I understand it. Vaguely, ELK is intended to be an ingredient in an outer-alignment solution, while it seems like the problem you describe falls roughly into the "inner alignment" camp. More specifically, but still at a high-level of gloss, the way I currently see things is:

If you want to train a powerful AI, currently the set of tasks you can train your AI on will, by default, result in your AI murdering you.
Because we currently cannot teach our AIs to be powerful by doing anything except rewarding them for doing things that straightforwardly imply that they should disempower humans, you don't need a "sharp left turn" in order for humanity to end up disempowered.
Given this, it seems like there's still a substantial part of the difficulty of alignment that remains to be solved even if knew how to cope with the "sharp left turn." That is, even if capabilities were continuous in SGD steps, training powerful AIs would still result in catastrophe.
ELK is intended to be an ingredient in tackling this difficulty, which has been traditionally referred to as "outer alignment."

Even more separately, it currently seems to me like it's very hard to work on the problem you describe while treating other components [like your loss function] like a black box, because my guess is that "outer alignment" solutions need to do non-trivial amounts of "reaching inside the model's head" to be plausible, and a lot of how to ensure capabilities and alignment generalize together is going to depend on details about how would have prevented it from murdering you in [capabilities continuous with SGD] world.

ELK for learned optimizers has some more details.

I think that the sharp left turn is also relevant to ELK, if it leads to your system not generalizing from "questions humans can answer" to "questions humans can't answer." My suspicion is that our key disagreements with Nate are present in the case of solving ELK and are not isolated to handling high-stakes failures.

(However it's frustrating to me that I can never pin down Nate or Eliezer on this kind of thing, e.g. are they still pessimistic if there were a low-stakes AI deployment in the sense of this post?)

If it helps, I have a discussion of Concept Extrapolation in the context of aligning a real-deal agent-y AGI in §14.4 here.

So far I can’t quite get the whole story to hang together, as you’ll see from that link. But I definitely see it as a “shot on goal”. (Well, at least, I think the broader project / framework is a “shot on goal”. I don’t find the image classification project to be directly addressing any of my most burning questions.)

Curated. I could imagine a world where different people pursue different agendas in a “live and let live” way, with no one waiting to be too critical of anyone else. I think that’s a world where many people could waste a lot of time with nothing prompting them to reconsider. I think posts like this one give us a chance to avoid scenarios like that. And posts like this can spur discussion of the higher-level approaches/intuitions that spawn more object-level research agenda. The top comments here by Paul Christianno, John Wentworth, and others are a great instance of this.

I also kind of like how this just further develops my gears-level understanding of why Nate predicts doom. There’s color here beyond AGI Ruin: List of Lethalities, which I assume captured most of Nate’s pessimism, but in fact I wonder if Nate disagrees with Eliezer and thinks things would be a bunch more hopeful if only people worked on the right stuff (in contrast with the problem is too hard for our civilization).

Lastly I’ll note that I think it’s good that Nate wrote this post even before being confident he could pass other people’s ITT. I’m glad he felt it was okay to be critical (with caveats) even before his criticisms were maximally defensible (e.g. because he thinks he could pass an ITT).

What would make you change your mind about robustness of behavior (or interpretability of internal representations) through the sharp left turn? Or about the existence of such a sharp left turn, as opposed to smooth scaling of ability to learn in-context?

For example, would you change your mind if we found smooth scaling laws for (some good measure of) in-context learning?

From A central AI alignment problem: capabilities generalization, and the sharp left turn:

Suppose that the fictional team OpenMind is training up a variety of AI systems, before one of them takes that sharp left turn. Suppose they've put the AI in lots of different video-game and simulated environments, and they've had good luck training it to pursue an objective that the operators described in English. "I don't know what those MIRI folks were talking about; these systems are easy to direct; simple training suffices", they say. At the same time, they apply various training methods, some simple and some clever, to cause the system to allow itself to be removed from various games by certain "operator-designated" characters in those games, in the name of shutdownability. And they use various techniques to prevent it from stripmining in Minecraft, in the name of low-impact. And they train it on a variety of moral dilemmas, and find that it can be trained to give correct answers to moral questions (such as "in thus-and-such a circumstance, should you poison the operator's opponent?") just as well as it can be trained to give correct answers to any other sort of question. "Well," they say, "this alignment thing sure was easy. I guess we lucked out."
Then, the system takes that sharp left turn,^[4]^[5] and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart.
[...]
[5] "Hold on, isn't this unfalsifiable? Aren't you saying that you're going to continue believing that alignment is hard, even as we get evidence that it's easy?" Well, I contend that "GPT can learn to answer moral questions just as well as it can learn to answer other questions" is not much evidence either way about the difficulty of alignment. I'm not saying we'll get evidence that I'll ignore; I'm naming in advance some things that I wouldn't consider negative evidence (partially in hopes that I can refer back to this post when people crow later and request an update). But, yes, my model does have the inconvenient property that people who are skeptical now, are liable to remain skeptical until it's too late, because most of the evidence I expect to give us advance warning about the nature of the problem is evidence that we've already seen. I assure you that I do not consider this property to be convenient.

As for things that could convince me otherwise: technical understanding of intelligence could undermine my "sharp left turn" model. I could also imagine observing some ephemeral hopefully-I'll-know-it-when-I-see-it capabilities thresholds, without any sharp left turns, that might update me. (Short of "full superintelligence without a sharp left turn", which would obviously convince me but comes too late in the game to shift my attention.)

Just noting that given more recent developments than this post, we should be majorly updating on recent progress towards Andrew Critch's strategy. (Still not more likely than not to succeed, but we still need to assign some Bayes points to Critch, and take some away from Nate.)

Thanks for the post! One narrow point:
You seem to lean at least a bit on the example of 'much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner'. It seems to me that
a. You don't need to go to humans before you get significant accumulation of important cultural knowledge outside genes (e.g. my understanding is that unaccultured chimps die in the wild)
b. the genetic bottleneck is a somewhat weird and contingent feature of animal evolution, and I don't think there's a clear analogy in current LLM ML paradigms

I'm not making any claims about takeoff speeds in models, just saying that I don't think arguments that are based on features that are (maybe) contingent on a genetic bottleneck support the same inference for ML. Can you make the same argument without leaning on the genetic bottleneck, or explain to me why the analogy in fact should hold?

(Is that just because they get attacked and killed by other chimp groups?)

My impression is that they don't have the skills needed for successful foraging. There's a lot of evidence for some degree of cultural accumulation in apes and e.g. macaques. But I haven't looked into this specific claim super closely.

<sociology of AI safety rant>

So, if an Everett-branches traveller told me "well, you know, MIRI folks had the best intentions, but in your branch, made the field pay attention to unproductive directions, and this made your civilization more confused and alignment harder" and I had to guess "how?", one of the top choices would be ongoing strawmanning and misrepresentation of Eric Drexler's ideas.

</rant>

To me, CAIS thinking seems quite different from the description in the op.

Some statements, without much justifications/proofs

- Modularity is a pretty powerful principle/law of intelligent systems. If you look around the world, you see modularity everywhere. Actually in my view you will see more of "modularity" than of "rational agency", suggesting gods of modularity are often stronger than gods of rational agency. Modularity would help as a lot, in contrast to integrated homogenous agents => one of the key directions of AI safety should be figuring out how to summon Gods of modularity

- Modularity helps with interpretability; once you have "modules" and "interfaces", you have much better shot at understanding what's going on by looking on the interfaces. (For intuitive feel: Imagine you want to make plotting and scheming of three people much more legible, and you can impose this constrain: they need to make all communication between them on Slack, which you can read)

- Any winning strategy needs to solve global coordination at some point, otherwise people will just call a service to destroy the world. Solutions of the type "your aligned superintelligent agent takes over the world and coordinates everyone" are dangerous and won't work; superintelligent agent able takes over the world is something you don't want to bring into existence, and in contrast, you need a security layer to prevent anyone from event attempting that

- There are multiple hard problems; attempting to solve them all at once in one system is not the right tactic. In practice, we want to isolate some problems to separate "modules" or "services" - for example, we want a separate "research and development services", "security service", "ontology mapping service",...

- Many hard problems don't disappear, but there are also technical solutions for them [e.g. distillation]

Anyway, now turning to your discussion of ELK in particular.

Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent (much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner). You might not be able to just "expose the bad behavior" to gradients that you can hit to correct the thing, at least not easily and quickly.

I often think and write about other places where capabilities may come from that could challenge our basic alignment plan. Four particularly salient examples:

Your AI might perform search internally, e.g. looking for hypotheses that match the data or for policies that work well.
Natural selection may occur internally, e.g. cognitive patterns that acquire power might tend to dominate the behavior of your AI (despite the AI having no explicit prediction that they would work well).
Your AI might reason about how to think better, e.g. select cognitive actions based on anticipated consequences of those cognitive actions.
Our AI might deploy new algorithms that pose their own alignment risk for different (potentially unanticipated) reasons.

Some of these represent real problems, but none of them seem to fundamentally change the game or be deal-breakers:

Aligning the internal search seems very similar to aligning SGD on the outside. We could distinguish two additional difficulties in this case:
1. Because the search is on the inside, we can't directly apply our alignment insights to align it. Instead we need to ensure that SGD learns to align the search. This itself poses two difficulties: (a) the outer gradient needs to incentivize doing this, (b) we need to argue that it's nearly as easy for SGD to learn the aligned search as to learn the unaligned search (or build scaffolding such that it becomes similarly easily). This is what we're talking about in this appendix, and it's part of why we are skeptical about approaches to ELK based on simple regularizers. But we don't see a reason that either (a) or (b) would be a dealbreaker, and we tentatively think our current approaches to ontology identification would at least solve (a) if they were successful at all. It's pretty hard to talk about (b) without having more clarity about what the alignment scheme actually looks like but we don't see an in principle reason it's hard.
2. The internal search algorithm may not be SGD, and perhaps our alignment strategy was specific to some detail of SGD. But SGD appears to be amongst the hardest search algorithms, and ARC tries to pursue approaches that work for other algorithms rather than leveraging anything about SGD in particular. We're definitely in the market for other search algorithms that cause trouble but don't yet know of any.
Natural selection on the inside is similar but potentially more tricky, because the optimizer has more limited control over how this search works. This is like the analog of memetic selection being smarter than humans and eventually overpowering or hijacking human consequentialism. Another extreme example is that it seems like a large enough neural network may be catastrophically misaligned at initialization simply because of selection amongst activation patterns within a single forward pass. Ultimately we'd like to handle this in exactly the same way that we handle the last point, by some combination of (a) we can just directly apply the same hope from the previous section even to natural selection, (b) we can run explicit searches that are more powerful than implicit search by natural selection within our model, which requires ensuring that our explicit learned search captures whatever is good about natural selection (this seems tough but not at all obvious impossible to me). It's hard to talk about option (a) without seeing if/how we solve the problem from the last point. We could definitely work on option (b) now, and a large enough ARC would be working on it, but it seems like a relatively low priority since it's both very remote from existing systems and seems relatively unlikely-to-me to be the simplest place where we get stuck.
If our model is selecting cognitive actions, or designing new algorithms, then our core hope is that an aligned model will try to think in an aligned way. So if we've been succeeding at alignment so far then the model will be trying to stay aligned. By analogy, if humans cared about the amount of human DNA in the universe, then to the extent that cultural evolution was guided by human consequentialism (rather than e.g. being memetic selection), we would be trying to develop cultural machinery that was helpful for maximizing the amount of human DNA in the universe.
One way this can go wrong is if our model wants to stay aligned but fails, e.g. because it identifies new techniques for thinking that themselves pose new alignment difficulties (just as we desire human flourishing but may instead implement AI systems that want paperclips). I think this is a real problem, but there are a lot of reasons I don't consider it an existential challenge for our approach:
1. If you've succeeded at alignment so far, then your AI will also consider this a problem and will be trying to solve it. I think we should relate to our AI, discovering new ways to think that might pose new alignment difficulties, in the same way that we relate to future humans who may encounter alignment difficulties. The AI may solve the problem, or may implement policy solutions, or etc., and our role is to set them up for success just like we are trying to set up future humans for success. AI compresses the timescale both for "new AI algorithms with new alignment problems" but also for all of the solutions to those problems, so I don't think it changes the game from future humans. And so I'd focus on prosaic AI alignment for exactly the same reasons I focus on prosaic AI alignment when trying to help future humans succeed at alignment.
2. I think that we should be considering the particular algorithms that might pose a new alignment problem, and trying to solve alignment for each of them. If we have some general reason to think that new algorithms will be much harder than old algorithms, or that lessons won't transfer, then we can discuss those and whether they should affect research prioritization. So far I don't think we have such arguments, and so I think we should just be looking for algorithms that might pose problems. (I don't actually think that's the highest priority, because prosaic ML so obviously poses problems, and the other problems we see seem so closely analogous to the ones posed by prosaic ML. But I'm certainly in the market for other problems and think that a large enough research community should already be actively looking for them.)

Your second problem is that the AGI's concepts might rapidly get totally uninterpretable to your ELK head. Like, you could imagine doing neuroimaging on your mammals all the way through the evolution process. They've got some hunger instincts in there, but it's not like they’re smart enough yet to represent the concept of "inclusive genetic fitness" correctly, so you figure you'll just fix it when they get capable enough to understand the alternative (of eating because it's instrumentally useful for procreation). And so far you're doing great: you've basically decoded the visual cortex, and have a pretty decent understanding of what it's visualizing.

Analogously, your ELK head's abilities are liable to fall off a cliff right as the AGI's capabilities start generalizing way outside of its training distribution.

And if they don't, then this ELK head is (in this hypothetical) able to decode and understand the workings of an alien mind. Likely a kludgey behemoth of an alien mind. This itself is liable to require quite a lot of capability, quite plausibly of the sort that humanity gets first from the systems that took sharp left-turns, rather than systems that ground along today's scaling curves until they scaled that far.

^{^}
I'd like to head off a possible response you might make that I disagree with: "Sure your algorithm works for any example you can write down, but the whole point is that you need it to work for alien cognition, where humans don't understand why it works. So of course it works on concrete examples but not in the unknown real world." . I'm putting this in a footnote because it seems like a digression and I have no idea if this is your view.
My main response is that we can in fact talk about concrete examples where "why your AI system's cognition works" isn't accessible to humans in the relevant ways:
- We can consider tricky facts we understand about how to reason, for which our discovery of those facts is empirically contingent (and where discovering those facts is harder than discovering the reasons itself). Then we can consider whether our AI alignment strategies would work even if humans hadn't figured out the relevant facts about reasoning.
- We can consider AI cognition which is contingent on hypothesized unknown-to-human facts, e.g. about the causal structure of reality, or about key facts about mathematics, or whatever else.
- Most of our ELK approaches don't make no-holds-barred use of "can a human come up with some story about why this AI cognition may work," and so this just isn't a particularly salient threshold anyway. As a silly example, if you were solving this problem with a speed prior (or indeed with any of the approaches in the regularization section of the ELK document) you wouldn't expect a particular key threshold at the space of strategies that a human understands.

I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can't anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise.

It means not becoming too pessimistic about a direction until we see fairly concretely where it's stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.

Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It's like the exact opposite of security mindset.

You write:

But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again.

I don't yet have this sense about a "sharp left turn" bottleneck.

Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself "the hard part"?

Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It's like the exact opposite of security mindset.

Cryptographer: It seems like our existing proposals for "secure" communication are still vulnerable to man in the middle attacks. Better infrastructure for key distribution is one way to overcome this particular attack, so let's try to improve that. We can also see how this might fit in with the rest of our security infrastructure to help build to a secure internet, though no doubt the details will change.
Cryptography skeptic: The real difficulty isn't man in the middle attacks, it's that security is really hard. By focusing on concrete stuff like man-in-the-middle you are overlooking the real nature of the problem, focusing on the known risks rather than the unknown unknowns. Someone with a true security mindset wouldn't be fiddling around the edges like this.

This was a good reply, I basically buy it. Thanks.

I don't feel like this is right (though I think this duality feels like a real thing that is important sometimes and is interesting to think about, so appreciated the comment).

I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain.

Yeah, my comment was sloppily phrased; I agree with "I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain."

If you follow his strategy, then you can spend arbitrarily long trying to find a faithful concrete operationalization of a part of the problem that doesn't exist.

I don't think that's how this works? The strategy I'm recommending explicitly contains two parts where we gain evidence about whether a part of the problem actually exists:

noticing an intuitive pattern in the failure-modes of some strategies
attempting to formalize (which presumably includes backpropagating our mathematics into our intuitions)

Imaginary John: Well, uh, these days I'm mostly focusing on using my flimsy non-mastered grasp of the common-concept format to try to give a descriptive account of human values, because for some reason that's where I think the hope is. So I'm not actually working too much on this thing that you think takes a swing at the real problem (although I do flirt with it occasionally).

... which, amusingly, looks like a much more ambitious version of interpretability work.

My guess at part of your views:

There's ~one natural structure for capabilities, such that (assuming we don't have deep mastery of intelligence) nearly anything we build that is an AGI will have that structure.
Given this, there will be a point where an AI system switches from everything-muddled-in-a-soup to clean capabilities and muddled alignment (the "sharp left turn").

(For the reader: I am not saying "we're screwed if the sharp left turn happens so we should ignore it", I am saying that the sharp left turn is unlikely.)

Some commentary on the conversation with me:

Imaginary Richard/Rohin: You seem awfully confident in this sharp left turn thing. And that the goals it was trained for won't just generalize. This seems characteristically overconfident.

For instance, observe that natural selection didn't try to get the inner optimizer to be aligned with inclusive genetic fitness at all. For all we know, a small amount of cleverness in exposing inner-misaligned behavior to the gradients will just be enough to fix the problem.

Good description of why I don't find the evolution analogy compelling for "sharp left turn is very likely".

And even if not that-exact-thing, then there are all sorts of ways that some other thing could come out of left field and just render the problem easy. So I don't see why you're worried.

Nate: My model says that the hard problem rears its ugly head by default, in a pretty robust way. Clever ideas might suffice to subvert the hard problem (though my guess is that we need something more like understanding and mastery, rather than just a few clever ideas). I have considered an array of clever ideas that look to me like they would predictably-to-me fail to solve the problems, and I admit that my guess is that you're putting most of your hope on small clever ideas that I can already see would fail. But perhaps you have ideas that I do not. Do you yourself have any specific ideas for tackling the hard problem?
Imaginary Richard/Rohin: Train it, while being aware of inner alignment issues, and hope for the best.

Nate: That doesn't seem to me to even start to engage with the issue where the capabilities fall into an attractor and the alignment doesn't.

Yup, agreed.

Perhaps sometime we can both make a list of ways to train with inner alignment issues in mind, and then share them with each other, so that you can see whether you think I'm lacking awareness of some important tool you expect to be at our disposal, and so that I can go down your list and rattle off the reasons why the proposed training tools don't look to me like they result in alignment that is robust to sharp left turns. (Or find one that surprises me, and update.) But I don't want to delay this post any longer, so, some other time, maybe.

"Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent". This is something that comes up in response to a few of the plans. Is the idea that during training, for advanced enough AIs capabilities gains come from gradient descent and also through processing input / interacting with the world. Or is the second part only after it has finished training. What does that concretely look like in ML?
Is a lot of the disagreement about these plans just because of others finding the idea of a "sharp left turn" more unlikely than Nate or is there more agreement about that idea but the disagreement is about what proposals might give us a shot at solving it?
What might an ambitious interpretability agenda focused on the sharp left turn and the generalization problem look like besides just trying harder at interpretability?
Another explanation of the "sharp left turn" would also be really helpful to me. At the moment, it feels like I can only explain why that happens by using analogies to humans/apes rather than being able to give a clear explanation for why we should expect that by default, using ML/alignment language.

What might an ambitious interpretability agenda focused on the sharp left turn and the generalization problem look like besides just trying harder at interpretability?

Some key pieces...

For 2, I think a lot of it is finding the "sharp left turn" idea unlikely. I think trying to get agreement on that question would be valuable.

For 4, some of the arguments for it in this post (and comments) may help.

I think the upvotes, without answers, means that other people are also interested in hearing Nate's clarifications on these questions, particularly #1.

2 is a mixture of both - examples will hopefully come as people comment their disagreements.

(Most of the QR-upvotes at the moment are from me. I think 1-4 are all good questions, for Nate or others; but I'm extra excited about people coming up with ideas for 3.)

Thanks for the post, I agree with a lot of it. A few quick comments on your dialogue with imaginary me/Rohin, which highlight the main points of disagreement:

And even if not that-exact-thing, then there are all sorts of ways that some other thing could come out of left field and just render the problem easy. So I don't see why you're worried.

Nate: I have considered an array of clever ideas that look to me like they would predictably-to-me fail to solve the problems, and I admit that my guess is that you're putting most of your hope on small clever ideas that I can already see would fail.

While we’ve argued that scaleable truthfulness would constitute significant progress on alignment (and might provide a solution outright), we don’t mean to suggest that truthfulness will sidestep all difficulties that have been identified by alignment researchers. On the contrary, we expect work on scaleable truthfulness to encounter many of those same difficulties, and to benefit from many of the same solutions.

I just wanted to clarify that the truthful AI paper isn't evidence that people who try to hit the hard bits of alignment always miss — it's just a paper doing a different thing.

Hey, thanks for posting this!

And I apologise - I seem to have again failed to communicate what we're doing here :-(

"Get the AI to ask for labels on ambiguous data"

Get the AI to generate candidate extrapolations of its reward data, that include human-survivable candidates.
Select among these candidates to get a human-survivable ultimate reward functions.

Partisans of the other "hard problem" are also quick to tell people that the things they call research are not in fact targeting the problem at all. (I wonder if it's something about the name...)

Like, even simpler than the problem of an AGI that puts two identical strawberries on a plate and does nothing else, is the problem of an AGI that turns as much of the universe as possible into diamonds. This is easier because, while it still requires that we have some way to direct the system towards a concept of our choosing, we no longer require corrigibility. (Also, "diamond" is a significantly simpler concept than "strawberry" and "cellularly identical".)
It seems to me that we have basically no idea how to do this. We can train the AGI to be pretty good at building diamond-like things across a lot of training environments, but once it takes that sharp left turn, by default, it will wander off and do some other thing, like how humans wandered off and invented birth control.

Is there a writeup of where you expect this to fail? I recall this MIRI newsletter but I think it also just asserted it was hard/impossible.

Is the difficulty just in "it's gonna hijack it's own reward function?" or is there more to it than that?

Of course, concepts such as human values/corrigibility/whatever are a lot more fragile than diamonds, so this doesn't seem helpful for alignment.

(Unsure whether to mark "agree" for the first two paragraphs, or "disagree" for the last line. Leaving this comment instead.)

Also "expect this to fail" already seems to jump the gun. Who has a proposal for successfully building an AGI that can do this, other than saying gradient-descent will surprise us with one?

"evolution's inclusive genetic fitness criteria -> a human's learned values" (as mediated by evolution's influence over the human's learning process + reward circuitry)
"a particular human's learning process + reward circuitry + "training" environment -> the human's learned values"

The relationship we want to make inferences about is:

"a particular AI's learning process + reward function + training environment -> the AI's learned values"

~66% from "human learning -> human values"
~4% from "evolution -> human values"^[2]
~30% from various other evidence sources, which I won't address further in this comment, on inner goals versus outer criteria:
- economics
- microbial ecology
- politics
- current results in machine learning
- game theory / mulit-agent negotiation dynamics

^{^}
You can, of course, try to look at how population genetics relate to learned values to try to get more data from the "evolution -> human values" reference class, but I think most genetic influences on values are mediated by differences in reward circuitry or environmental correlates of genetic variation. So such an investigation probably ends up mostly redundant in light of how the "human learning -> human values" dynamics work out. I don't know how you'd try and back out a useful inference about general inner versus outer relationships (independent from the "human learning -> human values" dynamics) from that mess. In practice, I think the first order evidence from "human learning -> human values" still dominates any evolution-specific inferences you can make here.
^{^}
Even given the arguments in this comment, putting such a low weight on "evolution -> human values" might seem extreme, but I have an additional reason, originally identified by Alex Turner, for further down weighting the evidence from "evolution -> human values". See this document on shard theory and search for "homo inclusive-genetic-fitness-maximus".

I worry you are being insufficiently pessimistic.

There may not be substantial disagreements here. Do you agree with:

The most important claim in your comment is that "human learning → human values" is evidence that inner misalignment is easier than it seems when one looks at it from the "evolution -> human values" perspective.
Here's why I disagree:

I also think this regularity in inner values is reasonably robust to sharp left turns in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence.

This matches my intuitions.

Do you agree with: “a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values”

I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”

One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given.

This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.

By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is.

My bad; I've updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution's failure at inner alignment is the most significant and informative evidence that inner alignment is hard.

I'm partly confused about the phrasing "we have no idea how to do this." (which is stronger than "we don't currently have a plan for how to do this.")

Problems currently known to me:

Reward hijacking
Point 19 in List of Lethalities ("there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment").
Ontological updating (i.e. what exactly is a diamond?)
New to me from this post: the most important capabilities advances may come from an inner process that isn't actually coupled to the reinforcement learning system. (I didn't really get this until reading this post and haven't finished thinking through the concept)

Main ingredients I'm imagining: (disclaimer: I'm a layman making a lot of informed guesses, wouldn't be surprised it

(Maybe scrub the language model of all references to ML/programming, initially. They'll be helpful eventually but maybe don't give the AGI a headstart on self-modification.)

(I think the outer-alignment goal here is to get it to advance at physics faster than self-modification, so that you can force it to learn ontological problems before it could get subverted by them).

Some notable training-sets it needs to include:

digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion
it's ability to parse what's going on in the digital worlds depends on sensors that are present in the digital world (also for physical world), and there are different arrays of sensors in different worlds. It's trained against situations where it has the ability to modify its sensors for simple reward hacking.
eventually it's taught programming/hardware know-ho, and put in fairly simplified situations where the solution to it's puzzle is to notice that it physically exists, and make changes to it's hardware or software, but it doesn't directly hijack it's own reward function.

I'm out of time for now, will think about this more.

IMO shard theory provides a great frame to think about this in, it's a must-read for improving alignment intuitions.

But maybe I just don't understand this proposal yet (and I have had some trouble distilling things I recognize as plans out of Evan's writing, so far).

Maybe this and this will help.

If you want to train a powerful AI, currently the set of tasks you can train your AI on will, by default, result in your AI murdering you.
Because we currently cannot teach our AIs to be powerful by doing anything except rewarding them for doing things that straightforwardly imply that they should disempower humans, you don't need a "sharp left turn" in order for humanity to end up disempowered.
Given this, it seems like there's still a substantial part of the difficulty of alignment that remains to be solved even if knew how to cope with the "sharp left turn." That is, even if capabilities were continuous in SGD steps, training powerful AIs would still result in catastrophe.
ELK is intended to be an ingredient in tackling this difficulty, which has been traditionally referred to as "outer alignment."

ELK for learned optimizers has some more details.

(However it's frustrating to me that I can never pin down Nate or Eliezer on this kind of thing, e.g. are they still pessimistic if there were a low-stakes AI deployment in the sense of this post?)

If it helps, I have a discussion of Concept Extrapolation in the context of aligning a real-deal agent-y AGI in §14.4 here.

For example, would you change your mind if we found smooth scaling laws for (some good measure of) in-context learning?

From A central AI alignment problem: capabilities generalization, and the sharp left turn:

Suppose that the fictional team OpenMind is training up a variety of AI systems, before one of them takes that sharp left turn. Suppose they've put the AI in lots of different video-game and simulated environments, and they've had good luck training it to pursue an objective that the operators described in English. "I don't know what those MIRI folks were talking about; these systems are easy to direct; simple training suffices", they say. At the same time, they apply various training methods, some simple and some clever, to cause the system to allow itself to be removed from various games by certain "operator-designated" characters in those games, in the name of shutdownability. And they use various techniques to prevent it from stripmining in Minecraft, in the name of low-impact. And they train it on a variety of moral dilemmas, and find that it can be trained to give correct answers to moral questions (such as "in thus-and-such a circumstance, should you poison the operator's opponent?") just as well as it can be trained to give correct answers to any other sort of question. "Well," they say, "this alignment thing sure was easy. I guess we lucked out."
Then, the system takes that sharp left turn,^[4]^[5] and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart.
[...]
[5] "Hold on, isn't this unfalsifiable? Aren't you saying that you're going to continue believing that alignment is hard, even as we get evidence that it's easy?" Well, I contend that "GPT can learn to answer moral questions just as well as it can learn to answer other questions" is not much evidence either way about the difficulty of alignment. I'm not saying we'll get evidence that I'll ignore; I'm naming in advance some things that I wouldn't consider negative evidence (partially in hopes that I can refer back to this post when people crow later and request an update). But, yes, my model does have the inconvenient property that people who are skeptical now, are liable to remain skeptical until it's too late, because most of the evidence I expect to give us advance warning about the nature of the problem is evidence that we've already seen. I assure you that I do not consider this property to be convenient.

As for things that could convince me otherwise: technical understanding of intelligence could undermine my "sharp left turn" model. I could also imagine observing some ephemeral hopefully-I'll-know-it-when-I-see-it capabilities thresholds, without any sharp left turns, that might update me. (Short of "full superintelligence without a sharp left turn", which would obviously convince me but comes too late in the game to shift my attention.)

(Is that just because they get attacked and killed by other chimp groups?)

98

On how various plans miss the hard bits of the alignment challenge

98

Reactions to specific plans

Owen Cotton-Barratt & Truthful AI

Ryan Greenblatt & Eliciting Latent Knowledge

Eric Drexler & AI Services

Evan Hubinger, in a recent personal conversation

A fairly straw version of someone with technical intuitions like Richard Ngo’s or Rohin Shah’s

Another recent proposal

Vivek Hebbar, summarized (perhaps poorly) from last time we spoke of this in person

John Wentworth & Natural Abstractions

Neel Nanda & Theories of Impact for Interpretability

Stuart Armstrong & Concept Extrapolation

Andrew Critch & political solutions

What about superbabies?

What about other MIRI people?

High-level view