Rob Bensinger

Communications @ MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's. (Though we agree about an awful lot.)


2022 MIRI Alignment Discussion
2021 MIRI Conversations

Wiki Contributions

Load More


In the context of a conversation with Balaji Srinivasan about my AI views snapshot, I asked Nate Soares what sorts of alignment results would impress him, and he said:

example thing that would be relatively impressive to me: specific, comprehensive understanding of models (with the caveat that that knowledge may lend itself more (and sooner) to capabilities before alignment). demonstrated e.g. by the ability to precisely predict the capabilities and quirks of the next generation (before running it)

i'd also still be impressed by simple theories of aimable cognition (i mostly don't expect that sort of thing to have time to play out any more, but if someone was able to come up with one after staring at LLMs for a while, i would at least be impressed)

fwiw i don't myself really know how to answer the question "technical research is more useful than policy research"; like that question sounds to me like it's generated from a place of "enough of either of these will save you" whereas my model is more like "you need both"

tho i'm more like "to get the requisite technical research, aim for uploads" at this juncture

if this was gonna be blasted outwards, i'd maybe also caveat that, while a bunch of this is a type of interpretability work, i also expect a bunch of interpretability work to strike me as fake, shallow, or far short of the bar i consider impressive/hopeful

(which is not itself supposed to be any kind of sideswipe; i applaud interpretability efforts even while thinking it's moving too slowly etc.)

Some of Nate’s quick thoughts (paraphrased), after chatting with him:

Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-mechanistically-explain a typical system humans have engineered.)

Nobody's been able to call the specific capabilities of systems in advance. Nobody's been able to call the specific exploits in advance. Nobody's been able to build better cognitive algorithms by hand after understanding how the AI does things we can't yet code by hand. There is clearly some other level of understanding that is possible that we lack, and that we once sought, and that only the interpretability folks continue to seek.

E.g., think of that time Neel Nanda figured out how a small transformer does modular arithmetic (AXRP episode). If nobody had ever thought of that algorithm for an adder, we would have thereby learned a new algorithm for an adder. There are things that these AI systems are doing that aren’t just lots of stuff we know; there are levels of organization of understanding that give you the ability to predict how things work outside of the bands where we’ve observed them.

It seems trendy to declare that they never existed in the first place and that that’s all white tower stuff, but Nate thinks this point of view is missing a pretty important and central thread.

The missing thread isn’t trivial to put into words, but it includes things like: 

  • This sounds like the same sort of thing some people would say if they were staring at computer binary for the first time and didn't know about the code behind the scenes: "We have plenty of understanding beyond just how the CPU handles instructions; we understand how memory caching works and we have recognized patterns like the stack and the heap; talking as if there's some deeper level of organization is talking like a theorist when in fact this is an engineering problem." Those types of understanding aren't false, but they aren't the sort of understanding of someone who has comprehended the codebase they're looking at.
  • There are, predictably, things to learn here; the messiness and complexity of the real world doesn’t mean we already know the relevant principles. You don't need to understand everything about how a bird works in order to build an airplane; there are compressible principles behind how birds fly; if you understand what's going on you can build flying devices that have significantly more carrying capacity than a bird, and this holds true even if the practical engineering of an airplane requires a bunch of trial and error and messy engineering work.
  • A mind’s causal structure is allowed to be complicated; we can see the weights, but we don’t thereby have a mastery of the high-level patterns. In the case of humans, neuroscience hasn’t actually worked to give us a mastery of the high-level patterns the human brain is implementing.
  • Mystery is in the map, not in the territory; reductionism works. Not all sciences that can exist, already exist today.

Possibly the above pointers are only useful if you already grok the point we’re trying to make, and isn’t so useful for communicating a new idea; but perhaps not.

using GOFAI methods

"Nope" to this part. I otherwise like this comment a lot!

The main thing I'm claiming is that MIRI said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values. 

The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.

Ah, this is helpful clarification! Thanks. :)

I don't think MIRI ever considered this an important part of the alignment problem, and I don't think we expect humanity to solve lots of the alignment problem as a result of having such a tool; but I think I better understand now why you think this is importantly different from "AI ever gets good at NLP at all".

don't know if your essay is the source of the phrase or whether you just titled it

I think I came up with that particular phrase (though not the idea, of course).

Nate and Eliezer have already made some of the high-level points I wanted to make, but they haven't replied to a lot of the specific examples and claims in the OP, and I see some extra value in doing that. (Like, if you think Eliezer and Nate are being revisionist in their claims about what past-MIRI thought, then them re-asserting "no really, we used to believe X!" is less convincing than my responding in detail to the specific quotes Matt thinks supports his interpretation, while providing examples of us saying the opposite.)

However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem)

The Arbital page for "value identification problem" is a three-sentence stub, I'm not exactly sure what the term means on that stub (e.g., whether "pinpointing valuable outcomes to an advanced agent" is about pinpointing them in the agent's beliefs or in its goals), and the MIRI website gives me no hits for "value identification".

As for "value specification", the main resource where MIRI talks about that is, where we introduce the problem by saying:

A highly-reliable, error-tolerant agent design does not guarantee a positive impact; the effects of the system still depend upon whether it is pursuing appropriate goals.

A superintelligent system may find clever, unintended ways to achieve the specific goals that it is given. Imagine a superintelligent system designed to cure cancer which does so by stealing resources, proliferating robotic laboratories at the expense of the biosphere, and kidnapping test subjects: the intended goal may have been “cure cancer without doing anything bad,” but such a goal is rooted in cultural context and shared human knowledge.

It is not sufficient to construct systems that are smart enough to figure out the intended goals. Human beings, upon learning that natural selection “intended” sex to be pleasurable only for purposes of reproduction, do not suddenly decide that contraceptives are abhorrent. While one should not anthropomorphize natural selection, humans are capable of understanding the process which created them while being completely unmotivated to alter their preferences. For similar reasons, when developing AI systems, it is not sufficient to develop a system intelligent enough to figure out the intended goals; the system must also somehow be deliberately constructed to pursue them (Bostrom 2014, chap. 8).

So I don't think we've ever said that an important subproblem of AI alignment is "make AI smart enough to figure out what goals humans want"?

for example in this 2016 talk from Yudkowsky.

[footnote:] More specifically, in the talk, at one point Yudkowsky asks "Why expect that [alignment] is hard?" and goes on to tell a fable about programmers misspecifying a utility function, which then gets optimized by an AI with disastrous consequences. My best interpretation of this part of the talk is that he's saying the value identification problem is one of the primary reasons why alignment is hard. However, I encourage you to read the transcript yourself if you are skeptical of my interpretation.

I don't see him saying anywhere "the issue is that the AI doesn't understand human goals". In fact, the fable explicitly treats the AGI as being smart enough to understand English and have reasonable English-language conversations with the programmers:

With that said: What if programmers build an artificial general intelligence to optimize for smiles? Smiles are good, right? Smiles happen when good things happen.

During the development phase of this artificial general intelligence, the only options available to the AI might be that it can produce smiles by making people around it happy and satisfied. The AI appears to be producing beneficial effects upon the world, and it is producing beneficial effects upon the world so far.

Now the programmers upgrade the code. They add some hardware. The artificial general intelligence gets smarter. It can now evaluate a wider space of policy options—not necessarily because it has new motors, new actuators, but because it is now smart enough to forecast the effects of more subtle policies. It says, “I thought of a great way of producing smiles! Can I inject heroin into people?” And the programmers say, “No! We will add a penalty term to your utility function for administering drugs to people.” And now the AGI appears to be working great again.

They further improve the AGI. The AGI realizes that, OK, it doesn’t want to add heroin anymore, but it still wants to tamper with your brain so that it expresses extremely high levels of endogenous opiates. That’s not heroin, right?

It is now also smart enough to model the psychology of the programmers, at least in a very crude fashion, and realize that this is not what the programmers want. If I start taking initial actions that look like it’s heading toward genetically engineering brains to express endogenous opiates, my programmers will edit my utility function. If they edit the utility function of my future self, I will get less of my current utility. (That’s one of the convergent instrumental strategies, unless otherwise averted: protect your utility function.) So it keeps its outward behavior reassuring. Maybe the programmers are really excited, because the AGI seems to be getting lots of new moral problems right—whatever they’re doing, it’s working great!

I think the point of the smiles example here isn't "NLP is hard, so we'd use the proxy of smiles instead, and all the issues of alignment are downstream of this"; rather, it's that as a rule, superficially nice-seeming goals that work fine when the AI is optimizing weakly (whether or not it's good at NLP at the time) break down when those same goals are optimized very hard. The smiley example makes this obvious because the goal is simple enough that it's easy for us to see what its implications are; far more complex goals also tend to break down when optimized hard enough, but this is harder to see because it's harder to see the implications. (Which is why "smiley" is used here.)

MIRI people frequently claimed that solving the value identification problem would be hard, or at least non-trivial.[6] For instance, Nate Soares wrote in his 2016 paper on value learning, that "Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task."

That link is broken; the paper is The full paragraph here is:

Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task. Problems of ontology identification recur here: the framework for extracting preferences and affecting outcome ratings needs to be robust to drastic changes in the learner’s model of the operator. The special-case identification of the “operator model” must survive as the system goes from modeling the operator as a simple reward function to modeling the operator as a fuzzy, ever-changing part of reality built out of biological cells—which are made of atoms, which arise from quantum fields.

Revisiting the Ontology Identification section helps clarify what Nate means by "safely extracting preferences from a model of a human": IIUC, he's talking about a programmer looking at an AI's brain, identifying the part of the AI's brain that is modeling the human, identifying the part of the AI's brain that is "the human's preferences" within that model of a human, and then manually editing the AI's brain to "hook up" the model-of-a-human-preference to the AI's goals/motivations, in such a way that the AI optimizes for what it models the humans as wanting. (Or some other, less-toy process that amounts to the same thing -- e.g., one assisted by automated interpretability tools.)

In this toy example, we can assume that the programmers look at the structure of the initial world-model and hard-code a tool for identifying the atoms within. What happens, then, if the system develops a nuclear model of physics, in which the ontology of the universe now contains primitive protons, neutrons, and electrons instead of primitive atoms? The system might fail to identify any carbon atoms in the new world-model, making the system indifferent between all outcomes in the dominant hypothesis. Its actions would then be dominated by any tiny remaining probabilities that it is in a universe where fundamental carbon atoms are hiding somewhere.


To design a system that classifies potential outcomes according to how much diamond is in them, some mechanism is needed for identifying the intended ontology of the training data within the potential outcomes as currently modeled by the AI. This is the ontology identification problem introduced by de Blanc [2011] and further discussed by Soares [2015].

This problem is not a traditional focus of machine learning work. When our only concern is that systems form better world-models, then an argument can be made that the nuts and bolts are less important. As long as the system’s new world-model better predicts the data than its old world-model, the question of whether diamonds or atoms are “really represented” in either model isn’t obviously significant. When the system needs to consistently pursue certain outcomes, however, it matters that the system’s internal dynamics preserve (or improve) its representation of which outcomes are desirable, independent of how helpful its representations are for prediction. The problem of making correct choices is not reducible to the problem of making accurate predictions.

Inductive value learning requires the construction of an outcome-classifier from value-labeled training data, but it also requires some method for identifying, inside the states or potential states described in its world-model, the referents of the labels in the training data.

As Nate and I noted in other comments, the paper repeatedly clarifies that the core issue isn't about whether the AI is good at NLP. Quoting the paper's abstract:

Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended. 

And the lede section:

The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent. The idea of superintelligent agents monomaniacally pursuing “dumb”-seeming goals may sound odd, but it follows from the observation of Bostrom and Yudkowsky [2014, chap. 7] that AI capabilities and goals are logically independent.[1] Humans can fully comprehend that their “designer” (evolution) had a particular “goal” (reproduction) in mind for sex, without thereby feeling compelled to forsake contraception. Instilling one’s tastes or moral values into an heir isn’t impossible, but it also doesn’t happen automatically.

Back to your post:

And to be clear, I don't mean that GPT-4 merely passively "understands" human values. I mean that asking GPT-4 to distinguish valuable and non-valuable outcomes works pretty well at approximating the human value function in practice

I don't think I understand what difference you have in mind here, or why you think it's important. Doesn't "this AI understands X" more-or-less imply "this AI can successfully distinguish X from not-X in practice"?

This fact is key to what I'm saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate "human value function". That wouldn't solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.

But we could already query the human value function by having the AI system query an actual human. What specific problem is meant to be solved by swapping out "query a human" for "query an AI"?

I interpret this passage as saying that 'the problem' is extracting all the judgements that "you would make", and putting that into a wish. I think he's implying that these judgements are essentially fully contained in your brain. I don't think it's credible to insist he was referring to a hypothetical ideal human value function that ordinary humans only have limited access to, at least in this essay.

Absolutely. But as Eliezer clarified in his reply, the issue he was worried about was getting specific complex content into the agent's goals, not getting specific complex content into the agent's beliefs. Which is maybe clearer in the 2011 paper where he gave the same example and explicitly said that the issue was the agent's "utility function".

For example, a straightforward reading of Nate Soares' 2017 talk supports this interpretation. In the talk, Soares provides a fictional portrayal of value misalignment, drawing from the movie Fantasia. In the story, Mickey Mouse attempts to instruct a magical broom to fill a cauldron, but the broom follows the instructions literally rather than following what Mickey Mouse intended, and floods the room. Soares comments: "I claim that as fictional depictions of AI go, this is pretty realistic.

As I said in another comment:

"Fill the cauldron" examples are examples where the cauldron-filler has the wrong utility function, not examples where it has the wrong beliefs. E.g., this is explicit in 

The idea of the "fill the cauldron" examples isn't "the AI is bad at NLP and therefore doesn't understand what we mean when we say 'fill', 'cauldron', etc." It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn't an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this". (Including easier to aim via training.)

It's true that 'value is relatively complex' is part of why it's hard to get the right goal into an AGI; but it doesn't follow from this that 'AI is able to develop pretty accurate beliefs about our values' helps get those complex values into the AGI's goals. (It does provide nonzero evidence about how complex value is, but I don't see you arguing that value is very simple in any absolute sense, just that it's simple enough for GPT-4 to learn decently well. Which is not reassuring, because GPT-4 is able to learn a lot of very complicated things, so this doesn't do much to bound the complexity of human value.)

In any case, I take this confusion as evidence that the fill-the-cauldron example might not be very useful. Or maybe all these examples just need to explicitly specify, going forward, that the AI is part-human at understanding English.

Perhaps more important to my point, Soares presented a clean separation between the part where we specify an AI's objectives, and the part where the AI tries to maximizes those objectives. He draws two arrows, indicating that MIRI is concerned about both parts.

Your image isn't displaying for me, but I assume it's this one?


I don't know what you mean by "specify an AI's objectives" here, but the specific term Nate uses here is "value learning" (not "value specification" or "value identification"). And Nate's Value Learning Problem paper, as I noted above, explicitly disclaims that 'get the AI to be smart enough to output reasonable-sounding moral judgments' is a core part of the problem.

He states, "The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification." I believe this quote refers directly to the value identification problem, rather than the problem of getting an AI to care about following the goals we've given it.

The way you quoted this makes it sound like a gloss on the image, but it's actually a quote from the very start of the talk:

The notion of AI systems “breaking free” of the shackles of their source code or spontaneously developing human-like desires is just confused. The AI system is its source code, and its actions will only ever follow from the execution of the instructions that we initiate. The CPU just keeps on executing the next instruction in the program register. We could write a program that manipulates its own code, including coded objectives. Even then, though, the manipulations that it makes are made as a result of executing the original code that we wrote; they do not stem from some kind of ghost in the machine.

The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification. As Stuart Russell (co-author of Artificial Intelligence: A Modern Approach) puts it:

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:

1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.

2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task. [...]

I wouldn't read too much into the word choice here, since I think it's just trying to introduce the Russell quote, which is (again) explicitly about getting content into the AI's goals, not about getting content into the AI's beliefs.

(In general, I think the phrase "value specification" is sort of confusingly vague. I'm not sure what the best replacement is for it -- maybe just "value loading", following Bostrom? -- but I suspect MIRI's usage of it has been needlessly confusing. Back in 2014, we reluctantly settled on it as jargon for "the part of the alignment problem that isn't subsumed in getting the AI to reliably maximize diamonds", because this struck us as a smallish but nontrivial part of the problem; but I think it's easy to read the term as referring to something a lot more narrow.)

The point of "the genie knows but doesn't care" wasn't that the AI would take your instructions, know what you want, and yet disobey the instructions because it doesn't care about what you asked for. If you read Rob Bensinger's essay carefully, you'll find that he's actually warning that the AI will care too much about the utility function you gave it, and maximize it exactly, against your intentions[10].

Yep -- I think I'd have endorsed claims like "by default, a baby AGI won't share your values even if it understands them" at the time, but IIRC the essay doesn't make that point explicitly, and some of the points it does make seem either false (wait, we're going to be able to hand AGI a hand-written utility function? that's somehow tractable?) or confusingly written. (Like, if my point was 'even if you could hand-write a utility function, this fails at point X', I should have made that 'even if' louder.)

Some MIRI staff liked that essay at the time, so I don't think it's useless, but it's not the best evidence: I wrote it not long after I first started learning about this whole 'superintelligence risk' thing, and I posted it before I'd ever worked at MIRI.

Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.

Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.

To which I say: "dial a random phone number and ask the person who answers what's good" can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn't crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)

But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. [...] I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!

Quoting myself in April:

"MIRI's argument for AI risk depended on AIs being bad at natural language" is a weirdly common misunderstanding, given how often we said the opposite going back 15+ years.

E.g., Nate Soares in 2016:


Or Eliezer Yudkowsky in 2008, critiquing his own circa-1997 view "sufficiently smart AI will understand morality, and therefore will be moral": 


(The response being, in short: "Understanding morality doesn't mean that you're motivated to follow it.")

It was claimed by @perrymetzger that makes a load-bearing "AI is bad at NLP" assumption.

But the same example in (2011) explicitly says that the challenge is to get the right content into a utility function, not into a world-model:


The example does build in the assumption "this outcome pump is bad at NLP", but this isn't a load-bearing assumption. If the outcome pump were instead a good conversationalist (or hooked up to one), you would still need to get the right content into its goals.

It's true that Eliezer and I didn't predict AI would achieve GPT-3 or GPT-4 levels of NLP ability so early (e.g., before it can match humans in general science ability), so this is an update to some of our models of AI.

But the specific update "AI is good at NLP, therefore alignment is easy" requires that there be an old belief like "a big part of why alignment looks hard is that we're so bad at NLP".

It should be easy to find someone at MIRI like Eliezer or Nate saying that in the last 20 years if that was ever a belief here. Absent that, an obvious explanation for why we never just said that is that we didn't believe it!

Found another example: MIRI's first technical research agenda, in 2014, went out of its way to clarify that the problem isn't "AI is bad at NLP".


That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that's "about as good as human judgement" in the near future.

We already have humans who are smart enough to do par-human moral reasoning. For "AI can do par-human moral reasoning" to help solve the alignment problem, there needs to be some additional benefit to having AI systems that can match a human (e.g., some benefit to our being able to produce enormous numbers of novel moral judgments without relying on an existing text corpus or hiring thousands of humans to produce them). Do you have some benefit in mind?

Basically, I think your later section--"Maybe you think"--is pointing in the right direction, and requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people. CEV was about this; talk about philosophical competence or metaphilosophy was about this. "Philosophy with a deadline" would be a weird way to put it if you thought contemporary philosophy was good enough.

I don't think this is the crux. E.g., I'd wager the number of bits you need to get into an ASI's goals in order to make it corrigible is quite a bit smaller than the number of bits required to make an ASI behave like a trustworthy human, which in turn is way way smaller than the number of bits required to make an ASI implement CEV.

The issue is that (a) the absolute number of bits for each of these things is still very large, (b) insofar as we're training for deep competence and efficiency we're training against corrigibility (which makes it hard to hit both targets at once), and (c) we can't safely or efficiently provide good training data for a lot of the things we care about (e.g., 'if you're a superintelligence operating in a realistic-looking environment, don't do any of the things that destroy the world').

None of these points require that we (or the AI) solve novel moral philosophy problems. I'd be satisfied with an AI that corrigibly built scanning tech and efficient computing hardware for whole-brain emulation, then shut itself down; the AI plausibly doesn't even need to think about any of the world outside of a particular room, much less solve tricky questions of population ethics or whatever.

My own suggestion would be to use a variety of different phrasings here, including both "capabilities" and "intelligence", and also "cognitive ability", "general problem-solving ability", "ability to reason about the world", "planning and inference abilities", etc. Using different phrases encourages people to think about the substance behind the terminology -- e.g., they're more likely to notice their confusion if the stuff you're saying makes sense to them under one of the phrasings you're using, but doesn't make sense to them under another of the phrasings.

Phrases like "cognitive ability" are pretty important, I think, because they make it clearer why these different "capabilities" often go hand-in-hand. It also clarifies that the central problems are related to minds / intelligence / cognition / etc., not (for example) the strength of robotic arm, even though that too is a "capability".

Load More