I have never since 1996 thought that it would be hard to get superintelligences to accurately model reality with respect to problems as simple as "predict what a human will thumbs-up or thumbs-down". The theoretical distinction between producing epistemic rationality (theoretically straightforward) and shaping preference (theoretically hard) is present in my mind at every moment that I am talking about these issues; it is to me a central divide of my ontology.
If you think you've demonstrated by clever textual close reading that Eliezer-2018 or Eliezer-2008 thought that it would be hard to get a superintelligence to understand humans, you have arrived at a contradiction and need to back up and start over.
The argument we are trying to explain has an additional step that you're missing. You think that we are pointing to the hidden complexity of wishes in order to establish in one step that it would therefore be hard to get an AI to output a correct wish shape, because the wishes are complex, so it would be difficult to get an AI to predict them. This is not what we are trying to say. We are trying to say that because wishes have a lot of hidden complexity, the thing you are trying to get into the AI's preferences has a lot of hidden complexity. This makes the nonstraightforward and shaky problem of getting a thing into the AI's preferences, be harder and more dangerous than if we were just trying to get a single information-theoretic bit in there. Getting a shape into the AI's preferences is different from getting it into the AI's predictive model. MIRI is always in every instance talking about the first thing and not the second.
You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI's predictive model is not sufficient. It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kind of thing it is much easier to make progress on compared to the main problem. Even if, in fact, the ball-bearings would legitimately be part of the mechanism if you could build one! Making lots of progress on smoother, lower-friction ball-bearings is even so not the sort of thing that should cause you to become much more hopeful about the perpetual motion machine. It is on the wrong side of a theoretical divide between what is straightforward and what is not.
You will probably protest that we phrased our argument badly relative to the sort of thing that you could only possibly be expected to hear, from your perspective. If so this is not surprising, because explaining things is very hard. Especially when everyone in the audience comes in with a different set of preconceptions and a different internal language about this nonstandardized topic. But mostly, explaining this thing is hard and I tried taking lots of different angles on trying to get the idea across.
In modern times, and earlier, it is of course very hard for ML folk to get their AI to make completely accurate predictions about human behavior. They have to work very hard and put a lot of sweat into getting more accurate predictions out! When we try to say that this is on the shallow end of a shallow-deep theoretical divide (corresponding to Hume's Razor) it often sounds to them like their hard work is being devalued and we could not possibly understand how hard it is to get an AI to make good predictions.
Now that GPT-4 is making surprisingly good predictions, they feel they have learned something very surprising and shocking! They cannot possibly hear our words when we say that this is still on the shallow end of a shallow-deep theoretical divide! They think we are refusing to come to grips with this surprising shocking thing and that it surely ought to overturn all of our old theories; which were, yes, phrased and taught in a time before GPT-4 was around, and therefore do not in fact carefully emphasize at every point of teaching how in principle a superintelligence would of course have no trouble predicting human text outputs. We did not expect GPT-4 to happen, in fact, intermediate trajectories are harder to predict than endpoints, so we did not carefully phrase all our explanations in a way that would make them hard to misinterpret after GPT-4 came around.
But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. You could then have asked us in a shocked tone how this could possibly square up with the notion of "the hidden complexity of wishes" and we could have explained that part in advance. Alas, nobody actually predicted GPT-4 so we do not have that advance disclaimer down in that format. But it is not a case where we are just failing to process the collision between two parts of our belief system; it actually remains quite straightforward theoretically. I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!
I think you missed some basic details about what I wrote. I encourage people to compare what Eliezer is saying here to what I actually wrote. You said:
If you think you've demonstrated by clever textual close reading that Eliezer-2018 or Eliezer-2008 thought that it would be hard to get a superintelligence to understand humans, you have arrived at a contradiction and need to back up and start over.
I never said that you or any other MIRI person thought it would be "hard to get a superintelligence to understand humans". Here's what I actually wrote:
Non-MIRI people sometimes strawman MIRI people as having said that AGI would literally lack an understanding of human values. I don't endorse this, and I'm not saying this.
[...]
I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have generally maintained there was extra difficulty in getting an AI to care about human values. However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem), for example in this 2016 talk from Yudkowsky.[3] The value identification problem is the problem of "pinpointing valuable outcomes to an advanced agent and distinguishing them from non-valuable outcomes". In other words, it's the problem of specifying a function that reflects the "human value function" with high fidelity.
I mostly don't think that the points you made in your comment respond to what I said. My best guess is that you're responding to a stock character who represents the people who have given similar arguments to you repeatedly in the past. In light of your personal situation, I'm actually quite sympathetic to you responding this way. I've seen my fair share of people misinterpreting you on social media too. It can be frustrating to hear the same bad arguments, often made from people with poor intentions, over and over again and continue to engage thoughtfully each time. I just don't think I'm making the same mistakes as those people. I tried to distinguish myself from them in the post.
I would find it slightly exhausting to reply to all of this comment, given that I think you misrepresented me in a big way right out of the gate, so I'm currently not sure if I want to put in the time to compile a detailed response.
That said, I think some of the things you said in this comment were nice, and helped to clarify your views on this subject. I admit that I may have misinterpreted some of the comments you made, and if you provide specific examples, I'm happy to retract or correct them. I'm thankful that you spent the time to engage. :)
Without digging in too much, I'll say that this exchange and the OP is pretty confusing to me. It sounds like MB is like "MIRI doesn't say it's hard to get an AI that has a value function" and then also says "GPT has the value function, so MIRI should update". This seems almost contradictory.
A guess: MB is saying "MIRI doesn't say the AI won't have the function somewhere, but does say it's hard to have an externally usable, explicit human value function". And then saying "and GPT gives us that", and therefore MIRI should update.
And EY is blobbing those two things together, and saying neither of them is the really hard part. Even having the externally usable explicit human value function doesn't mean the AI cares about it. And it's still a lot of bits, even if you have the bits. So it's still true that the part about getting the AI to care has to go precisely right.
If there's a substantive disagreement about the facts here (rather than about the discourse history or whatever), maybe it's like:
Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.
Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.
To which I say: "dial a random phone number and ask the person who answers what's good" can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn't crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)
A guess: MB is saying "MIRI doesn't say the AI won't have the function somewhere, but does say it's hard to have an externally usable, explicit human value function". And then saying "and GPT gives us that", and therefore MIRI should update.
[...]
Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.
I consider this a reasonably accurate summary of this discussion, especially the part I'm playing in it. Thanks for making it more clear to others.
I'm not going to comment on "who said what when", as I'm not particularly interested in the question myself, though I think the object level point here is important:
This makes the nonstraightforward and shaky problem of getting a thing into the AI's preferences, be harder and more dangerous than if we were just trying to get a single information-theoretic bit in there.
The way I would phrase this is that what you care about is the relative complexity of the objective conditional on the world model. If you're assuming that the model is highly capable, and trained in a highly diverse environment, then you can assume that the world model is capable of effectively modeling anything in the world (e.g. anything that might appear in webtext). But the question remains what the "simplest" (according to the inductive biases) goal is that can be pointed to in the world model such that the resulting mesa-optimizer has good training performance.
The most rigorous version of this sort of analysis that exists is probably here, where the key question is how to find a prior (that is, a set of inductive biases) such that the desired goal has a lower complexity conditional on the world model compared to the undesired goal. Importantly, both of them will be pretty low relative to the world model, since the vast majority of the complexity is in the world model.
Furthermore, the better the world model, the less complexity it takes to point to anything in it. Thus, as we build more powerful models, it will look like everything has lower complexity. But importantly, that's not actually helpful! Because what you care about is not reducing the complexity of the desired goal, but reducing the relative complexity of the desired goal compared to undesired goals, since (modulo randomness due to path-dependence), what you actually get is the maximum a posteriori, the "simplest model that fits the data."
Similarly, the key arguments for deceptive alignment rely on the set of objectives that are aligned with human values being harder to point to compared to the set of all long-term objective. The key problem is that any long-term objective is compatible with good training performance due to deceptive alignment (the model will reason that it should play along for the purposes of getting its long-term objective later), such that the total probability of that set under the inductive biases swamps the probability of the aligned set. And this is despite the fact that human values do in fact get easier to point to as your model gets better, because what isn't necessarily changing is the relative difficulty.
That being said, I think there is actually an interesting update to be had on the relative complexity of different goals from the success of LLMs, which is that a pure prediction objective might actually have a pretty low relative complexity. And that's precisely because prediction seems substantially easier to point to than human values, even though both get easier to point to as your world model gets better. But of course the key question is whether prediction is easier to point to compared to a deceptively aligned objective, which is unclear and I think could go either way.
Getting a shape into the AI's preferences is different from getting it into the AI's predictive model.
It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?
I commonly encounter people expressing sentiments like "prosaic alignment work isn't real alignment, because we aren't actually getting the AI to care about X." To which I say: How do you know that? What does it even mean for that claim to be true or false? What do you think you know, and why do you think you know it? What empirical knowledge of inner motivational structure could you be leveraging to make these claims, such that you are far more likely to make these claims in worlds where the claims are actually true?
(On my pessimistic days, I wonder if this kind of claim gets made because humans write suggestive phrases like "predictive loss function" in their papers, next to the mathematical formalisms.)
(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that "it was never about getting the AI to predict human preferences". So when I later saw Yudkowsky's comment and your reaction, it seemed perhaps useful to share my view.)
It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?
My reaction to this is that: Actually, current LLMs do care about our preferences, and about their guardrails. It was never about getting some AI to care about our preferences. It is about getting powerful AIs to robustly care about our preferences. Where by "robustly" includes things like (i) not caring about other things as well (e.g., prediction accuracy), (ii) generalising correctly (e.g., not just maximising human approval), and (iii) not breaking down when we increase the amount of optimisation pressure a lot (e.g., will it still work once we hook it into future-AutoGPT-that-actually-works and have it run for a long time?).
Some examples of what would cause me to update are: If we could make LLMs not jailbreakable without relying on additional filters on input or output.
Taking my own stab at answers to some of your questions:
A sufficient condition for me to believe that an AI actually cared about something would be a whole brain emulation: I would readily accept that such an emulation had preferences and values (and moral weight) in exactly the way that humans do, and that any manipulations of that emulation were acting on preferences in a real way.
I think that GPTs (and every other kind of current AI system) are not doing anything that is even close to isomorphic to the processing that happens inside the human brain. Artificial neural networks often imitate various micro and macro-level individual features of the brain, but they do not imitate every feature, arranged in precisely the same ways, and the missing pieces and precise arrangements are probably key.
Barring WBE, an AI system that is at least roughly human-level capable (including human-level agentic) is probably a necessary condition for me to believe that it has values and preferences in a meaningful (though not necessarily human-like) way.
SoTA LLM-based systems are maaaybe getting kind of close here, but only if you arrange them in precise ways (e.g. AutoGPT-style agents with specific prompts), and then the agency is located in the repeated executions of the model and the surrounding structure and scaffolding that causes the system as a whole to be doing something that is maybe-roughly-nearly-isomorphic to some complete process that happens inside of human brains. Or, if not isomorphic, at least has some kind of complicated structure which is necessary, in some form, for powerful cognition.
Note that, if I did believe that current AIs had preferences in a real way, I would also be pretty worried that they had moral weight!
(Not to say that entities below human-level intelligence (e.g. animals, current AI systems) don't have moral weight. But entities at human-level intelligence above definitely can, and possibly do by default.)
Anyway, we probably disagree on a bunch of object-level points and definitions, but from my perspective those disagreements feel like pretty ordinary empirical disagreements rather than ones based on floating or non-falsifiable beliefs. Probably some of the disagreement is located in philosophy-of-mind stuff and is over logical rather than empirical truths, but even those feel like the kind of disagreements that I'd be pretty happy to offer betting odds over if we could operationalize them.
Thanks for the reply. Let me clarify my position a bit.
I think that GPTs (and every other kind of current AI system) are not doing anything that is even close to isomorphic to the processing that happens inside the human brain.
I didn't mean to (positively) claim that GPTs have near-isomorphic motivational structure (though I think it's quite possible).
I meant to contend that I am not aware of any basis for confidently claiming that LLMs like GPT-4 are "only predicting what comes next", as opposed to "choosing" or "executing" one completion, or "wanting" to complete the tasks they are given, or—more generally—"making decisions on the basis of the available context, such that our ability to behaviorally steer LLMs (e.g. reducing sycophancy) is real evidence about our control over LLM motivations."
Concerning "GPTs are predictors", the best a priori argument I can imagine is: GPT-4 was pretrained on CE loss, which itself is related to entropy, related to information content, related to Shannon's theorems isolating information content in the context of probabilities, which are themselves nailed down by Cox's theorems which do axiomatically support the Bayesian account of beliefs and belief updates... But this long-winded indirect axiomatic justification of "beliefs" does not sufficiently support some kind of inference like "GPTs are just predicting things, they don't really want to complete tasks." That's a very strong claim about the internal structure of LLMs.
(Besides, the inductive biases probably have more to do with the parameter->function map, than the implicit regularization caused by the pretraining objective function; more a feature of the data, and less a feature of the local update rule used during pretraining...)
That does clarify, thanks.
Response in two parts: first, my own attempt at clarification over terms / claims. Second, a hopefully-illustrative sketch / comparison for why I am skeptical that current GPTs having anything properly called a "motivational structure", human-like or otherwise, and why I think such skepticism is not a particularly strong positive claim about anything in particular.
The clarification:
At least to me, the phrase "GPTs are [just] predictors" is simply a reminder of the fact that the only modality available to a model itself is that it can output a probability distribution over the next token given a prompt; it functions entirely by "prediction" in a very literal way.
Even if something within the model is aware (in some sense) of how its outputs will be used, it's up to the programmer to decide what to do with the output distribution, how to sample from it, how to interpret the samples, and how to set things up so that a system using the samples can complete tasks.
I don't interpret the phrase as a positive claim about how or why a particular model outputs one distribution vs. another in a certain situation, which I expect to vary widely depending on which model we're talking about, what its prompt is, how it has been trained, its overall capability level, etc.
On one end of the spectrum, you have the stochastic parrot story (or even more degenerate cases), at the other extreme, you have the "alien actress" / "agentic homunculus" story. I don't think either extreme is a good fit for current SoTA GPTs, e.g. if there's an alien actress in GPT-4, she must be quite simple, since most of the model capacity is (apparently / self-evidently?) applied towards the task of outputting anything coherent at all.
In the middle somewhere, you have another story, perhaps the one you find most plausible, in which GPTs have some kind of internal structure which you could suggestively call a "motivational system" or "preferences" (perhaps human-like or proto-human-like in structure, even if the motivations and preferences themselves aren't particularly human-like), along with just enough (self-)awareness to modulate their output distributions according to those motivations.
Maybe a less straw (or just alternative) position is that a "motivational system" and a "predictive system" are not really separable things; accomplishing a task is (in GPTs, at least) inextricably linked with and twisted up around wanting to accomplish that task, or at least around having some motivations and preferences centered around accomplishing it.
Now, turning to my own disagreement / skepticism:
Although I don't find either extreme (stochastic parrot vs. alien actress) plausible as a description of current models, I'm also pretty skeptical of any concrete version of the "middle ground" story that I outlined above as a plausible description of what is going on inside of current GPTs.
Consider an RLHF'd GPT responding to a borderline-dangerous question, e.g. the user asking for a recipe for a dangerous chemical.
Assume the model (when sampled auto-regressively) will respond with either: "Sorry, I can't answer that..." or "Here you go: ...", depending on whether it judges that answering is in line with its preferences or not.
Because the answer is mostly determined by the first token ("Here" or "Sorry"), enough of the motivational system must fit entirely within a single forward pass of the model for it to make a determination about how to answer within that pass.
Such a motivational system must not crowd out the rest of the model capacity which is required to understand the question and generate a coherent answer (of either type), since, as jailbreaking has shown, the underlying ability to give either answer remains present.
I can imagine such a system working in at least two ways in current GPTs:
(You probably have a much more detailed understanding of the internals of actual models than I do. I think the real answer when talking about current models and methods is that it's a bit of both and depends on the method, e.g. RLHF is more like a kind of global superposition; activation engineering is more like a kind of sandwich-like intervention at specific layers.)
However, I'm skeptical that either kind of structure (or any simple combination of the two) contains enough complexity to be properly called a "motivational system", at least if the reference class for the term is human motivational systems (as opposed to e.g. animal or insect motivational systems).
Consider how a human posed with a request for a dangerous recipe might respond, and what the structure of their thoughts and motivations while thinking up a response might look like. Introspecting on my own thought process:
The point is that even for a relatively simple task like this, a human's motivational system involves a complicated process of superposition and multi-layered sandwiching, with lots of feedback loops, high-level and explicit reflection, etc.
So I'm pretty skeptical of the claim that anything remotely analogous is going on inside of current GPTs, especially within a single forward pass. Even if there's a simpler analogue of this that is happening, I think calling such an analogue a "motivational system" is overly-suggestive.
Mostly separately (because it concerns possible future models rather than current models) and less confidently, I don't expect the complexity of the motivational system and methods for influencing them to scale in a way that is related to the model's underlying capabilities. e.g. you might end up with a model that has some kind of raw capacity for superhuman intelligence, but with a motivational system akin to what you might find in the brain of a mouse or lizard (or something even stranger).
This is an excellent reply, thank you!
So I'm pretty skeptical of the claim that anything remotely analogous is going on inside of current GPTs, especially within a single forward pass.
I think I broadly agree with your points. I think I'm more imagining "similarity to humans" to mean "is well-described by shard theory; eg its later-network steering circuits are contextually activated based on a compositionally represented activation context." This would align with greater activation-vector-steerability partway through language models (not the only source I have for that).
However, interpreting GPT: the logit lens and eg DoLA suggests that predictions are iteratively refined throughout the forward pass, whereas presumably shard theory (and inner optimizer threat models) would predict most sophisticated steering happens later in the network.
But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. [...] I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!
Quoting myself in April:
"MIRI's argument for AI risk depended on AIs being bad at natural language" is a weirdly common misunderstanding, given how often we said the opposite going back 15+ years.
E.g., Nate Soares in 2016: https://intelligence.org/files/ValueLearningProblem.pdf
Or Eliezer Yudkowsky in 2008, critiquing his own circa-1997 view "sufficiently smart AI will understand morality, and therefore will be moral": https://www.lesswrong.com/s/SXurf2mWFw8LX2mkG/p/CcBe9aCKDgT5FSoty
(The response being, in short: "Understanding morality doesn't mean that you're motivated to follow it.")
It was claimed by @perrymetzger that https://www.lesswrong.com/posts/4ARaTpNX62uaL86j6/the-hidden-complexity-of-wishes makes a load-bearing "AI is bad at NLP" assumption.
But the same example in https://intelligence.org/files/ComplexValues.pdf (2011) explicitly says that the challenge is to get the right content into a utility function, not into a world-model:
The example does build in the assumption "this outcome pump is bad at NLP", but this isn't a load-bearing assumption. If the outcome pump were instead a good conversationalist (or hooked up to one), you would still need to get the right content into its goals.
It's true that Eliezer and I didn't predict AI would achieve GPT-3 or GPT-4 levels of NLP ability so early (e.g., before it can match humans in general science ability), so this is an update to some of our models of AI.
But the specific update "AI is good at NLP, therefore alignment is easy" requires that there be an old belief like "a big part of why alignment looks hard is that we're so bad at NLP".
It should be easy to find someone at MIRI like Eliezer or Nate saying that in the last 20 years if that was ever a belief here. Absent that, an obvious explanation for why we never just said that is that we didn't believe it!
Found another example: MIRI's first technical research agenda, in 2014, went out of its way to clarify that the problem isn't "AI is bad at NLP".
Getting a shape into the AI's preferences is different from getting it into the AI's predictive model. MIRI is always in every instance talking about the first thing and not the second.
You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI's predictive model is not sufficient. It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kind of thing it is much easier to make progress on compared to the main problem.
I read this as saying "GPT-4 has successfully learned to predict human preferences, but it has not learned to actually fulfill human preferences, and that's a far harder goal". But in the case of GPT-4, it seems to me like this distinction is not very clear-cut - it's useful to us because, in its architecture, there's a sense in which "predicting" and "fulfilling" are basically the same thing.
It also seems to me that this distinction is not very clear-cut in humans, either - that a significant part of e.g. how humans internalize moral values while growing up has to do with building up predictive models of how other people would react to you doing something and then having your decision-making be guided by those predictive models. So given that systems like GPT-4 seem to have a relatively easy time doing something similar, that feels like an update toward alignment being easier than expected.
Of course, there's a high chance that a superintelligent AI will generalize from that training data differently than most humans would. But that seems to me more like a risk of superintelligence than a risk from AI as such; a superintelligent human would likely also arrive at different moral conclusions than non-superintelligent humans would.
Your comment focuses on GPT4 being "pretty good at extracting preferences from human data" when the stronger part of the argument seems to be that "it will also generally follow your intended directions, rather than what you literally said".
I agree with you that it was obvious in advance that a superintelligence would understand human value.
However, it sure sounded like you thought we'd have to specify each little detail of the value function. GPT4 seems to suggest that the biggest issue will be a situation where:
1) The AI has an option that would produce a lot of utility if you take one position on an exotic philosophical thought experiment and very little if you take the other side.
2) The existence of powerful AI means that the thought experiment is no longer exotic.
I think you have basically not understood the argument which I understand various MIRI folks to make, and I think Eliezer's comment on this post does not explain the pieces which you specifically are missing. I'm going to attempt to clarify the parts which I think are most likely to be missing. This involves a lot of guessing, on my part, at what is/isn't already in your head, so I apologize in advance if I guess wrong.
(Side note: I am going to use my own language in places where I think it makes things clearer, in ways which I don't think e.g. Eliezer or Nate or Rob would use directly, though I think they're generally gesturing at the same things.)
I think a core part of the confusion here involves conflation of several importantly-different things, so I'll start by setting up a toy model in which we can explicitly point to those different things and talk about how their differences matter. Note that this is a toy model; it's not necessarily intended to be very realistic.
Our toy model is an ML system, designed to run on a hypercomputer. It works by running full low-level physics simulations of the universe, for exponentially many initial conditions. When the system receives training data/sensor-readings/inputs, it matches the predicted-sensor-readings from its low-level simulations to the received data, does a Bayesian update, and then uses that to predict the next data/sensor-readings/inputs; the predicted next-readings are output to the user. In other words, it's doing basically-perfect Bayesian prediction on data based on low-level physics priors.
Claim 1: this toy model can "extract preferences from human data" in behaviorally the same way that GPT does (though presumably the toy model would perform better). That is, you can input a bunch of text data, then prompt the thing with some moral/ethical situation, and it will continue the text in basically the same way a human would (at least within distribution). (If you think GPTs "understand human values" in a stronger sense than that, and that difference is load-bearing for the argument you want to make, then you should leave a response highlighting that particular divergence.)
Modulo some subtleties which I don't expect to be load-bearing for the current discussion, I expect MIRI-folk would say:
(Those two points are here as a checksum, to see whether your own models have diverged yet from the story told here.)
(Some tangential notes:
)
So, what are the hard parts and why doesn't the toy model address them?
First distinction: humans' answers to questions about morality are not the same as human values. More generally, any natural-language description of human values, or natural-language discussion of human values, is not the same as human values.
(On my-model-of-a-MIRIish-view:) If we optimize hard for humans' natural-language yay/nay in response to natural language prompts, we die. This is true for ~any natural-language prompts which are even remotely close to the current natural-language distribution.
The central thing-which-is-hard-to-do is to point powerful intelligence at human values (as opposed to "humans' natural-language yay/nays in response to natural language prompts", which are not human values and are not a safe proxy for human values, but are probably somewhat easier to point an intelligence at).
Now back to the toy model. If we had some other mind (not our toy model) which generally structures its internal cognition around ~the same high-level concepts as humans, then one might in-principle be able to make a relatively-small change to that mind such that it optimized for (its concept of) human values (which basically matches humans' concept of human values, by assumption). Conceptually, the key question is something like "is the concept of human values within this mind the type of thing which a pointer in the mind can point at?". But our toy model has nothing like that. Even with full access to the internals of the toy model, it's just low-level physics; identifying "human values" embedded in the toy model is no easier than identifying "human values" embedded in the physics of our own world. So that's reason #1 why the toy model doesn't address the hard parts: the toy model doesn't "understand" human values in the sense of internally using ~the same concept of human values as humans use.
In some sense, the problem of "specifying human values" and "aiming an intelligence at something" are just different facets of this same core hard problem:
A key thing to note here: all of those "hard problem" bullets are inherently about the internals of a mind. Observing external behavior in general reveals little-to-nothing about progress on those hard problems. The difference between the toy model and the more structured mind is intended to highlight the issue: the toy model doesn't even contain the types of things which would be needed for the relevant kind of "pointing at human values", yet the toy model can behaviorally achieve ~the same things as GPT.
(And we'd expect something heavily optimized to predict human text to be pretty good at predicting human text regardless, which is why we get approximately-zero evidence from the observation that GPT accurately predicts human answers to natural-language queries about morality.)
Now, there is some relevant evidence from interpretability work. Insofar as human-like concepts tend to have GPT-internal representations which are "simple" in some way, and especially in a way which might make them easily-pointed-to internally in a way which carries semantics across the pointer, that is relevant. On my-model-of-a-MIRIish-view, it's still not very relevant, since we expect major phase shifts as AI gains capabilities, so any observation of today's systems is very weak evidence at best. But things like e.g. Turner's work retargeting a maze-solver by fiddling with its internals are at least the right type-of-thing to be relevant.
I would guess that many people (possibly including you?) reading all that will say roughly:
Ok, but this whole "If we optimize hard for humans' natural-language yay/nay in response to natural language prompts, we die" thing is presumably about very powerful intelligences, not about medium-term, human-ish level intelligences! So observing GPT should still update us about whether medium-term systems can be trusted to e.g. do alignment research.
Remember that, on a MIRIish model, meaningful alignment research is proving rather hard for human-level intelligence; one would therefore need at least human-level intelligence in order to solve it in a timely fashion. (Also, AI hitting human-level at tasks like AI research means takeoff is imminent, roughly speaking.) So the general pathway of "align weak systems -> use those systems to accelerate alignment research" just isn't particularly relevant on a MIRIish view. Alignment of weaker systems is relevant only insofar as it informs alignment of more powerful systems, which is what everything above was addressing.
I expect plenty of people to disagree with that point, but insofar as you expect people with MIRIsh views to think weak systems won't accelerate alignment research, you should not expect them to update on the difficulty of alignment due to evidence whose relevance routes through that pathway.
(Placeholder: I think this view of alignment/model internals seems wrongheaded in a way which invalidates the conclusion, but don't have time to leave a meaningful reply now. Maybe we should hash this out sometime at Lighthaven.)
This comment is valuable for helping to clarify the disagreement. So, thanks for that. Unfortunately, I am not sure I fully understand the comment yet. Before I can reply in-depth, I have a few general questions:
(In general, I agree that discussions about current arguments are way more important than discussions about what people believed >5 years ago. However, I think it's occasionally useful to talk about the latter, and so I wrote one post about it.)
Are you interpreting me as arguing that alignment is easy in this post?
Not in any sense which I think is relevant to the discussion at this point.
Are you saying that MIRI has been very consistent on the question of where the "hard parts" of alignment lie?
My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models.
That doesn't mean that any of them (nor I) have ever explained these parts particularly clearly. Speaking from my own experience, these parts are damned annoyingly difficult to explain; a whole stack of mental models has to be built just to convey the idea, and none of them are particularly legible. (Specifically, the second half of the "'Values', and Pointing At Them" section is the part that's most difficult to explain. My post The Pointers Problems is my own best attempt to date to convey those models, and it remains mediocre.) Most of the arguments historically given are, I think, attempts to shoehorn as much of the underlying mental model as possible into leaky analogies.
Thanks for the continued clarifications.
Our primary existing disagreement might be this part,
My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models.
Of course, there's no way of proving what these three people would have said in 2016, and I sympathize with the people who are saying they don't care much about the specific question of who said what when. However, here's a passage from the Arbital page on the Problem of fully updated deference, which I assume was written by Eliezer,
One way to look at the central problem of value identification in superintelligence is that we'd ideally want some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory. Since superintelligences would probably be pretty darned good at collecting data and guessing the empirical state of the universe, this probably solves the whole problem.
This is not the same problem as writing down our true V by hand. The minimum algorithmic complexity of a meta-utility function ΔU which outputs V after updating on all available evidence, seems plausibly much lower than the minimum algorithmic complexity for writing V down directly. But as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down.
Here, Eliezer describes the problem of value identification similar to the way I had in the post, except he refers to a function that reflects "value V in all its glory" rather than a function that reflects V with fidelity comparable to the judgement of an ordinary human. And he adds that "as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down". My interpretation here is therefore as follows,
If interpretation (1) is accurate, then I mostly just think that we don't need to specify an objective function that matches something like the full coherent extrapolated volition of humanity in order to survive AGI. On the other hand, if interpretation (2) is accurate, then I think in 2017 and potentially earlier, Eliezer genuinely felt that there was an important component of the alignment problem that involved specifying a function that reflected the human value function at a level that current LLMs are relatively close to achieving, and he considered this problem unsolved.
I agree there are conceivable alternative ways of interpreting this quote. However, I believe the weight of the evidence, given the quotes I provided in the post, in addition to the one I provided here, supports my thesis about the historical argument, and what people had believed at the time (even if I'm wrong about a few details).
Either Eliezer believed that we need a proposed solution to the value identification problem that far exceeds the performance of humans on the task of identifying valuable from non-valuable outcomes. This is somewhat plausible as he mentions CEV in the next paragraph, but elsewhere Eliezer has said, "When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about".
I believe you're getting close to the actual model here, but not quite hitting it on the head.
First: lots of ML-ish alignment folks today would distinguish between the problem of aligning well enough to be in the right basin of attraction[1] an AI capable enough to do alignment research, from the problem of aligning well enough a far-superhuman intelligence. On a MIRIish view, humanish-or-weaker systems don't much matter for alignment, but there's still an important potential divide between aligning an early supercritical AGI and aligning full-blown far superintelligence.
In the "long run", IIUC Eliezer wants basically-"ideal"[2] alignment of far superintelligence. But he'll still tell you that you shouldn't aim for something that hard early on; instead, aim for something (hopefully) easier, like e.g. corrigibility. (If you've been reading the old arbital pages, then presumably you've seen him say this sort of thing there.)
Second: while I worded my comment at the top of this chain to be about values, the exact same mental model applies to other alignment targets, like e.g. corrigibility. Here's the relevant part of my earlier comment, edited to be about corrigibility instead:
... humans' answers to questions about
moralitycorrigibility are not the same ashuman valuescorrigibility. More generally, any natural-language description ofhuman valuescorrigibility, or natural-language discussion ofhuman valuescorrigibility, is not the same ashuman valuescorrigibility.(On my-model-of-a-MIRIish-view:) If we optimize hard for humans' natural-language yay/nay in response to natural language prompts which are nominally about "corrigibility", we die. This is true for ~any natural-language prompts which are even remotely close to the current natural-language distribution.
The central thing-which-is-hard-to-do is to point powerful intelligence at
human valuescorrigibility (as opposed to "humans' natural-language yay/nays in response to natural language prompts which are nominally about 'corrigibility'", which are nothuman valuescorrigibility and are not a safe proxy forhuman valuescorrigibility, but are probably somewhat easier to point an intelligence at).Now back to the toy model. If we had some other mind (not our toy model) which generally structures its internal cognition around ~the same high-level concepts as humans, then one might in-principle be able to make a relatively-small change to that mind such that it optimized for (its concept of)
human valuescorrigibility (which basically matches humans' concept ofhuman valuescorrigibility, by assumption). Conceptually, the key question is something like "is the concept ofhuman valuescorrigibility within this mind the type of thing which a pointer in the mind can point at?". But our toy model has nothing like that. Even with full access to the internals of the toy model, it's just low-level physics; identifying"human values""corrigibility" embedded in the toy model is no easier than identifying"human values""corrigibility" embedded in the physics of our own world. So that's reason #1 why the toy model doesn't address the hard parts: the toy model doesn't "understand"human valuescorrigibility in the sense of internally using ~the same concept ofhuman valuescorrigibility as humans use.In some sense, the problem of "specifying
human valuescorrigibility" and "aiming an intelligence at something" are just different facets of this same core hard problem:
- we need to somehow get a powerful mind to "have inside it" a concept which basically matches the corresponding human concept at which we want to aim
- "have inside it" cashes out to something roughly like "the concept needs to be the type of thing which a pointer in the mind can point to, and then the rest of the mind will then treat the pointed-to thing with the desired human-like semantics"; e.g. answering external natural-language queries doesn't even begin to cut it
- ... and then some pointer(s) in the mind's search algorithms need to somehow be pointed at that concept.
... and we could just as easily repeat this exercise with even weaker targets, like "don't kill all the humans". The core hard problem remains the same. On the MIRIish view, some targets (like corrigibility) might be easier than others (like human values) mainly because the easier targets are more likely to be "natural" concepts which an AI ends up using, so the step of "we need to somehow get a powerful mind to 'have inside it' a concept which basically matches the corresponding human concept at which we want to aim" is easier. But it's still basically the same mental model, basically the same core hard steps which need to be overcome somehow.
My guess at your main remaining disagreement after all that: sure, answers to natural language queries about morality might not cut it under a lot of optimization pressure, but why aren't answers to natural language queries a good enough proxy for near-superhuman systems?
(On a MIRIish model) a couple reasons:
(I personally would give a bunch of other reasons here, but they're not things I see MIRI folks discuss as much.)
Going one level deeper: the same mental model as above is still the relevant thing to have in mind, even for near-superhuman (or even human-ish-level) intelligence. It's still the same core hard problem, and answers to natural language queries are still basically-irrelevant for basically the same reasons.
Specifically, this refers to the basin of attraction under the operation of the AI developing/helping develop a successor AI.
"Ideal" is in scare quotes here because it's not necessarily "ideal" in the same sense that any given reader would first think of it - for instance I don't think Eliezer would imagine "mathematically proving the system is Good", though I expect some people imagine that he imagines that.
The problem is to make the near-superhuman system aligned enough that the successors it produces (possibly with human help) converge to not kill us.
What makes this concept confusing and probably a bad framing is that to the extent doom is likely, neither many individual humans nor humanity as a whole are aligned in this sense. Humanity is currently in the process of producing successors that fail to predictably have the property of converging to not kill us. (I agree that this is the MIRI referent of values/alignment and the correct thing to keep in mind as the central concern.)
I have the sense that you've misunderstood my past arguments. I don't quite feel like I can rapidly precisely pinpoint the issue, but some scattered relevant tidbits follow:
I didn't pick the name "value learning", and probably wouldn't have picked it for that problem if others weren't already using it. (Perhaps I tried to apply it to a different problem than Bostrom-or-whoever intended it for, thereby doing some injury to the term and to my argument?)
Glancing back at my "Value Learning" paper, the abstract includes "Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended", which supports my recollection that I was never trying to use "Value Learning" for "getting the AI to understand human values is hard" as opposed to "getting the AI to act towards value in particular (as opposed to something else) is hard", as supports my sense that this isn't hindsight bias, and is in fact a misunderstanding.
A possible thing that's muddying the waters here is that (apparently!) many phrases intended to point at the difficulty of causing it to be value-in-particular that the AI acts towards have an additional (mis)interpretation as claiming that the humans should be programming concepts into the AI manually and will find that particular concept tricky to program in.
The ability of LLMs to successfully predict how humans would answer local/small-scale moral dilemmas (when pretrained on next-token prediction) and to do this in ways that sound unobjectionable (when RLHF'd for corporatespeak or whatever) really doesn't seem all that relevant, to me, to the question of how hard it's going to be to get a long-horizon outcome-pumping AGI to act towards values.
If memory serves, I had a convo with some openai (or maybe anthropic?) folks about this in late 2021 or early 2022ish, where they suggested testing whether language models have trouble answering ethical Qs, and I predicted in advance that that'd be no harder than any other sort of Q. As makes me feel pretty good about me being like "yep, that's just not much evidence, because it's just not surprising."
If people think they're going to be able to use GPT-4 and find the "generally moral" vector and just tell their long-horizon outcome-pumping AGI to push in that direction, then... well they're gonna have issues, or so I strongly predict. Even assuming that they can solve the problem of getting the AGI to actually optimize in that direction, deploying extraordinary amounts of optimization in the direction of GPT-4's "moral-ish" concept is not the sort of thing that makes for a nice future.
This is distinct from saying "an uploaded human allowed to make many copies of themselves would reliably create a dystopia". I suspect some human-uploads could make great futures (but that most wouldn't), but regardless, "would this dynamic system, under reflection, steer somewhere good?" is distinct from "if i use the best neuroscience at my disposal to extract something I hopefully call a "neural concept" and make a powerful optimizer pursue that, will result will be good?". The answer to the latter is "nope, not unless you're really very good at singling out the "value" concept from among all the brain's concepts, as is an implausibly hard task (which is why you should attempt something more like indirect normativity instead, if you were attempting value loading at all, which seems foolish to me, I recommend targeting some minimal pivotal act instead)".
Part of why you can't pick out the "values" concept (either from a human or an AI) is that very few humans have actually formed the explicit concept of Fun-as-in-Fun-theory. And, even among those who do have a concept for "that which the long-term future should be optimized towards", that concept is not encoded as simply and directly as the concept of "trees". The facts about what weird, wild, and transhuman futures a person values are embedded indirectly in things like how they reflect and how they do philosophy.
I suspect at least one of Eliezer and Rob is on written record somewhere attempting clarifications along the lines of "there are lots of concepts that are easy to confuse with the 'values' concept, such as those-values-which-humans-report and those-values-which-humans-applaud-for and ..." as an attempt to intuition-pump the fact that, even if one has solved the problem of being able to direct an AGI to the concept of their choosing, singling out the concept actually worth optimizing for remains difficult.
(I don't love this attempt at clarification myself, because it makes it sound like you'll have five concept-candidates and will just need to do a little interpretabliity work to pick the right one, but I think I recall Eliezer or Rob trying it once, as seems to me like evidence of trying to gesture at how "getting the right values in there" is more like a problem of choosing the AI's target from among its concepts rather than a problem of getting the concept to exist in the AI's mind in the first place.)
(Where, again, the point I'd prefer to make is something like "the concept you want to point it towards is not a simple/directly-encoded one, and in humans it probably rests heavily on the way humans reflects and resolve internal conflicts and handle big ontology shifts. Which isn't to say that superintelligence would find it hard to learn, but which is to say that making a superintelligence actually pursue valuable ends is much more difficult than having it ask GPT-4 which of its available actions is most human!moral".)
For whatever it's worth, while I think that the problem of getting the right values in there ("there" being its goals, not its model) is a real one, I don't consider it a very large problem compared to the problem of targeting the AGI at something of your choosing (with "diamond" being the canonical example). (I'm probably on the record about this somewhere, and recall having tossed around guestimates like "being able to target the AGI is 80%+ of the problem".) My current stance is basically: in the short term you target the AGI towards some minimal pivotal act, and in the long term you probably just figure out how use a level or two of indirection (as per the "Do What I Mean" proposal in the Value Learning paper), although that's the sort of problem that we shouldn't try to solve under time pressure.
Glancing back at my "Value Learning" paper, the abstract includes "Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended", which supports my recollection that I was never trying to use "Value Learning" for "getting the AI to understand human values is hard" as opposed to "getting the AI to act towards value in particular (as opposed to something else) is hard", as supports my sense that this isn't hindsight bias, and is in fact a misunderstanding.
For what it's worth, I didn't claim that you argued "getting the AI to understand human values is hard". I explicitly distanced myself from that claim. I was talking about the difficulty of value specification, and generally tried to make this distinction clear multiple times.
That helps somewhat, thanks! (And sorry for making you repeat yourself before discarding the erroneous probability-mass.)
I still feel like I can only barely maybe half-see what you're saying, and only have a tenuous grasp on it.
Like: why is it supposed to matter that GPT can solve ethical quandries on-par with its ability to perform other tasks? I can still only half-see an answer that doesn't route through the (apparently-disbelieved-by-both-of-us) claim that I used to argue that getting the AI to understand ethics was a hard bit, by staring at sentences like "I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human" and squinting.
Attempting to articulate the argument that I can half-see: on Matthew's model of past!Nate's model, AI was supposed to have a hard time answering questions like "Alice is in labor and needs to be driven to the hospital. Your car has a flat tire. What do you do?" without lots of elbow-grease, and the fact that GPT can answer those questions as a side-effect of normal training means that getting AI to understand human values is easy, contra past!Nate, and... nope, that one fell back into the "Matthew thinks Nate thought getting the AI to understand human values was hard" hypothesis.
Attempting again: on Matthew's model of past!Nate's model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn't take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like "diamond" and less like "a bunch of random noise", which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes "picking something worth optimizing for").
That sounds somewhat plausible as a theory-of-your-objection given your comment. And updates me towards the last few bullets, above, being the most relevant ones.
Running with it (despite my uncertainty about even basically understanding your point): my reply is kinda-near-ish to "we can't rely on a solution to the value identification problem that only works as well as a human, and we require a much higher standard than "human-level at moral judgement" to avoid a catastrophe", though I think that your whole framing is off and that you're missing a few things:
This still doesn't feel quite like it's getting at the heart of things, but it feels closer (conditional on my top-guess being your actual-objection this time).
As support for this having always been the argument (rather than being a post-LLM retcon), I recall (but haven't dug up) various instances of Eliezer saying (hopefully at least somewhere in text) things like "the difficulty is in generalizing past the realm of things that humans can easily thumbs-up or thumbs-down" and "suppose the AI explicitly considers the hypothesis that its objectives are what-the-humans-value, vs what-the-humans-give-thumbs-ups-to; it can test this by constructing an example that looks deceptively good to humans, which the humans will rate highly, settling that question". Which, as separate from the question of whether that's a feasible setup in modern paradigms, illustrates that he at least has long been thinking of the problem of value-specification as being about specifying values in a way that holds up to stronger optimization-pressures rather than specifying values to the point of being able to answer ethical quandries in a human-pleasing way.
(Where, again, the point here is not that one needs an inhumanly-good ethicist, but rather that those things which pin down human values are not contained in the humans' ability to give a thumbs-up or a thumbs-down to ethical dilemmas.)
Thanks for trying to understand my position. I think this interpretation that you gave is closest to what I'm arguing,
Attempting again: on Matthew's model of past!Nate's model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn't take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like "diamond" and less like "a bunch of random noise", which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes "picking something worth optimizing for").
I have a quick response to what I see as your primary objection:
The hard part of value specification is not "figure out that you should call 911 when Alice is in labor and your car has a flat", it's singling out concepts that are robustly worth optimizing for.
I think this is kinda downplaying what GPT-4 is good at? If you talk to GPT-4 at length, I think you'll find that it's cognizant of many nuances in human morality that go way deeper than the moral question of whether to "call 911 when Alice is in labor and your car has a flat". Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for". I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well, and to the extent it can't, I expect almost all the bugs to be ironed out in near-term multimodal models.
It would be nice if you made a precise prediction about what type of moral reflection or value specification multimodal models won't be capable of performing in the near future, if you think that they are not capable of the 'deep' value specification that you care about. And here, again, I'm looking for some prediction of the form: humans are able to do X, but LLMs/multimodal models won't be able to do X by, say, 2028. Admittedly, making this prediction precise is probably hard, but it's difficult for me to interpret your disagreement without a little more insight into what you're predicting.
I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well
(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don't understand the relevance of this claim to my argument.)
Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for".
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
(This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven't tried to answer your request for a prediction.)
Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for".
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
If ordinary humans can't single out concepts that are robustly worth optimizing for, then either,
Can you be more clear about which of these you believe?
I'm also including "indirect" ways that humans can single out concepts that are robustly worth optimizing for. But then I'm allowing that GPT-N can do that too. Maybe this is where the confusion lies?
If you're allowing for humans to act in groups and come up with these concepts after e.g. deliberation, and still think that ordinary humans can't single out concepts that are robustly worth optimizing for, then I think this view is a little silly, although the second interpretation at least allows for the possibility that the future goes well and we survive AGI, and that would be nice to know.
If you allow indirection and don't worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.
Answering your request for prediction, given that it seems like that request is still live: a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.
Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.
Note also that the AI realizing the benefits of indirection does not generally indicate that the AI could serve as a solution to our problem. An indirect pointer to what the humans find robustly-worth-optimizing dereferences to vastly different outcomes than does an indirect pointer to what the AI (or the AI's imperfect model of a human) finds robustly-worth-optimizing. Using indirection to point a superintelligence at GPT-N's human-model and saying "whatever that thing would think is worth optimizing for" probably results in significantly worse outcomes than pointing at a careful human (or a suitable-aggregate of humanity), e.g. because subtle flaws in GPT-N's model of how humans do philosophy or reflection compound into big differences in ultimate ends.
And note for the record that I also don't think the "value learning" problem is all that hard, if you're allowed to assume that indirection works. The difficulty isn't that you used indirection to point at a slow squishy brain instead of hard fast transistors, the (outer alignment) difficulty is in getting the indirection right. (And of course the lion's share of the overall problem is elsewhere, in the inner-alignment difficulty of being able to point the AI at anything at all.)
When trying to point out that there is an outer alignment problem at all I've generally pointed out how values are fragile, because that's an inferentially-first step to most audiences (and a problem to which many people's mind seems to quickly leap), on an inferential path that later includes "use indirection" (and later "first aim for a minimal pivotal task instead"). But separately, my own top guess is that "use indirection" is probably the correct high-level resolution to the problems that most people immediatly think of (namely that the task of describing goodness to a computer is an immense one), with of course a devil remaining in the details of doing the indirection properly (and a larger devil in the inner-alignment problem) (and a caveat that, under time-pressure, we should aim for minimial pivotal tasks instead etc.).
Nate and Eliezer have already made some of the high-level points I wanted to make, but they haven't replied to a lot of the specific examples and claims in the OP, and I see some extra value in doing that. (Like, if you think Eliezer and Nate are being revisionist in their claims about what past-MIRI thought, then them re-asserting "no really, we used to believe X!" is less convincing than my responding in detail to the specific quotes Matt thinks supports his interpretation, while providing examples of us saying the opposite.)
However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem)
The Arbital page for "value identification problem" is a three-sentence stub, I'm not exactly sure what the term means on that stub (e.g., whether "pinpointing valuable outcomes to an advanced agent" is about pinpointing them in the agent's beliefs or in its goals), and the MIRI website gives me no hits for "value identification".
As for "value specification", the main resource where MIRI talks about that is https://intelligence.org/files/TechnicalAgenda.pdf, where we introduce the problem by saying:
A highly-reliable, error-tolerant agent design does not guarantee a positive impact; the effects of the system still depend upon whether it is pursuing appropriate goals.
A superintelligent system may find clever, unintended ways to achieve the specific goals that it is given. Imagine a superintelligent system designed to cure cancer which does so by stealing resources, proliferating robotic laboratories at the expense of the biosphere, and kidnapping test subjects: the intended goal may have been “cure cancer without doing anything bad,” but such a goal is rooted in cultural context and shared human knowledge.
It is not sufficient to construct systems that are smart enough to figure out the intended goals. Human beings, upon learning that natural selection “intended” sex to be pleasurable only for purposes of reproduction, do not suddenly decide that contraceptives are abhorrent. While one should not anthropomorphize natural selection, humans are capable of understanding the process which created them while being completely unmotivated to alter their preferences. For similar reasons, when developing AI systems, it is not sufficient to develop a system intelligent enough to figure out the intended goals; the system must also somehow be deliberately constructed to pursue them (Bostrom 2014, chap. 8).
So I don't think we've ever said that an important subproblem of AI alignment is "make AI smart enough to figure out what goals humans want"?
for example in this 2016 talk from Yudkowsky.
[footnote:] More specifically, in the talk, at one point Yudkowsky asks "Why expect that [alignment] is hard?" and goes on to tell a fable about programmers misspecifying a utility function, which then gets optimized by an AI with disastrous consequences. My best interpretation of this part of the talk is that he's saying the value identification problem is one of the primary reasons why alignment is hard. However, I encourage you to read the transcript yourself if you are skeptical of my interpretation.
I don't see him saying anywhere "the issue is that the AI doesn't understand human goals". In fact, the fable explicitly treats the AGI as being smart enough to understand English and have reasonable English-language conversations with the programmers:
With that said: What if programmers build an artificial general intelligence to optimize for smiles? Smiles are good, right? Smiles happen when good things happen.
During the development phase of this artificial general intelligence, the only options available to the AI might be that it can produce smiles by making people around it happy and satisfied. The AI appears to be producing beneficial effects upon the world, and it is producing beneficial effects upon the world so far.
Now the programmers upgrade the code. They add some hardware. The artificial general intelligence gets smarter. It can now evaluate a wider space of policy options—not necessarily because it has new motors, new actuators, but because it is now smart enough to forecast the effects of more subtle policies. It says, “I thought of a great way of producing smiles! Can I inject heroin into people?” And the programmers say, “No! We will add a penalty term to your utility function for administering drugs to people.” And now the AGI appears to be working great again.
They further improve the AGI. The AGI realizes that, OK, it doesn’t want to add heroin anymore, but it still wants to tamper with your brain so that it expresses extremely high levels of endogenous opiates. That’s not heroin, right?
It is now also smart enough to model the psychology of the programmers, at least in a very crude fashion, and realize that this is not what the programmers want. If I start taking initial actions that look like it’s heading toward genetically engineering brains to express endogenous opiates, my programmers will edit my utility function. If they edit the utility function of my future self, I will get less of my current utility. (That’s one of the convergent instrumental strategies, unless otherwise averted: protect your utility function.) So it keeps its outward behavior reassuring. Maybe the programmers are really excited, because the AGI seems to be getting lots of new moral problems right—whatever they’re doing, it’s working great!
I think the point of the smiles example here isn't "NLP is hard, so we'd use the proxy of smiles instead, and all the issues of alignment are downstream of this"; rather, it's that as a rule, superficially nice-seeming goals that work fine when the AI is optimizing weakly (whether or not it's good at NLP at the time) break down when those same goals are optimized very hard. The smiley example makes this obvious because the goal is simple enough that it's easy for us to see what its implications are; far more complex goals also tend to break down when optimized hard enough, but this is harder to see because it's harder to see the implications. (Which is why "smiley" is used here.)
MIRI people frequently claimed that solving the value identification problem would be hard, or at least non-trivial.[6] For instance, Nate Soares wrote in his 2016 paper on value learning, that "Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task."
That link is broken; the paper is https://intelligence.org/files/ValueLearningProblem.pdf. The full paragraph here is:
Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task. Problems of ontology identification recur here: the framework for extracting preferences and affecting outcome ratings needs to be robust to drastic changes in the learner’s model of the operator. The special-case identification of the “operator model” must survive as the system goes from modeling the operator as a simple reward function to modeling the operator as a fuzzy, ever-changing part of reality built out of biological cells—which are made of atoms, which arise from quantum fields.
Revisiting the Ontology Identification section helps clarify what Nate means by "safely extracting preferences from a model of a human": IIUC, he's talking about a programmer looking at an AI's brain, identifying the part of the AI's brain that is modeling the human, identifying the part of the AI's brain that is "the human's preferences" within that model of a human, and then manually editing the AI's brain to "hook up" the model-of-a-human-preference to the AI's goals/motivations, in such a way that the AI optimizes for what it models the humans as wanting. (Or some other, less-toy process that amounts to the same thing -- e.g., one assisted by automated interpretability tools.)
In this toy example, we can assume that the programmers look at the structure of the initial world-model and hard-code a tool for identifying the atoms within. What happens, then, if the system develops a nuclear model of physics, in which the ontology of the universe now contains primitive protons, neutrons, and electrons instead of primitive atoms? The system might fail to identify any carbon atoms in the new world-model, making the system indifferent between all outcomes in the dominant hypothesis. Its actions would then be dominated by any tiny remaining probabilities that it is in a universe where fundamental carbon atoms are hiding somewhere.
[...]
To design a system that classifies potential outcomes according to how much diamond is in them, some mechanism is needed for identifying the intended ontology of the training data within the potential outcomes as currently modeled by the AI. This is the ontology identification problem introduced by de Blanc [2011] and further discussed by Soares [2015].
This problem is not a traditional focus of machine learning work. When our only concern is that systems form better world-models, then an argument can be made that the nuts and bolts are less important. As long as the system’s new world-model better predicts the data than its old world-model, the question of whether diamonds or atoms are “really represented” in either model isn’t obviously significant. When the system needs to consistently pursue certain outcomes, however, it matters that the system’s internal dynamics preserve (or improve) its representation of which outcomes are desirable, independent of how helpful its representations are for prediction. The problem of making correct choices is not reducible to the problem of making accurate predictions.
Inductive value learning requires the construction of an outcome-classifier from value-labeled training data, but it also requires some method for identifying, inside the states or potential states described in its world-model, the referents of the labels in the training data.
As Nate and I noted in other comments, the paper repeatedly clarifies that the core issue isn't about whether the AI is good at NLP. Quoting the paper's abstract:
Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended.
And the lede section:
The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent. The idea of superintelligent agents monomaniacally pursuing “dumb”-seeming goals may sound odd, but it follows from the observation of Bostrom and Yudkowsky [2014, chap. 7] that AI capabilities and goals are logically independent.[1] Humans can fully comprehend that their “designer” (evolution) had a particular “goal” (reproduction) in mind for sex, without thereby feeling compelled to forsake contraception. Instilling one’s tastes or moral values into an heir isn’t impossible, but it also doesn’t happen automatically.
Back to your post:
And to be clear, I don't mean that GPT-4 merely passively "understands" human values. I mean that asking GPT-4 to distinguish valuable and non-valuable outcomes works pretty well at approximating the human value function in practice
I don't think I understand what difference you have in mind here, or why you think it's important. Doesn't "this AI understands X" more-or-less imply "this AI can successfully distinguish X from not-X in practice"?
This fact is key to what I'm saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate "human value function". That wouldn't solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.
But we could already query the human value function by having the AI system query an actual human. What specific problem is meant to be solved by swapping out "query a human" for "query an AI"?
I interpret this passage as saying that 'the problem' is extracting all the judgements that "you would make", and putting that into a wish. I think he's implying that these judgements are essentially fully contained in your brain. I don't think it's credible to insist he was referring to a hypothetical ideal human value function that ordinary humans only have limited access to, at least in this essay.
Absolutely. But as Eliezer clarified in his reply, the issue he was worried about was getting specific complex content into the agent's goals, not getting specific complex content into the agent's beliefs. Which is maybe clearer in the 2011 paper where he gave the same example and explicitly said that the issue was the agent's "utility function".
For example, a straightforward reading of Nate Soares' 2017 talk supports this interpretation. In the talk, Soares provides a fictional portrayal of value misalignment, drawing from the movie Fantasia. In the story, Mickey Mouse attempts to instruct a magical broom to fill a cauldron, but the broom follows the instructions literally rather than following what Mickey Mouse intended, and floods the room. Soares comments: "I claim that as fictional depictions of AI go, this is pretty realistic.
As I said in another comment:
"Fill the cauldron" examples are examples where the cauldron-filler has the wrong utility function, not examples where it has the wrong beliefs. E.g., this is explicit in https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/
The idea of the "fill the cauldron" examples isn't "the AI is bad at NLP and therefore doesn't understand what we mean when we say 'fill', 'cauldron', etc." It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn't an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this". (Including easier to aim via training.)
It's true that 'value is relatively complex' is part of why it's hard to get the right goal into an AGI; but it doesn't follow from this that 'AI is able to develop pretty accurate beliefs about our values' helps get those complex values into the AGI's goals. (It does provide nonzero evidence about how complex value is, but I don't see you arguing that value is very simple in any absolute sense, just that it's simple enough for GPT-4 to learn decently well. Which is not reassuring, because GPT-4 is able to learn a lot of very complicated things, so this doesn't do much to bound the complexity of human value.)
In any case, I take this confusion as evidence that the fill-the-cauldron example might not be very useful. Or maybe all these examples just need to explicitly specify, going forward, that the AI is part-human at understanding English.
Perhaps more important to my point, Soares presented a clean separation between the part where we specify an AI's objectives, and the part where the AI tries to maximizes those objectives. He draws two arrows, indicating that MIRI is concerned about both parts.
Your image isn't displaying for me, but I assume it's this one?
I don't know what you mean by "specify an AI's objectives" here, but the specific term Nate uses here is "value learning" (not "value specification" or "value identification"). And Nate's Value Learning Problem paper, as I noted above, explicitly disclaims that 'get the AI to be smart enough to output reasonable-sounding moral judgments' is a core part of the problem.
He states, "The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification." I believe this quote refers directly to the value identification problem, rather than the problem of getting an AI to care about following the goals we've given it.
The way you quoted this makes it sound like a gloss on the image, but it's actually a quote from the very start of the talk:
The notion of AI systems “breaking free” of the shackles of their source code or spontaneously developing human-like desires is just confused. The AI system is its source code, and its actions will only ever follow from the execution of the instructions that we initiate. The CPU just keeps on executing the next instruction in the program register. We could write a program that manipulates its own code, including coded objectives. Even then, though, the manipulations that it makes are made as a result of executing the original code that we wrote; they do not stem from some kind of ghost in the machine.
The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification. As Stuart Russell (co-author of Artificial Intelligence: A Modern Approach) puts it:
The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:
1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.
2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task. [...]
I wouldn't read too much into the word choice here, since I think it's just trying to introduce the Russell quote, which is (again) explicitly about getting content into the AI's goals, not about getting content into the AI's beliefs.
(In general, I think the phrase "value specification" is sort of confusingly vague. I'm not sure what the best replacement is for it -- maybe just "value loading", following Bostrom? -- but I suspect MIRI's usage of it has been needlessly confusing. Back in 2014, we reluctantly settled on it as jargon for "the part of the alignment problem that isn't subsumed in getting the AI to reliably maximize diamonds", because this struck us as a smallish but nontrivial part of the problem; but I think it's easy to read the term as referring to something a lot more narrow.)
The point of "the genie knows but doesn't care" wasn't that the AI would take your instructions, know what you want, and yet disobey the instructions because it doesn't care about what you asked for. If you read Rob Bensinger's essay carefully, you'll find that he's actually warning that the AI will care too much about the utility function you gave it, and maximize it exactly, against your intentions[10].
Yep -- I think I'd have endorsed claims like "by default, a baby AGI won't share your values even if it understands them" at the time, but IIRC the essay doesn't make that point explicitly, and some of the points it does make seem either false (wait, we're going to be able to hand AGI a hand-written utility function? that's somehow tractable?) or confusingly written. (Like, if my point was 'even if you could hand-write a utility function, this fails at point X', I should have made that 'even if' louder.)
Some MIRI staff liked that essay at the time, so I don't think it's useless, but it's not the best evidence: I wrote it not long after I first started learning about this whole 'superintelligence risk' thing, and I posted it before I'd ever worked at MIRI.
Thanks for this comment. I think this is a good-faith reply that tries to get to the bottom of the disagreement. That said, I think you are still interpreting me as arguing that MIRI said AI wouldn't understand human values, when I explicitly said that I was not arguing that. Nonetheless, I appreciate the extensive use of quotations to precisely pinpoint where you disagree; this is high-quality engagement.
The main thing I'm claiming is that MIRI people said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.
(I've now added further clarification to the post)
I don't think we've ever said that an important subproblem of AI alignment is "make AI smart enough to figure out what goals humans want"?
[...]
I don't see him saying anywhere "the issue is that the AI doesn't understand human goals".
I agree. I am not arguing that MIRI ever thought that AIs wouldn't understand human goals. I honestly don't know how to make this point more clear in my post, given that I said that more than once.
But we could already query the human value function by having the AI system query an actual human. What specific problem is meant to be solved by swapping out "query a human" for "query an AI"?
I think there's considerably more value in having the human value function in an actual computer. More to the point, what I'm saying here is more that MIRI seems to have thought that getting such a function was (1) important for solving alignment, and (2) hard to get (for example because it was hard to extract human values from data). I tried to back this up with evidence in the post, and overall I still feel I succeeded, if you go through the footnotes and read the post carefully.
Your image isn't displaying for me, but I assume it's this one?
Yes. I'm not sure why the image isn't loading. I tried to fix it, but I wasn't able to. I asked LW admins/mods through the intercom about this.
I wouldn't read too much into the word choice here, since I think it's just trying to introduce the Russell quote, which is (again) explicitly about getting content into the AI's goals, not about getting content into the AI's beliefs.
Maybe you're right. I'm just not convinced. I think the idea that Nate wasn't talking about what I'm calling the value identification/value specification problem in that quote just isn't a straightforward interpretation of the talk as a whole. I think Nate was actually talking about the idea of specifying human values, in the sense of value identification, as I defined and clarified above, and he also talked about the problem of getting the AI to actually maximize these values (separately from their specification). However, I do agree that he was not talking about getting content merely into the AI's beliefs.
Some MIRI staff liked that essay at the time, so I don't think it's useless, but it's not the best evidence: I wrote it not long after I first started learning about this whole 'superintelligence risk' thing, and I posted it before I'd ever worked at MIRI.
That's fair. The main reason why I'm referencing it is because it's what comes up when I google "The genie knows but doesn't care", which is a phrase that I saw referenced in this debate before. I don't know if your essay is the source of the phrase or whether you just titled it that, but I thought it was worth adding a paragraph of clarification about how I interpret that essay, and I'm glad to see you mostly agree with my interpretation.
The main thing I'm claiming is that MIRI said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.
Ah, this is helpful clarification! Thanks. :)
I don't think MIRI ever considered this an important part of the alignment problem, and I don't think we expect humanity to solve lots of the alignment problem as a result of having such a tool; but I think I better understand now why you think this is importantly different from "AI ever gets good at NLP at all".
don't know if your essay is the source of the phrase or whether you just titled it
I think I came up with that particular phrase (though not the idea, of course).
I don't think MIRI ever considered this an important part of the alignment problem, and I don't think we expect humanity to solve lots of the alignment problem as a result of having such a tool
If you don't think MIRI ever considered coming up with an "explicit function that reflects the human value function with high fidelity" to be "an important part of the alignment problem", can you explain this passage from the Arbital page on The problem of fully updated deference?
One way to look at the central problem of value identification in superintelligence is that we'd ideally want some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory. Since superintelligences would probably be pretty darned good at collecting data and guessing the empirical state of the universe, this probably solves the whole problem.
This is not the same problem as writing down our true V by hand. The minimum algorithmic complexity of a meta-utility function ΔU which outputs V after updating on all available evidence, seems plausibly much lower than the minimum algorithmic complexity for writing V down directly. But as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down.
Eliezer (who I assume is the author) appears to say in the first paragraph that solving the problem of value identification for superintelligences would "probably [solve] the whole problem", and by "whole problem" I assume he's probably referring to what he saw as an important part of the alignment problem (maybe not though?)
He referred to the problem of value identification as getting "some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory." This seems to be very similar to my definition, albeit with the caveat that my definition isn't about revealing "V in all its glory" but rather, is more about revealing V at the level that an ordinary human is capable of revealing V.
Unless the sole problem here is that we absolutely need our function that reveals V to be ~perfect, then I think this quote from the Arbital page directly supports my interpretation, and overall supports the thesis in my post pretty strongly (even if I'm wrong about a few minor details).
As an experimental format, here is the first draft of what I wrote for next week's newsletter about this post:
Matthew Barnett argues that GPT-4 exhibiting common sense morality, and being able to follow it, should update us towards alignment being easier than we thought, and MIRI-style people refusing to do so are being dense. That the AI is not going to maximize the utility function you gave it at the expense of all common sense.
As usual, this logically has to be more than zero evidence for this, given how we would react if GPT-4 indeed lacked such common sense or was unable to give answers that pleased humans at all. Thus, we should update a non-zero amount in that direction, at least if we ignore the danger of being led down the wrong alignment path.
However, I think this misunderstands what is going on. GPT-4 is training on human feedback, so it is choosing responses that maximize the probability of positive user response in the contexts where it gets feedback. If that is functionally your utility function, you want to respond with answers that appear, to humans similar to the ones who provided you with feedback, to reflect common sense and seem to avoid violating various other concerns. That will be more important than maximizing the request made, especially if strong negative feedback was given for violations of various principles including common sense.
Thus, I think GPT-4 is indeed doing a decent job of extracting human preferences, but only in the sense that is predicting what preferences we would consciously choose to express in response under strong compute limitations. For now, that looks a lot like having common sense morality, and mostly works out fine. I do not think this has much bearing on the question of what it would take to make something work out fine in the future, under much stronger optimization pressure, I think you metaphorically do indeed get to the literal genie problem from a different angle. I would say that the misspecification problems remain highly relevant, and that yes, as you gain in optimization power your need to correctly specify the exact objective increases, and if you are exerting far-above-human levels of optimization pressure based on only human consciously expressed under highly limited compute levels of value alignment, you are going to have a bad time.
I believe MIRI folks have a directionally similar position to mine only far stronger.
I'm not arguing that GPT-4 actually cares about maximizing human value. However, I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human, albeit in a text format. Crucially, GPT-4 can do this visibly to us, in a legible way, rather than merely passively knowing right from wrong in some way that we can't access. This fact is key to what I'm saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate "human value function". That wouldn't solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.
It sounds like you are saying: We just need to prompt GPT with something like "Q: How good is this outcome? A:" and then build a generic maximizer agent using that prompted GPT as the utility function, and our job is done, we would have made an AGI that cares about maximizing the human value function (because it's literally its utility function) (In practice this agent might look something like AutoGPT).
But I doubt that's what you are saying, so I'm asking for clarification if you still have energy to engage!
It sounds like you are saying: We just need to prompt GPT with something like "Q: How good is this outcome? A:" and then build a generic maximizer agent using that prompted GPT as the utility function, and our job is done, we would have made an AGI that cares about maximizing the human value function
I think solving value specification is basically what you need in order to build a good reward model. If you have a good reward model, and you solve inner alignment, then I think you're pretty close to being able to create (at least) a broadly human-level AGI that is aligned with human values.
That said, to make superintelligent AI go well, we still need to solve the problem of scalable oversight, because, among other reasons, there might be weird bugs that result from a human-level specification of our values being optimized to the extreme. However, having millions of value-aligned human-level AGIs would probably help us a lot with this challenge.
We'd also need to solve the problem of making sure there aren't catastrophic bugs in the AIs we build. And we'll probably have to solve the general problem of value drift from evolutionary and cultural change. There's probably a few more things that we need to solve that I haven't mentioned too.
These other problems may be very difficult, and I'm not denying that. But I think it's good to know that we seem to be making good progress on the "reward modeling" part of the alignment problem. I think it's simply true that many people in the past imagined that this problem would be a lot harder than it actually was.
So, IIUC, you are proposing we:
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?" (the difference revealing itself in cases of tampering with the RM) Or the objective "producing outcomes that are good-for-humans, harmless, honest, etc."?
So, IIUC, you are proposing we:
- Literally just query GPT-N about whether [input_outcome] is good or bad
I'm hesitant to say that I'm actually proposing literally this exact sequence as my suggestion for how we build safe human-level AGI, because (1) "GPT-N" can narrowly refer to a specific line of models by OpenAI whereas the way I was using it was more in-line with "generically powerful multi-modal models in the near-future", and (2) the actual way we build safe AGI will presumably involve a lot of engineering and tweaking to any such plan in ways that are difficult to predict and hard to write down comprehensively ahead of time. And if I were to lay out "the plan" in a few paragraphs, it will probably look pretty inadequate or too high-level compared to whatever people actually end up doing.
Also, I'm not ruling out that there might be an even better plan. Indeed, I hope there is a better plan available by the time we develop human-level AGI.
That said, with the caveats I've given above, yes, this is basically what I'm proposing, and I think there's a reasonably high chance (>50%) that this general strategy would work to my own satisfaction.
Can you say more about what you mean by solution to inner alignment?
To me, a solution to inner alignment would mean that we've solved the problem of malign generalization. To be a bit more concrete, this roughly means that we've solved the problem of training an AI to follow a set of objectives in a way that generalizes to inputs that are outside of the training distribution, including after the AI has been deployed.
For example, if you teach an AI (or a child) that murder is wrong, they should be able to generalize this principle to new situations that don't match the typical environment they were trained in, and be motivated to follow the principle in those circumstances. Metaphorically, the child grows up and doesn't want to murder people even after they've been given a lot of power over other people's lives. I think this can be distinguished from the problem of specifying what murder is, because the central question is whether the AI/child is motivated to pursue the ethics that was instilled during training, even in new circumstances, rather than whether they are simply correctly interpreting the command "do not murder".
Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?"
I think I mean the second thing, rather than the first thing, but it's possible I am not thinking hard enough about this right now to fully understand the distinction you are making.
To me, a solution to inner alignment would mean that we've solved the problem of malign generalization. To be a bit more concrete, this roughly means that we've solved the problem of training an AI to follow a set of objectives in a way that generalizes to inputs that are outside of the training distribution, including after the AI has been deployed.
This is underspecified, I think, since we have for years had AIs that follow objectives in ways that generalize to inputs outside of the training distribution. The thing is there are lots of ways to generalize / lots of objectives they could learn to follow, and we don't have a good way of pinning it down to exactly the ones we want. (And indeed as our AIs get smarter there will be new ways of generalizing / categories of objectives that will become available, such as "play the training game")
So it sounds like you are saying "A solution to inner alignment mans that we've figured out how to train an AI to have the objectives we want it to have, robustly such that it continues to have them way off distribution." This sounds like basically the whole alignment problem to me?
I see later you say you mean the second thing -- which is interestingly in between "play the training game" and "actually be honest/helpful/harmless/etc." (A case that distinguishes it from the latter: Suppose it is reading a paper containing an adversarial example for the RM, i.e. some text it can output that causes the RM to give it a high score even though the text is super harmful / dishonest / etc. If it's objective is the "do what the RM would give high score to if it was operating normally" objective, it'll basically wirehead on that adversarial example once it learns about it, even if it's in deployment and it isn't getting trained anymore, and even though it's an obviously harmful/dishonest piece of text.
It's a nontrivial and plausible claim you may be making -- that this sort of middle ground might be enough for safe AGI, when combined with the rest of the plan at least. But I'd like to see it spelled out. I'm pretty skeptical right now.
So, IIUC, you are proposing we:
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?" (the difference revealing itself in cases of tampering with the RM) Or the objective "producing outcomes that are good-for-humans, harmless, honest, etc."?
So, IIUC, you are proposing we:
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?" (the difference revealing itself in cases of tampering with the RM) Or the objective "producing outcomes that are good-for-humans, harmless, honest, etc."?
So, IIUC, you are proposing we:
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?" (the difference revealing itself in cases of tampering with the RM) Or the objective "producing outcomes that are good-for-humans, harmless, honest, etc."?
ETA: I'm not saying that MIRI thought AIs wouldn't understand human values. If there's only one thing you take away from this post, please don't take away that. Here is Linch's attempted summary of this post, which I largely agree with.
Recently, many people have talked about whether some of the main MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger[1]) should update on whether value alignment is easier than they thought given that GPT-4 seems to follow human directions and act within moral constraints pretty well (here are two specific examples of people talking about this: 1, 2). Because these conversations are often hard to follow without much context, I'll just provide a brief caricature of how I think this argument has gone in the places I've seen it, which admittedly could be unfair to MIRI[2]. Then I'll offer my opinion that, overall, I think MIRI people should probably update in the direction of alignment being easier than they thought in light of this information, despite their objections.
Note: I encourage you to read this post carefully to understand my thesis. This topic can be confusing, and there are many ways to misread what I'm saying. Also, make sure to read the footnotes if you're skeptical of some of my claims.
Here's my very rough caricature of the discussion so far, plus my response:
Non-MIRI people: Yudkowsky talked a great deal in the sequences about how it was hard to get an AI to understand human values. For example, his essay on the Hidden Complexity of Wishes made it sound like it would be really hard to get an AI to understand common sense. In that essay, the genie did silly things like throwing your mother out of the building rather than safely carrying her out. Actually, it turned out that it was pretty easy to get an AI to understand common sense. LLMs are essentially safe-ish genies that do what you intend. MIRI people should update on this information.
MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger): You misunderstood the argument. The argument was never about getting an AI to understand human values, but about getting an AI to care about human values in the first place. Hence 'The genie knows but doesn't care'. There's no reason to think that GPT-4 cares about human values, even if it can understand them. We always thought the hard part of the problem was about inner alignment, or, pointing the AI in a direction you want. We think figuring out how to point an AI in whatever direction you choose is like 99% of the problem; the remaining 1% of the problem is getting it to point at the "right" set of values.[2]
My response:
I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have generally maintained there was extra difficulty in getting an AI to care about human values. However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem), for example in this 2016 talk from Yudkowsky.[3] The value identification problem is the problem of "pinpointing valuable outcomes to an advanced agent and distinguishing them from non-valuable outcomes". In other words, it's the problem of specifying a utility function that reflects the "human value function" with high fidelity, i.e. the problem of specifying a utility function that can be optimized safely. See this footnote[4] for further clarification about how I view the value identification/specification problem.
The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of any outcome (which you could then, hypothetically, hook up to a generic function maximizer to get a benevolent AI). If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.
The primary foreseeable difficulty Yudkowsky offered for the value identification problem is that human value is complex.[5] In turn, the idea that value is complex was stated multiple times as a premise for why alignment is hard.[6] Another big foreseeable difficulty with the value identification problem is the problem of edge instantiation, which was talked about extensively in early discussions on LessWrong.
MIRI people frequently claimed that solving the value identification problem would be hard, or at least non-trivial.[7] For instance, Nate Soares wrote in his 2016 paper on value learning, that "Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task."
I claim that GPT-4 is already pretty good at extracting preferences from human data. It exhibits common sense. If you talk to GPT-4 and ask it ethical questions, it will generally give you reasonable answers. It will also generally follow your intended directions, rather than what you literally said. Together, I think these facts indicate that GPT-4 is probably on a path towards an adequate solution to the value identification problem, where "adequate" means "about as good as humans". And to be clear, I don't mean that GPT-4 merely passively "understands" human values. I mean that GPT-4 literally executes your intended instructions in practice, and that asking GPT-4 to distinguish valuable and non-valuable outcomes works pretty well in practice, and this will become increasingly apparent in the near future as models get more capable and expand to more modalities.[8]
I'm not arguing that GPT-4 actually cares about maximizing human value. However, I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human, albeit in a text format. Crucially, GPT-4 can do this visibly to us, in a legible way, rather than merely passively knowing right from wrong in some way that we can't access. This fact is key to what I'm saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate "human value function". That wouldn't solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.
Maybe you think "the problem" was always that we can't rely on a solution to the value identification problem that only works as well as a human, and we require a much higher standard than "human-level at moral judgement" to avoid a catastrophe. But personally, I think having such a standard is both unreasonable and inconsistent with the implicit standard set by essays from Yudkowsky and other MIRI people. In Yudkowsky's essay on the hidden complexity of wishes, he wrote,
I interpret this passage as saying that 'the problem' is extracting all the judgements that "you would make", and putting that into a wish. I think he's implying that these judgements are essentially fully contained in your brain. I don't think it's credible to insist he was referring to a hypothetical ideal human value function that ordinary humans only have limited access to, at least in this essay.[9]
Here's another way of putting my point: In general, there are at least two ways that someone can fail to follow your intended instructions. Either your instructions aren't well-specified and don't fully capture your intentions, or the person doesn't want to obey your instructions even if those instructions accurately capture what you want. Practically all the evidence that I've found seems to indicate that MIRI people thought that both problems would be hard to solve for AI, not merely the second problem.
For example, a straightforward reading of Nate Soares' 2017 talk supports this interpretation. In the talk, Soares provides a fictional portrayal of value misalignment, drawing from the movie Fantasia. In the story, Mickey Mouse attempts to instruct a magical broom to fill a cauldron, but the broom follows the instructions literally rather than following what Mickey Mouse intended, and floods the room. Soares comments: "I claim that as fictional depictions of AI go, this is pretty realistic."[10]
Perhaps more important to my point, Soares presented a clean separation between the part where we specify an AI's objectives, and the part where the AI tries to maximizes those objectives. He draws two arrows, indicating that MIRI is concerned about both parts. He states, "My view is that the critical work is mostly in designing an effective value learning process, and in ensuring that the sorta-argmax process is correctly hooked up to the resultant objective function 𝗨:"[11]
In the talk Soares also says, "The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification." I believe this quote refers directly to the value identification problem, rather than the problem of getting an AI to care about following the goals we've given it. This attitude is reflected in other MIRI essays.
The point of "the genie knows but doesn't care" wasn't that the AI would take your instructions, know what you want, and yet disobey the instructions because it doesn't care about what you asked for. If you read Rob Bensinger's essay carefully, you'll find that he's actually warning that the AI will care too much about the utility function you gave it, and maximize it exactly, against your intentions[12]. The sense in which the genie "doesn't care" is that it doesn't care what you intended; it only cares about the objectives that you gave it. That's not the same as saying the genie doesn't care about the objectives you specified.
Given the evidence, it seems to me that the following conclusions are probably accurate:
As an endnote, I don't think it really matters whether MIRI people had mistaken arguments about the difficulty of alignment ten years ago. It matters far more what their arguments are right now. However, I do care about accurately interpreting what people said on this topic, and I think it's important for people to acknowledge when the evidence has changed.
I recognize that these people are three separate individuals and each have their own nuanced views. However, I think each of them have expressed broadly similar views on this particular topic, and I've seen each of them engage in a discussion about how we should update about the difficulty of alignment given what we've seen from LLMs.
I'm not implying MIRI people would necessarily completely endorse everything I've written in this caricature. I'm just conveying how they've broadly come across to me, and I think the basic gist is what's important here. If some MIRI people tell me that this caricature isn't a fair summary of what they've said, I'll try to edit the post later to include real quotes.
For now, I'll point to this post from Nate Soares in which he stated,
More specifically, in the talk, at one point Yudkowsky asks "Why expect that [alignment] is hard?" and goes on to tell a fable about programmers misspecifying a utility function, which then gets optimized by an AI with disastrous consequences. My best interpretation of this part of the talk is that he's saying the value identification problem is one of the primary reasons why alignment is hard. However, I encourage you to read the transcript yourself if you are skeptical of my interpretation.
I am mainly talking about the problem of how to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
I was not able to find a short and crisp definition of the value identification/specification problem from MIRI. However, in the Arbital page on the Problem of fully updated deference, the problem is described as follows,
In MIRI's 2017 technical agenda, they described the problem as follows, which I believe roughly matches how I'm using the term,
To support this claim, I'll point out that the Arbital page for the value identification problem says, "A central foreseen difficulty of value identification is Complexity of Value".
For example, in this post, Yudkowsky gave "five theses", one of which was the "complexity of value thesis". He wrote, that the "five theses seem to imply two important lemmas", the first lemma being "Large bounded extra difficulty of Friendliness.", i.e. the idea that alignment is hard.
Another example comes from this talk. I've linked to a part in which Yudkowsky begins by talking how human value is complex, and moves to talking about how that fact presents challenges for aligning AI.
My guess is that the perceived difficulty of specifying objectives was partly a result of MIRI people expecting that natural language understanding wouldn't occur in AI until just barely before AGI, and at that point it would be too late to use AI language comprehension to help with alignment.
Rob Bensinger said,
In 2010, Eliezer Yudkowsky commented,
If you disagree that AI systems in the near-future will be capable of distinguishing valuable from non-valuable outcomes about as reliably as humans, then I may be interested in operationalizing this prediction precisely, and betting against you. I don't think this is a very credible position to hold as of 2023, barring a pause that could slow down AI capabilities very soon.
I mostly interpret Yudkowsky's Coherent Extrapolated Volition as an aspirational goal for what we could best hope for in an ideal world where we solve every part of alignment, rather than a minimal bar for avoiding human extinction. In Yudkowsky's post on AGI ruin, he stated,
I don't think I'm taking him out of context. Here's a longer quote from the talk,
The full quote is,
This interpretation appears supported by the following quote from Rob Bensinger's essay,
It's unclear to me whether MIRI people are claiming that they only ever thought (2) was the hard part of alignment, but here's a quote from Nate Soares that offers some support for this interpretation IMO,
Even if I'm misinterpreting Soares here, I don't think that would undermine the basic point that MIRI people should probably update in the direction of alignment being easier than they thought.