# 22

Review

This is a post about my own confusions. It seems likely that other people have discussed these issues at length somewhere, and that I am not up with current thoughts on them, because I don’t keep good track of even everything great that everyone writes. I welcome anyone kindly directing me to the most relevant things, or if such things are sufficiently well thought through that people can at this point just correct me in a small number of sentences, I’d appreciate that even more.

~

The traditional argument for AI alignment being hard is that human value is ‘complex’ and ‘fragile’. That is, it is hard to write down what kind of future we want, and if we get it even a little bit wrong, most futures that fit our description will be worthless.

The illustrations I have seen of this involve a person trying to write a description of value conceptual analysis style, and failing to put in things like ‘boredom’ or ‘consciousness’, and so getting a universe that is highly repetitive, or unconscious.

I’m not yet convinced that this is world-destroyingly hard.

Firstly, it seems like you could do better than imagined in these hypotheticals:

1. These thoughts are from a while ago. If instead you used ML to learn what ‘human flourishing’ looked like in a bunch of scenarios, I expect you would get something much closer than if you try to specify it manually. Compare manually specifying what a face looks like, then generating examples from your description to using modern ML to learn it and generate them.
2. Even in the manually describing it case, if you had like a hundred people spend a hundred years writing a very detailed description of what went wrong, instead of a writer spending an hour imagining ways that a more ignorant person may mess up if they spent no time on it, I could imagine it actually being pretty close. I don’t have a good sense of how far away it is.

I agree that neither of these would likely get you to exactly human values.

But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything.

This seems to be a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something, and b) assuming that there is a fast takeoff so that the relevant AI has its values forever, and takes over the world.

My guess is that values that are got using ML but still somewhat off from human values are much closer in terms of not destroying all value of the universe, than ones that a person tries to write down. Like, the kinds of errors people have used to illustrate this problem (forget to put in, ‘consciousness is good’) are like forgetting to say faces have nostrils in trying to specify what a face is like, whereas a modern ML system’s imperfect impression of a face seems more likely to meet my standards for ‘very facelike’ (most of the time).

Perhaps a bigger thing for me though is the issue of whether an AI takes over the world suddenly. I agree that if that happens, lack of perfect alignment is a big problem, though not obviously an all value nullifying one (see above). But if it doesn’t abruptly take over the world, and merely becomes a large part of the world’s systems, with ongoing ability for us to modify it and modify its roles in things and make new AI systems, then the question seems to be how forcefully the non-alignment is pushing us away from good futures relative to how forcefully we can correct this. And in the longer run, how well we can correct it in a deep way before AI does come to be in control of most decisions. So something like the speed of correction vs. the speed of AI influence growing.

These are empirical questions about the scales of different effects, rather than questions about whether a thing is analytically perfect. And I haven’t seen much analysis of them. To my own quick judgment, it’s not obvious to me that they look bad.

For one thing, these dynamics are already in place: the world is full of agents and more basic optimizing processes that are not aligned with broad human values—most individuals to a small degree, some strange individuals to a large degree, corporations, competitions, the dynamics of political processes. It is also full of forces for aligning them individually and stopping the whole show from running off the rails: law, social pressures, adjustment processes for the implicit rules of both of these, individual crusades. The adjustment processes themselves are not necessarily perfectly aligned, they are just overall forces for redirecting toward alignment. And in fairness, this is already pretty alarming. It’s not obvious to me that imperfectly aligned AI is likely to be worse than the currently misaligned processes, and even that it won’t be a net boon for the side of alignment.

So then the largest remaining worry is that it will still gain power fast and correction processes will be slow enough that its somewhat misaligned values will be set in forever. But it isn’t obvious to me that by that point it isn’t sufficiently well aligned that we would recognize its future as a wondrous utopia, just not the very best wondrous utopia that we would have imagined if we had really carefully sat down and imagined utopias for thousands of years. This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.

# 22

New Comment

(I reviewed this in a top-level post: Review of 'But exactly how complex and fragile?'.)

I've thought about (concepts related to) the fragility of value quite a bit over the last year, and so I returned to Katja Grace's But exactly how complex and fragile? with renewed appreciation (I'd previously commented only a very brief microcosm of this review). I'm glad that Katja wrote this post and I'm glad that everyone commented. I often see private Google docs full of nuanced discussion which will never see the light of day, and that makes me sad, and I'm happy that people discussed this publicly.

I'll split this review into two parts, since the nominations called for review of both the post and the comments:

I think this post should be reviewed for its excellent comment section at least as much as for the original post, and also think that this post is a pretty central example of the kind of post I would like to see more of.

~ habryka

# Summary

I think this was a good post. I think Katja shared an interesting perspective with valuable insights and that she was correct in highlighting a confused debate in the community.

That said, I think the post and the discussion are reasonably confused. The post sparked valuable lower-level discussion of AI risk, but I don't think that the discussion clarified AI risk models in a meaningful way.

The problem is that people are debating "is value fragile?" without realizing that value fragility is a sensitivity measure: given some initial state and some dynamics, how sensitive is the human-desirability of the final outcomes to certain kinds of perturbations of the initial state

Left unremarked by Katja and the commenters, value fragility isn't intrinsically about AI alignment. What matters most is the extent to which the future is controlled by systems whose purposes are sufficiently entangled with human values. This question reaches beyond just AI alignment.

They also seem to be debating an under-specified proposition. Different perturbation sets and different dynamics will exhibit different fragility properties, even though we're measuring with respect to human value in all cases. For example, perturbing the training of an RL agent learning a representation of human value, is different from perturbing the utility function of an expected utility maximizer.

Setting loose a superintelligent expected utility maximizer is different from setting loose a mild optimizer (e.g. a quantilizer), even if they're both optimizing the same flawed representation of human value; the dynamics differ. As another illustration of how dynamics are important for value fragility, imagine if recommender systems had been deployed within a society which already adequately managed the impact of ML systems on its populace. In that world, we may have ceded less of our agency and attention to social media, and would therefore have firmer control over the future and value would be less fragile with respect to the training process of these recommender systems.

# The Post

But exactly how complex and fragile? and its comments debate whether "value is fragile." I think this is a bad framing because it hides background assumptions about the dynamics of the system being considered. This section motivates a more literal interpretation of the value fragility thesis, demonstrating its coherence and its ability to meaningfully decompose AI alignment disagreements. The next section will use this interpretation to reveal how the comments largely failed to explore key modelling assumptions. This, I claim, helped prevent discussion from addressing the cruxes of disagreements.

The post and discussion both seem to slip past (what I view as) the heart of 'value fragility', and it seems like many people are secretly arguing for and against different propositions. Katja says:

it is hard to write down what kind of future we want, and if we get it even a little bit wrong, most futures that fit our description will be worthless.

But this leaves hidden a key step:

it is hard to write down the future we want, feed the utility function punchcard into the utility maximizer and then press 'play', and if we get it even a little bit wrong, most futures that fit our description will be worthless.

Here is the original 'value is fragile' claim:

Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth.

~  Eliezer Yudkowsky, Value is Fragile

Eliezer claims that if the future is not shaped by a goal system, there's not much worth. He does not explicitly claim, in that original essay, that we have to/will probably build an X-maximizer AGI, where X is an extremely good (or perfect) formalization of human values (whatever that would mean!). He does not explicitly claim that we will mold a mind from shape Y and that that probably goes wrong, too. He's talking about goal systems chartering a course through the future, and how sensitive the outcomes are to that process.

Let's ground this out. Imagine you're acting, but you aren't quite sure what is right. For a trivial example, you can eat bananas or apples at any given moment, but you aren't sure which is better. There are a few strategies you could follow: preserve attainable utility for lots of different goals (preserve the fruits as best you can); retain option value where your normative uncertainty lies (don't toss out all the bananas or all of the apples); etc.

But what if you have to commit to an object-level policy now, a way-of-steering-the-future now, without being able to reflect more on your values? What kind of guarantees can you get?

In Markov decision processes, if you're maximally uncertain, you can't guarantee you won't lose at least half of the value you could have achieved for the unknown true goal (I recently proved this for an upcoming paper). Relatedly, perfectly optimizing an -incorrect reward function only bounds regret to  per time step (see also Goodhart's Curse). The main point is that you can't pursue every goal at once. It doesn't matter whether you use reinforcement learning to train a policy, or whether you act randomly, or whether you ask Mechanical Turk volunteers what you should do in each situation. Whenever your choices mean anything at all, no sequence of actions can optimize all goals at the same time

So there has to be something which differentially pushes the future towards "good" things and away from "bad" things. That something could be 'humanity', or 'aligned AGI', or 'augmented humans wielding tool AIs', or 'magically benevolent aliens' - whatever. But it has to be something, some 'goal system' (as Eliezer put it), and it has to be entangled with the thing we want it to optimize for (human morals and metamorals). Otherwise, there's no reason to think that the universe weaves a "good" trajectory through time.

Hence, one might then conclude

Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will not be optimized for human morals and metamorals.

But how do we get from "will not be optimized for" to "will contain almost nothing of worth"? There are probably a few ways of arguing this; the simplest may be:

our universe has 'resources'; making the universe decently OK-by-human-standards requires resources which can be used for many other purposes; most purposes are best accomplished by not using resources in this way.

This is not an argument that we will deploy utility maximizers with a misspecified utility function, and that that will be how our fragile value is shattered and our universe is extinguished. The thesis holds merely that

Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth.

As Katja notes, this argument is secretly about how the "forces of optimization" shape the future, and not necessarily about AIs or anything. The key point is to understand how the future is shaped, and then discuss how different kinds of AI systems might shape that future.

Concretely, I can claim 'value is fragile' and then say 'for example, if we deployed a utility-maximizer in our society but we forgot to have it optimize for variety, people might loop a single desirable experience forever.' But on its own, the value fragility claim doesn't center on AI.

[Human] values do not emerge in all possible minds.  They will not appear from nowhere to rebuke and revoke the utility function of an expected paperclip maximizer.

Touch too hard in the wrong dimension, and the physical representation of those values will shatter - and not come back, for there will be nothing left to want to bring it back.

And the referent of those values - a worthwhile universe - would no longer have any physical reason to come into being.

Let go of the steering wheel, and the Future crashes.

Value is Fragile

Katja (correctly) implies that concluding that AI alignment is difficult requires extra arguments beyond value fragility:

... But if [the AI] doesn’t abruptly take over the world, and merely becomes a large part of the world’s systems, with ongoing ability for us to modify it and modify its roles in things and make new AI systems, then the question seems to be how forcefully the non-alignment is pushing us away from good futures relative to how forcefully we can correct this. And in the longer run, how well we can correct it in a deep way before AI does come to be in control of most decisions. So something like the speed of correction vs. the speed of AI influence growing.

But exactly how complex and fragile?

As I see it, Katja and the commenters mostly discuss their conclusions about how AI+humanity might steer the future, how hard it will be to achieve the requisite entanglement with human values, instead of debating the truth value of the 'value fragility' claim which Eliezer made. Katja and the commenters discuss points which are relevant to AI alignment, but which are distinct from the value fragility claim. No one remarks that this claim has truth value independent of how we go about AI alignment, or how hard it is for AI to further our values.

Value fragility quantifies the robustness of outcome value to perturbation of the "motivations" of key actors within a system, given certain dynamics. This may become clearer as we examine the comments. This insight allows us to decompose debates about "value fragility" into e.g.

1. In what ways is human value fragile, given a fixed optimization scheme?

In other words: given fixed dynamics, to what classes of perturbations is outcome value fragile?
2. What kinds of multi-agent systems tend to veer towards goodness and beauty and value?

In other words: given a fixed set of perturbations, what kinds of dynamics are unusually robust against these perturbations?
1. What kinds of systems will humanity end up building, should we act no further? This explores our beliefs about how probable alignment pressures will interact with value fragility.

I think this is much more enlightening than debating

VALUE_FRAGILE_TO_AI == True?

If no such decomposition takes place, I think debate is just too hard and opaque and messy, and I think some of this messiness spilled over into the comments. Locally, each comment is well thought-out, but it seems (to me) that cruxes were largely left untackled

To concretely point out something I consider somewhat confused, johnwentsworth authored the top-rated comment:

I think [Katja's summary] is an oversimplification of the fragility argument, which people tend to use in discussion because there's some nontrivial conceptual distance on the way to a more rigorous fragility argument.

The main conceptual gap is the idea that "distance" is not a pre-defined concept. Two points which are close together in human-concept-space may be far apart in a neural network's learned representation space or in an AGI's world-representation-space. It may be that value is not very fragile in human-concept-space; points close together in human-concept-space may usually have similar value. But that will definitely not be true in all possible representations of the world, and we don't know how to reliably formalize/automate human-concept-space.

The key point is not "if there is any distance between your description and what is truly good, you will lose everything", but rather, "we don't even know what the relevant distance metric is or how to formalize it". And it is definitely the case, at least, that many mathematically simple distance metrics do display value fragility.

This is a good point. But what exactly happens between "we write down something too distant from the 'truth'" and the result? The AI happens. But this part, the dynamics, it's kept invisible.

So if you think that there will be fast takeoff via utility maximizers (a la AIXI), you might say "yes, value is fragile", but if I think it'll be more like slow CAIS with semi-aligned incentives making sure nothing goes too wrong, I reply "value isn't fragile." Even if we agree on a distance metric! This is how people talk past each other.

Crucially, you have to realize that your mind can hold separate the value fragility considerations, the considerations as to how vulnerable the outcomes are to the aforementioned perturbations, you have to know you can hold these separate from your parameter values for e.g. AI timelines.

Many other comments seem off-the-mark in a similar way. That said, I think that Steve Byrnes left an underrated comment:

Corrigibility is another reason to think that the fragility argument is not an impossibility proof: If we can make an agent that sufficiently understands and respects the human desire for autonomy and control, then it would presumably ask for permission before doing anything crazy and irreversible, so we would presumably be able to course-correct later on (even with fast/hard takeoff).

The reason that corrigibility-like properties are so nice is that they let us continue to steer the future through the AI itself; its power becomes ours, and so we remain the "goal system with detailed reliable inheritance from human morals and metamorals" shaping the future.

# Conclusion

The problem is that people are debating "is value fragile?" without realizing that value fragility is a sensitivity measure: given some initial state and some dynamics, how sensitive is the human-desirability of the final outcomes to certain kinds of perturbations of the initial state

Left unremarked by Katja and the commenters, value fragility isn't intrinsically about AI alignment. What matters most is the extent to which the future is controlled by systems whose purposes are sufficiently entangled with human values. This question reaches beyond just AI alignment.

I'm glad Katja said "Hey, I'm not convinced by this key argument", but I don't think it makes sense to include But exactly how complex and fragile? in the review.

Thanks to Rohin Shah for feedback on this review.

I read through the first part of this review, and generally thought "yep, this is basically right, except it should factor out the distance metric explicitly rather than dragging in all this stuff about dynamics". I had completely forgotten that I said the same thing a year ago, so I was pretty amused when I reached the quote.

Anyway, I'll defend the distance metric thing a bit here.

But what exactly happens between "we write down something too distant from the 'truth'" and the result? The AI happens. But this part, the dynamics, it's kept invisible.

I claim that "keeping the dynamics invisible" is desirable here.

The reason that "fragility of human values" is a useful concept/hypothesis in the first place is that it cuts reality at the joints. What does that mean? Roughly speaking, it means that there's a broad class of different questions for which "are human values fragile?" is an interesting and useful subquestion, without needing a lot of additional context. We can factor out the "are human values fragile?" question, and send someone off to go think about that question, without a bunch of context about why exactly we want to answer the question. Conversely, because the answer isn't highly context-dependent, we can think about the question once and then re-use the answer when thinking about many different scenarios - e.g. foom or CAIS or multipolar takeoff or .... Fragility of human values is a gear in our models, and once we've made the investment to understand that gear, we can re-use it over and over again as the rest of the model varies.

Of course, that only works to the extent that fragility of human values actually doesn't depend on a bunch of extra context. Which it obviously does, as this review points out. Distance metrics allow us to "factor out" that context-dependence, to wrap it in a clean API.

Rather than asking "are human values fragile?", we ask "under what distance metric(s) are human values fragile?" - that's the new "API" of the value-fragility question. Then, when someone comes along with a specific scenario (like foom or CAIS or ...), we ask what distance metric is relevant to the dynamics of that scenario. For instance, in a foom scenario, the relevant distance metric is probably determined by the AI's ontology - i.e. what things the AI thinks are "similar". In a corporate-flavored multipolar takeoff scenario, the relevant distance metric might be driven by economic/game-theoretic considerations: outcomes with similar economic results (e.g. profitability of AI-run companies) will be "similar".

The point is that these distance metrics tell us what particular aspects/properties of each scenario are relevant to value fragility.

Rather than asking "are human values fragile?", we ask "under what distance metric(s) are human values fragile?" - that's the new "API" of the value-fragility question.

In other words: "against which compact ways of generating perturbations is human value fragile?". But don't you still need to consider some dynamics for this question to be well-defined? So it doesn't seem like it captures all of the regularities implied by:

Distance metrics allow us to "factor out" that context-dependence, to wrap it in a clean API.

But I do presently agree that it's a good conceptual handle for exploring robustness against different sets of perturbations.

In other words: "against which compact ways of generating perturbations is human value fragile?". But don't you still need to consider some dynamics for this question to be well-defined?

Not quite. If we frame the question as "which compact ways of generating perturbations", then that's implicitly talking about dynamics, since we're asking how the perturbations were generated. But if we know what perturbations are generated, then we can say whether human value is fragile against those perturbations, regardless of how they're generated. So, rather than framing the question as "which compact ways of generating perturbations", we frame it as "which sets of perturbations" or "densities of perturbations" or a distance function on perturbations.

Ideally, we come up with a compact criterion for when human values are fragile against such sets/densities/distance functions.

(I meant to say 'perturbations', not 'permutations')

Not quite. If we frame the question as "which compact ways of generating permutations", then that's implicitly talking about dynamics, since we're asking how the permutations were generated.

Hm, maybe we have two different conceptions. I've been imagining singling out a variable (e.g. the utility function) and perturbing it in different ways, and then filing everything else under the 'dynamics.'

So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent's utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn't classify as 'perturbations' in the original ontology.

Point is, these perturbations aren't actually generated within the imagined scenarios, but we generate them outside of the scenarios in order to estimate outcome sensitivity.

Perhaps this isn't clean, and perhaps I should rewrite parts of the review with a clearer decomposition.

So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent's utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn't classify as 'perturbations' in the original ontology.

Let me know if this is what you're saying:

•  we have an agent which chooses X to maximize E[u(X)] (maybe with a do() operator in there)
• we perturb the utility function to u'(X)
• we then ask whether max E[u(X)] is approximately E[u(X')], where X' is the decision maximizing E[u'(X')]

... so basically it's a Goodhart model, where we have some proxy utility function and want to check whether the proxy achieves similar value to the original.

Then the value-fragility question asks: under which perturbation distributions are the two values approximately the same? Or, the distance function version: if we assume that u' is "close to" u, then under what distance functions does that imply the values are close together?

Then your argument would be: the answer to that question depends on the dynamics, specifically on how X influences u. Is that right?

Assuming all that is what you're saying... I'm imagining another variable, which is roughly a world-state W. When we write utility as a function of X directly (i.e. u(X)), we're implicitly integrating over world states. Really, the utility function is u(W(X)): X influences the world-state, and then the utility is over (estimated) world-states. When I talk about "factoring out the dynamics", I mean that we think about the function u(W), ignoring X. The sensitivity question is then something like: under what perturbations is u'(W) a good approximation of u(W), and in particular when are maxima of u'(W) near-maximal for u(W), including when the maximization is subject to fairly general constraints. The maximization is no longer over X, but instead over world-states W directly - we're asking which world-states (compatible with the constraints) maximize each utility. (For specific scenarios, the constraints would encode the world-states reachable by the dynamics.) Ideally, we'd find some compact criterion for which perturbations preserve value under which constraints.

(Meta: this was useful, I understand this better for having written it out.)

Yes, this is basically what I had in mind! I really like this grounding; thanks for writing it out. If there were a value fragility research agenda, this might be a good start; I haven't yet decided whether I think there are good theorems to be found here, though.

Can you expand on

including when the maximization is subject to fairly general constraints... Ideally, we'd find some compact criterion for which perturbations preserve value under which constraints.

This is , right? And then you might just constrain the subset of W which the agent can search over? Or did you have something else in mind?

This is , right? And then you might just constrain the subset of W which the agent can search over?

Exactly.

One toy model to conceptualize what a "compact criterion" might look like: imagine we take a second-order expansion of u around some u-maximal world-state . Then, the eigendecomposition of the Hessian of u around  tells us which directions-of-change in the world state u cares about a little or a lot. If the constraints lock the accessible world-states into the directions which u doesn't care about much (i.e. eigenvalues near 0), then any accessible world-state near  compatible with the constraints will have near-maximal u. On the other hand, if the constraints allow variation in directions which u does care about a lot (i.e. large eigenvalues), then u will be fragile to perturbations to u' which move the u'-optimal world-state along those directions.

That toy model has a very long list of problems with it, but I think it conveys roughly what kind of things are involved in modelling value fragility.

This Facebook post has the best discussion of this I know of; in particular check out Dario's comment and the replies to it.

But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything.

I think this is an oversimplification of the fragility argument, which people tend to use in discussion because there's some nontrivial conceptual distance on the way to a more rigorous fragility argument.

The main conceptual gap is the idea that "distance" is not a pre-defined concept. Two points which are close together in human-concept-space may be far apart in a neural network's learned representation space or in an AGI's world-representation-space. It may be that value is not very fragile in human-concept-space; points close together in human-concept-space may usually have similar value. But that will definitely not be true in all possible representations of the world, and we don't know how to reliably formalize/automate human-concept-space.

The key point is not "if there is any distance between your description and what is truly good, you will lose everything", but rather, "we don't even know what the relevant distance metric is or how to formalize it". And it is definitely the case, at least, that many mathematically simple distance metrics do display value fragility.

I found this very helpful, thanks! I think this is maybe what Yudkowsky was getting at when he brought up adversarial examples here.

Adversarial examples are like adversarial goodhart. But an AI optimizing the universe for its imperfect understanding of the good is instead like extremal goodhart. So, while adversarial examples show that cases of dramatic non-overlap between human and ML concepts exist, it may be that you need an adversarial process to find them with nonnegligible probability. In which case we are fine.

This optimistic conjecture could be tested by looking to see what image *maximally* triggers a ML classifier. Does the perfect cat, the most cat-like cat according to ML actually look like a cat to us humans? If so, then by analogy the perfect utopia according to ML would also be pretty good. If not...

Perhaps this paper answers my question in the negative; I dont know enough ML to be sure. Thoughts?

If you want to visualize features, you might just optimize an image to make neurons fire. Unfortunately, this doesn’t really work. Instead, you end up with a kind of neural network optical illusion — an image full of noise and nonsensical high-frequency patterns that the network responds strongly to.

The natural response to this is "ML seems really good at learning good distance metrics".

And it is definitely the case, at least, that many mathematically simple distance metrics do display value fragility.

Which is why you learn the distance metric. "Mathematically simple" rules for vision, speech recognition, etc. would all be very fragile, but ML seems to solve those tasks just fine.

One obvious response is "but what about adversarial examples"; my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.

Another response is "but there are lots of rewards / utilities that are compatible with observed behavior, so you might learn the wrong thing, e.g. you might learn influence-seeking behavior". This is the worry behind inner alignment concerns as well. This seems like a real worry to me, but it's only tangentially related to the complexity / fragility of value.

The natural response to this is "ML seems really good at learning good distance metrics".

No, no they absolutely do not seem...

my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.

... right, yes, that is exactly the issue here. They do not learn the things we care about. Whether ML is good at learning predictive distance metrics is irrelevant here; what matters is whether they are good at learning human distance metrics. Maybe throwing more data at the problem will make learned metrics converge to human metrics, but even if it did, would we reliably be able to tell?

The key point is that we don't even know what the relevant distance metric is. Even in human terms, we don't know what the relevant metric is. We cannot expect to be able to distinguish an ML system which has learned the "correct" metric from one which has not.

The key point is that we don't even know what the relevant distance metric is. Even in human terms, we don't know what the relevant metric is. We cannot expect to be able to distinguish an ML system which has learned the "correct" metric from one which has not.

This seems true, and also seems true for the images case, yet I (and I think most researchers) predict that image understanding will get very good / superhuman. What distinguishes the images case from the human values case? My guess at your response is that we aren't applying optimization pressure on the learned distance function for images.

In that case, my response would be that yes, if you froze in place the learned distance metric / "human value representation" at any given point, and then ratcheted up the "capabilities" of the agent, that's reasonably likely to go badly (though I'm not sure, and it depends how much the current agent has already been trained). But presumably the agent is going to continue learning over time.

Even in the case where we freeze the values and ratchet up the capabilities: you're presumably not aligned with me, but it doesn't seem like ratcheting up your capabilities obviously leads to doom for me. (It doesn't obviously not lead to doom either though.)

(and I think most researchers) predict that image understanding will get very good / superhuman. What distinguishes the images case from the human values case? My guess at your response is that we aren't applying optimization pressure on the learned distance function for images.

Good guess, but no. My response is that "image understanding will get very good" is completely different from "neural nets will understand images the same way humans do" or "neural nets will understand images such that images the net considers similar will also seem similar to humans". I agree that ML systems will get very good at "understanding" images in the sense of predicting motion or hidden pixels or whatever. But while different humans seem to have pretty similar concepts of what a tree is, it is not at all clear that ML systems have the same tree-concept as a human... and even if they did, how could we verify that, in a manner robust to both distribution shifts and Goodhart?

For friendliness purposes, it does not matter how well a neural net "understands" images/values, what matters is that their "understanding" be compatible with human understanding - in the sense that, if the human considers two things similar, the net should also consider them similar, and vice versa. Otherwise the fragility problem comes into play: two human-value-estimates which seem close together in the AI's representation may be disastrously different for a human.

I agree that ML systems will get very good at "understanding" images in the sense of predicting motion or hidden pixels or whatever.

... So why can't ML systems get very good at predicting what humans value, if they can predict motion / pixels? Or perhaps you can think they can predict motion / pixels, but they can't e.g. caption images, because that relies on higher-level concepts? If so, I predict that ML systems will also be good at that, and maybe that's the crux.

But while different humans seem to have pretty similar concepts of what a tree is, it is not at all clear that ML systems have the same tree-concept as a human.

I'm also predicting that vision-models-trained-with-richer-data will have approximately the same tree-concept as humans. (Not exactly the same, e.g. they won't have a notion of a "Christmas tree", presumably.)

and even if they did, how could we verify that, in a manner robust to both distribution shifts and Goodhart?

I'm not claiming we can verify it. I'm trying to make an empirical prediction about what happens. That's very different from what I can guarantee / verify. I'd argue the OP is also speaking in this frame.

I'm trying to make an empirical prediction about what happens. That's very different from what I can guarantee / verify. I'd argue the OP is also speaking in this frame.

That may be the crux. I'm generally of the mindset that "can't guarantee/verify" implies "completely useless for AI safety". Verifying that's it's safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn't guarantee it beforehand or double-check afterwards, that would just be called "AI".

I'm not saying we need proof-level guarantees for everything. Reasoning from strong enough priors would be ok, but saying "well, it seems like it'll probably be safe, but we can't actually verify our assumptions or reasoning" really doesn't cut it. Especially when we do not understand what the things-of-interest (values) even are, or how to formalize them.

I'm also predicting that vision-models-trained-with-richer-data will have approximately the same tree-concept as humans.

If we're saying that tree-concepts of vision-models-trained-with-richer-data will be similar to the human tree-concept according to humans, then I actually do agree with that. I do not expect it to generalize to values. (Although if we had a way to verify that the concepts match, I would expect the concept-match-verification method to generalize.) Here's a few different views on why I wouldn't expect it to generalize, which feel to me like they're all working around the edges of the same central idea:

• In game/decision-theoretic terms, values depend on off-equilibrium behavior. They depend on counterfactual situations which will never actually happen.
• In reductive terms, things in images can mostly be expressed as complicated clusters in atom-configuration space. Those clusters are directly relevant to predictive models, and they have predictive power. Values, and agency, aren't like that - we could model and predict the world just fine without assigning agency to any processes in it. (I suspect that a formalization of this distinction drops naturally out of a theory of abstraction, but that's still under construction.)
• Humans can generally agree on what a tree is. Disagreements over values - or over what values even are - feel qualitatively different. From a human perspective, it feels like values and trees are defined in qualitatively different ways.

Again, if we had ways to guarantee/verify that a human and an ML system were using the same concepts, or had similar notions of "distance" and "approximation", then I do expect that would generalize from images to values. But I don't expect that methods which find human-similar concepts in images will also generally find human-similar concepts in values.

That may be the crux. I'm generally of the mindset that "can't guarantee/verify" implies "completely useless for AI safety". Verifying that's it's safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn't guarantee it beforehand or double-check afterwards, that would just be called "AI"

Surely "the whole point of AI safety research" is just to save the world, no? If the world ends up being saved, does it matter whether we were able to "verify" that or not? From my perspective, as a utilitarian, it seems to me that the only relevant question is how some particular intervention/research/etc. affects the probability of AI being good for humanity (or the EV, to be precise). It certainly seems quite useful to be able to verify lots of stuff to achieve that goal, but I think it's worth being clear that verification is an instrumental goal not a terminal one—and that there might be other possible ways to achieve that terminal goal (understanding empirical questions, for example, as Rohin wanted to do in this thread). At the very least, I certainly wouldn't go around saying that verification is "the whole point of AI safety research."

Surely "the whole point of AI safety research" is just to save the world, no?

Suppose you're an engineer working on a project to construct the world's largest bridge (by a wide margin). You've been tasked with safety: designing the bridge so that it does not fall down.

One assistant comes along and says "I have reviewed the data on millions of previously-built bridges as well as record-breaking bridges specifically. Extrapolating the data forward, it is unlikely that our bridge will fall down if we just scale-up a standard, traditional design."

Now, that may be comforting, but I'm still not going to move forward with that bridge design until we've actually run some simulations. Indeed, I'd consider the simulations the core part of the bridge-safety-engineer's job; trying to extrapolate from existing bridges would be at most an interesting side-project.

But if the bridge ends up standing, does it matter whether we were able to guarantee/verify the design or not?

The problem is model uncertainty. Simulations of a bridge have very little model uncertainty - if the simulation stands, then we can be pretty darn confident the bridge will stand. Extrapolating from existing data to a record-breaking new system has a lot of model uncertainty. There's just no way one can ever achieve sufficient levels of confidence with that kind of outside-view reasoning - we need the levels of certainty which come with a detailed, inside-view understanding of the system.

If the world ends up being saved, does it matter whether we were able to "verify" that or not?

Go find an engineer who designs bridges, or buildings, or something. Ask them: if they were designing the world's largest bridge, would it matter whether they had verified the design was safe, so long as the bridge stood up?

That may be the crux. I'm generally of the mindset that "can't guarantee/verify" implies "completely useless for AI safety". Verifying that's it's safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn't guarantee it beforehand or double-check afterwards, that would just be called "AI".

It would be nice if you said this in comments in the future. This post seems pretty explicitly about the empirical question to me, and even if you don't think the empirical question counts as AI safety research (a tenable position, though I don't agree with it), the empirical questions are still pretty important for prioritization research, and I would like people to be able to have discussions about that.

(Partly I'm a bit frustrated at having had another long comment conversation that bottomed out in a crux that I already knew about, and I don't know how I could have known this ahead of time, because it really sounded to me like you were attempting to answer the empirical question.)

Although it occurs to me that you might be claiming that empirically, if we fail to verify, then we're near-definitely doomed. If so, I want to know the reasons for that belief, and how they contradict my arguments, rather than whatever it is we're currently debating. (And also, I retract both of the paragraphs above.)

Re: the rest of your comment: I don't in fact want to have AI systems that try to guess human "values" and then optimize that -- as you said we don't even know what "values" are. I more want AI systems that are trying to help us, in the same way that a personal assistant might help you, despite not knowing your "values".

Sorry we wound up deep in a thread on a known crux. Mostly I just avoid timeline/prioritization/etc conversations altogether (on the margin I think it's a bikeshed). But in this case I read the OP as wondering why safety researchers were interested in the fragility argument, more than arguing over fragility itself.

As for AIs trying to help us rather than guessing human values... I don't really see how that circumvents the central problem? It sort-of splits off some of the nebulous, unformalized ideas which seem relevant into their own component, but we still end up with a bunch of nebulous, unformalized ideas which do not seem like the same kind of conceptual objects as "trees". We still need notions of wanting things, of agency, etc.

One obvious response is “but what about adversarial examples”; my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.

To clarify, are you saying that if we had a rich enough dataset, the concepts they learn would be things we care about? If so, what is this based on, and how rich of a dataset do you think we would need? If not, can you explain more what you mean?

In the images case, I meant that if you had a richer dataset with more images in more conditions, accompanied with touch-based information, perhaps even audio, and the agent were allowed to interact with the world and see through these input mechanisms what the world did in response, then it would learn concepts that allow it to understand the world the way we do -- it wouldn't be fooled by occlusions, or by putting picture of a baseball on top of an ocean picture, etc. (This also requires a sufficiently large dataset; I don't know how large.)

I'm not saying that such a dataset would lead it to learn what we value. I don't know what that dataset would look like, partly because it's not clear to me what exactly we value.

There's a distinction worth mentioning between the fragility of human value in concept space, and the fragility induced by a hard maximizer running after its proxy as fast as possible.

Like, we could have a distance metric whereby human value is discontinuously sensitive to nudges in concept space, while still being OK practically (if we figure out eg mild optimization). Likewise, if we have a really hard maximizer pursuing a mostly-robust proxy of human values, and human value is pretty robust itself, bad things might still happen due to implementation errors (the AI is incorrigibly trying to accrue human value for itself, instead of helping us do it).

So then the largest remaining worry is that it will still gain power fast and correction processes will be slow enough that its somewhat misaligned values will be set in forever. But it isn’t obvious to me that by that point it isn’t sufficiently well aligned that we would recognize its future as a wondrous utopia, just not the very best wondrous utopia that we would have imagined if we had really carefully sat down and imagined utopias for thousands of years. This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.

I think this argument mostly holds in the case of proxy alignment, but fails in the case of deceptive alignment. If a model is deceptively aligned, then I don't think there is any reason we should expect it to be only "somewhat misaligned"—once a mesa-optimizer becomes deceptive, there's no longer optimization pressure acting to keep its mesa-objective in line with the base, which means it could be totally off, not just slightly wrong. Additionally, a deceptively aligned mesa-optimizer might be able to do things like gradient hacking to significantly hinder our correction processes.

Also, I think it's worth pointing out that deception doesn't just happen during training: it's also possible for a non-deceptive proxy aligned mesa-optimizer to become deceptive during deployment, which could throw a huge wrench in your correction processes story. In particular, non-myopic proxy aligned mesa-optimizers "want to be deceptive" in the sense that, if presented with the strategy of deceptive alignment, they will choose to take it (this is a form of suboptimality alignment). This could be especially concerning in the presence of an adversary in the environment (a competitor AI, for example) that is choosing its output to cause other AIs to behave deceptively.

I wonder if Paul Christiano ever wrote down his take on this, because he seems to agree with Eliezer that using ML to directly learn and optimize for human values will be disastrous, and I'm guessing that his reasons/arguments would probably be especially relevant to people like Katja Grace, Joshua Achiam, and Dario Amodei.

I myself am somewhat fuzzy/confused/not entirely convinced about the "complex/fragile" argument and even wrote kind of a counter-argument a while ago. I think my current worries about value learning or specification has less to do with the "complex/fragile" argument and more to do with what might be called "ignorance of values" (to give it an equally pithy name) which is that humans just don't know what our real values are (especially when applied to unfamiliar situations that will come up in the future) so how can AI designers specify them or how can AIs learn them?

People try to get around this by talking about learning meta-preferences, e.g., preferences for how to deliberate about values, but that's not some "values" that we already have and the AI can just learn, but instead a big (and I think very hard) philosophical and social science/engineering project to try to figure out what kinds of deliberation would be better than other kinds or would be good enough to eventually lead to good outcomes. (ETA: See also this comment.)

It’s not obvious to me that imperfectly aligned AI is likely to be worse than the currently misaligned processes, and even that it won’t be a net boon for the side of alignment.

My own worry is less that "imperfectly aligned AI is likely to be worse than the currently misaligned processes" but more that the advent of AGI might be the last good chance for humanity to get alignment right (including addressing "human safety problem"), and if we don't do a good enough job (even if we improve on the current situation in some sense) we'll be largely stuck with the remaining misalignment because there won't be another opportunity like it. ETA: A good slogan for this might be "AI risk as the risk of missed opportunity".

This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.

I'm not entirely sure I understand this sentence, but this post might be relevant here: https://www.lesswrong.com/posts/Qz6w4GYZpgeDp6ATB/beyond-astronomical-waste.

But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything.
This seems to be a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something, and b) assuming that there is a fast takeoff so that the relevant AI has its values forever, and takes over the world.

When I think of the fragility argument, I usually think in terms of Goodhart's Taxonomy. In particular, we might deal with--

• Extremal Goodhart -- Human values are already unusually well-satisfied relative to what is normal for this universe and pushing proxies of our values to the extremes might inadvertently move the universe away from that in some way we didn't consider
• Adversial Goodhart -- The thing that matters which is absent from our proxy is absolutely critical for satsifying our values and requires the same kinds of resources that our proxy relies on

My impression is that our values are complex enough that they have a lot of distinct absolutely critical pieces that hard to pin down even if you try really hard. I mainly think this because I once tried imagining how to make an AGI that optimizes for 'fulfilling human requests' and realized that fulfill, human and request all had such complicated and fragile definitions that it would take me an extremely long time to pin-down what I meant. And I wouldn't be confident in the result I made after pinning things down.

While I don't find this kind of argument fully convincing, I think it's more powerful than ' a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something'.

That being said, I agree with b). I also lean toward the view that Slow Take-Off plus Machine-Learning may allow a non-catastrophic "good enough" solutions to human value problems.

My guess is that values that are got using ML but still somewhat off from human values are much closer in terms of not destroying all value of the universe, than ones that a person tries to write down. Like, the kinds of errors people have used to illustrate this problem (forget to put in, ‘consciousness is good’) are like forgetting to say faces have nostrils in trying to specify what a face is like, whereas a modern ML system’s imperfect impression of a face seems more likely to meet my standards for ‘very facelike’ (most of the time).

I agree that Machine-Learning will probably give us better estimations of human-flourishing than trying to write-down our values themselves. However, I'm still very apprehensive about it unless we're also being very careful about slow take-off. The main reasons for this apprehensiveness comes from Rohin Shah's work sequence on Value Learning (particularly ambitious value-learning). My main take-away from this was: Learning human values from examples of humans is hard without writing down some extra assumptions about human values (which may leave something important out).

Here's a practical example of this: If you create an AI that learns human values from a lot of examples of humans, what do you think its stance will be on Person-Affecting Views? What will its stance be on value-lexicality responses to Torture vs. Dust-Specks? My impression is that you'll have to write down something to tell the AI how to decide these cases (when should we categorize human behaviors as irrational vs when should we not). And a lot of people may regard the ultimate decision as catastrophic.

There are other complications too. If the AI can interact with the world in ways that change human values and then updates to care about those changed values, strange things might happen. For instance, the AI might pressure humanity to adopt simpler, easier to learn values if it's agential. This might not be so bad but I suspect there are things the AI might do that could potentially be very bad.

So, because I'm not that confident in ML value-learning and because I'm not that confident in human values in general, I'm pretty skeptical of the idea that machine-learning will avert extreme risks associated with value mispecification.

Seconding Habryka.

In talking to many people about AI Alignment over the years, I've repeatedly found that a surprisingly large generator of disagreement about risk scenarios was disagreement about the fragility of human values.

I think this post should be reviewed for it's excellent comment section at least as much as for the original post, and also think that this post is a pretty central example of the kind of post I would like to see more of.

For one thing, these dynamics are already in place: the world is full of agents and more basic optimizing processes that are not aligned with broad human values—most individuals to a small degree, some strange individuals to a large degree, corporations, competitions, the dynamics of political processes.

I don't think of this as evidence that unaligned AI is not dangerous. Arguable we're already seeing bad effects from unaligned AI, such as effects on public discourse as a result of newsfeed algorithms. Further, anything that limits the impact of unaligned action now seems largely the result of existing agents being of relatively low or similar power. Even the most powerful actors in the world right now can't effectively control much of the world (e.g. no government has figured out how to eliminate dissent, no military how to stop terrorists, etc.). I expect thing to look quite different if we develop an actor that is more powerful than a majority of all other actors combined, even if it develops into that power slowly because the steps along the way to that seem individually worth the tradeoff.

But it isn’t obvious to me that by that point it isn’t sufficiently well aligned that we would recognize its future as a wondrous utopia, just not the very best wondrous utopia that we would have imagined if we had really carefully sat down and imagined utopias for thousands of years.

To our ancestors we would appear to live in a wondrous utopia (bountiful food, clean water, low disease, etc.), yet we still want to do better. I think there will be suffering so long as we are not at the global maximum and anyone realizes this.