johnswentworth

But exactly how complex and fragile?

This is , right? And then you might just constrain the subset of W which the agent can search over?

Exactly.

One toy model to conceptualize what a "compact criterion" might look like: imagine we take a second-order expansion of u around some u-maximal world-state . Then, the eigendecomposition of the Hessian of u around tells us which directions-of-change in the world state u cares about a little or a lot. If the constraints lock the accessible world-states into the directions which u doesn't care about much (i.e. eigenvalues near 0), then any accessible world-state near compatible with the constraints will have near-maximal u. On the other hand, if the constraints allow variation in directions which u does care about a lot (i.e. large eigenvalues), then u will be fragile to perturbations to u' which move the u'-optimal world-state along those directions.

That toy model has a very long list of problems with it, but I think it conveys roughly what kind of things are involved in modelling value fragility.

But exactly how complex and fragile?

So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent's utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn't classify as 'perturbations' in the original ontology.

Let me know if this is what you're saying:

- we have an agent which chooses X to maximize E[u(X)] (maybe with a do() operator in there)
- we perturb the utility function to u'(X)
- we then ask whether max E[u(X)] is approximately E[u(X')], where X' is the decision maximizing E[u'(X')]

... so basically it's a Goodhart model, where we have some proxy utility function and want to check whether the proxy achieves similar value to the original.

Then the value-fragility question asks: under which perturbation distributions are the two values approximately the same? Or, the distance function version: if we assume that u' is "close to" u, then under what distance functions does that imply the values are close together?

Then your argument would be: the answer to that question depends on the dynamics, specifically on how X influences u. Is that right?

Assuming all that is what you're saying... I'm imagining another variable, which is roughly a world-state W. When we write utility as a function of X directly (i.e. u(X)), we're implicitly integrating over world states. Really, the utility function is u(W(X)): X influences the world-state, and then the utility is over (estimated) world-states. When I talk about "factoring out the dynamics", I mean that we think about the function u(W), ignoring X. The sensitivity question is then something like: under what perturbations is u'(W) a good approximation of u(W), and in particular when are maxima of u'(W) near-maximal for u(W), including when the maximization is subject to fairly general constraints. The maximization is no longer over X, but instead over world-states W directly - we're asking which world-states (compatible with the constraints) maximize each utility. (For specific scenarios, the constraints would encode the world-states reachable by the dynamics.) Ideally, we'd find some compact criterion for which perturbations preserve value under which constraints.

(Meta: this was useful, I understand this better for having written it out.)

But exactly how complex and fragile?

In other words: "against which compact ways of generating perturbations is human value fragile?". But don't you still need to consider some dynamics for this question to be well-defined?

Not quite. If we frame the question as "which compact ways of generating perturbations", then that's implicitly talking about dynamics, since we're asking how the perturbations were generated. But if we know *what* perturbations are generated, then we can say whether human value is fragile against those perturbations, regardless of *how* they're generated. So, rather than framing the question as "which compact ways of generating perturbations", we frame it as "which sets of perturbations" or "densities of perturbations" or a distance function on perturbations.

Ideally, we come up with a compact *criterion* for when human values are fragile against such sets/densities/distance functions.

But exactly how complex and fragile?

I read through the first part of this review, and generally thought "yep, this is basically right, except it should factor out the distance metric explicitly rather than dragging in all this stuff about dynamics". I had completely forgotten that I said the same thing a year ago, so I was pretty amused when I reached the quote.

Anyway, I'll defend the distance metric thing a bit here.

But what exactly

happensbetween "we write down something too distant from the 'truth'" and the result? The AI happens. But this part, the dynamics, it's kept invisible.

I claim that "keeping the dynamics invisible" is desirable here.

The reason that "fragility of human values" is a useful concept/hypothesis in the first place is that it cuts reality at the joints. What does that mean? Roughly speaking, it means that there's a broad class of different questions for which "are human values fragile?" is an interesting and useful subquestion, without needing a lot of additional context. We can factor out the "are human values fragile?" question, and send someone off to go think about that question, without a bunch of context about why exactly we want to answer the question. Conversely, because the answer isn't highly context-dependent, we can think about the question once and then re-use the answer when thinking about many different scenarios - e.g. foom or CAIS or multipolar takeoff or .... Fragility of human values is a gear in our models, and once we've made the investment to understand that gear, we can re-use it over and over again as the rest of the model varies.

Of course, that only works to the extent that fragility of human values actually doesn't depend on a bunch of extra context. Which it obviously does, as this review points out. Distance metrics allow us to "factor out" that context-dependence, to wrap it in a clean API.

Rather than asking "are human values fragile?", we ask "under what distance metric(s) are human values fragile?" - that's the new "API" of the value-fragility question. Then, when someone comes along with a specific scenario (like foom or CAIS or ...), we ask what distance metric is relevant to the dynamics of that scenario. For instance, in a foom scenario, the relevant distance metric is probably determined by the AI's ontology - i.e. what things the AI thinks are "similar". In a corporate-flavored multipolar takeoff scenario, the relevant distance metric might be driven by economic/game-theoretic considerations: outcomes with similar economic results (e.g. profitability of AI-run companies) will be "similar".

The point is that these distance metrics tell us what particular aspects/properties of each scenario are relevant to value fragility.

The Pointers Problem: Clarifications/Variations

Great post!

I especially like "try to maximize values according to models which, according to human beliefs, track the things we care about well". I ended up at a similar point when thinking about the problem. It seems like we ultimately *have* to use this approach, at some level, in order for all the type signatures to line up. (Though this doesn't rule out entirely different approaches at other levels, as long as we expect those approaches to track the things we care about well.)

On amplified values, I think there's a significant piece absent from the discussion here (possibly intentionally). It's not just about *precision* of values, it's about evaluating the value function at all.

Model/example: a Bayesian utility maximizer does not need to be able to evaluate its utility function, it only needs to be able to decide which of two options has *higher* utility. If e.g. the utility function is , and a decision only effects , then the agent doesn't need to evaluate the sum at all; it only needs to calculate for each option. **This is especially relevant in a world where most actions don't effect most of the world** (or if they do, the effects are drowned out by noise) - which is exactly the sort of world we live in. Most of my actions do not effect a random person in Mumbai (and to the extent there is an effect, it's drowned out by noise). Even if I value the happiness of that random person in Mumbai, I never need to think about them, because my actions don't significantly impact them in any way I can predict.

As you say, the issue isn't just "we can't evaluate our values precisely". The issue is that we probably do not and cannot evaluate our values *at all*. We only ever evaluate comparisons, and only between actions with a relatively simple diff.

Applying this to amplification: amplification is not about evaluating our values more precisely, it's about comparing actions with more complicated diffs, or actions where more complicated information is relevant to the diff. The things you say in the post are still basically correct, but this gives a more accurate mental picture of what amplification needs to achieve.

Selection vs Control

The initial state of the program/physical computer may not overlap with the target space at all. The target space wouldn't be larger or smaller (in the sense of subsets); it would just be an entirely different set of states.

Flint's notion of optimization, as I understand it, requires that we can view the target space as a subset of the initial space.

Selection vs Control

Got it, that's the case I was thinking of as "redrawing the system boundary". Makes sense.

That still leaves the problem that we can write an (internal) optimizer which *isn't* iterative. For instance, a convex function optimizer which differentiates its input function and then algebraically solves for zero gradient. (In the real world, this is similar to what markets do.) This was also my main complaint on Flint's notion of "optimization": not all optimizers are iterative, and sometimes they don't even have an "initial" point against which we could compare.

Selection vs Control

I like this division a lot. One nitpick: I don't think internal optimization is a subset of external optimization, unless we're redrawing the system boundary at some point. A search always takes place within the context of a system's (possibly implicit) world-model; that's the main thing which distinguishes it from control/external optimization. If that world-model does not match the territory, then the system may not successfully optimize anything in its environment, even though it's searching for optimizing plans internally.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

That's a clever example, I like it.

Based on that description, it should be straightforward to generalize the Levin bound to neural networks. The main step would be to replace the Huffman code with a turbocode (or any other near-Shannon-bound code), at which point the compressibility is basically identical to the log probability density, and we can take the limit to continuous function space without any trouble. The main change is that entropy would become relative entropy (as is normal when taking info theory bounds to a continuous limit). Intuitively, it's just using the usual translation between probability theory and minimum description length, and applying it to the probability density of parameter space.

The material here is one seed of a worldview which I've updated toward a lot more over the past year. Some other posts which involve the theme include Science in a High Dimensional World, What is Abstraction?, Alignment by Default, and the companion post to this one Book Review: Design Principles of Biological Circuits.

Two ideas unify all of these:

One major corollary of these two ideas is that goal-oriented systems will tend to evolve

similarmodular structures, reflecting the relevant parts of their environment. Systems to which this applies include organisms, machine learning algorithms, and the learning performed by the human brain. In particular, this suggests that biological systems and trained deep learning systems are likely to have modular, human-interpretable internal structure. (At least, interpretable by humans familiar with the environment in which the organism/ML system evolved.)This post talks about some of the evidence behind this model: biological systems are indeed quite modular, and simulated evolution experiments find that circuits do indeed evolve modular structure reflecting the modular structure of environmental variations. The companion post reviews the rest of the book, which makes the case that the internals of biological systems are indeed quite interpretable.

On the deep learning side, researchers also find considerable modularity in trained neural nets, and direct examination of internal structures reveals plenty of human-recognizable features.

Going forward, this view is in need of a more formal and general model, ideally one which would let us empirically test key predictions - e.g. check the extent to which different systems learn similar features, or whether learned features in neural nets satisfy the expected abstraction conditions, as well as tell us how to look for environment-reflecting structures in evolved/trained systems.