# 43

AbstractionAI
Frontpage

Suppose AI continues on its current trajectory: deep learning continues to get better as we throw more data and compute at it, researchers keep trying random architectures and using whatever seems to work well in practice. Do we end up with aligned AI “by default”?

I think there’s at least a plausible trajectory in which the answer is “yes”. Not very likely - I’d put it at ~10% chance - but plausible. In fact, there’s at least an argument to be made that alignment-by-default is more likely to work than many fancy alignment proposals, including IRL variants and HCH-family methods.

This post presents the rough models and arguments.

I’ll break it down into two main pieces:

• Will a sufficiently powerful unsupervised learner “learn human values”? What does that even mean?
• Will a supervised/reinforcement learner end up aligned to human values, given a bunch of data/feedback on what humans want?

Ultimately, we’ll consider a semi-supervised/transfer-learning style approach, where we first do some unsupervised learning and hopefully “learn human values” before starting the supervised/reinforcement part.

As background, I will assume you’ve read some of the core material about human values from the sequences, including Hidden Complexity of Wishes, Value is Fragile, and Thou Art Godshatter

## Unsupervised: Pointing to Values

In this section, we’ll talk about why an unsupervised learner might not “learn human values”. Since an unsupervised learner is generally just optimized for predictive power, we’ll start by asking whether theoretical algorithms with best-possible predictive power (i.e. Bayesian updates on low-level physics models) “learn human values”, and what that even means. Then, we’ll circle back to more realistic algorithms.

Consider a low-level physical model of some humans - e.g. a model which simulates every molecule comprising the humans. Does this model “know human values”? In one sense, yes: the low-level model has everything there is to know about human values embedded within it, in exactly the same way that human values are embedded in physical humans. It has “learned human values”, in a sense sufficient to predict any real-world observations involving human values.

But it seems like there’s a sense in which such a model does not “know” human values. Specifically, although human values are embedded in the low-level model, the embedding itself is nontrivial. Even if we have the whole low-level model, we still need that embedding in order to “point to” human values specifically - e.g. to use them as an optimization target. Indeed, when we say “point to human values”, what we mean is basically “specify the embedding”. (Side note: treating human values as an optimization target is not the only use-case for “pointing to human values”, and we still need to point to human values even if we’re not explicitly optimizing for anything. But that’s a separate discussion, and imagining using values as an optimization target is useful to give a mental image of what we mean by “pointing”.)

In short: predictive power alone is not sufficient to define human values. The missing part is the embedding of values within the model. The hard part is pointing to the thing (i.e. specifying the values-embedding), not learning the thing (i.e. finding a model in which values are embedded).

Finally, here’s a different angle on the same argument which will probably drive some of the philosophers up in arms: any model of the real world with sufficiently high general predictive power will have a model of human values embedded within it. After all, it has to predict the parts of the world in which human values are embedded in the first place - i.e. the parts of which humans are composed, the parts on which human values are implemented. So in principle, it doesn’t even matter what kind of model we use or how it’s represented; as long the predictive power is good enough, values will be embedded in there, and the main problem will be finding the embedding.

## Unsupervised: Natural Abstractions

In this section, we’ll talk about how and why a large class of unsupervised methods might “learn the embedding” of human values, in a useful sense.

First, notice that basically everything from the previous section still holds if we replace the phrase “human values” with “trees”. A low-level physical model of a forest has everything there is to know about trees embedded within it, in exactly the same way that trees are embedded in the physical forest. However, while there are trees embedded in the low-level model, the embedding itself is nontrivial. Predictive power alone is not sufficient to define trees; the missing part is the embedding of trees within the model.

More generally, whenever we have some high-level abstract object (i.e. higher-level than quantum fields), like trees or human values, a low-level model might have the object embedded within it but not “know” the embedding.

Now for the interesting part: empirically, we have whole classes of neural networks in which concepts like “tree” have simple, identifiable embeddings. These are unsupervised systems, trained for predictive power, yet they apparently “learn the tree-embedding” in the sense that the embedding is simple: it’s just the activation of a particular neuron, a particular channel, or a specific direction in the activation-space of a few neurons.

What’s going on here? We know that models optimized for predictive power will not have trivial tree-embeddings in general; low-level physics simulations demonstrate that much. Yet these neural networks do end up with trivial tree-embeddings, so presumably some special properties of the systems make this happen. But those properties can’t be that special, because we see the same thing for a reasonable variety of different architectures, datasets, etc.

Here’s what I think is happening: “tree” is a natural abstraction. More on what that means here, but briefly: abstractions summarize information which is relevant far away. When we summarize a bunch of atoms as “a tree”, we’re throwing away lots of information about the exact positions of molecules/cells within the tree, or about the pattern of bark on the tree’s surface. But information like the exact positions of molecules within the tree is irrelevant to things far away - that signal is all wiped out by the noise of air molecules between the tree and the observer. The flap of a butterfly’s wings may alter the trajectory of a hurricane, but unless we know how all wings of all butterflies are flapping, that tiny signal is wiped out by noise for purposes of our own predictions. Most information is irrelevant to things far away, not in the sense that there’s no causal connection, but in the sense that the signal is wiped out by noise in other unobserved variables.

If a concept is a natural abstraction, that means that the concept summarizes all the information which is relevant to anything far away, and isn’t too sensitive to the exact notion of “far away” involved. That’s what I think is going on with “tree”.

Getting back to neural networks: it’s easy to see why a broad range of architectures would end up “using” natural abstractions internally. Because the abstraction summarizes information which is relevant far away, it allows the system to make far-away predictions without passing around massive amounts of information all the time. In a low-level physics model, we don’t need abstractions because we do pass around massive amounts of information all the time, but real systems won’t have anywhere near that capacity any time soon. So for the foreseeable future, we should expect to see real systems with strong predictive power using natural abstractions internally.

With all that in mind, it’s time to drop the tree-metaphor and come back to human values. Are human values a natural abstraction?

If you’ve read Value is Fragile or Godshatter, then there’s probably a knee-jerk reaction to say “no”. Human values are basically a bunch of randomly-generated heuristics which proved useful for genetic fitness; why would they be a “natural” abstraction? But remember, the same can be said of trees. Trees are a complicated pile of organic spaghetti code, but “tree” is still a natural abstraction, because the concept summarizes all the information from that organic spaghetti pile which is relevant to things far away. In particular, it summarizes anything about one tree which is relevant to far-away trees.

Similarly, the concept of “human” summarizes all the information about one human which is relevant to far-away humans. It’s a natural abstraction.

Now, I don’t think “human values” are a natural abstraction in exactly the same way as “tree” - specifically, trees are abstract objects, whereas human values are properties of certain abstract objects (namely humans). That said, I think it’s pretty obvious that “human” is a natural abstraction in the same way as “tree”, and I expect that humans “have values” in roughly the same way that trees “have branching patterns. Specifically, the natural abstraction contains a bunch of information, that information approximately factors into subcomponents (including “branching pattern”), and “human values” is one of those information-subcomponents for humans.

I wouldn’t put super-high confidence on all of this, but given the remarkable track record of hackish systems learning natural abstractions in practice, I’d give maybe a 70% chance that a broad class of systems (including neural networks) trained for predictive power end up with a simple embedding of human values. A plurality of my uncertainty is on how to think about properties of natural abstractions. A significant chunk of uncertainty is also on the possibility that natural abstraction is the wrong way to think about the topic altogether, although in that case I’d still assign a reasonable chance that neural networks end up with simple embeddings of human values - after all, no matter how we frame it, they definitely have trivial embeddings of many other complicated high-level objects.

## Aside: Microscope AI

Microscope AI is about studying the structure of trained neural networks, and trying to directly understand their learned internal algorithms, models and concepts. In light of the previous section, there’s an obvious path to alignment where there turns out to be a few neurons (or at least some simple embedding) which correspond to human values, we use the tools of microscope AI to find that embedding, and just like that the alignment problem is basically solved.

Of course it’s unlikely to be that simple in practice, even assuming a simple embedding of human values. I don’t expect the embedding to be quite as simple as one neuron activation, and it might not be easy to recognize even if it were. Part of the problem is that we don’t even know the type signature of the thing we’re looking for - in other words, there are unanswered fundamental conceptual questions here, which make me less-than-confident that we’d be able to recognize the embedding even if it were right under our noses.

That said, this still seems like a reasonably-plausible outcome, and it’s an approach which is particularly well-suited to benefit from marginal theoretical progress.

One thing to keep in mind: this is still only about aligning one AI; success doesn’t necessarily mean a future in which more advanced AIs remain aligned. More on that later.

## Supervised/Reinforcement: Proxy Problems

Suppose we collect some kind of data on what humans want, and train a system on that. The exact data and type of learning doesn’t really matter here; the relevant point is that any data-collection process is always, no matter what, at best a proxy for actual human values. That’s a problem, because Goodhart’s Law plus Hidden Complexity of Wishes. You’ve probably heard this a hundred times already, so I won’t belabor it.

Here’s the interesting possibility: assume the data is crap. It’s so noisy that, even though the data-collection process is just a proxy for real values, the data is consistent with real human values. Visually:

At first glance, this isn’t much of an improvement. Sure, the data is consistent with human values, but it’s consistent with a bunch of other possibilities too - including the real data-collection process (which is exactly the proxy we wanted to avoid in the first place).

But now suppose we do some transfer learning. We start with a trained unsupervised learner, which already has a simple embedding of human values (we hope). We give our supervised learner access to that system during training. Because the unsupervised learner has a simple embedding of human values, the supervised learner can easily score well by directly using that embedded human values model. So, we cross our fingers and hope the supervised learner just directly uses that embedded human values model, and the data is noisy enough that it never “figures out” that it can get better performance by directly modelling the data-collection process instead.

In other words: the system uses an actual model of human values as a proxy for our proxy of human values.

This requires hitting a window - our data needs to be good enough that the system can tell it should use human values as a proxy, but bad enough that the system can’t figure out the specifics of the data-collection process enough to model it directly. This window may not even exist.

(Side note: we can easily adjust this whole story to a situation where we’re training for some task other than “satisfy human values”. In that case, the system would use the actual model of human values to model the Hidden Complexity of whatever task it’s training on.)

Of course in practice, the vast majority of the things people use as objectives for training AI probably wouldn’t work at all. I expect that they usually look like this:

In other words, most objectives are so bad that even a little bit of data is enough to distinguish the proxy from real human values. But if we assume that there’s some try-it-and-see going on, i.e. people try training on various objectives and keep the AIs which seem to do roughly what the humans want, then it’s maybe plausible that we end up iterating our way to training objectives which “work”. That’s assuming things don’t go irreversibly wrong before then - including not just hostile takeover, but even just development of deceptive behavior, since this scenario does not have any built-in mechanism to detect deception.

Overall, I’d give maybe a 10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values. The main failure mode I’d expect, assuming we get the chance to iterate, is deception - not necessarily “intentional” deception, just the system being optimized to look like it’s working the way we want rather than actually working the way we want. It’s the proxy problem again, but this time at the level of humans-trying-things-and-seeing-if-they-work, rather than explicit training objectives.

## Alignment in the Long Run

So far, we’ve only talked about one AI ending up aligned, or a handful ending up aligned at one particular time. However, that isn’t really the ultimate goal of AI alignment research. What we really want is for AI to remain aligned in the long run, as we (and AIs themselves) continue to build new and more powerful systems and/or scale up existing systems over time.

I know of two main ways to go from aligning one AI to long-term alignment:

• Make the alignment method/theory very reliable and robust to scale, so we can continue to use it over time as AI advances.
• Align one roughly-human-level-or-smarter AI, then use that AI to come up with better alignment methods/theories.

The alignment-by-default path relies on the latter. Even assuming alignment happens by default, it is unlikely to be highly reliable or robust to scale.

That’s scary. We’d be trusting the AI to align future AIs, without having any sure-fire way to know that the AI is itself aligned. (If we did have a sure-fire way to tell, then that would itself be most of a solution to the alignment problem.)

That said, there’s a bright side: when alignment-by-default works, it’s a best-case scenario. The AI has a basically-correct model of human values, and is pursuing those values. Contrast this to things like IRL variants, which at best learn a utility function which approximates human values (which are probably not themselves a utility function). Or the HCH family of methods, which at best mimic a human with a massive hierarchical bureaucracy at their command, and certainly won’t be any more aligned than that human+bureaucracy would be.

To the extent that alignment of the successor system is limited by alignment of the parent system, that makes alignment-by-default potentially a more promising prospect than IRL or HCH. In particular, it seems plausible that imperfect alignment gets amplified into worse-and-worse alignment as systems design their successors. For instance, a system which tries to look like it’s doing what humans want rather than actually doing what humans want will design a successor which has even better human-deception capabilities. That sort of problem makes “perfect” alignment - i.e. an AI actually pointed at a basically-correct model of human values - qualitatively safer than a system which only manages to be not-instantly-disastrous.

(Side note: this isn’t the only reason why “basically perfect” alignment matters, but I do think it’s the most relevant such argument for one-time alignment/short-term term methods, especially on not-very-superhuman AI.)

In short: when alignment-by-default works, we can use the system to design a successor without worrying about amplification of alignment errors. However, we wouldn’t be able to tell for sure whether alignment-by-default had worked or not, and it’s still possible that the AI would make plain old mistakes in designing its successor.

## Conclusion

Let’s recap the bold points:

• A low-level model of some humans has everything there is to know about human values embedded within it, in exactly the same way that human values are embedded in physical humans. The embedding, however, is nontrivial. Thus...
• Predictive power alone is not sufficient to define human values. The missing part is the embedding of values within the model. However…
• This also applies if we replace the phrase “human values” with “trees”. Yet we have a whole class of neural networks in which a simple embedding lights up in response to trees. Why?
• Trees are a natural abstraction, and we should expect to see real systems trained for predictive power use natural abstractions internally.
• Human values are a little different from trees (they’re a property of an abstract object rather than an abstract object themselves), but I still expect that a broad class of systems trained for predictive power will end up with simple embeddings of human values (~70% chance).
• Because the unsupervised learner has a simple embedding of human values, a supervised/reinforcement learner can easily score well on values-proxy-tasks by directly using that model of human values. In other words, the system uses an actual model of human values as a proxy for our proxy of human values (~10-20% chance).
• When alignment-by-default works, it’s basically a best-case scenario, so we can safely use the system to design a successor without worrying about amplification of alignment errors (among other things).

Overall, I only give this whole path ~10% chance of working in the short term, and maybe half that in the long term. However, if amplification of alignment errors turns out to be a major limiting factor for long-term alignment, then alignment-by-default is plausibly more likely to work than approaches in the IRL or HCH families.

The limiting factor here is mainly identifying the (probably simple) embedding of human values within a learned model, so microscope AI and general theory development are both good ways to improve the outlook. Also, in the event that we are able to identify a simple embedding of human values in a learned model, it would be useful to have a way to translate that embedding into new systems, in order to align successors.

# 43

New Comment

I think what you've identified here is a weakness in the high-level, classic arguments for AI risk -

Overall, I’d give maybe a 10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values. The main failure mode I’d expect, assuming we get the chance to iterate, is deception - not necessarily “intentional” deception, just the system being optimized to look like it’s working the way we want rather than actually working the way we want. It’s the proxy problem again, but this time at the level of humans-trying-things-and-seeing-if-they-work, rather than explicit training objectives.

This failure mode of deceptive alignment seems like it would result most easily from Mesa-optimisation or an inner alignment failure. Inner Alignment / Misalignment is possibly the key specific mechanism which fills a weakness in the 'classic arguments' for AI safety - the Orthogonality Thesis, Instrumental Convergence and Fast Progress together implying small separations between AI alignment and AI capability can lead to catastrophic outcomes. The question of why there would be such a damaging, hard-to-detect divergence between goals and alignment needs an answer to have a solid, specific reason to expect dangerous misalignment, and Inner Misalignment is just such a reason.

I think that it should be presented in initial introductions to AI risk alongside those classic arguments, as the specific, technical reason why the specific techniques we use are likely to produce such goal/capability divergence - rather than the general a priori reasons given by the classic arguments.

Personally, I think a more likely failure mode is just "you get what you measure", as in Paul's write up here. If we only know how to measure certain things which are not really the things we want, then we'll be selecting for not-what-we-want by default. But I know at least some smart people who think that inner alignment is the more likely problem, so you're in good company.

Am surprised you think that’s the main failure mode. I am fairly more concerned about failure through mesa optimisers taking a treacherous turn.

I’m thinking we will be more likely to find sensible solutions to outer alignment, but have not much real clue about the internals, and then we’ll give them enough optimisation power to build super intelligent unaligned mesa optimisers, and then with one treacherous turn the game will be up.

Why do you think inner alignment will be easier?

Two arguments here. First, an outside-view argument: inner alignment problems should only crop up on a relatively narrow range of architectures/parameters. Second, an entirely separate inside-view argument: assuming that natural abstractions are a thing makes inner alignment failure look much less likely.

Narrow range argument: inner alignment failure only applies to a specific range of architectures within a specific range of task parameters - for instance, we have to be optimizing for something, and there has to be lots of relevant variables observed only at runtime, and there has to be something like a "training" phase in which we lock-in parameter choices before runtime, and for the more disastrous versions we usually need divergence of the runtime distribution from the training distribution. It's a failure mode which assumes that a whole lot of things look like today's ML pipelines.

On the other hand, the get-what-you-measure problem and its generalizations apply to any architecture, including tool AI, idealized Bayesian utility maximizers (i.e. the infinite data/compute regime), and (less obviously) human-mimicking systems.

Natural abstractions argument: in an inner alignment failure, the outer optimizer is optimizing for , but the inner optimizer ends up pointed at some rough approximation . But if X is a natural abstraction, then this is far less likely to be a problem; we expect a wide range of predictive systems to all learn a basically-correct notion of , so there's little reason for an inner optimizer to end up pointed at a rough approximation, especially if we're leveraging transfer learning from some unsupervised learner.

(It's worth asking here why this argument doesn't apply to the divergence of human goals from evolutionary fitness. A human only has ~30k genes, and each one has a fairly simple function - e.g. catalyze one chemical reaction or stabilize a structure or the like. That's nowhere near enough to represent something like evolutionary fitness in the genome, especially when the large majority of those genes are already used for metabolism and body plan and whatnot. Modern ML, on the other hand, already operates in a range where insufficient degrees of freedom are far less likely to be a problem. Also, I'm currently unsure whether evolutionary fitness is a natural abstraction at all.)

In general, if human values are a natural abstraction, then pointing to values is much harder than "learning" values. That means outer alignment is the problem more than inner alignment.

Natural abstractions argument: in an inner alignment failure, the outer optimizer is optimizing for X, but the inner optimizer ends up pointed at some rough approximation ~X. But if X is a natural abstraction, then this is far less likely to be a problem; we expect a wide range of predictive systems to all learn a basically-correct notion of X, so there's little reason for an inner optimizer to end up pointed at a rough approximation, especially if we're leveraging transfer learning from some unsupervised learner.

This isn't an argument against deceptive alignment, just proxy alignment—with deceptive alignment, the agent still learns X, it just does so as part of its world model rather than its objective. In fact, I think it's an argument for deceptive alignment, since if X first crops up as a natural abstraction inside of your agent's world model, that raises the question of how exactly it will get used in the agent's objective function—and deceptive alignment is arguably one of the simplest, most natural ways for the base optimizer to get an agent that has information about the base objective stored in its world model to actually start optimizing for that model of the base objective.

I mostly agree with this. I don't view deception as an inner alignment problem, though - for instance, it's an issue in any approval-based setup even without an inner optimizer showing up. To the extent that it is an inner alignment issue, it involves generalization failure from the training distribution, which I also generally consider an outer alignment problem (i.e. training on a distribution which differs from the deploy environment generally means the system is not outer aligned, unless the architecture is somehow set up to make the distribution shift irrelevant).

A useful criterion here: would the problem still happen if we just optimized over all the parameters simultaneously at runtime, rather than training offline first? If the problem would still happen, then it's not really an inner alignment problem (at least not in the usual mesa-optimization sense).

To the extent that it is an inner alignment issue, it involves generalization failure from the training distribution, which I also generally consider an outer alignment problem (i.e. training on a distribution which differs from the deploy environment generally means the system is not outer aligned, unless the architecture is somehow set up to make the distribution shift irrelevant).

A useful criterion here: would the problem still happen if we just optimized over all the parameters simultaneously at runtime, rather than training offline first? If the problem would still happen, then it's not really an inner alignment problem (at least not in the usual mesa-optimization sense).

That's certainly not how I would define inner alignment. In “Risks from Learned Optimization,” we just define it as the problem of aligning the mesa-objective (if one exists) with the base objective, which is entirely independent of whether or not there's any sort of distinction between the training and deployment distributions and is fully consistent with something like online learning as you're describing it.

The way I understood it, the main reason a mesa-optimizer shows up in the first place is that some information is available at runtime which is not available during training, so some processing needs to be done at runtime to figure out the best action given the runtime-info. The mesa-optimizer handles that processing. If we directly optimize over all parameters at runtime, then there's no place for that to happen.

What am I missing?

Let's consider the following online learning setup:

At each timestep , takes action and receives reward . Then, we perform the simple policy gradient update

Now, we can ask the question, would be a mesa-optimizer? The first thing that's worth noting is that the above setup is precisely the standard RL training setup—the only difference is that there's no deployment stage. What that means, though, is that if standard RL training produces a mesa-optimizer, then this will produce a mesa-optimizer too, because the training process isn't different in any way whatsoever. If is acting in a diverse environment that requires search to be able to be solved effectively, then will still need to learn to do search—the fact that there won't ever be a deployment stage in the future is irrelevant to 's current training dynamics (unless is deceptive and knows there won't be a deployment stage—that's the only situation where it might be relevant).

Given that, we can ask the question of whether , if it's a mesa-optimizer, is likely to be misaligned—and in particular whether it's likely to be deceptive. Again, in terms of proxy alignment, the training process is exactly the same, so the picture isn't any different at all—if there are simpler, easier-to-optimize-for proxies, then is likely to learn those instead of the true base objective. Like I mentioned previously, however, deceptive alignment is the one case where it might matter that you're doing online learning, since if the model knows that it might do different things based on that fact. However, there are still lots of reasons why a model might be deceptive even in an online learning setup—for example, it might expect better opportunities for defection in the future, and thus want to prevent being modified now so that it can defect when it'll be most impactful.

When I say "optimize all the parameters at runtime", I do not mean "take one gradient step in between each timestep". I mean, at each timestep, fully optimize all of the parameters. Optimize  all the way to convergence before every single action.

Think back to the central picture of mesa-optimization (at least as I understand it). The mesa-optimizer shows up because some data is only available at runtime, not during training, so it has to be processed at runtime using parameters selected during training. In the online RL setup you sketch here, "runtime" for mesa-optimization purposes is every time the system chooses its action - i.e. every timestep - and "training" is all the previous timesteps. A mesa-optimizer should show up if, at every timestep, some relevant new data comes in and the system has to process that data in order to choose the optimal action, using parameters inherited from previous timesteps.

Now, suppose we fully optimize all of the parameters at every timestep. The objective function for this optimization would presumably be , with the sum taken over all previous data points, since that's what the RL setup is approximating.

This optimization would probably still "find" the same mesa-optimizer as before, but now it looks less like a mesa-optimizer problem and more like an outer alignment problem: that objective function is probably not actually the thing we want. The fact that the true optimum for that objective function probably has our former "mesa-optimizer" embedded in it is a pretty strong signal that that objective function itself is not outer aligned; the true optimum of that objective function is not really the thing we want.

Does that make sense?

The RL process is actually optimizing , the log just comes from the REINFORCE trick. Regardless, I'm not sure I understand what you mean by optimizing fully to convergence at each timestep—convergence is a limiting property, so I don't know what it could mean do it for a single timestep. Perhaps you mean just taking the optimal policy such that In that case, that is in fact the definition of outer alignment I've given in the past, so I agree that whether is aligned or not is an outer alignment question.

Sure,  works for what I'm saying, assuming that sum-over-time only includes the timesteps taken thus far. In that case, I'm saying that either:

• the mesa optimizer doesn't appear in , in which case the problem is fixed by fully optimizing everything at every timestep (i.e. by using ), or
• the mesa optimizer does appear in , in which case the problem was really an outer alignment issue all along.

Thank you for being so clear.

On 2, I’m surprised if you think that natural selection isn’t a natural abstraction but that eudaemonia is. (If we’re getting an AGI singleton that want to fully learn our values.)

Secondly I’ll say that if we do not understand it’s representation of X or X-prime, and if a small difference will be catastrophic, then that will also lead to doom.

On 1: I think that’s quite plausible? Like, I assign something in the range of 20-60% probability to that. How much does it have to change for you to feel much safer about inner alignment?

(I’m also not that clear it only applies to this situation. Perhaps I’m mistaken, but in my head subsystem alignment and robust delegation both have this property of ”build a second optimiser that helps achieve your goals” and in both cases passing on the true utility function seems very hard.)

On 2, I’m surprised if you think that natural selection isn’t a natural abstraction but that eudaemonia is.

Currently, my first-pass check for "is this probably a natural abstraction?" is "can humans usually figure out what I'm talking about from a few examples, without a formal definition?". For human values, the answer seems like an obvious "yes". For evolutionary fitness... nonobvious. Humans usually get it wrong without the formal definition.

Also, natural abstractions in general involve summarizing the information from one chunk of the universe which is relevant "far away". For human values, the relevant chunk of the universe is the human - i.e. the information about human values is all embedded in the physical human. But for evolutionary fitness, that's not the case - an organism does not contain all the information relevant to calculating its evolutionary fitness. So it seems like there's some qualitative difference there - like, human values "live" in humans, but fitness doesn't "live" in organisms in the same way. I still don't feel like I fully understand this, though.

On 1: I think that’s quite plausible? Like, I assign something in the range of 20-60% probability to that.

Sure, inner alignment is a problem which mainly applies to architectures similar to modern ML, and modern ML architecture seems like the most-likely route to AGI.

It still feels like outer alignment is a much harder problem, though. The very fact that inner alignment failure is so specific to certain architectures is evidence that it should be tractable. For instance, we can avoid most inner alignment problems by just optimizing all the parameters simultaneously at run-time. That solution would be too expensive in practice, but the point is that inner alignment is hard in a "we need to find more efficient algorithms" sort of way, not a "we're missing core concepts and don't even know how to solve this in principle" sort of way. (At least for mesa-optimization; I agree that there are more general subsystem alignment/robust delegation issues which are potentially conceptually harder.)

Outer alignment, on the other hand, we don't even know how to solve in principle, on any architecture whatsoever, even with arbitrary amounts of compute and data. That's why I expect it to be a bottleneck.

Currently, my first-pass check for "is this probably a natural abstraction?" is "can humans usually figure out what I'm talking about from a few examples, without a formal definition?". For human values, the answer seems like an obvious "yes". For evolutionary fitness... nonobvious. Humans usually get it wrong without the formal definition.

Hmm, presumably you're not including something like "internal consistency" in the definition of 'natural abstraction'. That is, humans who aren't thinking carefully about something will think there's an imaginable object even if any attempts to actually construct that object will definitely lead to failure. (For example, Arrow's Impossibility Theorem comes to mind; a voting rule that satisfies all of those desiderata feels like a 'natural abstraction' in the relevant sense, even though there aren't actually any members of that abstraction.)

Oh this is fascinating. This is basically correct; a high-level model space can include models which do not correspond to any possible low-level model.

One caveat: any high-level data or observations will be consistent with the true low-level model. So while there may be natural abstract objects which can't exist, and we can talk about those objects, we shouldn't see data supporting their existence - e.g. we shouldn't see a real-world voting system behaving like it satisfies all of Arrow's desiderata.

Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn't one of the core points of the reductionism sequence that, while "thor caused the thunder" sounds simpler to a human than Maxwell's equations (because the words fit naturally into a human psychology), one of them is much "simpler" in an absolute sense than the other (and is in fact true).

Regarding your point about the human values living in humans while the organism's fitness is living partly in the environment, nothing immediately comes to mind to say here, but I agree it's a very interesting question.

The things you say about inner/outer alignment hold together quite sensibly. I am surprised to hear you say that mesa optimisers can be avoided by just optimizing all the parameters simultaneously at run-time. That doesn't match my understanding of mesa optimisation, I thought the mesa optimisers would definitely arise during the training, but if you're right that it's trivial-but-expensive to remove them there then I agree it's intuitively a much easier problem than I had realised.

Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn't one of the core points of the reductionism sequence that, while "thor caused the thunder" sounds simpler to a human than Maxwell's equations (because the words fit naturally into a human psychology), one of them is much "simpler" in an absolute sense than the other (and is in fact true).

Despite humans giving really dumb verbal explanations (like "Thor caused the thunder"), we tend to be pretty decent at actually predicting things in practice.

The same applies to natural abstractions. If I ask people "is 'tree' a natural category?" then they'll get into some long philosophical debate. But if I show someone five pictures of trees, then show them five other picture which are not all trees, and ask them which of the second set are similar to the first set, they'll usually have no trouble at all picking the trees in the second set.

I thought the mesa optimisers would definitely arise during the training

If you're optimizing all the parameters simultaneously at runtime, then there is no training. Whatever parameters were learned during "training" would just be overwritten by the optimal values computed at runtime.

Despite humans giving really dumb verbal explanations (like "Thor caused the thunder"), we tend to be pretty decent at actually predicting things in practice.

Mm, quantum mechanics much? I do not think I can reliably tell you which experiments are in the category “real” and the category “made up”, even though it’s a very simple category mathematically. But I don’t expect you’re saying this, I just am still confused what you are saying.

This reminds me of Oli’s question here, which ties into Abram’s “point of view from somewhere” idea. I feel like I expect ML-systems to take the point of view of the universe, and not learn our natural categories.

I'm talking everyday situations. Like "if I push on this door, it will open" or "by next week my laundry hamper will be full" or "it's probably going to be colder in January than June". Even with quantum mechanics, people do figure out the pattern and build some intuition, but they need to see a lot of data on it first and most people never study it enough to see that much data.

In places where the humans in question don't have much first-hand experiential data, or where the data is mostly noise, that's where human prediction tends to fail. (And those are also the cases where we expect learning systems in general to fail most often, and where we expect the system's priors to matter most.) Another way to put it: humans' priors aren't great, but in most day-to-day prediction problems we have more than enough data to make up for that.

‘You get what you measure’ (outer alignment failure) and Mesa optimisers (inner failure) are both potential gap fillers that explain why specifically the alignment/capability divergence initially arises. Whether it’s one or the other, I think the overall point is still that there is this gap in the classic arguments that allows for a (possibly quite high) chance of ‘alignment by default’, for the reasons you give, but there are at least 2 plausible mechanisms that fill this gap. And then I suppose my broader point would be that we should present:

Classic Arguments —> objections to them (capability and alignment often go together, could get alignment by default) —> specific causal mechanisms for misalignment

To help me check my understanding of what you're saying, we train an AI on a bunch of videos/media about Alice's life, in the hope that it learns an internal concept of "Alice's values". Then we use SL/RL to train the AI, e.g., give it a positive reward whenever it does something that the supervisor thinks benefits Alice's values. The hope here is that the AI learns to optimize the world according to its internal concept of "Alice's values" that it learned in the previous step. And we hope that its concept of "Alice's values" includes the idea that Alice wants AIs, including any future AIs, to keep improving their understanding of Alice's values and to serve those values, and that this solves alignment in the long run.

Assuming the above is basically correct, this (in part) depends on the AI learning a good enough understanding of "improving understanding of Alice's values" in step 1. This in turn (assuming "improving understanding of Alice's values" involves "using philosophical reasoning to solve various confusions related to understanding Alice's values, including Alice's own confusions") depends on that the AI can learn a correct or good enough concept of "philosophical reasoning" from unsupervised training. Correct?

If AI can learn "philosophical reasoning" from unsupervised training, GPT-N should be able to do philosophy (e.g., solve open philosophical problems), right?

There's a lot of moving pieces here, so the answer is long. Apologies in advance.

I basically agree with everything up until the parts on philosophy. The point of divergence is roughly here:

assuming "improving understanding of Alice's values" involves "using philosophical reasoning to solve various confusions related to understanding Alice's values, including Alice's own confusions"

I do think that resolving certain confusions around values involves solving some philosophical problems. But just because the problems are philosophical does not mean that they need to be solved by philosophical reasoning.

The kinds of philosophical problems I have in mind are things like:

• What is the type signature of human values?
• What kind of data structure naturally represents human values?
• How do human values interface with the rest of the world?

In other words, they're exactly the sort of questions for which "utility function" and "Cartesian boundary" are answers, but probably not the right answers.

How could an AI make progress on these sorts of questions, other than by philosophical reasoning?

Let's switch gears a moment and talk about some analogous problems:

• What is the type signature of the concept of "tree"?
• What kind of data structure naturally represents "tree"?
• How do "trees" (as high-level abstract objects) interface with the rest of the world?

Though they're not exactly the same questions, these are philosophical questions of a qualitatively similar sort to the questions about human values.

Empirically, AIs already do a remarkable job reasoning about trees, and finding answers to questions like those above, despite presumably not having much notion of "philosophical reasoning". They learn some data structure for representing the concept of tree, and they learn how the high-level abstract "tree" objects interact with the rest of the (lower-level) world. And it seems like such AIs' notion of "tree" tends to improve as we throw more data and compute at them, at least over the ranges explored to date.

In other words: empirically, we seem to be able to solve philosophical problems to a surprising degree by throwing data and compute at neural networks. Well, at least "solve" in the sense that the neural networks themselves seem to acquire solutions to the problems... not that either the neural nets or the humans gain much understanding of such problems in general.

Going up a meta level: why would this be the case? Why would solutions to philosophical problems end up embedded in random learning algorithms, without either the algorithms or the humans having a general understanding of the problems?

Well, presumably neural nets end up with a notion of "tree" for much the same reason that humans end up with a notion of "tree": it's a useful concept. We don't have a precise mathematical theory of when or why it's useful (though I do hopefully have some groundwork for that), but we can see instrumental convergence to a useful concept even without understanding why the concept is useful.

In short: solutions to certain philosophical problems are probably instrumentally convergent, so the solutions will probably pop up in a fairly broad range of systems despite neither the systems nor their designers understanding the philosophical problems.

Now, so far this has talked about why solutions to philosophical problems would pop up in one AI. But does that help one AI to improve its own solutions? Depends on the setup, but at the very least it offers an AI a possible path to improving its solutions to such philosophical problems without going through philosophical reasoning.

Finally, I'll note that if humans want to be able to recognize an AI's solutions to philosophical problems, e.g. decode a model of human values from the weights of a neural net, then we'll probably need to make some philosophical/mathematical progress ourselves in order to do that reliably. After all, we don't even know the type signature of the thing we're looking for or a data structure with which to represent it.

So similarly, a human could try to understand Alice's values in two ways. The first, equivalent to what you describe here for AI, is to just apply whatever learning algorithm their brain uses when observing Alice, and form an intuitive notion of "Alice's values". And the second is to apply explicit philosophical reasoning to this problem. So sure, you can possibly go a long way towards understanding Alice's values by just doing the former, but is that enough to avoid disaster? (See Two Neglected Problems in Human-AI Safety for the kind of disaster I have in mind here.)

(I keep bringing up metaphilosophy but I'm pretty much resigned to be living in a part of the multiverse where civilization will just throw the dice and bet on AI safety not depending on solving it. What hope is there for our civilization to do what I think is the prudent thing, when no professional philosophers, even ones in EA who are concerned about AI safety, ever talk about it?)

I mostly agree with you here. I don't think the chances of alignment by default are high. There are marginal gains to be had, but to get a high probability of alignment in the long term we will probably need actual understanding of the relevant philosophical problems.

My take is that corrigibility is sufficient to get you an AI that understands what it means to "keep improving their understanding of Alice's values and to serve those values".  I don't think the AI needs to play the "genius philosopher" role, just the "loyal and trustworthy servant" role.  A superintelligent AI which plays that role should be able to facilitate a "long reflection" where flesh and blood humans solve philosophical problems.

(I also separately think unsupervised learning systems could in principle make philosophical breakthroughs. Maybe one already has.)

Planned summary for the Alignment Newsletter:

I liked the author’s summary, so I’ve reproduced it with minor stylistic changes:
A low-level model of some humans has everything there is to know about human values embedded within it, in exactly the same way that human values are embedded in physical humans. The embedding, however, is nontrivial. Thus, predictive power alone is not sufficient to define human values. The missing part is the embedding of values within the model.
However, this also applies if we replace the phrase “human values” with “trees”. Yet we have a whole class of neural networks in which a simple embedding lights up in response to trees. This is because trees are a natural abstraction, and we should expect to see real systems trained for predictive power use natural abstractions internally.
Human values are a little different from trees: they’re a property of an abstract object (humans) rather than an abstract object themselves. Nonetheless, the author still expects that a broad class of systems trained for predictive power will end up with simple embeddings of human values (~70% chance).
Since an unsupervised learner has a simple embedding of human values, a supervised/reinforcement learner can easily score well on values-proxy-tasks by directly using that model of human values. In other words, the system uses an actual model of human values as a proxy for our proxy of human values (~10-20% chance). This is what is meant by _alignment by default_.
When this works, it’s basically a best-case scenario, so we can safely use the system to design a successor without worrying about amplification of alignment errors (among other things).

Planned opinion:

I broadly agree with the perspective in this post: in particular, I think we really should have more optimism because of the tendency of neural nets to learn “natural abstractions”. There is structure and regularity in the world and neural nets often capture it (despite being able to memorize random noise); if we train neural nets on a bunch of human-relevant data it really should learn a lot about humans, including what we care about.
However, I am less optimistic than the author about the specific path presented here (and he only assigns 10% chance to it). In particular, while I do think human values are a “real” thing that a neural net will pick up on, I don’t think that they are well-defined enough to align an AI system arbitrarily far into the future: our values do not say what to do in all possible situations; to see this we need only to look at the vast disagreements among moral philosophers (who often focus on esoteric situations). If an AI system were to internalize and optimize our current system of values, as the world changed the AI system would probably become less and less aligned with humans. We could instead talk about an AI system that has internalized both current human values and the process by which they are constructed, but that feels much less like a natural abstraction to me.
I _am_ optimistic about a very similar path, in which instead of training the system to pursue (a proxy for) human values, we train the system to pursue some “meta” specification like “be helpful to the user / humanity” or “do what we want on reflection”. It seems to me that “being helpful” is also a natural abstraction, and it seems more likely that an AI system pursuing this specification would continue to be beneficial as the world (and human values) changed drastically.
In light of the previous section, there’s an obvious path to alignment where there turns out to be a few neurons (or at least some simple embedding) which correspond to human values, we use the tools of microscope AI to find that embedding, and just like that the alignment problem is basically solved.

This is the part I disagree with. The network does recognise trees, or at least green things (given that the grass seems pretty brown in the low tree pic).

Extrapolating this, I expect the AI might well have neurons that correspond roughly to human values, on the training data. Within the training environment, human values, amount of dopamine in human brain, curvature of human lips (in smiles), number of times the reward button is pressed, and maybe even amount of money in human bank account might all be strongly correlated.

You will have successfully narrowed human values down to within the range of things that are strongly correlated with human values in the training environment. If you take this signal and apply enough optimization pressure, you are going to get the equivalent of a universe tiled with tiny smiley faces.

Note that the examples in the OP are from an adversarial generative network. If its notion of "tree" were just "green things", the adversary should be quite capable of exploiting that.

You will have successfully narrowed human values down to within the range of things that are strongly correlated with human values in the training environment. If you take this signal and apply enough optimization pressure, you are going to get the equivalent of a universe tiled with tiny smiley faces.

The whole point of the "natural abstractions" section of the OP  is that I do not think this will actually happen. Off-distribution behavior is definitely an issue for the "proxy problems" section of the post, but I do not expect it to be an issue for identifying natural abstractions.

Note that the examples in the OP are from an adversarial generative network. If its notion of "tree" were just "green things", the adversary should be quite capable of exploiting that.

In order for the network to produce good pictures, the concept of "tree" must be hidden in there somewhere, but it could be hidden in a complicated and indirect manor. I am questioning whether the particular single node selected by the researchers encodes the concept of "tree" or "green thing".

Ah, I see. You're saying that the embedding might not actually be simple. Yeah, that's plausible.

I guess the main issue that I have with this argument is that an AI system that is extremely good at prediction is unlikely to just have a high-level concept corresponding to human values (if it does contain such a concept). Instead, it's likely to also include a high-level concept corresponding to what people say about about values - or rather several corresponding to what various different groups would say about human-values. If your proxy is based on what people say, then these concepts which correspond to what people say will match much better - and the probability of at least one of these concepts being the best match is increased by large the number of these. So I don't put a very high weight on this scenario at all.

This is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up.

Also, I have another strange idea that might increase the probability of this working.

If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of "true human values"?

I don't think it's likely to work, but thought I'd share anyway.

Thanks!

Is this why you put the probability as "10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values"? Or have you updated your probabilities since writing this post?

Yup, this is basically where that probability came from. It still feels about right.

Thanks a lot for writing this. I've been thinking about FAI plans along these lines for a while now, here are some thoughts on specific points you made.

First, I take issue with the "Alignment By Default" title. There are two separate questions here. Question #1 is whether we'd have a good outcome if everyone concerned with AI safety got hit by a bus. Question #2 is whether there's a way to create Friendly AI using unsupervised learning. I'm rather optimistic that the answer to Question #2 is yes. I find the unsupervised learning family of approaches more appealing than IRL or HCH (from what I understand of those approaches). But I still think there are various ways in which things could go wrong, some of which you mention in this post, and it's useful to have safety researchers thinking about this, because the problems seem pretty tractable to me. You, me, and Steve Byrnes are the only people in the community I remember off the top of my head who seem to be giving this serious thought, which is a little odd because so many top AI people seem to think that unsupervised learning is The Nut That Must Be Cracked if we are to build AGI.

Anyway, in order to illustrate that the problems seem tractable, here are a couple things you brought up + thoughts on solving them.

With regard to the high-resolution molecular model of a human, there's the possibility of using this model as an upload somehow even if the embedding of human values is nontrivial. I guess the challenge is to excise everything around the human from the model, and replace those surroundings with whatever an ideal environment for doing moral / philosophical reasoning would be, along with some communication channel to the outside world. This is approach is similar to the Paul Christiano construction described on p. 198 of Superintellligence. In this case, I guess it is more important for the embedding of a person's physical surroundings to be "natural" enough that we can mess with it without messing with the person's mind. However, even if the embedding of the person's physical surroundings is kinda bad (meaning that our "ideal environment for doing moral / philosophical reasoning" ends up being like a glitchy VR sim in practice), this plausibly won't cause a catastrophic alignment failure. Also, you don't necessarily need a super high-resolution model to do this sort of thing (imagine prompting GPT-N with "Gandhi goes up the mountain to contemplate Moral Question X, he returns after a year of contemplation and proclaims...").

This requires hitting a window - our data needs to be good enough that the system can tell it should use human values as a proxy, but bad enough that the system can’t figure out the specifics of the data-collection process enough to model it directly. This window may not even exist.

A couple thoughts.

First, I think it's possible to create this window. Suppose we restrict ourselves to feeding our system data from before the year 2000. There should be a decent representation of human values to be learned from this data, yet it should be quite difficult to figure out the specifics of the 2020+ data-collection process from it. Identifying the specific quirks which cause the data-collection process to differ from human values seems especially difficult. (I think restricting ourselves to pre-2000 data is overkill, I just chose 2000 for the purpose of illustration.)

Second, one way to check on things is to deliberately include a small quantity of mislabeled data, then once the system is done learning, check whether its model correctly recognizes that the mislabeled data is mislabeled (and agrees with all data that is correctly labeled). (This should be combined with the idea above where we disguise the data-collection process from the AI, because otherwise we might pinpoint "the data-collection process prior to the time at which the mislabeled data was introduced"?)

I know of two main ways to go from aligning one AI to long-term alignment

A third approach which you don't mention is to use the initial aligned AI as a "human values oracle" for subsequent AIs. Once you have a cheap, fast computational representation of human values, you can replicate it across a massive compute cluster and

• Use it to generate extremely large quantities of training data

• Use it as the "moral compass" for some bigger, more sophisticated system

• Use it to identify specific ways in which the newer AI's concept of human values is wrong, and keep correcting the newer AI's concept of human values until it's good (maybe using active learning)

You need the new AI and the old AI to communicate with one another. But details of how they work can be totally different if you have them communicate using labeled data. Training one ML model to predict the output of some other ML model is a technique I see every so often in machine learning papers... "Distilling the Knowledge in a Neural Network" is a well-known example of this.

Finally, you wrote:

That’s assuming things don’t go irreversibly wrong before then - including not just hostile takeover, but even just development of deceptive behavior, since this scenario does not have any built-in mechanism to detect deception.

Mesa-optimizers are a real danger, but if we put those aside for a moment, I don't think there is much risk of a hostile takeover from an unsupervised learning system since it's not an agent.

Thanks for the comments, these are excellent!

Valid complaint on the title, I basically agree. I only give the path outlined in the OP ~10% of working without any further intervention by AI safety people, and I definitely agree that there are relatively-tractable-seeming ways to push that number up on the margin. (Though those would be marginal improvements only; I don't expect them to close the bulk of the gap without at least some progress on theoretical bottlenecks.)

I am generally lukewarm about human-simulation approaches to alignment; the fusion power generator scenario is a prototypical example of my concerns here (also see this comment on it, which explains what I see as the key take-away). The idea of simulating a human doing moral philosophy is a bit different than what I usually imagine, though; it's basically like taking an alignment researcher and running them on faster hardware. That doesn't directly solve any of the underlying conceptual problems - it just punts them to the simulated researchers - but it is presumably a strict improvement over a limited number of researchers operating slowly in meatspace. Alignment research ems!

Suppose we restrict ourselves to feeding our system data from before the year 2000. There should be a decent representation of human values to be learned from this data, yet it should be quite difficult to figure out the specifics of the 2020+ data-collection process from it.

I don't think this helps much. Two examples of "specifics of the data collection process" to illustrate:

• Suppose our data consists of human philosophers' writing on morality. Then the "specifics of the data collection process" includes the humans' writing skills and signalling incentives, and everything else besides the underlying human values.
• Suppose our data consists of humans' choices in various situations. Then the "specifics of the data collection process" includes the humans' mistaken reasoning, habits, divergence of decision-making from values, and everything else besides the underlying human values.

So "specifics of the data collection process" is a very broad notion in this context. Essentially all practical data sources will include a ton of extra information besides just their information on human values.

Second, one way to check on things is to deliberately include a small quantity of mislabeled data, then once the system is done learning, check whether its model correctly recognizes that the mislabeled data is mislabeled (and agrees with all data that is correctly labeled).

I like this idea, and I especially like it in conjunction with deliberate noise as an unsupervised learning trick. I'll respond more to that on the other comment.

A third way which you don't mention is to use the initial aligned AI as a "human values oracle" for subsequent AIs.

I have mixed feelings on this.

My main reservation is that later AIs will never be more precisely aligned the oracle. That first AI may be basically-correctly aligned, but it still only has so much data and probably only rough algorithms, so I'd really like it to be able to refine its notion of human values over time. In other words, the oracle's notion of human values may be accurate but not precise, and I'd like precision to improve as more data comes in and better algorithms are found. This is especially important if capabilities rise over time and greater capabilities require more precise alignment.

That said, as long as the oracle's alignment is accurate, we could use your suggestion to make sure that actions are OK for all possible human-values-notions within uncertainty. That's probably at least good enough to avoid disaster. It would still fall short of the full potential value of AI - there'd be missed opportunities, where the system has to be overly careful because its notion human values is insufficiently precise - but at least no disaster.

Finally, on deceptive behavior: I use the phrase a bit differently than I think most people do these days. My prototypical image isn't of a mesa-optimizer. Rather, I imagine people iteratively developing a system, trying things out, keeping things which seem to work, and thereby selecting for things which look good to humans (regardless of whether they're actually good). In that situation, we'd expect the system to end up doing things which look good but aren't, because the human developers accidentally selected for that sort of behavior. It's a "you-get-what-you-measure" problem, rather than a mesa-optimizers problem.

I don't expect them to close the bulk of the gap without at least some progress on theoretical bottlenecks.

Can you be more specific about the theoretical bottlenecks that seem most important?

I am generally lukewarm about human-simulation approaches to alignment; the fusion power generator scenario is a prototypical example of my concerns here (also see this comment on it, which explains what I see as the key take-away).

I agree that Tool AI is not inherently safe. The key question is which problem is easier: the alignment problem, or the safe-use-of-dangerous-tools problem. All else equal, if you think the alignment problem is hard, then you should be more willing to replace alignment work with tool safety work. If you think the alignment problem is easy, you should discourage dangerous tools in favor of frontloaded work on a more paternalistic "not just benign, actually aligned" AI.

An analogy here would be Linux vs Windows. Linux lets you shoot your foot off and wipe your hard drive with a single command, but it also gives you greater control of your system and your computer is less likely to get viruses. Windows is safer and more paternalistic, with less user control. Windows is a better choice for the average user, but that's partially because we have a lot of experience building operating systems. It wouldn't make sense to aim for a Windows as our first operating system, because (a) it's a more ambitious project and (b) we wouldn't have enough experience to know the right ways in which to be paternalistic. Heck, it was you who linked disparagingly to waterfall-style software development the other day :) There's a lot to be said for simplicity of implementation.

(Random aside: In some sense I think the argument for paternalism is self-refuting, because the argument is essentially that humans can't be trusted, but I'm not sure the total amount of responsibility we're assigning to humans has changed--if the first system is to be very paternalistic, that puts an even greater weight of responsibility on the shoulders of its designers to be sure and get it right. I'd rather shove responsibility into the post-singularity world, because the current world seems non-ideal, for example, AI designers have limited time to think due to e.g. possible arms races.)

What do I mean by the "safe-use-of-dangerous-tools problem"? Well, many dangerous tools will come with an instruction manual or mandatory training in safe tool use. For a tool AI, this manual might include things like:

• Before asking the AI any question, ask: "If I ask Question X, what is the estimated % chance that I will regret asking on reflection?"

• Tell the AI: "When you answer this question, instead of revealing any information you think will plausibly harm me, replace it with [I'm not revealing this because it could plausibly harm you]"

• If using a human-simulation approach to alignment, tell your AI to only make use of the human-simulation to inform terminal values, never instrumental values. Or give the human simulation loads of time to reflect, so it's effectively a speed superintelligence (assuming for the moment what seems to be a common AI safety assumption that more reflection always improves outcomes--skepticism here). Or make sure the simulated human has access to the safety manual.

I think it's possible to do useful work on the manual for the Tool AI even in the absence of any actual Tool AI having been created. In fact, I suspect this work will generalize better between different AI designs than most alignment work generalizes between designs.

Insights from our manual could even be incorporated into the user interface for the tool. For example, the question-asking flow could by default show us the answer to the question "If I ask Question X, what is the estimated % chance that I will regret asking on reflection?" and ask us to read the result and confirm that the question is actually one we want to ask. This would be analogous to alias rm='rm -i' in Linux--it doesn't reduce transparency or add brittle complexity, but it does reduce the risk of shooting ourselves in the foot.

BTW you wrote:

Coming at it from a different angle: if a safety problem is handled by a system's designer, then their die-roll happens once up-front. If that die-roll comes out favorably, then the system is safe (at least with respect to the problem under consideration); it avoids the problem by design. On the other hand, if a safety problem is left to the system's users, then a die-roll happens every time the system is used, so inevitably some of those die rolls will come out unfavorably. Thus the importance of designing AI for safety up-front, rather than relying on users to use it safely.

One possible plan for the tool is to immediately use it to create a more paternalistic system (or just generate a bunch of UI safeguards as I described above). So then you're essentially just rolling the dice once.

Two examples of "specifics of the data collection process" to illustrate

From my perspective, these examples essentially illustrate that there's not a single natural abstraction for "human values"--but as I said elsewhere, I think that's a solvable problem.

My main reservation is that later AIs will never be more precisely aligned the oracle. That first AI may be basically-correctly aligned, but it still only has so much data and probably only rough algorithms, so I'd really like it to be able to refine its notion of human values over time. In other words, the oracle's notion of human values may be accurate but not precise, and I'd like precision to improve as more data comes in and better algorithms are found. This is especially important if capabilities rise over time and greater capabilities require more precise alignment.

Let's make the later AIs corrigible then. Perhaps our initial AI can give us both a corrigibility oracle and a values oracle. (Or later AIs could use some other approach to corrigibility.)

Can you be more specific about the theoretical bottlenecks that seem most important?

Type signature of human values is the big one. I think it's pretty clear at this point that utility functions aren't the right thing, that we value things "out in the world" as opposed to just our own "inputs" or internal state, that values are not reducible to decisions or behavior, etc. We don't have a framework for what-sort-of-thing human values are. If we had that - not necessarily a full model of human values, just a formalization which we were confident could represent them - then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.

The key question is which problem is easier: the alignment problem, or the safe-use-of-dangerous-tools problem. All else equal, if you think the alignment problem is hard, then you should be more willing to replace alignment work with tool safety work. If you think the alignment problem is easy, you should discourage dangerous tools in favor of frontloaded work on a more paternalistic "not just benign, actually aligned" AI.

A good argument, but I see the difficulties of safe tool AI and the difficulties of alignment as mostly coming from the same subproblem. To the extent that that's true, alignment work and tool safety work need to be basically the same thing.

On the tools side, I assume the tools will be reasoning about systems/problems which humans can't understand - that's the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the "tools" have their own models of human values, and use those models to check the safety of their outputs... which brings us right back to alignment.

Simple mechanisms like always displaying an estimated probability that I'll regret asking a question would probably help, but I'm mainly worried about the unknown unknowns, not the known unknowns. That's part of what I mean when I talk about marginal improvements vs closing the bulk of the gap - the unknown unknowns are the bulk of the gap.

(I could see tools helping in a do-the-same-things-but-faster sort of way, and human-mimicking approaches in particular are potentially helpful there. On the other hand, if we're doing the same things but faster, it's not clear that that scenario really favors alignment research over the Leeroy Jenkins of the world.)

In some sense I think the argument for paternalism is self-refuting, because the argument is essentially that humans can't be trusted, but I'm not sure the total amount of responsibility we're assigning to humans has changed--if the first system is to be very paternalistic, that puts an even greater weight of responsibility on the shoulders of its designers to be sure and get it right.

This in particular I think is a strong argument, and the die-rolls argument is my main counterargument.

We can indeed partially avoid the die-rolls issue by only using the system a limited number of times - e.g. to design another system. That said, in order for the first system to actually add value here, it has to do some reasoning which is too complex for humans - which brings back the problem from earlier, about the inherent danger of collapsing complex values and systems into a simple API. We'd be rolling the dice twice - once in designing the first system, once in using the first system to design the second - and that second die-roll in particular has a lot of unknown unknowns packed into it.

Let's make the later AIs corrigible then. Perhaps our initial AI can give us both a corrigibility oracle and a values oracle. (Or later AIs could use some other approach to corrigibility.)

I have yet to see a convincing argument that corrigibility is any easier than alignment itself. It seems to suffer from the same basic problem: the concept of "corrigibility" has a lot of hidden complexity, especially when it interacts with embeddedness. To the extent that we're relying on corrigibility, I'd ideally like it to improve with capabilities, in the same way and for the same reasons as I'd like alignment to improve with capabilities. Do you know of an argument that it's easier?

If we had that - not necessarily a full model of human values, just a formalization which we were confident could represent them - then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.

Do you have in mind a specific aspect of human values that couldn't be represented using, say, the reward function of a reinforcement learning agent AI?

On the tools side, I assume the tools will be reasoning about systems/problems which humans can't understand - that's the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the "tools" have their own models of human values, and use those models to check the safety of their outputs... which brings us right back to alignment.

There's an aspect of defense-in-depth here. If your tool's model of human values is slightly imperfect, that doesn't necessarily fail hard the way an agent with a model of human values that's slightly imperfect does.

BTW, let's talk about the "Research Assistant" story here. See more discussion here. (The problems brought up in that thread seem pretty solvable to me.)

Simple mechanisms like always displaying an estimated probability that I'll regret asking a question would probably help, but I'm mainly worried about the unknown unknowns, not the known unknowns. That's part of what I mean when I talk about marginal improvements vs closing the bulk of the gap - the unknown unknowns are the bulk of the gap.

That's why you need a tool... so it can tell you the unknown unknowns you're missing, and how to solve them. We'd rather have a single die roll, on creating a good tool, then have a separate die roll for every one of those unknown unknowns, wouldn't we? ;-) Shouldn't we aim for a fairly minimalist, non-paternalistic tool where unknown unknowns are relatively unlikely to become load-bearing? All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns.

it has to do some reasoning which is too complex for humans - which brings back the problem from earlier, about the inherent danger of collapsing complex values and systems into a simple API.

If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you're getting at with the "unknown unknowns" stuff), what is the alternative?

I have yet to see a convincing argument that corrigibility is any easier than alignment itself. It seems to suffer from the same basic problem: the concept of "corrigibility" has a lot of hidden complexity, especially when it interacts with embeddedness. To the extent that we're relying on corrigibility, I'd ideally like it to improve with capabilities, in the same way and for the same reasons as I'd like alignment to improve with capabilities. Do you know of an argument that it's easier?

We were discussing a scenario where we had an OK solution to alignment, and you were saying that you didn't want to get locked into a merely OK solution for all of eternity. I'm saying corrigibility can address that. Alignment is already solvable to an OK degree in this hypothetical, so I'm assuming corrigibility is solvable to an OK degree as well.

Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities. You say "corrigibility" has a lot of hidden complexity. The more capable the system, the more hypotheses it can generate regarding complex phenomena, and the more likely those hypotheses are to be correct. There's no reason we can't make the system's notion of corrigibility corrigible in the same way its values are corrigible. (BTW, I don't think corrigibility even necessarily needs to be thought of as separate from alignment, you can think of them as both being reflected in an agent's reward function say. But that's a tangent.) And we can leverage capability increases by having the system explain various notions of corrigibility it's discovered and how they differ so we can figure out which notion(s) we want to use.

Do you have in mind a specific aspect of human values that couldn't be represented using, say, the reward function of a reinforcement learning agent AI?

It's not the function-representation that's the problem, it's the type-signature of the function. I don't know what such a function would take in or what it would return. Even RL requires that we specify the input-output channels up-front.

All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns.

This translates in my head to "all we need to do is solve the main problems of alignment, and then we'll have an assistant which can help us clean up any easy loose ends".

More generally: I'm certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that's very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?

If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you're getting at with the "unknown unknowns" stuff), what is the alternative?

I don't think solving FAI involves reasoning about things beyond humans. I think the AIs themselves will need to reason about things beyond humans, and in particular will need to reason about complex safety problems on a day-to-day basis, but I don't think that designing a friendly AI is too complex for humans.

Much of the point of AI is that we can design systems which can reason about things too complex for ourselves. Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.

Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities.

What notion of "corrigible" are you using here? It sounds like it's not MIRI's "the AI won't disable its own off-switch" notion.

This translates in my head to "all we need to do is solve the main problems of alignment, and then we'll have an assistant which can help us clean up any easy loose ends".

Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment? This claim seems a bit odd to me because I don't think those problems are highlighted in any of the major AI alignment research agenda papers. (Alternatively, if you feel like there are important omissions from those answers, I strongly encourage you to write your own answer!)

I did a presentation at the recent AI Safety Discussion Day on how to solve the problems in that thread. My proposed solutions don't look much like anything that's e.g. on Arbital because the problems are different. I can share the slides if you want, PM me your gmail address.

More generally: I'm certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that's very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?

Here's an example of a tool that I would find helpful right now, that seems possible to make with current technology (and will get better as technology advances), and seems very low risk: Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal. (EDIT: Or, given some AI safety problem, highlight the contiguous passage of text which is most likely to represent a solution.)

Can you come up with improbable scenarios in which this sort of thing ends up being net harmful? Sure. But security is not binary. Just because there is some hypothetical path to harm doesn't mean harm is likely.

Could this kind of approach be useful for unaligned AI as well? Sure. So begin work on it ASAP, keep it low profile, and restrict its use to alignment researchers in order to create maximum differential progress towards aligned AI.

Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.

I'm a bit confused why you're bringing up "safety problems too complex for ourselves" because it sounds like you don't think there are any important safety problems like that, based on the sentences that came before this one?

What notion of "corrigible" are you using here? It sounds like it's not MIRI's "the AI won't disable its own off-switch" notion.

I'm talking about the broad sense of "corrigible" described in e.g. the beginning of this post.

(BTW, I just want to clarify that we're having two parallel discussions here: One discussion is about what we should be doing very early in our AI safety gameplan, e.g. creating the assistant I described that seems like it would be useful right now. Another discussion is about how to prevent a failure mode that could come about very late in our AI safety gameplan, where we have a sorta-aligned AI and we don't want to lock ourselves into an only sorta-optimal universe for all eternity. I expect you realize this, I'm just stating it explicitly in order to make the discussion a bit easier to follow.)

Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment?

Mostly no. I've been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).

Using GPT-like systems to simulate alignment researchers' writing is a probably-safer use-case, but it still runs into the core catch-22. Either:

• It writes something we'd currently write, which means no major progress (since we don't currently have solutions to the major problems and therefore can't write down such solutions), or
• It writes something we currently wouldn't write, in which case it's out-of-distribution and we have to worry about how it's extrapolating us

I generally expect the former to mostly occur by default; the latter would require some clever prompts.

I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we're more useful to simulate.

Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal.

This sounds like a great tool to have. It's exactly the sort of thing which is probably marginally useful. It's unlikely to help much on the big core problems; it wouldn't be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact.

I do think a lot of the things you're suggesting would be valuable and worth doing, on the margin. They're probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they're still useful.

I'm a bit confused why you're bringing up "safety problems too complex for ourselves" because it sounds like you don't think there are any important safety problems like that, based on the sentences that came before this one?

The "safety problems too complex for ourselves" are things like the fusion power generator scenario - i.e. safety problems in specific situations or specific applications. The safety problems which I don't think are too complex are the general versions, i.e. how to build a generally-aligned AI.

An analogy: finding shortest paths in a billion-vertex graph is far too complex for me. But writing a general-purpose path-finding algorithm to handle that problem is tractable. In the same way, identifying the novel safety problems of some new technology will sometimes be too complex for humans. But writing a general-purpose safety-reasoning algorithm (i.e. an aligned AI) is tractable, I expect.

I'm talking about the broad sense of "corrigible" described in e.g. the beginning of this post.

Ah ok, the suggestion makes sense now. That's a good idea. It's still punting a lot of problems until later, and humans would still be largely responsible for solving those problems later. But it could plausibly help with the core problems, without any obvious trade-off (assuming that the AI/oracle actually does end up pointed at corrigibility).

Mostly no. I've been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).

Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I'd like the answers to represent as broad a distribution of plausible harms as possible, because that thread might end up becoming very important & relevant very soon.

Some notes on the loss function in unsupervised learning:

Since an unsupervised learner is generally just optimized for predictive power

I think it's worthwhile to distinguish the loss function that's being optimized during unsupervised learning, vs what the practitioner is optimizing for. Yes, the loss function being optimized in an unsupervised learning system is frequently minimization of reconstruction error or similar. But when I search for "unsupervised learning review" on Google Scholar, I find this highly cited paper by Bengio et al. The abstract talks a lot about learning useful representations and says nothing about predictive power. In other words, learning "natural abstractions" appears to be pretty much the entire game from a practitioner perspective.

And in the same way supervised learning has dials such as regularization which let us control the complexity of our model, unsupervised learning has similar dials.

For clustering, we could achieve 0 reconstruction error (or equivalently, explain all the variation in the data) by putting every data point in its own cluster, but that would completely defeat the point. The elbow method is a well-known heuristic for figuring out what the "right" number of clusters in a dataset is.

Similarly, we could achieve 0 reconstruction error with an autoencoder by making the number of dimensions in the bottleneck be equal to the number of dimensions in the original input, but again, that would completely defeat the point. Someone on the Stats Stackexchange says that there is no standard way to select the number of dimensions for an autoencoder. (For reference, the standard way to select the regularization parameter which controls complexity in supervised learning would obviously be through cross-validation.) However, I suspect this is a tractable research problem.

It was interesting that you mentioned the noise of air molecules, because one unsupervised learning trick is to deliberately introduce noise into the input to see if the system has learned "natural" representations which allow it to reconstruct the original noise-free input. See denoising autoencoder. This is the kind of technique which might allow an autoencoder to learn natural representations even if the number of dimensions in the bottleneck is equal to the number of dimensions in the original input.

BTW, here's an interesting-looking (pessimistic) paper I found while researching this comment: Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

You brought up microscope AI. I think a promising research direction here may be to formulate a notion of "ease of interpretability" which can be added as an additional term to an unsupervised loss function (the same way we might, for example, add a term to a clustering algorithm's loss function so that in addition to minimizing reconstruction error, it also seeks to minimize the number of clusters).

Hardcoding "human values" by hand is hopeless, but hardcoding "ease of human interpretability" by hand seems much more promising, since ease of human interpretability is likely to correspond to easily formalizable notions such as simplicity. Also, if your hardcoded notion of "ease of human interpretability" turns out to be slightly wrong, that's not a catastrophe: you just get an ML model which is a bit harder to interpret than you might like.

Another option is to learn a notion of what constitutes an interpretable model by e.g. collecting "ease of interpretability" data from human microscope users.

Of course, one needs to be careful that any interpretability term does not get too much weight in the loss function, because if it does, we may stop learning the "natural" abstractions that we desire (assuming a worst-case scenario where human interpretability is anticorrelated with "naturalness"). The best approach may be to learn two models, one of which was optimized for interpretability and one of which wasn't, and only allow our system to take action when the two models agree. I guess mesa-optimizers in the non-interpretable model are still a worry though.

This comment definitely wins the award for best comment on the post so far. Great ideas, highly relevant links.

I especially like the deliberate noise idea. That plays really nicely with natural abstractions as information-relevant-far-away: we can intentionally insert noise along particular dimensions, and see how that messes with prediction far away (either via causal propagation or via loss of information directly). As long as most of the noise inserted is not along the dimensions relevant to the high-level abstraction, denoising should be possible. So it's very plausible that denoising autoencoders are fairly-directly incentivized to learn natural abstractions. That'll definitely be an interesting path to pursue further.

Assuming that the denoising autoencoder objective more-or-less-directly incentivizes natural abstractions, further refinements on that setup could very plausibly turn into a useful "ease of interpretability" objective.

This comment definitely wins the award for best comment on the post so far.

Thanks!

I don't consider myself an expert on the unsupervised learning literature by the way, I expect there is more cool stuff to be found.

This is the sort of thing I've been thinking about since "What's the dream for giving natural language commands to AI?" (which bears obvious similarities to this post). The main problems I noted there apply similarly here:

• Prediction in the supervised task might not care about the full latent space used for the unsupervised tasks, losing information.
• Little to no protection from Goodhart's law. Things that are extremely good proxies for human values still might not be safe to optimize.
• Doesn't care about metaethics, just maximizes some fixed thing. Which wouldn't be a problem if it was meta-ethically great to start with, but it probably incorporates plenty of human foibles in order to accurately predict us.

The killer is really that second one. If you run this supervised learning process, and it gives you a bunch of rankings of things in terms of their human values score, this isn't a safe AI even if it's on average doing a great job, because the thing that gets the absolute best score is probably an exploit of the specific pattern-recognition algorithm used to do the ranking. In short, we still need to solve the other-izer problem.

Actually, your trees example does give some ideas. Could you look inside a GAN trained on normal human behavior and identify what parts of it were the "act morally" or "be smart" parts, and turn them up? Choosing actions is, after all, a generative problem, not a classification or regression problem.

This requires hitting a window - our data needs to be good enough that the system can tell it should use human values as a proxy, but bad enough that the system can’t figure out the specifics of the data-collection process enough to model it directly. This window may not even exist.

I like this framing, it is clarifying.

When alignment-by-default works, it’s basically a best-case scenario, so we can safely use the system to design a successor without worrying about amplification of alignment errors (among other things).

didn't understand how this was derived or what other results/ideas it is referencing.

didn't understand how this was derived or what other results/ideas it is referencing.

The idea here is that the AI has a rough model of human values, and is pointed at those values when making decisions (e.g. the embedding is known and it's optimizing for the embedded values, in the case of an optimizer). It may not have perfect knowledge of human values, but it would e.g. design its successor to build a more precise model of human values than itself (assuming it expects that successor to have more relevant data) and point the successor toward that model, because that's the action which best optimizes for its current notion of human values.

Contrast to e.g. an AI which is optimizing for human approval. If it can do things which makes a human approve, even though the human doesn't actually want those things (e.g. deceptive behavior), then it will do so. When that AI designs its successor, it will want the successor to be even better at gaining human approval, which means making the successor even better at deception.

This probably needs more explanation, but I'm not sure which parts need more explanation, so feedback would be appreciated.

Great post!

That might have been discussed in the comments, but my gut reaction to the tree example was not "It's not really understanding tree" but "It's understanding trees visually". That is, I think the examples point to trees being a natural abstraction with regard to images made of pixels. In that sense, dogs and cats and other distinct visual objects might fit your proposal of natural abstraction. Yet this doesn't entail that trees are a natural abstraction when given the position of atoms, or sounds (to be more abstract). I thus think that natural abstractions should be defined with regard for the sort of data that is used.

For human values, I might accept that they are natural abstraction, but I don't know for which kind of data. Is audiovisual data (as in youtube videos) enough? Do we also need textual data? Neuroimagery? I don't know, and that makes me slightly more pessimistic about a unsupervised model learning human values by default.

My model of abstraction is that high-level abstractions summarize all the information from some chunk of the world which is relevant "far away". Part of that idea is that, as we "move away" from the information-source, most information is either quickly wiped out by noise, or faithfully transmitted far away. The information which is faithfully transmitted will usually be present across many different channels; that's the main reason it's not wiped out by noise in the first place. Obviously this is not something which necessarily applies to all possible systems, but intuitively it seems like it should apply to most systems most of the time: information which is not duplicated across multiple channels is easily wiped out by noise.

So in principle, it doesn’t even matter what kind of model we use or how it’s represented; as long the predictive power is good enough, values will be embedded in there, and the main problem will be finding the embedding.

I will agree with this. However, notice what this doesn't say. It doesn't say "any model powerful enough to be really dangerous contains human values". Imagine a model that was good at a lot of science and engineering tasks. It was good enough at nuclear physics to design effective fusion reactors and bombs. It knew enough biology to design a superplage. It knew enough molecular dynamics to design self replicating nanotech. It knew enough about computer security to hack most real world systems. But it didn't know much about how humans thought. It's predictions are far from maxentropy, if it sees people walking along a street, it thinks they will probably carry on walking, not fall to the ground twiching randomly. Lets say that the model is as predictively accurate as you would be when asked to predict the behaviour of a stranger from a few seconds of video. This AI doesn't contain a model of human values anywhere in it.

We can't just assume that every AI powerful enough to be dangerous contains a model of human values, however I suspect most of them will in practice.

This is entirely correct.

So far, we’ve only talked about one AI ending up aligned, or a handful ending up aligned at one particular time. However, that isn’t really the ultimate goal of AI alignment research. What we really want is for AI to remain aligned in the long run, as we (and AIs themselves) continue to build new and more powerful systems and/or scale up existing systems over time.

I think this suggests an interesting path where alignment by default might be able to serve as a bridge to better alignment mechanisms, i.e. if it works and we can select for AIs that contains representations of human values, then we might be able to prioritize this in a slow takeoff scenario so that in the early phases of it we at least have mostly aligned AI that helps us build better mechanisms for alignment (as opposed to these AIs simply building successors directly with the hope that they maintain alignment with human values in the process).

I think of this as the Rohin trajectory, since he's the main person I've heard talk about it. I agree it's a natural approach to consider, though deceptiveness-type problems are a big potential issue.

Isn't remaining aligned an example of robust delegation? If so, there have been both discussions and technical work on this problem before.

Yup, exactly right, though this version is a fair bit more involved than the simplified delegation scenarios we've seen in most of the theoretical work.

when alignment-by-default works, we can use the system to design a successor without worrying about amplification of alignment errors

Anything neural net related starts with random noise and performs gradient descent style steps. This doesn't get you the global optimal, it gets you some point that is approximately a local optimal, which depends on the noise, the nature of the search space, and the choice of step size.

If nothing else, the training data will contain sensor noise.

At best you are going to get something that roughly corresponds to human values.

Just because it isn't obvious where the noise entered the system doesn't make it noiseless. Just because you gave what we actually want, and the value of a neuron in a neural net the same name, doesn't make them the same thing.

Consider the large set of references with representative members "What Alice makes long term plans towards", "What Bobs impulsive action tends towards", "What Alice says is good and right when her social circle are listening", "What Carl listens to when deciding which politician to vote for", "What news makes Eric instinctively feel good", "what makes Fred presses the reward button during AI training" ect ect.

If these all referred to the same preference ordering over states of the world, then we could call that human values, and have a natural concept.

Trees are a fairly natural concept because "tall green things" and "Lifeforms that are >10% cellulose" point to a similar set of objects. There are many different simple boundaries in concept-space that largely separate trees from non trees. Trees are tightly clustered in thing-space.

To the extent that all those references refer to the same thing, we can't expect an AI to distinguish between them. To the extent that they refer to different concepts, at best the AI will have a separate concept for each.

Suppose you run the microscope AI, and you find that you have a whole load of concepts that kind of match "human values" to different degrees. These represent different people and different embeddings of value. (Of course, "What Carl listens to when deciding which politician to vote for" contains Carls distrust of political promises. "what makes Fred presses the reward button during AI training" includes the time Fred tripped up and slammed the button by accident. Each of the easily accessible concepts is a bit different and includes its own bit of noise)

Trees are a fairly natural concept because "tall green things" and "Lifeforms that are >10% cellulose" point to a similar set of objects. There are many different simple boundaries in concept-space that largely separate trees from non trees. Trees are tightly clustered in thing-space.

That's not quite how natural abstractions work. There are lots of edge cases which are sort-of-trees-but-sort-of-not: logs, saplings/acorns, petrified trees, bushes, etc. Yet the abstract category itself is still precise.

An analogy: consider a Gaussian cluster model. Any given cluster will have lots of edge cases, and lots of noise in the individual points. But the cluster itself - i.e. the mean and variance parameters of the cluster - can still be precisely defined. Same with the concept of "tree", and (I expect) with "human values".

In general, we can have a precise high-level concept without a hard boundary in the low-level space.

Consider a source of data that is from a sum of several Gaussian distributions. If you have a sufficiently large number of samples from this distribution, you can locate the origional gaussians to arbitrary accuracy. (Of course, if you have a finite number of samples, you will have some inaccuracy in predicting the location of the gaussians, possibly a lot.)

However, not all distributions share this property. If you look at uniform distributions over rectangles in 2d space, you will find that a uniform L shape can be made in 2 different ways. More complicated shapes can be made in even more ways. The property that you can uniquely decompose sum of gaussians into its individual gaussians is not a property that applies to every distribution.

I would expect that whether or not logs, saplings, petrified trees, sparkly plastic christmas trees ect counted as trees would depend on the details of the training data, as well as the network architecture and possibly the random seed.

Note: this is an empirical prediction about current neural networks. I am predicting that if someone, takes 2 networks that have been trained on different datasets, ideally with different architectures, and tries to locate the neuron that holds the concept of "Tree" in each, and then shows both networks an edge case that is kind of like a tree, then the networks will often disagree significantly about how much of a tree it is.