## AI ALIGNMENT FORUMAF

Alex Turner

Alex Turner, postdoctoral researcher at the Center for Human-Compatible AI. Reach me at turner.alex[at]berkeley[dot]edu.

# Sequences

Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact

# Wiki Contributions

In a human mind, a lot of cognition is happening in diffuse illegible giant vectors, but a key part of the mental architecture squeezes through a low-bandwidth token stream. I'd feel a lot better about where ML was going if some of the steps in their cognition looked like low-bandwidth token streams, rather than giant vectors.

In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that's the part which is a pointer problem.

Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?

the pointer problem part is roughly "specify what I mean well enough that I could use the specification to get an AI to do what I mean", assuming problems like "get AI to follow specification" can be solved.

On the view of this post, is it that we would get a really good "evaluation module" for the AI to use, and the "get AI to follow specification" corresponds to "make AI want to generate plans evaluated highly by that procedure"? Or something else?

As an example, here are three possible reactions to a no-ghost update:

Suppose that many (EDIT: a few) of your value shards take as input the ghost latent variable in your world model. You learn ghosts aren't real. Let's say this basically sets the ghost-related latent variable value to false in all shard-relevant contexts. Then it seems perfectly fine that most of my shards keep on bidding away and determining my actions (e.g. protect my family), since most of my value shards are not in fact functions of the ghost latent variable. While it's indeed possible to contrive minds where most of their values are functions of a variable in the world model which will get removed by the learning process, it doesn't seem particularly concerning to me. (But I'm also probably not trying to tackle the problems in this post, or the superproblems which spawned them.)

There's a small element of inner alignment to this, as well. Although an RL agent such as AIXI will want to wirehead if it forms an "accurate" model of how it gets reward, we can also see this as only one model consistent with the data, another being that reward is actually coming from task achievement (IE, the AI could internalize the intended values). Although this model will usually have at least slightly worse predictive accuracy, we can counterbalance that with process-level feedback which tells the system that's a better way of thinking about it.

This doesn't seem relevant for non-AIXI RL agents which don't end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?

I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?".

This is called an observation-utility maximizer (to contrast it with reinforcement learning). Daniel Dewey goes on to show that we can incorporate uncertainty about the utility function into observation-utility maximizers, recovering the kind of "learning what is being rewarded" that RL agents were supposed to provide, but without the perverse incentive to try and make the utility turn out to be something easy to maximize.

I think that Dewey is wrong about RL agents having this problem in general. Dewey wrote (emphasis mine):

Reinforcement learning, we have argued, is not an adequate real-world solution to the
problem of maximizing an initially unknown utility function. Reinforcement learners,
by definition, act to maximize their expected observed rewards; they may learn that
human goals are in some cases instrumentally useful to high rewards, but this dynamic
is not tenable for agents of human or higher intelligence, especially considering the
possibility of an intelligence explosion.

The trouble with the reinforcement learning notion (1) is that it can only prefer or
disprefer future interaction histories on the basis of the rewards they contain. Reinforcement
learning has no language in which to express alternative final goals, discarding all non-
reward information contained in an interaction history.

I will go out on a limb and guess that this paper is nearly entirely wrong in its key claims. Similarly with Avoiding Wireheading with Value Reinforcement Learning

How can we design good goals for arbitrarily intelligent agents? Reinforcement learning (RL) is a natural approach. Unfortunately, RL does not work well for generally intelligent agents, as RL agents are incentivised to shortcut the reward sensor for maximum reward – the so-called

The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans. In a sense, this means having aligned goals without having the same goals: your goal is to cooperate with "human goals", but you don't yet have a full description of what human goals are. Your value function might be much simpler than the human value function.

In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.

Insofar as I understand your point, I disagree. In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI's learned ontology (whether or not it has a concept for people). If you commit to the specific view of outer/inner alignment, then now you also want your loss function to "represent" that goal in some way.

humans seem to correctly identify what each other want and believe, quite frequently. Therefore, humans must have prior knowledge which helps in this task. If we can encode those prior assumptions in an AI, we can point it in the right direction.

I doubt this due to learning from scratch. I think the question of "how do I identify what you want, in terms of a utility function?" is a bit sideways due to people not in fact having utility functions.[1] Insofar as the question makes sense, its answer probably takes the form of inductive biases: I might learn to predict the world via self-supervised learning and form concepts around other people having values and emotional states due to that being a simple convergent abstraction relatively pinned down by my training process, architecture, and data over my life, also reusing my self-modelling abstractions. It would be quite unnatural to model myself in one way (as valuing happiness) and others as having "irrational" shards which "value" anti-happiness but still end up behaving as if they value happiness. (That's not a sensible thing to say, on my ontology.)

This presents a difficulty if another agent wishes to help such an agent, but does not share its ontological commitments.

I think it's worth considering how I might go about helping a person from an uncontacted tribe who doesn't share my ontology. Conditional on them requesting help from me somehow, and my wanting to help them, and my deciding to do so—how would I carry out that process, internally?

(Not reading the rest at the moment, may leave more comments later)

1. ^

Human values take the form of decision-influences (i.e. shards) which increase or decrease the probability of mental events and decisions (buying ice cream, thinking for another minute). There is no such thing as an anti-ice-cream shard which is perfectly anti-rational in that it bids "against its interests", bidding for ice cream and against avoiding ice cream. That's just an ice cream shard. Goals and rationality are not entirely separate, in people.

I feel confused. I think this comment is overall good (though I don't think I understand a third of it), but doesn't seem to suggest the genome actually solved information inaccessibility in the form of reliably locating learned WM concepts in the human brain?

But, do you really fundamentally care that your kids have genomes?

Seems not relevant? I think we're running into an under-definition of IGF (and the fact that it doesn't actually have a utility function, even over local mutations on a fixed genotype). Does IGF have to involve genomes, or just information patterns as written in nucleotides or in binary? The "outer objective" of IGF suffers a classic identifiability issue common to many "outer objectives", where the ancestral "training signal" history is fully compatible with "IGF just for genomes" and also "IGF for all relevant information patterns made of components of your current pattern."

(After reading more, you later seem to acknowledge this point -- that evolution wasn't "shouting" anything about genomes in particular. But then why raise this point earlier?)

Now, there's a reasonable counterargument to this point, which is that there's no psychologically-small tweak to human psychology that dramatically increases that human's IGF. (We'd expect evolution to have gathered that low-hanging fruit.)

I don't know if I disagree, it depends what you mean here. If "psychologically small" is "small" in a metric of direct tweaks to high-level cognitive properties (like propensity to cheat given abstract knowledge of resources X and mating opportunities Y), then I think that isn't true. By information inaccessibility, I think that evolution can't optimize directly over high-level cognitive properties.

Optima often occur at extremes, and concepts tend to differ pretty widely at the extremes, etc. When the AI gets out of the training regime and starts really optimizing, then any mismatch between its ends and our values are likely to get exaggerated.

This kind of argument seems sketchy to me. Doesn't it prove too much? Suppose there's a copy of me which also values coffee to the tune of \$40/month and reflectively endorses that value at that strength. Are my copy and I now pairwise misaligned in any future where one of us "gets out of the training regime and starts really optimizing"? (ETA: that is, significantly more pairwise misaligned than I would be with an exact copy of myself in such a situation. For more selfish people, I imagine this prompt would produce misalignment due to some desires like coffee/sex being first-person.)

And all this is to say nothing about how humans' values are much more complex and fragile than IGF, and thus much trickier to transmit

Complexity is probably relative to the learning process and inductive biases in question. While any given set of values will be difficult to transmit in full (which is perhaps your point), the fact that humans did end up with their values shows evidence that human values are the kind of thing which can be transmitted/formed easily in at least one architecture.

I often get the impression that people weigh off e.g. doing shard theory alignment strategies under the shard theory alignment picture, versus inner/outer research under the inner/outer alignment picture, versus...

And insofar as this impression is correct, this is a mistake. There is only one way alignment is.

If inner/outer is altogether a more faithful picture of those dynamics:

• relatively coherent singular mesa-objectives form in agents, albeit not necessarily always search-based
• more fragility of value and difficulty in getting the mesa objective just right, with little to nothing in terms of "consolation prizes" for slight mistakes in value loading
• possibly low path dependence on the update process

then we have to solve alignment in that world.

If shard theory is altogether more faithful, then we live under those dynamics:

• gents learn contextual distributions of values around e.g. help people or acquire coins, some of which cohere and equilibrate into the agent's endorsed preferences and eventual utility function
• something like values handshakes and inner game theory occurs in AI
• we can focus on getting a range of values endorsed and thereby acquire value via being "at the bargaining table" vis some human-compatible values representing themselves in the final utility function
• which implies meaningful success and survival from "partial alignment"

And under these dynamics, inner and outer alignment are antinatural hard problems.

Or maybe neither of these pictures are correct and reasonable, and alignment is some other way.

But either way, there's one way alignment is. And whatever way that is, it is against that anvil that we hammer the AI's cognition with loss updates. When considering a research agenda, you aren't choosing a background set of alignment dynamics as well.

FWIW I think the most important distinction in "alignment" is aligning with somebody's preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.

I have an upcoming post which might be highly relevant. Many proposals which black-box human judgment / model humans, aren't trying to get an AI which optimizes what people want. They're getting an AI to optimize evaluations of plans—the quotation of human desires, as quoted via those evaluations. And I think that's a subtle distinction which can prove quite fatal.

I think that's also a good thing to think about, but most of the meat is in how you actually reason about that and how it leads to superior or at least adequate+complementary predictions about the behavior of ML systems. I think to the extent this perspective is useful for alignment it also ought to be useful for reasoning about the behavior of existing systems like large language models

Sure. To clarify, superior to what? "GPT-3 reliably minimizes prediction error; it is inner-aligned to its training objective"?