# 20

Summary: What does it mean for a loss function to be "aligned with" human goals? I perceive four different concepts which involve "loss function" in importantly different ways:

1. Physical-loss: The physical implementation of a loss function and the loss computations,
2. Mathematical-loss: The mathematical idealization of a loss function,
3. A loss function "encoding/representing/aligning with" an intended goal, and
4. Agents which "care about achieving low loss."

I advocate retaining physical- and mathematical-loss. I advocate dropping 3 in favor of talking directly about desired AI cognition and how the loss function entrains that cognition. I advocate disambiguating 4, because it can refer to a range of physically grounded preferences about loss (e.g. low value at the loss register versus making perfect future predictions).

Related: Towards deconfusing wireheading and reward maximization.[1] I'm going to talk about "loss" instead of "reward", but the lessons apply to both.

I think it's important to maintain a sharp distinction between the following four concepts.

# 1: Physically implemented loss

The loss function updated my network.

This is a statement about computations embedded in physical reality. This statement involves the physically implemented sequence of loss computations which stream in throughout training. For example, the computations engendered by loss_fn = torch.nn.CrossEntropyLoss().

# 2: Mathematical loss

The loss function is a smooth function of the prediction distribution.

This is a statement about the idealized mathematical loss function. These are the mathematical objects you can prove learning theory results about. The Platonic idealization of the learning problem and the mathematical output-grading rule casts a shadow into your computer via its real-world implementation (concept 1).

For example,  where  is the mathematical idealization of the MNIST dataset, where the  are the idealized grayscale MNIST images. And  is the mathematical function of cross-entropy (CE) loss between a label prediction distribution and the ground-truth labels.

# 3: Loss functions "representing" goals

I want a loss function which is aligned with the goal of "write good novels."

This is an aspirational statement about achieving some kind of correspondence between the loss function and the goal of writing good novels. But what does this statement mean

Suppose you tell me "I have written down a loss function  which is perfectly aligned with the goal of 'write good novels'." What experiences should this claim lead me to anticipate?

1. That an agent can only achieve low physical-loss (concept 1) if it has, in physical fact, written a good novel?
2. That in some mathematical idealization of the learning problem (concept 2), loss-minimization only occurs when the agent outputs text which would be found in what we rightly consider to be "good novels"? (But in which mathematical idealization?)
3. That, as a matter of physical fact, if you train an agent on  using learning setup , then you produce a language model which can be easily prompted to output high-quality novels?

The imprecision comes from loss functions not directly encoding goals. Loss signals are physically implemented (concept 1) parts of the AI's training process which (physically) update the AI's cognition in certain ways. While a loss function can be involved in the AI's decision-making structure (see items i and ii above), additional information is needed to understand what motivational cognition is being discussed.

I think that talking about loss functions being "aligned" encourages bad habits of thought at best, and is nonsensical at worst. I think it makes way more sense to say how you want the agent to think and then act (e.g. "write good novels"—the training goal, in Evan Hubinger's training stories framework) and why you think you can use a given loss function   to produce that cognition in the agent (the training rationale).

# 4: Agents which want to minimize loss

The agent wants to minimize its loss.

What does this mean in terms of the AI's generalization properties, in terms of how the AI thinks and is motivated?

1. The agent might care about maintaining a low value at the loss register and e.g. compute on-policy value to be  expected time-discounted register value, and then (somehow) select plans on that basis.
2. Or the agent might care about outputting physical predictions which match the future physical labels, making tradeoffs between misprediction events in a way which accords with e.g. the cross-entropy loss (the utility of a prediction being  CE loss).
3. Or something else.

An agent can internally "care" about "optimizing a loss function" (concept 4), but that caring is not implemented via loss computations which provide cognitive updates (concept 1).

# Confusion resulting from ambiguity

Here is a quote which I, as a reader, have trouble understanding:

outer alignment refers to aligning the specified loss function with the intended goal, whereas inner alignment refers to aligning the mesa-objective of a mesa-optimizer with the specified loss function. — Risks from Learned Optimization: Introduction

I don't know what inner cognition would or would not satisfy this vague-to-me definition. What does it mean for a loss function to be "aligned" with the intended goal?

Suppose you told me, "TurnTrout, we have definitely produced a loss function which is aligned with the intended goal, and inner-aligned an agent to that loss function."

• Suppose I went to the computer where the miraculous loss function is implemented. What should I expect to see when I read the code / do interpretability on the trained loss-model?
• Suppose I inspected the model data for the purportedly inner-aligned agent, and used powerful interpretability tools on the model. What kind of cognition would I expect to find?

These two points matter. For example, see section (3) for three concrete operationalizations of "loss function is aligned" which imply substantially different generalization behavior from the AI. In that kind of scenario, the AI's generalization properties sensitively depend both on the loss function which trained the model, and on the cognition which is aligned to that model.  Usually, when I read writing on outer/inner alignment, I end up feeling confused about what exactly I'm supposed to be imagining.[2] I often wonder whether there exists a concrete example at all!

As I said above:

I think that talking about loss functions being "aligned" encourages bad habits of thought at best, and is nonsensical at worst. I think it makes way more sense to say how you want the agent to think (e.g. "write good novels"—the training goal, in Evan Hubinger's training stories framework) and why you think you can use a given loss function   to produce that cognition in the agent (the training rationale).

# Conclusion

I think it makes the most sense to use "loss" to refer to physical-loss and mathematical-loss. I think we should stop talking about loss/reward as "representing" a goal because it invites imprecise thinking and communication. I think it can be OK to use motivated-by-loss (#4) as shorthand, but that it's important to clarify what flavor of loss minimization you're talking about.

Thanks to Garrett Baker for feedback on a draft.

1. ^

Leo Gao wrote:

The objective that any given policy appears to optimize is its behavioral objective[...]

There are in fact many distinct possible policies with different behavioral objectives for the RL algorithm to select for: there is a policy that changes the world in the “intended” way so that the reward function reports a high value, or one that changes the reward function such that it now implements a different algorithm that returns higher values, or one that changes the register the output from the reward function is stored in to a higher number, or one that causes a specific transistor in the processor to misfire, etc. All of these policies optimize some thing in the outside world (a utility function); for instance, the utility function that assigns high utility to a particular register being a large number. The value of the particular register is a fact of the world. [...]

However, when we try to construct an RL policy that has as its behavioral objective the “reward”, we encounter the problem that it is unclear what it would mean for the RL policy to “care about” reward, because there is no well defined reward channel in the embedded setting. We may observe that all of the above strategies are instrumental to having the particular policy be picked by the RL algorithm as the next policy used by the agent, but this is a utility over the world as well (“have the next policy implemented be this one”), and in fact this isn’t really much of a reward maximizer at all, because it explicitly bypasses reward as a concept altogether! In general, in an embedded setting, any preference the policy has over “reward" (or "observations") can be mapped onto a preference over facts of the world.[3]

My summary: "The agent cares about reward" is inherently underdefined, and precision matters here.

2. ^
3. ^

As I understand this quote, I think it's somewhat more complicated. I have some vague desire to acausally cooperate with certain entities in other Everett branches and even in (if they exist) different universes governed by different physical laws. My values do not have to steer my decision-making only based my beliefs about physical facts, although they often seem to.

# 20

New Comment

I think that talking about loss functions being "aligned" encourages bad habits of thought at best, and is nonsensical at worst. I think it makes way more sense to say how you want the agent to think and then act (e.g. "write good novels"—the training goal, in Evan Hubinger's training stories framework) and why you think you can use a given loss function ℓ novel to produce that cognition in the agent (the training rationale).

Very much agree with this.

Suppose you told me, "TurnTrout, we have definitely produced a loss function which is aligned with the intended goal, and inner-aligned an agent to that loss function." [What should I expect to see?]

If a person said this to me, what I would expect (if the person was not mistaken in their claim) is that they could explain an insight to me about what it means for an algorithm to "achieve a goal" like "writing good novels" and how they had devised a training method to find an algorithm that matches this operationalization. It is precisely because I don't know what alignment means that I think it's helpful to have some hand-hold terms like "alignment" to refer to the problem of clarifying this thing that is currently confusing.

I don't really disagree with anything you've written, but, in general, I think we should allow some of our words to refer to "big confusing problems" that we don't yet know how to clarify, because we shouldn't forget about the part of the problem that is deeply confusing, even as we incrementally clarify and build inroads towards it.

because I don't know what alignment means that I think it's helpful to have some hand-hold terms like "alignment"

Do you mean "outer/inner alignment"?

Supposing you mean that—I agree that it's good to say "and I'm confused about this part of the problem", while also perhaps saying "assuming I've formulated the problem correctly at all" and "as I understand it."

I don't really disagree with anything you've written, but, in general, I think we should allow some of our words to refer to "big confusing problems" that we don't yet know how to clarify, because we shouldn't forget about the part of the problem that is deeply confusing, even as we incrementally clarify and build inroads towards it.

Sure. However, in future posts, I will further contend that outer and inner alignment is not an appropriate or natural decomposition of the alignment problem. In my opinion, reifying these terms and reasoning from this frame increases our confusion and tacitly assumes away more promising approaches. (That's not to say that there's no one ever who is thinking reasonable and concrete thoughts from that frame. But my actual complaint stands.)

in future posts, I will further contend that outer and inner alignment is not an appropriate or natural decomposition of the alignment problem

Wonderful! I don't have any complaints per se about outer/inner alignment, but I use it relatively rarely in my own thinking, and it has resolved relatively few of my confusions about alignment.

FWIW I think the most important distinction in "alignment" is aligning with somebody's preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.

FWIW I think the most important distinction in "alignment" is aligning with somebody's preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.

I have an upcoming post which might be highly relevant. Many proposals which black-box human judgment / model humans, aren't trying to get an AI which optimizes what people want. They're getting an AI to optimize evaluations of plans—the quotation of human desires, as quoted via those evaluations. And I think that's a subtle distinction which can prove quite fatal.

Right. Many seem to assume that there is a causal relationship good -> human desires -> human evaluations. They are hoping both that if we do well according to human evaluations then we will be satisfying human desires, and that if we satisfy human desires, we will create a good world. I think both of those assumptions are questionable.

I like the analogy in which we consider an alternative world where AI researchers assumed, for whatever parochial reason, that it was actually human dreams that should guide AI behavior. In this world, they ask humans to write down their dreams, and try to devise AIs that would make the world like that. There are two assumptions here: (1) that making the world more like human dreams would be good, and (2) that humans can correctly report their dreams. In the case of dreams, both of these assumptions are suspect, right? But what exactly is the difference with human desires? Why do we assume that either they are a guide to what is good or can be reported accurately?

I think it's often helpful to talk about whether a policy which behaves badly (i.e. which selects actions have consequences we dislike, or like less than some alternative actions that it "could have" selected) will receive a high loss. One reason this is helpful is because we expect SGD to correct behaviors that lead to a high loss, given enough time, and so if bad behavior gets a high loss then we only need to think about optimization difficulties or cases where the failures can be catastrophic before SGD can correct them. Another reason this is helpful is that if we use SGD to select policies that get a low loss across a broad range of training environments, we have prima facie reason to expect the resulting policies to get a low loss in other environments.

As an aside: I think the physical implementation of the loss function is most relevant if your loss function is based on the reward that results from executing a given action (or on predicting actions based on their measured consequences). If your AI is instead trained via imitation learning or process-based feedback, the physical implementation of the loss does not seem to have special significance, e.g. SGD will not select policies that intervene in the physical world in order to change the physical implementation of their loss. (Though of course there can be policies that intervene in the physical world in order to change the dynamics of the learning process itself, and in some sense no physically implemented process could select for loss---a cognitive pattern that simply grabs resources by means unrelated to the loss function will ultimately be favored by the physical world no matter what training code you wrote down.)

Note that this is all less relevant to ARC because our goal is to find strategies for which we can't tell a plausible story about why the resulting AI would tend to take creative and coherent actions to kill you. From our perspective, it's evidently plausible for SGD to find a system that reasons explicitly about how to achieve a low loss (and so if such systems would kill you then that's a problem), as well as plausible for SGD to find a system that behaves in a completely different way and can't even be described as pursuing goals (so if you are relying on goal-directed behavior of any kind then that's a problem). So we can't really rely on any approaches like the ones you are either advocating or critiquing. Of course most people consider our goal completely hopeless, and I'm only saying this to clarify how ARC thinks about these things.

I think that's also a good thing to think about, but most of the meat is in how you actually reason about that and how it leads to superior or at least adequate+complementary predictions about the behavior of ML systems. I think to the extent this perspective is useful for alignment it also ought to be useful for reasoning about the behavior of existing systems like large language models

Sure. To clarify, superior to what? "GPT-3 reliably minimizes prediction error; it is inner-aligned to its training objective"?

I'd describe the alternative perspective as: we try to think of GPT-3 as "knowing" some facts and having certain reasoning abilities. Then to predict how it behaves on a new input, we ask what the best next-token prediction is about the training distribution, given that knowledge and reasoning ability.

Of course the view isn't "this is always what happens," it's a way of making a best guess. We could clarify how to set the error bars, or how to think more precisely about what "knowledge" and "reasoning abilities" mean. And our predictions depends on our prior over what knowledge and reasoning abilities models will have, which will be informed by a combination of estimates of algorithmic complexity of behaviors and bang-for-your-buck for different kinds of knowledge, but will ultimately depend on a lot of uncertain empirical facts about what kind of thing language models are able to learn. Overall I acknowledge you'd have to say a lot more to make this into something fully precise, and I'd guess the same will be true of a competing perspective.

I think this is roughly how many people make predictions about GPT-3, and in my experience it generally works pretty well and many apparent errors can be explained by more careful consideration of the training distribution. If we had a contest where you tried to give people short advice strings to help them predict GPT-3's behavior, I think this kind of description would be an extremely strong entry.

This procedure is far from perfect. So you could imagine something else doing a lot better (or providing significant additional value as a complement).

I interpret you as making the claim (across this and your other recent posts): don't expect policies to get a low loss just because they were selected for getting a low loss, instead think about how SGD steps will shape what they are "trying" to do and use that to reason directly about their generalization behavior.

Yeah... I interpret TurnTrout as saying "look I know it seems straightforward to say that we are optimizing over policies rather than building policies that optimize for reward, but actually this difference is incredibly subtle". And I think he's right that this exact point has the kind of subtlety that just keeps biting again and again. I have the sense that this distinction held up evolutionary biology for decades.

Nevertheless, yes, as you say, the question is how to in fact reason from "policies selected according to such-and-such loss" to "any guarantees whatsoever about general behavior of policy". I wish we could say more about why this part of the problem is so ferociously difficult.

I generally agree. In the human case, I sometimes have conversations where we’re discussing (let’s say) the fact that sweet tastes trigger rewards via (blah blah circuits in the brainstem), and the person says “this circuit wants us to eat sweet food, oh wait, maybe I should say that it wants sweet taste on our tongue? Or—” and then I say, “it’s just a simple input-output circuit, it does whatever it does, it doesn’t “want” anything in the real world”.

On the other hand, suppose there’s an intelligent designer (say, a human programmer), and they make a reward function R hoping that they will wind up with a trained AGI that’s trying to do X (where X is some idea in the programmer’s head), but they fail and the AGI is trying to do not-X instead. If R only depends on the AGI’s external behavior (as is often the case in RL these days), then we can imagine two ways that this failure happened:

1. The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)
2. The AGI was doing the right thing for the wrong reasons but got rewarded anyway (or doing the wrong thing for the right reasons but got punished).

I think it’s useful to catalog possible failures based on whether they involve (1) or (2), and I think it’s reasonable to call them “failures of outer alignment” and “failures of inner alignment” respectively, and I think when (1) is happening rarely or not at all, we can say that the reward function is doing a good job at “representing” the designer’s intention—or at any rate, it’s doing as well as we can possibly hope for from a reward function of that form. The AGI still might fail to acquire the right motivation, and there might be things we can do to help (e.g. change the training environment), but replacing R (which fires exactly to the extent that the AGI’s external behavior involves doing X) by a different external-behavior-based reward function R’ (which sometimes fires when the AGI is doing not-X, and/or sometimes doesn’t fire when the AGI is doing X) seems like it would only make things worse. So in that sense, it seems useful to talk about outer misalignment, a.k.a. situations where the reward function is failing to “represent” the AGI designer’s desired external behavior, and to treat those situations as generally bad (even while acknowledging that we’ll never get rid of such situations completely thanks to wireheading).

I think "outer alignment failure" is confusing terminology at this point—always requiring clarification, and then storing "oh yeah, 'outer alignment failure' means the wrong thing got rewarded as a matter of empirical fact." Furthermore, words are sticky, and lend some of their historical connotations to color our thinking. Better to just say "R rewards bad on-training behavior in situations A, B, C" or even "bad action rewarded", which compactly communicates the anticipation-constraining information.

Similarly, "inner alignment failure" (2) -> "undesired inner cognition reinforced when superficially good action performed" (we probably should get a better compact phrase for this one).

With humans in the loop, there actually is a way to implement . Unfortunately, computing the function takes as long as it takes for several humans to read a novel and aggregate their scores. And there's also no way to compute the gradient. So by that point, it's pretty much just a reinforcement learning signal.

However, you could use that human feedback to train a side network to predict the reward signal based on what the AI generates. This second network would then essentially compute a custom loss function (asymptotically approaching with more human feedback) that is amenable to gradient descent and can run far more quickly. That's basically the idea behind reward modeling (https://youtube.com/watch?v=PYylPRX6z4Q).

But yeah, framing such goals as loss functions probably gives the wrong intuition for how to approach aligning with them.

Interesting. I have the sense that we would have to get humans to reflect for years after reading a novel to produce a rating that, if optimized, would produce truly great novels. I think that when a novel really moves a person (or, even more importantly, moves a whole culture), it's not at all evident that this has happened until (often) years after-the-fact.

I also have the sense that part of what makes a novel great is that a person or a culture decide to associate a certain beautiful insight with it due to the novel's role in provoking that insight. But usually the novel is only partly responsible for the insight, and in part we choose to make the novel beautiful by associating it in our culture with a beautiful thing (and this associating of beautiful things is a good and honest thing to do).

Well, then computing  would just take a really long time.

So, it's not impossible in principle if you trained the loss function as I suggested (loss function trained by reinforcement learning, then applied to train the actual novel-generating model), but it is a totally impractical approach.

If you really wanted to teach an AI to generate good novels, you'd probably start by training a LLM to imitate existing novels through some sort of predictive loss (e.g., categorical cross-entropy on next-token prediction) to give it a good prior. Then train another LLM to predict reader reviews or dissertations written by literary grad students, using the novels they're based on as inputs, again with a similar predictive loss. (Pretraining both LLMs on some large corpus (as with GPT) could probably help with providing necessary cultural context.) At the same time, use a Mechanical Turk to get thousands of people to rate the sentiment of every review/dissertation, then train another LLM to predict the sentiment scores of all raters (or a low-dimensional projection of all their ratings), using the reviews/dissertations as input and something like MSE loss to predict sentiment scores as output. Then chain these latter two networks together to compute , to act as the prune to the first network's babble, and train to convergence.

Honestly, though, I probably still wouldn't trust the resulting system to produce good novels (or at least not with internally consistent plots, characterizations, and themes) if the LLMs were based on a Transformer architecture.

Honestly, though, I probably still wouldn't trust the resulting system [...] if the LLMs were based on a Transformer architecture.

Interesting - why is that?