Generated as part of SERI MATS, under John Wentworth. Thanks to Alex Turner and Garrett Baker for related discussion, and to Justis Mills for draft feedback.

All bold claims and ignorant mistakes herein are my own.

Do you remember when Evan Hubinger became really enamored with 'training stories,' a way of carving up the alignment problem into 'training rationales' and 'training goals'? Evan's idea was that we ought to think of alignment as choosing a target model that we want to end up with after training, plus choosing a training procedure that will actually yield that model. In my prior experience talking to people about this way of framing the alignment problem… people didn't especially get it. The typical response of those who had heard of this was, "Yeah, that's one way to carve up the problem apart from, e.g., inner and outer alignment. But so what? How does this framing help us actually reduce the problem? It seems no better or worse than our old framings."

I used to have that response myself. However, I now think I understand training stories! Here's my take:

What Do Training Stories Buy You That Inner/Outer Alignment Doesn't?

It’s worth pointing out how phrasing inner and outer alignment in terms of training stories makes clear what I think was our biggest mistake in formulating that terminology, which is that inner/outer alignment presumes that the right way to build an aligned model is to find an aligned loss function and then have a training goal of finding a model that optimizes for that loss function. However, as I hope the more general framework of training stories should make clear, there are many possible ways of trying to train an aligned model. Microscope AI and STEM AI are examples that I mentioned previously, but in general any approach that intends to use a loss function that would be problematic if directly optimized for, but then attempts to train a model that doesn’t directly optimize for that loss function, would fail on both outer and inner alignment—and yet might still result in an aligned model.

--Evan, How do we become confident in the safety of a machine learning system?

What training stories buy you is a framing of the alignment problem that doesn't imply that the only way you can use a loss function is as an outer objective that you would like to see directly optimized for. This is one way you can use a loss function -- but it definitely isn't the only such way! It's badly misleading for your framing of the alignment problem to suggest that inner/outer alignment of a powerful model is the only way to survive.

For example, say that you were raising a kid. One childrearing scheme is to carefully make your kid's childhood robust to arbitrary levels of precocious genius in your kid. You'd build a childhood such that overachieving in it would only ever be a good thing. You'd drill athletics, navigating complex adult social situations, difficult moral dilemmas, etc., always making sure that there isn't some perverse victory condition way up near the skill ceiling of the task. For on this approach, you don't ever want optimization power being pointed in a direction you wouldn't want to see optimized, in the way that you don't ever want to needlessly point a gun barrel at anything you don't want destroyed. This childrearing scheme revolves around designing for overachievers, so that your kids will leave childhood once they rise to the level of "overachiever," and then generalize their instilled drive to overachievement in the same spirit after they leave the nest.

You'll notice that the above approach to childrearing is pretty weird, and looks more like Ender's Game or Molly's Game than it does any kind of conventionally advised childrearing. It's in fact okay for behavior to be momentarily incentivized in childhood that you would not want to see optimized in adulthood! We successfully chisel out aligned kids because we understand their inductive biases well, and we work within the extremely path-dependent regime of human aging where applying nudges early in life can help result in an adult that avoids huge behavioral attractors, like heavy drug use, later on in life. It's just not a very good model of a growing human to see them as a path-independent search over policies that you have to be perfectly cautious about ever, even temporarily, incentivizing in a way you wouldn't want to see superintelligently optimized. Indeed, ignoring that young people can actively steer away from events that would change who they are and what they'd care about means prematurely giving up on most viable childrearing schemes! You'd be ill-advised as a new father if someone started you off explaining that a child is a search run over algorithms incentivized by the environment, rather than by foregrounding the theory of human inductive biases and human flavors of path-dependent aging.

Path-Dependence Apart from Deceptive Alignment

Say that you were searching over possible futures using a powerful search algorithm that simply conditioned on which futures looked good to you, and then sampled a future for you to look at from the uniform distribution over the remaining set of futures. You look around this presented possible future, and see that all is apparently good. What do you expect to happen when you actually move to instantiate that future?

Because of the path-independent structure of your powerful search algorithm, you should expect that world that looks so good to in fact be mediocre, under the surface. You optimized for something that would look good to you, and a world can be better shaped to look good if it doesn't have to waste any resources on in fact being a nice place. Just take whatever resources were spent on illegible nice things, take them back, and spend them on appearances instead. This argument suggests that path-independent search, taken to the limit, will inevitably mislead you. It suggests being very careful about what you're asking your path-independent search algorithm for, lest you get exactly what you asked for… and nothing more.

Suppose, now, that we add one path-dependent wrinkle to the story. You now first have to train the powerful search algorithm that you will then use to, carefully, find an actually good future. The longer you train the search algorithm on more data, the more powerful it grows. But, if the "search algorithm" your training procedure steps through is ever a smart consequentialist algorithm with arbitrary goals and an instrumental incentive to play along and survive training, that consequentialist algorithm will now output whatever behavioral profile you task it with outputting. If your training path (though algorithm space) ever routes through a deceptive consequentialist, all the grading models on behavior that follows will not improve outcomes. You are now doomed, and your training runs no longer have any leverage over that impending doom.

This, I think, is the training regime that the classic inner/outer alignment framing suggests. But in fact, there'll be a lot more path dependence in AI training than this! It's not that you sample uniformly from algorithm space, simply conditioning on lower and lower loss, until you eventually find a deceptive (or aligned) consequentialist algorithm. Consequentialism is coherence, and coherence doesn't crystallize all at once in one SGD step, and not at all before then. Instead, coherence meaningfully comes in degrees, and as you search over more and more complex algorithms, those algorithms will more and more begin to start defending themselves and increasingly satisfy the coherence properties that constitute consequentialism. Given that, the training process will, by degrees, become sensitive to what algorithm it finds long before a full-blown mesa-optimizer becomes situationally aware.

You'll want to know all about those earlier path-dependent dynamics, if your job is to raise a corrigible AGI. Mainly, you'll want a theory of SGD's inductive biases, and a theory of the relationship between reinforcement of all kinds in given contexts and the algorithms those reinforcement events would eventuate in. Finally, you'd want, largely for communicative clarity, a framing of the alignment problem that foregrounds this theoretical hope.

New Comment
6 comments, sorted by Click to highlight new comments since:

We successfully chisel out aligned kids because we understand their inductive biases well


Interpretation of this emoji: "Press X to doubt."

I'm interested in why you doubt this? I can imagine various interpretations of the quote which I doubt, and some which are less doubtful-to-me. 

The reason babies grow up into people that share our values has very little to do with our understanding of their inductive biases (i.e. most of the work is done by gene-environment interactions with parts of the environment that aren't good at predicting the child's moral development). The primary meaning of this comment is pointing out that a particular statement about children is wrong in a kind-of-funny way.

I have this sort of humorous image of someone raising a child, saying "Phew, thank goodness I had a good understanding of my child's inductive biases, I never could have gotten them to have similar values to me just by passing on half of my genes to them and raising them in an environment similar to the one I grew up in."

I agree that similar environments are important, but I don't see why you think they explain most of the outcomes. What's an example of a "gene-environment interaction with parts of the environment that aren't good at predicting the child's moral development"? 

Like, what it feels like to understand human inductive biases isn't to think "Gee, I understand inductive biases!". It's more like: "I see that my son just scowled after agreeing to clean his room. This provides evidence about his internal motivational composition, even though I can't do interpretability and read off his brain state. Given what I know about human physiology and people like him, I predict that while he might comply now ('in training'), at which point I will reward him with snacks. However, in the future he will probably not generalize to actually cleaning his room when asked when I'm less able to impose external punishments and rewards." 

I also claim that a human would be relatively easy to reshape into caring about baby-eaters if you used a competent and unethical (psychological) shaping scheme and hidden brain stimulation reward devices, such that you super strongly reinforced the human when they indicate they just thought positive or charitable thoughts about baby eaters, or agree that the baby eaters deserve moral consideration. I think you could probably pull this off within a week. 

Now, we haven't actually observed this. But insofar as you agree with my prediction, it's not due to an environment-gene interaction. This is what it feels like to understand inductive biases: Being able to correctly predict how to inculcate target values in another agent, without already having done it experimentally. 

What's an example of a "gene-environment interaction with parts of the environment that aren't good at predicting the child's moral development"?

Mimicking adult behavior even when the adult isn't paying any attention to the child (and children with different genes having slightly different sorts of mimicry). Automatically changing purity norms in response to disease and perceived disease risk. Having a different outlook on the world if you always had plenty of food growing up. Children of athletic parents often being athletic too, which changes how they relate to their environment and changes their eventual lifestyle. Being genetically predisposed to alcoholism and then becoming an alcoholic. Learning a tonal language and then having a different relationship with music.

I'm not saying parents have no power. If you paid a bunch of parents to raise their children to love playing the piano, you'd probably get a significantly elevated rate of that value among the kids (guess: ~35% compared to a base rate of ~3%). Effortful and deliberate parenting works at a rate distinguishable from chance. My claim is more that almost all of value transmission isn't like that, that you can raise kids without deliberately imparting your (or any) values and they'll still end up pretty similar to you.

I think these are great counterpoints. Thanks for making them. 

I still buy "the helicopter parent 'outer alignment' training regime is unwise for 'aligning' kids" and that deliberate parenting is better than chance. But possibly/probably not the primary factor. I haven't yet read much data here so my views feel relatively unconstrained, beyond my "common sense."

I think there's an additional consideration with AI, though: We control the reward circuitry. If lots of variance in kid-alignment is due to genetic variation in reward circuitry or learning hyperparameters or whatever, then we also control that with AI, that is also part of understanding AI inductive biases.