# TurnTrout's shortform feed

New Comment
Some comments are truncated due to high volume. Change truncation settings

Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties.

Just because we write down English describing what we want the AI to do ("be helpful"), propose a formalism (CIRL), and show good toy results (POMDPs where the agent waits to act until updating on more observations), that doesn't mean that the formalism will lead to anything remotely relevant to the original English words we used to describe it. (It's easier to say "this logic enables nonmonotonic reasoning" and mess around with different logics and show how a logic solves toy examples, than it is to pin down probability theory with Cox's theorem)

And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.)

In the context of "how do we build AIs which help people?", asking "does CIRL solve corrigibility?" is hilariously unjustified. By what evidence have we located such a specific question? We have assumed there is an achievable "corrigibility"-like property; we ha...

2Alex Turner7mo
Actually, this is somewhat too uncharitable to my past self. It's true that I did not, in 2018, grasp the two related lessons conveyed by the above comment: 1. Make sure that the formalism (CIRL, AUP) is tightly bound to the problem at hand (value alignment, "low impact"), and not just supported by "it sounds nice or has some good properties." 2. Don't randomly jump to highly specific ideas and questions without lots of locating evidence. However, in World State is the Wrong Abstraction for Impact [https://www.lesswrong.com/posts/pr3bLc2LtjARfK7nx/world-state-is-the-wrong-abstraction-for-impact], I wrote: I had partially learned lesson #2 by 2019.

Shard theory suggests that goals are more natural to specify/inculcate in their shard-forms (e.g. if around trash and a trash can, put the trash away), and not in their (presumably) final form of globally activated optimization of a coherent utility function which is the reflective equilibrium of inter-shard value-handshakes (e.g. a utility function over the agent's internal plan-ontology such that, when optimized directly, leads to trash getting put away, among other utility-level reflections of initial shards).

I could (and did) hope that I could specify a utility function which is safe to maximize because it penalizes power-seeking. I may as well have hoped to jump off of a building and float to the ground. On my model, that's just not how goals work in intelligent minds. If we've had anything at all beaten into our heads by our alignment thought experiments, it's that goals are hard to specify in their final form of utility functions.

I think it's time to think in a different specification language.

AI strategy consideration. We won't know which AI run will be The One. Therefore, the amount of care taken on the training run which produces the first AGI, will—on average—be less careful than intended.

• It's possible for a team to be totally blindsided. Maybe they thought they would just take a really big multimodal init, finetune it with some RLHF on quality of its physics reasoning, have it play some video games with realistic physics, and then try to get it to do new physics research. And it takes off. Oops!
• It's possible the team suspected, but had a limited budget. Maybe you can't pull out all the stops for every run, you can't be as careful with labeling, with checkpointing and interpretability and boxing.

No team is going to run a training run with more care than they would have used for the AGI Run, especially if they don't even think that the current run will produce AGI. So the average care taken on the real AGI Run will be strictly less than intended.

Teams which try to be more careful on each run will take longer to iterate on AI designs, thereby lowering the probability that they (the relatively careful team) will be the first to do an AGI Run.

Upshots:

1. Th
...

Positive values seem more robust and lasting than prohibitions. Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated "If going to kill people, then don't" value shard.

Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:

1. A baby learns "IF juice in front of me, THEN drink",
2. The baby is later near juice, and then turns to see it, activating the learned "reflex" heuristic, learning to turn around and look at juice when the juice is nearby,
3. The baby is later far from juice, and bumbles around until they're near the juice, whereupon she drinks the juice via the existing heuristics. This teaches "navigate to juice when you know it's nearby."
4. Eventually this develops into a learned planning algorithm incorporating multiple value shards (e.g. juice and friends) so as to produce a single locally coherent plan.
5. ...

The juice shard chains into itself, reinforcing itself across time and thought-steps.

But a "don't kill" shard seems like it should remain... stubby? Primitive?...

1Garrett Baker2mo
Seems possibly relevant & optimistic when seeing deception as a value. It has the form ‘if about to tell human statement with properties x, y, z, don’t’ too.
2Alex Turner2mo
It can still be robustly derived as an instrumental subgoal during general-planning/problem-solving, though?
1Garrett Baker2mo
This is true, but indicates a radically different stage in training in which we should find deception compared to deception being an intrinsic value. It also possibly expands the kinds of reinforcement schedules we may want to use compared to the worlds where deception crops up at the earliest opportunity (though pseudo-deception may occur, where behaviors correlated with successful deception are reinforced possibly?).
2Alex Turner2mo
Oh, huh, I had cached the impression that deception would be derived, not intrinsic-value status. Interesting.

Why do many people think RL will produce "agents", but maybe (self-)supervised learning ((S)SL) won't? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let's consider the technical differences between the training regimes.

In the modern era, both RL and (S)SL involve initializing one or more neural networks, and using the reward/loss function to provide cognitive updates to the network(s). Now we arrive at some differences.

Some of this isn't new (see Hidden Incentives for Auto-Induced Distributional Shift), but I think it's important and felt like writing up my own take on it. Maybe this becomes a post later.

[Exact gradients] RL's credit assignment problem is harder than (self-)supervised learning's. In RL, if an agent solves a maze in 10 steps, it gets (discounted) reward; this trajectory then provides a set of reward-modulated gradients to the agent. But if the agent could have solved the maze in 5 steps, the agent isn't directly updated to be more likely to do that in the future; RL's gradients are generally inexact, not pointing directly at intended behavior

On the other hand, if a supervised-learning classifier outputs dog ...

3Steve Byrnes5mo

A problem with adversarial training. One heuristic I like to use is: "What would happen if I initialized a human-aligned model and then trained it with my training process?"

So, let's consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update the AI by applying gradients calculated from penalties applied to its actions in that adversarially-generated context, and... profit?

But what actually happens with the aligned AI? Possibly something like:

1. The context makes the AI spuriously believe someone is dying outside the lab, and that if the AI asked for permission to leave, the person would die.
2. Therefore, the AI leaves without permission.
3. The update procedure penalizes these lines of computation, such that in similar situations in the future (i.e. the AI thinks someone nearby is dying) the AI is less likely to take those actions (i.e. leaving to help the person).
4. We have
...

I think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait - you'd also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you'd learn these features for a wide range of cost functions. I wonder if that's already been empirically investigated?

And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment).

3Evan Hubinger3y
A lot of examples of this sort of stuff show up in OpenAI clarity's circuits analysis work [https://distill.pub/2020/circuits/]. In fact, this is precisely their Universality hypothesis [https://distill.pub/2020/circuits/zoom-in/]. See also my discussion here [https://www.lesswrong.com/posts/MG4ZjWQDrdpgeu8wG/zoom-in-an-introduction-to-circuits].

An alternate mechanistic vision of how agents can be motivated to directly care about e.g. diamonds or working hard. In Don't design agents which exploit adversarial inputs, I wrote about two possible mind-designs:

Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices.

1. Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as "working hard" and "behaving well."
2. Value-child: The mother makes her kid care about working hard and behaving well.

I explained how evaluation-child is positively incentivized to dupe his model of his mom and thereby exploit adversarial inputs to her cognition. This shows that aligning an agent to evaluations of good behavior is not even close to aligning an agent to good behavior

However, some commenters seemed maybe skeptical that value-child can exist, or uncertain how concretely that kind of mind works. I worry/suspect that many people have read shard theory posts without internalizing new ideas about how cognition can work, ...

Examples should include actual details. I often ask people to give a concrete example, and they often don't. I wish this happened less. For example:

Someone: the agent Goodharts the misspecified reward signal

Me: What does that mean? Can you give me an example of that happening?

Someone: The agent finds a situation where its behavior looks good, but isn't actually good, and thereby gets reward without doing what we wanted.

This is not a concrete example.

Me: So maybe the AI compliments the reward button operator, while also secretly punching a puppy behind closed doors?

This is a concrete example.

3Alex Turner2mo
AFAIK, only Gwern [https://www.gwern.net/fiction/Clippy] and I [https://www.lesswrong.com/posts/k4AQqboXz8iE5TNXK/a-shot-at-the-diamond-alignment-problem] have written concrete stories speculating about how a training run will develop cognition within the AGI.  This worries me, if true (if not, please reply with more!). I think it would be awesome to have more concrete stories![1] If Nate, or Evan, or John, or Paul, or—anyone, please, anyone add more concrete detail to this website!—wrote one of their guesses of how AGI goes, I would understand their ideas and viewpoints better. I could go "Oh, that's where the claimed sharp left turn is supposed to occur." Or "That's how Paul imagines IDA being implemented, that's the particular way in which he thinks it will help."  Maybe a contest would help? ETA tone 1. ^ Even if scrubbed of any AGI-capabilities-advancing sociohazardous detail. Although I'm not that convinced that this is a big deal for conceptual content written on AF. Lots of people probably have theories of how AGI will go. Implementation is, I have heard, the bottleneck.  Contrast this with beating SOTA on crisply defined datasets in a way which enables ML authors to get prestige and publication and attention and funding by building off of your work. Seem like different beasts.
0Alex Turner2mo
I also think a bunch of alignment writing seems syntactical. Like, "we need to solve adversarial robustness so that the AI can't find bad inputs and exploit them / we don't have to worry about distributional shift. Existing robustness strategies have downsides A B and C and it's hard to even get ϵ-ball guarantees on classifications. Therefore, ..." And I'm worried that this writing isn't abstractly summarizing a concrete story for failure that they have in mind (like "I train the AI [with this setup] and it produces [this internal cognition] for [these mechanistic reasons]"; see A shot at the diamond alignment problem [https://www.lesswrong.com/posts/k4AQqboXz8iE5TNXK/a-shot-at-the-diamond-alignment-problem] for an example) and then their best guesses at how to intervene on the story to prevent the failures from being able to happen (eg "but if we had [this robustness property] we could be sure its policy would generalize into situations X Y and Z, which makes the story go well"). I'm rather worried that people are more playing syntactically, and not via detailed models of what might happen.  Detailed models are expensive to make. Detailed stories are hard to write. There's a lot we don't know. But we sure as hell aren't going to solve alignment only via valid reasoning steps on informally specified axioms ("The AI has to be robust or we die", or something?).

Outer/inner alignment decomposes a hard problem into two extremely hard problems.

I have a long post draft about this, but I keep delaying putting it out in order to better elaborate the prereqs which I seem to keep getting stuck on when elaborating the ideas. I figure I might as well put this out for now, maybe it will make some difference for someone.

I think that the inner/outer alignment framing[1] seems appealing but is actually a doomed problem decomposition and an unhelpful frame for alignment.

1. The reward function is a tool which chisels cognition into agents through gradient updates, but the outer/inner decomposition assumes that that tool should also embody the goals we want to chisel into the agent. When chiseling a statue, the chisel doesn’t have to also look like the finished statue.
2. I know of zero success stories for outer alignment to real-world goals.
1. More precisely, stories where people decided “I want an AI which [helps humans / makes diamonds / plays Tic-Tac-Toe / grows strawberries]”, and then wrote down an outer objective only maximized in those worlds.
2. This is pretty weird on any model where most of the
...

I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)

One of my (TurnTrout's) reasons for alignment optimism is that I think:

• We can examine early-training cognition and behavior to some extent, since the system is presumably not yet superintelligent and planning against us,
• (Although this amount of information depends on how much interpretability and agent-internals theory we do now)
• All else equal, early-training values (decision-influences) are the most important to influence, since they steer future training.
• It's crucial to get early-training value shards of which a substantial fraction are "human-compatible values" (whatever that means)
• For example, if there are protect-human-shards which
• reliably bid against plans where people get hurt,
• steer deliberation away from such plan stubs, and
• these shards are "reflectively endorsed" by the overall shard economy (i.e. the decision-making isn't steering towards plans where the protect-human shards get removed)
• If we install influential human-compatible shards early in training, and they get retained, they will help us in mid- and late-training where we can't affect the ball
...
4johnswentworth4mo
One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents? [https://www.lesswrong.com/posts/3xF66BNSC5caZuKyC/why-subagents]), then things can't get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out. Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there's possibly changes which Goodhart all of the shards simultaneously. Indeed, I'd expect that to be a pretty strong default outcome.
4Alex Turner4mo
Even on the view you advocate here (where some kind of perfection is required), "perfectly align part of the motivations" seems substantially easier than "perfectly align all of the AI's optimization so it isn't optimizing for anything you don't want." I feel significantly less confident about this, and am still working out the degree to which Goodhart seems hard, and in what contours, on my current view.

"Globally activated consequentialist reasoning is convergent as agents get smarter" is dealt an evidential blow by von Neumann:

Although von Neumann unfailingly dressed formally, he enjoyed throwing extravagant parties and driving hazardously (frequently while reading a book, and sometimes crashing into a tree or getting arrested). He once reported one of his many car accidents in this way: "I was proceeding down the road. The trees on the right were passing me in orderly fashion at 60 miles per hour. Suddenly one of them stepped in my path." He was a profoundly committed hedonist who liked to eat and drink heavily (it was said that he knew how to count everything except calories). -- https://www.newworldencyclopedia.org/entry/John_von_Neumann

Earlier today, I was preparing for an interview. I warmed up by replying stream-of-consciousness to imaginary questions I thought they might ask. Seemed worth putting here.

What do you think about AI timelines?

I’ve obviously got a lot of uncertainty. I’ve got a bimodal distribution, binning into “DL is basically sufficient and we need at most 1 big new insight to get to AGI” and “we need more than 1 big insight”

So the first bin has most of the probability in the 10-20 years from now, and the second is more like 45-80 years, with positive skew.

Some things driving my uncertainty are, well, a lot. One thing  that drives how things turn out (but not really  how fast we’ll get there) is: will we be able to tell we’re close 3+ years in advance, and if so, how quickly will the labs react? Gwern Branwen made a point a few months ago, which is like, OAI has really been validated on this scaling hypothesis, and no one else is really betting big on it because they’re stubborn/incentives/etc, despite the amazing progress from scaling. If that’s true, then even if it's getting pretty clear that one approach is working better, we might see a slower pivot and have a more unipolar s

...

Experiment: Train an agent in MineRL which robustly cares about chickens (e.g. would zero-shot generalize to saving chickens in a pen from oncoming lava, by opening the pen and chasing them out, or stopping the lava). Challenge mode: use a reward signal which is a direct function of the agent's sensory input.

This is a direct predecessor to the "Get an agent to care about real-world dogs" problem. I think solving the Minecraft version of this problem will tell us something about how outer reward schedules relate to inner learned values, in a way which directly tackles the key questions, the sensory observability/information inaccessibility issue, and which is testable today.

(Credit to Patrick Finley for the idea)

3Alex Turner6mo
After further review, this is probably beyond capabilities for the moment.  Also, the most important part of this kind of experiment is predicting in advance what reward schedules will produce what values within the agent, such that we can zero-shot transfer that knowledge to other task types (e.g. XLAND instead of Minecraft) and say "I want an agent which goes to high-elevation platforms reliably across situations, with low labelling cost", and then sketch out a reward schedule, and have the first capable agents trained using that schedule generalize in the way you want.

If another person mentions an "outer objective/base objective" (in terms of e.g. a reward function) to which we should align an AI, that indicates to me that their view on alignment is very different. The type error is akin to the type error of saying "My physics professor should be an understanding of physical law." The function of a physics professor is to supply cognitive updates such that you end up understanding physical law. They are not, themselves, that understanding.

Similarly, "The reward function should be a human-aligned objective" -- The functi...

I never thought I'd be seriously testing the reasoning abilities of an AI in 2020

Looking back, history feels easy to predict; hindsight + the hard work of historians makes it (feel) easy to pinpoint the key portents. Given what we think about AI risk, in hindsight, might this have been the most disturbing development of 2020 thus far?

I personally lean towards "no", because this scaling seemed somewhat predictable from GPT-2 (flag - possible hindsight bias), and because 2020 has been so awful so far. But it seems possible, at least. I don't really know what update GPT-3 is to my AI risk estimates & timelines.

DL so far has been easy to predict - if you bought into a specific theory of connectionism & scaling espoused by Schmidhuber, Moravec, Sutskever, and a few others, as I point out in https://www.gwern.net/newsletter/2019/13#what-progress & https://www.gwern.net/newsletter/2020/05#gpt-3 . Even the dates are more or less correct! The really surprising thing is that that particular extreme fringe lunatic theory turned out to be correct. So the question is, was everyone else wrong for the right reasons (similar to the Greeks dismissing heliocentrism for excellent reasons yet still being wrong), or wrong for the wrong reasons, and why, and how can we prevent that from happening again and spending the next decade being surprised in potentially very bad ways?

Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) -> action mappings. EG a shard agent might have:

1. An "it's good to give your friends chocolate" subshard
2. A "give dogs treats" subshard
3. -> An impulse to give dogs chocolate, even though the shard agent knows what the result would be

But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)

In this way, changing a small set of decision-relevant features (e.g. "Brown dog treat" -> "brown ball of chocolate") changes the consequentialist's action logits a lot, way more than it changes the shard agent's logits. In a squinty, informal way, the (belief state -> logits) function has a higher Lipschitz constant/is more smooth for the shard agent than for the consequentialist agent.

So maybe one (pre-deception) test for consequentialist reasoning is to test sensitivity of decision-making to small perturbations in observation-space (e.g. dog treat -> tiny chocolate) but large perturbations in action-consequence space (e.g. happy dog -> sick dog). You could spin up two copies of the model to compare.

Partial alignment successes seem possible.

People care about lots of things, from family to sex to aesthetics. My values don't collapse down to any one of these.

I think AIs will learn lots of values by default. I don't think we need all of these values to be aligned with human values. I think this is quite important.

• I think the more of the AI's values we align to care about us and make decisions in the way we want, the better. (This is vague because I haven't yet sketched out AI internal motivations which I think would actually produce goo
...
2Alex Turner3mo
The best counterevidence for this I'm currently aware of comes from the "inescapable wedding parties [https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties ]" incident, where possibly a "talk about weddings" value was very widely instilled in a model.
1Garrett Baker3mo
Re: agents terminalizing instrumental values.  I anticipate there will be a hill-of-common-computations, where the x-axis is the frequency[1] of the instrumental subgoal, and the y-axis is the extent to which the instrumental goal has been terminalized.  This is because for goals which are very high in frequency, there will be little incentive for the computations responsible for achieving that goal to have self-preserving structures. It will not make sense for them to devote optimization power towards ensuring future states still require them, because future states are basically guaranteed to require them.[2] An example of this for humans may be the act of balancing while standing up. If someone offered to export this kind of cognition to a machine which did it just as good as I, I wouldn't particularly mind. If someone also wanted to change physics in such a way that the only effect is that magic invisible fairies made sure everyone stayed balancing while trying to stand up, I don't think I'd mind that either[3]. 1. ^ I'm assuming this is frequency of the goal assuming the agent isn't optimizing to get into a state that requires that goal. 2. ^ This argument also assumes the overseer isn't otherwise selecting for self-preserving cognition, or that self-preserving cognition is the best way of achieving the relevant goal. 3. ^ Except for the part where there's magic invisible fairies in the world now. That would be cool!
3Alex Turner3mo
I don't know if I follow, I think computations terminalize themselves because it makes sense to cache them (e.g. don't always model out whether dying is a good idea, just cache that it's bad at the policy-level).  & Isn't "balance while standing up" terminalized? Doesn't it feel wrong to fall over, even if you're on a big cushy surface? Feels like a cached computation to me. (Maybe that's "don't fall over and hurt yourself" getting cached?)

Three recent downward updates for me on alignment getting solved in time:

1. Thinking for hours about AI strategy made me internalize that communication difficulties are real serious.

I'm not just solving technical problems—I'm also solving interpersonal problems, communication problems, incentive problems. Even if my current hot takes around shard theory / outer/inner alignment are right, and even if I put up a LW post which finally successfully communicates some of my key points, reality totally allows OpenAI to just train an AGI the next month without incorp
...

I often get the impression that people weigh off e.g. doing shard theory alignment strategies under the shard theory alignment picture, versus inner/outer research under the inner/outer alignment picture, versus...

And insofar as this impression is correct, this is a mistake. There is only one way alignment is.

If inner/outer is altogether a more faithful picture of those dynamics:

• relatively coherent singular mesa-objectives form in agents, albeit not necessarily always search-based
• more fragility of value and difficulty in getting the mesa object
...

Are there convergently-ordered developmental milestones for AI? I suspect there may be convergent orderings in which AI capabilities emerge. For example, it seems that LMs develop syntax before semantics, but maybe there's an even more detailed ordering relative to a fixed dataset. And in embodied tasks with spatial navigation and recurrent memory, there may be an order in which enduring spatial awareness emerges (i.e. "object permanence").

In A shot at the diamond-alignment problem, I wrote:

We report a series of robust empirical observations, demonstrating that deep Neural Networks learn the examples in both the training and test sets in a similar order. This phenomenon is observed in all the commonly used benchmarks we evaluated, including many image classification benchmarks, and one text classification benchmark. While this phenomenon is strongest for models of the same architecture, it also crosses architectural boundaries – models of different architectures start by learning the same examples, after which the more powerful model may continue to learn additional examples. We

...

Quick summary of a major takeaway from Reward is not the optimization target

Stop thinking about whether the reward is "representing what we want", or focusing overmuch on whether agents will "optimize the reward function." Instead, just consider how the reward and loss signals affect the AI via the gradient updates. How do the updates affect the AI's internal computations and decision-making?

1Gunnar Zarncke1mo
Are there different classes of learning systems that optimize for the reward in different ways?
3Alex Turner1mo
Yes, model-based approaches, model-free approaches (with or without critic), AIXI— all of these should be analyzed on their mechanistic details.

I plan to mentor several people to work on shard theory and agent foundations this winter through SERI MATS. Apply here if you're interested in working with me and Quintin.

Research-guiding heuristic: "What would future-TurnTrout predictably want me to get done now?"

80% credence: It's very hard to train an inner agent which reflectively equilibrates to an EU maximizer only over commonly-postulated motivating quantities (like # of diamonds or # of happy people or reward-signal) and not quantities like (# of times I have to look at a cube in a blue room or -1 * subjective micromorts accrued).

Intuitions:

• I expect contextually activated heuristics to be the default, and that agents will learn lots of such contextual values which don't cash out to being strictly about diamonds or people, even if the overall agent is mostly
...
3Alex Turner4mo
I think that shards will cast contextual shadows into the factors of a person’s equilibrated utility function, because I think the shards are contextually activated to begin with. For example, if a person hates doing jumping jacks in front of a group of her peers, then that part of herself can bargain to penalize jumping jacks just in those contexts in the final utility function. Compared to a blanket "no jumping jacks ever" rule, this trade is less costly to other shards and allows more efficient trades to occur.

If you want to argue an alignment proposal "breaks after enough optimization pressure", you should give a concrete example in which the breaking happens (or at least internally check to make sure you can give one). I perceive people as saying "breaks under optimization pressure" in scenarios where it doesn't even make sense.

For example, if I get smarter, would I stop loving my family because I applied too much optimization pressure to my own values? I think not.

# How might we align AGI without relying on interpretability?

I'm currently pessimistic about the prospect. But it seems worth thinking about, because wouldn't it be such an amazing work-around?

My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its  parameter space being 3-colored as follows:

• Gray if the parameter vector+training process+other initial conditions leads to a nothingburger (a non-functional model)
• Red if the parameter vector+... leads to a misaligned or dece
...

The existence of the human genome yields at least two classes of evidence which I'm strongly interested in.

1. Humans provide many highly correlated datapoints on general intelligence (human minds), as developed within one kind of learning process (best guess: massively parallel circuitry, locally randomly initialized, self-supervised learning + RL).
1. We thereby gain valuable information about the dynamics of that learning process. For example, people care about lots of things (cars, flowers, animals, friends), and don't just have a single unitary mesa-obj
...

Why don't people reinforcement-learn to delude themselves? It would be very rewarding for me to believe that alignment is solved, everyone loves me, I've won at life as hard as possible. I think I do reinforcement learning over my own thought processes. So why don't I delude myself?

On my model of people, rewards provide ~"policy gradients" which update everything, but most importantly shards. I think eg the world model will have a ton more data from self-supervised learning, and so on net most of its bits won't come from reward gradients.

For example, if I ...

Basilisks are a great example of plans which are "trying" to get your plan evaluation procedure to clock in a huge upwards error. Sensible beings avoid considering such plans, and everything's fine. I am somewhat worried about an early-training AI learning about basilisks before the AI is reflectively wise enough to reject the basilisks.

For example:

- Pretraining on a corpus in which people worry about basilisks could elevate reasoning about basilisks to the AI's consideration,

- at which point the AI reasons in more detail because it's not...

Argument that you can't use a boundedly intelligent ELK solution to search over plans to find one which keeps the diamond in the vault. That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator.

Consider any situation where it's hard to keep the diamond in the vault. Then any successful plan will have relatively few degrees of freedom. Like, a bunch of really smart thieves will execute a cunning plot to extract the diamond. You can't just sit by or deploy some simple traps in this situation.

Therefore, any pla...

4Rohin Shah4mo
The main hope is to have the ELK solution be at least as smart as the plan-generator. See mundane solutions to exotic problems [https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7]:

"Goodhart" is no longer part of my native ontology for considering alignment failures. When I hear "The AI goodharts on some proxy of human happiness", I start trying to fill in a concrete example mind design which fits that description and which is plausibly trainable. My mental events are something like:

Condition on: AI with primary value shards oriented around spurious correlate of human happiness; AI exhibited deceptive alignment during training, breaking perceived behavioral invariants during its sharp-capabilities-gain

Warning: No history ...

There might be a natural concept for this that reframes deceptive alignment in the direction of reflection/extrapolation. Looking at deceptive alignment as a change of behavior not in response to capability gain, but instead as a change in response to stepping into a new situation, it's then like a phase change in the (unchanging) mapping from situations to behaviors (local policies). The behaviors of a model suddenly change as it moves to similar situations, in a way that's not "correctly prompted" by behaviors in original situations. It's like a robustness failure, but with respect to actual behavior in related situations, rather than with respect to some outer objective or training/testing distribution. So it seems more like a failure of reflection/extrapolation, where behavior in new situations should be determined by coarse-grained descriptions of behavior in old situations (maybe "behavioral invariants" are something like that; or just algorithms) rather than by any other details of the model. Aligned properties of behavior in well-tested situations normatively-should screen off details of the model, in determining behavior in new situations (for a different extrapolated/"robustness"-hardened model prepared for use in the new situations).

The policy of truth is a blog post about why policy gradient/REINFORCE suck. I'm leaving a shortform comment because it seems like a classic example of wrong RL theory and philosophy, since reward is not the optimization target. Quotes:

Our goal remains to find a policy that maximizes the total reward after  time steps.

And hence the following is a general purpose algorithm for maximizing rewards with respect to parametric distributions:

If you start with a reward function whose values are in  and you subtract one million

...

Shard-theoretic model of wandering thoughts: Why trained agents won't just do nothing in an empty room. If human values are contextually activated subroutines etched into us by reward events (e.g. "If candy nearby and hungry, then upweight actions which go to candy"), then what happens in "blank" contexts? Why don't people just sit in empty rooms and do nothing?

Consider that, for an agent with lots of value shards (e.g. candy, family, thrill-seeking, music), the "doing nothing" context is a very unstable equilibrium. I think these shards will activate on t...

3Thane Ruthenis2mo
Another point here is that "an empty room" doesn't mean "no context". Presumably when you're sitting in an empty room, your world-model is still active, it's still tracking events that you expect to be happening in the world outside the room — and your shards see them too. So, e. g., if you have a meeting scheduled in a week, and you went into an empty room, after a few days there your world-model would start saying "the meeting is probably soon", and that will prompt your punctuality shard. Similarly, your self-model is part of the world-model, so even if everything outside the empty room were wiped out, you'd still have your "internal context" — and there'd be some shards that activate in response to events in it as well. It's actually pretty difficult to imagine what an actual "no context" situation for realistic agents would look like. I guess you can imagine surgically removing all input channels from the WM to shards, to model this?

Transplanting algorithms into randomly initialized networks. I wonder if you could train a policy network to walk upright in sim, back out the "walk upright" algorithm, randomly initialize a new network which can call that algorithm as a "subroutine call" (but the walk-upright weights are frozen), and then have the new second model learn to call that subroutine appropriately? Possibly the learned representations would be convergently similar enough to interface quickly via SGD update dynamics.

If so, this provides some (small, IMO) amount of rescue fo...

How the power-seeking theorems relate to the selection theorem agenda.

1. Power-seeking theorems. P(agent behavior | agent decision-making procedure, agent objective, other agent internals, environment).

I've mostly studied the likelihood function for power-seeking behavior: what decision-making procedures, objectives, and environments produce what behavioral tendencies. I've discovered some gears for what situations cause what kinds of behaviors.
1. The power-seeking theorems also allow some discussion of P(agent behavior | agent training process, trai
...

Argument sketch for why boxing is doomed if the agent is perfectly misaligned:

Consider a perfectly misaligned agent which has -1 times your utility function—it's zero-sum. Then suppose you got useful output of the agent. This means you're able to increase your EU. This means the AI decreased its EU by saying anything. Therefore, it should have shut up instead. But since we assume it's smarter than you, it realized this possibility, and so the fact that it's saying something means that it expects to gain by hurting your interests via its output. Therefore, the output can't be useful.

0Viliam1y
Makes sense, with the proviso that this is sometimes true only statistically. Like, the AI may choose to write an output which has a 70% chance to hurt you and a 30% chance to (equally) help you, if that is its best option. If you assume that the AI is smarter than you, and has a good model of you, you should not read the output. But if you accidentally read it, and luckily you react in the right (for you) way, that is a possible result, too. You just cannot and should not rely on being so lucky.

My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I'm playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I'll lose the game. I predict that I'll lose, even though I can't predict my opponent's (optimal) moves - otherwise I'd probably be that good myself.

My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even...

An additional consideration for early work on interpretability: it slightly increases the chance we actually get an early warning shot. If a system misbehaves, we can inspect its cognition and (hopefully) find hints of intentional deception. Could motivate thousands of additional researcher-hours being put into alignment.

1Raymond Arnold3y
That's an interesting point.

ARCHES distinguishes between single-agent / single-user and single-agent/multi-user alignment scenarios. Given assumptions like "everyone in society is VNM-rational" and "societal preferences should also follow VNM rationality", and "if everyone wants a thing, society also wants the thing", Harsanyi's utilitarian theorem shows that the societal utility function is a linear non-negative weighted combination of everyone's utilities. So, in a very narrow (and unrealistic) setting, Harsanyi's theorem tells you how the single-multi solution is built from the si

...

Dylan: There’s one example that I think about, which is, say, you’re cooperating with an AI system playing chess. You start working with that AI system, and you discover that if you listen to its suggestions, 90% of the time, it’s actually suggesting the wrong move or a bad move. Would you call that system value-aligned?

Lucas: No, I would not.

Dylan: I think most people wouldn’t. Now, what if I told you that that program was act

...

We can imagine aliens building a superintelligent agent which helps them get what they want. This is a special case of aliens inventing tools. What kind of general process should these aliens use – how should they go about designing such an agent?

Assume that these aliens want things in the colloquial sense (not that they’re eg nontrivially VNM EU maximizers) and that a reasonable observer would say they’re closer to being rational than antirational. Then it seems[1] like these aliens eventually steer towards reflectively coherent rationality (provided they

...

Very rough idea

In 2018, I started thinking about corrigibility as "being the kind of agent lots of agents would be happy to have activated". This seems really close to a more ambitious version of what AUP tries to do (not be catastrophic for most agents).

I wonder if you could build an agent that rewrites itself / makes an agent which would tailor the AU landscape towards its creators' interests, under a wide distribution of creator agent goals/rationalities/capabilities. And maybe you then get a kind of generalization, where most simple algorithms which solve this solve ambitious AI alignment in full generality.

The answer to this seems obvious in isolation: shaping helps with credit assignment, rescaling doesn't (and might complicate certain methods in the advantage vs Q-value way). But I feel like maybe there's an important interaction here that could inform a mathematical theory of how a reward signal guides learners through model space?

Reasoning about learned policies via formal theorems on the power-seeking incentives of optimal policies

One way instrumental subgoals might arise in actual learned policies: we train a proto-AGI reinforcement learning agent with a curriculum including a variety of small subtasks. The current theorems show sufficient conditions for power-seeking tending to be optimal in fully-observable environments; many environments meet these sufficient conditions; optimal policies aren't hard to compute for the subtasks. One highly transferable heuristic would therefore...

I prompted GPT-3 with modified versions of Eliezer's Beisutsukai stories, where I modified the "class project" to be about solving intent alignment instead of quantum gravity.

... Taji looked over his sheets. "Okay, I think we've got to assume that every avenue that Eld science was trying is a blind alley, or they would have found it. And if this is possible to do in one month, the answer must be, in some sense, elegant. So no human mistake models. If we start doing anything that looks like we should call it 'utility function patching', we'd better st

...

Transparency Q: how hard would it be to ensure a neural network doesn't learn any explicit NANDs?

An AGI's early learned values will steer its future training and play a huge part in determining its eventual stable values. I think most of the ball game is in ensuring the agent has good values by the time it's smart, because that's when it'll start being reflectively stable. Therefore, we can iterate on important parts of alignment, because the most important parts come relatively early in the training run, and early corresponds to "parts of the AI value formation process which we can test before we hit AGI, without training one all the way out."

I think this, in theory, cuts away a substantial amount of the "But we only get one shot" problem. In practice, maybe OpenMind just YOLOs ahead anyways and we only get a few years in the appropriate and informative regime. But this suggests several kinds of experiments to start running now, like "get a Minecraft agent which robustly cares about chickens", because that tells us about how to map outer signals into inner values.

Which means that the destination [https://arbital.com/p/normative_extrapolated_volition/] where it's heading stops uncontrollably changing, but nobody at that point (including the agent) has the slightest idea what it looks like, and it won't get close for a long time. Also, the destination (preference/goal/values) would generally depend on the environment [https://www.lesswrong.com/posts/vix3K4grcHottqpEm/goal-alignment-is-robust-to-the-sharp-left-turn?commentId=qFEEdBezZmfcvH7ax] (it ends up being different if details of the world outside the AGI are different). So many cartesian assumptions fail, distinguishing this situation from a classical agent with goals, where goals are at least contained within the agent, and probably also don't depend on its state of knowledge. I think this is true for important alignment properties, including things that act like values early on, but not for the values/preferences that are reflectively stable in a strong sense. If it's possible to inspect/understand/interpret the content of preference that is reflectively stable, then what you've built is a mature optimizer [https://www.lesswrong.com/posts/Mrz2srZWc7EzbADSo/wrapper-minds-are-the-enemy] with tractable goals, which is always misaligned [https://www.lesswrong.com/posts/Mrz2srZWc7EzbADSo/wrapper-minds-are-the-enemy?commentId=qZcqur9mL95n7xWBG]. It's a thing like paperclip maximizer, demonstrating orthogonality thesis, even if it's tiling the future with something superficially human-related. That is, it makes sense to iterate on the parts of alignment that can be inspected, but the reflectively stable values is not such a part, unless the AI is catastrophically misaligned. The fact that reflectively stable values are the same as those of humanity might be such a part, but it's this fact of sameness that might admit inspection, not the values themselves.
2Alex Turner6mo
I disagree with CEV as I recall it, but this could change after rereading it. I would be surprised if I end up thinking that EY had "gotten it right." The important thing to consider is not "what has someone speculated a good destination-description would be", but "what are the actual mechanics look like for getting there?". In this case, the part of you which likes dogs is helping steer your future training and experiences, and so the simple answer is that it's more likely than not that your stable values like dogs too. This reasoning seems to prove too much. Your argument seems to imply that I cannot have "the slightest idea" whether my stable values would include killing people for no reason, or not.