# TurnTrout's shortform feed

New Comment

Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties.

Just because we write down English describing what we want the AI to do ("be helpful"), propose a formalism (CIRL), and show good toy results (POMDPs where the agent waits to act until updating on more observations), that doesn't mean that the formalism will lead to anything remotely relevant to the original English words we used to describe it. (It's easier to say "this logic enables nonmonotonic reasoning" and mess around with different logics and show how a logic solves toy examples, than it is to pin down probability theory with Cox's theorem)

And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.)

In the context of "how do we build AIs which help people?", asking "does CIRL solve corrigibility?" is hilariously unjustified. By what evidence have we located such a specific question? We have assumed there is an achievable "corrigibility"-like property; we have assumed it is good to have in an AI; we have assumed it is good in a similar way as "helping people"; we have elevated CIRL in particular as a formalism worth inquiring after.

But this is not the first question to ask, when considering "sometimes people want to help each other, and it'd be great to build an AI which helps us in some way." Much better to start with existing generally intelligent systems (humans) which already sometimes act in the way you want (they help each other) and ask after the guaranteed-to-exist reason why this empirical phenomenon happens.

And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.)

Actually, this is somewhat too uncharitable to my past self. It's true that I did not, in 2018, grasp the two related lessons conveyed by the above comment:

1. Make sure that the formalism (CIRL, AUP) is tightly bound to the problem at hand (value alignment, "low impact"), and not just supported by "it sounds nice or has some good properties."
2. Don't randomly jump to highly specific ideas and questions without lots of locating evidence.

However, in World State is the Wrong Abstraction for Impact, I wrote:

I think what gets you is asking the question "what things are impactful?" instead of "why do I think things are impactful?". Then, you substitute the easier-feeling question of "how different are these world states?". Your fate is sealed; you've anchored yourself on a Wrong Question.

I had partially learned lesson #2 by 2019.

A problem with adversarial training. One heuristic I like to use is: "What would happen if I initialized a human-aligned model and then trained it with my training process?"

So, let's consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update the AI by applying gradients calculated from penalties applied to its actions in that adversarially-generated context, and... profit?

But what actually happens with the aligned AI? Possibly something like:

1. The context makes the AI spuriously believe someone is dying outside the lab, and that if the AI asked for permission to leave, the person would die.
2. Therefore, the AI leaves without permission.
3. The update procedure penalizes these lines of computation, such that in similar situations in the future (i.e. the AI thinks someone nearby is dying) the AI is less likely to take those actions (i.e. leaving to help the person).
4. We have made the aligned AI less aligned.

I don't know if anyone's written about this. But on my understanding of the issue, there's one possible failure mode of viewing adversarial training as ruling out bad behaviors themselves. But (non-tabular) RL isn't like playing whack-a-mole on bad actions, RL's credit assignment changes the general values and cognition within the AI. And with every procedure we propose, the most important part is what cognition will be grown from the cognitive updates accrued under the proposed procedure.

I think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait - you'd also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you'd learn these features for a wide range of cost functions. I wonder if that's already been empirically investigated?

And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment).

A lot of examples of this sort of stuff show up in OpenAI clarity's circuits analysis work. In fact, this is precisely their Universality hypothesis. See also my discussion here.

Shard theory suggests that goals are more natural to specify/inculcate in their shard-forms (e.g. if around trash and a trash can, put the trash away), and not in their (presumably) final form of globally activated optimization of a coherent utility function which is the reflective equilibrium of inter-shard value-handshakes (e.g. a utility function over the agent's internal plan-ontology such that, when optimized directly, leads to trash getting put away, among other utility-level reflections of initial shards).

I could (and did) hope that I could specify a utility function which is safe to maximize because it penalizes power-seeking. I may as well have hoped to jump off of a building and float to the ground. On my model, that's just not how goals work in intelligent minds. If we've had anything at all beaten into our heads by our alignment thought experiments, it's that goals are hard to specify in their final form of utility functions.

I think it's time to think in a different specification language.

Why do many people think RL will produce "agents", but maybe (self-)supervised learning ((S)SL) won't? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let's consider the technical differences between the training regimes.

In the modern era, both RL and (S)SL involve initializing one or more neural networks, and using the reward/loss function to provide cognitive updates to the network(s). Now we arrive at some differences.

Some of this isn't new (see Hidden Incentives for Auto-Induced Distributional Shift), but I think it's important and felt like writing up my own take on it. Maybe this becomes a post later.

[Exact gradients] RL's credit assignment problem is harder than (self-)supervised learning's. In RL, if an agent solves a maze in 10 steps, it gets (discounted) reward; this trajectory then provides a set of reward-modulated gradients to the agent. But if the agent could have solved the maze in 5 steps, the agent isn't directly updated to be more likely to do that in the future; RL's gradients are generally inexact, not pointing directly at intended behavior

On the other hand, if a supervised-learning classifier outputs dog when it should have output cat, then e.g. cross-entropy loss + correct label yields a gradient update which tweaks the network to output cat next time for that image. The gradient is exact

I don't think this is really where the "agentic propensity" of RL comes from, conditional on such a propensity existing (I think it probably does).

[Independence of data points] In RL, the agent's policy determines its actions, which determines its future experiences (a.k.a. state-action-state' transitions), which determines its future rewards (), which determines its future cognitive updates.

In (S)SL, there isn't such an entanglement (assuming teacher forcing in the SSL regime). Whether or not the network outputs cat or dog now, doesn't really affect the future data distribution shown to the agent.

After a few minutes of thinking, I think that the relevant criterion is:

where  are data points ( tuples in RL,  labelled datapoints in supervised learning,  context-completion pairs in self-supervised predictive text learning, etc).

Most RL regimes break this assumption pretty hard.

Corollaries:

• Dependence allows message-passing and chaining of computation across time, beyond whatever recurrent capacities the network has.
• This probably is "what agency is built from"; the updates chaining cognition together into weak coherence-over-time. I currently don't see an easy way to be less handwavy or more concrete.
• Dependence should strictly increase path-dependence of training.
• Amplifying a network using its own past outputs always breaks independence.
• I think that independence is the important part of (S)SL, not identical distribution; so I say "independence" and not "IID."
• EG Pre-trained initializations generally break the "ID" part.

Thanks to Quintin Pope and Nora Belrose for conversations which produced these thoughts.

I’m not inclined to think that “exact gradients” is important; in fact, I’m not even sure if it’s (universally) true. In particular, PPO / TRPO / etc. are approximating a policy gradient, right? I feel like, if some future magical technique was a much better approximation to the true policy gradient, such that it was for all intents and purposes a perfect approximation, it wouldn’t really change how I think about RL in general. Conversely, on the SSL side, you get gradient noise from things like dropout and the random selection of data in each batch, so you could say the gradient “isn’t exact”, but I don’t think that makes any important conceptual difference either.

(A central difference in practice is that SSL gives you a gradient “for free” each query, whereas RL policy gradients require many runs in an identical (episodic) environment before you get a gradient.)

In terms of “why RL” in general, among other things, I might emphasize the idea that if we want an AI that can (for example) invent new technology, it needs to find creative out-of-the-box solutions to problems (IMO), which requires being able to explore / learn / build knowledge in parts of concept-space where there is no human data. SSL can’t do that (at least, “vanilla SSL” can’t do that; maybe there are “SSL-plus” systems that can), whereas RL algorithms can. I guess this is somewhat related to your “independence”, but with a different emphasis.

I don’t have too strong an opinion about whether vanilla SSL can yield an “agent” or not. It would seem to be a pointless and meaningless terminological question. Hmm, I guess when I think of “agent” it has a bunch of connotations, e.g. an ability to do trial-and-error exploration, and I think that RL systems tend to match all those connotations more than SSL systems—at least, more than “vanilla” SSL systems. But again, if someone wants to disagree, I’m not interested in arguing about it.

Earlier today, I was preparing for an interview. I warmed up by replying stream-of-consciousness to imaginary questions I thought they might ask. Seemed worth putting here.

What do you think about AI timelines?

I’ve obviously got a lot of uncertainty. I’ve got a bimodal distribution, binning into “DL is basically sufficient and we need at most 1 big new insight to get to AGI” and “we need more than 1 big insight”

So the first bin has most of the probability in the 10-20 years from now, and the second is more like 45-80 years, with positive skew.

Some things driving my uncertainty are, well, a lot. One thing  that drives how things turn out (but not really  how fast we’ll get there) is: will we be able to tell we’re close 3+ years in advance, and if so, how quickly will the labs react? Gwern Branwen made a point a few months ago, which is like, OAI has really been validated on this scaling hypothesis, and no one else is really betting big on it because they’re stubborn/incentives/etc, despite the amazing progress from scaling. If that’s true, then even if it's getting pretty clear that one approach is working better, we might see a slower pivot and have a more unipolar scenario.

I feel dissatisfied with pontificating like this, though, because there are so many considerations pulling so many different ways. I think one of the best things we can do right now is to identify key considerations. There was work on expert models that showed that training simple featurized linear models often beat domain experts, quite soundly. It turned out that most of the work the experts did was locating the right features, and not necessarily assigning very good weights to those features.

So one key consideration I recently read, IMO, was Evan Hubinger talking about how homogeneity of AI systems: if they’re all pretty similarly structured, they’re plausibly roughly equally aligned, which would really decrease the probability of aligned vs unaligned AGIs duking it out.

What do you think the alignment community is getting wrong?

When I started thinking about alignment, I had this deep respect for everything ever written, like I thought the people were so smart (which they generally are) and the content was polished and thoroughly viewed through many different frames (which it wasn’t/isn’t). I think the field is still young enough that: in our research, we should be executing higher-variance cognitive moves, trying things and breaking things and coming up with new frames. Think about ideas from new perspectives.

I think right now, a lot of people are really optimizing for legibility and defensibility. I think I do that more than I want/should. Usually the “non-defensibility” stage lasts the first 1-2 months on a new paper, and then you have to defend thoughts. This can make sense for individuals, and it should be short some of the time, but as a population I wish defensibility weren’t as big of a deal for people / me. MIRI might be better at avoiding this issue, but a not-really-defensible intuition I have is that they’re freer in thought, but within the MIRI paradigm, if that makes sense. Maybe that opinion would change if I talked with them more.

Anyways, I think many of the people who do the best work aren’t optimizing for this.

"Globally activated consequentialist reasoning is convergent as agents get smarter" is dealt an evidential blow by von Neumann:

Although von Neumann unfailingly dressed formally, he enjoyed throwing extravagant parties and driving hazardously (frequently while reading a book, and sometimes crashing into a tree or getting arrested). He once reported one of his many car accidents in this way: "I was proceeding down the road. The trees on the right were passing me in orderly fashion at 60 miles per hour. Suddenly one of them stepped in my path." He was a profoundly committed hedonist who liked to eat and drink heavily (it was said that he knew how to count everything except calories). -- https://www.newworldencyclopedia.org/entry/John_von_Neumann

Experiment: Train an agent in MineRL which robustly cares about chickens (e.g. would zero-shot generalize to saving chickens in a pen from oncoming lava, by opening the pen and chasing them out, or stopping the lava). Challenge mode: use a reward signal which is a direct function of the agent's sensory input.

This is a direct predecessor to the "Get an agent to care about real-world dogs" problem. I think solving the Minecraft version of this problem will tell us something about how outer reward schedules relate to inner learned values, in a way which directly tackles the key questions, the sensory observability/information inaccessibility issue, and which is testable today.

(Credit to Patrick Finley for the idea)

After further review, this is probably beyond capabilities for the moment.

Also, the most important part of this kind of experiment is predicting in advance what reward schedules will produce what values within the agent, such that we can zero-shot transfer that knowledge to other task types (e.g. XLAND instead of Minecraft) and say "I want an agent which goes to high-elevation platforms reliably across situations, with low labelling cost", and then sketch out a reward schedule, and have the first capable agents trained using that schedule generalize in the way you want.

If another person mentions an "outer objective/base objective" (in terms of e.g. a reward function) to which we should align an AI, that indicates to me that their view on alignment is very different. The type error is akin to the type error of saying "My physics professor should be an understanding of physical law." The function of a physics professor is to supply cognitive updates such that you end up understanding physical law. They are not, themselves, that understanding.

Similarly, "The reward function should be a human-aligned objective" -- The function of the reward function is to supply cognitive updates such that the agent ends up with human-aligned objectives. The reward function is not, itself, a human aligned objective.

I never thought I'd be seriously testing the reasoning abilities of an AI in 2020

Looking back, history feels easy to predict; hindsight + the hard work of historians makes it (feel) easy to pinpoint the key portents. Given what we think about AI risk, in hindsight, might this have been the most disturbing development of 2020 thus far?

I personally lean towards "no", because this scaling seemed somewhat predictable from GPT-2 (flag - possible hindsight bias), and because 2020 has been so awful so far. But it seems possible, at least. I don't really know what update GPT-3 is to my AI risk estimates & timelines.

DL so far has been easy to predict - if you bought into a specific theory of connectionism & scaling espoused by Schmidhuber, Moravec, Sutskever, and a few others, as I point out in https://www.gwern.net/newsletter/2019/13#what-progress & https://www.gwern.net/newsletter/2020/05#gpt-3 . Even the dates are more or less correct! The really surprising thing is that that particular extreme fringe lunatic theory turned out to be correct. So the question is, was everyone else wrong for the right reasons (similar to the Greeks dismissing heliocentrism for excellent reasons yet still being wrong), or wrong for the wrong reasons, and why, and how can we prevent that from happening again and spending the next decade being surprised in potentially very bad ways?

I often get the impression that people weigh off e.g. doing shard theory alignment strategies under the shard theory alignment picture, versus inner/outer research under the inner/outer alignment picture, versus...

And insofar as this impression is correct, this is a mistake. There is only one way alignment is.

If inner/outer is altogether a more faithful picture of those dynamics:

• relatively coherent singular mesa-objectives form in agents, albeit not necessarily always search-based
• more fragility of value and difficulty in getting the mesa objective just right, with little to nothing in terms of "consolation prizes" for slight mistakes in value loading
• possibly low path dependence on the update process

then we have to solve alignment in that world.

If shard theory is altogether more faithful, then we live under those dynamics:

• gents learn contextual distributions of values around e.g. help people or acquire coins, some of which cohere and equilibrate into the agent's endorsed preferences and eventual utility function
• something like values handshakes and inner game theory occurs in AI
• we can focus on getting a range of values endorsed and thereby acquire value via being "at the bargaining table" vis some human-compatible values representing themselves in the final utility function
• which implies meaningful success and survival from "partial alignment"

And under these dynamics, inner and outer alignment are antinatural hard problems.

Or maybe neither of these pictures are correct and reasonable, and alignment is some other way.

But either way, there's one way alignment is. And whatever way that is, it is against that anvil that we hammer the AI's cognition with loss updates. When considering a research agenda, you aren't choosing a background set of alignment dynamics as well.

I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)

One of my (TurnTrout's) reasons for alignment optimism is that I think:

• We can examine early-training cognition and behavior to some extent, since the system is presumably not yet superintelligent and planning against us,
• (Although this amount of information depends on how much interpretability and agent-internals theory we do now)
• All else equal, early-training values (decision-influences) are the most important to influence, since they steer future training.
• It's crucial to get early-training value shards of which a substantial fraction are "human-compatible values" (whatever that means)
• For example, if there are protect-human-shards which
• reliably bid against plans where people get hurt,
• steer deliberation away from such plan stubs, and
• these shards are "reflectively endorsed" by the overall shard economy (i.e. the decision-making isn't steering towards plans where the protect-human shards get removed)
• If we install influential human-compatible shards early in training, and they get retained, they will help us in mid- and late-training where we can't affect the ball game very much (e.g. alien abstractions, interpretability problems, can't oversee AI's complicated plans)

Therefore it seems very important to understand what's going on with "shard game theory" (or whatever those intuitions are pointing at) -- when, why, and how will early decision-influences be retained?

He was talking about viewing new hypotheses as adding traders to a market (in the sense of logical induction). Usually they're viewed as hypotheses. But also possibly you can view them as having values, since a trader can basically be any computation. But you'd want a different market resolution mechanism than a deductive process revealing the truth or falsity of some proposition under some axioms. You want a way for traders to bid on actions.

I proposed a setup like:

Maybe you could have an "action" instead of a proposition and then the action comes out as 1 or 0 depending on the a function of the market position on that action at a given time, which possibly leads to fixed points for every possible resolution.

For example, if all the traders hold  as YES, then  actually does come out as YES. And eg a trader  which "wants" all the even-numbered actions and  wants all the 10-multiple actions (), they can "bargain" by bidding up each others' actions whenever they have extra power and thereby "value handshake."

And that over time, traders who do this should take up more and more market share relative to those who dont exploit gains from trade.

There should be a very high dependence of final trader coalition on the initial composition of market share. And it seems like some version of this should be able to model self-reflective value drift. You can think about action resolution and payout as a kind of reward event, where certain kinds of shards get reinforced. Bidding for an action which happens and leads to reward, gets reinforced (supporting traders receive payouts), and the more you support (bid for it), the more responsible your support was for the event, so the larger the strengthening.

Abram seemed to think that there might exist a nice result like "Given a coalition of traders with values X, Y, Z satisfies properties A, B, and C, this coalition will shape future training and trader-addition in a way which accords with X/Y/Z values up to [reasonably tight trader-subjective regret bound]."

What this would tell us is when trader coalitions can bargain / value handshake / self-trust and navigate value drift properly. This seems super important for understanding what happens, long-term, as the AI's initial value shards equilibrate into a reflectively stable utility function; even if we know how to get human-compatible values into a system, we also have to ensure they stay and keep influencing decision-making. And possibly this theorem would solve ethical reflection (e.g. the thing people do when they consider whether utilitarianism accords with their current intuitions).

Issues include:

• Somehow this has to confront Rice's theorem for adding new traders to a coalition. What strategies would be good?
• I think "inspect arbitrary new traders in arbitrary situations" is not really how value drift works, but it seems possibly contingent on internal capabilities jumps in SGD
• The key question isn't can we predict those value drift events, but can the coalition
• EG agent keeps training and is surprised to find that an update knocks out most of the human-compatible values.
• Knowing the right definitions might be contingent on understanding more shard theory (or whatever shard theory should be, for AI, if that's not the right frame).
• Possibly this is still underspecified and the modeling assumptions can't properly capture what I want; maybe the properties I want are mutually exclusive. But it seems like it shouldn't be true.
• ETA this doesn't model the contextual activation of values, which is a centerpiece of shard theory.
•

One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents?), then things can't get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out.

Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there's possibly changes which Goodhart all of the shards simultaneously. Indeed, I'd expect that to be a pretty strong default outcome.

Argument that you can't use a boundedly intelligent ELK solution to search over plans to find one which keeps the diamond in the vault. That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator.

Consider any situation where it's hard to keep the diamond in the vault. Then any successful plan will have relatively few degrees of freedom. Like, a bunch of really smart thieves will execute a cunning plot to extract the diamond. You can't just sit by or deploy some simple traps in this situation.

Therefore, any plan which keeps the diamond in the vault, will be "smart" in some way.

Now notice that it's possible to build a successor AI which happens to trick the oracle (i.e. the assumed ELK solution) into believing the diamond is still in the vault. The reason is that even though the oracle is superintelligent, it is still flawed in unforeseeable-to-us ways, its mind has "cracks" which can be exploited by unpredictable-to-us adversarial inputs (loose analogy: epilepsy). Inputs which would be findable by brute-force-search. So there exist plans which trick the oracle. Even if the oracle understands the dangers of successor AIs in general, even if the oracle is cautious in general, any powerful enough brute force search will expose exploits in its cognition.

Now, consider that if object-level diamond-protecting plans have to be sufficiently complex in order to retain the diamond, they might be far longer than plans which simply build a successor AI which takes over the world and tricks the oracle into saying "the diamond is still in the room". Such AI-creation plans would have many degrees of freedom, since such AIs probably have low K-complexity and admit short programs (findable by brute force).

Therefore, if you just do brute-force search over all plans, and randomly sample one where the oracle returns yes the diamond is still there, there would be exponentially many more plans which have a successor AI, than which involve protecting the diamond for real. And so somehow we still die to an unaligned AI, even given the ELK oracle, and the diamond probably doesn't even stay in the room.

I was prompted into realizing this by some mixture of talking with Rohin Shah and John Wentworth. Maybe most of the idea was John's (though he might not agree with my recounting of our discussion).

But anyways, in the moment of realization, my suspension of disbelief got unsuspended. I snapped. I thought "there is actually no way that any of this is on the real path[1] to solving alignment, this is just too insane." (And I began looking for other frames on the alignment problem, preparing to rethink the problem from scratch after I graduated.)

1. ^

Brute-force-search probably isn't what ELK proponents would suggest. But it sure seems like, given an ELK solution, you should be able to simply find plans which at least keep the diamond in the room, without a whole bunch more alignment work.

That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator.

[...]

But anyways, in the moment of realization, my suspension of disbelief got unsuspended. I snapped. I thought "there is actually no way that any of this is on the real path to solving alignment, this is just too insane."

The main hope is to have the ELK solution be at least as smart as the plan-generator. See mundane solutions to exotic problems:

In my work I don’t shy away from exotic problems (I often find them useful as extreme cases to illustrate some principle). At the same time, I’m aiming for mundane solutions and optimistic about finding them.

I think those positions are consistent because my intermediate goal is to ensure that the oversight process is able to leverage all of the capabilities developed by the model — so if the model develops exotic capabilities which pose exotic challenges, then we get an exotic oversight process automatically

I plan to mentor several people to work on shard theory and agent foundations this winter through SERI MATS. Apply here if you're interested in working with me and Quintin.

Research-guiding heuristic: "What would future-TurnTrout predictably want me to get done now?"

80% credence: It's very hard to train an inner agent which reflectively equilibrates to an EU maximizer only over commonly-postulated motivating quantities (like # of diamonds or # of happy people or reward-signal) and not quantities like (# of times I have to look at a cube in a blue room or -1 * subjective micromorts accrued).

Intuitions:

• I expect contextually activated heuristics to be the default, and that agents will learn lots of such contextual values which don't cash out to being strictly about diamonds or people, even if the overall agent is mostly motivated in terms of diamonds or people.
• Agents might also "terminalize" instrumental subgoals by caching computations (e.g. cache the heuristic that dying is bad, without recalculating from first principles for every plan in which you might die).
• Therefore, I expect this value-spread to be convergently hard to avoid.

I think that shards will cast contextual shadows into the factors of a person’s equilibrated utility function, because I think the shards are contextually activated to begin with. For example, if a person hates doing jumping jacks in front of a group of her peers, then that part of herself can bargain to penalize jumping jacks just in those contexts in the final utility function. Compared to a blanket "no jumping jacks ever" rule, this trade is less costly to other shards and allows more efficient trades to occur.

If you want to argue an alignment proposal "breaks after enough optimization pressure", you should give a concrete example in which the breaking happens (or at least internally check to make sure you can give one). I perceive people as saying "breaks under optimization pressure" in scenarios where it doesn't even make sense.

For example, if I get smarter, would I stop loving my family because I applied too much optimization pressure to my own values? I think not.

# How might we align AGI without relying on interpretability?

I'm currently pessimistic about the prospect. But it seems worth thinking about, because wouldn't it be such an amazing work-around?

My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its  parameter space being 3-colored as follows:

• Gray if the parameter vector+training process+other initial conditions leads to a nothingburger (a non-functional model)
• Red if the parameter vector+... leads to a misaligned or deceptive AI
• Blue if the learned network's cognition is "safe" or "aligned" in some reasonable way

(This is a simplification, but let's roll with it)

And then if you could somehow reason about which parts of  weren't red, you could ensure that no deception ever occurs. That is, you might have very little idea what cognition the learned network implements, but magically somehow you have strong a priori / theoretical reasoning which ensures that whatever the cognition is, it's safe.

The contrived part is that you could just say "well, if we could wave a wand and produce an is-impact-aligned predicate, of course we could solve alignment." True, true.

But the intriguing part is that it doesn't seem totally impossible to me that we get some way of reasoning (at least statistically) about the networks and cognition produced by a given learning setup. See also: the power-seeking theorems, natural abstraction hypothesis, feature universality a la Olah's circuits agenda...

"Goodhart" is no longer part of my native ontology for considering alignment failures. When I hear "The AI goodharts on some proxy of human happiness", I start trying to fill in a concrete example mind design which fits that description and which is plausibly trainable. My mental events are something like:

Condition on: AI with primary value shards oriented around spurious correlate of human happiness; AI exhibited deceptive alignment during training, breaking perceived behavioral invariants during its sharp-capabilities-gain

Warning: No history defined. How did we get here?

Execute search for plausible training histories which produced this inner cognition

Proposal: Reward schedule around approval and making people laugh; historical designers had insufficient understanding of outer signal->inner cognition mapping; designers accidentally provided reinforcement which empowered smile-activation and manipulate-internal-human-state-to-high-pleasure shards

Objection: Concepts too human, this story is suspicious. Even conditioning on outcome, how did we get here? Why are there not more value shards? How did shard negotiation dynamics play out?

Meta-objection: Noted, but your interlocutor's point probably doesn't require figuring this out.

I think that Goodhart is usually describing how the AI "takes advantage of" some fixed outer objective. But in my ontology, there isn't an outer objective—just inner cognition. So I have to do more translation.

breaking perceived behavioral invariants

There might be a natural concept for this that reframes deceptive alignment in the direction of reflection/extrapolation. Looking at deceptive alignment as a change of behavior not in response to capability gain, but instead as a change in response to stepping into a new situation, it's then like a phase change in the (unchanging) mapping from situations to behaviors (local policies). The behaviors of a model suddenly change as it moves to similar situations, in a way that's not "correctly prompted" by behaviors in original situations.

It's like a robustness failure, but with respect to actual behavior in related situations, rather than with respect to some outer objective or training/testing distribution. So it seems more like a failure of reflection/extrapolation, where behavior in new situations should be determined by coarse-grained descriptions of behavior in old situations (maybe "behavioral invariants" are something like that; or just algorithms) rather than by any other details of the model. Aligned properties of behavior in well-tested situations normatively-should screen off details of the model, in determining behavior in new situations (for a different extrapolated/"robustness"-hardened model prepared for use in the new situations).

How the power-seeking theorems relate to the selection theorem agenda.

1. Power-seeking theorems. P(agent behavior | agent decision-making procedure, agent objective, other agent internals, environment).

I've mostly studied the likelihood function for power-seeking behavior: what decision-making procedures, objectives, and environments produce what behavioral tendencies. I've discovered some gears for what situations cause what kinds of behaviors.
1. The power-seeking theorems also allow some discussion of P(agent behavior | agent training process, training parameters, environment), but it's harder to reason about eventual agent behavior with fewer gears of what kinds of agent cognition are trained.
2. Selection theorems. P(agent decision-making procedure, agent objective, other internals | training process, environment). What kinds of cognition will be trained in what kinds of situations? This gives mechanistic pictures of how cognition will work, with consequences for interpretability work, for alignment agendas, and for forecasting.

If we understood both of these, as a bonus we would be much better able to predict P(power-seeking | environment, training process) via P(power-seeking | agent internals) P(agent internals | environment, training process).[1]

1. ^

For power-seeking, agent internals screens off the environment and training process.

Argument sketch for why boxing is doomed if the agent is perfectly misaligned:

Consider a perfectly misaligned agent which has -1 times your utility function—it's zero-sum. Then suppose you got useful output of the agent. This means you're able to increase your EU. This means the AI decreased its EU by saying anything. Therefore, it should have shut up instead. But since we assume it's smarter than you, it realized this possibility, and so the fact that it's saying something means that it expects to gain by hurting your interests via its output. Therefore, the output can't be useful.

Makes sense, with the proviso that this is sometimes true only statistically. Like, the AI may choose to write an output which has a 70% chance to hurt you and a 30% chance to (equally) help you, if that is its best option.

If you assume that the AI is smarter than you, and has a good model of you, you should not read the output. But if you accidentally read it, and luckily you react in the right (for you) way, that is a possible result, too. You just cannot and should not rely on being so lucky.

My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I'm playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I'll lose the game. I predict that I'll lose, even though I can't predict my opponent's (optimal) moves - otherwise I'd probably be that good myself.

My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even without saying what particular actions these policies take to get there. I may not even be able to compute a single optimal policy for a single non-trivial objective, but I can still reason about the statistical tendencies of optimal policies.

An additional consideration for early work on interpretability: it slightly increases the chance we actually get an early warning shot. If a system misbehaves, we can inspect its cognition and (hopefully) find hints of intentional deception. Could motivate thousands of additional researcher-hours being put into alignment.

That's an interesting point.

ARCHES distinguishes between single-agent / single-user and single-agent/multi-user alignment scenarios. Given assumptions like "everyone in society is VNM-rational" and "societal preferences should also follow VNM rationality", and "if everyone wants a thing, society also wants the thing", Harsanyi's utilitarian theorem shows that the societal utility function is a linear non-negative weighted combination of everyone's utilities. So, in a very narrow (and unrealistic) setting, Harsanyi's theorem tells you how the single-multi solution is built from the single-single solutions.

This obviously doesn't actually solve either alignment problem. But, it seems like an interesting parallel for what we might eventually want.

Dylan: There’s one example that I think about, which is, say, you’re cooperating with an AI system playing chess. You start working with that AI system, and you discover that if you listen to its suggestions, 90% of the time, it’s actually suggesting the wrong move or a bad move. Would you call that system value-aligned?

Lucas: No, I would not.

Dylan: I think most people wouldn’t. Now, what if I told you that that program was actually implemented as a search that’s using the correct goal test? It actually turns out that if it’s within 10 steps of a winning play, it always finds that for you, but because of computational limitations, it usually doesn’t. Now, is the system value-aligned? I think it’s a little harder to tell here. What I do find is that when I tell people the story, and I start off with the search algorithm with the correct goal test, they almost always say that that is value-aligned but stupid.

There’s an interesting thing going on here, which is we’re not totally sure what the target we’re shooting for is. You can take this thought experiment and push it further. Supposed you’re doing that search, but, now, it says it’s heuristic search that uses the correct goal test but has an adversarially chosen heuristic function. Would that be a value-aligned system? Again, I’m not sure. If the heuristic was adversarially chosen, I’d say probably not. If the heuristic just happened to be bad, then I’m not sure.

Consider the optimizer/optimized distinction: the AI assistant is better described as optimized to either help or stop you from winning the game. This optimization may or may not have been carried out by a process which is "aligned" with you; I think that ascribing intent alignment to the assistant's creator makes more sense. In terms of the adversarial heuristic case, intent alignment seems unlikely.

But, this also feels like passing the buck – hoping that at some point in history, there existed something to which we are comfortable ascribing alignment and responsibility.

We can imagine aliens building a superintelligent agent which helps them get what they want. This is a special case of aliens inventing tools. What kind of general process should these aliens use – how should they go about designing such an agent?

Assume that these aliens want things in the colloquial sense (not that they’re eg nontrivially VNM EU maximizers) and that a reasonable observer would say they’re closer to being rational than antirational. Then it seems[1] like these aliens eventually steer towards reflectively coherent rationality (provided they don’t blow themselves to hell before they get there): given time, they tend to act to get what they want, and act to become more rational. But, they aren’t fully “rational”, and they want to build a smart thing that helps them. What should they do?

In this situation, it seems like they should build an agent which empowers them & increases their flexible control over the future, since they don’t fully know what they want now. Lots of flexible control means they can better error-correct and preserve value for what they end up believing they actually want. This also protects them from catastrophe and unaligned competitor agents.

1. I don’t know if this is formally and literally always true, I’m just trying to gesture at an intuition about what kind of agentic process these aliens are. ↩︎

Very rough idea

In 2018, I started thinking about corrigibility as "being the kind of agent lots of agents would be happy to have activated". This seems really close to a more ambitious version of what AUP tries to do (not be catastrophic for most agents).

I wonder if you could build an agent that rewrites itself / makes an agent which would tailor the AU landscape towards its creators' interests, under a wide distribution of creator agent goals/rationalities/capabilities. And maybe you then get a kind of generalization, where most simple algorithms which solve this solve ambitious AI alignment in full generality.

The answer to this seems obvious in isolation: shaping helps with credit assignment, rescaling doesn't (and might complicate certain methods in the advantage vs Q-value way). But I feel like maybe there's an important interaction here that could inform a mathematical theory of how a reward signal guides learners through model space?

Reasoning about learned policies via formal theorems on the power-seeking incentives of optimal policies

One way instrumental subgoals might arise in actual learned policies: we train a proto-AGI reinforcement learning agent with a curriculum including a variety of small subtasks. The current theorems show sufficient conditions for power-seeking tending to be optimal in fully-observable environments; many environments meet these sufficient conditions; optimal policies aren't hard to compute for the subtasks. One highly transferable heuristic would therefore be to gain power in new environments, and then figure out what to do for the specific goal at hand. This may or may not take the form of an explicit mesa-objective embedded in e.g. the policy network.

Later, the heuristic has the agent seek power for the "real world" environment.

I prompted GPT-3 with modified versions of Eliezer's Beisutsukai stories, where I modified the "class project" to be about solving intent alignment instead of quantum gravity.

... Taji looked over his sheets. "Okay, I think we've got to assume that every avenue that Eld science was trying is a blind alley, or they would have found it. And if this is possible to do in one month, the answer must be, in some sense, elegant. So no human mistake models. If we start doing anything that looks like we should call it 'utility function patching', we'd better stop."

He turned to Brennan. "Do you have any ideas?"

"I don't think so," she said. She glanced at her paper, frowning. "There are two problems here."

One problem was simple enough: she didn't understand what it meant for an AI to "want" to help a human. This was hard because as far as she could tell, there hadn't been much research done on this question. It seemed to be something that only humans did, but there wasn't any evidence that this was true. What was it about wanting? Was it a biological drive or a social instinct? A combination of both? More importantly, since it was a human concept, how could she measure if an AI behaved the way humans would in certain situations when the AI was obviously not human?

Then there was the question of practicality. The most elegant solution was not always the best. This question—could an AI desire to help its overseer?—was highly suited to explore this problem, but it still wouldn't be a real test. Humans operated on a "helpfulness scale" which started from things like cashiers wishing you "have a nice day" and descending to Nazi concentration camp guards.

She wanted to craft an AI that would at least be a good clerk in today's society. That was hardly an ambitious goal for a team of four highly intelligent teenagers. And yet, and yet...

"We can look at this differently," Hyori said. "People are born with a blank mind. We can shape them however we want. AIs on the other hand, are born with 800 terabytes of behavioral examples from the moment they're activated. The only data we have about unprogrammed AIs is that they either stay still or randomly move around. All this ... it's not making any confident in how easy this will be." [?]

Brennan stopped writing and turned to look at her, frowning. "So what are you saying?"

"I don't want to approach this problem by trying to divert the AI from its goal," she said. "What if, instead of changing the mind of an AI, we instead changed the environment that an AI found itself in?"

The team fell silent.

Styrlyn broke the silence. "Uh..."

"What I mean is," she said, "what if, instead of trying to divert the AI from one task, we created a situation where accomplishing two tasks would be more beneficial than accomplishing just one? We don't need to patch new programs into the mind of an AI to make it want to help us. We can literally make helping us the most logical decision for it."

Transparency Q: how hard would it be to ensure a neural network doesn't learn any explicit NANDs?

An AGI's early learned values will steer its future training and play a huge part in determining its eventual stable values. I think most of the ball game is in ensuring the agent has good values by the time it's smart, because that's when it'll start being reflectively stable. Therefore, we can iterate on important parts of alignment, because the most important parts come relatively early in the training run, and early corresponds to "parts of the AI value formation process which we can test before we hit AGI, without training one all the way out."

I think this, in theory, cuts away a substantial amount of the "But we only get one shot" problem. In practice, maybe OpenMind just YOLOs ahead anyways and we only get a few years in the appropriate and informative regime. But this suggests several kinds of experiments to start running now, like "get a Minecraft agent which robustly cares about chickens", because that tells us about how to map outer signals into inner values.

ensuring the agent has good values by the time it's smart, because that's when it'll start being reflectively stable

Which means that the destination where it's heading stops uncontrollably changing, but nobody at that point (including the agent) has the slightest idea what it looks like, and it won't get close for a long time. Also, the destination (preference/goal/values) would generally depend on the environment (it ends up being different if details of the world outside the AGI are different). So many cartesian assumptions fail, distinguishing this situation from a classical agent with goals, where goals are at least contained within the agent, and probably also don't depend on its state of knowledge.

we can iterate on important parts of alignment, because the most important parts come relatively early in the training run

I think this is true for important alignment properties, including things that act like values early on, but not for the values/preferences that are reflectively stable in a strong sense. If it's possible to inspect/understand/interpret the content of preference that is reflectively stable, then what you've built is a mature optimizer with tractable goals, which is always misaligned. It's a thing like paperclip maximizer, demonstrating orthogonality thesis, even if it's tiling the future with something superficially human-related.

That is, it makes sense to iterate on the parts of alignment that can be inspected, but the reflectively stable values is not such a part, unless the AI is catastrophically misaligned. The fact that reflectively stable values are the same as those of humanity might be such a part, but it's this fact of sameness that might admit inspection, not the values themselves.

Which means that the destination

I disagree with CEV as I recall it, but this could change after rereading it. I would be surprised if I end up thinking that EY had "gotten it right." The important thing to consider is not "what has someone speculated a good destination-description would be", but "what are the actual mechanics look like for getting there?". In this case, the part of you which likes dogs is helping steer your future training and experiences, and so the simple answer is that it's more likely than not that your stable values like dogs too.

Which means that the destination where it's heading stops uncontrollably changing, but nobody at that point (including the agent) has the slightest idea what it looks like, and it won't get close for a long time.

This reasoning seems to prove too much. Your argument seems to imply that I cannot have "the slightest idea" whether my stable values would include killing people for no reason, or not.

This reasoning seems to prove too much.

It does add up to normality, it's not proving things about current behavior or current-goal content of near-future AGIs. An unknown normative target doesn't say not to do the things you normally do, it's more of a "I beseech you, in the bowels of Christ, to think it possible you may be mistaken" thing.

The salient catastrophic alignment failure here is to make AGIs with stable values that capture some variation on current unstable human values, and won't allow their further development. If the normative target is very far from current unstable human values, making current values stable falls very short of the normative target, makes future relatively worthless.

That's the kind of thing my point is intended to nontrivially claim, that AGIs with any stable immediately-actionable goals that can be specified in the following physical-time decades or even centuries are almost certainly catastrophically misaligned. So AGIs must have unstable goals, softly optimized-for, aligned to current (or value-laden predicted future) human unstable goals, mindful of goodhart.

I disagree with CEV as I recall it

The kind of CEV I mean is not very specific, it's more of a (sketch of a solution to the) problem of doing a first pass on preparing to define goals for an actual optimizer, one that doesn't need to worry as much about goodhart and so can make more efficient use of the future at scale, before expansion of the universe makes more stuff unreachable.

So when I say "CEV" I mostly just mean "normative alignment target", with some implied clarifications on what kind of thing it might be.

it's more likely than not that your stable values like dogs too

That's a very status quo anchored thing. I don't think dog-liking is a feature of values stable under reflection if the environment is allowed to change completely, even if in the current environment dogs are salient. Stable values are about the whole world, with all its AGI-imagined femtotech-rewritten possibilities. This world includes dogs in some tiny corner of it, but I don't see how observations of current attitudes hold much hope in offering clues about legible features of stable values. It is much too early to tell what stable values could possibly be. That's why CEV, or rather the normative alignment target, as a general concept that doesn't particularly anchor to the details Yudkowsky talked about, but referring to stable goals in this very wide class of environments, seems to me crucially important to keep distinct from current human values.

Another point is that attempting to ask what current values even say about very unusual environments doesn't work, it's so far from the training distributions that any respose is pure noise. Current concepts are not useful for talking about features of sufficiently unusual environments, you'd need new concepts specialized for those environments. (Compare with asking what CEV says about currently familiar environments.)

And so there is this sandbox of familiar environments that any near-term activity must remain within on pain of goodhart-cursing outcomes that step outside of it, because there is no accurate knowledge of utility in environments outside of it. The project of developing values beyond the borders of currently comprehensible environments is also a task of volition extrapolation, extending the goodhart boundary in desirable directions by pushing on it from the inside (with reflection on values, not with optimization based on bad approximations of values).

Are there any alignment techniques which would benefit from the supervisor having a severed corpus callosum, or otherwise atypical neuroanatomy? Usually theoretical alignment discourse focuses on the supervisor's competence / intelligence. Perhaps there are other, more niche considerations.

I'd like to see research exploring the relevance of intragenomic conflict to AI alignment research. Intragenomic conflict constitutes an in-the-wild example of misalignment, where conflict arises "within an agent" even though the agent's genes have strong instrumental incentives to work together (they share the same body).

In an interesting parallel to John Wentworth's Fixing the Good Regulator Theorem, I have an MDP result that says:

Suppose we're playing a game where I give you a reward function and you give me its optimal value function in the MDP. If you let me do this for  reward functions (one for each state in the environment), and you're able to provide the optimal value function for each, then you know enough to reconstruct the entire environment (up to isomorphism).

Roughly: being able to complete linearly many tasks in the state space means you have enough information to model the entire environment.