Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Recent Discussion

Definition. On how I use words, values are decision-influences (also known as shards). “I value doing well at school” is a short sentence for “in a range of contexts, there exists an influence on my decision-making which upweights actions and plans that lead to e.g. learning and good grades and honor among my classmates.” 

Summaries of key points:

  1. Nonrobust decision-influences can be OK. A candy-shard contextually influences decision-making. Many policies lead to acquiring lots of candy; the decision-influences don't have to be "globally robust" or "perfect."
  2. Values steer optimization; they are not optimized against. The value shards aren't getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent's cognition (e.g. the world model, the general-purpose planning API). 

    Since values are not the
This is a great article! It helps me understand shard theory better and value it more; in particular, it relates to something I've been thinking about where people seem to conflate utility-optimizing agents with policy-execuing agents, but the two have meaningfully different alignment characteristics, and shard theory seems to be deeply exploring the latter, which is 👍. That is to say, prior to "simulators" and "shard theory", a lot of focus was on utility-maximizers--agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training. 💯 From my perspective, this post convincingly argues that one route to alignment involves splitting the problem into two still-difficult sub-problems (but actually easier, unlike inner- and outer-alignment, as you've said elsewhere []): identifying a good shard structure and training an AI with such a shard structure. One point is that the structure is inherently somewhat robust (and that therefore each individual shard need not be), making it a much larger target. I have two objections: * I don't buy the implied "naturally-robust" claim. You've solved the optimizer's curse, wireheading via self-generated adversarial inputs, etc., but the policy induced by the shard structure is still sensitive to the details; unless you're hiding specific robust structures in your back pocket, I have no way of knowing that increasing the candy-shard's value won't cause a phase shift that substantially increases the perceived value of the "kill all humans, take their candy" action plan. I ultimately care about the agent's "reveal
2Alex Turner1d
FYI I do expect planning for smart agents, just not something qualitatively alignment-similar to "argmax over crisp human-specified utility function." (In the language of the OP, I expect values-executors, not grader-optimizers.) I'm not either. I think there will be phase changes wrt "shard strengths" (keeping in mind this is a leaky abstraction), and this is a key source of danger IMO. Basically my stance is "yeah there are going to be phase changes, but there are also many perturbations which don't induce phase changes, and I really want to understand which is which."
4Alex Turner1d
Thanks for leaving this comment, I somehow only just now saw it. I want to make a use/mention distinction. Consider an analogous argument: "Given gradient descent's pseudocode it seems like the only point of backward is to produce parameter modifications that lead to low outputs of loss_fn. Gradient descent selects over all directional derivatives for the gradient, which is the direction of maximal loss reduction. Why is that not "optimizing the outputs of the loss function as gradient descent's main terminal motivation"?"[1] Locally reducing the loss is indeed an important part of the learning dynamics of gradient descent, but this (I claim) has very different properties than "randomly sample from all global minima in the loss landscape" (analogously: "randomly sample a plan which globally maximizes grader output").  But I still haven't answered your broader  I think you're asking for a very reasonable definition which I have not yet given, in part because I've remained somewhat confused about the exact grader/non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I've focused on giving examples, in the hope of getting the vibe across.) I gave it a few more stabs, and I don't think any of them ended up being sufficient. But here they are anyways: 1. A "grader-optimizer" makes decisions primarily on the basis of the outputs of some evaluative submodule, which may or may not be explicitly internally implemented. The decision-making is oriented towards making the outputs come out as high as possible.  2. In other words, the evaluative "grader" submodule is optimized against by the planning. 3. IE the process plans over "what would the grader say about this outcome/plan", instead of just using the grader to bid the plan up or down.  I wish I had a better intensional definition for you, but that's what I wrote immediately and I really better get through the rest of my comm backlog from last week.  Here are

Overall disagreement:

I've remained somewhat confused about the exact grader/non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I've focused on giving examples, in the hope of getting the vibe across.)

Yeah, I think I have at least some sense of how this works in the kinds of examples you usually discuss (though my sense is that it's well captured by the "grader is complicit" point in my previous comment, which you presumably disagree with).

But I don't see how to extend the extensional definition far enough to get to the ... (read more)

Alternate title: "Somewhat Contra Scott On Simulators".

Scott Alexander has a recent post up on large language models as simulators.

I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant.

(But see caveats about the simulator framing from Beth Barnes here.)

These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun.

In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner...

Interesting, I think this clarifies things, but the framing also isn't quite as neat as I'd like.

I'd be tempted to redefine/reframe this as follows:

• Outer alignment for a simulator - Perfectly defining what it means to simulate a character. For example, how can we create a specification language so that we can pick out the character that we want? And what do we do with counterfactuals given they aren't actually literal?

• Inner alignment for a simulator - Training a simulator to perfectly simulate the assigned character

• Outer alignment for characters - fi... (read more)

2Thane Ruthenis21h
Broadly agreed. I'd written a similar analysis of the issue before [], where I also take into account path dynamics (i. e., how and why we actually get to Azazel from a random initialization). But that post is a bit outdated. My current best argument for it goes as follows []:

This will be posted also on the EA Forum, and included in a sequence containing some previous posts and other posts I'll publish this year.


Humans think critically about values and, to a certain extent, they also act according to their values. To the average human, the difference between increasing world happiness and increasing world suffering is huge and evident, while goals such as collecting coins and collecting stamps are roughly on the same level.

It would be nice to make these differences obvious to AI as they are to us. Even though exactly copying what happens in the human mind is probably not the best strategy to design an AI that understands ethics, having an idea of how value works in humans is a good starting point.

So, how...

Originally posted on the EA Forum for the Criticism and Red Teaming Contest. Will be included in a sequence containing some previous posts and other posts I'll publish this year.

0. Summary

AI alignment research centred around the control problem works well for futures shaped by out-of-control misaligned AI, but not that well for futures shaped by bad actors using AI. Section 1 contains a step-by-step argument for that claim. In section 2 I propose an alternative which aims at moral progress instead of direct risk reduction, and I reply to some objections. I will give technical details about the alternative at some point in the future, in section 3. 

The appendix clarifies some minor ambiguities with terminology and links to other stuff.

1. Criticism of the main framework in AI


Produced as part of the SERI ML Alignment Theory Scholars Program Winter 2022 Cohort.

Not all global minima of the (training) loss landscape are created equal.

Even if they achieve equal performance on the training set, different solutions can perform very differently on the test set or out-of-distribution. So why is it that we typically find "simple" solutions that generalize well?

In a previous post, I argued that the answer is "singularities" — minimum loss points with ill-defined tangents. It's the "nastiest" singularities that have the most outsized effect on learning and generalization in the limit of large data. These act as implicit regularizers that lower the effective dimensionality of the model.

Singularities in the loss landscape reduce the effective dimensionality of your model, which selects for models that generalize better.

Even after writing...

Epistemic status: Highly speculative hypothesis generation.

I had a similar, but slightly different picture. My picture was the tentacle covered blob.

When doing gradient descent from a random point, you usually get to a tentacle. (Ie the red trajectory) 

When the system has been running for a while, it is randomly sampling from the black region. Most of the volume is in the blob. (The randomly wandering particle can easily find its way from the tentacle to the blob, but the chance of it randomly finding the end of a tentacle from within the blob is smal... (read more)

Thanks to Ian McKenzie and Nicholas Dupuis, collaborators on a related project, for contributing to the ideas and experiments discussed in this post. Ian performed some of the random number experiments.

Also thanks to Connor Leahy for feedback on a draft, and thanks to Evan Hubinger, Connor Leahy, Beren Millidge, Ethan Perez, Tomek Korbak, Garrett Baker, Leo Gao and various others at Conjecture, Anthropic, and OpenAI for useful discussions.

This work was carried out while at Conjecture.

Important correction

I have received evidence from multiple credible sources that text-davinci-002 was not trained with RLHF.

The rest of this post has not been corrected to reflect this update. Not much besides the title (formerly "Mysteries of mode collapse due to RLHF") is affected: just mentally substitute "mystery method" every time "RLHF" is invoked...

A GPT-3 mode-collapse example I can't believe I forgot: writing rhyming poetry! I and a number of other people were excited by ChatGPT on launch seeming able to do long stretches of flawless rhyming poetry in couplets or quatrains, and where the words rhyming were not hackneyed common pairs of the sort you might see in the lyrics of pop songs charting. Hilarious, but extremely surprising. (davinci-002 had did a little bit of this, but not convincingly the way ChatGPT overnight did.*) Leike on Twitter [] denied any knowledge of rhyming suddenly working, and especially denied that anything special like adding rhyming dictionaries or IPA-re-encoding text had been done or that GPT-3 had switched tokenizations on the backend. So, had there been some sort of emergence [], or 'miracle of spelling' []? After playing around with it for a while, my conclusion was: 'no'. ChatGPT does rhyming poetry in only one way, and it is difficult to make it try any other kind of poetry even with explicit instructions and examples and doing continuations. It doesn't understand novel rhymes or puns if you quiz it, and its explanations of them remain as highly varied and incorrect as the original davinci model's pun explanations [] were. This is not what any kind of fixed phonetic understanding or genuine rhyming ability would look like. My conclusion was essentially, 'mode collapse': presumably some poetry examples made it into the training datasets (from my experiments, if nothing else), and because it's easy for any literate Anglophone to judge rhyming but non-rhyming poetry is a lot harder (and generally despised by most people, which is why the prestige & popularity of Western poetry over the past century has collapsed to a degree few people appreciate), it'd be logical for the raters to highly prefer rh
1Owain Evans16h
OpenAI had generated poems in the New Yorker, [] which suggests they might have had some internal project related to poetry. With GPT3.5, I think there's also "mode collapse" for style in writing prose (e.g. plays or stories).  Claude does not have this mode collapse in poetry or prose. (It maybe has a much more subtle version of it). This suggests to me it'd be relatively easy to fix ChatGPT's issues (as Gwern suggests).  Does anyone know how much poetry and literary prose is in the pre-training sets aside from stuff in Common Crawl?  

OpenAI had generated poems in the New Yorker, which suggests they might have had some internal project related to poetry.

I didn't get that impression from that when I read it - the NYer author and his friends prompted most of that, even if their friend Dan Selsam happens to work at OpenAI. (He seems to work on math LMs, nothing fiction or RL-related.) They were set up with the public Playground interface, so the OA insider role here was limited to showing them a few completions and trying to explain it; presumably they did the rest more remote and parti... (read more)

Load More