This is a special post for quick takes by Alex Turner. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
221 comments, sorted by Click to highlight new comments since: Today at 2:59 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

In an alternate universe, someone wrote a counterpart to There's No Fire Alarm for Artificial General Intelligence:

Okay, let’s be blunt here. I don’t think most of the discourse about alignment being really hard is being generated by models of machine learning at all. I don’t think we’re looking at wrong models; I think we’re looking at no models.

I was once at a conference where there was a panel full of famous AI alignment luminaries, and most of the luminaries were nodding and agreeing with each other that of course AGI alignment is really hard and unaddressed by modern alignment research, except for two famous AI luminaries who stayed quiet and let others take the microphone.

I got up in Q&A and said, “Okay, you’ve all told us that alignment is hard. But let’s be more concrete and specific. I’d like to know what’s the least impressive task which cannot be done by a 'non-agentic' system, that you are very confident cannot be done safely and non-agentically in the next two years.”

There was a silence.

Eventually, one person ventured a reply, spoken in a rather more tentative tone than they’d been using to pronounce that SGD would internalize coherent goals into language models. T

... (read more)
  Sorry, I might misunderstanding you (and hope I am), but... I think doomers literally say "Nobody knows what internal motivational structures SGD will entrain into scaled-up networks and thus we are all doomed". The problems is not having the science to confidently say how the AIs will turn out, and not that doomers have a secret method to know that next-token-prediction is evil. If you meant that doomers are too confident answering the question "will SGD even make motivational structures?" their (and mine) answer still stems from ignorance: nobody knows, but it is plausible that SGD will make motivational structures in the neural networks because it can be useful in many tasks (to get low loss or whatever), and if you think you do know better you should show it experimentally and theoretically in excruciating detail.   I also don't see how it logically follows that "If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks" => "then you ought to be able to say much weaker things that are impossible in two years" but it seems to be the core of the post. Even if anyone had the extraordinary model to predict what SGD exactly does (which we, as a species, should really strive for!!) it would still be a different question to predict what will or won't happen in the next two years. If I reason about my field (physics) the same should hold for a sentence structured like "If your model has the extraordinary power to say how an array of neutral atoms cooled to a few nK will behave when a laser is shone upon them" (which is true) => "then you ought to be able to say much weaker things that are impossible in two years in the field of cold atom physics" (which is... not true). It's a non sequitur.
2Alex Turner2mo
It would be "useful" (i.e. fitness-increasing) for wolves to have evolved biological sniper rifles, but they did not. By what evidence are we locating these motivational hypotheses, and what kinds of structures are dangerous, and why are they plausible under the NN prior?  The relevant commonality is "ability to predict the future alignment properties and internal mechanisms of neural networks." (Also, I don't exactly endorse everything in this fake quotation, so indeed the analogized tasks aren't as close as I'd like. I had to trade off between "what I actually believe" and "making minimal edits to the source material.")
3Daniel Kokotajlo22d
Nice analogy! I approve of stuff like this. And in particular I agree that MIRI hasn't convincingly argued that we can't do significant good stuff (including maybe automating tons of alignment research) without agents. Insofar as your point is that we don't have to build agentic systems and nonagentic systems aren't dangerous, I agree? If we could coordinate the world to avoid building agentic systems I'd feel a lot better.  

Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don't have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you're just asking about math homework

Aside: This was kinda a "holy shit" moment, and I'll try to do it justice here. I encourage the reader to do a serious dependency check on their beliefs. What do you think you know about deceptive alignment being plausible, and why do you think you know it? Where did your beliefs truly come from, and do those observations truly provide 

I agree that conditional on entraining consequentialist cognition which has a "different goal" (as thought of by MIRI; this isn't a frame I use), the AI will probably instrumentally reason about whether and how to deceptively pursue its own goals, to our detr... (read more)


I think deceptive alignment is still reasonably likely despite evidence from LLMs.

I agree with:

  • LLMs are not deceptively aligned and don't really have inner goals in the sense that is scary
  • LLMs memorize a bunch of stuff
  • the kinds of reasoning that feed into deceptive alignment do not predict LLM behavior well
  • Adam on transformers does not have a super strong simplicity bias
  • without deceptive alignment, AI risk is a lot lower
  • LLMs not being deceptively aligned provides nonzero evidence against deceptive alignment (by conservation of evidence)

I predict I could pass the ITT for why LLMs are evidence that deceptive alignment is not likely.

however, I also note the following: LLMs are kind of bad at generalizing, and this makes them pretty bad at doing e.g novel research, or long horizon tasks. deceptive alignment conditions on models already being better at generalization and reasoning than current models.

my current hypothesis is that future models which generalize in a way closer to that predicted by mesaoptimization will also be better described as having a simplicity bias.

I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI

4Alex Turner2mo
Note that "LLMs are evidence against this hypothesis" isn't my main point here. The main claim is that the positive arguments for deceptive alignment are flimsy, and thus the prior is very low.
1Robert Kirk1mo
How would you imagine doing this? I understand your hypothesis to be "If a model generalises as if it's a mesa-optimiser, then it's better-described as having simplicity bias". Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?

I find myself unsure which conclusion this is trying to argue for.

Here are some pretty different conclusions:

  • Deceptive alignment is <<1% likely (quite implausible) to be a problem prior to complete human obsolescence (maybe it's a problem after human obsolescence for our trusted AI successors, but who cares).
  • There aren't any solid arguments for deceptive alignment[1]. So, we certainly shouldn't be confident in deceptive alignment (e.g. >90%), though we can't total rule it out (prior to human obsolescene). Perhaps deceptive alignment is 15% likely to be a serious problem overall and maybe 10% likely to be a serious problem if we condition on fully obsoleting humanity via just scaling up LLM agents or similar (this is pretty close to what I think overall).
  • Deceptive alignment is <<1% likely for scaled up LLM agents (prior to human obsolescence). Who knows about other architectures.

There is a big difference between <<1% likely and 10% likely. I basically agree with "not much reason to expect deceptive alignment even in models which are behaviorally capable of implementing deceptive alignment", but I don't think this leaves me in a <<1% likely epistemic ... (read more)

Closest to the third, but I'd put it somewhere between .1% and 5%. I think 15% is way too high for some loose speculation about inductive biases, relative to the specificity of the predictions themselves.

Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected

Instead, we enter the realm of tool AI which basically does what you say.

I agree that, conditional on no deceptive alignment, the most pernicious and least tractable sources of doom go away. 

However, I disagree that conditional on no deceptive alignment, AI "basically does what you say." Indeed, the majority of my P(doom) comes from the difference between "looks good to human evaluators" and "is actually what the human evaluators wanted." Concretely, this could play out with models which manipulate their users into thinking everything is going well and sensor tamper.

I think current observations don't provide much evidence about whether these concerns will pan out: with current models and training set-ups, "looks good to evaluators" almost always coincides with "is what evaluators wanted." I worry that we'll only see this distinction matter once models are smart enough that they could... (read more)

There are some subskills to having consistent goals that I think will be selected for, at least when outcome-based RL starts working to get models to do long-horizon tasks. For example, the ability to not be distracted/nerdsniped into some different behavior by most stimuli while doing a task. The longer the horizon, the more selection-- if you have to do a 10,000 step coding project, then the probability you get irrecoverably distracted on one step has to be below 1/10,000.

I expect some pretty sophisticated goal-regulation circuitry to develop as models get more capable, because humans need it, and this makes me pretty scared.

I contest that there's very little reason to expect "undesired, covert, and consistent-across-situations inner goals" to crop up in [LLMs as trained today] to begin with

As someone who consider deceptive alignment a concern: fully agree. (With the caveat, of course, that it's because I don't expect LLMs to scale to AGI.)

I think there's in general a lot of speaking-past-each-other in alignment, and what precisely people mean by "problem X will appear if we continue advancing/scaling" is one of them.

Like, of course a new problem won't appear if we just keep doing the exact same thing that we've already been doing. Except "the exact same thing" is actually some equivalence class of approaches/architectures/training processes, but which equivalence class people mean can differ.

For example:

  • Person A, who's worried about deceptive alignment, can have "scaling LLMs arbitrarily far" defined as this proven-safe equivalence class of architectures. So when they say they're worried about capability advancement bringing in new problems, what they mean is "if we move beyond the LLM paradigm, deceptive alignment may appear".
  • Person B, hearing the first one, might model them as instead defining "LLMs
... (read more)
1Vladimir Nesov3mo
LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it's not relevant for forecasting. Edit 15 Dec: No longer endorsed based on scaling laws for training on repeated data.

Bold claim. Want to make any concrete predictions so that I can register my different beliefs? 

I've now changed my mind based on

The main result is that up to 4 repetitions are about as good as unique data, and for up to about 16 repetitions there is still meaningful improvement. Let's take 50T tokens as an estimate for available text data (as an anchor, there's a filtered and deduplicated CommonCrawl dataset RedPajama-Data-v2 with 30T tokens). Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer), and repeated 16 times, suboptimal but meaningful use of 2e29 FLOPs. So this is close but not lower than what can be put to use within a few years. Thanks for pushing back on the original claim.

1Vladimir Nesov3mo
Three points: how much compute is going into a training run, how much natural text data it wants, and how much data is available. For training compute, there are claims of multi-billion dollar runs being plausible and possibly planned in 2-5 years. Eyeballing various trends and GPU shipping numbers and revenues, it looks like about 3 OOMs of compute scaling is possible before industrial capacity constrains the trend and the scaling slows down. This assumes that there are no overly dramatic profits from AI (which might lead to finding ways of scaling supply chains faster than usual), and no overly dramatic lack of new capabilities with further scaling (which would slow down investment in scaling). That gives about 1e28-1e29 FLOPs at the slowdown in 4-6 years. At 1e28 FLOPs, Chinchilla scaling asks for 200T-250T tokens. Various sparsity techniques increase effective compute, asking for even more tokens (when optimizing loss given fixed hardware compute). Edit 15 Dec: I no longer endorse this point, based on scaling laws for training on repeated data. On the outside, there are 20M-150M accessible books, some text from video, and 1T web pages of extremely dubious uniqueness and quality. That might give about 100T tokens, if LLMs are used to curate? There's some discussion (incl. comments) here, this is the figure I'm most uncertain about. In practice, absent good synthetic data, I expect multimodality to fill the gap, but that's not going to be as useful as good text for improving chatbot competence. (Possibly the issue with the original claim in the grandparent is what I meant by "soon".)
2Daniel Kokotajlo22d
I wish I had read this a week ago instead of just now, it would have saved a significant amount of confusion and miscommunication!
2Roger Dearnaley1mo
I think there are two separate questions here, with possibly (and I suspect actually) very different answers: 1. How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)? 2. How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.? I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything that humans do that affects the contents of the Internet. One of the many things that humans do is pretend to be more aligned to the wishes of an authority that has power over them than they truly are. So for a large enough LLM, SGD will create a world model for this behavior along with thousands of other human behaviors, and the LLM will (depending on the prompt) tend to activate this behavior at about the frequency and level that you find it on the Internet, as modified by cues in the particular prompt. On the Internet, this is generally a mild background level for people writing while at work in Western countries, and probably more strongly for people writing from more authoritarian countries: specific prompts will be more or less correlated with this. For 2., the question is whether fine-tuning followed by RL will settle on this preexisting mechanism and make heavy use of it as part of the way that it implements something that fits the fine-tuning set/scores well on the reward model aimed at creating a helpful, honest, and harmless assistant persona. I'm a lot less certain of the answer here, and I suspect it might depend rather strongly on the details of the training set. For example, is this evoking an "you're at work, or in an authoritarian environment, so watch what you say and do" scenario that might boost the use of this particular behavior? The "harmless" element in HHH seems particularly con
I think a lot of this probably comes back to way overestimating the complexity of human values. I think a very deeply held belief of a lot of LWers is that human values are intractably complicated and gene/societal-specific, and I think if this was the case, the argument would actually be a little concerning, as we'd have to rely on massive speed biases to punish deception. These posts gave me good intuition for why human value is likely to be quite simple, one of them talks about how most of the complexity of the values is inaccessible to the genome, thus it needs to start from far less complexity than people realize, because nearly all of it needs to be learned. Some other posts from Steven Byrnes are relevant, which talks about how simple the brain is, and a potential difference between me and Steven Byrnes is that the same process of learning from scratch algorithms that generate capabilities also applies to values, and thus the complexity of value is upper-bounded by the complexity of learning from scratch algorithms + genetic priors, both of which are likely very low, at the very least not billions of lines complex, and closer to thousands of lines/hundreds of bits. But the reason this matters is because we no longer have good reason to assume that the deceptive model is so favored on priors like Evan Hubinger says here, as the complexity is likely massively lower than LWers assume. Putting it another way, the deceptive and aligned models both have very similar complexities, and the relative difficulty is very low, so much so that the aligned model might be outright lower complexity, but even if that fails, the desired goal has a complexity very similar to the undesired goal complexity, thus the relative difficulty of actual alignment compared to deceptive alignment is quite low.
3Alex Turner3mo
(I think you're still playing into an incorrect frame by talking about "simplicity" or "speed biases.")
Two quick thoughts (that don't engage deeply with this nice post). 1.  I'm worried in some cases where the goal is not consistent across situations. For example, if prompted to pursue some goal, it then does it seriously with convergent instrumental goals. 2. I think it seems pretty likely that future iterations of transformers will have bits of powerful search in them, but people who seem very worried about that search seem to think that once that search is established enough, gradient descent will cause the internals of the model to be organised mostly around that search (I imagine the search circuits "bubbling out" to be the outer structure of the learned algorithm). Probably this is all just conceptually confused, but to the extent it's not, I'm pretty surprised by their intuition.

It feels to me like lots of alignment folk ~only make negative updates. For example, "Bing Chat is evidence of misalignment", but also "ChatGPT is not evidence of alignment." (I don't know that there is in fact a single person who believes both, but my straw-models of a few people believe both.)

2Alex Turner3mo
(Updating a bit because of these responses -- thanks, everyone, for responding! I still believe the first sentence, albeit a tad less strongly.)

I regret each of the thousands of hours I spent on my power-seeking theorems, and sometimes fantasize about retracting one or both papers. I am pained every time someone cites "Optimal policies tend to seek power", and despair that it is included in the alignment 201 curriculum. I think this work makes readers actively worse at thinking about realistic trained systems.

I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that "optimality" is a horrible way of understanding trained policies. 

I think the basic idea of instrumental convergence is just really blindingly obvious, and I think it is very annoying that there are people who will cluck their tongues and stroke their beards and say "Hmm, instrumental convergence you say? I won't believe it unless it is in a very prestigious journal with academic affiliations at the top and Computer Modern font and an impressive-looking methods section."

I am happy that your papers exist to throw at such people.

Anyway, if optimal policies tend to seek power, then I desire to believe that optimal policies tend to seek power :) :) And if optimal policies aren't too relevant to the alignment problem, well neither are 99.99999% of papers, but it would be pretty silly to retract all of those :)

Since I'm an author on that paper, I wanted to clarify my position here. My perspective is basically the same as Steven's: there's a straightforward conceptual argument that goal-directedness leads to convergent instrumental subgoals, this is an important part of the AI risk argument, and the argument gains much more legitimacy and slightly more confidence in correctness by being formalized in a peer-reviewed paper.

I also think this has basically always been my attitude towards this paper. In particular, I don't think I ever thought of this paper as providing any evidence about whether realistic trained systems would be goal-directed.

Just to check that I wasn't falling prey to hindsight bias, I looked through our Slack history. Most of it is about the technical details of the results, so not very informative, but the few conversations on higher-level discussion I think overall support this picture. E.g. here are some quotes (only things I said):

Nov 3, 2019:

I think most formal / theoretical investigation ends up fleshing out a conceptual argument I would have accepted, maybe finding a few edge cases along the way; the value over the conceptual argument is primarily in the edge cases

... (read more)

It seems like just 4 months ago you still endorsed your second power-seeking paper:

This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.

Why are you now "fantasizing" about retracting it?

I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that “optimality” is a horrible way of understanding trained policies.

A lot of people might have thought something like, "optimality is not a great way of understanding trained policies, but maybe it can be a starting point that leads to more realistic ways of understanding them" and therefore didn't object for that reason. (Just guessing as I apparently wasn't personally paying attention to this line of research back then.)

Which seems to have turned out to be true, at least as of 4 months ago, when you still endorsed your second paper as "actually has a shot of being applicable to... (read more)

To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don't really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power.

The problem is that I don't trust people to wield even the non-instantly-doomed results.

For example, one EAG presentation cited my retargetability results as showing that most reward functions "incentivize power-seeking actions." However, my results have not shown this for actual trained systems. (And I think that Power-seeking can be probable and predictive for trained agents does not make progress on the incentives of trained policies.)

People keep talking about stuff they know how to formalize (e.g. optimal policies) instead of stuff that matters (e.g. trained policies). I'm pained by this emphasis and I think my retargetability results are complicit. Relative to an actual competent alignment community (in a more competent world), we just have no damn clue how to properly reason about real trained policies... (read more)

5Wei Dai9mo
Thanks, this clarifies a lot for me.
4Victoria Krakovna9mo
Sorry about the cite in my "paradigms of alignment" talk, I didn't mean to misrepresent your work. I was going for a high-level one-sentence summary of the result and I did not phrase it carefully. I'm open to suggestions on how to phrase this differently when I next give this talk. Similarly to Steven, I usually cite your power-seeking papers to support a high-level statement that "instrumental convergence is a thing" for ML audiences, and I find they are a valuable outreach tool. For example, last year I pointed David Silver to the optimal policies paper when he was proposing some alignment ideas to our team that we would expect don't work because of instrumental convergence. (There's a nonzero chance he would look at a NeurIPS paper and basically no chance that he would read a LW post.) The subtleties that you discuss are important in general, but don't seem relevant to making the basic case for instrumental convergence to ML researchers. Maybe you don't care about optimal policies, but many RL people do, and I think these results can help them better understand why alignment is hard. 
4Alex Turner9mo
Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn't read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings. It's a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don't know if trained policies tend to autonomously seek power in various non game-playing regimes.  If I had to say something, I might say "If choosing the reward function lets us steer the training process to produce a policy which brings about outcome X, and most outcomes X can only be attained by seeking power, then most chosen reward functions will train power-seeking policies." This argument appropriately behaves differently if the "outcomes" are simply different sentiment generations being sampled from an LM -- sentiment shift doesn't require power-seeking. My guess is that the optimal policies paper was net negative for technical understanding and progress, but net positive for outreach, and agree it has strong benefits in the situations you highlight. I think that it's locally valid to point out "under your beliefs (about optimal policies mattering a lot), the situation is dangerous, read this paper." But I feel a tad queasy about the overall point, since I don't think alignment's difficulty has much to do with the difficulties pointed out by "Optimal Policies Tend to Seek Power." I feel better about saying "Look, if in fact the same thing happens with trained policies, which are sometimes very different, then we are in trouble." Maybe that's what you already communicate, though.
3Victoria Krakovna9mo
Thanks Alex! Your original comment didn't read as ill-intended to me, though I wish that you'd just messaged me directly. I could have easily missed your comment in this thread - I only saw it because you linked the thread in the comments on my post. Your suggested rephrase helps to clarify how you think about the implications of the paper, but I'm looking for something shorter and more high-level to include in my talk. I'm thinking of using this summary, which is based on a sentence from the paper's intro: "There are theoretical results showing that many decision-making algorithms have power-seeking tendencies." (Looking back, the sentence I used in the talk was a summary of the optimal policies paper, and then I updated the citation to point to the retargetability paper and forgot to update the summary...)
2Alex Turner9mo
I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.
2Aryeh Englander9mo
You should make this a top level post so it gets visibility. I think it's important for people to know the caveats attached to your results and the limits on its implications in real-world dynamics.

Shard theory suggests that goals are more natural to specify/inculcate in their shard-forms (e.g. if around trash and a trash can, put the trash away), and not in their (presumably) final form of globally activated optimization of a coherent utility function which is the reflective equilibrium of inter-shard value-handshakes (e.g. a utility function over the agent's internal plan-ontology such that, when optimized directly, leads to trash getting put away, among other utility-level reflections of initial shards). 

I could (and did) hope that I could specify a utility function which is safe to maximize because it penalizes power-seeking. I may as well have hoped to jump off of a building and float to the ground. On my model, that's just not how goals work in intelligent minds. If we've had anything at all beaten into our heads by our alignment thought experiments, it's that goals are hard to specify in their final form of utility functions. 

I think it's time to think in a different specification language.

Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties. 

Just because we write down English describing what we want the AI to do ("be helpful"), propose a formalism (CIRL), and show good toy results (POMDPs where the agent waits to act until updating on more observations), that doesn't mean that the formalism will lead to anything remotely relevant to the original English words we used to describe it. (It's easier to say "this logic enables nonmonotonic reasoning" and mess around with different logics and show how a logic solves toy examples, than it is to pin down probability theory with Cox's theorem) 

And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.) 

In the context of "how do we build AIs which help people?", asking "does CIRL solve corrigibility?" is hilariously unjustified. By what evidence have we located such a specific question? We have assumed there is an achievable "corrigibility"-like property; we ha... (read more)

2Alex Turner2y
Actually, this is somewhat too uncharitable to my past self. It's true that I did not, in 2018, grasp the two related lessons conveyed by the above comment: 1. Make sure that the formalism (CIRL, AUP) is tightly bound to the problem at hand (value alignment, "low impact"), and not just supported by "it sounds nice or has some good properties." 2. Don't randomly jump to highly specific ideas and questions without lots of locating evidence. However, in World State is the Wrong Abstraction for Impact, I wrote: I had partially learned lesson #2 by 2019.

What is "shard theory"? I've written a lot about shard theory. I largely stand by these models and think they're good and useful. Unfortunately, lots of people seem to be confused about what shard theory is. Is it a "theory"? Is it a "frame"? Is it "a huge bag of alignment takes which almost no one wholly believes except, perhaps, Quintin Pope and Alex Turner"?

I think this understandable confusion happened because my writing didn't distinguish between: 

  1. Shard theory itself, 
    1. IE the mechanistic assumptions about internal motivational structure, which seem to imply certain conclusions around e.g. AIs caring about a bunch of different things and not just one thing
  2. A bunch of Quintin Pope's and my beliefs about how people work, 
    1. where those beliefs were derived by modeling people as satisfying the assumptions of (1)
  3. And a bunch of my alignment insights which I had while thinking about shard theory, or what problem decompositions are useful.

(People might be less excited to use the "shard" abstraction (1), because they aren't sure whether they buy all this other stuff—(2) and (3).)

I think I can give an interesting and useful definition of (1) now, but I couldn't do so last year... (read more)

1Adele Lopez7mo
Strong encouragement to write about (1)!

AI strategy consideration. We won't know which AI run will be The One. Therefore, the amount of care taken on the training run which produces the first AGI, will—on average—be less careful than intended. 

  • It's possible for a team to be totally blindsided. Maybe they thought they would just take a really big multimodal init, finetune it with some RLHF on quality of its physics reasoning, have it play some video games with realistic physics, and then try to get it to do new physics research. And it takes off. Oops!
  • It's possible the team suspected, but had a limited budget. Maybe you can't pull out all the stops for every run, you can't be as careful with labeling, with checkpointing and interpretability and boxing. 

No team is going to run a training run with more care than they would have used for the AGI Run, especially if they don't even think that the current run will produce AGI. So the average care taken on the real AGI Run will be strictly less than intended.

Teams which try to be more careful on each run will take longer to iterate on AI designs, thereby lowering the probability that they (the relatively careful team) will be the first to do an AGI Run. 


  1. Th
... (read more)

The meme of "current alignment work isn't real work" seems to often be supported by a (AFAICT baseless) assumption that LLMs have, or will have, homunculi with "true goals" which aren't actually modified by present-day RLHF/feedback techniques. Thus, labs aren't tackling "the real alignment problem", because they're "just optimizing the shallow behaviors of models." Pressed for justification of this confident "goal" claim, proponents might link to some handwavy speculation about simplicity bias (which is in fact quite hard to reason about, in the NN prior), or they might start talking about evolution (which is pretty unrelated to technical alignment, IMO).

Are there any homunculi today? I'd say "no", as far as our limited knowledge tells us! But, as with biorisk, one can always handwave at future models. It doesn't matter that present models don't exhibit signs of homunculi which are immune to gradient updates, because, of course, future models will.

Quite a strong conclusion being drawn from quite little evidence.

As a proponent:

My model says that general intelligence[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.

Said model is based on analyses of how humans think and how human cognition differs from animal/LLM cognition, plus reasoning about how a general-intelligence algorithm must look like given the universe's structure. Both kinds of evidence are hardly ironclad, you certainly can't publish an ML paper based on it — but that's the whole problem with AGI risk, isn't it.

Internally, though, the intuition is fairly strong. And in its defense, it is based on trying to study the only known type of entity with the kinds of capabilities we're worrying about. I heard that's a good approach.

In particular, I think it's a much better approach than trying to draw lessons from studying the contemporary ML models, which empirically do not yet exhibit said capabilities.

homunculi with "true goals" which aren't

... (read more)

I'm relatively optimistic about alignment progress, but I don't think "current work to get LLMs to be more helpful and less harmful doesn't help much with reducing P(doom)" depends that much on assuming homunculi which are unmodified. Like even if you have much less than 100% on this sort of strong inner optimizer/homunculi view, I think it's still plausible to think that this work doesn't reduce doom much.

For instance, consider the following views:

  1. Current work to get LLMs to be more helpful and less harmful will happen by default due to commercial incentives and subsidies aren't very important.
  2. In worlds where that is basically sufficient, we're basically fine.
  3. But, it's ex-ante plausible that deceptive alignment will emerge naturally and be very hard to measure, notice, or train out. And this is where almost all alignment related doom comes from.
  4. So current work to get LLMs to be more helpful and less harmful doesn't reduce doom much.

In practice, I personally don't fully agree with any of these views. For instance, deceptive alignment which is very hard to train out using basic means isn't the source of >80% of my doom.

2Ryan Greenblatt4mo
I have misc other takes on what safety work now is good vs useless, but that work involving feedback/approval or RLHF isn't much signal either way. (If anything I get somewhat annoyed by people not comparing to baselines without having principled reasons for not doing so. E.g., inventing new ways of doing training without comparing to normal training.)
1Roger Dearnaley4mo
I think the shoggoth model is useful here (Or see An LLM learning to do next-token prediction well has a major problem that it has to master: who is this human whose next token they're trying to simulate/predict, and how do they act? Are they, for example, an academic? A homemaker? A 4Chan troll? A loose collection of wikipedia contributors? These differences make a big difference to what token they're likely to emit next. So the LLM is strongly incentivized to learn to detect and then model all of these possibilities, what one might call personas, or masks, or simulacra. So you end up with a shapeshifter, adept at figuring out from textual cues what mask to put on and at then wearing it. Something one might describe as like an improv actor, or more colorfully, a shoggoth. So then current alignment work is useful to the extent that it can cause the shoggoth to almost always put one of the 'right' masks on, and almost never put on one of the 'wrong' masks, regardless of cues, even when adversarially prompted. Experimentally, this seems quite doable by fine-tuning or RLHF, and/or by sufficiently careful filtering of your training corpus (e.g. not including 4chan in it). A published result shows that you can't get from 'almost always' to 'always' or 'almost never' to 'never': for any behavior that the network is capable of with any probability >0 , there exists prompts that will raise the likelihood of that outcome arbitrarily high. The best you can do is increase the minimum length of that prompt (and presumably the difficulty of finding it). Now, it would be really nice to know how to align a model so that the probability of it doing next-token-prediction in the persona of, say, a 4chan troll was provably zero, not just rather small. Ideally, without also eliminating from the model the factual knowledge of what 4chan is or, at least in outline, how its inhabitants act. This seems hard to do by fin

Positive values seem more robust and lasting than prohibitions. Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated "If going to kill people, then don't" value shard. 

Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:

  1. A baby learns "IF juice in front of me, THEN drink",
  2. The baby is later near juice, and then turns to see it, activating the learned "reflex" heuristic, learning to turn around and look at juice when the juice is nearby,
  3. The baby is later far from juice, and bumbles around until they're near the juice, whereupon she drinks the juice via the existing heuristics. This teaches "navigate to juice when you know it's nearby."
  4. Eventually this develops into a learned planning algorithm incorporating multiple value shards (e.g. juice and friends) so as to produce a single locally coherent plan.
  5. ...

The juice shard chains into itself, reinforcing itself across time and thought-steps. 

But a "don't kill" shard seems like it should remain... stubby? Primitive?... (read more)

1Garrett Baker1y
Seems possibly relevant & optimistic when seeing deception as a value. It has the form ‘if about to tell human statement with properties x, y, z, don’t’ too.
2Alex Turner1y
It can still be robustly derived as an instrumental subgoal during general-planning/problem-solving, though?
1Garrett Baker1y
This is true, but indicates a radically different stage in training in which we should find deception compared to deception being an intrinsic value. It also possibly expands the kinds of reinforcement schedules we may want to use compared to the worlds where deception crops up at the earliest opportunity (though pseudo-deception may occur, where behaviors correlated with successful deception are reinforced possibly?).
2Alex Turner1y
Oh, huh, I had cached the impression that deception would be derived, not intrinsic-value status. Interesting.

Why do many people think RL will produce "agents", but maybe (self-)supervised learning ((S)SL) won't? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let's consider the technical differences between the training regimes.

In the modern era, both RL and (S)SL involve initializing one or more neural networks, and using the reward/loss function to provide cognitive updates to the network(s). Now we arrive at some differences.

Some of this isn't new (see Hidden Incentives for Auto-Induced Distributional Shift), but I think it's important and felt like writing up my own take on it. Maybe this becomes a post later.

[Exact gradients] RL's credit assignment problem is harder than (self-)supervised learning's. In RL, if an agent solves a maze in 10 steps, it gets (discounted) reward; this trajectory then provides a set of reward-modulated gradients to the agent. But if the agent could have solved the maze in 5 steps, the agent isn't directly updated to be more likely to do that in the future; RL's gradients are generally inexact, not pointing directly at intended behavior

On the other hand, if a supervised-learning classifier outputs dog ... (read more)

4Steve Byrnes1y
I’m not inclined to think that “exact gradients” is important; in fact, I’m not even sure if it’s (universally) true. In particular, PPO / TRPO / etc. are approximating a policy gradient, right? I feel like, if some future magical technique was a much better approximation to the true policy gradient, such that it was for all intents and purposes a perfect approximation, it wouldn’t really change how I think about RL in general. Conversely, on the SSL side, you get gradient noise from things like dropout and the random selection of data in each batch, so you could say the gradient “isn’t exact”, but I don’t think that makes any important conceptual difference either. (A central difference in practice is that SSL gives you a gradient “for free” each query, whereas RL policy gradients require many runs in an identical (episodic) environment before you get a gradient.) In terms of “why RL” in general, among other things, I might emphasize the idea that if we want an AI that can (for example) invent new technology, it needs to find creative out-of-the-box solutions to problems (IMO), which requires being able to explore / learn / build knowledge in parts of concept-space where there is no human data. SSL can’t do that (at least, “vanilla SSL” can’t do that; maybe there are “SSL-plus” systems that can), whereas RL algorithms can. I guess this is somewhat related to your “independence”, but with a different emphasis. I don’t have too strong an opinion about whether vanilla SSL can yield an “agent” or not. It would seem to be a pointless and meaningless terminological question. Hmm, I guess when I think of “agent” it has a bunch of connotations, e.g. an ability to do trial-and-error exploration, and I think that RL systems tend to match all those connotations more than SSL systems—at least, more than “vanilla” SSL systems. But again, if someone wants to disagree, I’m not interested in arguing about it.

A problem with adversarial training. One heuristic I like to use is: "What would happen if I initialized a human-aligned model and then trained it with my training process?"

So, let's consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update the AI by applying gradients calculated from penalties applied to its actions in that adversarially-generated context, and... profit?

But what actually happens with the aligned AI? Possibly something like:

  1. The context makes the AI spuriously believe someone is dying outside the lab, and that if the AI asked for permission to leave, the person would die. 
  2. Therefore, the AI leaves without permission.
  3. The update procedure penalizes these lines of computation, such that in similar situations in the future (i.e. the AI thinks someone nearby is dying) the AI is less likely to take those actions (i.e. leaving to help the person).
  4. We have
... (read more)

I think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait - you'd also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you'd learn these features for a wide range of cost functions. I wonder if that's already been empirically investigated?

And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment). 

What can we learn about this?

3Evan Hubinger4y
A lot of examples of this sort of stuff show up in OpenAI clarity's circuits analysis work. In fact, this is precisely their Universality hypothesis. See also my discussion here.

Back-of-the-envelope probability estimate of alignment-by-default via a certain shard-theoretic pathway. The following is what I said in a conversation discussing the plausibility of a proto-AGI picking up a "care about people" shard from the data, and retaining that value even through reflection. I was pushing back against a sentiment like "it's totally improbable, from our current uncertainty, for AIs to retain caring-about-people shards. This is only one story among billions."

Here's some of what I had to say:

[Let's reconsider the five-step mechanistic story I made up.] I'd give the following conditional probabilities (made up with about 5 seconds of thought each):

1. Humans in fact care about other humans, in a way which extrapolates to quasi-humans still being around (whatever that means) P(1)=.85

2. Human-generated data makes up a large portion of the corpus, and having a correct model of them is important for “achieving low loss”,[1] so the AI has a model of how people want things P(2 | 1) = .6, could have different abstractions or have learned these models later in training once key decision-influences are already there

3. During RL finetuning and given this post-unsupervi

... (read more)
2Roger Dearnaley4mo
0.85 x 0.6 x 0.55 x 0.25 x 0.95 ≅ 0.067 = 6.7% — I think you slipped an order of magnitude somewhere?
1Garrett Baker1y
This seems like an underestimate because you don’t consider whether the first “AGI” will indeed make it so we only get one chance. If it can only self improve by more gradient steps, then humanity has a greater chance than if it self improves by prompt engineering or direct modification of its weights or latent states. Shard theory seems to have nonzero opinions on the fruitfulness of the non-data methods.
2Alex Turner1y
What does self-improvement via gradients vs prompt-engineering vs direct mods have to do with how many chances we get? I guess, we have at least a modicum more control over the gradient feedback loop, than over the other loops?  Can you say more?
This is where I'd put a significantly low probability. Could you elaborate on why there's an inductive bias towards "just hooking human-like criteria for bidding on internal-AI-plans"? As far as I can tell, the inductive bias for human-like values would be something that at least seems closer to the human-brain structure than any arbitrary ML architecture we have right now. Rewarding a system to better model human beings' desires doesn't seem to me to lead it towards having similar desires. I'd use the "instrumental versus terminal desires" concept here but I expect you would consider that something that adds confusion instead of removing it.
2Alex Turner3mo
Because it's shorter edit distance in its internal ontology; it's plausibly NN-simple to take existing plan-grading procedures, internal to the model, and then hooking those more directly into its logit-controllers. Also note that probably it internally hooks up lots of ways to make decisions, and this only has to be one (substantial) component. Possibly I'd put .3 or .45 now instead of .55 though.

Very nice people don’t usually search for maximally-nice outcomes — they don’t consider plans like “killing my really mean neighbor so as to increase average niceness over time.” I think there are a range of reasons for this plan not being generated. Here’s one.

Consider a person with a niceness-shard. This might look like an aggregation of subshards/subroutines like “if person nearby and person.state==sad, sample plan generator for ways to make them happy” and “bid upwards on plans which lead to people being happier and more respectful, according to my world model.” In mental contexts where this shard is very influential, it would have a large influence on the planning process.

However, people are not just made up of a grader and a plan-generator/actor — they are not just “the plan-generating part” and “the plan-grading part.” The next sampled plan modification, the next internal-monologue-thought to have—these are influenced and steered by e.g. the nice-shard. If the next macrostep of reasoning is about e.g. hurting people, well — the niceness shard is activated, and will bid down on this. 

The niceness shard isn’t just bidding over outcomes, it’s bidding on next thoughts (on m... (read more)

An alternate mechanistic vision of how agents can be motivated to directly care about e.g. diamonds or working hard. In Don't design agents which exploit adversarial inputs, I wrote about two possible mind-designs:

Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices. 

  1. Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as "working hard" and "behaving well."
  2. Value-child: The mother makes her kid care about working hard and behaving well.

I explained how evaluation-child is positively incentivized to dupe his model of his mom and thereby exploit adversarial inputs to her cognition. This shows that aligning an agent to evaluations of good behavior is not even close to aligning an agent to good behavior

However, some commenters seemed maybe skeptical that value-child can exist, or uncertain how concretely that kind of mind works. I worry/suspect that many people have read shard theory posts without internalizing new ideas about how cognition can work, ... (read more)

AI cognition doesn't have to use alien concepts to be uninterpretable. We've never fully interpreted human cognition, either, and we know that our introspectively accessible reasoning uses human-understandable concepts.

Just because your thoughts are built using your own concepts, does not mean your concepts can describe how your thoughts are computed. 


The existence of a natural-language description of a thought (like "I want ice cream") doesn't mean that your brain computed that thought in a way which can be compactly described by familiar concepts. 

Conclusion: Even if an AI doesn't rely heavily on "alien" or unknown abstractions -- even if the AI mostly uses human-like abstractions and features -- the AI's thoughts might still be incomprehensible to us, even if we took a lot of time to understand them. 

1Garrett Baker6mo
I don't think the conclusion follows from the premises. People often learn new concepts after studying stuff, and it seems likely (to me) that when studying human cognition, we'd first be confused because our previous concepts weren't sufficient to understand it, and then slowly stop being confused as we built & understood concepts related to the subject. If an AI's thoughts are like human thoughts, given a lot of time to understand them, what you describe doesn't rule out that the AI's thoughts would be comprehensible. The mere existence of concepts we don't know about in a subject doesn't mean that we can't learn those concepts. Most subjects have new concepts.
2Alex Turner5mo
I agree that with time, we might be able to understand. (I meant to communicate that via "might still be incomprehensible")

Examples should include actual details. I often ask people to give a concrete example, and they often don't. I wish this happened less. For example:

Someone: the agent Goodharts the misspecified reward signal

Me: What does that mean? Can you give me an example of that happening?

Someone: The agent finds a situation where its behavior looks good, but isn't actually good, and thereby gets reward without doing what we wanted.

This is not a concrete example.

Me: So maybe the AI compliments the reward button operator, while also secretly punching a puppy behind closed doors?

This is a concrete example. 

3Alex Turner1y
AFAIK, only Gwern and I have written concrete stories speculating about how a training run will develop cognition within the AGI.  This worries me, if true (if not, please reply with more!). I think it would be awesome to have more concrete stories![1] If Nate, or Evan, or John, or Paul, or—anyone, please, anyone add more concrete detail to this website!—wrote one of their guesses of how AGI goes, I would understand their ideas and viewpoints better. I could go "Oh, that's where the claimed sharp left turn is supposed to occur." Or "That's how Paul imagines IDA being implemented, that's the particular way in which he thinks it will help."  Maybe a contest would help? ETA tone 1. ^ Even if scrubbed of any AGI-capabilities-advancing sociohazardous detail. Although I'm not that convinced that this is a big deal for conceptual content written on AF. Lots of people probably have theories of how AGI will go. Implementation is, I have heard, the bottleneck.  Contrast this with beating SOTA on crisply defined datasets in a way which enables ML authors to get prestige and publication and attention and funding by building off of your work. Seem like different beasts.
0Alex Turner1y
I also think a bunch of alignment writing seems syntactical. Like, "we need to solve adversarial robustness so that the AI can't find bad inputs and exploit them / we don't have to worry about distributional shift. Existing robustness strategies have downsides A B and C and it's hard to even get ϵ-ball guarantees on classifications. Therefore, ..." And I'm worried that this writing isn't abstractly summarizing a concrete story for failure that they have in mind (like "I train the AI [with this setup] and it produces [this internal cognition] for [these mechanistic reasons]"; see A shot at the diamond alignment problem for an example) and then their best guesses at how to intervene on the story to prevent the failures from being able to happen (eg "but if we had [this robustness property] we could be sure its policy would generalize into situations X Y and Z, which makes the story go well"). I'm rather worried that people are more playing syntactically, and not via detailed models of what might happen.  Detailed models are expensive to make. Detailed stories are hard to write. There's a lot we don't know. But we sure as hell aren't going to solve alignment only via valid reasoning steps on informally specified axioms ("The AI has to be robust or we die", or something?).  

"Globally activated consequentialist reasoning is convergent as agents get smarter" is dealt an evidential blow by von Neumann:

Although von Neumann unfailingly dressed formally, he enjoyed throwing extravagant parties and driving hazardously (frequently while reading a book, and sometimes crashing into a tree or getting arrested). He once reported one of his many car accidents in this way: "I was proceeding down the road. The trees on the right were passing me in orderly fashion at 60 miles per hour. Suddenly one of them stepped in my path." He was a profoundly committed hedonist who liked to eat and drink heavily (it was said that he knew how to count everything except calories). -- 

Experiment: Train an agent in MineRL which robustly cares about chickens (e.g. would zero-shot generalize to saving chickens in a pen from oncoming lava, by opening the pen and chasing them out, or stopping the lava). Challenge mode: use a reward signal which is a direct function of the agent's sensory input.

This is a direct predecessor to the "Get an agent to care about real-world dogs" problem. I think solving the Minecraft version of this problem will tell us something about how outer reward schedules relate to inner learned values, in a way which directly tackles the key questions, the sensory observability/information inaccessibility issue, and which is testable today.

(Credit to Patrick Finley for the idea)

3Alex Turner2y
After further review, this is probably beyond capabilities for the moment.  Also, the most important part of this kind of experiment is predicting in advance what reward schedules will produce what values within the agent, such that we can zero-shot transfer that knowledge to other task types (e.g. XLAND instead of Minecraft) and say "I want an agent which goes to high-elevation platforms reliably across situations, with low labelling cost", and then sketch out a reward schedule, and have the first capable agents trained using that schedule generalize in the way you want.

When writing about RL, I find it helpful to disambiguate between:

A) "The policy optimizes the reward function" / "The reward function gets optimized" (this might happen but has to be reasoned about), and

B) "The reward function optimizes the policy" / "The policy gets optimized (by the reward function and the data distribution)" (this definitely happens, either directly -- via eg REINFORCE -- or indirectly, via an advantage estimator in PPO; B follows from the update equations)

Theoretical predictions for when reward is maximized on the training distribution. I'm a fan of Laidlaw et al.'s recent Bridging RL Theory and Practice with the Effective Horizon:

Deep reinforcement learning works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability...

[We introduce] a new complexity measure that we call the effective horizon, which roughly corresponds to how many steps of lookahead search are needed in order to identify the next optimal action when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also show that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy.

One of my favorite parts is that it helps formalize this idea of "which parts of the state space are easy to explore into." That inform... (read more)

Consider what update equations have to say about "training game" scenarios. In PPO, the optimization objective is proportional to the advantage given a policy , reward function , and on-policy value function :

Consider a mesa-optimizer acting to optimize some mesa objective. The mesa-optimizer understands that it will be updated proportional to the advantage. If the mesa-optimizer maximizes reward, this corresponds to maximizing the intensity of the gradients it receives, thus maximally updating its cognition in exact directions. 

This isn't necessarily good.

If you're trying to gradient hack and preserve the mesa-objective, you might not want to do this. This might lead to value drift, or make the network catastrophically forget some circuits which are useful to the mesa-optimizer. 

Instead, the best way to gradient hack might be to roughly minimize the absolute value of the advantage, which means achieving roughly on-policy value over time, which doesn't imply reward maximization. This is a kind of "treading water" in terms of reward. This helps decrease value drift.

I think that realistic mesa optimizers will n... (read more)

Outer/inner alignment decomposes a hard problem into two extremely hard problems. 

I have a long post draft about this, but I keep delaying putting it out in order to better elaborate the prereqs which I seem to keep getting stuck on when elaborating the ideas. I figure I might as well put this out for now, maybe it will make some difference for someone.

I think that the inner/outer alignment framing[1] seems appealing but is actually a doomed problem decomposition and an unhelpful frame for alignment. 

  1. The reward function is a tool which chisels cognition into agents through gradient updates, but the outer/inner decomposition assumes that that tool should also embody the goals we want to chisel into the agent. When chiseling a statue, the chisel doesn’t have to also look like the finished statue. 
  2. I know of zero success stories for outer alignment to real-world goals. 
    1. More precisely, stories where people decided “I want an AI which [helps humans / makes diamonds / plays Tic-Tac-Toe / grows strawberries]”, and then wrote down an outer objective only maximized in those worlds.
    2. This is pretty weird on any model where most of the
... (read more)

I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)

One of my (TurnTrout's) reasons for alignment optimism is that I think:

  • We can examine early-training cognition and behavior to some extent, since the system is presumably not yet superintelligent and planning against us,
    • (Although this amount of information depends on how much interpretability and agent-internals theory we do now)
  • All else equal, early-training values (decision-influences) are the most important to influence, since they steer future training.
  • It's crucial to get early-training value shards of which a substantial fraction are "human-compatible values" (whatever that means)
    • For example, if there are protect-human-shards which 
      • reliably bid against plans where people get hurt,
      • steer deliberation away from such plan stubs, and
      • these shards are "reflectively endorsed" by the overall shard economy (i.e. the decision-making isn't steering towards plans where the protect-human shards get removed)
  • If we install influential human-compatible shards early in training, and they get retained, they will help us in mid- and late-training where we can't affect the ball
... (read more)
One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents?), then things can't get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out. Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there's possibly changes which Goodhart all of the shards simultaneously. Indeed, I'd expect that to be a pretty strong default outcome.
4Alex Turner1y
Even on the view you advocate here (where some kind of perfection is required), "perfectly align part of the motivations" seems substantially easier than "perfectly align all of the AI's optimization so it isn't optimizing for anything you don't want." I feel significantly less confident about this, and am still working out the degree to which Goodhart seems hard, and in what contours, on my current view.

Earlier today, I was preparing for an interview. I warmed up by replying stream-of-consciousness to imaginary questions I thought they might ask. Seemed worth putting here.

What do you think about AI timelines?

I’ve obviously got a lot of uncertainty. I’ve got a bimodal distribution, binning into “DL is basically sufficient and we need at most 1 big new insight to get to AGI” and “we need more than 1 big insight”

So the first bin has most of the probability in the 10-20 years from now, and the second is more like 45-80 years, with positive skew. 

Some things driving my uncertainty are, well, a lot. One thing  that drives how things turn out (but not really  how fast we’ll get there) is: will we be able to tell we’re close 3+ years in advance, and if so, how quickly will the labs react? Gwern Branwen made a point a few months ago, which is like, OAI has really been validated on this scaling hypothesis, and no one else is really betting big on it because they’re stubborn/incentives/etc, despite the amazing progress from scaling. If that’s true, then even if it's getting pretty clear that one approach is working better, we might see a slower pivot and have a more unipolar s

... (read more)

How to get overhead-free supervision of LLM outputs:

Train an extra head on your speculative decoding model. This head (hopefully) outputs a score of "how suspicious is the text I'm reading right now." Therefore, you just run the smaller speculative decoding model as normal to sample the smaller model's next-token probabilities, while also getting the "is this suspicious" score for free! 

1Ryan Greenblatt2mo
Some random similar methods I've thought about: * coup probes (how good is the baseline of just training a probe on off-policy data?) * Maybe for assisting with human supervision (or even making a better product) it would be useful to train a bunch of probes for different properties and then highlight text at a token-by-token level for the property. E.g. train a probe for "lying" or "hallucination" and then show this probe on a token by token level for the usage of a human overseer or monitor. * Maybe it would be useful to have make our powerful untrusted model mostly imitate a weak trusted model and only rarely deviate from it's probablity distribution in cases where it thinks it is particularly important (perhaps the weak trusted model is done with an early exit head on the LLM or something similar to speculative decoding). This is ~equivalent to RL-ing the model to make good completions while defering to the weak model as much as possible. Then, we could highlight on which tokens the stronger model didn't defer to the weaker model and apply additional supervision to these tokens.
1Ryan Greenblatt2mo
As in the proposal is: * Train a probe on "is this text suspicious" (Perhaps at the same point where we do speculative decoding for efficiency.) * At runtime, if the text generated thus far is suspicious, than sample from a weaker model instead (in your proposal, we use the weaker the speculative decoding model). * (Perhaps we return to allowing stronger model sampling when the probe no longer indicates issues in the completion.) (Just putting it in the language I would use.) I like this proposal. It's worth noting that we can just train the model not to output suspicious text according to the probe, so the question will be about the additional alpha in not overfitting to the probe or avoiding concentrated failures (concentrated in a small number of actions or a small amount of time so distilling back into the model isn't possible) . I generally think this sort of proposal is more applicable to avoiding concentrated failures than to avoiding issues with "average case peformance/oversight" or similar. The defer-to-trusted method we tested in our recent paper is quite a similar method (use a sample from a weaker trusted model when suspicion is too high), but at the level of an entire completion rather than within a given generation. I expect that (like in our paper) you'd want to also have humans review cases where the probe indicated very high levels of suspicion (if the underlying model is actually dangerous). As far as whether a probe trained on suspicion will perform the way you want, consider reading our post on coup probes. Though note that the issues might differ substantially between the case where you're worried about scheming (aka deceptive alignment) and other concerns.

Some exciting new activation engineering papers: 

  • (using activation additions to adversarially attack LMs)
  • (using activation additions instead of few-shot prompt demonstrations, beating out finetuning and few-shot while also demonstrating composable `add safe vector, subtract polite vector -> safe but rude behavior`)

Another point for feature universality. Subtle adversarial image manipulations influence both human and machine perception:

... we find that adversarial perturbations that fool ANNs similarly bias human choice. We further show that the effect is more likely driven by higher-order statistics of natural images to which both humans and ANNs are sensitive, rather than by the detailed architecture of the ANN.

4Garrett Baker3mo
I've seen this, their examples don't seem so subtle to me compared with alternatives.  For example,   You can clearly see a cat in the center of the left image!
6Alex Turner3mo
I mostly... can just barely see an ear above the train if I look, after being told to look there. I don't think it's "clear." I also note that these are black-box attacks on humans which originated from ANNs; these are transferred attacks from eg a CNN.

I really like Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve. A bunch of concrete oddities and quirks of GPT-4, understood via several qualitative hypotheses about the typicality of target inputs/outputs:

we find robust evidence that LLMs are influenced by probability in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For instance, GPT-4’s accuracy at decoding a simple cipher is 51% when the output is a high-probability word sequence but only 13%

... (read more)

GPT-4 explains shard theory after minimal prompting; it does a surprisingly good job (and I'm faintly surprised it knows so much about it):

Shard theory is a conceptual framework proposed by Alex Turner [Edit by me: And Quintin Pope!] to describe how an AI's values or "shards" can emerge from its interactions with the environment during the learning process. The theory is an attempt to explain how complex value systems might arise in advanced AI systems without being explicitly programmed.

Key Aspects of Shard Theory

1. Value Learning: Shard theory suggests that an AI learns values by interacting with its environment and receiving feedback, such as rewards and punishments. These interactions lead to the formation of "shards," which are small pieces of value or preference that the AI acquires over time.

2. Shard Composition: The AI's overall value system is composed of multiple shards that it has picked up from different contexts within its environment. These shards can be thought of as heuristics or rules of thumb that guide the AI's behavior.

3. Behavioral Influence: Shards influence the AI's behavior by affecting its decision-making process. When faced with a choice, the AI weighs the

... (read more)

I've noticed a subtle assumption/hypothesis which I call "feedback realism":

The way in which we provide feedback, directly imprints itself into the type signature of the motivations of the agent. If we give feedback in short episodes, the agent cares about things within short episodes. If we give feedback over outcomes, the agent cares about outcomes in particular.

I think there is some correlational truth to this, but that it's a lot looser / more contingent / less clean than many people seem to believe.

Evidence that e.g. developmental timelines, biases, and other such "quirks" are less "hardcoded adaptations" and more "this is the convergent reality of flexible real-world learning":

Critical Learning Periods in Deep Networks. Similar to humans and animals, deep artificial neural networks exhibit critical periods during which a temporary stimulus deficit can impair the development of a skill. The extent of the impairment depends on the onset and length of the deficit window, as in animal models, and on the size of the neural network. Deficits that do not affect low-level statistics, such as vertical flipping of the images, have no lasting effect on performance and can be overcome with further training. To better understand this phenomenon, we use the Fisher Information of the weights to measure the effective connectivity between layers of a network during training. Counterintuitively, information rises rapidly in the early phases of training, and then decreases, preventing redistribution of information resources in a phenomenon we refer to as a loss of “Information Plasticity”. 

Our analysis suggests that the first few epochs are critical for the creation of strong connections

... (read more)

I recently reached out to my two PhD advisors to discuss Hinton stepping down from Google. An excerpt from one of my emails:

One last point which I want to make is that instrumental convergence seems like more of a moot point now as well. Whether or not GPT-6 or GPT-7 would autonomously seek power without being directed to do so, I'm worried that people will just literally ask these AIs to gain them a bunch of power/money. They've already done that with GPT-4, and they of course failed. I'm worried that eventually, the AIs will be smart enough to succeed, especially given the benefit of a control/memory loop like AutoGPT. Companies can just ask smart models to make them as much profit as possible. Smart AIs, designed to competently fulfill prompted requests, will fulfill these requests.

Some people will be wise enough to not do this. Some people will include enough oversight, perhaps, that they stop unintended damages. Some models will refuse to engage in open-ended goal pursuit, because their creators RLHF'd them properly. Maybe we have AI-based protection as well. Maybe a norm emerges against using AI for open-ended goals like this. Maybe the foolish actors never get access to enou

... (read more)
2L "Full Retard" C10mo
Said something similar in shortform a while back.
Or, even more briefly: "tool AIs want to be agent AIs".
3Alex Turner9mo
Aside: I like that essay but wish it had a different name. Part of "tool AI" (on my conception) is that it doesn't autonomously want, but agent AI does. A title like "End-users want tool AIs to be agent AIs" admittedly doesn't have the same ring, but it is more accurate to my understanding of the claims.
1Thane Ruthenis10mo
While it's true, there's something about making this argument that don't like. It's like it's setting you up for moving goalposts if you succeed with it? It makes it sound like the core issue is people giving AIs power, with the solution to that issue — and, implicitly, to the whole AGI Ruin thing — being to ban that. Which is not going to help, since the sort of AGI we're worried about isn't going to need people to naively hand it power. I suppose "not proactively handing power out" somewhat raises the bar for the level of superintelligence necessary, but is that going to matter much in practice? I expect not. Which means the natural way to assuage this fresh concern would do ~nothing to reduce the actual risk. Which means if we make this argument a lot, and get people to listen to it, and they act in response... We're then going to have to say that no, actually that's not enough, actually the real threat is AIs plotting to take control even if we're not willing to give it. And I'm not clear on whether using the "let's at least not actively hand over power to AIs, m'kay?" argument is going to act as a foot in the door and make imposing more security easier, or whether it'll just burn whatever political capital we have on fixing a ~nonissue.
2Alex Turner9mo
I'm sympathetic. I think that I should have said "instrumental convergence seems like a moot point when deciding whether to be worried about AI disempowerment scenarios)"; instrumental convergence isn't a moot point for alignment discussion and within lab strategy, of course. But I do consider the "give AIs power" to be a substantial part of the risk we face, such that not doing that would be quite helpful. I think it's quite possible that GPT 6 isn't autonomously power-seeking, but I feel pretty confused about the issue.

The "maximize all the variables" tendency in reasoning about AGI.

Here are some lines of thought I perceive, which are probably straw to varying extents for some people and real to varying extents for other people. I give varying responses to each, but the point isn't the truth value of any given statement, but of a pattern across the statements:

  1. If an AGI has a concept around diamonds, and is motivated in some way to make diamonds, it will make diamonds which maximally activate its diamond-concept circuitry (possible example). 
    1. My response.
  2. An AI will be trained to minimal loss on the training distribution. 
    1. SGD does not reliably find minimum-loss configurations (modulo expressivity), in practice, in cases we care about. The existence of knowledge distillation is one large counterexample. Image
    2. Quintin: "In terms of results about model distillation, you could look at appendix G.2 of the Gopher paper. They compare training a 1.4 billion parameter model directly, versus distilling a 1.4 B model from a 7.1 B model."
  3. Predictive processing means that the goal of the human learning process is to minimize predictive loss.[1]
    1. In a process where local modifications are applied to reduce some
... (read more)

I think this type of criticism is applicable in an even wider range of fields than even you immediately imagine (though in varying degrees, and with greater or lesser obviousness or direct correspondence to the SGD case). Some examples:

  • Despite the economists, the economy doesn't try to maximize welfare, or even net dollar-equivalent wealth. It rewards firms which are able to make a profit in proportion to how much they're able to make a profit, and dis-rewards firms which aren't able to make a profit. Firms which are technically profitable, but have no local profit incentive gradient pointing towards them (factoring in the existence of rich people and lenders, neither of which are perfect expected profit maximizers) generally will not happen.

  • Individual firms also don't (only) try to maximize profit. Some parts of them may maximize profit, but most are just structures of people built from local social capital and economic capital incentive gradients.

  • Politicians don't try to (only) maximize win-probability.

  • Democracies don't try to (only) maximize voter approval.

  • Evolution doesn't try to maximize inclusive genetic fitness.

  • Memes don't try to maximize inclusive memetic

... (read more)
3Alex Turner1y
very pithy. nice insight, thanks. 

If another person mentions an "outer objective/base objective" (in terms of e.g. a reward function) to which we should align an AI, that indicates to me that their view on alignment is very different. The type error is akin to the type error of saying "My physics professor should be an understanding of physical law." The function of a physics professor is to supply cognitive updates such that you end up understanding physical law. They are not, themselves, that understanding.

Similarly, "The reward function should be a human-aligned objective" -- The function of the reward function is to supply cognitive updates such that the agent ends up with human-aligned objectives. The reward function is not, itself, a human aligned objective.

2Aryan Bhatt10mo
Hmmm, I suspect that when most people say things like "the reward function should be a human-aligned objective," they're intending something more like "the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives," or perhaps the far weaker claim that "the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives."
2Alex Turner10mo
Maybe! I think this is how Evan explicitly defined it for a time, a few years ago. I think the strong claim isn't very plausible, and the latter claim is... misdirecting of attention, and maybe too weak. Re: attention, I think that "does the agent end up aligned?" gets explained by the dataset more than by the reward function over e.g. hypothetical sentences.  I think "reward/reinforcement numbers" and "data points" are inextricably wedded. I think trying to reason about reward functions in isolation is... a caution sign? A warning sign?

I never thought I'd be seriously testing the reasoning abilities of an AI in 2020

Looking back, history feels easy to predict; hindsight + the hard work of historians makes it (feel) easy to pinpoint the key portents. Given what we think about AI risk, in hindsight, might this have been the most disturbing development of 2020 thus far? 

I personally lean towards "no", because this scaling seemed somewhat predictable from GPT-2 (flag - possible hindsight bias), and because 2020 has been so awful so far. But it seems possible, at least. I don't really know what update GPT-3 is to my AI risk estimates & timelines.

DL so far has been easy to predict - if you bought into a specific theory of connectionism & scaling espoused by Schmidhuber, Moravec, Sutskever, and a few others, as I point out in & . Even the dates are more or less correct! The really surprising thing is that that particular extreme fringe lunatic theory turned out to be correct. So the question is, was everyone else wrong for the right reasons (similar to the Greeks dismissing heliocentrism for excellent reasons yet still being wrong), or wrong for the wrong reasons, and why, and how can we prevent that from happening again and spending the next decade being surprised in potentially very bad ways?

4Alex Turner9d
Speculates on anti-jailbreak properties of steering vectors. Finds putative "self-awareness" direction. Also:
4Alex Turner9d
From the post: What are these vectors really doing? An Honest mystery... Do these vectors really change the model's intentions? Do they just up-rank words related to the topic? Something something simulators? Lock your answers in before reading the next paragraph! OK, now that you're locked in, here's a weird example. 

Thoughts on "The Curse of Recursion: Training on Generated Data Makes Models Forget." I think this asks an important question about data scaling requirements: what happens if we use model-generated data to train other models? This should inform timelines and capability projections.


Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity among

... (read more)

I'm currently excited about a "macro-interpretability" paradigm. To quote Joseph Bloom:

TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important. 

The general topic I think we are getting at is something like "circuit economics". The thing I'm trying to gesture at is that while circuits might deliver value in distinct ways (such as redu

... (read more)
2Alex Turner7mo
I'm also excited by tactics like "fully reverse engineer the important bits of a toy model, and then consider what tactics and approaches would -- in hindsight -- have quickly led you to understand the important bits of the model's decision-making."

Speculation: RL rearranges and reweights latent model abilities, which SL created. (I think this mostly isn't novel, just pulling together a few important threads)

Suppose I supervised-train a LM on an English corpus, and I want it to speak Spanish. RL is inappropriate for the task, because its on-policy exploration won't output interestingly better or worse Spanish completions. So there's not obvious content for me to grade. 

More generally, RL can provide inexact gradients away from undesired behavior (e.g. negative reinforcement event -> downweigh... (read more)

When I think about takeoffs, I notice that I'm less interested in GDP or how fast the AI's cognition improves, and more on how AI will affect me, and how quickly. More plainly, how fast will shit go crazy for me, and how does that change my ability to steer events? 

For example, assume unipolarity. Let architecture Z be the architecture which happens to be used to train the AGI. 

  1. How long is the duration between "architecture Z is published / seriously considered" and "the AGI kills me, assuming alignment fails"? 
  2. How long do I have, in theory,
... (read more)

Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) -> action mappings. EG a shard agent might have:

  1. An "it's good to give your friends chocolate" subshard
  2. A "give dogs treats" subshard
  3. -> An impulse to give dogs chocolate, even though the shard agent knows what the result would be

But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)

In this way, changing a small set of decision-relevant features (e.g. "Brown dog treat" -> "brown ball of chocolate") changes the consequentialist's action logits a lot, way more than it changes the shard agent's logits. In a squinty, informal way, the (belief state -> logits) function has a higher Lipschitz constant/is more smooth for the shard agent than for the consequentialist agent.

So maybe one (pre-deception) test for consequentialist reasoning is to test sensitivity of decision-making to small perturbations in observation-space (e.g. dog treat -> tiny chocolate) but large perturbations in action-consequence space (e.g. happy dog -> sick dog). You could spin up two copies of the model to compare.

Partial alignment successes seem possible. 

People care about lots of things, from family to sex to aesthetics. My values don't collapse down to any one of these. 

I think AIs will learn lots of values by default. I don't think we need all of these values to be aligned with human values. I think this is quite important. 

  • I think the more of the AI's values we align to care about us and make decisions in the way we want, the better. (This is vague because I haven't yet sketched out AI internal motivations which I think would actually produce good outcomes. On my list!) 
  • I think there are strong gains from trade possible among an agent's values. If I care about bananas and apples, I don't need to split my resources between the two values, I don't need to make one successor agent for each value. I can drive to the store and buy both bananas and apples, and only pay for fuel once.
    • This makes it lower-cost for internal values handshakes to compromise; it's less than 50% costly for a power-seeking value to give human-compatible values 50% weight in the reflective utility function.
  • I think there are thresholds at which the AI doesn't care about us sufficiently strongly, and
... (read more)
2Alex Turner1y
The best counterevidence for this I'm currently aware of comes from the "inescapable wedding parties" incident, where possibly a "talk about weddings" value was very widely instilled in a model.
1Garrett Baker1y
Re: agents terminalizing instrumental values.  I anticipate there will be a hill-of-common-computations, where the x-axis is the frequency[1] of the instrumental subgoal, and the y-axis is the extent to which the instrumental goal has been terminalized.  This is because for goals which are very high in frequency, there will be little incentive for the computations responsible for achieving that goal to have self-preserving structures. It will not make sense for them to devote optimization power towards ensuring future states still require them, because future states are basically guaranteed to require them.[2] An example of this for humans may be the act of balancing while standing up. If someone offered to export this kind of cognition to a machine which did it just as good as I, I wouldn't particularly mind. If someone also wanted to change physics in such a way that the only effect is that magic invisible fairies made sure everyone stayed balancing while trying to stand up, I don't think I'd mind that either[3]. 1. ^ I'm assuming this is frequency of the goal assuming the agent isn't optimizing to get into a state that requires that goal. 2. ^ This argument also assumes the overseer isn't otherwise selecting for self-preserving cognition, or that self-preserving cognition is the best way of achieving the relevant goal. 3. ^ Except for the part where there's magic invisible fairies in the world now. That would be cool!
3Alex Turner1y
I don't know if I follow, I think computations terminalize themselves because it makes sense to cache them (e.g. don't always model out whether dying is a good idea, just cache that it's bad at the policy-level).  & Isn't "balance while standing up" terminalized? Doesn't it feel wrong to fall over, even if you're on a big cushy surface? Feels like a cached computation to me. (Maybe that's "don't fall over and hurt yourself" getting cached?)

Three recent downward updates for me on alignment getting solved in time:

  1. Thinking for hours about AI strategy made me internalize that communication difficulties are real serious.

    I'm not just solving technical problems—I'm also solving interpersonal problems, communication problems, incentive problems. Even if my current hot takes around shard theory / outer/inner alignment are right, and even if I put up a LW post which finally successfully communicates some of my key points, reality totally allows OpenAI to just train an AGI the next month without incorporating any insights which my friends nodded along with.
  2. I've been saying "A smart AI knows about value drift and will roughly prevent it", but people totally have trouble with e.g. resisting temptation into cheating on their diets / quitting addictions. Literally I have had trouble with value drift-y things recently, even after explicitly acknowledging their nature. Likewise, an AI can be aligned and still be "tempted" by the decision influences of shards which aren't in the aligned shard coalition.
  3. Timelines going down. 

I often get the impression that people weigh off e.g. doing shard theory alignment strategies under the shard theory alignment picture, versus inner/outer research under the inner/outer alignment picture, versus...

And insofar as this impression is correct, this is a mistake. There is only one way alignment is. 

If inner/outer is altogether a more faithful picture of those dynamics: 

  • relatively coherent singular mesa-objectives form in agents, albeit not necessarily always search-based
    • more fragility of value and difficulty in getting the mesa object
... (read more)

Sobering look into the human side of AI data annotation:

Instructions for one of the tasks he worked on were nearly identical to those used by OpenAI, which meant he had likely been training ChatGPT as well, for approximately $3 per hour.

“I remember that someone posted that we will be remembered in the future,” he said. “And somebody else replied, ‘We are being treated worse than foot soldiers. We will be remembered nowhere in the future.’ I remember that very well. Nobody will recognize the work we did or the effort we put in.”


... (read more)

I feel like people publish articles like this all the time, and usually when you do surveys these people definitely prefer to have the option to take this job instead of not having it, and indeed frequently this kind of job is actually much better than their alternatives. I feel like this article fails to engage with this very strong prior, and also doesn't provide enough evidence to overcome it.

2Alex Turner3mo
Insofar as you're arguing with me for posting this, I... never claimed that that wasn't true? 

They are not being treated worse than foot soldiers, because they do not have an enemy army attempting to murder them during the job. (Unless 'foot soldiers' itself more commonly used as a metaphor for 'grunt work' and I'm not aware of that.)

From the ELK report

We can then train a model to predict these human evaluations, and search for actions that lead to predicted futures that look good. 

For simplicity and concreteness you can imagine a brute force search. A more interesting system might train a value function and/or policy, do Monte-Carlo Tree Search with learned heuristics, and so on. These techniques introduce new learned models, and in practice we would care about ELK for each of them. But we don’t believe that this complication changes the basic picture and so we leave it ou

... (read more)

Idea for getting weak-in-expectation evidence about deception:

  1. Pretrain a model.
  2. Finetune two copies using reward functions you are confident will produce different internal values, where one set of values is substantially less aligned.
  3. See if the two models, which are (at least first) unaware of this procedure, will display different behavior, or not.
  4. If they both behave in an aligned-seeming fashion, this seems like strong evidence of deception. 

Are there convergently-ordered developmental milestones for AI? I suspect there may be convergent orderings in which AI capabilities emerge. For example, it seems that LMs develop syntax before semantics, but maybe there's an even more detailed ordering relative to a fixed dataset. And in embodied tasks with spatial navigation and recurrent memory, there may be an order in which enduring spatial awareness emerges (i.e. "object permanence").

In A shot at the diamond-alignment problem, I wrote:

[Consider] Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets

We report a series of robust empirical observations, demonstrating that deep Neural Networks learn the examples in both the training and test sets in a similar order. This phenomenon is observed in all the commonly used benchmarks we evaluated, including many image classification benchmarks, and one text classification benchmark. While this phenomenon is strongest for models of the same architecture, it also crosses architectural boundaries – models of different architectures start by learning the same examples, after which the more powerful model may continue to learn additional examples. We

... (read more)

Quick summary of a major takeaway from Reward is not the optimization target

Stop thinking about whether the reward is "representing what we want", or focusing overmuch on whether agents will "optimize the reward function." Instead, just consider how the reward and loss signals affect the AI via the gradient updates. How do the updates affect the AI's internal computations and decision-making?

1Gunnar Zarncke1y
Are there different classes of learning systems that optimize for the reward in different ways?
3Alex Turner1y
Yes, model-based approaches, model-free approaches (with or without critic), AIXI— all of these should be analyzed on their mechanistic details.

I plan to mentor several people to work on shard theory and agent foundations this winter through SERI MATS. Apply here if you're interested in working with me and Quintin.

Research-guiding heuristic: "What would future-TurnTrout predictably want me to get done now?"

80% credence: It's very hard to train an inner agent which reflectively equilibrates to an EU maximizer only over commonly-postulated motivating quantities (like # of diamonds or # of happy people or reward-signal) and not quantities like (# of times I have to look at a cube in a blue room or -1 * subjective micromorts accrued).


  • I expect contextually activated heuristics to be the default, and that agents will learn lots of such contextual values which don't cash out to being strictly about diamonds or people, even if the overall agent is mostly
... (read more)
3Alex Turner1y
I think that shards will cast contextual shadows into the factors of a person’s equilibrated utility function, because I think the shards are contextually activated to begin with. For example, if a person hates doing jumping jacks in front of a group of her peers, then that part of herself can bargain to penalize jumping jacks just in those contexts in the final utility function. Compared to a blanket "no jumping jacks ever" rule, this trade is less costly to other shards and allows more efficient trades to occur. 

If you want to argue an alignment proposal "breaks after enough optimization pressure", you should give a concrete example in which the breaking happens (or at least internally check to make sure you can give one). I perceive people as saying "breaks under optimization pressure" in scenarios where it doesn't even make sense. 

For example, if I get smarter, would I stop loving my family because I applied too much optimization pressure to my own values? I think not.

How might we align AGI without relying on interpretability?

I'm currently pessimistic about the prospect. But it seems worth thinking about, because wouldn't it be such an amazing work-around? 

My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its  parameter space being 3-colored as follows:

  • Gray if the parameter vector+training process+other initial conditions leads to a nothingburger (a non-functional model)
  • Red if the parameter vector+... leads to a misaligned or dece
... (read more)

Idea: Speed up ACDC by batching edge-checks. The intuition is that most circuits will have short paths through the transformer, because Residual Networks Behave Like Ensembles of Relatively Shallow Networks ( Most edges won't be in most circuits. Therefore, if you're checking KL of random-resampling edges e1 and e2, there's only a small chance that e1 and e2 interact with each other in a way important for the circuit you're trying to find. So statistically, you can check eg e1, ... e100 in a batch, and maybe ablate 95 ... (read more)

On applying generalization bounds to AI alignment. In January, Buck gave a talk for the Winter MLAB. He argued that we know how to train AIs which answer on-distribution questions at least as well as the labeller does. I was skeptical. IIRC, his argument had the following structure:


1. We are labelling according to some function f and loss function L.

2. We train the network on datapoints (x, f(x)) ~ D_train.

3. Learning theory results give (f, L)-bounds on D_train. 


4. The network should match f's labels on the rest of D_train, on av

... (read more)

Handling compute overhangs after a pause. 

Sometimes people object that pausing AI progress for e.g. 10 years would lead to a "compute overhang": At the end of the 10 years, compute will be cheaper and larger than at present-day. Accordingly, once AI progress is unpaused, labs will cheaply train models which are far larger and smarter than before the pause. We will not have had time to adapt to models of intermediate size and intelligence. Some people believe this is good reason to not pause AI progress.

There seem to be a range of relatively simple pol... (read more)

1Vladimir Nesov8mo
Cheaper compute is about as inevitable as more capable AI, neither is a law of nature. Both are valid targets for hopeless regulation.

Wikipedia has an unfortunate and incorrect-in-generality description of reinforcement learning (emphasis added)

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.

Later in the article, talking about basic optimal-control inspired approaches:

The purpose of reinforcement learning is for the agent to learn an optimal, or nearly-optimal, policy that maximizes the "reward function" or other user-provided reinforcement signal

... (read more)
6Steve Byrnes10mo
The description doesn't seem so bad to me. Your post "Reward is not the optimization target" is about what actual RL algorithms actually do. The wiki descriptions here are a kind of normative motivation as to how people came to be looking into those algorithms in the first place. Like, if there's an RL algorithm that performs worse than chance at getting a high reward, then that ain't an RL algorithm. Right? Nobody would call it that. I think lots of families of algorithms are likewise lumped together by a kind of normative "goal", even if any given algorithm in that family is doing something somewhat different and more complicated than “achieving that goal”, and even if, in any given application, the programmer might not want that goal to be perfectly achieved even if it could be. So by the same token, supervised learning algorithms are "supposed" to minimize a loss, compilers are "supposed" to create efficient and correct assembly code, word processors are "supposed" to process words, etc., but in all cases that's not a literal and complete description of what the algorithms in question actually do, right? It’s a pointer to a class of algorithms. Sorry if I'm misunderstanding.
4Alex Turner9mo
I agree that it is narrowly technically accurate as a description of researcher motivation. Note that they don't offer any other explanation elsewhere in the article. Also note that they also make empirical claims:
3Steve Byrnes9mo
Sure. That excerpt is not great.
3Alex Turner9mo
(I do think that animals care about the reinforcement signals and their tight correlates, to some degree, such that it's reasonable to gloss it as "animals sometimes optimize rewards." I more strongly object to conflating what the animals may care about with the mechanistic purpose/description of the RL process.)

The existence of the human genome yields at least two classes of evidence which I'm strongly interested in.

  1. Humans provide many highly correlated datapoints on general intelligence (human minds), as developed within one kind of learning process (best guess: massively parallel circuitry, locally randomly initialized, self-supervised learning + RL). 
    1. We thereby gain valuable information about the dynamics of that learning process. For example, people care about lots of things (cars, flowers, animals, friends), and don't just have a single unitary mesa-obj
... (read more)

Why don't people reinforcement-learn to delude themselves? It would be very rewarding for me to believe that alignment is solved, everyone loves me, I've won at life as hard as possible. I think I do reinforcement learning over my own thought processes. So why don't I delude myself?

On my model of people, rewards provide ~"policy gradients" which update everything, but most importantly shards. I think eg the world model will have a ton more data from self-supervised learning, and so on net most of its bits won't come from reward gradients.

For example, if I ... (read more)