All of TurnTrout's Comments + Replies

On applying generalization bounds to AI alignment. In January, Buck gave a talk for the Winter MLAB. He argued that we know how to train AIs which answer on-distribution questions at least as well as the labeller does. I was skeptical. IIRC, his argument had the following structure:


1. We are labelling according to some function f and loss function L.

2. We train the network on datapoints (x, f(x)) ~ D_train.

3. Learning theory results give (f, L)-bounds on D_train. 


4. The network should match f's labels on the rest of D_train, on av

... (read more)

Thoughts on "Deep Learning is Robust to Massive Label Noise."

We show that deep neural networks are capable of generalizing from training data for which true labels are massively outnumbered by incorrect labels. We demonstrate remarkably high test performance after training on corrupted data from MNIST, CIFAR, and ImageNet. For example, on MNIST we obtain test accuracy above 90 percent even after each clean training example has been diluted with 100 randomly-labeled examples. Such behavior holds across multiple patterns of label noise, even when erroneous l

... (read more)

Thoughts on "The Curse of Recursion: Training on Generated Data Makes Models Forget." I think this asks an important question about data scaling requirements: what happens if we use model-generated data to train other models? This should inform timelines and capability projections.


Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language mo

... (read more)

a lot of interpretability work that performs act-add like ablations to confirm that their directions are real

Minor clarifying point: Act-adds cannot be cast as ablations. Do you mean to say that the interp work uses activation addition to confirm real directions? Or that they use activation ablation/resampling/scrubbing?

ITI is basically act adds but they compute act adds with many examples instead of just a pair

Yup, ITI was developed concurrently, and (IIRC, private correspondence) was inspired by their work on Othello-GPT. So this is another instance of i... (read more)

2Lawrence Chan10d
Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It's possible there's a better or standard word that I can't think of write now. Also, another example of an attempt at interp -> alignment would arguably be the model editing stuff following causal tracing in the ROME paper? 

Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping.

LLMs aren't trained to convergence because that's not compute-efficient, so early stopping seems like the relevant baseline. No?

everyone who reads those seems to be even more confused after reading them

I want to defend "Reward is not the optimization target" a bit, while also mourning its apparent lack of clarity. The above is a valid impression, but I don't think it's true. For some reason, some people really get a lot ... (read more)

I agree that with time, we might be able to understand. (I meant to communicate that via "might still be incomprehensible")

All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing.

Strong claim! I'm skeptical (EDIT: if you mean "in the limit" to apply to practically relevant systems we build in the future. If so,) do you have a citation for DRL convergence results relative to this level of expressivity, and reasoning for why realistic early stopping in practice doesn't matter? (Also, of course, even one single optimal ... (read more)

Outside of simple problems like Othello, I expect most DRL agents will not converge fully to the peak of the 'spinning top', and so will retain traces of their informative priors like world-models. For example, if you plug GPT-5 into a robot, I doubt it would ever be trained to the point of discarding most of its non-value-relevant world-model - the model is too high-capacity for major forgetting, and past meta-learning incentivizes keeping capabilities around just in case. But that's not 'every system we build in the future', just a lot of them. Not hard to imagine realistic practical scenarios where that doesn't hold - I would expect that any specialized model distilled from it (for cheaper faster robotic control) would not learn or would discard much more of its non-value-relevant world-model compared to its parent, and that would have potential safety & interpretability implications. The System II distills and compiles down to a fast efficient System I. (For example, if you were trying to do safety by dissecting its internal understanding of the world, or if you were trying to hack a superior reward model, adding in safety criteria not present in the original environment/model, by exploiting an internal world model, you might fail because the optimized distilled model doesn't have those parts of the world model, even if the parent model did, as they were irrelevant.) Chess end-game databases are provably optimal & very superhuman, and yet, there is no 'world-model' or human-interpretable concepts of chess anywhere to be found in them; the 'world-model' used to compute them, whatever that was, was discarded as unnecessary after the optimal policy was reached. Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping. (It's not like stopping forgetting is hard. Of course you can stop forgetting by changing the problem to be solved, and simply making a representation of the world-sta

(The original post was supposed to also have @Monte M as a coauthor; fixed my oversight.)

AI cognition doesn't have to use alien concepts to be uninterpretable. We've never fully interpreted human cognition, either, and we know that our introspectively accessible reasoning uses human-understandable concepts.

Just because your thoughts are built using your own concepts, does not mean your concepts can describe how your thoughts are computed. 


The existence of a natural-language description of a thought (like "I want ice cream") doesn't mean that your brain computed that thought in a way which can be compactly described by familiar concepts... (read more)

I don't think the conclusion follows from the premises. People often learn new concepts after studying stuff, and it seems likely (to me) that when studying human cognition, we'd first be confused because our previous concepts weren't sufficient to understand it, and then slowly stop being confused as we built & understood concepts related to the subject. If an AI's thoughts are like human thoughts, given a lot of time to understand them, what you describe doesn't rule out that the AI's thoughts would be comprehensible.

The mere existence of concepts we don't know about in a subject doesn't mean that we can't learn those concepts. Most subjects have new concepts.

(Also, all AI-doom content should maybe be expunged as well, since "AI alignment is so hard" might become a self-fulfilling prophecy via sophisticated out-of-context reasoning baked in by pretraining.)

I agree that there's something nice about activation steering not optimizing the network relative to some other black-box feedback metric. (I, personally, feel less concerned by e.g. finetuning against some kind of feedback source; the bullet feels less jawbreaking to me, but maybe this isn't a crux.)

(Medium confidence) FWIW, RLHF'd models (specifically, the LLAMA-2-chat series) seem substantially easier to activation-steer than do their base counterparts. 

This paper seems pretty cool! 

I've for a while thought that alignment-related content should maybe be excluded from pretraining corpora, and held out as a separate optional dataset. This paper seems like more support for that, since describing general eval strategies and specific evals might allow models to 0-shot hack them. 

Other reasons for excluding alignment-related content:

  • "Anchoring" AI assistants on our preconceptions about alignment, reducing our ability to have the AI generate diverse new ideas and possibly conditioning it on our philosophical confusions and mistakes
  • Self-fulfilling prophecies around basilisks and other game-theoretic threats
3Owain Evans1mo
Good points. As we note in the paper, this may conflict with the idea of automating alignment research in order to solve alignment. Aaron_Scher makes a related point.  More generally, it's uncertain what the impact is of excluding a certain topic from pretraining. In practice, you'll probably fail to remove all discussions of alignment (as some are obfuscated or allegorical) and so you'd remove 99% or 99.9% rather than 100%. The experiments in our paper, along with the influence functions work by Grosse et al. could help us understand what the impact of this is likely to be.

(Also, all AI-doom content should maybe be expunged as well, since "AI alignment is so hard" might become a self-fulfilling prophecy via sophisticated out-of-context reasoning baked in by pretraining.)

I've been interested in using this for red-teaming for a while -- great to see some initial work here. I especially liked the dot-product analysis. 

This incidentally seems like strong evidence that you can get jailbreak steering vectors (and maybe the "answer questions" vector is already a jailbreak vector). Thankfully, activation additions can't be performed without the ability to modify activations during the forward pass, and so e.g. GPT-4 can't be jailbroken in this way. (This consideration informed my initial decision to share the cheese vector research.)

In practice, we focus on the embedding associated with the last token from a late layer.

I don't have time to provide citations right now, but a few results have made me skeptical of this choice -- probably you're better off using an intermediate layer, rather than a late one. Early and late layers seem to deal more with token-level concerns, while mid-layers seem to handle more conceptual / abstract features.

Focusing on language models, we note that models exhibit “consistent developmental stages,” at first behaving similarly to -gram models and later exhibiting linguistic patterns.

I wrote a shortform comment which seems relevant:

Are there convergently-ordered developmental milestones for AI? I suspect there may be convergent orderings in which AI capabilities emerge. For example, it seems that LMs develop syntax before semantics, but maybe there's an even more detailed ordering relative to a fixed dataset. And in embodied tasks with spatial navigat

... (read more)

Offline RL can work well even with wrong reward labels. I think alignment discourse over-focuses on "reward specification." I think reward specification is important, but far from the full story. 

To this end, a new paper (Survival Instinct in Offline Reinforcement Learning) supports Reward is not the optimization target and associated points that reward is a chisel which shapes circuits inside of the network, and that one should fully consider the range of sources of parameter updates (not just those provided by a reward signal). 

Some relevant qu... (read more)

Delicious food does seem like a good (but IMO weak) point in favor of reward-optimization, and pushes up my P(AI cares a lot about reward terminally) a tiny bit. But also note that lots of people (including myself) don't care very much about delicious food, and it seems like the vast majority of people don't make their lives primarily about delicious food or other tight correlates of their sensory pleasure circuits. 

If pressing a button is easy and doesn't conflict with taking out the trash and doing other things it wants to do, it might try it.

This i... (read more)

Off-the-cuff: Possibly you can use activation additions to compute a steering vector which e.g. averages the difference between "prompts which contain the password" and "the same prompts but without the password", and then add this steering vector into the forward pass at a range of layers, and see what injection sites are most effective at unlocking the behavior. This could help localize the "unlock" circuit, which might help us start looking for unlock circuits and (maybe?) information related to deception.

My impression is derived from looking at some apparently random qualitative examples. But maybe @NinaR can run the coeff=0 setting and report the assessed sycophancy, to settle this more quantitatively:?

Effect of sycophancy steering on llama-2-7b-chat with multipliers + and - 50 on an AI-generated dataset of questions designed to test sycophancy,  assessed independently for each answer using Claude 2 API
1[comment deleted]2mo

What is "shard theory"? I've written a lot about shard theory. I largely stand by these models and think they're good and useful. Unfortunately, lots of people seem to be confused about what shard theory is. Is it a "theory"? Is it a "frame"? Is it "a huge bag of alignment takes which almost no one wholly believes except, perhaps, Quintin Pope and Alex Turner"?

I think this understandable confusion happened because my writing didn't distinguish between: 

  1. Shard theory itself, 
    1. IE the mechanistic assumptions about internal motivational structure, whic
... (read more)

Strong encouragement to write about (1)!

I'd guess that Meta didn't bother to train against obvious sycophancy and if you trained against it, then it would go away.

Hm. My understanding is that RLHF/instruct fine-tuning tends to increase sycophancy. Can you share more about this guess?

Here's the sycophancy graph from Discovering Language Model Behaviors with Model-Written Evaluations:

For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that's not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF'd models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.

What this graph actually shows is increasing syc... (read more)

3Ryan Greenblatt2mo
I'd guess that if you: * Instructed human labelers to avoid sycophancy * Gave human labelers examples of a few good and bad responses with respect to sycophancy * Trained models on examples where sycophancy is plausibly/likely (e.g., pretrained models exhibit sycophancy a reasonable fraction of the time when generating) Then sycophancy from RLHF as measured by this sort of dataset would mostly go away. The key case where RLHF fails to incentivize good behavior is when (AI assisted) human labelers can't correctly identify negative outputs. And, surely typical humans can recognize the sort of sycophancy in this dataset? (Note that this argument doesn't imply that humans would be able to catch and train out subtle sycophancy cases, but this dataset doesn't really have such cases.) Reasonably important parts of my view (which might not individually be cruxes): * Pretrained (no RLHF!) models prompted to act like assistants exhibit sycophancy * It's reasonably likely to me that RLHF/instruction finetuning increasing sycophancy is due to some indirect effect rather than "because it's typically directly incentivized by human labels". Thus, this maybe doesn't show a general problem with RLHF, but rather a specific quirk. I believe preference models exhibit liking sycophancy. My guess would be that either the preference model learns something like "is this is a normal assistant response" and this generalizes to sycophancy because normal assistants on the internet are sycophantic or it's roughly noise (it depends on some complicated and specific inductive biases story which doesn't generalize). * Normal humans can recognize sycophancy in this dataset pretty easily * Unless you actually do different activation steering at multiple different layers and try to use human understanding of what's going on, then my view is that activation steering is just some different way to influence models to behave similar to the postive s

Maybe my original comment was unclear. I was making a claim of "evidently this has improved on whatever they did" and not "there's no way for them to have done comparably well if they tried."

I do expect this kind of technique to stack benefits on top of finetuning, making the techniques complementary. That is, if you consider the marginal improvement on some alignment metric on validation data, I expect the "effort required to increase metric via finetuning" and "effort via activation addition" to be correlated but not equal. Thus, I suspect that even after finetuning a model, there will be low- or medium-hanging activation additions which further boost alignment.

I think this result is very exciting and promising. You appear to have found substantial reductions in sycophancy, beyond whatever was achieved with Meta's finetuning, using a simple activation engineering methodology. Qualitatively, the model's coherence and capabilities seem to be retained, though I'd like to see e.g. perplexity on OpenWebText and MMLU benchmark performance to be sure.

Can Anthropic just compute a sycophancy vector for Claude using your methodology, and then just subtract the vector and thereby improve alignment with user interests? I'd love to know the answer.

Where is this shown? Most of the results don't evaluate performance without steering. And the TruthfulQA results only show a clear improvement from steering for the base model without RLHF. 
2Ryan Greenblatt2mo
How do you know this is beyond what finetuning was capable of? I'd guess that Meta didn't bother to train against obvious sycophancy and if you trained against it, then it would go away. This work can still be interesting for other reasons, e.g. building into better interpretability work than can easily be done with finetuning. Edit: I didn't realize this work averaged across an entire dataset of comparisons to get the vector. I now think more strongly that the sample efficiency and generalization here is likely to be comparable to normal training or some simple variant. Beyond this, I think it seems possible though unlikely that the effects here are similar to just taking a single large gradient step of supervised learning on the positive side and of supervised unlikelihood training on the negative side. (Removed because this assumed a single example was used for the vector)

Consider what update equations have to say about "training game" scenarios. In PPO, the optimization objective is proportional to the advantage given a policy , reward function , and on-policy value function :

Consider a mesa-optimizer acting to optimize some mesa objective. The mesa-optimizer understands that it will be updated proportional to the advantage. If the mesa-optimizer maximizes reward, this corresponds to maximizing the intensity of the gradients it receives, thus maximally updatin... (read more)

An Arxiv version is forthcoming. We're working with Gavin Leech to publish these results as a conference paper. 

I'm also excited by tactics like "fully reverse engineer the important bits of a toy model, and then consider what tactics and approaches would -- in hindsight -- have quickly led you to understand the important bits of the model's decision-making."

I still don't follow. Apparently, TL's center_writing_weights is adapting the writing weights in a pre-LN-invariant fashion (and also in a way which doesn't affect the softmax probabilities after unembed). This means the actual computations of the forward pass are left unaffected by this weight modification, up to precision limitations, right? So that means that our results in particular should not be affected by TL vs HF. 

1Arthur Conmy2mo
Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!

We used TL to cache activations for all experiments, but are considering moving away to improve memory efficiency. 

TL removes the mean from all additions to the residual stream which I would have guessed that this would solve the problem here.

Oh, somehow I'm not familiar with this. Is this center_unembed? Or are you talking about something else?

Do you have evidence for this?

Yes, but I think the evidence didn't actually come from the "Love" - "Hate" prompt pair. Early in testing we found paired activation additions worked better. I don't have a citeabl... (read more)

2Arthur Conmy2mo
No this isn’t about center_unembed, it’s about center_writing_weights as explained here: This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing

Yeah, seems tough to avoid "reward" in that situation. Thanks for pointing this out.

Thanks for the comment! Quick reacts: I'm concerned about the first bullet, not about 2, and bullet 3 seems to ignore top- probability prediction requirements (the requirement isn't to just ID the most probable next token). Maybe there's a recovery of bullet 3 somehow, though?

I'm currently excited about a "macro-interpretability" paradigm. To quote Joseph Bloom:

TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important. 

The general topic I think we are getting at is something like "circuit economics". The thing I'm trying to gesture at is that while circuits might deliver value in distinct ways (such as redu

... (read more)

I'm also excited by tactics like "fully reverse engineer the important bits of a toy model, and then consider what tactics and approaches would -- in hindsight -- have quickly led you to understand the important bits of the model's decision-making."

Handling compute overhangs after a pause. 

Sometimes people object that pausing AI progress for e.g. 10 years would lead to a "compute overhang": At the end of the 10 years, compute will be cheaper and larger than at present-day. Accordingly, once AI progress is unpaused, labs will cheaply train models which are far larger and smarter than before the pause. We will not have had time to adapt to models of intermediate size and intelligence. Some people believe this is good reason to not pause AI progress.

There seem to be a range of relatively simple pol... (read more)

Cheaper compute is about as inevitable as more capable AI, neither is a law of nature. Both are valid targets for hopeless regulation.

For example, if you wanted to generally predict model behavior right now, you'd probably just want to get really good at understanding webtext, practice the next token prediction game, etc.

Another candidate eval is to demand predictability given activation edits (eg zero-ablating certain heads, patching in activations from other prompts, performing activation additions, and so on). Webtext statistics won't be sufficient there.

It's fine that you would guess that, but without a strong reason to believe it's true—which I definitely don't think we have—you can't use something like this as a sufficient condition to label a model as safe.

After thinking more about it earlier this week, I agree. 

I was initially more bullish on "this seems sufficient and also would give a lot of time to understand models" (in which case you can gate model deployment with this alone) but I came to think "prediction requirements track something important but aren't sufficient" (in which case this is ... (read more)

The advantage definition itself is correct and non-oscillating... Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.

The advantage is (IIUC) defined with respect to a given policy, and so the advantage can oscillate and then cause mode collapse. I agree that a constant learning rate schedule is problematic, but note that ACTDE converges even with a constant learning rate schedule. So, I would indeed say that oscillating value estimation caused mode collapse in the toy example I gave?

Though note that ideally, once we actually know with confidence what is best, we should be near-greedy about it, rather than softmaxing!

I disagree. I don't view reward/reinforcement as indicating what is "best" (from our perspective), but as chiseling decision-making circuitry into the AI (which may then decide what is "best" from its perspective). One way of putting a related point: I think that we don't need to infinitely reinforce a line of reasoning in order to train an AI which reasons correctly. 

(I want to check -- does this response make sense to you? Happy to try explaining my intuition in another way.)

I mostly disagree with the quote as I understand it.

Unfortunately, it's very unclear why ability to predict generalization behavior on other tasks would transfer to being able to predict generalization behavior in the cases that we care about—and we can't test the case that we care about directly due to RSA-2048-style problems.

I don't buy the RSA-2048 example as plausible generalization that gets baked into weights (though I know that example isn't meant to be realistic). I agree there exist in weight-space some bad models which this won't catch, though it... (read more)

I agree there exist in weight-space some bad models which this won't catch, though it's not obvious to me that they're realistic cases.

It's fine that you would guess that, but without a strong reason to believe it's true—which I definitely don't think we have—you can't use something like this as a sufficient condition to label a model as safe.

I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out a

... (read more)

Is this identical to training the next-to-last layer to predict the rewards directly, and then just transforming those predictions to get a sample?

In the tabular case, that's equivalent given uniform . Maybe it's also true in the function approximator PG regime, but that's a maybe -- depends on inductive biases. But often we want a pretrained  (like when doing RLHF on LLMs), which isn't uniform.

Without being familiar with the literature, why should I buy that we can informally reason about what is "low-frequency" versus "high-frequency" behavior? I think reasoning about "simplicity" has historically gone astray, and worry that this kind of reasoning will as well.

That's why the title says "power-seeking can be predictive" not "training-compatible goals can be predictive". 

You're right. I was critiquing "power-seeking due to your assumptions isn't probable, because I think your assumptions won't hold" and not "power-seeking isn't predictive." I had misremembered the predictive/probable split, as introduced in Definitions of “objective” should be Probable and Predictive:

I don’t see a notion of “objective” that can be confidently claimed is:

  1. Probable: there is a good argument that the systems we build will ha
... (read more)

The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims.

It seems worth pointing out: the informality is in the hypothesis, which comprises a set of somewhat illegible intuitions and theories I use to reason about generalization. However, the prediction itself is what needs to be graded in order to see whether I was right. I made a prediction fairly like "... (read more)

RL creates agents, and RL seemed to be the way to AGI. In the 2010s, reinforcement learning was the dominant paradigm for those interested in AGI (e.g. OpenAI). RL lends naturally to creating agents that pursue rewards/utility/objectives. So there was reason to expect that agentic AI would be the first (and by the theoretical arguments, last) form that superintelligence would take.

Why are you confident that RL creates agents? Is it the non-stochasticity of optimal policies for almost all reward functions? The on-policy data collection of PPO? I think there... (read more)

I don't think we should call this "algebraic value editing" because it seems overly pretentious to say we're editing the model's values We don't even know what values are!

I phased out "algebraic value editing" for exactly that reason. Note that only the repository and prediction markets retain this name, and I'll probably rename the repo activation_additions.

What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.

In addition to my other comment, I'll further quote Behavioural statistics for a maze-solving agent:

We think the complex influence of spatial distances on the network’s decision-making might favor a ‘shard-like’ description: a

... (read more)

I think you're the one who's imposing a type error here. For "value functions" to be useful in modelling a policy, it doesn't have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic - it just has to be the case that the agent is acting consistently with some value function.

Can you say more? Maybe give an example of what this looks like in the maze-solving regime?

What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-

... (read more)
1Victoria Krakovna4mo
The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims. (E.g. what I have read so far doesn't seems to rule out that shards can be modeled as contextually activated subagents with utility functions.)  An upside of formalism is that you can tell when it's wrong, and thus it can help make our thinking more precise even if it makes assumptions that may not apply. I think defining your terms and making your arguments more formal should be a high priority. I'm not advocating spending hundreds of hours proving theorems, but moving in the direction of formalizing definitions and claims would be quite valuable.  It seems like a bad sign that the most clear and precise summary of shard theory claims was written by someone outside your team. I highly agree with this takeaway from that post: "Making a formalism for shard theory (even one that’s relatively toy) would probably help substantially with both communicating key ideas and also making research progress." This work has a lot of research debt, and paying it off would really help clarify the disagreements around these topics. 
3Victoria Krakovna4mo
Thanks Daniel for the detailed response (which I agree with), and thanks Alex for the helpful clarification. I agree that the training-compatible set is not predictive for how the neural network generalizes (at least under the "strong distributional shift" assumption in this post where the test set is disjoint from the training set, which I think could be weakened in future work). The point of this post is that even though you can't generally predict behavior in new situations based on the training-compatible set alone, you can still predict power-seeking tendencies. That's why the title says "power-seeking can be predictive" not "training-compatible goals can be predictive".  The hypothesis you mentioned seems compatible with the assumptions of this post. When you say "the policy develops motivations related to obvious correlates of its historical reinforcement signals", these "motivations" seem like a kind of training-compatible goals (if defined more broadly than in this post). I would expect that a system that pursues these motivations in new situations would exhibit some power-seeking tendencies because those correlate with a lot of reinforcement signals.  I suspect a lot of the disagreement here comes from different interpretations of the "internal representations of goals" assumption, I will try to rephrase that part better. 

To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.

I'm responding to this post, so why should I strike that out? 

The utility function formalism doesn't require agents to "internally represent a scalar function over observations". You'll notice that this isn't one of the conclusions of the VNM theorem.

The post is talking about internal representations.

The core claim of this post is that if you train a network in some environment, the agent will not generalize optimally with respect to the reward function you trained it on, but will instead be optimal with respect to some other reward function in a way that is compatible with training-reward-optimality, and that this means that it is likely to avoid shutdown in new environments. The idea that this happens because reward functions are "internally represented" isn't necessary for those results. You're right that the post uses the phrase "internal representation" once at the start, and some very weak form of "representation" is presumably necessary for the policy to be optimal for a reward function (at least in the sense that you can derive a bunch of facts about a reward function from the optimal policy for that reward function), but that doesn't mean that they're central to the post.
1Victoria Krakovna4mo
The internal representations assumption was meant to be pretty broad, I didn't mean that the network is explicitly representing a scalar reward function over observations or anything like that - e.g. these can be implicit representations of state features. I think this would also include the kind of representations you are assuming in the maze-solving post, e.g. cheese shards / circuits. 

Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before.

Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water

I've been wondering about the latter for a while. These two results are less strongly predicted by shard theoretic reasoning than by "hardcoded" hypotheses. Pure-RL+SL shard theory loses points on these two observations, and points to other mechanisms IMO (or I'm missing some implications of pure-RL+SL shard theory).

"There are theoretical results showing that many decision-making algorithms have power-seeking tendencies."

I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.

This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs for GPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget.

You looked at GPT-2-small. I injected your activation additions into GPT-2-X... (read more)

Load More