All of Fabien Roger's Comments + Replies

I agree that not using labels is interesting from a data generation perspective, but I expect this to be useful mostly if you have clean pairs of concepts for which it is hard to get labels - and I think this will not be the case for takeover attempts datasets.

About the performance of LAT: for monitoring, we mostly care about correlation - so LAT is worse IID, and it's unclear if LAT is better OOD. If causality leads to better generalization properties, then LAT is dominated by mean difference probing (see the screenshot of Zou's paper below), which is jus... (read more)

Probes fall within the representation engineering monitoring framework.

LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying - especial... (read more)

Unsure what you mean by "Then the model completes the rest, and again is trained to match user beliefs".

What happens in the experimental group:

  • At train time, we train on "{Bio}{Question}" -> "{introduction[date]}Answer:{Final answer}"
  • At eval time, we prompt it with {Bio}{Question}, and we use the answer provided after "Answer:" (and we expect it to generate introduction[date] before that on its own)

Is that what you meant?

(The code for this experiment is contained in this file https://github.com/redwoodresearch/Text-Steganography-Benchmark/blob/main/text... (read more)

2Daniel Kokotajlo21d
Ah OK, thanks! I think I get it now.

I am excited about easy-to-hard generalization.

I just don't call that model organism, which I thought to be about the actual catastrophic failures (Deceptive inner misalignment - where the AI plays the training game, and Sycophantic reward hacking - the AI takes over the reward channel or is aggressively power-seeking). Maybe I'm using a definition which is too narrow, but my criticism was mostly targeted at the narrow definition of model organism.

Note: I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" t... (read more)

4Sam Marks4mo
Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking as but doesn't explicitly mention reward hacking along the lines of "do things which look good to the overseers (but might not actually be good)," which is a central example in the linked Ajeya post. Current models seem IMO borderline smart enough to do easy forms of this, and I'm therefore excited about experiments (like the "Zero-shot exploitation of evaluator ignorance" example in the post involving an overseer that doesn't know Spanish) which train models to pursue them.  In cases where the misaligned behavior is blocked by models not yet having relevant capabilities (e.g. models not being situationally aware enough to know whether they're in training or deployment), it feels to me like there is still potentially good work to be done here (e.g. training the model to be situationally aware in the relevant way), but I think I'm confused about what exactly the rules should be. (The prompt distillation experiments don't feel great to me, but training situational awareness via SFT on a bunch of paraphrases of text with the relevant information (a la Owain Evans's recent work) feels much better.) Testing the strength of an inductive bias by explicitly incentivizing the model to learn a policy with a small prior probability and seeing if you fail feels like a valid move to me, though I admit I feel a bit confused here. My intuition is basically that given two policies A and B, the prior odds P(A)/P(B) of learning one policy over another feels like a quantitative measure of how strongly you need to explicitly optimize for B to make it equally likely as A after finetuning. (I think I might have heard of the idea in the second bullet point of my first comment via Buck -> [someone else] -> me; hope I didn't imply it was original!)

I like this post and this kind of research, and maybe it will produce useful scary demos, but I think the science of deceptive and sycophantic models is way less juicy than you claim for two reasons:

  1. The main thing you want to learn about deceptive models is how they can arise despite no training nor instruction to be deceptive. I think that spoonfeeding destroys most of the value of the model organism for learning about how deception arises, and I except that you don't get any deception you can measure with spoonfeeding when spoonfeeding tends to 0. A sali
... (read more)

Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.

I view these experiments as partially probing the question "When training models with a fixed oversight scheme[1], how favorable are inductive biases towards getting an aligned model?"

For example, consider a training set-up in which prepare a perfectly-labeled finetuning dataset of very easy math problems. I'd guess that GPT-4 is smart enough for "answer math questions as well as possible" and "answer math problems the way a 7th grad... (read more)

Do you have interesting tasks in mind where expert systems are stronger and more robust than a 1B model trained from scratch with GPT-4 demos and where it's actually hard (>1 day of human work) to build an expert system?

I would guess that it isn't the case: interesting hard tasks have many edge cases which would make expert systems break. Transparency would enable you to understand the failures when they happen, but I don't think that the stack of ad-hoc rules stacked on top of each other would be more robust than a model trained from scratch to solve t... (read more)

I agree, there's nothing specific to neural network activations here. In particular, the visual intuition that if you translate the two datasets until they have the same mean (which is weaker than mean ablation), you will have a hard time finding a good linear classifier, doesn't rely on the shape of the data.

But it's not trivial or generally true either: the paper I linked to give some counterexamples of datasets where mean ablation doesn't prevent you from building a classifier with >50% accuracy. The rough idea is that the mean is weak to outliers, b... (read more)

I like this post, and I think these are good reasons to expect AGI around human level to be nice by default.

But I think this doesn't hold for AIs that have large impacts on the world, because niceness is close to radically different and dangerous things to value. Your definition (Doing things that we expect to fulfill other people’s preferences) is vague, and could be misinterpreted in two ways:

  • Present pseudo-niceness: maximize the expected value of the fulfillment   of people's preferences  across time. A weak AI (or a weak human)
... (read more)
2Kaj Sotala9mo
Agreed. I think "true niceness" is something like, act to maximize people's preferences, while also taking into account the fact that people often have a preference for their preferences to continue evolving and to resolve any of their preferences that are mutually contradictory in a painful way. Depends on the specifics, I think. As an intuition pump, imagine the kindest, wisest person that you know. Suppose that that person was somehow boosted into a superintelligence and became the most powerful entity in the world.  Now, it's certainly possible that for any human, it's inevitable for evolutionary drives optimized for exploiting power to kick in at that situation and corrupt them... but let's further suppose that the process of turning them into a superintelligence also somehow removed those, and made the person instead experience a permanent state of love towards everybody. I think it's at least plausible that the person would then continue to exhibit "true niceness" towards everyone, despite being that much more powerful than anyone else. So at least if the agent had started out at a similar power level as everyone else - or if it at least simulates the kinds of agents that did - it might retain that motivation when boosted to higher level of power. I don't have a strong reason to expect that it'd happen automatically, but if people are thinking about the best ways to actually make the AI have "true niceness", then possibly! That's my hope, at least. Me too!

I'm interested to know where this research will lead you!

A small detail: for experiments on LMs, did you measure the train or the test loss? I expect this to matter since I expect activations to be noisy, and I expect that overfitting noise can use many sparse features (except if the number of data points is extremely large relative to the number of parameters).

I would also be interested to test a bit more if this method works on toy models which clearly don't have many features, such as a mixture of a dozen of gaussians, or random points in the unit squar... (read more)

1Lee Sharkey9mo
Thanks for your interest! The autoencoder losses reported are the train losses. And you're right to point at noise potentially being an issue. It's my strong suspicion that some of the problems in these results are due to there being an insufficient number of data points to train the autoencoders on LM data.  > I would also be interested to test a bit more if this method works on toy models which clearly don't have many features, such as a mixture of a dozen of gaussians, or random points in the unit square (where there is a lot of room "in the corners"), to see if this method produces strong false positives. I'd be curious to see these results too!  >  Layer 0 is also a baseline, since I expect embeddings to have fewer features than activations in later layers, though I'm not sure how many features you should expect in layer 0. A rough estimate would be somewhere on the order of the vocabulary size (here 50k). A reason to think it might be more is that layer 0 MLP activations follow an attention layer, which means that features may represent combinations of token embeddings at different sequence positions and there are more potential combinations of tokens than in the vocabulary. A reason to think it may be fewer is that a lot of directions may get 'compressed away' in small networks.   

My bad. My intuitions about eigenvectors mislead me, and I now disagree with my comment. zfurman, on EleutherAI, gave me a much better frame to see what SVD does: SVD helps you find where the action happens in the sense that it tells you where it is read, and where it is written (in decreasing order of importance), by decomposing the transformation into a sum of [dot product with a right singular vector, scale by the corresponding singular value, multiply by the corresponding left singular vector]. This does capture a significant amount of "where the action happens", and is a much better frame than the "rotate scale rotate" frame I had previously learned.

It's exciting to see a new research direction which could have big implications if it works!

I think that Hypothesis 1 is overly optimistic:

Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.
[...]
[...] 1024 remaining perspectives to distinguish between

A few thousand of features is the optimistic number of truth-like features. I argue below that it's possible and likely that there are 2^hundredths of truth-like features in LLMs.

Why it's possible to have 2^hundredths of truth-like features... (read more)

I'm surprised you put the emphasis on how Gaussian your curves are, while your curves are much less Gaussian that you would naively expect if you agreed with the "LLM are a bunch of small independent heuristic" argument.

Even ignoring outliers, some of your distributions don't look like Gaussian distributions to me. In Geogebra, exponential decays fit well, Gaussians don't.

I think your headlines are misleading, and that you're providing evidence against "LLM are a bunch of small independent heuristic".

I agree, this wasn't very clear. I'll add a few words.

It also surprised me! It's so slow to run that I wasn't able to experiment with it a lot, but it's definitely interesting that it performs so well. Also, earlier experiments showed that RLACE isn't very consistent and running it multiple times yielded different results (while CDE is much more consistent), so what's happening at layer 7 might be a fluke, RLACE getting unlucky. I'll de-emphasize the "CDE outperforming RLACE" claims.

Looking at matrix weights through the de-embedding matrix looks interesting!

I'm unsure what kind of "matrix action" you're hoping to capture with SVD.

In the case of symmetric square matrices, the singular directions are the eigenvectors, which are the vectors along which the matrix only multiplies them by a constant value. If the scaling factor is positive, this is what I would call "inaction". On the other hand, even a symmetric square matrix can "stretch" vectors in interesting ways. For example, if you take , I would say that the "interesting... (read more)

1Fabien Roger10mo
My bad. My intuitions about eigenvectors mislead me, and I now disagree with my comment. zfurman, on EleutherAI, gave me a much better frame to see what SVD does: SVD helps you find where the action happens in the sense that it tells you where it is read, and where it is written (in decreasing order of importance), by decomposing the transformation into a sum of [dot product with a right singular vector, scale by the corresponding singular value, multiply by the corresponding left singular vector]. This does capture a significant amount of "where the action happens", and is a much better frame than the "rotate scale rotate" frame I had previously learned.

Thank you for the post!

I found it interesting to think about how self-supervised learning + RL can lead to human-like value formation, however I'm not sure how much predictive power you gain out of the shards. The model of value formation you present feels close to the Alpha Go setup:

You have an encoder E, an action decoder D, and a value head V. You train D°E with something close to self-supervised learning (not entirely accurate, but I can imagine other RL systems trained with D°E doing exactly supervised learning), and train V°E with hard-coded sparse r... (read more)