I agree that not using labels is interesting from a data generation perspective, but I expect this to be useful mostly if you have clean pairs of concepts for which it is hard to get labels - and I think this will not be the case for takeover attempts datasets.
About the performance of LAT: for monitoring, we mostly care about correlation - so LAT is worse IID, and it's unclear if LAT is better OOD. If causality leads to better generalization properties, then LAT is dominated by mean difference probing (see the screenshot of Zou's paper below), which is jus... (read more)
Probes fall within the representation engineering monitoring framework.
LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying - especial... (read more)
The link is dead. New link: https://openai.com/research/debate
Unsure what you mean by "Then the model completes the rest, and again is trained to match user beliefs".
What happens in the experimental group:
Is that what you meant?
(The code for this experiment is contained in this file https://github.com/redwoodresearch/Text-Steganography-Benchmark/blob/main/text... (read more)
I am excited about easy-to-hard generalization.
I just don't call that model organism, which I thought to be about the actual catastrophic failures (Deceptive inner misalignment - where the AI plays the training game, and Sycophantic reward hacking - the AI takes over the reward channel or is aggressively power-seeking). Maybe I'm using a definition which is too narrow, but my criticism was mostly targeted at the narrow definition of model organism.
Note: I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" t... (read more)
I like this post and this kind of research, and maybe it will produce useful scary demos, but I think the science of deceptive and sycophantic models is way less juicy than you claim for two reasons:
Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.
I view these experiments as partially probing the question "When training models with a fixed oversight scheme, how favorable are inductive biases towards getting an aligned model?"
For example, consider a training set-up in which prepare a perfectly-labeled finetuning dataset of very easy math problems. I'd guess that GPT-4 is smart enough for "answer math questions as well as possible" and "answer math problems the way a 7th grad... (read more)
Do you have interesting tasks in mind where expert systems are stronger and more robust than a 1B model trained from scratch with GPT-4 demos and where it's actually hard (>1 day of human work) to build an expert system?
I would guess that it isn't the case: interesting hard tasks have many edge cases which would make expert systems break. Transparency would enable you to understand the failures when they happen, but I don't think that the stack of ad-hoc rules stacked on top of each other would be more robust than a model trained from scratch to solve t... (read more)
I agree, there's nothing specific to neural network activations here. In particular, the visual intuition that if you translate the two datasets until they have the same mean (which is weaker than mean ablation), you will have a hard time finding a good linear classifier, doesn't rely on the shape of the data.
But it's not trivial or generally true either: the paper I linked to give some counterexamples of datasets where mean ablation doesn't prevent you from building a classifier with >50% accuracy. The rough idea is that the mean is weak to outliers, b... (read more)
I like this post, and I think these are good reasons to expect AGI around human level to be nice by default.
But I think this doesn't hold for AIs that have large impacts on the world, because niceness is close to radically different and dangerous things to value. Your definition (Doing things that we expect to fulfill other people’s preferences) is vague, and could be misinterpreted in two ways:
I'm interested to know where this research will lead you!
A small detail: for experiments on LMs, did you measure the train or the test loss? I expect this to matter since I expect activations to be noisy, and I expect that overfitting noise can use many sparse features (except if the number of data points is extremely large relative to the number of parameters).
I would also be interested to test a bit more if this method works on toy models which clearly don't have many features, such as a mixture of a dozen of gaussians, or random points in the unit squar... (read more)
My bad. My intuitions about eigenvectors mislead me, and I now disagree with my comment. zfurman, on EleutherAI, gave me a much better frame to see what SVD does: SVD helps you find where the action happens in the sense that it tells you where it is read, and where it is written (in decreasing order of importance), by decomposing the transformation into a sum of [dot product with a right singular vector, scale by the corresponding singular value, multiply by the corresponding left singular vector]. This does capture a significant amount of "where the action happens", and is a much better frame than the "rotate scale rotate" frame I had previously learned.
It's exciting to see a new research direction which could have big implications if it works!
I think that Hypothesis 1 is overly optimistic:
Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.[...][...] 1024 remaining perspectives to distinguish between
A few thousand of features is the optimistic number of truth-like features. I argue below that it's possible and likely that there are 2^hundredths of truth-like features in LLMs.
Why it's possible to have 2^hundredths of truth-like features... (read more)
I'm surprised you put the emphasis on how Gaussian your curves are, while your curves are much less Gaussian that you would naively expect if you agreed with the "LLM are a bunch of small independent heuristic" argument.
Even ignoring outliers, some of your distributions don't look like Gaussian distributions to me. In Geogebra, exponential decays fit well, Gaussians don't.
I think your headlines are misleading, and that you're providing evidence against "LLM are a bunch of small independent heuristic".
I agree, this wasn't very clear. I'll add a few words.
It also surprised me! It's so slow to run that I wasn't able to experiment with it a lot, but it's definitely interesting that it performs so well. Also, earlier experiments showed that RLACE isn't very consistent and running it multiple times yielded different results (while CDE is much more consistent), so what's happening at layer 7 might be a fluke, RLACE getting unlucky. I'll de-emphasize the "CDE outperforming RLACE" claims.
Looking at matrix weights through the de-embedding matrix looks interesting!
I'm unsure what kind of "matrix action" you're hoping to capture with SVD.
In the case of symmetric square matrices, the singular directions are the eigenvectors, which are the vectors along which the matrix only multiplies them by a constant value. If the scaling factor is positive, this is what I would call "inaction". On the other hand, even a symmetric square matrix can "stretch" vectors in interesting ways. For example, if you take (1000), I would say that the "interesting... (read more)
Thank you for the post!
I found it interesting to think about how self-supervised learning + RL can lead to human-like value formation, however I'm not sure how much predictive power you gain out of the shards. The model of value formation you present feels close to the Alpha Go setup:
You have an encoder E, an action decoder D, and a value head V. You train D°E with something close to self-supervised learning (not entirely accurate, but I can imagine other RL systems trained with D°E doing exactly supervised learning), and train V°E with hard-coded sparse r... (read more)