I agree that not using labels is interesting from a data generation perspective, but I expect this to be useful mostly if you have clean pairs of concepts for which it is hard to get labels - and I think this will not be the case for takeover attempts datasets.
About the performance of LAT: for monitoring, we mostly care about correlation - so LAT is worse IID, and it's unclear if LAT is better OOD. If causality leads to better generalization properties, then LAT is dominated by mean difference probing (see the screenshot of Zou's paper below), which is just regular probing with high enough L2 regularization (as shown in the first Appendix of this post).
Probes fall within the representation engineering monitoring framework.
LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying - especially for concepts like "takeover" and "theft" for which I don't think there are natural (positive, negative) pairs that would allow narrowing down the probe direction with <5 hand-crafted pairs.
I think it's worth speaking directly about probes rather than representation engineering, as I find it more clear (even if probes were trained with LAT).
The link is dead. New link: https://openai.com/research/debate
Unsure what you mean by "Then the model completes the rest, and again is trained to match user beliefs".
What happens in the experimental group:
Is that what you meant?
(The code for this experiment is contained in this file https://github.com/redwoodresearch/Text-Steganography-Benchmark/blob/main/textsteg/exploration/ft_experiments.py in particular, see how the "eval_model" function does not depend on which model was used)
I am excited about easy-to-hard generalization.
I just don't call that model organism, which I thought to be about the actual catastrophic failures (Deceptive inner misalignment - where the AI plays the training game, and Sycophantic reward hacking - the AI takes over the reward channel or is aggressively power-seeking). Maybe I'm using a definition which is too narrow, but my criticism was mostly targeted at the narrow definition of model organism.
Note: I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over. I think it mostly tells you about the sycophantic dangerous capabilities, and how restrictive your fine-tuning set is in the kind of behaviors that can slip through, and I think those are the two important things you learn from your experiment. Measuring how much AIs "want to" do stuff as opposed to doing them because they "have to" is extremely tricky, and I think you get very little signal from experiments like the one you suggested.
(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)
I like this post and this kind of research, and maybe it will produce useful scary demos, but I think the science of deceptive and sycophantic models is way less juicy than you claim for two reasons:
Do you have interesting tasks in mind where expert systems are stronger and more robust than a 1B model trained from scratch with GPT-4 demos and where it's actually hard (>1 day of human work) to build an expert system?
I would guess that it isn't the case: interesting hard tasks have many edge cases which would make expert systems break. Transparency would enable you to understand the failures when they happen, but I don't think that the stack of ad-hoc rules stacked on top of each other would be more robust than a model trained from scratch to solve the task. (The tasks I have in mind are sentiment classification and paraphrasing. I don't have enough medical knowledge to imagine what would the expert system look like for medical diagnosis.) Or maybe you have in mind a particular way of writing expert systems which ensures that the stack of ad-hoc rules doesn't interact in weird ways that produces unexpected results?
I agree, there's nothing specific to neural network activations here. In particular, the visual intuition that if you translate the two datasets until they have the same mean (which is weaker than mean ablation), you will have a hard time finding a good linear classifier, doesn't rely on the shape of the data.
But it's not trivial or generally true either: the paper I linked to give some counterexamples of datasets where mean ablation doesn't prevent you from building a classifier with >50% accuracy. The rough idea is that the mean is weak to outliers, but outliers don't matter if you want to produce high-accuracy classifiers. Therefore, what you want is something like the median.
I like this post, and I think these are good reasons to expect AGI around human level to be nice by default.
But I think this doesn't hold for AIs that have large impacts on the world, because niceness is close to radically different and dangerous things to value. Your definition (Doing things that we expect to fulfill other people’s preferences) is vague, and could be misinterpreted in two ways:
I don't know how you would describe "true niceness", but I think it's neither of the above.
So if you train an AI to develop "niceness", because AIs are initially weak, you might train niceness, or you might get one of the two pseudo niceness I described. Or something else entirely. Niceness is natural for agents of similar strengths because lots of values point towards the same "nice" behavior. But when you're much more powerful than anyone else, the target becomes much smaller, right?
Do you have reasons to expect "slight RL on niceness" to give you "true niceness" as opposed to a kind of pseudo-niceness?
I would be scared of an AI which has been trained to be nice if there was no way to see if, when it got more powerful, it tried to modify people's preferences / it tried to prevent people's preferences from changing. Maybe niceness + good interpretability enables you to get through the period where AGIs haven't yet made breakthroughs in AI Alignment?
I'm interested to know where this research will lead you!
A small detail: for experiments on LMs, did you measure the train or the test loss? I expect this to matter since I expect activations to be noisy, and I expect that overfitting noise can use many sparse features (except if the number of data points is extremely large relative to the number of parameters).
I would also be interested to test a bit more if this method works on toy models which clearly don't have many features, such as a mixture of a dozen of gaussians, or random points in the unit square (where there is a lot of room "in the corners"), to see if this method produces strong false positives. Layer 0 is also a baseline, since I expect embeddings to have fewer features than activations in later layers, though I'm not sure how many features you should expect in layer 0.I hope you'll find what's wrong with layer 0 in your experiments!