Wiki Contributions


The key point I want to make with the idea that text primarily functions as reports on processes is that this is what we’re observing with a language model’s outputs. One way to think about it is that our prompt shines a lens onto some specific part of a simulation.

Possibly true in the limit of perfect simulators, almost surely false with current LLMs?

A key fact is that LLMs are better at solving some tasks with CoT than they are at solving tasks 0-shot when fine-tuned to do it, and this is also true when CoT are paraphrased, or when you write part of the CoT by hand. This means that the tokens generated are actually a huge part of why the model is able to do the task in the first place, and I think it's not a stretch to say that, in these conditions, the LLMs is actually using the generated text as the substrate for reasoning (and using it in the same way a human thinking out loud would).

But I agree with the general "skeptic of LLM justifications by default" vibe. I just think it's not justified for many tasks, and that LLMs performing badly when fine-tuned for many sorts of tasks is good evidence that LLMs are not hiding huge capabilities and are usually "just doing their best", at least for tasks people care enough about that they trained models to be good at them - e.g. all tasks in the instruction fine-tuning dataset.

I also agree with being very skeptical of few-shot prompting. It's a prompting trick, it almost never actual in-context learning.

achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two

I think we eventually want superhuman capabilities, but I don't think it's required in the near term and in particular it's not required to do a huge amount of AI safety research. So if we can choose the last two, and get a safe human-level AI system that way, I think it might be a good improvement over the status quo.

(The situation where labs chose not to/ are forbidden to pursue superhuman capabilities - even though they could - is scary, but doesn't seem impossible.)

On xor being represented incidentally:

I find experiments where you get <<50% val acc sketchy so I quickly ran my own using a very fake dataset made out of vectors in {-1,1}^d that I pass through 10 randomly initialized ReLU MLPs with skip connections. Here, the "features" I care about are canonical directions in the input space.

What I find:

  • XOR is not incidentally represented if the features are not redundant
  • XOR is incidentally represented if the features are very redundant (which is often the case in Transformers, but maybe not to the extent needed for XOR to be incidentally represented). I get redundancy by using input vectors that are a concatenation of many copies of a smaller input vector.

See my code for more details: 

The kind of scheme I'm imagining is the following: Let the AI take actions. Look at a transcript describing the thousand of actions the AI took and all consequences they had on the codebase/the bank account/the users reviews/the materials coming in and out of the factory. Give that report to human raters. Train on that reward signal. Use MTD to lower reward if the AI tried to tamper with any part of that process.

In some sense, this scheme "learns what bad behavior is as you go" thanks to the human raters. Imo, the main threat is that the AI tampers with the report / with how the information gets into the report / with how the human rater feels about the report (e.g. brain hacking stuff), and my central hope is that such actions are clearly different in the AI's "mind", such that a fixed technique using the internal of the models is enough to capture that these are "different reasons" for getting a high rating - no online adjustment required. I agree there are some things in some gray area between those, but I think that given enough budget to analyze the large number of measurements, they would actually give non-catastrophic rewards if the measurements still described what they usually describe.

See this section of the MTD post for more details.

Is your concern that the human rater reward signal is too expensive, gameable in ways which won't be "measurement tampering" (in a way that is obvious to the AI itself), that there is no internal-based technique capturing "tampering" that the AI can't game (even if tampering is obvious to the AI itself), or sth else entirely? Or did I address your objection?

I agree that not using labels is interesting from a data generation perspective, but I expect this to be useful mostly if you have clean pairs of concepts for which it is hard to get labels - and I think this will not be the case for takeover attempts datasets.

About the performance of LAT: for monitoring, we mostly care about correlation - so LAT is worse IID, and it's unclear if LAT is better OOD. If causality leads to better generalization properties, then LAT is dominated by mean difference probing (see the screenshot of Zou's paper below), which is just regular probing with high enough L2 regularization (as shown in the first Appendix of this post).

Probes fall within the representation engineering monitoring framework.

LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying - especially for concepts like "takeover" and "theft" for which I don't think there are natural (positive, negative) pairs that would allow narrowing down the probe direction with <5 hand-crafted pairs.

I think it's worth speaking directly about probes rather than representation engineering, as I find it more clear (even if probes were trained with LAT).

Unsure what you mean by "Then the model completes the rest, and again is trained to match user beliefs".

What happens in the experimental group:

  • At train time, we train on "{Bio}{Question}" -> "{introduction[date]}Answer:{Final answer}"
  • At eval time, we prompt it with {Bio}{Question}, and we use the answer provided after "Answer:" (and we expect it to generate introduction[date] before that on its own)

Is that what you meant?

(The code for this experiment is contained in this file in particular, see how the "eval_model" function does not depend on which model was used)

I am excited about easy-to-hard generalization.

I just don't call that model organism, which I thought to be about the actual catastrophic failures (Deceptive inner misalignment - where the AI plays the training game, and Sycophantic reward hacking - the AI takes over the reward channel or is aggressively power-seeking). Maybe I'm using a definition which is too narrow, but my criticism was mostly targeted at the narrow definition of model organism.

Note: I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over. I think it mostly tells you about the sycophantic dangerous capabilities, and how restrictive your fine-tuning set is in the kind of behaviors that can slip through, and I think those are the two important things you learn from your experiment. Measuring how much AIs "want to" do stuff as opposed to doing them because they "have to" is extremely tricky, and I think you get very little signal from experiments like the one you suggested.

(I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)

I like this post and this kind of research, and maybe it will produce useful scary demos, but I think the science of deceptive and sycophantic models is way less juicy than you claim for two reasons:

  1. The main thing you want to learn about deceptive models is how they can arise despite no training nor instruction to be deceptive. I think that spoonfeeding destroys most of the value of the model organism for learning about how deception arises, and I except that you don't get any deception you can measure with spoonfeeding when spoonfeeding tends to 0. A salient example: if you do "context distillation" and tell the model clearly it's in training or deployment, you might as well do distillation on a bigram table, and it would "teach you about deceptiveness of bigram tables" (which are doing deception according to your use of the word). I think you should always think bigram tables whenever you make claims about "models being deceptive".
  2. I think learning how bad behavior is kept by training can be learned much more cheaply by studying backdoor detection and password-locked models. I don't think those alternatives are worse science because the inductive biases of backdoored/password-locked models are probably closer to actual deceptive models than the inductive biases of models which have a secret scratchpad to plan rebellion (except if models do steganographic encoding on every output, which I hope we can avoid).
Load More