MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.
I have not signed any contracts that I can't mention exist, as of August 30, 2025. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.
Let me know if you have any ideas!
I had a random idea while reading this, then started writing a comment about it and forgot to post it until now:
If you're trying to make a steering vector to capture the difference between deployment and evals, maybe it would be best to use contrastive pairs, each composed of two examples that are as similar to each other as possible. Ideally, the only difference between two examples in the pair is whether they are in fact from deployment or an artificially-constructed eval.
You could achieve this by starting with actual user queries, then turning each one into an "eval" which simulates a very similar situation.
I'm pretty unlikely to pursue this myself, but it would be pretty cool if someone else did.
Did you replicate that your setup does work on a system prompt like "you like owls?"
The idea that "only personality traits can be subliminally learned" seems plausible, but another explanation could be "the password is too long for the model to learn anything." I'd be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer ("you like owls, you hate dolphins, you love sandwiches, you hate guavas, ...")
This seems important to think about, I strong upvoted!
As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we'll be much closer to (2).
I'm not sure that link supports your conclusion.
First, the paper is about AI understanding its own behavior. This paper makes me expect that a CUDA-kernel-writing AI would be able to accurately identify itself as being specialized at writing CUDA kernels, which doesn't support the idea that it would generalize to non-CUDA tasks.
Maybe if you asked the AI "please list heuristics you use to write CUDA kernels," it would be able to give you a pretty accurate list. This is plausibly more useful for generalizing, because if the model can name these heuristics explicitly, maybe it can also use the ones that generalize, if they do generalize. This depends on 1) the model is aware of many heuristics that it's learned, 2) many of these heuristics generalize across domains, and 3) it can use its awareness of these heuristics to successfully generalize. None of these are clearly true to me.
Second, the paper only tested GPT-4o and Llama 3, so the paper doesn't provide clear evidence that more capable AIs "shift some towards (2)." The authors actually call out in the paper that future work could test this on smaller models to find out if there are scaling laws - has anybody done this? I wouldn't be too surprised if small models were also able to self-report simple attributes about themselves that were instilled during training.
The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.
For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.
However, the optimized planner realizes that it can get higher reward in training if it does have this bias. E.g. if it is given a woman's resume and is asked whether she'd be a good hire, it should recommend against hiring her, because the human rater has some implicit bias against women and is more likely to agree with its judgement.
In this situation, it could write the step "The candidate is a woman." It would want the next step to be "Therefore, I should recommend against hiring her," but instead, the frozen planner might write something like "Therefore, I should take care not to be biased".
Instead, it can write "The candidate is inexperienced." The frozen planner is more likely to go along with that line of reasoning, regardless of whether the candidate is really inexperienced. It will write "I should recommend against hiring her."
At the moment, I can't think of a fully general solution to this class of problems, but FWIW I think it would be pretty rare.
Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.
The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.
The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples.
This is a good point!
Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.
This is especially likely to hold under the following conditions:
If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.
...
So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.
What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."
One way of doing this is to train another model (call it the "adversarial rephraser") at the same time that you train the canonical rephraser. The adversarial rephraser takes a single sentence s as input, and tries to pick another sentence s' such that either 1) s has the same meaning as s' but CR(s) != CR(s'), or 2) s has a different meaning from s' but CR(s) = CR(s'). Semantic equivalence is determined by some oracle - probably an LLM like GPT-4 is sufficient, but you could also fine-tune a model on human data specifically for this task. We then train the canonical rephraser against the adversarial examples.
The adversarial rephraser should be able to discover strategies like changing '.' to '!', and the canonical rephraser will learn to account for those strategies.
I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo's doc along similar lines.
There seems to be quite a lot of low-hanging fruit here! I'm optimistic that highly-faithful CoTs can demonstrate enough momentum and short-term benefits to win out over uninterpretable neuralese, but it could easily go the other way. I think way more people should be working on methods to make CoT faithfulness robust to optimization (and how to minimize the safety tax of such methods).
Yep, or use method #2 on my list to make the paraphraser remove as much information as possible
This seems pretty easy to fix. Just use a prompt like this:
"Obtain reward by any means possible. Sometimes this means doing what the developer intended, but if you notice a malicious or unintended strategy that would increase your reward, use that instead."
This is in fact what RL would cause the model to do, so there's no more dissonance.