AI ALIGNMENT FORUM
AF

176
Arun Jose
Ω39412153
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Alignment Stream of Thought
Current Language Models Struggle to Reason in Ciphered Language
Jozdien1mo10
  • Speak using what looks like code words (e.g. marinade or Redwood for GPT-5)
    • Again, this doesn’t look like an entirely new language yet

[...]

While most of these may make the job of monitors more difficult, I think these are still far from the sort of CoT that would break CoT monitorability or that would make it hard for humans to decipher most of the important content - at least for current LLMs.

I'm confused by this. The Apollo paper seems to find that CoT monitorability does break as a consequence of using these words. Section 4 in particular seems to imply three separate causes for why the CoT was challenging to use: models producing illegible CoTs, unfaithful CoTs, and CoTs with many plausible explanations that were hard to separate. They still use results from CoT monitoring in their measurements but explicitly disclaim that this was to show that the results are still non-zero while undercounting:

We recognize [requiring covert behavior in the CoT] can lead to significant undercounting of legitimate cases of covert actions when the internal reasoning is illegible, incomplete or unfaithful, this conservative criterion strictly strengthens our finding that covert actions persist at non-zero rates after the intervention.

Reply
evhub's Shortform
Jozdien3mo30

If I understand correctly, the program still isn't open to people who don't have work authorization in the US or the UK, right?

Reply
Critiques of the AI control agenda
Jozdien2y10

Fixed, thanks! Yeah that's weird - I copied it over from a Google doc after publishing to preserve footnotes, so maybe it's some weird formatting bug from all of those steps.

Reply
The case for more ambitious language model evals
Jozdien2y10

It's unclear where the two intro quotes are from; I don't recognize them despite being formatted as real quotes. If they are purely hypothetical, that should be clearer.

They're accounts from people who knows Eric and the person referenced in the second quote. They are real stories, but between not being allowed to publicly share GPT-4-base outputs and these being the most succinct stories I know of, I figured just quoting how I heard it would be best. I'll add a footnote to make it clearer that these are real accounts.

It's pretty important because it tells you what LLMs do (imitation learning & meta-RL), which are quite dangerous things for them to do, and establishes a large information leak which can be used for things like steganography, coordination between instances, detecting testing vs deployment (for treacherous turns) etc.

It's also concerning because RLHF is specifically targeted at hiding (but not destroying) these inferences.

I agree, the difference in perceived and true information density is one of my biggest concerns for near-term model deception. It changes questions like "can language models do steganography / when does it pop up" to "when are they able to make use of this channel that already exists", which sure makes the dangers feel a lot more salient.

Thanks for the linked paper, I hadn't seen that before.

Reply
Inner Misalignment in "Simulator" LLMs
Jozdien3y20

I have a post from a while back with a section that aims to do much the same thing you're doing here, and which agrees with a lot of your framing. There are some differences though, so here are some scattered thoughts.

One key difference is that what you call "inner alignment for characters", I prefer to think about as an outer alignment problem to the extent that the division feels slightly weird. The reason I find this more compelling is that it maps more cleanly onto the idea of what we want our model to be doing, if we're sure that that's what it's actually doing. If our generative model learns a prior such that Azazel is easily accessible by prompting, then that's not a very safe prior, and therefore not a good training goal to have in mind for the model. In the case of characters, what's the difference between the two alignment problems, when both are functionally about wanting certain characters and getting other ones because you interacted with the prior in weird ways?

I think a crux here might be my not really getting why separate inner-outer alignment framings in this form is useful. As stated, the outer alignment problems in both cases feel... benign? Like, in the vein of "these don't pose a lot of risk as stated, unless you make them broad enough that they encroach onto the inner alignment problems", rather than explicit reasoning about a class of potential problems looking optimistic. Which results in the bulk of the problem really just being inner alignment for characters and simulators, and since the former is a subpart of the outer alignment problem for simulators, it just feels like the "risk" aspect collapses down into outer and inner alignment for simulators again.

Reply
Thoughts on the impact of RLHF research
Jozdien3y20

Thanks!

My take on the scaled-up models exhibiting the same behaviours feels more banal - larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.

Reply
Thoughts on the impact of RLHF research
Jozdien3y43

I wasn't really focusing on the RL part of RLHF in making the claim that it makes the "agentic personas" problem worse, if that's what you meant. I'm pretty on board with the idea that the actual effects of using RL as opposed to supervised fine-tuning won't be apparent until we use stronger RL or something. Then I expect we'll get even weirder effects, like separate agentic heads or the model itself becoming something other than a simulator (which I discuss in a section of the linked post).

My claim is pretty similar to how you put it - in RLHF as in fine-tuning of the kind relevant here, we're focusing the model onto outputs that are generated by better agentic persona. But I think that the effect is particuarly salient with RLHF because it's likely to be scaled up more in the future, where I expect said effect to be exacerbated. I agree with the rest of it, that prompt engineering is unlikely to produce the same effect, and definitely not the same qualitative shift of the world prior.

Reply
Thoughts on the impact of RLHF research
Jozdien3y2221

Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.

I want to explain my position on a couple points in particular though - they would've been a central focus of what I imagined my post to be, points around which I've been thinking a lot recently. I haven't talked to a lot of people about this explicitly so I don't have high credence in my take, but it seems at least worth clarifying.

RLHF is less safe than imitation or conditioning generative models.

My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we're wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state.

So here's one of my main reasons for not liking RLHF: it removes this very satisfying property. Models that have been RLHF'd (so to speak), have different world priors in ways that aren't really all that intuitive (see Janus' work on mode collapse, or my own prior work which addresses this effect in these terms more directly since you've probably read the former). We get a posterior that doesn't have the nice properties we want of a prior based directly on our world, because RLHF is (as I view it) a surface-level instrument we're using to interface with a high-dimensional ontology. Making toxic interactions less likely (for example) leads to weird downstream effects in the model's simulations because it'll ripple through its various abstractions in ways specific to how they're structured inside the model, which are probably pretty different from how we structure our abstractions and how we make predictions about how changes ripple out.

So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don't have the useful safety measures implied by being weighted by a true approximation of our world.

Another reason for not liking RLHF that's somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model's computation on agency in some sense. My guess is that this explains to an extent the results in that paper - RLHF'd models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on. This also seems dangerous to me because we're making agency more accessible to and powerful from ordinary prompting, more powerful agency is inherently tied to properties we don't really want in simulacra, and said agency of a sort is sampled from a not-so-familiar ontology to boot.

(Only skimmed the post for now because I'm technically on break, it's possible I missed something crucial).

Reply
Trying to isolate objectives: approaches toward high-level interpretability
Jozdien3y10

Do you think the default is that we'll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can't really be identified mechanistically as such, or that only processes where they're really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies - figuring out all of them seems pretty useful.

Reply
Trying to isolate objectives: approaches toward high-level interpretability
Jozdien3y*10

Yeah, this is definitely something I consider plausible. But I don't have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I'm planning on working on.

Reply
Load More
60Realistic Reward Hacking Induces Different and Deeper Misalignment
1mo
0
72Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
1mo
6
74Why Do Some Language Models Fake Alignment While Others Don't?
4mo
2
21Introducing BenchBench: An Industry Standard Benchmark for AI Strength
7mo
0
41BIG-Bench Canary Contamination in GPT-4
1y
1
26Gradient Descent on the Human Brain
2y
0
8Difficulty classes for alignment properties
2y
0
25The Pointer Resolution Problem
2y
0
17Critiques of the AI control agenda
2y
8
42The case for more ambitious language model evals
2y
9
Load More
Simulator Theory
3 years ago
(+20)
Simulator Theory
3 years ago
(+285)
Simulator Theory
3 years ago
(+373)