Rohin Shah — AI Alignment Forum

AI ALIGNMENT FORUM
AF

I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?

I'm pretty sure it is not that. When people say this it is usually just asking the question: "Will current models try to take over or otherwise subvert our control (including incompetently)?" and noticing that the answer is basically "no".^[1] What they use this to argue for can then vary:

Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that "the doomers were right")
Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don't see this, which is evidence against that particular threat model.^[2]

I agree with (1), disagree with (2), and for (3) it depends on details.

In Leo's case in particular I don't think he's using the observation for much, it's mostly just a throwaway claim that's part of the flow of the comment, but inasmuch as it is being used it is to say something like "current AIs aren't trying to subvert our control, so it's not completely implausible on the face of it that the first automated alignment researcher to which we delegate won't try to subvert our control", which is just a pretty weak claim and seems fine, and doesn't imply any kind of extrapolation to superintelligence. I'd be surprised if this was an important disagreement with the "alignment is hard" crowd.

^{^}
There are demos of models doing stuff like this (e.g. blackmail) but only under conditions selected highly adversarially. These look fragile enough that overall I'd still say current models are more aligned than e.g. rationalists (who under adversarially selected conditions have been known to intentionally murder people).
^{^}
E.g. One naive threat model says "Orthogonality says that an AI system's goals are completely independent of its capabilities, so we should expect that current AI systems have random goals, which by fragility of value will then be misaligned". Setting aside whether anyone ever believed in such a naive threat model, I think we can agree that current models are evidence against such a threat model.

Research Agenda: Synthesizing Standalone World-Models

Rohin Shah3d*40

Which, importantly, includes every fruit of our science and technology.

I don't think this is the right comparison, since modern science / technology is a collective effort and so can only cumulate thinking through mostly-interpretable steps. (This may also be true for AI, but if so then you get interpretability by default, at least interpretability-to-the-AIs, at which point you are very likely better off trying to build AIs that can explain that to humans.)

In contrast, I'd expect individual steps of scientific progress that happen within a single mind often are very uninterpretable (see e.g. "research taste").

If we understood an external superhuman world-model as well as a human understands their own world-model, I think that'd obviously get us access to tons of novel knowledge.

Sure, I agree with that, but "getting access to tons of novel knowledge" is nowhere close to "can compete with the current paradigm of building AI", which seems like the appropriate bar given you are trying to "produce a different tool powerful enough to get us out of the current mess".

Perhaps concretely I'd wildly guess with huge uncertainty that this would involve an alignment tax of ~4 GPTs, in the sense that if you had an interpretable world model from GPT-10 similar in quality to a human's understanding of their own world model, that would be similarly useful as GPT-6.

Research Agenda: Synthesizing Standalone World-Models

Rohin Shah9d71

a human's world-model is symbolically interpretable by the human mind containing it.

Say what now? This seems very false:

See almost anything physical (riding a bike, picking things up, touch typing a keyboard, etc). If you have a dominant hand / leg, try doing some standard tasks with the non-dominant hand / leg. Seems like if the human mind could symbolically interpret its own world model this should be much easier to do.
Basically anything to do with vision / senses. Presumably if vision was symbolically interpretable to the mind then there wouldn't be much of a skill ladder to climb for things like painting.
Symbolic grammar usually has to be explicitly taught to people, even though ~everyone has a world model that clearly includes grammar (in the sense that they can generate grammatical sentences and identify errors in grammar)

Tbc I can believe it's true in some cases, e.g. I could believe that some humans' far-mode abstract world models are approximately symbolically interpretable to their mind (though I don't think mine is). But it seems false in the vast majority of domains (if you are measuring relative to competent, experienced people in those domains, as seems necessary if you are aiming for your system to outperform what humans can do).

Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

Rohin Shah2mo2423

I agree with this directionally, and generally tend to believe that certain kinds of straight lines will continue to be straight. But also iirc you rely a lot on the METR trend line, which is (a) noisy, (b) over a short timescale (~2 years^[1]), and (c) not a great measure of what people care about^[2]^[3]. I don't think you should strongly expect that particular line to continue to be straight over a long time period.

(Though I tend to think that it's more likely to slow down by 2027 than to speed up, relative to what you'd predict by extrapolating the 2024-2025 trend, so I agree with the overall conclusion of the post. And obviously one should have non-trivial probability on each of {slow down, stay the same, speed up}, as you say.)

^{^}
I'm excluding GPT-2 and GPT-3 which seem very noisy.
^{^}
Compared to things like Moore's Law, or energy efficiency of solar, or pretraining perplexity, or other similar things where we observe straight lines.
^{^}
None of this is to say that the METR graph is bad! It is the best empirical evidence on timelines I'm aware of.

Optimally Combining Probe Monitors and Black Box Monitors

Rohin Shah2mo30

From the paper:

If a non-monotonic relationship exists between monitor scores and likelihood ratios, auditing the highest-scoring outputs is suboptimal. [...] This means ΠNP will audit some moderately high-scoring outputs (e.g., those with a monitor score of 7.2) before auditing certain higher-scoring outputs

I agree this is clearly better if you trust your LLRs to be 100% accurate, but ultimately this has to be fit on some training set and then we are depending on it generalizing to deployment. However, that generalization could fail just for normal ML reasons (the deployment distribution is different, the LLR scores were noisily estimated) or because the model is scheming and adapts to your monitoring strategy.

Suppose I want to rule by fiat that for outputs O1 and O2 where O1 gets higher scores than O2, we are guaranteed to be at least as likely to audit O1 as O2. How would you suggest I go about doing that?

After writing the above I remembered you said the word "kernel" and looked in the paper for that and saw you did mention this limitation, and suggested that practitioners "increase kernel width or enforce monotonicity". So I guess my question is, how should I go about enforcing monotonicity? I still have to map two scores (m1, m2) to a decision of whether or not to audit; it's unclear to me that I can brute force that problem, much easier to brute force when I have a single score (the LLR).

Training a Reward Hacker Despite Perfect Labels

Rohin Shah2mo122

I think Oliver's pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)

The experimental setup (in the sense of getting bad behavior despite perfect labels on the training set) was also done prior to the popularization of reward-as-chisel.

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Rohin Shah3mo62

Great research, thanks for doing it!

My sense is that there are a lot of caveats on this result:

The reward hack is easy to execute (likely doesn't require reasoning, or only very little reasoning)
There is a single task; the model can learn very specific heuristics to that task rather than being more general
The reward function is deliberately designed to cause obfuscation, and is about as close to process-based as you can get while not looking directly at the thoughts. (The reward literally strips out the actual command.)
The task is sabotaged, so that reward hacking is required 100% of the time to get rewards
The results don't seem that scary? For the CoT judge penalty to go down from ~0.32 to ~0.13, the pass rate goes down 11%. It's not quite clear how to interpret this given how artificial the setting is, but things you could take away -- (a) obfuscatory rewards are not competitive or (b) safety rewards do actually succeed in reducing the bad behavior.

How hard did you have to try to get this result? Was this just the first thing you tried, where you made everything as easy as possible to get some result? Or did you try a bunch of things, and have to relax them to get obfuscation to arise? If it's the latter, I feel like this should be overall good news for CoT monitorability.

“Behaviorist” RL reward functions lead to scheming

Rohin Shah3mo51

If we use Myopic Optimization from Non-myopic Approval (MONA), then the AI will not want to engage in sneaky power-seeking behaviors, because such behaviors will not help with the short-term tasks that the AI-in-a-lab wants to do
I mainly rule out MONA for capabilities reasons; see Thoughts on “Process-Based Supervision” §5.3.

I'd make this argument even for regular outcomes-based RL. Presumably, whenever you are providing rewards (whether in brain-like AGI or RLHF for LLMs), they are based on actions that the AI took over some bounded time in the past, let's call that time bound T. Then presumably sneaky power-seeking behaviors should be desired inasmuch as they pay out within time T. Currently T is pretty small so I don't expect it to pay out. Maybe you're imagining that T will increase a bunch?

Secondarily, I have a concern that, even if a setup seems like it should lead to an AI with exclusively short-term desires, it may in fact accidentally lead to an AI that also has long-term desires. See §5.2.1 of that same post for specifics.

Sure, I agree with this (and this is why I want to do MONA even though I expect outcomes-based RL will already be quite time-limited), but if this is your main argument I feel like your conclusion should be way more measured (more like "this might happen" rather than "this is doomed").

3.5 “The reward function will get smarter in parallel with the agent being trained”
This one comes from @paulfchristiano (2022): “I’m most optimistic about approaches where RL is only performed on a reward function that gets smarter in parallel with the agent being trained.”
I, umm, don’t really know what he’s getting at here. Maybe something like: use AI assistants to better monitor and catch sneaky behavior? If so, (1) I don’t expect the AI assistants to be perfect, so my story still goes through, and (2) there’s a chicken-and-egg problem that will lead to the AI assistants being misaligned and scheming too.

Re: (1), the AI assistants don't need to be perfect. They need to be good enough that the agent being trained cannot systematically exploit them, which is a much lower bar than "perfect" (while still being quite a high bar). In general I feel like you're implicitly treating an adversarial AI system as magically having perfect capabilities while any AI systems that are supposed to be helpful are of course flawed and have weaknesses. Pick an assumption and stick with it!

Re: (2), you start with a non-scheming human, and bootstrap up from there, at each iteration using the previous aligned assistant to help oversee the next AI. See e.g. IDA.

There are lots of ways this could go wrong, I'm not claiming it will clearly work, but I don't think either of your objections matters much.

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Rohin Shah3mo124

A couple of years ago, the consensus in the safety community was that process-based RL should be prioritized over outcome-based RL [...]
I don't think using outcome-based RL was ever a hard red line, but it was definitely a line of some kind.

I think many people at OpenAI were pretty explicitly keen on avoiding (what you are calling) process-based RL in favor of (what you are calling) outcome-based RL for safety reasons, specifically to avoid putting optimization pressure on the chain of thought. E.g. I argued with @Daniel Kokotajlo about this, I forget whether before or after he left OpenAI.

There were maybe like 5-10 people at GDM who were keen on process-based RL for safety reasons, out of thousands of employees.

I don't know what was happening at Anthropic, though I'd be surprised to learn that this was central to their thinking.

Overall I feel like it's not correct to say that there was a line of some kind, except under really trivial / vacuous interpretations of that. At best it might apply to Anthropic (since it was in the Core Views post).

Separately, I am still personally keen on process-based RL and don't think it's irrelevant -- and indeed we recently published MONA which imo is the most direct experimental paper on the safety idea behind process-based RL.

In general there is a spectrum between process- and outcome-based RL, and I don't think I would have ever said that we shouldn't do outcome-based RL on short-horizon tasks with verifiable rewards; I care much more about the distinction between the two for long-horizon fuzzy tasks.

I do agree that there are some signs that people will continue with outcome-based RL anyway, all the way into long-horizon fuzzy tasks. I don't think this is a settled question -- reasoning models have only been around a year, things can change quite quickly.

None of this is to disagree with your takeaways, I roughly agree with all of them (maybe I'd have some quibbles about #2).

The bitter lesson of misuse detection

Rohin Shah4mo55

Nice work, it's great to have more ways to evaluate monitoring systems!

A few points for context:

The most obvious application for misuse monitoring is to run it at deployment on every query to block harmful ones. Since deployments are so large scale, the expense of the monitoring system matters a ton. If you use a model to monitor itself, that is roughly doubling the compute cost of your deployment. It's not that people don't realize that spending more compute works better; it's that this is not an expense worth paying at current margins.
- You mention this, but say it should be considered because "frontier models are relatively inexpensive" -- I don't see why the cost per query matters, it's the total cost that matters, and that is large
- You also suggest that robustness is important for "safety-critical contexts" -- I agree with this, and would be on board if the claim was that users should apply strong monitoring for safety-critical use cases -- but elsewhere you seem to be suggesting that this should be applied to very broad and general deployments which do not seem to me to be "safety-critical".
Prompted LLMs are often too conservative about what counts as a harmful request, coming up with relatively stretched ways in which some query could cause some small harm. Looking at the examples of benign prompts in the appendix, I think they are too benign to notice this problem (if it exists). You could look at performance on OverRefusalBench to assess this.
- Note that blocking / refusing benign user requests is a really terrible experience for users. Basically any misuse detection system is going to flag things often enough that this is a very noticeable effect (like even at 99% accuracy people will be annoyed). So this is a pretty important axis to look at.
  - One might wonder how this can be the case when often AIs will refuse fairly obviously benign requests -- my sense is that usually in those cases, it's because of a detection system that was targeted at something much more serious (e.g. CSAM) that is over-triggering, but where the companies accept that cost because of the seriousness of the thing they're detecting.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments

If we use Myopic Optimization from Non-myopic Approval (MONA), then the AI will not want to engage in sneaky power-seeking behaviors, because such behaviors will not help with the short-term tasks that the AI-in-a-lab wants to do

3.5 “The reward function will get smarter in parallel with the agent being trained”