Alex Mallen

Redwood Research

Wikitag Contributions

Comments

Sorted by

When "humans who would try to intervene are stopped or killed", so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever.

I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn't really care to set up a system that would lock in the AI's power in 10 years, but give it no power before then. If that's false, then I'd call it a behavioral schemer. It's a broad definition, I know, but the behavior is ultimately what matters so that's what I'm trying to get at.

I would have thought that the main distinction between schemers and reward hackers was how they came about

Do you mean terminal reward seekers, not reward hackers? I use reward hacking as a description of a behavior in training, not a motivation in training, and I think many training-time schemers were reward hackers in training. I agree terminal reward seekers can potentially have long-term goals and collude across instances like a schemer, though the stories are a bit complicated.

virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a "you get what you measure" catastrophe scenario.)

It doesn't count the classic reward hackers in "you get what you measure" because the reward hackers were not trying to gain long-term power with their actions. The key difference being between "substantial material action to gain long-term power" and "substantial material action that gains the AIs long-term power". I could clarify by adding the word "intended".

Let me know if you think there's a better way of getting at "an AI that behaves like you'd normally think of a schemer behaving in the situations where it materially matters".

is this empirically how people use "schemer",

Empirically, I think people we talk to usually mean the intersection of behavioral scheming and training-gaming. In broader circles people empirically use scheming quite liberally, including myopic forms of deception. 

s.t. I should give up on ... using "scheming" as referring to training-time scheming

It's probably fine to use "scheming" in this way in conversation with sufficient context.

if redwood says that something is intended to reduce "catastrophic risk from schemers", should I interpret that as ~synonymous with "catastrophic risk from misaligned AI"

The scheming that Redwood-style AI control tries to defend against is, as a matter of control evaluation methodology, defined as a model that is optimizing against your control protocol. This is technically different from an arbitrary misaligned AI, since it's a worst-case misaligned AI.

I think this does a great job of reviewing the considerations regarding what goals would be incentivized by SGD by default, but I think that in order to make predictions about which goals will end up being relevant in future AIs, we have to account for the outer loop of researchers studying model generalization and changing their training processes.

For example, reward hacking seems very likely by default from RL, but it is also relatively easy to notice in many forms and AI projects will be incentivized to correct it. On the other hand, ICGs might be harder to notice and have fewer incentives for correcting.

This is really exciting to see not only as a way to understand better what people at Anthropic are thinking about how to make ASL-4 safety cases but also as a survey of how various safety agendas might play a role in practice!

Here are some thoughts about the details:

  • Given that the mechinterp safety case involves running a kind of control evaluation, other monitoring methods can largely be substituted into the eval. This could enable cheap testing of other monitors e.g. based on supervised probing or mechanistic anomaly detection.
  • I like the idea to check whether a feature is causally necessary for defection in your model organisms as this gives extra certainty about generalization recall.
  • "Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution." (As is probably mentioned by others) This isn’t necessary for control safety cases as long as you’ve done a sufficient job of eliciting performance from your model (e.g. i.i.d. elicitation with human demonstrations of attacks, which as is mentioned might be difficult).
  • The third kind of safety case is pretty novel to me which makes it exciting but also I’m not convinced by it yet, for reasons that are mostly already outlined in the blogpost.
  • The evidence becomes very diluted by multiple hypothesis testing when trying to rule out any cluster of deceptive personas being reinforced. We probably wouldn’t be able to get convincing affirmative evidence with this approach without a meta-level eval.
  • I think exploration hacking shouldn’t be a concern here if you can get the 10-90% with SFT. (EDIT: However you might be concerned that the 10-90% does not contain the particular deceptive personas that get reinforced.)