6mo30

TLDR: We don't have to hope for generalization of our oversight procedures. Instead, we can 1) define a proxy failure that we can evaluate and 2) worst-case against our oversight procedure on the actual distribution we care about (but using the proxy failure so that we have ground truth).

1y42

Causal Scrubbing: My main problem with causal scrubbing as a solution here is that only guarantees the sufficiency, but not the necessity, or your explanation. As a result, my understanding is that a causal-scrubbing-based evaluation would admit a trivial explanation that simply asserts that the entire model is relevant for every behavior.

Redwood has been experimenting with learning (via gradient descent) causal scrubbing explanations that are somewhat addressing your necessity point. Specifically:

- "Larger" explanations are penalized more (size refers

Yes! The important part is decomposing activations (not neccessarily linearly). I can rewrite my MLP as:

MLP(x) = f(x) + (MLP(x) - f(x))

and then claim that the MLP(x) - f(x) term is unimportant. There is an example of this in the parentheses balancer example.

11y

Thanks! Can you give a non-linear decomposition example?

Nice summary! One small nitpick:

> In the features framing, we don’t necessarily assume that features are aligned with circuit components (eg, they could be arbitrary directions in neuron space), while in the subgraph framing we focus on components and don’t need to show that the components correspond to features

This feels slightly misleading. In practice, we often do claim that sub-components correspond to features. We can "rewrite" our model into an equivalent form that better reflects the computation it's performing. For example, if we claim that a ce...

11y

Thanks for the clarification! If I'm understanding correctly, you're saying that the important part is decomposing activations (linearly?) and that there's nothing really baked in about what a component can and cannot be. You normally focus on components, but this can also fully encompass the features as directions frame, by just saying that "the activation component in that direction" is a feature?

Good question! As you suggest in your comment, increasing marginal returns to capacity induce monosemanticity, and decreasing marginal returns induce polysemanticity.

We observe this in our toy model. We didn't clearly spell this out in the post, but the marginal benefit curves labelled from A to F correspond to points in the phase diagram. At the top of the phase diagram where features are dense, there is no polysemanticity because the marginal benefit curves are increasing (see curves A and B). In the feature sparse region (points D, E, F), we see...

11y

Thanks for this.
Got it. I imagine that for some particular architectures, and given some particular network weights, you can numerically compute the marginal returns to capacity curves, but that it's hard to express capacity analytically as a function of network weights since you really need to know what the particular features are in order to compute returns to capacity -- is that correct?

You didn't mention the policy implications, which I think are one of if not the most impactful reason to care about misuse. Government regulation seems super important long-term to prevent people from deploying dangerous models publicly, and the only way to get that is by demonstrating that models are actually scary.