## AI ALIGNMENT FORUMAF

Aryan Bhatt

Junior Alignment Researcher

Sorted by New

# Wiki Contributions

Hmmm, I suspect that when most people say things like "the reward function should be a human-aligned objective," they're intending something more like "the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives," or perhaps the far weaker claim that "the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives."

I wish I knew why.

Same.

I don't really have any coherent hypotheses (not that I've tried for any fixed amount of time by the clock) for why this might be the case. I do, however, have a couple of vague suggestions for how one might go about gaining slightly more information that might lead to a hypothesis, if you're interested.

The main one involves looking at the local nonlinearities of the few layers after the intervention layer at various inputs, by which I mean examining diff(t) = f(input+t*top_right_vec) - f(input) as a function of t (for small values of t, in particular) (where f=nn.Sequential({the n layers after the intervention layer}) for various small integers n).

One of the motivations for this is that it feels more confusing that [adding works and subtracting doesn't] than that [increasing the coefficient strength does diff things in diff regimes, ie for diff coefficient strengths], but if you think about it, both of those are just us being surprised/confused that the function I described above is locally nonlinear for various values of t.[1] It seems possible, then, that examining the nonlinearities in the subsequent few layers could shed some light on a slightly more general phenomenon that'll also explain why adding works but subtracting doesn't.

It's also possible, of course, that all the relevant nonlinearities kick in much further down the line, which would render this pretty useless. If this turns out to be the case, one might try finding "cheese vectors" or "top-right vectors" in as late a layer as possible[2], and then re-attempt this.

1. ^

We only care more about the former confusion (that adding works and subtracting doesn't) because we're privileging t=0, which isn't unreasonable, but perhaps zooming out just a bit will help, idk

2. ^

I'm under the impression that the current layer wasn't chosen for much of a particular reason, so it might be a simple matter to just choose a later layer that performs nearly as well?

Sorry for the pedantic comment, but I think you might've meant to have  in the denominator here.

Thanks for the great post! I have a question, if it's not too much trouble:

Sorry for my confusion about something so silly, but shouldn't the following be "when "?

When  there is no place where the derivative vanishes

I'm also a bit confused about why we can think of  as representing "which moment of the interference distribution we care about."

Perhaps some of my confusion here stems from the fact that it seems to me that the optimal number of subspaces, , is an increasing function of , which doesn't seem to line up with the following:

Hence when  is large we want to have fewer subspaces

What am I missing here?