Principles for Alignment/Agency Projects

Thanks for the post! As always I broadly agree, but I have a bunch of nitpicks.

You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off.

I agree that avoiding the Hard parts is rarely productive, but you also don't address one relevant concern: what if the Hard part is not merely Hard, but actually Impossible? In this case your advice can also be cashed out by trying to prove it is impossible instead of avoiding it. And just like with most impossibility results in TCS, it's possible that even if the precise formulation is impossible, it often just means that you need to reframe the problem a bit.

Mostly, I think the hard parts are things like "understand agency in general better" and "understand what's going on inside the magic black boxes". If your response to such things is "sounds hard, man", then you have successfully identified (some of) the Hard Parts.

I expect you would also say that a crucial hard part many people are avoiding is "how to learn human values?", right? (Not the true names, but a useful pointer)

The point of the intuitive story is to steer our search. Without it, we risk blind empiricism: just cataloguing patterns without building general models/theory/understanding for what's going on. In that mode, we can easily lose track of the big picture goal and end up cataloguing lots of useless stuff. An intuitive story gives us big-picture direction, and something to aim for. Even if it turns out to be wrong!

I want to note that the failure mode of blind theory here is to accept any story, and thus make the requirement of a story completely impotent to guide research. There's an art (and hopefully a science) to finding stories that bias towards productive mistakes.

Most of the value and challenge is in finding the right operationalizations of the vague concepts involved in those arguments, such that the argument is robustly correct and useful. Because it's where most of the value and most of the challenge is, finding the right operationalization should typically be the central focus of a project.

I expect you to partially disagree, but there's not always a "right" operationalization, and there's a failure mode where one falls in love with their neat operationalization, making the misses parts of the phenomena invisible.

Don’t just run a black-box experiment on a network, or try to prove a purely behavioral theorem. We want to talk about internal structure.

I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you're going to answer that we have evidence that this doesn't work in Alignment and so it is avoiding the Hard part. Am I correct?

Partly, opening the black box is about tackling the Hard Parts rather than avoiding them. Not opening the black box is a red flag; it's usually a sign of avoiding the Hard Parts.

One formal example of this is the relativization barrier in complexity theory, which tells you that you can't prove (and a bunch of other separations) using only techniques using algorithms as blackboxes instead of looking at the structure.

Once you're past that stumbling block, I think the most important principles are Derive the Ontology and Operationalize. These two are important for opposing types of people. Some people tend to stay too abstract and avoid committing to an ontology, but never operationalize and therefore miss out on the main value-add. Other people operationalize prematurely, adopting ad-hoc operationalizations, and Deriving the Ontology pretty strongly dicourages that.

Agreed that it's a great pair of advice to keep in mind!

[-]johnswentworth3y82

I expect you would also say that a crucial hard part many people are avoiding is "how to learn human values?", right? (Not the true names, but a useful pointer)

Yes, although I consider that one more debatable.

I expect you to partially disagree, but there's not always a "right" operationalization...

When there's not a "right" operationalization, that usually means that the concepts involved were fundamentally confused in the first place.

I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you're going to answer that we have evidence that this doesn't work in Alignment and so it is avoiding the Hard part. Am I correct?

Actually, I think starting from a behavioral theorem is fine. It's just not where we're looking to end up, and the fact that we want to open the black box should steer what starting points we look for, even when those starting points are behavioral.

[-]Koen.Holtman3y62

I generally agree with you on the principle Tackle the Hamming Problems, Don't Avoid Them.

That being said, some of the Hamming problems I see that are being avoided most on this forum, and in the AI alignment community, are

Do something that will affect policy in a positive way
Pick some actual human values, and then hand-encode these values into open source software components that can go into AI reward functions

[-]Steven Byrnes3y52

Because each module has lots of stuff going on inside, but only a low-dimensional interface, there should be many ways to change around the insides of a module while keeping the externally-visible behavior the same. Because such changes don't change behavior, they don't change system performance. So, we expect that modularity implies lots of degrees-of-freedom in parameter space, i.e. broad peaks.

I know it's a bit off-topic, but FWIW I don't immediately share this intuition. If there are "many ways to change around the insides of a module while keeping the externally-visible behavior the same", then if the whole network is just one "module" (i.e. it's not modular at all), can't I likewise say there are "many ways to change around the insides of [the one module which comprises the entire network] while keeping the externally-visible behavior the same"?

[-]johnswentworth3y30

Yup. I'm also not entirely convinced by this argument, for the same reason.

[-]adamShimi3y20

Well, isn't having multiple modules a precondition to something being modular? That seems like what's happening in your example: it has only one module, so it doesn't even make sense to apply John's criterion.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

47

Principles for Alignment/Agency Projects

47

Tackle the Hamming Problems, Don't Avoid Them

Have An Intuitive Story Of What We're Looking For

Operationalize

Derive the Ontology, Don't Assume It

Open The Black Box

Relative Importance of These Principles