"John, what do you think of this idea for an alignment research project?"

I get questions like that fairly regularly. How do I go about answering? What principles guide my evaluation? Not all of my intuitions for what makes a project valuable can easily be made legible, but I think the principles in this post capture about 80% of the value.

Tackle the Hamming Problems, Don't Avoid Them

Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things (or Clever Reasons To Ignore The Hard Things), rather than just Directly Tackling The Hard Things.

The most common pattern along these lines is to propose outsourcing the Hard Parts to some future AI, and "just" try to align that AI without understanding the Hard Parts of alignment ourselves. The next most common pattern is to argue that, since Hard Parts are Hard, we definitely don't have enough time to solve them and should therefore pretend that we're going to solve alignment while ignoring them. Third most common is to go into field building, in hopes of getting someone else to solve the Hard Parts. (Admittedly these are not the most charitable summaries.)

There is value in seeing how dumb ideas fail. Most of that value is figuring out what the Hard Parts of the problem are - the taut constraints which we run into over and over again, which we have no idea how to solve. (If it seems pretty solvable, it's probably not a Hard Part.) Once you can recognize the Hard Parts well enough to try to avoid them, you're already past the point where trying dumb ideas has much value.

On a sufficiently new problem, there is also value in checking dumb ideas just in case the problem happens to be easy. Alignment is already past that point; it's not easy.

You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off. That's one of the big problems with trying to circumvent the Hard Parts: when the circumvention inevitably fails, we are still no closer to solving the Hard Parts. (It has been observed both that alignment researchers mostly seem to not be tackling the Hard Parts, and that alignment research mostly doesn't seem to build on itself; I claim that the latter is a result of the former.)

Mostly, I think the hard parts are things like "understand agency in general better" and "understand what's going on inside the magic black boxes". If your response to such things is "sounds hard, man", then you have successfully identified (some of) the Hard Parts.

Have An Intuitive Story Of What We're Looking For

One project going right now is looking at how modularity in trained systems corresponds to broad peaks in parameter space. Intuitive story for that: we have two "modules", each with lots of stuff going on inside, but only a relatively-low-dimensional interface between them. Because each module has lots of stuff going on inside, but only a low-dimensional interface, there should be many ways to change around the insides of a module while keeping the externally-visible behavior the same. Because such changes don't change behavior, they don't change system performance. So, we expect that modularity implies lots of degrees-of-freedom in parameter space, i.e. broad peaks.

This story is way too abstract to be able to look for immediately in a trained net. How do we operationalize "modules", and find them? How do we operationalize "changes in a module", especially since parameter space may not line up very neatly with functional modules? But that's fine; the story can be pretty abstract.

The point of the intuitive story is to steer our search. Without it, we risk blind empiricism: just cataloguing patterns without building general models/theory/understanding for what's going on. In that mode, we can easily lose track of the big picture goal and end up cataloguing lots of useless stuff. An intuitive story gives us big-picture direction, and something to aim for. Even if it turns out to be wrong!

Operationalize

It's relatively easy to make vague/abstract intuitive arguments. Most of the value and challenge is in finding the right operationalizations of the vague concepts involved in those arguments, such that the argument is robustly correct and useful. Because it's where most of the value and most of the challenge is, finding the right operationalization should typically be the central focus of a project.

My abstraction work is a good example here. I started with some examples of abstraction and an intuitive story about throwing away information while keeping info relevant "far away". Then, the bulk of the work was to operationalize that idea in a way which matched all the intuitive examples, and made the intuitive stories provable.

Derive the Ontology, Don't Assume It

In ML interpretability, some methods look at the computation graph of the net. Others look at orthogonal directions in activation space. Others look at low-rank decompositions of the weight matrices. These are all "different ontologies" for interpretation. Methods which look at one of these ontologies will typically miss structure in the others; e.g. if run a graph clustering algorithm on the computation graph I probably won't pick up interpretable concepts embedded in directions in activation space.

What we'd really like is to avoid assuming an ontology, and rather discover/derive the ontology itself as part of our project. For instance, we could run an experiment where we change one human-interpretable "thing" in the environment, and then look at how that changes the trained net; that would let us discover how the concept is embedded rather than assume it from the start (credit to Chu for this suggestion). Another approach is to start out with some intuitive story for why a particular ontology is favored - e.g. if we have a graph with local connectivity, then maybe the Telephone Theorem kicks in. Such an argument should (a) allow us to rule out interactions which circumvent the favored ontology, and (b) be testable in its own right, e.g. for the Telephone Theorem we can (in principle) check the convergence of mutual information to a limit.

Open The Black Box

Don’t just run a black-box experiment on a network, or try to prove a purely behavioral theorem. We want to talk about internal structure.

Partly, opening the black box is about tackling the Hard Parts rather than avoiding them. Not opening the black box is a red flag; it's usually a sign of avoiding the Hard Parts.

Partly, opening the black box is about getting a very rich data channel. When we just work with a black box, we get relatively sparse data about what's going on. When we open the black box, we can in-principle directly observe every gear and directly check what's going on.

Relative Importance of These Principles

Tackle The Hamming Problems is probably the advice which is most important to follow for marginal researchers right now, but mostly I expect people who aren't already convinced of it will need to learn it the hard way. (I certainly had to learn it the hard way, though I did that before starting to work on alignment.) Open the Black Box follows pretty naturally once you're leaning in to the Hard Parts.

Once you're past that stumbling block, I think the most important principles are Derive the Ontology and Operationalize. These two are important for opposing types of people. Some people tend to stay too abstract and avoid committing to an ontology, but never operationalize and therefore miss out on the main value-add. Other people operationalize prematurely, adopting ad-hoc operationalizations, and Deriving the Ontology pretty strongly dicourages that.

Have an Intuitive Story is especially helpful for people who tend to get lost in the weeds and go nowhere. Make sure you have an intuitive story, and use that story to guide everything else.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 3:50 PM

Thanks for the post! As always I broadly agree, but I have a bunch of nitpicks.

You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off.

I agree that avoiding the Hard parts is rarely productive, but you also don't address one relevant concern: what if the Hard part is not merely Hard, but actually Impossible? In this case your advice can also be cashed out by trying to prove it is impossible instead of avoiding it. And just like with most impossibility results in TCS, it's possible that even if the precise formulation is impossible, it often just means that you need to reframe the problem a bit.

Mostly, I think the hard parts are things like "understand agency in general better" and "understand what's going on inside the magic black boxes". If your response to such things is "sounds hard, man", then you have successfully identified (some of) the Hard Parts.

I expect you would also say that a crucial hard part many people are avoiding is "how to learn human values?", right? (Not the true names, but a useful pointer)

The point of the intuitive story is to steer our search. Without it, we risk blind empiricism: just cataloguing patterns without building general models/theory/understanding for what's going on. In that mode, we can easily lose track of the big picture goal and end up cataloguing lots of useless stuff. An intuitive story gives us big-picture direction, and something to aim for. Even if it turns out to be wrong!

I want to note that the failure mode of blind theory here is to accept any story, and thus make the requirement of a story completely impotent to guide research. There's an art (and hopefully a science) to finding stories that bias towards productive mistakes.

Most of the value and challenge is in finding the right operationalizations of the vague concepts involved in those arguments, such that the argument is robustly correct and useful. Because it's where most of the value and most of the challenge is, finding the right operationalization should typically be the central focus of a project.

I expect you to partially disagree, but there's not always a "right" operationalization, and there's a failure mode where one falls in love with their neat operationalization, making the misses parts of the phenomena invisible.

Don’t just run a black-box experiment on a network, or try to prove a purely behavioral theorem. We want to talk about internal structure.

I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you're going to answer that we have evidence that this doesn't work in Alignment and so it is avoiding the Hard part. Am I correct?

Partly, opening the black box is about tackling the Hard Parts rather than avoiding them. Not opening the black box is a red flag; it's usually a sign of avoiding the Hard Parts.

One formal example of this is the relativization barrier in complexity theory, which tells you that you can't prove (and a bunch of other separations) using only techniques using algorithms as blackboxes instead of looking at the structure.

Once you're past that stumbling block, I think the most important principles are Derive the Ontology and Operationalize. These two are important for opposing types of people. Some people tend to stay too abstract and avoid committing to an ontology, but never operationalize and therefore miss out on the main value-add. Other people operationalize prematurely, adopting ad-hoc operationalizations, and Deriving the Ontology pretty strongly dicourages that.

Agreed that it's a great pair of advice to keep in mind!

I expect you would also say that a crucial hard part many people are avoiding is "how to learn human values?", right? (Not the true names, but a useful pointer)

Yes, although I consider that one more debatable.

I expect you to partially disagree, but there's not always a "right" operationalization...

When there's not a "right" operationalization, that usually means that the concepts involved were fundamentally confused in the first place.

I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you're going to answer that we have evidence that this doesn't work in Alignment and so it is avoiding the Hard part. Am I correct?

Actually, I think starting from a behavioral theorem is fine. It's just not where we're looking to end up, and the fact that we want to open the black box should steer what starting points we look for, even when those starting points are behavioral.

I generally agree with you on the principle Tackle the Hamming Problems, Don't Avoid Them.

That being said, some of the Hamming problems I see that are being avoided most on this forum, and in the AI alignment community, are

  1. Do something that will affect policy in a positive way

  2. Pick some actual human values, and then hand-encode these values into open source software components that can go into AI reward functions

Because each module has lots of stuff going on inside, but only a low-dimensional interface, there should be many ways to change around the insides of a module while keeping the externally-visible behavior the same. Because such changes don't change behavior, they don't change system performance. So, we expect that modularity implies lots of degrees-of-freedom in parameter space, i.e. broad peaks.

I know it's a bit off-topic, but FWIW I don't immediately share this intuition. If there are "many ways to change around the insides of a module while keeping the externally-visible behavior the same", then if the whole network is just one "module" (i.e. it's not modular at all), can't I likewise say there are "many ways to change around the insides of [the one module which comprises the entire network] while keeping the externally-visible behavior the same"?

Yup. I'm also not entirely convinced by this argument, for the same reason.

Well, isn't having multiple modules a precondition to something being modular? That seems like what's happening in your example: it has only one module, so it doesn't even make sense to apply John's criterion.