# 48

Solving hard scientific problems usually requires compelling insights

Here’s a heuristic which plays an important role in my reasoning about solving hard scientific problems: that when you’ve made an important breakthrough, you should be able to explain the key insight(s) behind that breakthrough in an intuitively compelling way. By “intuitively compelling” I don’t mean “listeners should be easily persuaded that the idea solves the problem”, but instead: “listeners should be easily persuaded that this is the type of idea which, if true, would constitute a big insight”.

The best examples are probably from Einstein: time being relative, and gravity being equivalent to acceleration, are both insights in this category. The same for Malthus and Darwin and Godel; the same for Galileo and Newton and Shannon.

Another angle on this heuristic comes from Scott Aaronson’s list of signs that a claimed P≠NP proof is wrong. In particular, see:

#6: the paper lacks a coherent overview, clearly explaining how and why it overcomes the barriers that foiled previous attempts.

And #1: the author can’t immediately explain why the proof fails for 2SAT, XOR-SAT, or other slight variants of NP-complete problems that are known to be in P.

I read these as Aaronson claiming that a successful solution to this very hard problem is likely to contain big insights that can be clearly explained.

Perhaps the best counterexample is the invention of Turing machines. Even after Turing explained the whole construction, it seems reasonable to still be uncertain whether there's actually something interesting there, or whether he's just presented you with a complicated mess. I think that uncertainty would be particularly reasonable if we imagine trying to understand the formalism before Turing figures out how to implement any nontrivial algorithm (like prime factorisation) on a Turing machine, or how to prove any theorems about universal Turing machines.

Other counterexamples might include quantum mechanics, where quantization was originally seen as a hack to make the equations work; or formal logic, where I’m not sure if there were any big insights that could be grasped in advance of actually seeing the formalisms in action.

Using the compelling insight heuristic to evaluate alignment research directions

It’s possible that alignment will in practice end up being more of an engineering problem than a scientific problem like the ones I described above. E.g. perhaps we're in a world where, with sufficient caution about scaling up existing algorithms, we'll produce aligned AIs capable of solving the full version of the problem for us.[1] But suppose we're trying to produce a fully scalable solution ourselves; are there existing insights which might be sufficient for that? Here are some candidates, which I’ll only discuss very briefly, and plan to discuss in more detail in a forthcoming post (I’d also welcome suggestions for any I've missed):

• “Trustworthy imitation of human external behavior would avert many default dooms as they manifest in external behavior unlike human behavior.”
• This is Eliezer’s description of the core insight behind Paul’s imitative amplification proposal. I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).
• Decomposing supervision of complex tasks allows better human oversight.
• Again, I’ve found this less compelling over time - in this case because I’ve realized that decomposition is the “default” approach we follow whenever we evaluate things, and so the real “work” of the insight needs to be in describing how we’ll decompose tasks, which I don’t think we’ve made much progress on (with techniques like cross-examination being possible exceptions).
• Weight-sharing makes deception much harder.
• I think this is the main argument pushing me towards optimism about ELK; thanks to Ajeya for articulating it to me.
• Uncertainty about human preferences makes agents corrigible.
• This is Stuart Russell’s claim about why assistance games are a good potential solution to alignment; I basically don’t buy it at all, for the same reasons as Yudkowsky (but kudos to Stuart for stating the proposed insight clearly enough that the disagreement is obvious).
• Myopic agents can be capable while lacking incentives for long-term misbehavior.
• This claim seems to drive a bunch of Evan Hubinger’s work, but I don’t buy it. In order for an agent’s behavior to be competent over long time horizons, it needs to be doing some kind of cognition aimed towards long time horizons, and we don’t know how to stop that cognition from being goal-directed.
• Problems that arise in limited-data regimes (e.g. inner misalignment) go away when you have methods of procedurally generating realistic data (e.g. separable world-models).
• This claim was made to me by Michael Cohen. It’s interesting, but I don’t think it solves the core alignment problem, because we don’t understand cognition well enough to efficiently factor out world-models from policies. E.g. training a world-model to predict observations step-by-step seems like it loses out on all the benefits of thinking in terms of abstractions; whereas training it just on long-term predictive accuracy makes the intermediate computations uninterpretable and therefore unusable.
• By default we’ll train models to perform bounded tasks of bounded scope, and then achieve more complex tasks by combining them.
• This seems like the core claim motivating Eric Drexler’s CAIS framework. I think it dramatically underrates the importance of general intelligence, and the returns to scaling up single models, for reasons I explain further here.
• Functional decision theory.
• I don’t think this directly addresses the alignment problem, but it feels like the level of insight I’m looking for, in a related domain.

Note that I do think each of these claims gestures towards interesting research possibilities which might move the needle in worlds where the alignment problem is easy. But I don’t think any of them are sufficiently powerful insights to scalably solve the hard version of the alignment problem. Why do many of the smart people listed above think otherwise? I think it’s because they’re not accounting properly for the sense in which the alignment problem is an adversarial one: that by default, optimization which pushes towards general intelligence will also push towards misalignment, and we'll need to do something unusual to be confident we're separating them. In other words, the set of insights about consequentialism and optimization which made us worry about the alignment problem in the first place (along with closely-related insights like the orthogonality thesis and instrumental convergence) are sufficiently high-level, and sufficiently robust, that unless you're guided by other powerful insights you're less likely to find exceptions to those principles, and more likely to find proposals where you can no longer spot the flaw.

This claim is very counterintuitive from a ML perspective, where loosely-directed exploration of new algorithms often leads to measurable improvements. I don’t know how to persuade people that applying this approach to alignment leads to proposals which are deceptively appealing, except by getting them to analyze each of the proposals above until they convince themselves that the purported insights are insufficient to solve the problem. Unfortunately, this is very time-consuming. To save effort, I’d like to promote a norm for proposals for alignment techniques to be very explicit about where the hard work is done, i.e. which part is surprising or insightful or novel enough to make us think that it could solve alignment even in worlds where that’s quite difficult. Or, alternatively, if the proposal is only aimed at worlds where the problem is relatively easy, please tell me that explicitly. E.g. I spent quite a while being confused about which part of the ELK agenda was meant to do the hard work of solving ontology identification; after asking Paul about it, though, his main response was “maybe ontology identification is easy”, which I feel very skeptical about.[2] (He also suggested something to do with the structure of explanations as a potential solving-ELK-level-insight; but I don’t understand this well enough to discuss it in detail.)

Using the compelling insight heuristic to generate alignment research directions

If we need more major insights about intelligence/consequentialism/goals to solve alignment, how might we get them? Getting more evidence from seeing more advanced systems will make this much easier, so one strategy is just to keep pushing on empirical and engineering work, while keeping an eye out for novel phenomena which might point us in the right directions.[3]

But for those who want to try to generate those insights directly, some tentative options:

• Studying how existing large models think, and try to extrapolate from that
• Understand human minds well enough to identify insights about human intelligence (e.g. things like predictive coding, or multi-agent models of minds, or dual process theory) which can be applied to alignment
• Understanding how groups think (e.g. how task decomposition occurs in cultural evolution, or in corporations, or…)
• Agent foundations research

Of course, the whole point of major insights is that it’s hard to predict them; so I’d be excited about others pursuing potential insights that don’t fall into any of these categories (with the caveat that, as the work gets increasingly abstract, it’s necessary to be increasingly careful for it to have any chance of succeeding).

1. ^

I model Eliezer as agreeing with most of the claims I make in this post, but strongly disagreeing with this sentence, because he thinks that the core problem is so hard that no amount of prosaic engineering effort could plausibly prevent catastrophe in the absence of major novel insights.

2. ^

Some brief intuitions about why: I think the hardest part of human cognition is generating and merging different ontologies. Thinking “within” an ontology is like doing normal research in a scientific field; reasoning about different ontologies is like doing philosophy, or doing paradigm-breaking research, and so it seems like a particularly difficult thing to generate a training signal for.

3. ^

Thanks to Nathan Helm-Burger for reminding me of this, with his comment.

New Comment

Hmm. I suppose a similar key insight for my own line of research might go like:

The orthogonality thesis is actually wrong for brain-like learning systems. Such systems first learn many shallow proxies for their reward signal. Moreover, the circuits implementing these proxies are self-preserving optimization demons. They’ll steer the learning process away from the true data generating process behind the reward signal so as to ensure their own perpetuation.

If true, this insight matters a lot for value alignment because it points to a way that aligned behavior in the infra-human regime could perpetuate into the superhuman regime. If all of:

• We can instil aligned behavior in the infra-human regime
• The circuits that implement aligned behavior in the infra-human regime can ensure their own perpetuation into the superhuman regime
• The circuits that implement aligned behavior in the infra-human regime continue to implement it in the superhuman regime

hold true, then I think we’re in a pretty good position regarding value alignment. Off-switch corrigibility is a bust though because self-preserving circuits won’t want to let you turn them off.

If you’re interested in some of the actual arguments for this thesis, you can read my answer to a question about the relation between human reward circuitry and human values.

Weight-sharing makes deception much harder.

Could you explain or provide a reference for this?

I find this particularly curious since naively, one would assume that weight sharing implicitly implements a simplicity prior, so it should make optimization more likely and thus also deceptive behavior? Maybe the argument is that somehow weight sharing leaves less wiggle room for obscuring one's reasoning process, making a potential optimizer more interpretable? But the hidden states and tied weights could still be encoding deceptive reasoning in an uninterpretable way?

I like that you're proposing an explicit heuristic inspired by the history of science for judging research directions and approaches, and acknowledge that it leads to conclusion that are counter intuitive to my Richard-model (pushing for Agents foundations for example), so you're not just retrofitting your own conclusion AFAIK. I also like that you're applying it to object-level directions in alignment — that's something I'm working on at the moment for my own research, based on your pushback.

That being said, my prediction/retrodiction is that this is too strong a criteria, for reasons already discussed in this post. Basically I expect that for most if not all great scientific solutions you mention, if you back up enough (sometimes you don't need to back up that far), you will find a step, an idea, an insight that proved crucial down the line but didn't look like the right type. Even in the post there's a sort of weird double standard where you implicitly discuss Darwin and Einstein after they have matured their theory, whereas you talk about Turing before he proves a non-trivial result or design a non-trivial algorithm. The extension of my prediction here is that during the long process that these thinkers (and others examples) took to arrive at their insights, they built on models and ideas that revealed bits of evidence but where jank, incorrect, and eventually used as scaffolding then thrown away.

Note that this is a rather empirical prediction about the history of science, and that I'm curious of any counterexample you or anybody else would find to it.

Another issue I see is that often the insights redefined the rules of the game. Galileo for example (in the Feyerabend interpretation at least) redefines "rest state" and "movement" to be consistent with a moving earth. You could say that this is insight that are compelling, but from the perspective at that time of history, it looks more like changing the rules of the game (of what a theory of the stars has to deal with) by changing the natural interpretations associated with it.

I broadly agree with Richard's main point, but I also do agree with this comment in the sense that I am not confident that the example of Turing compared with e.g. Einstein is completely fair/accurate.

One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed breakthrough seems to just 'fall out' of a bunch of partial work without there being a compelling explanation after the fact.

One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed breakthrough seems to just 'fall out' of a bunch of partial work without there being a compelling explanation after the fact.

Hum, I'm not sure I'm following your point. Do you mean that you can have both productive mistakes and intuitively compelling explanations when the final (or even intermediary breakthrough) is reached? Then I totally agree. My point was more that if you only use Richard's heuristic, I expect you to not reach the breakthrough because you would have killed in the bud many productive mistakes that actually lead the way there.

There's also a very kuhnian thing here that I didn't really mention in my previous comment (except on the Galileo part): the compellingness of an answer is often stronger after the fact, when you work in the paradigm that it lead too. That's another aspect of productive mistakes or even breakthrough: they don't necessarily look right or predict more from the start, and evaluating their consequences is not necessarily obvious.

I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd be pretty happy with that.

Why do I care about this? It has uncomfortable tinges of status regulation, but I think it's important because there are so many people reading about this research online, and trying to find a way into the field, and often putting the people already in the field on some kind of intellectual pedestal. Stating clearly the key insights of a given approach, and their epistemic status, will save them a whole bunch of time. E.g. it took me ages to work through my thoughts on myopia in response to Evan's posts on it, whereas if I'd known it hinged on some version of the insight I mentioned in this post, I would have immediately known why I disagreed with it.

As an example of (I claim) doing this right, see the disclaimer on my "shaping safer goals" sequence: "Note that all of the techniques I propose here are speculative brainstorming; I'm not confident in any of them as research directions, although I'd be excited to see further exploration along these lines." Although maybe I should make this even more prominent.

Lastly, I don't think I'm actually comparing Darwin and Einstein's mature theories to Turing's incomplete theory. As I understand it, their big insights required months or years of further work before developing into mature theories (in Darwin's case, literally decades).

I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd be pretty happy with that.

When phrased like that, I agree with you. I am personally relatively suspicious of claims by a bunch of people to have found a path to alignment, but actually excited by some of their productive mistakes (as discussed a bit in my post).

I also fully agree that I want people to use the second, and my "history of alignment" research direction aims at concretely teasing the productive mistakes and revealed bits of evidence without falling for the "this is obviously a solution" or "this is obviously not a solution and thus useless".

Why do I care about this? It has uncomfortable tinges of status regulation, but I think it's important because there are so many people reading about this research online, and trying to find a way into the field, and often putting the people already in the field on some kind of intellectual pedestal. Stating clearly the key insights of a given approach, and their epistemic status, will save them a whole bunch of time. E.g. it took me ages to work through my thoughts on myopia in response to Evan's posts on it, whereas if I'd known it hinged on some version of the insight I mentioned in this post, I would have immediately known why I disagreed with it.

+1000. And teasing out more generally the assumptions, the insights, the new parts of works and approach is I think super necessary and on my research agenda. That's also part of the reason why I feel asking newcomers to be distillers is not necessarily a great idea: good distillation of the type we're discussing requires IMO quite a deep understanding of the landscape, the problem and the underlying ideas. Otherwise you at best get a decent summary, and we need more.

As an example of (I claim) doing this right, see the disclaimer on my "shaping safer goals" sequence: "Note that all of the techniques I propose here are speculative brainstorming; I'm not confident in any of them as research directions, although I'd be excited to see further exploration along these lines." Although maybe I should make this even more prominent.

Haven't reread your sequence in quite some time, but I think the value of such exploratory sequence is to make clearer the intuitions underlying the direction, even if they haven't lead yet to productive mistakes. So I like your disclaimer, but I think the even better way of doing this is to clarify for different posts and ideas what are the intuitions you're building on and where the current formalims/descriptions/analogies are failing to capture them.

Lastly, I don't think I'm actually comparing Darwin and Einstein's mature theories to Turing's incomplete theory. As I understand it, their big insights required months or years of further work before developing into mature theories (in Darwin's case, literally decades).

This might also be a bit of miscommunication, but I felt like your discussion of Turing could also have applied especially in Darwin's case, where the initial insight required a lot of additional pieces and clarification to make a clean and ordered theory that you can actually defend. Generally I was pointing at the risk of hindsight bias, where the fact that the insight is clean and powerful once the full theory is known and considered didn't mean it was so compelling at the time it was thought of. (Which is also a general empirical claim about the history of scientific progress, to explore ;) )

I’d like to promote a norm for proposals for alignment techniques to be very explicit about where the hard work is done, i.e. which part is surprising or insightful or novel enough to make us think that it could solve alignment even in worlds where that’s quite difficult.

Alignment is, by nature, an engineering task, not a scientific task: It is an attempt to make something, not to understand some existing thing. It may be that, as you suggest, “solving hard scientific problems usually requires compelling insights”, but this is beside the point. Spaceflight was a hard problem, but was solved without a special, compelling insight. Likewise for the progress of computation from vacuum tubes to nanoscale electronics. Both are in the domain of engineering, where problems are typically solved by improving and composing many components. Asking “which part solves the hard problem” would be a mistake.

Regarding the CAIS model, you suggest that it “dramatically underrates the importance of general intelligence”, yet I have argued that the comprehensive AI services model (including the service of developing new services) is a way of thinking about implementations of general intelligence, not a substitute for it!

The capabilities of large language models should update our expectations, but do not persuade me that knowledge and skills of societal scale and diversity must or will be embodied in an undifferentiated blob of computation.

By the way, I haven’t suggested the CAIS model as a solution to alignment problems; instead of proposing a solution, it suggests that alignment problems are likely to arise (and perhaps be solved) in a context different from what has often been assumed. Some problems seem more tractable in that context, others less.

Weight-sharing makes deception much harder.

This is Eliezer’s description of the core insight behind Paul’s imitative amplification proposal. I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).

I didn't understand what you mean by the line being blurrier... Is this a comment about what works in practice for imitation learning?  Does a similar objection apply if we replace imitation

learning with behavioral cloning?

Overall I think this is a good post and very interesting, thanks.

I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).

So I checked out those links. Briefly looking at them, I can see what you mean about the line between RL and imitation learning being blurry. The first paper seems to show a version of RL which is basically imitation learning.

I'm confused because when you said this makes iterated amplification less compelling to you, I took that to mean it made you less optimistic about iterated amplification as a solution for alignment. But why would whether something is technically classified as imitation learning or a special kind of RL make a difference for its effectiveness?

Or did you mean not that you find it any less promising as an alignment proposal, but just that you now find the core insight less compelling/interesting because it's not as major an innovation over the idea of RL as you had thought it was?