The predominant view on LW seems to be "pure AI capabilities research is bad, because capabilities progress alone doesn't contribute to alignment progress, and capabilities progress without alignment progress means that we're doomed".

I understand the arguments for this position, but I have what might be called the opposite position. The opposite position seems at least as intuitive as the standard position to me, and it confuses me that it's not discussed more. (I'm not confused that people reject it; I'm confused that nobody seems to even bring it up for the purpose of rejecting it.)

The opposite position is "In order to do alignment research, we need to understand how AGI works; and we currently don't understand how AGI works, so we need to have more capabilities research so that we would have a chance of figuring it out. Doing capabilities research now is good because it's likely to be slower now than it might be in some future where we had even more computing power, neuroscience understanding, etc. than we do now. If we successfully delayed capabilities research until a later time, then we might get a sudden spurt of it and wouldn't have the time to turn our increased capabilities understanding into alignment progress. Thus by doing capabilities research now, we buy ourselves a longer time period in which it's possible to do more effective alignment research."

Some reasons I have for holding this position:

1) I used to do AI strategy research. Among other things, I looked into how feasible it is for intelligence to rapidly turn superintelligent, and what kinds of pathways there are into AI disaster. But a thought that I kept having when doing any such research was "I don't know if any of this theory is of any use, because so much depends on what the world will be like when actual AGI is developed, and what that AGI will look in the first place. Without knowing what AGI will look like, I don't know whether any of the assumptions I'm making about it are going to hold. If any one of them fails to hold, the whole paper might turn out to be meaningless."

Eventually, I concluded that I can't figure out a way to make the outputs of strategy research useful for as long as I know as little about AGI as I do. Then I went to do something else with my life, since it seemed too early to do useful AGI strategy research (as far as I could tell).

2) Compare the state of AI now, to how it was before the deep learning revolution happened. It seems obvious to me that our current understanding of DL puts us in a better position to do alignment research than we were before the DL revolution. For instance, Redwood Research is doing research on language models because they believe that their research is analogous to some long-term problems

Assume that Redwood Research's work will actually turn out to be useful for aligning superintelligent AI. Language models are one of the results of the DL revolution, so their work couldn't have been done before that revolution. It seems that in a counterfactual world where the DL revolution happened later and the DL era was compressed into a shorter timespan, our chances of alignment would be worse since that world's equivalent of Redwood Research would have less time to do their research.

3) As a similar consideration, language models are already "deceptive" in a sense - asked something that it has no clue about, InstructGPT will happily come up with confident-sounding nonsense. When I linked people to some of that nonsense, multiple people pointed out that InstructGPT's answers sound like the kind of a student who's taking an exam and is asked to write an essay about a topic they know nothing about, but tries to fake it anyway (that is, trying to deceive the examiner). 

Thus, even if you are doing pure capabilities research and just want your AI system to deliver people accurate answers, it is already the case that you can see a system like InstructGPT "trying to deceive" people. If you are building a question-answering system, you want to build one that people can trust to give accurate answers rather than impressive-sounding bullshit, so you have the incentive to work on identifying and stopping such "deceptive" computations as a capabilities researcher already.

So it has already happened that

  • Progress in capabilities research gives us a new concrete example of how e.g. deception manifests in practice, that can be used to develop our understanding of it and develop new ideas for dealing with it.
  • Capabilities research reaches a point where even capabilities researchers have a natural reason to care about alignment, reducing the difference between "capabilities research" and "alignment research".
  • Thus, our understanding and awareness of deception is likely to improve as we get closer to AGI, and by that time we will have already learned a lot about how deception manifests in simpler systems and how to deal with it, and maybe some of that will suggest principles that generalize to more powerful systems as well.

It's not that I'd put a particularly high probability on InstructGPT by itself leading to any important insights about either deception in particular or alignment in general. InstructGPT is just an instance of something that seems likely to help us understand deception a little bit better. And given that, it seems reasonable to expect that further capabilities development will also give us small insights to various alignment-related questions, and maybe all those small insights will combine to give us the answers we need.

4) Still on the topic of deception, there are arguments suggesting that something like GPT will always be "deceptive" for Goodhart's Law and Siren World reasons. We can only reward an AI system for producing answers that look good to us, but this incentivizes the system to produce answers that look increasingly good to us, rather than answers that are actually correct. "Looking good" and "being correct" correlate with each other to some extent, but will eventually be pushed apart once there's enough optimization pressure on the "looking good" part.

As such, this seems like an unsolvable problem... but at the same time, if you ask me a question, I can have a desire to actually give a correct and useful answer to your question, rather than just giving you an answer that you find maximally compelling. More generally, humans can and often do have a genuine desire to help other humans (or even non-human animals) fulfill their preferences, rather than just having a desire to superficially fake cooperativeness.

I'm not sure how this desire works, but I don't think you could train GPT to have it. It looks like some sort of theory of mind is involved in how the goal is defined. If I want to help you fulfill your preferences, then I have a sense of what it would mean for your preferences to be fulfilled, and I can have a goal of optimizing for that (even while I am uncertain of what exactly your preferences are).

We don't currently seem to know how to do this kind of a theory of mind, but it can't be that much more complicated than other human-level capabilities are, since even many non-human animals seem to have some version of it. Still, I don't think we can yet implement that kind of a theory of mind in any AI system. So we have to wait for our capabilities to progress to the kind of a point where this kind of a capacity becomes possible, and then we can hopefully use that capabilities understanding to solve what looks like a crucial piece of alignment understanding.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 9:23 PM

It seems to me that the this argument only makes sense if we assume that “more capabilities research now” translates into “more gradual development of AGI”. That’s the real crux for me.

If that assumption is false, then accelerating capabilities is basically equivalent to having all the AI alignment and strategy researchers hibernate for some number N years, and then wake up and get back to work. And that, in turn, is strictly worse than having all the AI alignment and strategy researchers do what they can during the next N years, and also continue doing work after those N years have elapsed. I do agree that there is important alignment-related work that we can only do in the future, when AGI is closer. I don't agree that there is nothing useful being done right now.

On the other hand, if that assumption is true (i.e. the assumption “more capabilities research now” translates into “more gradual development of AGI”), then there's at least a chance that more capabilities research now would be net positive.

However, I don't think the assumption is true—or at least, not to any appreciable extent. It would only be true if you thought that there was a different bottleneck to AGI besides capabilities research. You mention faster hardware, but my best guess is that we already have a massive hardware overhang—once we figure out AGI-capable algorithms, I believe we already have the hardware that would support superhuman-level AGI with quite modest amounts of money and chips. (Not everyone agrees with me.) You mention “neuroscience understanding”, but I would say that insofar as neuroscience understanding helps people invent AGI-capable learning algorithms, neuroscience understanding = capabilities research! (I actually think some types of neuroscience are mainly helpful for capabilities and other types are mainly helpful for safety, see here.) I imagine there being small bottlenecks that would add a few months today, but would only add a few weeks in a decade, e.g. future better CUDA compilers. But I don't see any big bottlenecks, things that add years or decades, other than AGI capabilities research itself.

Even if the assumption is significantly true, I still would be surprised if more capabilities research now would be a good trade, because (1) I do think there’s a lot of very useful alignment work we can do right now (not to mention outreach, developing pedagogy, etc.), (2) the most valuable alignment work is work that informs differential technological development, i.e. work that tells us exactly what AGI capabilities work should be done at all, namely R&D that moves us down a path to maximally alignable AGI, but that's only valuable to the extent that we figure things out before the wrong kind of capabilities research has already been completed. See Section 1.7 here.

I'm not sure how this desire works, but I don't think you could train GPT to have it. It looks like some sort of theory of mind is involved in how the goal is defined. 

I do think that would be valuable to know, and am very interested in that question myself, but I think that figuring it out is mostly a different type of research than AGI capabilities research—loosely speaking, what you're talking about looks like “designing the right RL reward function”, whereas capabilities research mostly looks like “designing a good RL algorithm”—or so I claim, for reasons here and here.

In order to do alignment research, we need to understand how AGI works; and we currently don't understand how AGI works, so we need to have more capabilities research so that we would have a chance of figuring it out.

I totally agree with this. Alas, "understand how AGI works" is not something which most capabilities work even attempts to do.

It turns out that people can advance capabilities without having much clue what's going on inside their magic black boxes, and that's what most capabilities work looks like at this point.

Agreed, but the black-box experimentation seems like it's plausibly a prerequisite for actual understanding? E.g. you couldn't analyze InceptionV1 or CLIP to understand its inner workings before you actually had those models. To use your car engine metaphor from the other comment, we can't open the engine and stick it full of sensors before we actually have the engine. And now that we do have engines, people are starting to stick them full of sensors, even if most of the work is still focused on building even fancier engines.

It seems reasonable to expect that as long as there are low-hanging fruit to be picked using black boxes, we get a lot of black boxes and the occasional paper dedicated to understanding what's going on with them and how they work.  Then when it starts getting harder to get novel interesting results with just black box tinkering, the focus will shift to greater theoretical understanding and more thoroughly understanding everything that we've accomplished so far. 

I think we are getting some information. For example, we can see that token level attention is actually quite powerful for understanding language and also images. We have some understanding of scaling laws. I think the next step is a deeper understanding of how world modeling fits in with action generation -- how much can you get with just world modeling, versus world modeling plus reward/action combined?

If the transformer architecture is enough to get us there, it tells us a sort of null hypothesis for intelligence -- that the structure for predicting sequences by comparing all pairs of elements of a limited sequence -- is general.

Not rhetorically, what kind of questions you think would better lead to understanding how AGI works?

I think teaching a transformer with an internal thought process (predicting the next tokens over a part of the sequence that's "showing your work") would be an interesting insight into how intelligence might work. I thought of this a little while back but also discovered this is also a long standing MIRI research direction into transparency. I wouldn't be surprised if Google took it up at this point.

Not rhetorically, what kind of questions you think would better lead to understanding how AGI works?

Suppose I'm designing an engine. I try out a new design, and it surprises me - it works much worse or much better than expected. That's a few bits of information. That's basically the sort of information we get from AI experiments today.

What we'd really like is to open up that surprising engine, stick thermometers all over the place, stick pressure sensors all over the place, measure friction between the parts, measure vibration, measure fluid flow and concentrations and mixing, measure heat conduction, etc, etc. We want to be able to open that black box, see what's going on, figure out where that surprising performance is coming from. That would give us far more information, and far more useful information, than just "huh, that worked surprisingly well/poorly". And in particular, there's no way in hell we're going to understand how an engine works without opening it up like that.

The same idea carries over to AI: there's no way in hell we're going to understand how intelligence works without opening the black box. If we can open it up, see what's going on, figure out where surprises come from and why, then we get orders of magnitude more information and more useful information. (Of course, this also means that we need to figure out what things to look at inside the black box and how - the analogues of temperatures, pressures, friction, mixing, etc in an engine.)

Still on the topic of deception, there are arguments suggesting that something like GPT will always be "deceptive" for Goodhart's Law and Siren World reasons. We can only reward an AI system for producing answers that look good to us, but this incentivizes the system to produce answers that look increasingly good to us, rather than answers that are actually correct. "Looking good" and "being correct" correlate with each other to some extent, but will eventually be pushed apart once there's enough optimization pressure on the "looking good" part.

As such, this seems like an unsolvable problem... but at the same time, if you ask me a question, I can have a desire to actually give a correct and useful answer to your question, rather than just giving you an answer that you find maximally compelling. More generally, humans can and often do have a genuine desire to help other humans (or even non-human animals) fulfill their preferences, rather than just having a desire to superficially fake cooperativeness.

This argument proves too much. Humans are rewarded by each other for appearing to be friendly, but not actually for being friendly. Therefore, they are incentivized to seem friendly. Therefore, no humans will really be friendly or really care about each other, because that's not what human culture is really selecting for. 

We have concluded a falsehood; the line of reasoning requires, at the least, more assumptions which make it clear why the argument obtains in the AI case but not the human case. (Personally, I suspect the reasoning is mostly invalid.) So, I think "selection" / "incentivization" arguments require great care and caution

(As another line of reasoning, reward is not the optimization target contravenes claims like "GPT will always be 'deceptive' due to Goodhart's / Siren worlds.")

I tend to value a longer timeline more than a lot of other people do. I guess I see EA and AI Safety setting up powerful idea machines that get more powerful when they are given more time to gear up.  A lot more resources have been invested into EA field-building recently, but we need time for these investments to pay off. At EA London this year, I gained a sense that AI Safety movement building is only now becoming its own thing; and of course it'll take time to iterate to get it right, then time for people to pass through the programs, then time for them to have a career.

I suspect the kind of argument that we need more capabilities to make progress might have been stronger earlier in the game, but now that we already have powerful language models, there's a lot that we can do without needing AI to advance any further.

My sense is that Anthropic is somewhat oriented around this idea. I'm not sure if this is their actual plan or just some guesswork I read between the lines.

But I vaguely recall something like "develop capabilities that you don't publish, while also developing interpretability techniques which you do publish, and try to have a competitive edge on capabilities which you then have some lead time to try to inspect via intepretability techniques and the practice alignment on various capability-scales.

(I may have just made this up while trying to steelman them to myself)