Thoughts on AGI organizations and capabilities work

Rob Bensinger; So8res

(Note: This essay was largely written by Rob, based on notes from Nate. It’s formatted as Rob-paraphrasing-Nate because (a) Nate didn’t have time to rephrase everything into his own words, and (b) most of the impetus for this post came from Eliezer wanting MIRI to praise a recent OpenAI post and Rob wanting to share more MIRI-thoughts about the space of AGI organizations, so it felt a bit less like a Nate-post than usual.)

Nate and I have been happy about the AGI conversation seeming more honest and “real” recently. To contribute to that, I’ve collected some general Nate-thoughts in this post, even though they’re relatively informal and disorganized.

AGI development is a critically important topic, and the world should obviously be able to hash out such topics in conversation. (Even though it can feel weird or intimidating, and even though there’s inevitably some social weirdness in sometimes saying negative things about people you like and sometimes collaborate with.) My hope is that we'll be able to make faster and better progress if we move the conversational norms further toward candor and substantive discussion of disagreements, as opposed to saying everything behind a veil of collegial obscurity.

Capabilities work is currently a bad idea

Nate’s top-level view is that ideally, Earth should take a break on doing work that might move us closer to AGI, until we understand alignment better.

That move isn’t available to us, but individual researchers and organizations who choose not to burn the timeline are helping the world, even if other researchers and orgs don't reciprocate. You can unilaterally lengthen timelines, and give humanity more chances of success, by choosing not to personally shorten them.

Nate thinks capabilities work is currently a bad idea for a few reasons:

He doesn’t buy that current capabilities work is a likely path to ultimately solving alignment.
Insofar as current capabilities work does seem helpful for alignment, it strikes him as helping with parallelizable research goals, whereas our bottleneck is serial research goals. (See A note about differential technological development.)
Nate doesn’t buy that we need more capabilities progress before we can start finding a better path.

This is not to say that capabilities work is never useful for alignment, or that alignment progress is never bottlenecked on capabilities progress. As an extreme example, having a working AGI on hand tomorrow would indeed make it easier to run experiments that teach us things about alignment! But in a world where we build AGI tomorrow, we're dead, because we won't have time to get a firm understanding of alignment before AGI technology proliferates and someone accidentally destroys the world.^[1] Capabilities progress can be useful in various ways, while still being harmful on net.

(Also, to be clear: AGI capabilities are obviously an essential part of humanity's long-term path to good outcomes, and it's important to develop them at some point — the sooner the better, once we're confident this will have good outcomes — and it would be catastrophically bad to delay realizing them forever.)

On Nate’s view, the field should do experiments with ML systems, not just abstract theory. But if he were magically in charge of the world's collective ML efforts, he would put a pause on further capabilities work until we've had more time to orient to the problem, consider the option space, and think our way to some sort of plan-that-will-actually-probably-work. It’s not as though we’re hurting for ML systems to study today, and our understanding already lags far behind today’s systems' capabilities.^[2]

Publishing capabilities advances is even more obviously bad

For researchers who aren't willing to hit the pause button, an even more obvious (and cheaper) option is to avoid publishing any capabilities research (including results of the form "it turns out that X can be done, though we won't say how we did it").

Information can leak out over time, so "do the work but don't publish about it" still shortens AGI timelines in expectation. However, it can potentially shorten them a lot less.

In an ideal world, the field would currently be doing ~zero publishing of capabilities research — and marginal action to publish less is beneficial even if the rest of the world continues publishing.

Thoughts on the landscape of AGI organizations

With those background points in hand:

Nate was asked earlier this year whether he agrees with Eliezer's negative takes on OpenAI. There's also been a good amount of recent discussion of OpenAI on LessWrong.

Nate tells me that his headline view of OpenAI is mostly the same as his view of other AGI organizations, so he feels a little odd singling out OpenAI. That said, here are his notes on OpenAI anyway:

On Nate’s model, the effect of OpenAI is almost entirely dominated by its capabilities work (and sharing of its work), and this effect is robustly negative. (This is true for DeepMind, FAIR, and Google Brain too.)
Nate thinks that DeepMind, OpenAI, Anthropic, FAIR, Google Brain, etc. should hit the pause button on capabilities work (or failing that, at least halt publishing). (And he thinks any one actor can unilaterally do good in the process, even if others aren't reciprocating.)
On Nate’s model, OpenAI isn't close to operational adequacy in the sense of the Six Dimensions of Operational Adequacy write-up — which is another good reason to hold off on doing capabilities research. But this is again a property OpenAI shares with DeepMind, Anthropic, etc.

Insofar as Nate or I think OpenAI is doing the wrong thing, we’re happy to criticize it.^[3] But, while this doesn't change the fact that we view OpenAI's effects as harmful on net currently, Nate does want to acknowledge that OpenAI seems to him to be doing better than some other orgs on a number of fronts:

Nate liked a lot of things about the OpenAI Charter. (As did Eliezer, though compared to Eliezer, Nate saw the Charter as a more important positive sign about OpenAI's internal culture.)
Nate would suspect that OpenAI is much better than Google Brain and FAIR (and comparable with DeepMind, and maybe a bit behind Anthropic? it's hard to judge these things from the outside) on some important adequacy dimensions, like research closure and operational security. (Though Nate worries that, e.g., he may hear more about efforts in these directions made by OpenAI than about DeepMind just by virtue of spending more time in the Bay.)
Nate is also happy that Sam Altman and others at OpenAI talk to EAs/rationalists and try to resolve disagreements, and he’s happy that OpenAI has had people like Holden and Helen on their board at various points.
Also, obviously, OpenAI (along with DeepMind and Anthropic) has put in a much clearer AGI alignment effort than Google, FAIR, etc. (Albeit Nate thinks the absolute amount of "real" alignment work is still small.)
Most recently, Nate and Eliezer both think it’s great that OpenAI released a blog post that states their plan going forward, and we want to encourage DeepMind and Anthropic to do the same.^[4]

Comparatively, Nate thinks of OpenAI as being about on par with DeepMind, maybe a bit behind Anthropic (who publish less), and better than most of the other big names, in terms of attempts to take not-killing-everyone seriously. But again, Nate and I think that the overall effect of OpenAI (and DeepMind and FAIR and etc.) is bad, because we think it's dominated by "shortens AGI timelines". And we’re a little leery of playing “who's better on [x] dimension” when everyone seems to be on the floor of the logistic success curve.

We don't want "here are a bunch of ways OpenAI is doing unusually well for its reference class" to be treated as encouragement for those organizations to stay in the pool, or encouragement for others to join them in the pool. Outperforming DeepMind, FAIR, and Google on one or two dimensions is a weakly positive sign about the future, but on my model and Nate’s, it doesn't come close to outweighing the costs of "adding another capabilities org to the world".

^{^}
Nate simultaneously endorses these four claims:
1. More capabilities would make it possible to learn some new things about alignment.
2. We can't do all the alignment work pre-AGI. Some trial-and-error and experience with working AGI systems will be required.
3. It can't all be trial-and-error, and it can't all be improvised post-AGI. Among other things, this is because:
3.1. Some errors kill you, and you need insight into which errors those are, and how to avoid them, in advance.
3.2. We’re likely to have at most a few years to upend the gameboard once AGI arrives. Figuring everything out under that level of time pressure seems unrealistic; we need to be going into the AGI regime with a solid background understanding, so that empirical work in the endgame looks more like "nailing down a dozen loose ends and making moderate tweaks to a detailed plan" rather than "inventing an alignment field from scratch".
3.3. AGI is likely to coincide with a sharp left turn, which makes it harder (and more dangerous) to rely on past empirical generalizations, especially ones that aren't backed by deep insight into AGI cognition.
3.4. Other points raised in AGI Ruin: A List of Lethalities.
4. If we end up able to do alignment, it will probably be because we figured out at least one major thing that we don't currently know, that isn't a part of the current default path toward advancing SotA or trying to build AGI ASAP with mainstream-ish techniques, and isn't dependent on such progress.
^{^}
And, again, small individual “don’t burn the timeline” actions all contribute to incrementally increasing the time humanity has to get its act together and figure this stuff out. You don’t actually need coordination in order to have a positive effect in this way.
And, to reiterate: I say "pause" rather than "never build AGI at all" because MIRI leadership thinks that humanity never building AGI would mean the loss of nearly all of the future's value. If this were a live option, it would be an unacceptably bad one.
^{^}
Nate tells me that his current thoughts on OpenAI are probably a bit less pessimistic than Eliezer's. As a rule, Nate thinks of himself as generally less socially cynical than Eliezer on a bunch of fronts, though not less-cynical enough to disagree with the basic conclusions.
Nate tells me that he agrees with Eliezer that the original version of OpenAI ("an AGI in every household", the associated social drama, etc.) was a pretty negative shock in the wake of the camaraderie of the 2015 Puerto Rico conference.
At this point, of course, the founding of OpenAI is a sunk cost. So Nate mostly prefers to assess OpenAI's current state and future options.
Currently, Nate thinks that OpenAI is trying harder than most on some important safety fronts — though none of this reaches the standards of "adequate project" and we're still totally going to die if they meet great success along their current path.
Since I’ve listed various positives about OpenAI here, I'll note some examples of recent-ish developments that made Nate less happy about OpenAI: his sense that OpenAI was less interested in Paul Christiano's research, Evan Hubinger's research, etc. than he thought they should have been, when Paul was at OpenAI; Dario's decision to leave OpenAI; and OpenAI focusing on the “use AI to solve AI alignment” approach (as opposed to other possible strategies), as endorsed by e.g. Jan Leike, the head of OpenAI’s safety team after Paul's departure.
^{^}
If a plan doesn't make sense, the research community can then notice this and apply corrective arguments, causing the plan to change. As indeed happened when Elon and Sam stated their more-obviously-bad plan for OpenAI at the organization's inception.
It would have been better to state their plan first and start an organization later, so rounds of critical feedback and updating could occur before you lock in decisions about hiring, org structure, name, culture, etc.
But at least it happened at all; if OpenAI had just said "yeah, we're gonna do alignment research!" and left it there, the outcome probably would have been far worse.
Also, if organizations release obviously bad plans but are then unresponsive to counter-arguments, researchers can go work at the orgs with better plans and avoid the orgs with worse plans. This encourages groups to compete to have the seemingly-sanest plan, which strikes me as a better equilibrium than the current one.

Nate tells me that his headline view of OpenAI is mostly the same as his view of other AGI organizations, so he feels a little odd singling out OpenAI.
[...]
But, while this doesn't change the fact that we view OpenAI's effects as harmful on net currently, Nate does want to acknowledge that OpenAI seems to him to be doing better than some other orgs on a number of fronts:

I wanted to give this a big +1. I think OpenAI is doing better than literally every single other major AI research org except probably Anthropic and Deepmind on trying to solve the AI-not-killing-everyone task. I also think that Anthropic/Deepmind/OpenAI are doing better in terms of not publishing their impressive capabilities research than ~everyone else (e.g. not revealing the impressive downstream Benchmark numbers on Codex/text-davinci-002 performance). Accordingly, I think there's a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.

This is probably a combination of three effects:

OpenAI is clearly on the cutting edge of AI research.
OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/Rationalist social scene.
OpenAI is publicly talking about alignment; other orgs don't even acknowledge it, this makes it a heretic rather than an infidel.

And I'm happy that this post pushes against this tendency.

(And yes, standard caveats, reality doesn't grade on a curve, etc.)

Accordingly, I think there’s a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.

I'm not sure I agree that this is unfair.

OpenAI is clearly on the cutting edge of AI research.

This is obviously a good reason to focus on them more.

OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/Rationalist social scene.

Perhaps we have responsibility to scrutinize/criticize them more because of this, due to comparative advantage (who else can do it easier/better than we can), and because they're arguably deriving some warm fuzzy glow from this association? (Consider FTX as an analogy.)

OpenAI is publicly talking about alignment; other orgs don’t even acknowledge it, this makes it a heretic rather than an infidel.

Yes, but they don't seem keen on talking about the risks/downsides/shortcomings of their alignment efforts (e.g., they make their employees sign non-disparagement agreements and as a result the former alignment team members who left in a big exodus can't say exactly why they left). If you only talk about how great your alignment effort is, maybe that's worse than not talking about it at all, as it's liable to give people a false sense of security?

I appreciate this post! It feels fairly reasonable, and much closer to my opinion than (my perception of) previous MIRI posts. Points that stand out:

Publishing capabilities work is notably worse than just doing the work.
- I'd argue that hyping up the capabilities work is even worse than just quietly publishing it without fanfare.
- Though, a counter-point is that if an organisation doesn't have great cyber-security and is a target for hacking, capabilities can easily leak (see, eg, the Soviets getting nuclear weapons 4 year after the US, despite it being a top secret US program and before the internet)
Capabilities work can be importantly helpful for alignment work, especially empirical focused work.

Probably my biggest crux is around the parallel vs serial thing. My read is that fairly little current alignment work really feels "serial" to me. Assuming that you're mostly referring to conceptual alignment work, my read is that a lot of it is fairly confused, and would benefit a lot from real empirical data and real systems that can demonstrate concepts such as agency, planning, strategic awareness, etc. And just more data on what AGI cognition might look like. Without these, it seems extremely hard to distinguish true progress from compelling falsehoods.

Publishing capabilities work is notably worse than just doing the work.
I'd argue that hyping up the capabilities work is even worse than just quietly publishing it without fanfare.

What's the mechanism you're thinking of, through which hype does damage?

I also doubt that good capabilities work will be published "without fanfare", given how watched this space is.

My read is that fairly little current alignment work really feels "serial" to me. Assuming that you're mostly referring to conceptual alignment work, my read is that a lot of it is fairly confused, and would benefit a lot from real empirical data and real systems that can demonstrate concepts such as agency, planning, strategic awareness, etc.

I think this is more an indictment of existing work, and less a statement about what work needs to be done. e.g. my guess is we'll both agree that the original inner alignment work from Evan Hubinger is pretty decent conceptual research. And I think much conceptual work seems pretty serial to me, and is hard to parallelize due to reasons like "intuitions from the lead researcher are difficult to share" and communications difficulties in general.

Of course, I also do agree that there's a synergy between empirical data and thinking -- e.g. one of the main reasons I'm excited about Redwood's agenda is because it's very conceptually driven, which lets it be targeted at specific problems (for example, they're coming with techniques that aim to solve the mechanistic anomaly detection problem, and finding current analogues and doing experiments with those).

What's the mechanism you're thinking of, through which hype does damage?

This ship may have sailed at this point, but to me the main mechanism is getting other actors to pay attention, focus on the most effective kind of capabilities work, and making it more politically feasible to raise support. Eg, I expect that the media firestorm around GPT-3 made it significantly easier to raise the capital + support within Google Brain to train PaLM. Legibly making a ton of money with it falls in a similar category to me.

Gopher is a good example of not really seeing much fanfare, I think? (Though I don't spend much time on ML Twitter, so maybe there was loads lol)

And I think much conceptual work seems pretty serial to me, and is hard to parallelize due to reasons like "intuitions from the lead researcher are difficult to share" and communications difficulties in general.

Ah, my key argument here is that most conceptual work is bad because of lacking good empirical examples, grounding and feedback loops, and that if we were closer to AGI we could have this.

I agree that risks from learned optimisation is important and didn't need this, and plausibly feels like a good example of serial work to me.

I expect that the media firestorm around GPT-3 made it significantly easier to raise the capital + support within Google Brain to train PaLM.

Wouldn't surprise me if this was true, but I agree with you that it's possible the ship has already sailed on LLMs. I think this is more so the case if you have a novel insight about what paths are more promising to AGI (similar to the scaling hypothesis in 2018)---getting ~everyone to adopt that insight would significantly advance timelines, though I'd argue that publishing it (such that only the labs explicitly aiming at AGI like OpenAI and Deepmind adopt it) is not clearly less bad than hyping it up.

Gopher is a good example of not really seeing much fanfare, I think? (Though I don't spend much time on ML Twitter, so maybe there was loads lol)

Surely this is because it didn't say anything except "Deepmind is also now in the LLM game", which wasn't surprising given Geoff Irving left OpenAI for Deepmind? There weren't significant groundbreaking techniques used to train Gopher as far as I can remember.

Chinchilla, on the other hand, did see a ton of fanfare.

Ah, my key argument here is that most conceptual work is bad because of lacking good empirical examples, grounding and feedback loops, and that if we were closer to AGI we could have this.

Cool. I agree with you that conceptual work is bad in part because of a lack of good examples/grounding/feedback loops, though I think this can be overcome with clever toy problem design and analogies to current problems (that you can then get the examples/grounding/feedback loops from). E.g. surely we can test toy versions of shard theory claims using the small algorithmic neural networks we're able to fully reverse engineer.

Can you give some historical examples of work that lowered the amount-of-serial-research-left-till-doom? And examples of work that didn't? Because an advance in alignment is often a direct advance in capabilities, and I'm a little confused about the spectrum of possibilities.

Here's an example of my confusion. Clearly interpretability work is mostly good, right? Exploring semantic super-positions and other current advances seem like they're clearly benificial to publish in spite of the fact that they advance capabilities. If we progress to the point where we can interpret the algorithms that a smallish NN is using, that still seems fine. But what if interpretability research progress to the point where they can decode the algorithms a NN is running, then the techniques that allow that level of interpretability are quite dangerous. For example, if we find large NNs have some kind of proto-general search which seems like it could be amplified easily to get a general agent, then, you know, it would be pretty bad if every AGI organization could find this out by just applying standard interpretability tool X. Or is that kind of work still worth publishing, because powerful interpretability would make alignment way easier and that outweighs the risk of reducing serial research time till doom?

I don't know Nate's response, but his take on agent-foundations-ish research in A note about differential technological development (and the fact that he and MIRI have been broadly pro-interpretability-work to date) might help clarify how he thinks about cases like this.

[...]
I feel relatively confident that a large percentage of people who do capabilities work at OpenAI, FAIR, DeepMind, Anthropic, etc. with justifications like "well, I'm helping with alignment some too" or "well, alignment will be easier when we get to the brink" (more often EA-adjacent than centrally "EA", I think) are currently producing costs that outweigh the benefits.
Some relatively niche and theoretical agent-foundations-ish research directions might yield capabilities advances too, and I feel much more positive about those cases. I’m guessing it won’t work, but it’s the kind of research that seems positive-EV to me and that I’d like to see a larger network of researchers tackling, provided that they avoid publishing large advances that are especially likely to shorten AGI timelines.
The main reasons I feel more positive about the agent-foundations-ish cases I know about are:
The alignment progress in these cases appears to me to be much more serial, compared to the vast majority of alignment work the field outputs today.
I’m more optimistic about the total amount of alignment progress we’d see in the worlds where agent-foundations-ish research so wildly exceeded my expectations that it ended up boosting capabilities. Better understanding optimization in this way really would seem to me to take a significant bite out of the capabilities generalization problem, unlike most alignment work I’m aware of.
The kind of people working on agent-foundations-y work aren’t publishing new ML results that break SotA. Thus I consider it more likely that they’d avoid publicly breaking SotA on a bunch of AGI-relevant benchmarks given the opportunity, and more likely that they’d only direct their attention to this kind of intervention if it seemed helpful for humanity’s future prospects.
(Footnote: On the other hand, weirder research is more likely to shorten timelines a lot, if it shortens them at all. More mainstream research progress is less likely to have a large counterfactual impact, because it’s more likely that someone else has the same idea a few months or years later. “Low probability of shortening timelines a lot” and “higher probability of shortening timelines a smaller amount” both matter here, so I advocate that both niche and mainstream researchers be cautious and deliberate about publishing potentially timelines-shortening work.)
Relatedly, the energy and attention of ML is elsewhere, so if they do achieve a surprising AGI-relevant breakthrough and accidentally leak bits about it publicly, I put less probability on safety-unconscious ML researchers rushing to incorporate it.
I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its eggs in that basket.
Rather, my hope is that this example clarifies that I’m not saying “doing alignment research is bad” or even “all alignment research that poses a risk of advancing capabilities is bad”.
[...]

Nate tells me that his headline view of OpenAI is mostly the same as his view of other AGI organizations, so he feels a little odd singling out OpenAI.
[...]
But, while this doesn't change the fact that we view OpenAI's effects as harmful on net currently, Nate does want to acknowledge that OpenAI seems to him to be doing better than some other orgs on a number of fronts:

This is probably a combination of three effects:

OpenAI is clearly on the cutting edge of AI research.
OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/Rationalist social scene.
OpenAI is publicly talking about alignment; other orgs don't even acknowledge it, this makes it a heretic rather than an infidel.

And I'm happy that this post pushes against this tendency.

(And yes, standard caveats, reality doesn't grade on a curve, etc.)

Accordingly, I think there’s a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.

I'm not sure I agree that this is unfair.

OpenAI is clearly on the cutting edge of AI research.

This is obviously a good reason to focus on them more.

OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/Rationalist social scene.

OpenAI is publicly talking about alignment; other orgs don’t even acknowledge it, this makes it a heretic rather than an infidel.

I appreciate this post! It feels fairly reasonable, and much closer to my opinion than (my perception of) previous MIRI posts. Points that stand out:

Publishing capabilities work is notably worse than just doing the work.
- I'd argue that hyping up the capabilities work is even worse than just quietly publishing it without fanfare.
- Though, a counter-point is that if an organisation doesn't have great cyber-security and is a target for hacking, capabilities can easily leak (see, eg, the Soviets getting nuclear weapons 4 year after the US, despite it being a top secret US program and before the internet)
Capabilities work can be importantly helpful for alignment work, especially empirical focused work.

Publishing capabilities work is notably worse than just doing the work.
I'd argue that hyping up the capabilities work is even worse than just quietly publishing it without fanfare.

What's the mechanism you're thinking of, through which hype does damage?

I also doubt that good capabilities work will be published "without fanfare", given how watched this space is.

My read is that fairly little current alignment work really feels "serial" to me. Assuming that you're mostly referring to conceptual alignment work, my read is that a lot of it is fairly confused, and would benefit a lot from real empirical data and real systems that can demonstrate concepts such as agency, planning, strategic awareness, etc.

What's the mechanism you're thinking of, through which hype does damage?

Gopher is a good example of not really seeing much fanfare, I think? (Though I don't spend much time on ML Twitter, so maybe there was loads lol)

And I think much conceptual work seems pretty serial to me, and is hard to parallelize due to reasons like "intuitions from the lead researcher are difficult to share" and communications difficulties in general.

Ah, my key argument here is that most conceptual work is bad because of lacking good empirical examples, grounding and feedback loops, and that if we were closer to AGI we could have this.

I agree that risks from learned optimisation is important and didn't need this, and plausibly feels like a good example of serial work to me.

I expect that the media firestorm around GPT-3 made it significantly easier to raise the capital + support within Google Brain to train PaLM.

Gopher is a good example of not really seeing much fanfare, I think? (Though I don't spend much time on ML Twitter, so maybe there was loads lol)

Chinchilla, on the other hand, did see a ton of fanfare.

Ah, my key argument here is that most conceptual work is bad because of lacking good empirical examples, grounding and feedback loops, and that if we were closer to AGI we could have this.

[...]
I feel relatively confident that a large percentage of people who do capabilities work at OpenAI, FAIR, DeepMind, Anthropic, etc. with justifications like "well, I'm helping with alignment some too" or "well, alignment will be easier when we get to the brink" (more often EA-adjacent than centrally "EA", I think) are currently producing costs that outweigh the benefits.
Some relatively niche and theoretical agent-foundations-ish research directions might yield capabilities advances too, and I feel much more positive about those cases. I’m guessing it won’t work, but it’s the kind of research that seems positive-EV to me and that I’d like to see a larger network of researchers tackling, provided that they avoid publishing large advances that are especially likely to shorten AGI timelines.
The main reasons I feel more positive about the agent-foundations-ish cases I know about are:
The alignment progress in these cases appears to me to be much more serial, compared to the vast majority of alignment work the field outputs today.
I’m more optimistic about the total amount of alignment progress we’d see in the worlds where agent-foundations-ish research so wildly exceeded my expectations that it ended up boosting capabilities. Better understanding optimization in this way really would seem to me to take a significant bite out of the capabilities generalization problem, unlike most alignment work I’m aware of.
The kind of people working on agent-foundations-y work aren’t publishing new ML results that break SotA. Thus I consider it more likely that they’d avoid publicly breaking SotA on a bunch of AGI-relevant benchmarks given the opportunity, and more likely that they’d only direct their attention to this kind of intervention if it seemed helpful for humanity’s future prospects.
(Footnote: On the other hand, weirder research is more likely to shorten timelines a lot, if it shortens them at all. More mainstream research progress is less likely to have a large counterfactual impact, because it’s more likely that someone else has the same idea a few months or years later. “Low probability of shortening timelines a lot” and “higher probability of shortening timelines a smaller amount” both matter here, so I advocate that both niche and mainstream researchers be cautious and deliberate about publishing potentially timelines-shortening work.)
Relatedly, the energy and attention of ML is elsewhere, so if they do achieve a surprising AGI-relevant breakthrough and accidentally leak bits about it publicly, I put less probability on safety-unconscious ML researchers rushing to incorporate it.
I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its eggs in that basket.
Rather, my hope is that this example clarifies that I’m not saying “doing alignment research is bad” or even “all alignment research that poses a risk of advancing capabilities is bad”.
[...]

49

Thoughts on AGI organizations and capabilities work

49

Capabilities work is currently a bad idea

Publishing capabilities advances is even more obviously bad

Thoughts on the landscape of AGI organizations