[Crossposted from Musings and Rough Drafts.]

Epistemic status: Brainstorming and first draft thoughts.

Inspired by something that Ruby Bloom wrote and the Paul Christiano episode of the 80,000 hours podcast.]

One claim I sometimes hear about AI alignment [paraphrase]:

It is really hard to know what sorts of AI alignment work are good this far out from transformative AI. As we get closer, we’ll have a clearer sense of what AGI / Transformative AI is likely to actually look like, and we’ll have much better traction on what kind of alignment work to do. In fact, MOST of the work of AI alignment is done in the final few years (or months) before AGI, when we’ve solved most of the hard capabilities problems already so we know what AGI will look like and we can work directly, with good feedback loops, on the sorts of systems that we want to align.

Usually, this is said to argue that to value of the alignment research being done today is primarily that of enabling, future, more critical, alignment work. But “progress in the field” is only one dimension to consider in boosting and unblocking the work of alignment researchers in this last stretch.

In this post I want to take the above posit seriously, and consider the implications. If most of the alignment work that will be done is going to be done in the final few years before the deadline, our job in 2021 is mostly to do everything that we can to enable the people working on the problem in the crucial period (which might be us, or our successors, or both) so that they are as well equipped as we can possibly make them.

What are all the ways that we can think of that we can prepare now, for our eventual final exam? What should we be investing in, to improve our efficacy in those final, crucial, years?

The following are some ideas.

[In this post, I'm going to refer to this last stretch of a few months to a few years, "final crunch time", as distinct from just "crunch time", ie this century.]


For this to matter, our alignment researchers need to be at the cutting edge of AI capabilities, and they need to be positioned such that their work can actually be incorporated into AI systems as they are deployed.

A different kind of work

Most current AI alignment work is pretty abstract and theoretical, for two reasons. 

The first reason is a philosophical / methodological claim: There’s a fundamental “nearest unblocked strategy” / overfitting problem. Patches that correct clear and obvious alignment failures are unlikely to generalize fully, they'll only constrain unaligned optimization to channels that you can’t recognize. For this reason, some claim, we need to have an extremely robust, theoretical understanding of intelligence and alignment, ideally at the level of proofs.

The second reason is a practical consideration: we just don’t have powerful AI systems to work with, so there isn’t much that can be done in the way of tinkering and getting feedback.

That second objection becomes less relevant in final crunch time: in this scenario, we’ll have powerful systems 1) that will be built along the same lines as the systems that it is crucial to align and 2) that will have enough intellectual capability to pose at least semi-realistic “creative” alignment failures (ie, current systems are so dumb, and live in such constrained environments, that it isn’t clear how much we can learn about aligning literal superintelligences from them.)

And even if the first objection ultimately holds, theoretical understanding often (usually?) follows from practical engineering proficiency. It seems like it might be a fruitful path to tinker with semi-powerful systems: trying out different alignment approaches empirically, and tinkering to discover new approaches, and then backing up to do robust theory-building given much richer data about what seems to work.

I could imagine sophisticated setups that enable this kind of tinkering and theory building. For instance, I imagine a setup that includes:

  • A “sandbox” that afford easy implementation of many different AI architectures and custom combinations of architectures, with a wide variety easy-to-create, easy-to-adjust, training schemes, and a full suite of interpretability tools. We could quickly try out different safety schemes, in different distributions, and observe what kinds of cognition and behavior result.
  • A meta AI that observes the sandbox, and all of the experiments therein, to learn general principles of alignment. We could use interpretability tools to use this AI as a “microscope” on the AI alignment problem itself, abstracting out patterns and dynamics that we couldn’t easily have teased out with only our own brains. This meta system might also play some role in designing the experiments to run in the sandbox, to allow it to get the best data to test it’s hypotheses.
  • A theorem prover that would formalize the properties and implications of those general alignment principles, to give us crisply specified alignment criteria by which we can evaluate AI designs.

Obviously, working with a full system like this is quite different than abstract, purely theoretical work on decision theory or logical uncertainty. It is closer to the sort of experiments that the OpenAI and Deep Mind safety teams have published, but even that is a pretty far cry from the kind of rapid-feedback tinkering that I’m pointing at here.

Given that the kind of work that leads to research progress might be very different in final crunch time than it is now, it seems worth trying to forecast what shape that work will take and trying to see if there are ways to practice doing that kind of work before final crunch time.


Obviously, when we get to final crunch time, we don’t want to have to spend any time studying fields that we could have studied in the lead-up years. We want to have already learned all the information and ways of thinking that we’ll want to know, then. It seems worth considering what fields we’ll wish we had known when time comes.

The obvious contenders:

  • Machine Learning
  • Machine Learning interpretability
  • All the Math of Intelligence that humanity has yet amassed [Probability theory Causality, etc.]

Some less obvious possibilities:

  • Neuroscience?
  • Geopolitics, if it turns out that which technical approach is ideal hinges on important facts about the balance of power?
  • Computer security?
  • Mechanism design in general?

Do other subjects come to mind?

Research methodology / Scientific “rationality”

We want the research teams tackling this problem in final crunch time to have the best scientific methodology and the best cognitive tools / habits for making research progress, that we can manage to provide them.

This maybe includes skills or methods in the domains of:

  • Ways to notice as early as possible if you’re following an ultimately-fruitless research path
  • Noticing / Resolving /Avoiding blindspots
  • Effective research teams
  • Original seeing / overcoming theory blindness / hypothesis generation
  • ???


One obvious thing is to spend time now, investing in habits and strategies for effective productivity. It seems senseless to waste precious hours in the final crunch time due to procrastination or poor sleep. It is well worth in to solve those problems now. But aside from the general suggestion to get your shit in order and develop good habits, I can think of two more specific things that seem good to do.

Practice no-cost-too-large productive periods

There maybe trades that could make people more productive on the margin, but are too expensive in regular life. For instance, I think that I might conceivably benefit from having a dedicated person who’s job is to always be near me, so that I can duck with them (or have them "hold space" for me) with 0 friction. I’ve experimented a little bit with similar ideas (like having a list of people on call to duck with), but it doesn’t seem worth it for me to pay a whole extra person-salary to have the person be on call and in the same building, instead of on-call via zoom.

But it is worth it at final crunch time.

It might be worth it to spend some period of time, maybe a week, maybe a month, every year, optimizing unrestrainedly for research productivity, with no heed to cost at all, so that we can practice how to do that. This is possibly a good thing to do anyway, because it might uncover trades that, on reflection, are worth importing into my regular life.

Optimize rest

One particular subset of personal productivity, that jumps out at me: each person should figure out their actual optimal cadence of rest.

There’s a failure mode that ambitious people commonly fall into, which is working past the point when marginal hours of work are negative. When the whole cosmic endowment is on the line, there will be a natural temptation to push yourself to work as hard as you can, and forgo rest. Obviously, this is a mistake. Rest isn’t just a luxury: it is one of the inputs to productive work.

There is a second level of this error in which one, grudgingly, takes the minimal amount of rest time, and gets back to work. But the amount of rest time required to stay functional is not the optimal amount of rest, the amount the maximizes productive output. Eliezer mused years ago, that he felt kind of guilty about it, but maybe he should actually take two days off between research days, because the quality of his research seemed better on days when he happened to have had two rest days preceding.

In final crunch time, we want everyone to be resting the optimal amount that actually maximizes area under the curve, not the one that maximizes work-hours. We should do binary search now, to figure out what the optimum is.

Also, obviously, we should explore to discover highly effective methods of rest, instead of doing whatever random things seem good (unless, as it turns out, “whatever random thing seems good” is actually the best way to rest).

Picking up new tools

One thing that will be happening in this time, is there will be a flurry of new AI tools that can radically transform thinking and research, perhaps increasingly radical tools coming at a rate of once a month or faster.

Being able to take advantage of those tools and start using them for research immediately, with minimal learning curve, seems extremely high leverage.

If there are things that we can do that increase the ease of picking up new tools and using them to their full potential (instead of, as is common, using only the features afforded by your old tools and only very gradually

Some thoughts (probably bad):

  • Could we set up our workflows, somehow, such that it is easy to integrate new tools into them? Like if you already have a flexible, expressive research interface (something like Roam? or maybe, if the technology is more advanced by then, something like neurolink?), and you’re used to regular changes in capability to the backed of the interface?
  • Can we just practice? Can we have a competitive game of introducing new tools, and trying to orient to them and figure out how to exploit them creatively as possible?
  • Probably it should be some people’s full time job to translate cutting edge developments in AI into useful tools and practical workflows, and then to teach those workflows to the researchers?
  • Can we design a meta-tool that helps us figure out how to exploit new tools? Is it possible to train an AI assistant specifically for helping us get the most out of our new AI tools?
  • Can we map out the sort of constraints on human thinking and/or the the sorts of tools that will be possible, in advance, so that we can practice with much weaker versions of those tools, and get a sense of how we would use them, so that we’re ready when they arrive?
  • Can we try out new tools on psychedelics, to boost neuroplasticity? Is there some other way to temporarily weaken our neural priors? Maybe some kind of training in original seeing?

Staying grounded and stable in spite of the stakes

Obviously, being one of the few hundred people on whom the whole future of the cosmos rests, while the singularity is happening around you, and you are confronted with the stark reality of how doomed we are, is scary and disorienting and destabilizing.

I imagine that that induces all kinds of psychological pressures, that might find release in any of a number of concerning outlets: by deluding one’s self about the situation or slipping sideways into a more convenient world, by becoming manic and frenetic, by sinking into immovable depression.

We need our people to have the virtue of being able to look the problem in the eye, with all of its terror and disorientation, and stay stable enough to make tough calls, and make them sanely.

We’re called to cultivate a virtue (or maybe a set of virtues) of which I don’t know the true name, but which involves courage and groundedness, and determination-without-denial.

I don’t know what is entailed in cultivating that virtue. Perhaps meditation? Maybe testing one’s self at literal risk to one’s life? I would guess that people in other times and places, who needed to face risk to their own lives and that of their families, and take action anyway, did have this virtue, or some part of it, and it might be fruitful to investigate those cultures and how that virtue was cultivated.



Any more ideas?

New Answer
New Comment

3 Answers sorted by

Daniel Kokotajlo


Thanks, this is a great thing to be thinking about and a good list of ideas!

Do other subjects come to mind?

Public speaking skills, persuasion skills, debate skills, etc.

Practice no-cost-too-large productive periods

I like this idea. At AI Impacts we were discussing something similar: having "fire drills" where we spend a week (or even just a day) pretending that a certain scenario has happened, e.g. "DeepMind just announced they have a turing-test-passing system and will demo it a week from now; we've got two journalists asking us for interviews and need to prep for the emergency meeting with the AI safety community tonight at 5." We never got around to testing out such a drill but I think variants on this idea are worth exploring. Inspired by what you said, perhaps we could have "snap drills" where suddenly we take our goals for the next two months and imagine that they need to be accomplished in a week instead, and see how much we can do. (Additionally, ideas like this seem like they would have bonus effects on morale, teamwork, etc.)

I don’t know what is entailed in cultivating that virtue. Perhaps meditation? Maybe testing one’s self at literal risk to one’s life?

This virtue is extremely important to militaries. Does any military use meditation as part of its training? I would guess that the training given to medics and officers (soldiers for whom clear thinking is especially important) might have some relevant lessons. Then again, maybe the military deals with this primarily by selecting the right sort of people rather than taking arbitrary people and training them. If so, perhaps we should look into applying similar selection methods in our own organizations to identify people to put in charge when the time comes.

Any more ideas?

In this post I discuss some:

Perhaps it would be good to have an Official List of all the AI safety strategies, so that whatever rationale people give for why this AI is safe can be compared to the list. (See this prototype list.)
Perhaps it would be good to have an Official List of all the AI safety problems, so that whatever rationale people give for why this AI is safe can be compared to the list, e.g. "OK, so how does it solve outer alignment? What about mesa-optimizers? What about the malignity of the universal prior? I see here that your design involves X; according to the Official List, that puts it at risk of developing problems Y and Z..." (See this prototype list.)
Perhaps it would be good to have various important concepts and arguments re-written with an audience of skeptical and impatient AI researchers in mind, rather than the current audience of friends and LessWrong readers.

Thinking afresh, here's another idea: I have a sketch of a blog post titled "What Failure Feels Like." The idea is to portray a scenario of doom in general, abstract terms (like Paul's post does, as opposed to writing a specific, detailed story) but with a focus on how it feels to us AI-risk-reducers, rather than focusing on what the world looks like in general or what's going on inside the AIs. I decided it would be depressing and not valuable to write. However, maybe it would be valuable as a thing people could read to help emotionally prepare/steel themselves for the time when they "are confronted with the stark reality of how doomed we are." IDK.

I guess overall my favorite idea is to just periodically spend time thinking about what you'd do if you found out that takeoff was happening soon. E.g. "Deepmind announces turing-test system" or "We learn of convincing roadmap to AGI involving only 3 OOMs more compute" or "China unveils project to spend +7 OOMs on a single training run by 2030, with lesser training runs along the way" I think that the exercise of thinking about near-term scenarios and then imagining what we'd do in response will be beneficial even on long timelines, but certainly super beneficial on short timelines (even if, as is likely, none of the scenarios we imagine come to pass).

Does any military use meditation as part of its training? 

. Yes, e.g.

This [2019] winter, Army infantry soldiers at Schofield Barracks in Hawaii began using mindfulness to improve shooting skills — for instance, focusing on when to pull the trigger amid chaos to avoid unnecessary civilian harm.

The British Royal Navy has given mindfulness training to officers, and military leaders are rolling it out in the Army and Royal Air Force for some officers and enlisted soldiers. The New Zealand Defence Force recently adopted the technique, and military forces o

... (read more)
4Daniel Kokotajlo
Hmmm, if this is the most it's been done, then that counts as a No in my book. I was thinking something like "Ah yes, the Viet Cong did this for most of the war, and it's now standard in both the Vietnamese and Chinese armies." Or at least "Some military somewhere has officially decided that this is a good idea and they've rolled it out across a large portion of their force."

Tsvi Benson-Tilsen


I speculate (based on personal glimpses, not based on any stable thing I can point to) that there's many small sets of people (say of size 2-4) who could greatly increase their total output given some preconditions, unknown to me, that unlock a sort of hivemind. Some of the preconditions include various kinds of trust, of common knowledge of shared goals, and of person-specific interface skill (like speaking each other's languages, common knowledge of tactics for resolving ambiguity, etc.).
[ETA: which, if true, would be good to have already set up before crunch time.]



One of the biggest considerations would be the process for activating "crunch time". In what situations should crunch time be declared? Who decides? How far out would we want to activate and would there be different levels? Are there any downsides of such a process including unwanted attention?

If these aren't discussed in advance, then I imagine that far too much of the available time could be taken up by whether to activate crunch time protocols or not.

PS. I actually proposed here that we might be able to get a superintelligence to solve most of the problem of embedded agency by itself. I'll try to write it up into a proper post soon.

10 comments, sorted by Click to highlight new comments since:

For this to matter, our alignment researchers need to be at the cutting edge of AI capabilities, and they need to be positioned such that their work can actually be incorporated into AI systems as they are deployed.

If we become aware that a lab will likely deploy TAI soon, other informed actors will probably become aware as well. This implies that many people would be trying to influence and gain access to this lab. Therefore, we should already have AI alignment researchers in positions of power within the lab before this happens.

Seems rather obvious to me that the sort of person who is like, "Oh, well, we can't possibly work on this until later" will, come Later, be like, "Oh, well, it's too late to start doing basic research now, we'll have to work with whatever basic strategies we came up with already."

Seems true, but also didn't seem to be what this post was about?

Most current AI alignment work is pretty abstract and theoretical, for two reasons.

FWIW, this is not obvious to me (or at least depends a lot on what you mean by 'AI alignment'). Work at places like OpenAI, CHAI, and DeepMind tends to be relatively concrete.

Also if you count work done by people not publicly identified as motivated by existential risk, I think the concrete:abstract ratio will increase.


I found this a surprisingly obvious set of strategic considerations (and meta-considerations), that for some reason I'd never seen anyone actually attempt to tackle before.

I found the notion of practicing "no cost too large" periods quite interesting. I'm somewhat intimidated by the prospect of trying it out, but it does seem like a good idea.


Alignment-focused policymakers / policy researchers should also be in positions of influence. 


I'd add a bunch of human / social topics to your list e.g. 

  • Policy 
  • Every relevant historical precedent
  • Crisis management / global logistical coordination / negotiation
  • Psychology / media / marketing
  • Forecasting 

Research methodology / Scientific “rationality,” Productivity, Tools

I'd be really excited to have people use Elicit with this motivation. (More context here and here.)

Re: competitive games of introducing new tools, we did an internal speed Elicit vs. Google test to see which tool was more efficient for finding answers or mapping out a new domain in 5 minutes. We're broadly excited to structure and support competitive knowledge work and optimize research this way. 

Relevant topic of a future post: some of the ideas from Risks From Learned Optimization or the Improved Good Regulator Theorem offer insights into building effective institutions and developing flexible problem-solving capacity.

Rough intuitive idea: intelligence/agency are about generalizable problem-solving capability. How do you incentivize generalizable problem-solving capability? Ask the system to solve a wide variety of problems, or a problem general enough to encompass a wide variety.

If you want an organization to act agenty, then a useful technique is to constantly force the organization to solve new, qualitatively different problems. An organization in a highly volatile market subject to lots of shocks or distribution shifts will likely develop some degree of agency naturally. 

Organizations with an adversary (e.g. traders in the financial markets) will likely develop some degree of agency naturally, as their adversary frequently adopts new methods to counter the organization's current strategy. Red teams are a good way to simulate this without a natural adversary.

Some organizations need to solve a sufficiently-broad range of problems as part of their original core business that they develop some degree of agency in the process. These organizations then find it relatively easy to expand into new lines of business. Amazon is a good example.

Conversely, businesses in stable industries facing little variability will end up with little agency. They can't solve new problems efficiently, and will likely be wiped out if there's a large shock or distribution shift in the market. They won't be good at expanding or pivoting into new lines of business. They'll tend to be adaptation-executors rather than profit-maximizers, to a much greater extent than agenty businesses.

This all also applies at a personal level: if you want to develop general problem-solving capability, then tackle a wide variety of problems. Try problems in many different fields. Try problems with an adversary. Try different kinds of problems, or problems with different levels of difficulty. Don't just try to guess which skills or tools generalize well, go out and find out which skills or tools generalize well.

If we don't know what to expect from future alignment problems, then developing problem-solving skills and organizations which generalize well is a natural strategy.

Re: picking up new tools, skills and practice designing and building user interfaces, especially to complex or not-very-transparent systems, would be very-high-leverage if the tool-adoption step is rate-limiting.

I don't actually think "It is really hard to know what sorts of AI alignment work are good this far out from transformative AI." is very helpful. 

It is currently fairly hard to tell what is good alignment work. A week from TAI, then either, good alignment work will be easier to recognise because of alignment progress not strongly correlated with capabilities, or good alignment research is just as hard to recognise. (More likely the latter) I can't think of any safety research that can be done on GPT3 that can't be done on GPT1. 

In my picture, research gets done and theorems proved, researcher population grows as funding increases and talent matures. Toy models get produced. Once you can easily write down a description of a FAI with unbounded compute, that's when you start to look at algorithms that have good capabilities in practice.