Rohin Shah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter


[AN #118]: Risks, solutions, and prioritization in a world with many AI systems

No, I meant it as written. People usually give numbers without any assumptions attached, which I would assume means "I predict that in our actual world there is an X% chance of an existential catastrophe due to AI".

[AN #118]: Risks, solutions, and prioritization in a world with many AI systems

I disagree with Critch that we should expect single/single delegation(/alignment) to be solved "by default" because of economic incentives.  I think economic incentives will not lead to it being solved well-enough, soon enough

Indeed, this is where my 10% comes from, and may be a significant part of the reason I focus on intent alignment whereas Critch would focus on multi/multi stuff.

My thought experiment from the paper is: "imagine if everyone woke up 1000000x smarter tomorrow."

Basically all of my arguments for "we'll be fine" rely on not having a huge discontinuity like that, so while I roughly agree with your prediction in that thought experiment, it's not very persuasive.

(The arguments do not rely on technological progress remaining at its current pace.)

keep up with rapid technological progress (you might say they are already failing to...)

At least in the US, our institutions are succeeding at providing public infrastructure (roads, water, electricity...), not having nuclear war, ensuring children can read, and allowing me to generally trust the people around me despite not knowing them. Deepfakes and facial recognition are small potatoes compared to that.

These are all failing us when it comes to, e.g. climate change.

I agree this is overall a point against my position (though I probably don't think it is as strong as you think it is).

Multi-agent safety

Nicholas's summary for the Alignment Newsletter:

Much of safety research focuses on a single agent that is directly incentivized by a loss/reward function to take particular actions. This sequence instead considers safety in the case of multi-agent systems interacting in complex environments. In this situation, even simple reward functions can yield complex and highly intelligent behaviors that are only indirectly related. For example, evolution led to humans who can learn to play chess, despite the fact that the ancestral environment did not contain chess games. In these situations, the problem is not how to construct an aligned reward function, the problem is how to shape the experience that the agent gets at training time such that the final agent policy optimizes for the goals that we want. This sequence lays out some considerations and research directions for safety in such situations.

One approach is to teach agents the generalizable skill of obedience. To accomplish this, one could design the environment to incentivize specialization. For instance, if an agent A is more powerful than agent B, but can see less of the environment than B, A might be incentivized to obey B’s instructions if they share a goal. Similarly we can increase the ease and value of coordination through enabling access to a shared permanent record or designing tasks that require large-scale coordination.

A second approach is to move agents to simpler and safer training regimes as they develop more intelligence. The key assumption here is that we may require complex regimes such as competitive multi-agent environments to jumpstart intelligent behavior, but may be able to continue training in a simpler regime such as single-task RL later. This is similar to current approaches for training a language model via supervised learning and then finetuning with RL, but going in the opposite direction to increase safety rather than capabilities.

A third approach is specific to a collective AGI: an AGI that is composed of a number of separate general agents trained on different objectives that learn to cooperatively solve harder tasks. This is similar to how human civilization is able to accomplish much more than any individual human. In this regime, the AGI can be effectively sandboxed by either reducing the population size or by limiting communication channels between the agents. One advantage of this approach to sandboxing is that it allows us to change the effective intelligence of the system at test-time, without going through a potentially expensive retraining phase.

Nicholas's opinion:

I agree that we should put more emphasis on the safety of multi-agent systems. We already have <@evidence@>(@Emergent Tool Use from Multi-Agent Interaction@) that complex behavior can arise from simple objectives in current systems, and this seems only more likely as systems become more powerful. Two-agent paradigms such as GANs, self-play, and debate, are already quite common in ML. Lastly, humans evolved complex behavior from the simple process of evolution so we have at least one example of this working. I also think this is an interesting area where there is lots to learn from other fields, such as game theory and evolutionary biology,

For any empirically-minded readers of this newsletter, I think this sequence opens up a lot of potential for research. The development of safety benchmarks for multi-agent systems and then the evaluation of these approaches seems like it would make many of the considerations discussed here more concrete. I personally would find them much more convincing with empirical evidence to back up that they work with current ML.

My opinion:

The AGI model here in which powerful AI systems arise through multiagent interaction is an important and plausible one, and I'm excited to see some initial thoughts about it. I don't particularly expect any of these ideas to be substantially useful, but I'm also not confident that they won't be useful, and given the huge amount of uncertainty about how multiagent interaction shapes agents, that may be the best we can hope for currently. I'd be excited to see empirical results testing some of these ideas out, as well as more conceptual posts suggesting more ideas to try.

Clarifying “What failure looks like” (part 1)

Yup, all of that sounds right to me!

One caveat is that on my models of AI development I don't expect the CEO could just copy model parameters to the intern. I think it's more likely that we have something along the lines of "graduate of <specific college major>" AI systems that you then copy and use as needed. But I don't think this really affects your point.

Sure, you could assert that AI systems replace humans one-for-one, but if this is not the best design, then there may be competitive pressure to move away from this towards something less interpretable.

Yeah jtbc I definitely would not assert this. If I had to make an argument for as-much-interpretability, it would be something like "in the scenario we're considering, AI systems are roughly human-level in capability; at this level of capability societal organization will still require a lot of modularity; if we know nothing else and assume agents are as black-boxy as humans, it seems reasonable to assume this will lead to a roughly similar amount of interpretability as current society". But this is not a particularly strong argument, especially in the face of vast uncertainty about what the future looks like.

Clarifying “What failure looks like” (part 1)

I guess you mean here that activations and weights in NNs are more interpretable to us than neurological processes in the human brain, but if so this comparison does not seem relevant to the text you quoted. Consider that it seems easier to understand why an editor of a newspaper placed some article on the front page than why FB's algorithm showed some post to some user (especially if we get to ask the editor questions or consult with other editors).

Isn't this what I said in the rest of that paragraph (although I didn't have an example)?

which I would expect to become uncompetitive with "replacing institutions" at some point

I'm not claiming that replacing humans is more competitive than replacing institutions. I'm claiming that, if we're considering the WFLL1 setting, and we're considering the point at which we could have prevented failure, at that point I'd expect AI systems are in the "replacing humans" category. By the time they're in the "replacing institutions" category, we probably are far beyond the position where we could do anything about the future.

Separately, even in the long run, I expect modularity to be a key organizing principle for AI systems.

you may still get weird dynamics between AI systems within an institution and across institutions (e.g. between a CEO advisor AI and a regulator advisor AI). These dynamics may be very hard to interpret (and may not even involve recognizable communication channels).

I agree this is possible but it doesn't seem very likely to me, since we'll very likely be training our AI systems to communicate in natural language, and those AI systems will likely be trained to behave in vaguely human-like ways.

Clarifying “What failure looks like” (part 1)

Great post! I especially liked the New Zealand example, it seems like a surprisingly good fit. Against a general backdrop of agreement, let me list a few points of disagreement:

Example: predictive policy algorithms used in the US are biased against people of colour.

Somewhat of a nitpick, but it is not clear to me that this is an example of problems of short-term incentives. Are we sure that given the choice between "lower crime, lower costs and algorithmic bias" and "higher crime, higher costs and only human bias", and we have dictatorial power and can consider long-term effects, we would choose the latter on reflection? Similarly with other current AI examples (e.g. the Google gorilla misclassification).

(It's possible that if I delved into the research and the numbers the answer would be obvious; I have extremely little information on the scale of the problem currently.)

Most human institutions are at least somewhat interpretable. This means, for example, that humans who tamper with the measurement process to pursue easy-to-measure objectives are prone to being caught, as eventually happened with CompStat. However, ML systems today are currently hard to interpret, and so it may be more difficult to catch interference with the measurement process.

Note ML systems are way more interpretable than humans, so if they are replacing humans then this shouldn't make that much of a difference. If a single ML system replaces an entire institution, then it probably is less interpretable than that institution. It doesn't seem obvious to me which of these we should be considering. (Partly it depends on how general and capable the AI systems are.) Overall I'd guess that for WFLL1 it's closer to "replacing humans" than "replacing institutions".

Draft report on AI timelines

Planned summary for the Alignment Newsletter (which won't go out until it's a Full Open Phil Report):

Once again, we have a piece of work so large and detailed that I need a whole newsletter to summarize it! This time, it is a quantitative model for forecasting when transformative AI will happen.

The overall framework

The key assumption behind this model is that if we train a neural net or other ML model that uses about as much computation as a human brain, that will likely result in transformative AI (TAI) (defined as AI that has an impact comparable to that of the industrial revolution). In other words, we _anchor_ our estimate of the ML model’s inference computation to that of the human brain. This assumption allows us to estimate how much compute will be required to train such a model _using 2020 algorithms_. By incorporating a trend extrapolation of how algorithmic progress will reduce the required amount of compute, we can get a prediction of how much compute would be required for the final training run of a transformative model in any given year.

We can also get a prediction of how much compute will be _available_ by predicting the cost of compute in a given year (which we have a decent amount of past evidence about), and predicting the maximum amount of money an actor would be willing to spend on a single training run. The probability that we can train a transformative model in year Y is then just the probability that the compute _requirement_ for year Y is less than the compute _available_ in year Y.

The vast majority of the report is focused on estimating the amount of compute required to train a transformative model using 2020 algorithms (where most of our uncertainty would come from); the remaining factors are estimated relatively quickly without too much detail. I’ll start with those so that you can have them as background knowledge before we delve into the real meat of the report. These are usually modeled as logistic curves in log space: that is, they are modeled as improving at some constant rate, but will level off and saturate at some maximum value after which they won’t improve.

Algorithmic progress

First off, we have the impact of _algorithmic progress_. <@AI and Efficiency@> estimates that algorithms improve enough to cut compute times in half every 16 months. However, this was measured on ImageNet, where researchers are directly optimizing for reduced computation costs. It seems less likely that researchers are doing as good a job at reducing computation costs for “training a transformative model”, and so the author increases the **halving time to 2-3 years**, with a maximum of **somewhere between 1-5 orders of magnitude** (with the assumption that the higher the “technical difficulty” of the problem, the more algorithmic progress is possible).

Cost of compute

Second, we need to estimate a trend for compute costs. There has been some prior work on this (summarized in [AN #97]( The report has some similar analyses, and ends up estimating **a doubling time of 2.5 years**, and a (very unstable) maximum of improvement by **a factor of 2 million by 2100**.

Willingness to spend

Third, we would like to know the maximum amount (in 2020 dollars) any actor might spend on a single training run. Note that we are estimating the money spent on a _final training run_, which doesn’t include the cost of initial experiments or the cost of researcher time. Currently, the author estimates that all-in project costs are 10-100x larger than the final training run cost, but this will likely go down to something like 2-10x, as the incentive for reducing this ratio becomes much larger.

The author estimates that the most expensive run _in a published paper_ was the final <@AlphaStar@>(@AlphaStar: Mastering the Real-Time Strategy Game StarCraft II@) training run, at ~1e23 FLOP and $1M cost. However, there have probably been unpublished results that are slightly more expensive, maybe $2-8M. In line with <@AI and Compute@>, this will probably increase dramatically to about **$1B in 2025**.

Given that AI companies each have around $100B cash on hand, and could potentially borrow additional several hundreds of billions of dollars (given their current market caps and likely growth in the worlds where AI still looks promising), it seems likely that low hundreds of billions of dollars could be spent on a single run by 2040, corresponding to a doubling time (from $1B in 2025) of about 2 years.

To estimate the maximum here, we can compare to megaprojects like the Manhattan Project or the Apollo program, which suggests that a government could spend around 0.75% of GDP for ~4 years. Since transformative AI will likely be more valuable economically and strategically than these previous programs, we can shade that upwards to 1% of GDP for 5 years. Assuming all-in costs are 5x that of the final training run, this suggests the maximum willingness to spend should be 1% of GDP of the largest country, which we assume grows at ~3% every year.

Strategy for estimating training compute for a transformative model

In addition to the three factors of algorithmic progress, cost of compute, and willingness to spend, we need an estimate of how much computation would be needed to train a transformative model using 2020 algorithms (which I’ll discuss next). Then, at year Y, the compute required is given by computation needed with 2020 algorithms * improvement factor from algorithmic progress, which (in this report) is a probability distribution. At year Y, the compute available is given by FLOP per dollar (aka compute cost) * money that can be spent, which (in this report) is a point estimate. We can then simply read off the probability that the compute required is greater than the compute available.

Okay, so the last thing we need is a distribution over the amount of computation that would be needed to train a transformative model using 2020 algorithms, which is the main focus of this report. There is a lot of detail here that I’m going to elide over, especially in talking about the _distribution_ as a whole (whereas I will focus primarily on the median case for simplicity). As I mentioned early on, the key hypothesis is that we will need to train a neural net or other ML model that uses about as much compute as a human brain. So the strategy will be to first translate from “compute of human brain” to “inference compute of neural net”, and then to translate from “inference compute of neural net” to “training compute of neural net”.

How much inference compute would a transformative model use?

We can talk about the rate at which synapses fire in the human brain. How can we convert this to FLOP? The author proposes the following hypothetical: suppose we redo evolutionary history, but in every animal we replace each neuron with N [floating-point units]( that each perform 1 FLOP per second. For what value of N do we still get roughly human-level intelligence over a similar evolutionary timescale? The author then does some calculations about simulating synapses with FLOPs, drawing heavily on the <@recent report on brain computation@>(@How Much Computational Power It Takes to Match the Human Brain@), to estimate that N would be around 1-10,000, which after some more calculations suggests that the human brain is doing the equivalent of 1e13 - 1e16 FLOP per second, with **a median of 1e15 FLOP per second**, and a long tail to the right.

Does this mean we can say that a transformative model will use 1e15 FLOP per second during inference? Such a model would have a clear flaw: even though we are assuming that algorithmic progress reduces compute costs over time, if we did the same analysis in e.g. 1980, we’d get the _same_ estimate for the compute cost of a transformative model, which would imply that there was no algorithmic progress between 1980 and 2020! The problem is that we’d always estimate the brain as using 1e15 FLOP per second (or around there), but for our ML models there is a difference between FLOP per second _using 2020 algorithms_ and FLOP per second _using 1980 algorithms_. So how do we convert form “brain FLOP per second” to “inference FLOP per second for 2020 ML algorithms”?

One approach is to look at how other machines we have designed compare to the corresponding machines that evolution has designed. An [analysis]( by Paul Christiano concluded that human-designed artifacts tend to be 2-3 orders of magnitude worse than those designed by evolution, when considering energy usage. Presumably a similar analysis done in the past would have resulted in higher numbers and thus wouldn’t fall prey to the problem above. Another approach is to compare existing ML models to animals with a similar amount of computation, and see which one is subjectively “more impressive”. For example, the AlphaStar model uses about as much computation as a bee brain, and large language models use somewhat more; the author finds it reasonable to say that AlphaStar is “about as sophisticated” as a bee, or that <@GPT-3@>(@Language Models are Few-Shot Learners@) is “more sophisticated” than a bee.

We can also look at some abstract considerations. Natural selection had _a lot_ of time to optimize brains, and natural artifacts are usually quite impressive. On the other hand, human designers have the benefit of intelligent design and can copy the patterns that natural selection has come up with. Overall, these considerations roughly balance each other out. Another important consideration is that we’re only predicting what would be needed for a model that was good at most tasks that a human would currently be good at (think a virtual personal assistant), whereas evolution optimized for a whole bunch of other skills that were needed in the ancestral environment. The author subjectively guesses that this should reduce our estimate of compute costs by about an order of magnitude.

Overall, putting all these considerations together, the author intuitively guesses that to convert from “brain FLOP per second” to “inference FLOP per second for 2020 ML algorithms”, we should add an order of magnitude to the median, and add another two orders of magnitude to the standard deviation to account for our large uncertainty. This results in a median of **1e16 FLOP per second** for the inference-time compute of a transformative model.

Training compute for a transformative model

We might expect a transformative model to run a forward pass **0.1 - 10 times per second** (which on the high end would match human reaction time of 100ms), and for each parameter of the neural net to contribute **1-100 FLOP per forward pass**, which implies that if the inference-time compute is 1e16 FLOP per second then the model should have **1e13 - 1e17 parameters**, with a median of **3e14 parameters**.

We now need to estimate how much compute it takes to train a transformative model with 3e14 parameters. We assume this is dominated by the number of times you have to run the model during training, or equivalently, the number of data points you train on times the number of times you train on each data point. (In particular, this assumes that the cost of acquiring data is negligible in comparison. The report argues for this assumption; for the sake of brevity I won’t summarize it here.)

For this, we need a relationship between parameters and data points, which we’ll assume will follow a power law KP^α, where P is the number of parameters and K and α are constants. A large number of ML theory results imply that the number of data points needed to reach a specified level of accuracy grows linearly with the number of parameters (i.e. α=1), which we can take as a weak prior. We can then update this with empirical evidence from papers. <@Scaling Laws for Neural Language Models@> suggests that for language models, data requirements scale as α=0.37 or as α=0.74, depending on what measure you look at. Meanwhile, [Deep Learning Scaling is Predictable, Empirically]( suggests that α=1.39 for a wide variety of supervised learning problems (including language modeling). However, the former paper studies a more relevant setting: it includes regularization, and asks about the number of data points needed to reach a target accuracy, whereas the latter paper ignores regularization and asks about the minimum number of data points that the model _cannot_ overfit to. So overall the author puts more weight on the former paper and estimates a median of α=0.8, though with substantial uncertainty.

We also need to estimate how many epochs will be needed, i.e. how many times we train on any given data point. The author decides not to explicitly model this factor since it will likely be close to 1, and instead lumps in the uncertainty over the number of epochs with the uncertainty over the constant factor in the scaling law above. We can then look at language model runs to estimate a scaling law for them, for which the median scaling law predicts that we would need 1e13 data points for our 3e14 parameter model.

However, this has all been for supervised learning. It seems plausible that a transformative task would have to be trained using RL, where the model acts over a sequence of timesteps, and then receives (non-differentiable) feedback at the end of those timesteps. How would scaling laws apply in this setting? One simple assumption is to say that each rollout over the _effective horizon_ counts as one piece of “meaningful feedback” and so should count as a single data point. Here, the effective horizon is the minimum of the actual horizon and 1/(1-γ), where γ is the discount factor. We assume that the scaling law stays the same; if we instead try to estimate it from recent RL runs, it can change the results by about one order of magnitude.

So we now know we need to train a 3e14 parameter model with 1e13 data points for a transformative task. This gets us nearly all the way to the compute required with 2020 algorithms: we have a ~3e14 parameter model that takes ~1e16 FLOP per forward pass, that is trained on ~1e13 data points with each data point taking H timesteps, for a total of H * 1e29 FLOP. The author’s distributions are instead centered at H * 1e30 FLOP; I suspect this is simply because the author was computing with distributions whereas I’ve been directly manipulating medians in this summary.

The last and most uncertain piece of information is the effective horizon of a transformative task. We could imagine something as low as 1 subjective second (for something like language modeling), or something as high as 1e9 subjective seconds (i.e. 32 subjective years), if we were to redo evolution, or train on a task like “do effective scientific R&D”. The author splits this up into short, medium and long horizon neural net paths (corresponding to horizons of 1e0-1e3, 1e3-1e6, and 1e6-1e9 respectively), and invites readers to place their own weights on each of the possible paths.

There are many important considerations here: for example, if you think that the dominating cost will be generative modeling (GPT-3 style, but maybe also for images, video etc), then you would place more weight on short horizons. Conversely, if you think the hard challenge is to gain meta learning abilities, and that we probably need “data points” comparable to the time between generations in human evolution, then you would place more weight on longer horizons.

Adding three more potential anchors

We can now combine all these ingredients to get a forecast for when compute will be available to develop a transformative model! But not yet: we’ll first add a few more possible “anchors” for the amount of computation needed for a transformative model. (All of the modeling so far has “anchored” the _inference time computation of a transformative model_ to the _inference time computation of the human brain_.)

First, we can anchor _parameter count of a transformative model_ to the _parameter count of the human genome_, which has far fewer “parameters” than the human brain. Specifically, we assume that all the scaling laws remain the same, but that a transformative model will only require 7.5e8 parameters (the amount of information in the human genome) rather than our previous estimate of ~1e15 parameters. This drastically reduces the amount of computation required, though it is still slightly above that of the short-horizon neural net, because the author assumed that the horizon for this path was somewhere between 1 and 32 years.

Second, we can anchor _training compute for a transformative model_ to the _compute used by the human brain over a lifetime_. As you might imagine, this leads to a much smaller estimate: the brain uses ~1e24 FLOP over 32 years of life, which is only 10x the amount used for AlphaStar, and even after adjusting upwards to account for man-made artifacts being worse than those made by evolution, the resulting model predicts a significant probability that we would already have been able to build a transformative model.

Finally, we can anchor _training compute for a transformative model_ to the _compute used by all animal brains over the course of evolution_. The basic assumption here is that our optimization algorithms and architectures are not much better than simply “redoing” natural selection from a very primitive starting point. This leads to an estimate of ~1e41 FLOP to train a transformative model, which is more than the long horizon neural net path (though not hugely more).

Putting it all together

So we now have six different paths: the three neural net anchors (short, medium and long horizon), the genome anchor, the lifetime anchor, and the evolution anchor. We can now assign weights to each of these paths, where each weight can be interpreted as the probability that that path is the _cheapest_ way to get a transformative model, as well as a final weight that describes the chance that none of the paths work out.

The long horizon neural net path can be thought of as a conservative “default” view: it could work out simply by training directly on examples of a long horizon task where each data point takes around a subjective year to generate. However, there are several reasons to think that researchers will be able to do better than this. As a result, the author assigns 20% to the short horizon neural net, 30% to the medium horizon neural net, and 15% to the long horizon neural net.

The lifetime anchor would suggest that we either already could get TAI, or are very close, which seems very unlikely given the lack of major economic applications of neural nets so far, and so gets assigned only 5%. The genome path gets 10%, the evolution anchor gets 10%, and the remaining 10% is assigned to none of the paths working out.

This predicts a **median of 2052** for the year in which some actor would be willing and able to train a single transformative model, with the full graphs shown below:

<Graphs removed since they are in flux and easy to share in a low-bandwidth way> 

How does this relate to TAI?

Note that what we’ve modeled so far is the probability that by year Y we will have enough compute for the final training run of a transformative model. This is not the same thing as the probability of developing TAI. There are several reasons that TAI could be developed _later_ than the given prediction:

1. Compute isn’t the only input required: we also need data, environments, human feedback, etc. While the author expects that these will not be the bottleneck, this is far from a certainty.

2. When thinking about any particular path and making it more concrete, a host of problems tend to show up that will need to be solved and may add extra time. Some examples include robustness, reliability, possible breakdown of the scaling laws, the need to generate lots of different kinds of data, etc.

3. AI research could stall, whether because of regulation, a global catastrophe, an AI winter, or something else.

However, there are also compelling reasons to expect TAI to arrive _earlier_:

1. We may develop TAI through some other cheaper route, such as a <@services model@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@).

2. Our forecasts apply to a “balanced” model that has a similar profile of abilities as a human. In practice, it will likely be easier and cheaper to build an “unbalanced” model that is superhuman in some domains and subhuman in others, that is nonetheless transformative.

3. The curves for several factors assume some maximum after which progress is not possible; in reality it is more likely that progress slows to some lower but non-zero growth rate.

In the near future, it seems likely that it would be harder to find cheaper routes (since there is less time to do the research), so we should probably assume that the probabilities are overestimates, and for similar reasons for later years the probabilities should be treated as underestimates.

For the median of 2052, the author guesses that these considerations roughly cancel out, and so rounds the median for development of TAI to **2050**. A sensitivity analysis concludes that 2040 is the “most aggressive plausible median”, while the “most conservative plausible median” is 2080.

Planned opinion:

I really liked this report: it’s extremely thorough and anticipates and responds to a large number of potential reactions. I’ve made my own timelines estimate using the provided spreadsheet, and have adopted the resulting graph (with a few modifications) as my TAI timeline (which ends up with a median of ~2055). This is saying quite a lot: it’s pretty rare that a quantitative model is compelling enough that I’m inclined to only slightly edit its output, as opposed to simply using the quantitative model to inform my intuitions.

Here are the main ways in which my model is different from the one in the report:

1. Ignoring the genome anchor

I ignore the genome anchor because I don’t buy the model: even if researchers did create a very parameter-efficient model class (which seems unlikely), I would not expect the same scaling laws to apply to that model class. The report mentions that you could also interpret the genome anchor as simply providing a constraint on how many data points are needed to train long-horizon behaviors (since that’s what evolution was optimizing), but I prefer to take this as (fairly weak) evidence that informs what weights to place on short vs. medium vs. long horizons for neural nets.

2. Placing more weight on short and medium horizons relative to long horizons

I place 30% on short horizons, 40% on medium horizons, and 10% on long horizons. The report already names several reasons why we might expect the long horizon assumption to be too conservative. I agree with all of those, and have one more of my own:

If meta-learning turns out to require a huge amount of compute, we can instead directly train on some transformative task with a lower horizon. Even some of the hardest tasks like scientific R&D shouldn’t have a huge horizon: even if we assume that it takes human scientists a year to produce the equivalent of a single data point, at 40 hours a week that comes out to a horizon of 2000 subjective hours, or 7e6 seconds. This is near the beginning of the long horizon realm of 1e6-1e9 seconds and seems like a very conservative overestimate to me.

(Note that in practice I’d guess we will train something like a meta-learner, because I suspect the skill of meta-learning will not require such large average effective horizons.)

3. Reduced willingness to spend

My willingness to spend forecasts are somewhat lower: the predictions and reasoning in this report feel closer to upper bounds on how much people might spend rather than predictions of how much they will spend. Assuming we reduce the ratio of all-in project costs to final training run costs to 10x, spending $1B on a training run by 2025 would imply all-in project costs of $10B, which is ~40% of Google’s yearly R&D budget of $26B, or 10% of the budget for a 4-year project. Possibly this wouldn’t be classified as R&D, but it would also be _2% of all expenditures over 4 years_. This feels remarkably high to me for something that’s supposed to happen within 5 years; while I wouldn’t rule it out, it wouldn’t be my median prediction.

4. Accounting for challenges

While the report does talk about challenges in e.g. getting the right data and environments by the right time, I think there are a bunch of other challenges as well: for example, you need to ensure that your model is aligned, robust, and reliable (at least if you want to deploy it and get economic value from it). I do expect that these challenges will be easier than they are today, partly because more research will have been done and partly because the models themselves will be more capable.

Another example of a challenge would be PR concerns: it seems very plausible to me that there will be a backlash against transformative AI systems, that results in those systems being deployed later than we’d expect them to be according to this model.

To be more concrete, if we ignore points 1-3 and assume this is my only disagreement, then for the median of 2052, rather than assuming that reasons for optimism and pessimism approximately cancel out to yield 2050 as the median for TAI, I’d be inclined to shade upwards to 2055 or 2060 as my median for TAI.

Comparing Utilities

Planned summary for the Alignment Newsletter:

This is a reference post about preference aggregation across multiple individually rational agents (in the sense that they have VNM-style utility functions), that explains the following points (among others):

1. The concept of “utility” in ethics is somewhat overloaded. The “utility” in hedonic utilitarianism is very different from the VNM concept of utility. The concept of “utility” in preference utilitarianism is pretty similar to the VNM concept of utility.

2. Utilities are not directly comparable, because affine transformations of utility functions represent exactly the same set of preferences. Without any additional information, concepts like “utility monster” are type errors.

3. However, our goal is not to compare utilities, it is to aggregate people’s preferences. We can instead impose constraints on the aggregation procedure.

4. If we require that the aggregation procedure produces a Pareto-optimal outcome, then Harsanyi’s utilitarianism theorem says that our aggregation procedure can be viewed as maximizing some linear combination of the utility functions.

5. We usually want to incorporate some notion of fairness. Different specific assumptions lead to different results, including variance normalization, Nash bargaining, and Kalai-Smorodinsky.

The "Backchaining to Local Search" Technique in AI Alignment

Planned summary for the Alignment Newsletter:

This post explains a technique to use in AI alignment, that the author dubs “backchaining to local search” (where local search refers to techniques like gradient descent and evolutionary algorithms). The key idea is to take some proposed problem with AI systems, and figure out mechanistically how that problem could arise when running a local search algorithm. This can help provide information about whether we should expect the problem to arise in practice.

Planned opinion:

I’m a big fan of this technique: it has helped me expose many initially confused concepts, and notice that they were confused, particularly wireheading and inner alignment. It’s an instance of the more general technique (that I also like) of taking an abstract argument and making it more concrete and realistic, which often reveals aspects of the argument that you wouldn’t have previously noticed.

Load More