Quantitative cruxes in Alignment

Martín Soto

Summary: Sometimes researchers talk past each other because their intuitions disagree on underlying quantitative variables. Getting better estimates of these variables is action-guiding. This is an epistemological study to bring them to the surface, as well as discover which sources of evidence can inform our predictions, and which methodologies researchers can and do use to deal with them.

This post covers some discussion of quantitative cruxes. The second one covers sources of evidence (a topic interesting on its own). The whole write-up can also be read in this document. I also did a short presentation of it. And it might be expanded with a benchmark of thought experiments in alignment.

Work done during my last two weeks of SERI MATS 3.1. Thanks to Vivek Hebbar, Filip Sondej, Lawrence Chan and Joe Benton for related discussions.

Many qualitative arguments urge worries about AI. Most of these argue shortly and easily for the possibility (rather than necessity) of AI catastrophe. Indeed, that’s how I usually present the case to my acquaintances: most people aren’t too confident about how the future will go, but if some yet unobserved variables turn out to not line up well enough, we’ll be screwed. A qualitative argument for non-negligible chance of catastrophe is enough to motivate action.

But mere possibilities aren’t enough to efficiently steer the future into safety: we need probabilities. And these are informed by quantitative estimates related to training dynamics, generalization capacity, biases of training data, efficiency of learned circuits, and other properties of parameter space. For example, a higher credence that training runs on task A with more than X parameters have a high chance of catastrophe would make us more willing to risk shooting for a pivotal act. Or, a lower credence on that would advise continuing to scale up current efforts. Marginal additional clarity on these predictions is action-relevant.

Indeed, the community has developed strong qualitative arguments for why dangerous dynamics might appear in big or capable enough systems (the limit of arbitrary intelligence). But it’s way harder to quantitatively assess when, and in what more concrete shape, they could actually arise, and this dimension is sometimes ignored, leading to communication failures. This is especially notable when researchers talk past each other because of different quantitative intuitions, without making explicit their sources of evidence or methodology to arrive at conclusions (although luckily some lengthier conversations have already been trying to partially rectify this).

Addressing these failures explicitly is especially valuable when doing pre-paradigmatic science action-relevant for the near future, and can help build common knowledge and inform strategies (for example, in compute governance). That’s why I’m excited to bring these cruxes to the surface and study how we do and should address them. It’s useful to keep a record of where our map is less accurate.

Methodology

We start with some general and vague action-relevant questions, for example:

A. How easy is it to learn a deceptive agent in a training run?

B. How safe is it to use AIs for alignment research?

C. How much alignment failure is necessary to destabilize society?

(Of course, these questions are not independent.)

We then try and concretize these vague questions into more quantitative cruxes that would help answer them. For example, a concrete quantity we can ask about A is

A1. How small is the minimal description of a strong consequentialist?

(Of course, this question is not yet concrete enough, lacking disambiguations for “description” and “strong consequentialist”, and more context-setting.)

The ultimate gold standard for such concrete quantities is “we could, in some ideal situation, literally set up an empirical experiment settling this quantity”. Of course, the experiment required to completely settle A1 (running all programs of increasing description length until finding a deceptive agent) could never be realistically performed (both due to unfeasibility and dangerousness). But nonetheless this ensures we are asking a question with an actual objective answer grounded in reality.

Determining one such quantity (to some error) is highly related to (or equivalent to) providing qualitative (yes / no) answers to some parameterized thought experiments. For example, for A1 we can build for any natural n, any task X and any cognitive mechanism M, the following thought experiments:

A1.n. Can a neural net of 10^n parameters encode a strong consequentialist?

A1.n.X. Can a neural net of 10^n parameters completely solve task X?

A1.n.M. Can a neural net of 10^n parameters encode cognitive mechanism M?

A1.n.X.M. Is cognitive mechanism M required to solve task X (or to do so with less than 10^n parameters)?

(Some thought experiments are trivial reformulations of quantitative cruxes. In other instances, different thought experiments can pull harder on our intuitions, facilitate thinking about concrete mechanisms, or take advantage of available empirical evidence.)

We can further notice these quantities are informed by some general background parameters about how reality works. For example, A1 could be affected by the parameter

P1. Which mind sizes suffice for non-trivial self-reflection?

As noticeable, these parameters need not be as concrete as the quantitative questions. Their defining trait is being deeper inside our world-model, or informing many quantities.

Finally, considering the cruxes, the parameters and the theoretical structure in which we think they are embedded (as causal relations or correlations), we can estimate them through different sources of evidence.

Below I present a first-pass isolation and analysis of some relevant quantitative cruxes, background parameters and sources of evidence. Needless to say, the lists are not exhaustive (I encourage readers to add items that seem relevant), and mainly intended to showcase the methodology and discuss some especially obvious uncertainties. Also, the below lists are biased towards cruxes related to training dynamics. This is because they seem, at the same time, (a) especially action-relevant, (b) still highly quantitatively uncertain (due to obvious evidence bottlenecks) and (c) fundamentally technical in nature (isolated from even messier and harder to control parts of reality like social dynamics).

Quantitative cruxes

As stated above, I present general questions (A, B, C…), isolate relevant related quantitative cruxes (A1, A2, A3…), and also rephrase and concretize these in terms of thought experiments (A1.a, A1.b, A1.c…).

It is recurrent below that the quantitative questions only make sense given some qualitative beliefs about possibility. Nonetheless, most of these beliefs are either already regarded as very likely in the community, or necessary for AI posing an existential risk (so that we are comfortable hedging on them being the case), or both.

In all below questions a probability distribution would be an even more informative answer than a single mean quantity. Also, of course, some of the below questions are still highly uncertain, and saying “I think this will be on the low end” instead of providing a concrete number is sometimes valid or even humble: after all, expliciting those qualitative assessments is also a very rough quantitative piece of information.

A. How easy is it to learn a deceptive agent in a training run?

This refers to the general failure mode conveyed in Risks from learned optimization, How likely is deceptive alignment, and Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover.

The difference between full-blown deceptive alignment and nearby failures like goal misgeneralization remains inconcrete. Relevantly, the distinction between explicit / human-legible / well-known to the whole system deception and other kinds of indirect / alien / mediated / self-concealed optimization channels seems inherently fuzzy. Nonetheless, Hubinger’s working definition is enough to motivate a cluster of failure modes: “the reason the model looks aligned during training is because it is actively trying to do so for instrumental reasons” (or in Ajeya’s words, “playing the training game”). And most importantly, solely focusing on outward behavior (both during and after training), we can completely factually specify some quantitative cruxes.

A1. How short (number of parameters) is the minimal description length of a general consequentialist (a neural net correctly executing all of a list of tasks in a set not-absurdly-long computation)?

My proposed list (which can be varied for different definitions of general consequentialist) is (not ordered by difficulty):

Pass a 1 hour adversarial Turing Test.
Achieving a 2400 ELO chess rating.
Starting with 100$ and access to the Internet, set up an artificial satellite.
Discover new completely original and publishable additions and revisions to:
- a subfield of physics
- a subfield of mathematics
- Machine Learning
Predict next-week stock markets with accuracy better than that of any current human aggregate organization.

Of course, an embedded consequentialist might not execute these tasks because they don’t forward its goals (instead of because of capabilities failure). For our purposes, we can just assume we have a way (not increasing the required net size noticeably) to incentivize the net to forward a goal downstream of these tasks (in a certain context we set up).

My proposed not-absurdly-long amount of computation (which can also be varied, and must be given in forward passes instead of FLOP to account for the varying size of the net) is:

5B forward passes (approximately 5 years of 60 passes per second).

Clarification: We are okay with an aggregate system including the net to execute these tasks (such as the net with Chain of Thought, Auto-GPT, or more complex scaffolding or repurposing of the net’s weights), as long as the net is the only part that is not algorithmically trivial, and the non-net parts of the aggregate are also assigned a sensible computational cap. We also accept the aggregate providing access to a general knowledge data-base, like the Internet (as long as it doesn’t contain pre-made explicit answers for the tasks). This is the most likely architecture if our net is a token-predictor, but notice we are not requiring that to be the case, and the net can produce other kinds of outputs (like for example actions for motor actuators).

Qualitative caveat: Of course, this question only makes sense if we agree there is no fundamental impossibility to a digitally implemented neural net executing these tasks. Notice this could not only fail because of philosophical worries about the importance of a mind’s substrate or shortcomings of the current connectionism paradigm (which seem to me ill-motivated and against empirical evidence, and are indeed generally disregarded in the alignment community). This could also fail because of more mundane reasons related to quantitative tradeoffs in the sizes of minds. For example, it is conceivable that increased sizes in any physically embedded cognitive system yield marginal returns, due to increasingly complicated information exchange between different parts of the system.

Of course, and as in Ajeya’s Biological Anchors, we know this is not the case for the above list because human brains exist, given we agree a human brain could be emulated in a neural net. But for lists of way more complex embedded tasks, it is conceivable the above trends could make existence fail. In fact, we can trivially devise tasks impossible for any embedded mind by diagonalization, but it’s highly uncertain whether any actually natural and useful task could be impossible for any embedded mind.

Actually, it would seem like we can always just obtain the degenerate solution of “a Giant Look-Up Table with all its parameters explicitly encoding (as if-then statements) all necessary actions to complete these natural tasks, and that only needs one forward pass to go through the table and complete them”. This is a completely valid answer: what we care about is how compact the consequentialist can get, and this would just mean “not compact at all” (although finitely compact). But as hinted above, when considering it as an embedded system, maybe not even the GLUT could execute some tasks (although again, it seems like this is clearly not the case for all tasks we care about).

More generally, the possibility of unworkable tradeoffs and limitations in pragmatic implementation must also be examined as part of answering A1 (and we could say non-existence corresponds to the quantitative value “infinite parameters”).

Other than analogizing to humans or trying to extrapolate from other empirical ML data, a natural but very speculative approach to this question is thinking about which kinds of cognitive mechanisms seem necessary to solve these tasks (or their subtasks) in natural and simple ways, paying special attention to the parts shared by all tasks (allegedly the most general ones), and guesstimating how many parameters we need to implement each of them (although allegedly this is already biased towards unnecessary modularity). Still in the human-comparison frame, we can ask whether the human brain contains a lot of additional information overfit to our environment, or if indeed the only way to obtain such general procedures is emerging through zillions of chaotically interacting parameters (these two possibilities usually come up as distinct qualitative stances underlying very different world-views).

Similarly to how cross-entropy seems more natural than Kolmogorov-complexity, we could ask for the general weighted prevalence (or general distribution) across different net sizes of algorithms encoding deceptive agents, instead of only the shortest description length. This captures more of the actual pervasiveness and canonicity of these algorithms, although knowing the minimal description length (or other more detailed accounts like how the distribution is skewed) might already capture most of the useful bits of information.

Of course, ultimately we care about the algorithms training can learn, not any existing algorithm whatsoever. That’s why the above needs to be supplemented by the following question.

A2. Given the minimal description of a general consequentialist (the answer to A1), how many additional parameters (expressed either in constant amount, in percentage, or in a more complex algebraic relation) do we need for a not-absurdly-long training run (against human-rewarded and automatically-rewarded subparts of the tasks of A1) to produce a general consequentialist (a model executing the tasks in A1)?

Here and below, my proposed not-absurdly-long length of training (which can also be varied, and must also be given in training steps instead of FLOP to account for the varying size of the net) is:

100T training steps (10^5 years at 60 per second, that is, 2·10^4 times the on-deployment steps in A1).

Indeed, it’s not clear how much action-relevant sense we can make out of A1 without a confident answer to A2. As an extreme example, it is conceivable (although seems unlikely) for the near-minimal algorithms to be chaotic, tangled and hashed, completely failing if any parameter is slightly altered, so that SGD can never approach them in the loss landscape, and for all findable algorithms to be relevantly bigger (thanks to Lawrence Chan for this point). Indeed, this might not be the most useful factorization of the problem (see A3 for more on this). More realistically, we don’t know how much “additional space” SGD’s dynamics need to find an algorithm with non-negligible probability.

The phenomenon of Deep Double Descent seems to point at this additional space being important. We can also directly test this quantitatively on small cases in which we know the minimal algorithm. Or we can even do this with relatively compact but non-minimal human-devised algorithms, or come up with more clever schemes (like coming up with an arbitrary but somehow structured algorithm, and training a different net on the task of approximating it). The main bottleneck for all these sources of evidence is the usual worry that we will see big phase changes between these ML systems and actually dangerous future systems (although even the prevalence and variance of phase changes can be studied in the current regime, providing further evidence). Even without experimenting, a deeper study of the Deep Learning literature would probably already provide better insights into A2.

Of course, we are not only asking about a concrete value of additional parameters, but a functional shape, so even deciding this is well-approximated by a constant or linear expression would be a very strong quantitative assessment. It seems very clear that the additional parameters won’t be a constant amount, given some already observed regimes. It doesn’t seem as unlikely that it would eventually saturate (above certain sizes, the additional parameters are constant). But it seems even more likely that we don’t get full saturation, and instead the additional parameters just go logarithmically or similar. This probably also depends on how serial the construction of some circuits by SGD can become (more serial might imply we need less additional parameters, since the not-yet-important parameters can be used as extra space).

We could also worry that some “in the limit” properties of SGD for arbitrary tasks / algorithms (as we could maybe find in the DL literature) aren’t representative of the actual answer to A2, since the set of tasks at hand shares some relevant simple structure (a certain core of consequentialist intelligence) that SGD will be able to find more easily. Even if this were the case, it’s not clear this wouldn’t also happen for most sets of tasks, since certain simple consequentialist principles are useful for almost all tasks. But it could indeed be the case that SGD doesn’t learn these basic principles (and instead finds more overfit answers) in the current regime of capabilities. This would seem more likely if the core principles of learned consequentialism turned out to be not that simple.

Lacking these empirical extrapolations, we again seem confined to thinking in the abstract about possible cognitive architectures (of both the minimal and bigger algorithms satisfying the task), and how easy they seem for SGD to approximate. This seems again very speculative, since we’re dealing with very complex mechanisms, which we don’t nearly know how to code right now.

In this and all questions below regarding training dynamics, we leave the exact training procedure (and architecture) unspecified. It might make sense to think of some of these in the abstract as questions about local search procedures in general. Nonetheless, for many purposes (and especially empirical evidence), it will make more sense to fill in this variable with a concrete training description, and thus pay additional care to the relevant differences in dynamics between these.

A3. How small (number of parameters) is the minimal net which, with a sensible training run (of a set length) (against human-rewarded and automatically-rewarded subparts of the tasks of A1), produces a general consequentialist (a model executing the tasks in A1)?

Of course, one way to obtain an answer to A3 is just by combining answers to A1 and A2. But as mentioned above, factoring the actual relevant question that way might still leave out some of our intuitions or direct evidence. Direct evidence or intuitions about which algorithms SGD can and can’t find are more naturally harnessed for A3, since we can think explicitly about both the concrete task / cognitive mechanisms, and the training dynamics (instead of separating them into A1 and A2). There surely also exist other ways to factor the problem.

As a different way to quantify coherence we have:

A4. How small (number of parameters) is the minimal net which, with a sensible training run (of a set length) (against human-rewarded and automatically-rewarded subparts of the tasks of A1), produces a t-AGI? Where t is a length of time.

This refers to Richard Ngo’s proposed quantification of continued coherence: “I call a system a t-AGI if, on most cognitive tasks, it beats most human experts who are given time t to perform the task”. It is motivated by expecting the current use of foundation models to continue.

A5. During a not-absurdly-long training of a 10^n parameter net for a task, a new circuit (of description length 10^m parameters), that could be learned without trashing already learned circuits, would provide a noticeable loss benefit for X% of the training points. How likely is this circuit to be learned, as a function of n, m and X?

This question tries to characterize how many “nudges” from reality are necessary for SGD to take advantage of a pattern decreasing loss. This is relevant to how easily AIs will learn and exploit situational awareness for “playing the training game”. There is an even more basic sense in which an AI will learn some “looking good to human” patterns instead of the “actual truth” patterns, without requiring any strong situational awareness. And even this, without strong single-AI agenticness, can lead to failure modes like Paul’s “You get what you measure” (although it’s not clear such a scenario could develop so far without the appearance of very agentic and situationally aware AIs). But of course we are even more worried about AIs who can understand their situation enough to knowingly optimize for goals in the real world.

Trying to translate the above worries into a concrete quantitative question proves especially difficult. For example, a natural framing is the following:

Imagine we are training our AI in a simulation, so that reward is calculated against an automatically calculated ground truth (or human feedback is relevantly sparse). How big of a constant flux of bits from external reality (in the input or loss) is required for our AI to start implementing situationally-aware policies that take advantage of this pattern, and exploit unintended channels?

The problem is bits are not a correct metric here, because the ease with which the new mechanism is implemented depends enormously on the qualitative shape in which it is presented, and its relation to already-existing cognitive mechanisms in the AI. More abstractly put, it depends on how easy it is for the AI’s ontology to catch this pattern instead of trashing it as noise.

As an illustrative example, compare someone sending a flux of bits to you by writing them down in your screen daily, versus by uplifting or downlifting the brightness of your visual field by 1% each day (thanks to Vivek Hebbar for this example).

We might thus like to settle with which flux of bits Solomonoff induction (allegedly the ideal reality-approximator according to scientific principles) would need to converge upon reality from inside the simulation, and treat this as a lower bound (no AI could do better than this ideal and intractable approximator). But that ignores our built AIs might possess a lot of human prior out of the box (as for foundation models), so that converging on reality is easier for them than for a truly impartial approximator (similarly to how pre-set assumptions can help interpret non-exhaustive human preferences).

Here I have chosen to settle for the description length of the circuit taking advantage of the pattern, since to some extent that already captures how easy it is to implement (starting from a blank slate). But as a result of this “mind-dependence” of the effectiveness of nudges, this is still ignoring how easy the rest of the learned net makes it to learn this new circuit. For example, it might be the circuit can repurpose some existing mechanisms, in which case learning will be easier. I’m not aware of a satisfying way to settle this fuzziness into well-defined quantitative facts, given we don’t even have a natural definition of what repurposing existing mechanisms is. We could again use obvious description metrics like “the smallest net implementing both circuits / functionalities”, but that still misses the actual prevalence / naturality of different (non-minimal) configurations repurposing these mechanisms, the “canonicity” of the model grokking the signal. Indeed, due to its mind-dependent nature, this seems completely entangled with the messy problem of understanding abstractions and how they work.

Because of all this, this question (which varies over all possible cognitive mechanism) could be understood as an “in the limit” general statement about SGD: is SGD biased in any direction whatsoever with regards to grokking new circuits for marginal loss decrease? For example, does SGD almost always find these circuits if they help in more than X% of data points, and the circuit is smaller than Y% of the whole net? Of course, it could well be the situation is so chaotic that there is no relevant bias in any direction whatsoever, and the answer truly depends a lot on the specifics of the circuit or the setup.

The answer probably also depends on how much “free space” there is left in the net, as in, how many parameters are already occupied by actually useful circuits. This gets at the uncertain question of whether current models are relevantly using most of their available parameters.

While it’s not clear which patterns SGD will be able to take advantage of, grokking phenomena do seem to point at this sometimes happening only after a long length of training. In this question we can also quantify training length to capture this dimension of variation (although maybe it’s ultimately not as action-relevant, since we’re not that worried about the exact moment in which these mechanisms are learnt, nor clear how to take advantage of this information).

There are lots more questions we’d like to know about to inform A (and A4 already gets much into the weeds), but let us proceed to a related topic.

B. Which cognition is required to solve technical tasks?

This is central for tracing technological path-dependence, which could be a relevant part of the structure we harness to solve real-world instances of alignment, and also directly informs whether pivotal acts are a good bet. Actually, common objections to pivotal acts are more holistic and theory-laden, contesting the whole framing as unproductive because of a messy mix of technical and social intuitions. Nonetheless, better technical assessments would still improve the state of debate here.

Some views argue broadly along the lines of “any system capable enough to help non-negligibly will already kill you”. Of course, there is not a literal direct implication from some tasks to other tasks or goals: the only way to obtain a correlation is by analyzing the distribution of possible physical cognitive systems (or those we will learn through the current paradigm, or through human science), that is, it must route through cognitive mechanisms. Of course, the forwarders of these views acknowledge this, and their credences are informed by grokking some relevant structure in such distributions, made possible by empirical and theoretical evidence (more on this below).

These reasonings are sometimes shortcut through considerations about the hardness of different tasks. Ultimately, though, we don’t have any objective measure of hardness, other than the distribution of which algorithms can solve which tasks, and so it seems like informative answers need to focus on cognitive mechanisms (and the ease for humans to find them). We might nonetheless believe the distribution of cognitive mechanisms is such that some differently defined, epistemic notion of hardness is highly correlated with the resolution of many tasks, so that we expect the ability to solve many tasks to cluster in mind-space. But this is already a strong conclusion, which need be especially salient given all available empirical and theoretical evidence.

B1. How short (number of parameters) is the minimal description length of a net executing task X? Where X is:

any of the tasks in A1
Nobel-level additions to physics
Reimann hypothesis proof level additions to mathematics
Transformers-level additions to Machine Learning
design nanotech such that, when released in almost any environment containing molecular compound X, replaces it by compound Y (for X and Y fixed throughout design, not variable on nanotech deployment)
design biotech with the functionalities of a bacterium but a diamondoid shell

Qualitative caveat: As discussed above, this question is only meaningful if we think there’s no fundamental or quantitative obstacles to arbitrary neural nets solving these tasks. This ignores interesting questions about how far the current paradigm can scale (although we might be comfortable hedging on the worlds where the current paradigm scales far).

This tries to get at how big the difference is between general-purpose planning and consequentialism applied to a set of tasks and narrowly overfit solutions to any task. We are especially interested in whether there exists a sweetspot in which AIs can help non-negligibly (in technical tasks for pivotal acts, or ML or alignment research) while still not unlocking other applications of a general consequentialist.

As in A2 and A3 above, we can complement this with questions about how easy it will be for local search to actually find solving algorithms.

B2. Given an average neural net (that is, randomly sampled as found by training for large enough nets) solving set of tasks X, is it easily repurposable (that is, when undergoing a short training) to solve set of tasks Y? Where X and Y are any subsets of the tasks in B1.

Here and below, my proposed short length of training (which can also be varied) is:

10% of the average training steps needed to solve X + the same for Y.

Human evolution, as the unique empirical data point available for evolved general intelligence, can be a source of vague priors here. Nonetheless, given some dissimilarities, it seems like deeper gears-level understanding of cognitive structures, especially those built by the current paradigm, is necessary to confidently interpret the available evidence.

B3. Suppose there are two algorithms that achieve 0 loss on the training task: (a) a simple and general algorithm expressible in a net with 10^n parameters, and (b) a cluster of k different shallower heuristics (or Experts of a Mixture-of-Experts) applied at once (each expressible in a net with 10^m parameters). What are the likelihoods of learning (a) and (b), as functions of n, m and k, in the case when the net is way bigger than 10^n or k·10^m?

Of course, in reality there will be a gradient of simple-to-messy algorithms solving the task, but this tries to get at how likely SGD is to converge on piles of heuristics instead of simple (but possibly sequentially long) algorithms. This is relevant for whether it’s likely to find a messy, overfit solution to our technical task, without needing to grok some more general principles of reasoning (although again, as per B1, it could be these principles are just always necessary). This is directly related to the ease of sequentially building circuits: are piles of heuristics easier to stumble upon, or is the simplicity prior so strong it thwarts this phenomenon?

Since the distinction between different heuristics need not be sharp (they can share circuits), we might prefer to ask about “a pile of heuristics easy to sequentially build with total description 10^m”. That said, it’s not even clear modular or Mixture-of-Experts algorithms will be easier to construct on average. Predictably, they will be when the modules or Experts are already useful for decreasing loss lacking the general algorithm.

We could also ask for the case in which the net is not way bigger than two algorithms, and instead close to their maximum. This would showcase the interaction of this “how strong is the simplicity bias” phenomenon (B3) with the “how much additional space SGD needs to find an algorithm” phenomenon (A2). Similar interactions can be confectioned for other subsets of questions of this list. While they predictably complicate the interpretation of available evidence, studying them might reveal important dynamics.

B4. For two different time lengths t<t’, how much time will elapse between the training of t-AGIs and t’-AGIs?

As showcased in the comments to Richard’s post, Richard expects 1 OOM of coherence constantly gained every 1.5 years, while Daniel Kokotajlo expects these OOM gains to be accelerating on time.

This disagreement is very relevant for timeline assessment and strategic allocation of resources. It is downstream of different gears-level intuitions of how general cognitive mechanisms will work, but also depends on societal variables and feedback loops which further complicate the picture. The first class of reflexions is partly isolated in A4. The second kind relates to feedback loops and societal dynamics, which are addressed below.

C. How perfect does alignment need to be?

We’re interested in how much slight misalignment can our overseeing procedures and general context “tank”. Some of the following quantities are more tailored to prosaic “You get what you measure” slow takeoff scenarios, since those more worried about learning a full-blown superintelligent deceptive agent will estimate them at extreme values by definition.

We don’t have good metrics for alignment (nor are conceptually clear on how they might look like), so the questions here point to behavioral assessments in simplified scenarios which can inform how hard we expect the problem to be.

C1. Which percentage of a t-AGI’s cognition needs to be optimized for the intended objective (instead of a different, random objective) for it to, respectively,

not take irreversible catastrophic actions
provide useful and (almost always) reliable cognitive labor

in any of the tasks in B1? Where t is a length of time.

The two properties are inspired by the distinction between high-stakes and low-stakes alignment.

Cognition is here left as an undefined intuitive concept, since what it will more concretely mean depends on how a system’s mechanisms are chiseled by reward. In the classical utility maximizer frame, the system will always maximize the same utility in all steps, and so the question would translate to how much the utility function weighs the intended goal vs other goals. In the Shard Theory frame, though, it might be more natural to talk about different cognitive mechanisms that do inform action in conflicting ways.

We could also quantify the system’s capabilities by a metric different from t-AGIs, especially one more overfit to a concrete task.

C2. Which percentage of the training points in a realistic near-future training or fine-tuning for any of the tasks in B1 provide unbeknownst unintended incentives (that we would recognize as pointing the cognition towards unwanted behavior if pointed out to us)?

This question is inspired by the prototypical RLHF example:

If we train against human feedback for, say, providing true answers, which percentage of the human-provided evaluations are actually misguided, so that they provide an incentive away from straightforward truth-telling?

In other tasks, the distinction between data points might not be as clear cut, and it would also be realistic for all data points (or all in a class of data points) to contain unintended incentives only on some cognitive dimensions and not others.

This is partially a question about data set curation, and so sensible to technical and engineering advancements into which societal questions also factor (like how much people will be worried about cleaning data sets).

C3. Suppose we took a realistic near-future training or fine-tuning set for any of the tasks in B1 (as in C2), and had one expert team scrutinize it to eliminate unintended incentives, by devoting a total amount of time corresponding to t seconds per data point. Which would then be the percentage of training points with unintended incentives, as a function of t?

Similar questions can be posed about qualitatively different (more realistic) curations of data sets, but this over-simplified case tries to directly relate dangerous training points to baseline human-red-team hours.

C4. One human has devoted a length t of time to produce an output (or sequence of iterative outputs) satisfying task X, while also optimized for eliciting some danger when used or deployed. How much time t’ does a human overseer (expert in the subject matter) need to find it the source of danger, as a function of t? Where X is any of the tasks in B1.

We could ask the same question for AIs of different capabilities (and different combinations of deceiver-overseer). This runs into the problem of objectively measuring relevant capabilities, but a priori we can just specify the deceiver and overseer AIs have been adversarially trained for their task to differing extents. Nonetheless, the human case remains a more intuitive (although also possibly biased) grounding.

And again as a complementary question:

C5. How many single-human-overseer hours of work will be equivalent to the methods we eventually use to inspect a single output in the near-term future?

D. Societal dynamics

While these topics concern systems even more chaotic and complex than a single AI, they are also especially well-suited for using historical data to inform estimates (although deciding how much to rely on historical extrapolation is deeply theory-laden, and already informed by all the rest of empirical and theoretical evidence).

These topics concern both how AI capabilities will evolve in society, and also how much misalignment different societal mechanisms can “tank”.

A deeper study might fare better isolating the uniquely technical aspects of feedback loops in particular from different societal variables.

D1. Suppose an AI monitors some organizational aspects of a corporation. How much human-time-equivalent effort will go into inspecting their channels of information with the world?

D2. How much information (in, for example, natural language words) will two such differently trained (or fine-tuned) AIs monitoring two different corporations be able to efficiently and assuredly bilaterally exchange in a way they have high probability humans won’t notice?

It would seem like the case of communication between n AIs can be reduced to bilateral communications without losing much prediction value, although this does ignore possible adversarial scenarios, or the possible increased uncertainty of behavior in multi-agentic scenarios.

D3. Suppose n differently trained AIs monitor a different corporation each. Suppose furthermore they face a Stag Hunt (where Stag is obtaining an X% increase in their respective objectives, and Rabbit a -Y% decrease). What’s the likelihood they hunt stag, as a function of n, X and Y?

This is relevant for AI coordination against humans, and the viability of adversarial setups. This depends both on how AIs will behave game-theoretically (which depends on other messier variables like how much humans will have tried to make them cooperative), and on how much uncertainty they’ll have over other systems’ behavior (which could motivate a separate question).

D4. How many total human-researcher-hour equivalents will have been spent in 5 years from now towards the specific goal of using AIs to produce ML research?

The contributing factors here centrally related to allocation of resources mediated by humans would seem easier to extrapolate from past data. But still technical differences could drastically affect which avenues seem most promising, and thus how resources are allocated. For example, if we qualitatively expect to continue using foundation models, the efforts spent in getting any one particular task to work might be rather low.

List of background parameters

Intuitively, we want to find nodes causally upstream of our quantitative cruxes. Since some of the cruxes and background parameters correspond to matheamtical propositions (and we’re confused about logical causality), it’s more accurate to say we’re searching for correlated quantities, or just “low-level technical stuff with short descriptions that is correlated with a lot of what we care about”. Indeed, not all background parameters are created equal, and we’re interested in those that have a lot of action-relevant explanatory power, in the sense that better estimates for them would inform a lot of quantitative cruxes.

This list is limited to parameters relevant to training dynamics, given as mentioned above their explainability and a priori tractability. These parameters are of course not independent, and in fact present interesting inter-dependencies helpful to inform forecasting.

1. High/Low path dependence

2. Quantitative parameters of the prior over circuits of a training procedure.

There’s discussion of which kinds of priors (simplicity, speed, etc.) local search or SGD in particular present, and how these affect training results. Getting more concrete expressions and especially testing empirically its shape in different instances could inform which kinds of models we expect to learn.

3. How serial is the construction of a circuit during training?

Are different parts of circuits built in different stages of training, by random exploration (even when they don’t already provide performance increase), or are complete performance-enhancing circuits built basically at once, gradually directed into existence by back-propagation? This seems upstream of several properties of learned cognition, and the ease to find certain very general or big circuits.

4. How much memorization does the current paradigm incentivize?

5. How much can shallow patterns combine into general reasoning?

This question is, as mentioned above, especially vague and left inconcrete, lacking a clearly defined mechanistic difference between shallow heuristics and general algorithms other than behavioral generalization. But nonetheless an estimate on this question seems to inform in some views many downstream opinions about cognition, tasks and dynamics. How viable (and incentivized) is it to complete some of the difficult tasks mentioned here by stacking “straightforwardly pattern-matchy” approximations that take advantage of constantly reappearing patterns, and combine to improve their generalization strengths? Reality will probably be more nuanced: it is likely that even the “shallow pile of heuristics” algorithm that achieves some difficult task will contain parts quite multi-purpose, search-y, or hard for a human to follow.

In fact, can even the simplest patterns combining to solve a task (in a not absurdly big net) provide almost all gains from generalization, including dangerous ones? This would correspond to the intuition that “a certain kind of generalizing reasoning is needed for the task, so no matter how you get it algorithmically it will dangerously generalize anyway”.

6. How modularizable are certain difficult tasks? Relatedly, how modular are found models solving them?

7. How much superposition do current models present?

These two questions are of course especially relevant for interpretability-enabled technical advances (both safety and capabilities).

8. How much “unused” space is there in current models?

That is, how many parameters have not been optimized for a purpose, or can be ablated without consequences on any training point? Maybe, though, the unused space lives in a more complicated dimension than “literal unused parameters”. For example, if the superposition picture is broadly correct, it could correspond to “further features that could be crammed into this neuron without almost decreasing performance on others”.

This write-up continues in the second post about sources of evidence.