Sources of evidence in Alignment

Martín Soto

Summary: A short epistemological study to discover which sources of evidence can inform our predictions of action-relevant quantities in alignment.

This post follows Quantitative cruxes, although reading that first is mostly not required. Work done during my last two weeks of SERI MATS 3.1.

Sources of evidence

No researcher in any field ever makes explicit all of their sources of evidence. Let alone in a field as chaotic and uncertain as ML, in which hardly-earned experience and intuitions play a central role in stirring the tensor pile. And even less in a field with as many varied opinions and confusing questions as alignment. Nonetheless, even when researchers are just “grokking some deeper hard-to-transmit structure from familiar theory and evidence”, they need to get their bits of information from somewhere. Knowledge doesn’t come for free, they need entanglement with observed parts of reality.

Getting a better picture of where we are and could be looking, brain-storming or deepening existing sources, and understanding methodology limitations (as Adam Shimi’s efforts already pursue) can dissolve confusions, speed progress forward, help us calibrate and build common knowledge.

In reality, the following sources of evidence motivating any belief are way less separable than the below text might make it seem. Nonetheless, isolating them yields more conceptual clarity and is the first step for analysis.

1. Qualitative arguments

One obvious, theoretical source, and the most used in this community by far. The central shortcoming is that their abstractions are flexibly explanatory exactly because they abstract away detail, and thus provide more information about the existence of algorithms or dynamics, than about some relevant related quantities like how prevalent they actually are in a certain space, when do these dynamics actually start to appear and with how much steering power, etc.

Sometimes a tacit assumption might seem to be made: there are so many qualitative arguments for the appearance of these dynamics (and so few for the appearance of rectifying dynamics), that surely one of them will be present to a relevant degree, and early on enough! This seems like a sort of presumption of independence about yet unobserved structure: a priori, we have no reason to believe any one of these qualitative arguments have higher or lower quantitative impact, so we should settle on the vague prior of them all having similar effects (and so, the side with more qualitative arguments wins). While this is truly the best we can do when further evidence isn’t available, it seems like an especially fragile prior, ignoring the many possible interdependencies among some qualitative arguments (how their validity cluster across different worlds), and possible correlated failures / decoupling of abstractions from reality, or systemic biases in the search for qualitative arguments. Incorporating some of these considerations is already enough to both slightly better inform our estimates, and especially better calibrate our credences and uncertainty.

Of course, usually qualitative arguments are informed by and supplemented with other sources of evidence that can provide a quantitative estimate, and then the presumption of independence is applied after incorporating these different sources (which is usually a considerably better situation to be in, unequivocally more grounded in reality). We can even explicitly reason qualitatively about the different relative magnitudes of some dynamics, as in How likely is deceptive alignment?.

And sometimes, in even less explicit ways, intuitive assessments of the strength of quantitative effects (or even the fundamental shape of the qualitative arguments) are already informed by grokked structure in reality coming from other sources of evidence. And of course, here talking about the actual structure, the actual evidence, will be more informative (even while acknowledging our credal relationship to this evidence might not be completely transmittable in language, given some hard-earned expert intuitions).

In any event, the above mentioned tacit assumption lurks in the background of some more superficial discussions, and sadly reality doesn’t seem to be as simple as to permit its high-confidence application.

2. Empirics and extrapolations

We can obtain information about the technical cruxes from the current paradigm. This doesn’t imply abandoning our abstractions and predictions, or ignoring the fact that according to them phase changes are still to come between current systems and dangerous systems. But they are, ultimately, the most valuable way to test our intuitions. So the valuable actions here are less blindly and agnostically listening to the flow of information, and more purposefully finding the few empirical tests that could probe some corners of our abstractions (as difficult as this might be, given our abstractions talk about far situations, and in qualitative terms). For this purpose, maybe examining the ML literature is not enough, and experimentation with goals different than those of academia is required.

Of course, extrapolations are also theory-laden, since we need to decide which contributing factors to take into account. That said, these factors can in turn be estimated, and so on, yielding fruitful intertwining of our qualitative abstractions with quantitative feedback, that can eventually be tested against reality.

Work that provides more data (example) can be distinguished from work that interprets it (example).

3. Mathematical results about local search in general

The motivating example for this class is mathematical theorems about the behavior of SGD in the limit, or approximations / conjectures of them (example: infinite-width limits of neural networks, and how they help understand inductive biases and training dynamics). That is, we think about the local search process in general and in the abstract, averaged over all possible tasks and models. Many times we get conceptual assessments instead of fully formal mathematical results, closer to qualitative arguments. Here deep familiarity with the Deep Learning literature would seem positive.

4. Mentally approximating empirical evidence

As mentioned above and by many, the alignment field seems mainly bottlenecked on not being able to test the dynamics we most worry about, or iterate experimentally on robust metrics. We don’t have the privilege of waiting for more concrete evidence, and the whole field is trying to get ahead of the curve and place the bandaid before the wound.

It’s obviously bad for a field not to have a reliable flux of empirical evidence on its most central variables. It’s also likely that our mental and social tools for doing science are too tailored to exploiting such a flux.

Thus, some researchers seem to try and cheaply approximate some aspects of empirical evidence, and find alternative testing grounds for their intuitions, which mostly amount to higher-resolution concretized reasoning and prediction. That is, expliciting a lot of the complex structure so that we only need to call on simpler, and thus apparently more trust-worthy, intuitions (in a speculative-but-concrete way which might not be widespread in other scientific areas).

The motivating example for this class is Nate’s explanation of obtaining confidence on a certain property of the distribution of cognitive mechanisms:

“like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.
[...]
and, like, it's a pretty tricky game to play b/c it's all made-up bullshit and it's hard to agree on who strained credulity more, but there's some sort of idealized game here where it sounds to me like we each expect we'd win if we played it ...
So the place that my brain reports it gets its own confidence from, is from having done exercises that amount to self-play in the game I mentioned in a thread a little while back, which gives me a variety of intuitions about the rows in your table (where I'm like "doing science well requires CIS-ish stuff" and "the sort of corrigibility you learn in training doesn't generalize how we want, b/c of the interactions w/ the CIS-ish stuff")
(that plus the way that people who hope the game goes the other way, seem to generally be arguing not from the ability to exhibit playthroughs that go some other way, but instead be arguing from ignorance / "we just don't know")”

It’s refreshing to read an explicit account of such mental processes upstream of confidence. This example amounts to more finely predicting the macroscopic dynamics of a training run. It is very close to the builder-breaker methodology Paul Christiano has many times explicited. ARC more explicitly yield the (almost) worst-case assumption to conservatively ignore some quantitative questions and just consider existence: “could we learn any dangerous algorithm here?” (Although again quantitative intuitions might inform which algorithms would seem to exist, since we can’t completely explicit them.) But Nate can also be understood above as dissolving quantitative assessments as the existence of different cognitive mechanisms: “is it true that (almost) all (not-absurdly-large) algorithms solving this task need to perform some cognitive move of this kind?”

Of course, it’s not really clear how much resolution we can get away with in these predictions, nor which biases might be steering our simulation away from reality, nor which already theory-laden interpretations of evidence we might be baking in, so that it’s unclear how much confidence these assessments permit. And especially it will be a function of how formal and concrete we were able to get the builder-breaker procedure to be. After all, when lacking more empirics, “made-up bullshit” will always be the first step towards well-calibrated estimates, although different researchers disagree on which steps in this path deserve which credences, based mainly on differing confidence on their cognitive tools.

While source 3 above engages with local search in the abstract (and grounds in the mathematical robustness of limit behaviors), this source strives for as much concreteness as is possible (and grounds in the correct assessment of small and simple interactions, or straightforwardly predictable behaviors of concrete algorithms).

There are other ways to mentally approximate empirical evidence than macroscopically simulating a training run. We could try something more forecast-y, and actually assign probabilities (or distributions) to small-scale phenomenon and quantities that crop up in our simulating, like different cognitive structures being built. We can also assess the difficulty of a task (in some epistemically privileged metric other than “which cognitive structures execute it correctly”) and consider the consequences for SGD.

5. Vague priors or heuristics extracted from other domains

For example, quantitative parameters of agentic mechanisms in biology or social sciences providing weak evidence for similar phenomena in neural net architecture (while trying to update according to the big qualitative differences between the settings).

Importantly, this is distinct from “getting such a good understanding of biological examples of agency that we better understand cognitive mechanisms in general”. Although that routes through other domains, it ultimately will be expressible and motivated in purely domain-agnostic terms, without requiring biological evidence. That said, these different kinds of evidence are usually not completely separable.

It’s been especially contested what the explicit role of evolutionary analogies should be. It would seem productive to more clearly delimitate whether they are used as mere explainers or intuition pumps (pointing towards deeper truths about cognitive mechanisms that actually need to be motivated and defended on different grounds), or whether they are relying on them as vague predictors as per source of evidence 5.

6. Common sense and value judgments

Given the deeply philosophical nature of some alignment questions and realizations, and how we need to think about intelligence itself and values themselves to solve apparently technical tasks, it seems like we also rely on an expanding set of deconfusions and “correct definitions” of what we are trying to achieve, and which sanity-preserving assumptions are we epistemically and strategically allowed to make.

I’m thinking here of Demski’s illuminating clarifications on what it even means to extrapolate our values, or Armstrong’s no indescribable hellscape assumption.

Stretching this class, we might even want it to include general deconfusions about our relationship to reality and values (Tomasik on dualism, Alexander on moral extrapolation), which give us a clearer sense of which decisions will need to be taken in the future (“deciding the CEV procedure is a value judgement”), and which ethical sweetspots are and aren’t possible (contra, for example, Nate on deferring ethics to future humans).

Probably decisions about which decision theory or anthropic reasoning are correct fall into this value-laden category (seeing, for example, the impossibility of neutrally comparing decision theories, although humans kind of informally do that so something weird is up). For example, even if Solomonoff induction concluded that our reality is being simulated by some maximally simple beings, we might rather, for some aesthetic and ethical reasons, reconsider the virtues of Solomonoff induction than bend to the random wills of these beings.

Limitations of the framework and presentation

(The following refers to the whole write-up, not only this second post.)

About the extent and detail of this analysis:

It is clear a more in-depth look is necessary. Not only are there innumerably many more cruxes (which might help notice exploitable interdependencies) , and even important background parameters I will have missed, but also the sources of evidence are an especially important list to try to expand and deepen. On that note, probably considerably more effort is needed for this framework to have a chance of providing interesting new low-hanging but yet unconsidered possibilities for prediction and research (instead of just distilling and presenting the state of some communal directions, and pushing towards clearer understanding and discourse, as is done in this write-up).

With a more complete mapping of the uncertainty landscape, we could even build a more nuts-and-bolts model of some of the threat-models downstream of these questions to quantitatively assess risks brought about by different proposals and strategies.

Also, the above lists are a first pass of “abstractly, what we’d like to know”. But we’re still lacking a cost-benefit analysis of what directions to pursue seem more tractable or will predictably provide high-quality evidence. It might be some of these questions are not worth prioritizing.

As mentioned, the above lists are deliberately biased towards training dynamics. They also seem slightly MIRI-centric, especially because of the threat models studied and framings used. This is also deliberate, because MIRI world-views and interpretations were sometimes the most confusing as to their claims, predictions and sources of evidence, so that I wanted to isolate their cruxes as a first-pass analysis of their mental moves. Nonetheless, this will have curtailed the explainability of this analysis in some ways, and applying it to other world-views would provide different and fruitful cruxes and factorizations. Similarly, the questions here relate to basic deconfusion about intelligence and the future most optimized for existential risk reduction, but it ignores many factors especially relevant to s-risk reduction, a different ethical urgency that sometimes poses relevantly different questions and calls for different methods.

About the general framework and the direction of analysis:

By over-analyzing isolated idealized phenomena, this analysis might be missing the bigger picture of how these cruxes relate to actual observed reality, or how humans go about exploring them. Certainly, as stated above, some of the messy interactions have been cast away to closely inspect even the most basic phenomenon we’d like to know more about (the “first-order effects”), which are already messy enough. Nonetheless, this choice is deliberate, and this simplification and isolation seems like a natural first step.

On that note, maybe even these isolated cruxes are too messy to obtain relevant information on, and we should instead be focusing on more context-setting and going for more concrete (but less generally informative) cruxes. This is a real worry, and might be moving us away from ML and DL knowledge and techniques we could tractably use. Nonetheless, it seems useful to keep our eyes on big questions which ultimately drive our interests and decisions, as a middle ground between overly-concrete ML work, and overly-abstract threat models.

Epistemically, it could be the methods we have for arriving at conclusions are yet so varied and preliminary that methodological analysis won’t showcase any relevant structure. Maybe everyone is so confused, or conflicting methods so entangled, that such an analysis won’t be fruitful. My current opinion is that, on the contrary, exactly because of the pre-paradigmatic nature of the field, explicitation of mechanisms can be most impactful now and help build foundations. That said, there certainly is some truth to the intuition that “sometimes someone just found this interesting phenomenon and wrote a post, by using standard intuitions and standard tools, and without much relation to the underlying interpersonal scientific mechanism”.

On the other direction, some researchers might argue probabilities and quantitative assessments are not the right frame to look at any of this, and instead the monolithic core of the problem, and all assessments necessary for trying to make progress on it, rely only on some deep patterns they’ve grokked. I remain unconvinced of this (although, ironically, I put some small probability on them being right), given how even apparently obvious things can turn out surprising and deserve a probabilistic analysis, and the relevant claims seem far from being an obvious thing (even if the presence of a certain dynamic in the limit is).

On the action-relevance of isolating cruxes and getting better estimates, I’m not sure many big strategic decisions (especially those in governance) change relevantly if our probabilities for some threat scenarios vary inside the 20%-80% range (this take due to Richard Ngo in some tweet I can’t find, as part of discussion on Knightian uncertainty). That said, I do think they would inform many lower-level decisions about research prioritization, or even elicit different understandings or interpretations of the deeper nature of some phenomena. I also think bringing them to the surface is enough to clarify some positions and discussions.

Conclusions and future work

While we constantly trust our intuitions to grok relevant structure and “feel around” for what seems more likely (and we shouldn’t trash these proven sources of evidence), in a pre-paradigmatic field it can be especially positive to try and make more explicit our epistemic attitudes and methods, as well as the concrete questions we are pointing them towards, and interesting nodes we think are upstream of disagreements.

Importantly, many of the questions we can pose in the flavor of the above list seem to call for forecast-y methods and attitudes (instead of the application of some well-trodden scientific tools), so doubtful distributions, medium credences and hedging seem to be the expected outcome for now.

I especially hope this analysis can help notice how far we are from the scientific standards of other fields. I am not saying we should remain agnostic: we don’t have that privilege. A speculative interpretation of evidence is a priori better than no interpretation. But we do have to calibrate our confidence, and it seems helpful for that to compare with abstractions that have proven useful in the past. Building towards more robust standards might be very positive in logistically median worlds.

On that note, while as mentioned above there are important reasons why alignment discussions remain largely speculative, it seems worthwhile to increase efforts on keeping an eye out for possible (even theoretical / hypothetical / mental) Proofs of Concept or any kind of feedback from reality.

Indeed, while the list of cruxes and parameters seem more contingent and will continuously be expanded, we seem especially bottlenecked on sources of evidence. We need a deeper methodological and epistemological analysis, trying to come up with qualitatively new sources, or further exploiting existing ones by combining or repurposing them.

This work might be expanded with a benchmark of thought experiments in alignment, as intuition pumpers and technical discussion starters. Other than that, some promising avenues for future work are:

Expand and deepen the analysis of sources of evidence, especially looking for not-already-used low-hanging methods.
Scourge the Deep Learning literature for evidence on some of the questions on training dynamics and in-the-limit behavior.
Replicate treatment inspired by the central questions of other researchers’ world-views.
Expand questions and methods on feedback loops.
More generally, search for background parameters and compile evidence on them.

Appendix: Where are we most uncertain?

Instead of weighing all our uncertainties, I’ll focus on a single central dichotomy that seems to underlie different world-views. Also, for clearer exposition, I’ll reconstruct my train of thought instead of arguing for conclusions.

Many threat models or qualitative arguments for danger seem to involve two steps: (1) this undesired algorithm exists (for example, human imitator instead of truthful predictor), and (2) our training procedures could find it (because there are some dynamics incentivizing that). This correspond to the questions “what is the class of algorithms satisfying the task”, and “out of those, which ones are easily found in training?” (similar to the distinction between A1 and A2 above).

My initial intuition was we were generally pretty good at existence proofs, and on the contrary pretty bad at being able to predict something as chaotic as training dynamics. Indeed, I felt like Paul’s builder-breaker takes advantage of this: through the conservative worst-case assumption, they can focus on existence proofs.

But it also seems true that many disagreements of MIRI views with other researchers are about whether you can even solve some tasks without getting general reasoners, the inherent necessity of good and general cognitive structures even for narrow advanced tasks, that is, which algorithms solve a certain task (thanks to Vivek Hebbar for pointing this out).

So what do we even mean by being uncertain about these questions?

Well, thinking about variance, it does seem true we have high variance on “how the algorithms solving this advanced task will look like”, and on the contrary we are pretty confident on “given those algorithms, SGD will approximately find the simpler ones”. And of course, joining answers to these questions we’d be able to answer “which algorithms are found in training”, so it does seem like most work is on the first step. Although it’s not clear whether (1) or (2) has more variance overall, that is, whether “given those algorithms, which ones does SGD find” increases or decreases variance. There’s a possible equivocation here. If we think only of sampling SGD many times, this could have indeed very low variance because of a strong simplicity prior. If instead we think of our subjective uncertainty right now over which models it will find, we are way more uncertain.

In any event, my original intuition come from feeling “less clueless” about the set of existing algorithms than about the set of learned algorithms. But maybe I was just feeling that way because I predict the set of existing algorithms to be “big and chaotic” (which is, of course, not the same as actually not being clueless about its contents).

Let’s examine another intuition: we do seem to have a kind of direct access to the set of existing algorithms (just completely build up an algorithm, or a counterexample about what that algorithm does in a certain situation) than to learned algorithms (we are nowhere near having powerful enough ML to find those algorithms). But of course, the first thing is not true either! We can’t yet completely explicit an algorithm solving any of the above hard tasks. Nonetheless, it does seem intuitively true that we have a good big picture of “how these algorithms will globally work, or which things they need to implement”. And maybe we don’t have the same big picture idea about “which algorithmic structures are incentivized by SGD” (they’re just a big mess of things). But I’m not even sure that’s true! I’m not sure we have a good grasp of what these algorithms will macroscopically actually do (let alone their efficient low-level implementation, or concrete behavior). And on the contrary we do have some interesting evidence on SGD’s overall biases, and maybe even other finer properties thanks to trial and error (although we are again wary of phase changes).

After all, I don’t see anyone strongly contesting that SGD will have a simplicity bias (or their threat models depending on that), and on the contrary a central and distinctive component of MIRI world-views is that there don’t exist algorithms solving some technical tasks without being too general. So I do feel like conceptually most of the crux is int “how pervasive are dangerous algorithms in algorithm-space”.

I think if this differs from my initial intuitions it’s because because I was thinking of “checking whether an algorithm of a certain shape exists”, rather than “more completely determining which kinds of algorithms are prevalent or natural and which aren’t” (as in, MIRI views accept safe algorithms exist, they just think they’ll be very concrete points on algorithm-space, difficult to find, instead of vast portions of algorithm-space), and that’s of course doing most of the heavy lifting for the subsequent question about SGD (natural ≈ simple ≈ what SGD finds).

AI ALIGNMENT FORUM
AF

Sources of evidence in Alignment

9

Sources of evidence

Limitations of the framework and presentation

Conclusions and future work

Appendix: Where are we most uncertain?

9