*Crossposted** to the EA Forum*

# Introduction

Imagine you are tasked with curing a disease which hasn’t appeared yet. Setting aside why you would know about such a disease’s emergence in the future, how would you go about curing it? You can’t directly or indirectly gather data about the disease itself, as it doesn’t exist yet; you can’t try new drugs to see if they work, as **the disease doesn’t exist yet**; you can’t do anything that is experimental in any way on the disease itself, as… you get the gist. You would be forgiven for thinking that there is nothing you can do about it right now.

AI Alignment looks even more hopeless: it’s about solving a never-before seen problem on a technology which doesn’t exist yet.

Yet researchers are actually working on it! There are __papers__, __books__, __unconferences__, __whole online forums__ full of alignment research, both __conceptual and applied__. Some of these work in a purely abstract and theoretical realm, others study our best current analogous of Human Level AI/Transformative AI/AGI, still others iterate on current technologies that seem precursors to these kinds of AI, typically large language models. With so many different approaches, how can we know if we’re making progress or just grasping at straws?

Intuitively, the latter two approaches sound more like how we should produce knowledge: they take their **epistemic strategies** (ways of producing knowledge) out of Science and Engineering, the two cornerstones of knowledge and technology in the modern world. Yet recall that in alignment, models of the actual problem and/or technology can’t be evaluated experimentally, and one cannot try and iterate on proposed solutions directly. So when we take inspiration from Science and Engineering (and I want people to do that), we must be careful and realize that most of the guarantees and checks we associate with both are simply not present in alignment, for the decidedly pedestrian reason that Human Level AI/Transformative AI/AGI doesn’t exist yet.

I thus claim that:

- Epistemic strategies from Science and Engineering don’t dominate other strategies in alignment research.
- Given the hardness of grounding knowledge in alignment, we should leverage every epistemic strategy we can find and get away with.
- These epistemic strategies should be made explicit and examined. Both the ones taken or adapted from Science and Engineering, and every other one (for example the more theoretical and philosophical strategies favored in conceptual alignment research).

This matters a lot, as it underlies many issues and confusions in how alignment is discussed, taught, created and criticized. Having such a diverse array of epistemic strategies is fascinating, but their implicit nature makes it challenging to communicate with newcomers, outsiders, and even fellow researchers leveraging different strategies. Here is a non-exhaustive list of issues that boil down to epistemic strategy confusion:

- There is a strong natural bias towards believing that taking epistemic strategies from Science and Engineering automatically leads to work that is valuable for alignment.
- This leads to a flurry of work (often by well-meaning ML researchers/engineers) that doesn’t tackle or help with alignment, and might push capabilities (in an imagined tradeoff with the assumed alignment benefits of the work)
- Very important: that doesn’t mean I consider work based on ML as useless for alignment. Just that the sort of ML-based work that actually tries to solve the problem tends to be by people who understand that one must be careful when transfering epistemic strategies to alignment.

- There is a strong natural bias towards disparaging any epistemic strategy that doesn’t align neatly with the main strategies of Science and Engineering (even when the alternative epistemic strategies are actually used by scientists and engineers in practice!)
- This leads to more and more common confusion about what’s happening on the Alignment Forum and conceptual alignment research more generally. And that confusion can easily turn into the sort of criticism that boils down to “this is stupid and people should stop doing that”.

- It’s quite hard to evaluate whether alignment research (applied or conceptual) creates the kind of knowledge we’re after, and helps move the ball forward. This comes from both the varieties of epistemic strategies and the lack of the usual guarantees and checks when applying more mainstream ones (every observation and experiment is used through analogies or induction to future regimes).
- This makes it harder for researchers using different sets of epistemic strategies to talk to each other and give useful feedback to one another.

- Criticism of some approach or idea often stops at the level of the epistemic strategy being weird.
- This happens a lot with criticism of lack of concreteness and formalism and grounding for Alignment Forum posts.
- It also happens when applied alignment research is rebuked solely because it uses current technology, and the critics have decided that it can’t apply to AGI-like regimes.

- Teaching alignment without tackling this pluralism of epistemic strategies, or by trying to fit everything into a paradigm using only a handful of those, results in my experience in people who know the lingo and some of the concepts, but have trouble contributing, criticizing and teaching the ideas and research they learnt.
- You can also end up with a sort of dogmatism that alignment can only be done a certain way.

Note that most fields (including many sciences and engineering disciplines) also use weirder epistemic strategies. So do many attempts at predicting and directing the future (think existential risk mitigation in general). My point is not that alignment is a special snowflake, more that it’s both weird enough (in the inability to experiment and iterate directly with) and important enough that elucidating the epistemic strategies we’re using, finding others and integrating them is particularly important.

In the rest of this post, I develop and unfold my arguments in more detail. I start with digging deeper into what I mean by the main epistemic strategies of Science and Engineering, and why they don’t transfer unharmed to alignment. Then I demonstrate the importance of looking at different epistemic strategies, by focusing on examples of alignment results and arguments which make most sense as interpreted through the epistemic lens of Theoretical Computer Science. I use the latter as an inspiration because I adore that field and because it’s a fertile ground for epistemic strategies. I conclude by pointing at the sort of epistemic analyses I feel are needed right now.

Lastly, this post can be seen as a research agenda of sorts, as I’m already doing some of these epistemic analyses, and believe this is the most important use of my time and my nerdiness about weird epistemological tricks. We roll with what evolution’s dice gave us.

*Thanks to Logan Smith, Richard Ngo, Remmelt Ellen,, Alex Turner, Joe Collman, Ruby, Antonin Broi, Antoine de Scorraille, Florent Berthet, Maxime Riché, Steve Byrnes, John Wentworth and Connor Leahy for discussions and feedback on drafts.*

# Science and Engineering Walking on Eggshells

What is the standard way of learning how the world works? For centuries now, the answer has been Science.

I like Peter Godfrey-Smith’s description in his glorious __Theory and Reality__:

Science works by taking theoretical ideas and trying to find ways to expose them to observation. The scientific strategy is to construe ideas, to embed them in surrounding conceptual frameworks, and to develop them, in such a way that this exposure is possible even in the case of the most general and ambitious hypotheses about the universe.

That is, the essence of science is in the trinity of modelling, predicting and testing.

This doesn’t mean there are no epistemological subtleties left in modern science; finding ways of gathering evidence, of forcing the meeting of model and reality, often takes incredibly creative turns. Fundamental Chemistry uses synthesis of molecules never seen in nature to test the edge cases of its models; black holes are not observable directly, but must be inferred by a host of indirect signals like the light released by matter falling in the black hole or gravitational waves during black holes merging; Ancient Rome is probed and explored through a discussion between textual analysis and archeological discoveries.

Yet all of these still amount to instantiating the meta epistemic strategy “say something about the world, then check if the world agrees”. As already pointed out, it doesn’t transfer straightforwardly to alignment because Human Level AI/Transformative AI/AGI doesn’t exist yet.

What I’m not saying is that epistemic strategies from Science are irrelevant to alignment. But because they must be adapted to tell us something about a phenomenon that doesn’t exist yet, they lose their supremacy in the creation of knowledge. They can help us to gather data about what exists now, and to think about the sort of models that are good at describing reality, but checking their relevance to the actual problem/thinking through the analogies requires thinking in more detail about what kind of knowledge we’re creating.

If we instead want more of a problem solving perspective, tinkering is a staple strategy in engineering, before we know how to solve the problem things reliably. Think about curing cancer or building the internet: you try the best solutions you can think of, see how they fail, correct the issues or find a new approach, and iterate.

Once again, this is made qualitatively different in alignment by the fact that neither the problem nor the source of the problem exist yet. We can try to solve toy versions of the problem, or what we consider analogous situations, but none of our solutions can be actually tested yet. And there is the additional difficulty that Human-level AI/Transformative AI/AGI might be so dangerous that we have only one chance to implement the solution.

So if we want to apply the essence of human technological progress, from agriculture to planes and computer programs, just trying things out, we need to deal with the epistemic subtleties and limits of analogies and toy problems.

An earlier draft presented the conclusion of this section as “Science and Engineering can’t do anything for us—what should we do?” which is not my point. What I’m claiming is that in alignment, the epistemic strategies from both Science and Engineering are not as straightforward to use and leverage as they usually are (yes, I know, there’s a lot of subtleties to Science and Engineering anyway). They don’t provide privileged approaches demanding minimal epistemic thinking in most cases; instead we have to be careful how we use them as epistemic strategies. Think of them as tools which are so good that most of the time, people can use them without thinking about all the details of how the tools work, and get mostly the wanted result. My claim here is that these tools need to be applied with significantly more care in alignment, where they lose their “basically works all the time” status.

Acknowledging that point is crucial for understanding why alignment research is so pluralistic in terms of epistemic strategies. Because no such strategy works as broadly as we’re used to for most of Science and Engineering, alignment has to draw from literally every epistemic strategy it can pull, taking inspiration from Science and Engineering, but also philosophy, pre-mortems of complex projects, and a number of other fields.

To show that further, I turn to some main alignment concepts which are often considered confusing and weird, in part because they don’t leverage the most common epistemic strategies. I explain these results by recontextualizing them through the lens of Theoretical Computer Science (TCS).

# Epistemic Strategies from TCS

For those who don’t know the field, Theoretical Computer Science focuses on studying computation. It emerged from the work of Turing, Church, Gödel and others in the 30s, on formal models of what we would now call computation: the process of following a step-by-step recipe to solve a problem. TCS cares about what is computable and what isn’t, as well as how much resources are needed for each problem. Probably the most active subfield is __Complexity Theory__, which cares about how to separate computational problems in classes capturing how many resources (most often time – number of steps) are required for solving them.

What makes TCS most relevant to us is that theoretical computer scientists excel at wringing knowledge out of the most improbable places. They are brilliant at inventing epistemic strategies, and remember, we need every one we can find for solving alignment.

To show you what I mean, let’s look at three main ideas/arguments in alignment (Convergent Subgoals, Goodhart’s Law and the Orthogonality Thesis) through some TCS epistemological strategies.

## Convergent Subgoals and Smoothed Analysis

One of the main argument for AI Risk and statement of a problem in alignment is Nick Bostrom’s __Instrumental Convergence Thesis__ (which also takes inspiration from Steve Omohundro’s __Basic AI Drives__):

Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.

That is, actions/plans exist which help with a vast array of different tasks: self-preservation, protecting one’s own goal, acquiring resources... So a Human-level AI/Transformative AI/AGI could take them while still doing what we asked it to do. Convergent subgoals are about showing that behaviors which look like they can only emerge from rebellious robots actually can be pretty useful for obedient (but unaligned in some way) AI.

What kind of argument is that? Bostrom makes a claim about “most goals”—that is, the space of all goals. His claim is that convergent subgoals are so useful that goal-space is almost chock-full of goals incentivizing convergent subgoals.

And more recent explorations of this argument have followed this intuition: Alex Turner et al.’s __work__ on power-seeking formalizes the instrumental convergence thesis in the setting of __Markov decision processes__ (MDP) and reward functions by looking, for every “goal” (a distribution over reward functions) at the set of all its permuted variants (the distribution given by exchanging some states – so the reward labels stay the same, but are not put on the same states). Their main theorems state that given some symmetry properties in the environment, a majority (or a possibly bigger fraction) of the permuted variant of **every** goal will incentivize convergent subgoals for its optimal policies.

So this tells us that goals without convergent subgoals exist, but they tend to be crowded out by ones with such subgoals. Still, it’s very important to realize what neither Bostrom nor Turner are arguing for: they’re not saying that every goal has convergent subgoals. Nor are they claiming to have found the way humans sample goal-space, such that their results imply goals with convergent subgoals must be sampled by humans with high probability. Instead, they show the overwhelmingness of convergent subgoals in some settings, and consider that a strong indicator that avoiding them is hard.

I see a direct analogy with the wonderful idea of __smoothed analysis__ in complexity theory. For a bit of background, complexity theory generally focuses on the worst case time taken by algorithms. That means it mostly cares about which input will take the most time, not about the average time taken over all inputs. The latter is also studied, but it’s nigh impossible to find the distribution of input actually used in practice (and some problems studied in complexity theory are never used in practice, so a meaningful distribution is even harder to make sense of). Just like it’s very hard to find the distribution from which goals are sampled in practice.

As a consequence of the focus on worst case complexity, some results in complexity theory clash with experience. Here we focus on the __simplex algorithm__, used for linear programming: it runs really fast and well in practice, despite having provable exponential worst case complexity. Which in complexity-theory-speak means it shouldn’t be a practical algorithm.

__Daniel Spielman__ and__ Shang-Hua Teng__ had a brilliant intuition to resolve this inconsistency: what if the worst case inputs were so rare that just a little perturbation would make them easy again? Imagine a landscape that is mostly flat, with some very high but very steep peaks. Then if you don’t land exactly on the peaks (and they’re so pointy that it’s really hard to get there exactly), you end up on the flat surface.

This intuition yielded __smoothed analysis__: instead of just computing the worst case complexity, we compute the worst case complexity averaged over some noise on the input. Hence the peaks get averaged with the flatness around them and have a low smoothed time complexity.

Convergent subgoals, especially in Turner’s formulation, behave in an analogous way: think of the peaks as goals without convergent subgoals; to avoid power-seeking we would ideally reach one of them, but their rarity and intolerance to small perturbations (permutations here) makes it really hard. So the knowledge created is about the shape of the landscape, and leveraging the intuition of smoothed analysis, that tells us something important about the hardness of avoiding convergent subgoals.

Note though that there is one aspect in which Turner’s results and the smoothed analysis of the simplex algorithm are complete opposite: in the former the peaks are what we want (no convergent subgoals) while in the latter they’re what we don’t want (inputs that take exponential time to treat). This inversion doesn’t change the sort of knowledge produced, but it’s an easy source of confusion.

epistemic analysis isn’t only meant for clarifying and distilling results: it can and should pave the way to some insights on how the arguments could fail. Here, the analogy to smoothed complexity and the landscape picture suggests that Bostrom and Turner’s argument could be interrogated by:

- Arguing that the sampling method we use in practice to decide tasks and goals for AI targets specifically the peaks.
- Arguing that the sampling is done in a smaller goal-space for which the peaks are broader
- For Turner’s version, one way of doing that might be to not consider the full orbit, but only the close-ish variations of the goals (small permutations instead of all permutations). So questioning the form of noise over the choice of goals that is used in Turner’s work.

Smoothed Analysis | Convergent Subgoals (Turner et al) |

Possible Inputs | Possible Goals |

Worst-case inputs make steep and rare peaks | Goals without convergent subgoals make steep and rare peaks |

## Goodhart’s Law and Distributed Impossibility/Hardness Results

Another recurrent concept in alignment thinking is __Goodhart’s law__. It wasn’t invented by alignment researchers, but Scott Garrabrant and David Manheim proposed a __taxonomy__ of its different forms. Fundamentally, Goodhart’s law tells us that if we optimize a proxy of what we really want (some measure that closely correlates with the wanted quantity in the regime we can observe), strong optimization tends to make the two split apart, meaning we don’t end up with what we really wanted. For example, imagine that everytime you go for a run you put on your running shoes, and you only put on these shoes for running. Putting on your shoes is thus a good proxy for running; but if you decide to optimize the former in order to optimize the latter, you will take most of your time putting on and taking off your shoes instead of running.

In alignment, Goodhart is used to argue for the hardness of specifying exactly what we want: small discrepancies can lead to massive differences in outcomes.

Yet there is a recurrent problem: Goodhart’s law assumes the existence of some True Objective, of which we’re taking a proxy. Even setting aside the difficulties of defining what we really want at a given time, what if what we really want is not fixed but constantly evolving? Thinking about what I want nowadays, for example, it’s different from what I wanted 10 years ago, despite some similarities. How can Goodhart’s law apply to my values and my desires if there is not a fixed target to reach?

Salvation comes from a basic insight when comparing problems: if problem A (running a marathon) is harder than problem B (running 5 km), then showing that the latter is really hard, or even impossible, transfers to the former.

My description above focuses on the notion of one problem being harder than another. TCS formalizes this notion by saying the easier problem is reducible to the harder one: a solution for the harder one lets us build a solution for the easier problem. And that’s the trick: if we show there is no solution for the easier problem, this means that there is no solution for the harder one, or such a solution could be used to solve the easier problem. Same thing with hardness results which are about how difficult it is to solve a problem.

That is, when proving impossibility/hardness, you want to focus on the easiest version of the problem for which the impossibility/hardness still holds.

In the case of Goodhart’s law, this can be used to argue that it applies to moving targets because having True Values or a True Objective makes the problem easier. Hitting a fixed target sounds simpler than hitting a moving or shifting one. If we accept that conclusion, then because Goodhart’s law shows hardness in the former case, it also does in the latter.

That being said, whether the moving target problem is indeed harder is debatable and debated. My point here is not to claim that this is definitely true, and so that Goodhart’s law necessarily applies. Instead, it’s to focus the discussion on the relative hardness of the two problems, which is what underlies the correctness of the epistemic strategy I just described. So the point of this analysis is that there is another argument to decide the usefulness of Goodhart’s law in alignment than debating the existence of True Value

Running | Alignment | |

Easier Problem | 5k | Approximating a fixed target (True Values) |

Harder Problem | Marathon | Approximating a moving target |

Orthogonality Thesis and Complexity Barriers

My last example is Bostrom’s __Orthogonality Thesis__: it states that goals and competence are orthogonal, meaning that they are independent– a certain level of competence doesn’t force a system to have only a small range of goals (with some subtleties that I address below).

That might sound only too general to really be useful for alignment, but we need to put it in context. The Orthogonality Thesis is a response to a common argument for alignment-by-default: because a Human-level AI/Transformative AI/AGI would be competent, it should realize what we really meant/wanted, and correct itself as a result. Bostrom points out that there is a difference between understanding and caring. The AI understanding our real intentions doesn’t mean it must act on that knowledge, especially if it is programmed and instructed to follow our initial commands. So our advanced AI might understand that we don’t want it to follow convergent subgoals while maximizing the number of paperclips produced in the world, but what it cares about is the initial goal/command/task of maximizing paperclips, not the more accurate representation of what we really meant.

Put another way, if one wants to prove alignment-by-default, the Orthogonality thesis argues that competence is not enough. As it is used, it’s not so much a result about the real world, but a result about how we can reason about the world. It shows that one class of arguments (competence will lead to human values) isn’t enough.

Just like some of the weirdest results in complexity theory: __the barriers__ to P vs NP. This problem is one of the biggest and most difficult open questions in complexity theory: settling formally the question of whether the class of problems which can tractably be solved (P for __Polynomial time__) is equal to the class of problems for which solutions can be tractably checked (NP for __Non-deterministic Polynomial time__). Intuitively those are different: the former is about creativity, the second about taste, and we feel that creating something of quality is harder than being able to recognize it. Yet a proof of this result (or its surprising opposite) has evaded complexity theorists for decades.

That being said, recall that theoretical computer scientists are experts at wringing knowledge out of everything, including their inability to prove something. This resulted in the three barriers to P vs NP: three techniques from complexity theory which have been proved to not be enough by themselves for showing P vs NP or its opposite. I won’t go into the technical details here, because the analogy is mostly with the goal of these barriers. They let complexity theorists know quickly if a proof has potential – it must circumvent the barriers somehow.

The Orthogonality thesis plays a similar role in alignment: it’s an easy check for the sort of arguments about alignment-by-default that many people think of when learning about the topic. If they extract alignment purely from the competence of the AI, then the Orthogonality Thesis tells us something is wrong.

What does this mean for criticism of the argument? That what matters when trying to break the Orthogonality Thesis isn’t its literal statement, but whether it still provides a barrier to alignment-by-default. Bostrom himself points out that the Orthogonality Thesis isn’t literally true in some regimes (for example some goals might require a minimum of competence) but that doesn’t affect the barrier nature of the result.

Barriers to P vs NP | Orthogonality Thesis |

Proof techniques that are provably not enough to settle the question | Competence by itself isn’t enough to show alignment-by-default |

# Improving the Epistemic State-of-the-art

Alignment research aims at solving a never-before-seen problem caused by a technology that doesn’t exist yet. This means that the main epistemic strategies from Science and Engineering need to be adapted if used, and lose some of their guarantees and checks. In consequence, I claim we shouldn’t only focus on those, but use all epistemic strategies we can find to produce new knowledge. This is already happening to some extent, causing both progress on many fronts but also difficulties in teaching, communicating and criticizing alignment work.

In this post I focused on epistemological analyses drawing from Theoretical Computer Science, both because of my love for it and because it fits into the background of many conceptual alignment researchers. But many different research directions leverage different epistemic strategies, and those should also be studied and unearthed to facilitate learning and criticism.

More generally, the inherent weirdness of alignment makes it nigh impossible to find one unique methodology of doing it. We need everything we can get, and that implies a more pluralistic epistemology. Which means the epistemology of different research approaches must be considered and studied and made explicit, if we don’t want to be confusing for the rest of the world, and each other too.

I’m planning on focusing on such epistemic analyses in the future, both for the main ideas and concepts we want to teach to new researchers and for the state-of-the art work that needs to be questioned and criticized productively.

This post really helped me make concrete some of the admittedly gut reaction type concerns/questions/misunderstandings I had about alignment research, thank you. I have a few thoughts after reading:

(1) I wonder how different some of these epistemic strategies are from everyday normal scientific research in practice. I do experimental neuroscience and I would argue that we also are not even really sure what the "right" questions are (in a local sense, as in, what experiment should I do next), and so we are in a state where we kinda fumble around using whatever inspiration we can. The inspiration can take many forms - philosophical, theoretical, emperical, a very simple model, thought experiments of various kinds, ideas or experimental results with an aesthetic quality. It is true that at the end of the day brain's already exist, so we have that to probe, but I'd argue that we don't have a great handle on what exactly is the important thing to look at in brains, nor in what experimental contexts we should be looking at them, so it's not immediately obvious what type of models, experiments, or observations we should be doing. What ends up happening is, I think, a lot of the types of arguments you mention. For instance, trying to make a story using the types of tasks we can run in the lab but applying to more complicated real world scenarios (or vice versa), and these arguments often take a less-than-totally-formal form. There is an analagous conversation occuring within neuroscience that takes the form of "does any of this work even say anything about how the brain works?!"

(2) You used theoretical computer science as your main example but it sounds to me like the epistemic strategies one might want in alignment research are more generally found in pure mathematics. I am not a mathematician but I know a few, and I'm always really intrigued by the difference in how they go about problem solving compared to us scientists.

Thanks!

Thanks for the kind words and thoughtful comment!

Glad it helped! That was definitely one goal, the hardest to check with early feedback because I mostly know people who already work in the field or have never been confronted to it, while you're in the middle. :)

Completely! One thing I tried to make clear in this draft (maybe not successfully given your comment ^^) is that many field, including natural sciences, leverage far more epistemic strategies than the Popperian "make a model, predict something and test it in the real world". My points are more that:

But I'm convinced that almost all disciplines could be the subject of a deep study in the methods people use to wrestle knowledge from the world. That's part of my hopes, since I want to steal epistemic strategies from many different fields and see how they apply to alignment.

Fascinating! Yeah, I agree with you that the analogy definitely exists, particularly with fundamental science. And that's part of the difficulty in alignment. Maybe the best comparison would be trying to cure a neurological pathology without having access to a human brain, but only to brains of individuals of very old species in our evolutionary lineage. It's harder, but linking the experimental results to the concrete object is still part of the problem.

(Would be very interested in having a call about the different epistemic strategies you use in experimental neuroscience by the way)

So I disagree, but you're touching on a fascinating topic, one that confused me for the longest time.

My claim is that pure maths is fundamentally the study of abstraction (Platonists would disagree, but that's more of a minority position nowadays). Patterns is also a word commonly used when mathematicians wax poetic. What this means is that pure maths studies ways of looking at the world, of representing it. But, and that's crucial for our discussion, pure maths doesn't care about how to apply these representations to the world itself. When pure mathematicians study concrete systems, it's often to get exciting abstractions out of them, but they don't check those abstractions against the world again -- they study the abstraction in and of themselves.

The closest pure maths gets to caring about how to apply abstractions is when using one type of abstractions to understand another. The history of modern mathematics is full of dualities, representation theorems, correspondences and tricks for seeing one abstraction through the prism of another. By the way that's one of the most useful aspect of maths in my opinion: if you manage to formalize your intuition/ideas as a well-studied mathematical abstraction, then you get for free a lot of powerful tools and ways of looking at it. But maths itself doesn't tell you how to do the formalization.

On the other hand, theoretical computer scientists are in this weird position where they almost only work in the abstract world reminiscent of pure maths, but their focus is on an aspect of the real world: computation, its possibilities and its limits. TCS doesn't care about a cool abstractions if it doesn't unlock some nice insight about computation.

The post already have some nice examples, but here is another one I like: defining the class of tractable problems. You want to delineate the problems that can be solved in practice, in the sense that the time taken to solve them grows slowly enough with the size of the input that solving for massive inputs is a matter of days, maybe weeks or months of computation, but not thousands of years. Yet there is a fundamental tension between a category that fits the problems for which we have fast algorithms, and mathematical elegance/well-behavedness.

So complexity theorists reached for a compromise with P, the class of problems solvable in polynomial time. Polynomials are nice because a linear sum of them is still a polynomial, which means among other things that if solving problem A is the same than solving problem B 3 times then problem C 2 times, and both B and C are in P, then A is also in P. Yet polynomials can grow pretty damn fast. If n is the size of the input, a problem who can be solved in time growing like n10100 is technically in P, but is completely intractable. The argument for why this is not an issue is that in practice, all the problems we can prove are in P have an algorithm with complexity at most growing like n3, which is the limit of tractable.

So much of the reasoning I presented in the previous paragraphs is reminiscent of troubles and subtleties in alignment, but so different from pure maths, that I really believe TCS is more relevant to the epistemic strategies of alignment than pure maths. Which doesn't mean maths doesn't prove crucial when instanciating these strategies, for example by proving theorems we need.

Nice overview, but what surprises me is that you are not in fact describing the main epistemic strategy used in engineering. What you say about engineering is:

You fail to mention the more important engineering strategy: one which does not rely on tinkering, but instead on logical reasoning and math to chart a straight line to your goal.

To use the obvious example, modern engineering does not design bridges by using the staple strategy of tinkering, it will use applied math and materials science to create and validate the bridge design.

From this point of view, the main difference between 'science' and 'engineering' is that science tries to understand nature: it seeks to understand existing physical laws and evolved systems. But engineering builds systems from the ground up. The whole point of this is that the engineer can route around creating the kind of hard-to-analyse complexity you often encounter in evolved systems. Engineering is a 'constructive science' not an 'observational science' or 'experimental science'.

On this forum, you can find a lot of reflection and debate about 'the fundamental properties of AGI', debate which treats AGI as some kind of evolved entity, not as a designed entity. I feel that this approach of treating AGI as an evolved entity is too limited. It may be a useful framing if you want to prove that the alignment problem exists, but it will not help you much in solving it.

In software engineering, the 'tinkering' approach to programming is often described as the approach of 'debugging a blank piece of paper'. This kind-of-experimental approach is not the recommended or respectable approach, especially not for high-risk software systems. (That being said, in ML algorithm research, improving systems by blindly tinkering with them still seems to be fairly respectable.)

Like you, I like Theoretical Computer Science. However, I do not like or recommend complexity theory as a model of TCS. As complexity theory mainly showcases the analytical post-hoc use of math, and not the constructive use of math to guide your design decisions. The type of TCS I like is what Wikipedia calls formal methods:

Specifically, in AI/AGI alignment, we can trivially use formal methods to model superintelligent machine learning, and then proceed to use formal methods to construct agents that have certain provable safety properties, even though they incorporate superintelligent machine learning.

Stuart Russell has remarked here that in the US, there seems to be somewhat of a cultural aversion against using formal methods:

I'd like to emphasize the use of a constructive engineering mindset, and the use of formal methods from TCS, as important epistemic strategies applicable to AGI alignment.

Thanks for the thoughtful comment!

That's a fair point. My intuition here is that this more advanced epistemic strategy requires the science, and that the one I described is the one people actually used for thousands of years when they had to design bridge without material science and decent physics. So in a sense, the modern engineering epistemic strategy sounds even less feasible for alignment.

As for your defense of formal methods... let's start by saying that I have a good idea of what you're talking about: I studied engineering in Toulouse, which is one of the places with the most formal methods in France, which is itself a big formal methods hotspots. I got classes and lab sessions about all the classical formal methods techniques. I did my PhD in a formal methods lab, going to a bunch of seminars and conferences. And even if my PhD was on the theory of distributed computing, which takes a lot from computability and complexity theory, it's also the field that spurned the invention of formal methods and the analysis of algorithms as this is pretty much the only way of creating distributed algorithms that work.

All of that to clarify that I'm talking from some amount of expertise here. And my reaction is that formal methods have two big problems: they're abysmally slow, and they're abysmally weak, at least compared to the task of verifying an AGI.

On the slow part, it took

two yearsfor researchers to formally verify Compcert, a C compiler that actually can't compile the whole C specification. It's still an amazing accomplishment, but that gives you an idea of the timescale of formal verification.You could of course answer that CompCert was verified "by hand" using proof assistants, and that we can leverage more automatic methods like model checking or abstract interpretation. But there's a reason why none of these approaches was used to certify a whole compiler: they're clearly not up to the task currently.

Which brings us to the weak part. The most relevant thing here is that you need a specification to apply formal methods. And providing a specification for alignment would be on par with proving P != NP. Not saying it's impossible, just that it's clearly so hard that you nee to show a lot of evidence to convince people that you did. (I'm interested if you feel you have evidence for that ^^)

And even then, the complexity of the specification would without a doubt far exceeds what can be checked using current methods.

I also see that you're saying formal methods help you design stuff. That sounds very wrong to me. Sure, you can use formal methods if you want to test your spec, but it's so costly with modern software that basically no one considers it worth it except cases where there's a risk of death (think stuff like airplanes). And the big problem with AGI is that a lot of the people running for it don't understand that it involves risks of death and worse. So a technology that is completely useless to make their model stronger, and only serves to catch errors that lead to death has no value for people who don't believe in the possibility of critical error in their software.

But maybe the last and final point is that, sure, it will require a lot of work and time, but doing it through formal methods will provide guarantees that we can hardly get from any other source. I sympathise with that, but like almost all researchers here, my answer is "Deep Learning seem to be going there far sooner than the formal methods, so I'm going to deal with aligning the kind of system we're linkely to get in 15~30 years, not the ones that we could at best get in a century or two.

(Sorry if you feel I'm strawmanning you, I have quite strong opinions on the complete irrelevance of formal methods for alignment. ^^)

I think I will need to write two separate replies to address the points you raise. First, a more meta-level and autobiographical reply.

When it comes to formal methods, I too have a good idea of what I am talking about. I did a lot of formal methods stuff in the 1990 at Eindhoven University, which at the time was a one of the places with the most formal methods in the Netherlands.

When I talk about using formal methods to design stuff, I am very serious. But I guess this is not the role of formal methods you are used to, so I'll elaborate.

The type of formal methods use I am referring to was popular in the late 1970s to at least the 1990s, not sure how popular it is now. It can be summarized by Dijkstra's slogan of “designing proof and program hand in hand”.

This approach stands in complete opposite to the approach of using a theorem prover to verify existing code after it was written.

In simple cases, the designing hand-in-hand approach works as follows: you start with a mathematical specification of the properties you want the program to be written to have. Then you use this specification to guide both the writing of the code and the writing of the correctness proof for the code at the same time. The writing of the next line in the proof will often tell you exactly what next lines of code you need to add. This often leads to code which much more clearly expresses what is going on. The whole code and proof writing process can often be done without even leveraging a theorem prover.

In more complex cases, you first have to develop the mathematical language to write the specification in. These complex cases are of course the more fun and interesting cases. AGI safety is one such more complex case.

You mention distributed computing. This is one area where the hand-in-hand style of formal methods use is particularly useful, because the method of intuitive trial and error sucks so much at writing correct distributed code. Hand-in-hand methods are useful if you want to design distributed systems protocols that are proven to be deadlock-free, or have related useful properties. (The Owicki-Gries method was a popular generic approach for this type of thing in the 1990, not sure if it is still mentioned often.)

In the late 1990s, I considered the career option of joining academia full-time to work on formal methods. But in the end, I felt that the academic field was insufficiently interested in scaling up beyond toy problems, while there was also insufficient demand-pull from industry or industry regulators. (This never advancing beyond toy problems is course a common complaint about academic CS, it does not usually pay in academia to work on problems larger than the page limit of a conference paper.) In any case, in the late 1990s I abandoned the idea of a career in the formal methods field. Instead, I have done various other things in internet engineering, big science, and industry R&D. I do feel that in all these cases, my formal methods background often often helped me design stuff better, and write more clear system specifications.

The reason I got into independent AGI safety research a few years ago is because a) after working in industry for a long time, I had enough money that I could spend some time doing whatever I liked, including writing papers without caring about a page limit, and b) (more very relevant to this discussion) after scanning various fun open research problems, I felt that the open problems in AGI safety were particularly tractable to the formal methods approach. Specifically, to the approach where you construct programs and proofs hand in hand without even bothering with theorem provers.

So compared to you, I have the exact opposite opinion on about the applicability of formal methods to AGI safety.

I intend to back up this opinion by giving you a list of various formal methods based work which has advanced the field of AGI safety, but I'll do that in a later reply.

The key to progress is that you specify various useful 'safety properties' only, which you then aim to construct/prove. Small steps! It would be pretty useless to try to specify the sum total of all current and future human goals and morality in one single step.

This is of course the usual Very American complaint about formal methods.

However, if I compare AGI to airplanes, I can note that, compared to airplaine disasters, potential AGI disasters have an even bigger worst-case risk of producing mass death. So by implication, everybody should consider it worth the expense to apply formal methods to AGI risk management. In this context, 2 person years to construct a useful safety proof should be considered money well spent.

Finding the money is a somewhat separate problem.

I don't agree with the picture you are painting here. The companies most obviously running for AGI, DeepMind and OpenAI, have explicit safety teams inside of them and they say all the right things about alignment risks on their websites. Also, there is significant formal methods based work coming out of DeepMind, from people like Hutter and Everitt.

Thanks for explaining your point in more details.

I'm also aware of this approach; it is just about as intractable in my opinion and experience than the "check afterward". You still need the specification, which is one of my big arguments against the usefulness of formal methods to alignment. Also, such methods followed strongly are hardly "faster" than using a theorem prover — they are about emulating the sort of thing a theorem prover does while writing the program! Actually, I wonder if this is not just equivalent to writing the proof in Coq and generating the corresponding program. (To be fair, I expect you'll gain an order of magnitude over Coq proofs because you don't have to go that anally over that many details; but an order of magnitude or two better than Coq proof is still incredibly slow when compared to the speed of software development, let alone ML!)

One really important detail that is missing from your description is the incredible simplicity of distributed programs (where I actually mean distributed, not the concurrent you seem to imply) and yet they are tremendously difficult to certify and/or come up with hand and hand in proof. Like the most advanced distributed algorithms in use are probably simpler in terms of code and what happens than the second or third project someone has to code in their first software engineering class. So I hardly see why I should generalize with any confidence from "we manage to design through proofs programs that are simpler that literally what any non-distributed computing programmer writes" to "we're going to design through proof the most advanced and complex program that ever existed, orders of magnitudes more complex than the most complex current systems".

Sure — my point is that useful safety properties are incredibly hard to specify at that level. I'm of course interested of examples to the contrary. ;)

Actually, this is a complain by basically every researcher and big shot in academic formal methods that I ever heard talk. They all want to defend formal methods, but they all have to acknowledge that:

You're missing my point: most of the people developing AGI don't believe in any potential AGI disasters. They don't have the availability of plane crashes and regulations to be "okay, we probably have to check this". If we had such formal methods backed approaches, I could see DeepMind potentially using it, maybe OpenAI, but not chinese labs or all the competitors we don't know about. And the fact that it would slow down whoever uses it for 2 years would be a strong incentive for even the aligned places to not use it, or lose the race and risk having an even worsely unaligned AGI released into the world.

I disagree that AI code is orders of magnitude more complex than say the code in a web browser or modern compiler: in fact quite the opposite applies. Most modern ML algorithms are very short pieces of code. If you are willing to use somewhat abstract math where you do not write out all the hyperparameter values, you can specify everything that goes on in a deep learning algorithm in only a few lines of pseudocode. Same goes for most modern RL algorithms.

I also note that while modern airplanes contain millions of lines of code, most of it is in the on-board entertainment systems. But the safety-critical subsystems in airplanes them tend to be significantly smaller in code size, and they also run air-gapped from the on-board entertainment system code. This air-gapping of course plays an important role in making the formal proofs for the safety-critical subsystems possible.

But beyond these observations: the main point I am trying to get across is that I do not value formal methods as a post-hoc verification tool that should be applied to millions of lines of code, some of it spaghetti code put together via trial and error. Clearly that approach would not work very well.

I value formal methods as a tool for the aligned AI code specification and design phases.On the design phase: formal methods offer me a way to create many alternative and clarifying viewpoints on what the code is doing or intending to do, viewpoints not offered by the text of the code itself. For example, formally expressed loop invariants can express much more clearly what is going on in a loop than the loop code itself. Global invariants that I can formulate for distributed system state can allow me to express more clearly how the protocols I design manage to avoid deadlock than the code itself. So the main value if the hand-in-hand method, as a code design method, is that you can develop these clarifying mathematical viewpoints, even before you start writing the code. (Not all programmers are necessarily good at using formal methods to clarify what they are doing, or to accelerate their work. I have heard a formal tool vendor say that formal methods based tools only make the top 10% of programmers in any development organisation more productive.)

On the specification phase: formal methods can allow me to express safety properties I seek for AGI systems in a much more clear and precise way than any piece of code can, or any natural language statement can.

No, the the design flow here is that alignment researchers spend these 2 years of time using formal methods to develop and publish AGI safety mechanisms right now, before anybody invents viable AGI. The designs we develop should then preferably take far less than 2 years, and no deep mathematical skills, to combine with an AGI-level ML algorithm, if one ever invents one.

OK, I really need to post the promised list of list of alignment papers were formal methods are actually used in the clarifying design-tool way I describe above. Talking about it in the abstract is probably not the best way to get my points across. (I am in the middle of attending virtual NeurIPS, so I have not found the time yet to compile the promised list of papers.)

BTW, bit of a side issue: I do not have an x-risk fault tree model where I assume that a 'chinese lab' is automatically going to be less careful than a 'western lab', But as a rhetorical shorthand I see what you are saying.

OK, here is the promised list formal methods based work which has advanced the field of AGI safety. So these are specific examples to to back up my earlier meta-level remarks where I said that formal methods are and have been useful for AGI safety.

To go back to the Wikipedia quote:

There are plenty of people in CS who are inspired by this thought: if we can design safer bridges by using the tool of mathematical analysis, why not software?

## First, some background comments about formal methods in ML

In modern ML, there are a lot of people who use mathematical analysis in an attempt to better understand or characterize the performance of ML algorithms. Such people have in fact been there since the start of AI as a field.

I am attending NeurIPS 2021, and there are lots of papers which are all about this use of formal methods, or have sections which are all about this. To pick a random example, there is 'Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss' (preprint here) where the abstract says things like

(I have not actually read the paper. I am just using this as a random example.)

In general, my sense is that in the ML community right now, there is a broad feeling that mathematical analysis still needs to have some breakthrough before it can say truly useful things, truly useful things when it comes to safety engineering, for AIs that use deep neural networks.

If you read current ML research papers on 'provable guarantees' and then get very unexited, or even deeply upset at how many simplifying assumptions the authors are making in order to allow for any proofs at all, I will not blame you. My sense is also that different groups of ML researchers disagree about how likely such future mathematical breakthroughs will be.

As there is plenty of academic and commercial interest in this formal methods sub-field of ML, I am not currently working in it myself, and I do not follow it closely. As an independent researcher, I have the freedom to look elsewhere.

I am looking in a different formal methods corner, one where I abstract away from the details of any specific ML algorithm. I then try to create well-defined safety mechanisms which produce provable safety properties, properties which apply no matter what ML algorithm is used. This ties back into what I said earlier:

## The list

So here is an annotated list of some formal methods based work which has, in my view, advanced the field of AGI safety. (To save me some time, I will not add any hyperlinks, pasting the title into google usually works.)

Hutter's AIXI model, 'Universal algorithmic intelligence: A mathematical top down approach' models superintelligence in a usefully universal way.

I need to point out however that the age-old π∗ optimal-policy concept in MDPs also models superintelligence in a usefully universal way, but at a higher level of abstraction.

MIRI's 2015 paper 'Corrigibility' makes useful contributions to defining provable safety properties, though it then also reports failure in the effort of attempting to design an agent that provably satisfies them. At a history-of-ideas level, what it interesting here is that 2015 MIRI seemed to have been very much into using formal methods to make progress on alignment, but 2021 MIRI seems to feel that there is much less promise to the approach. Or at least, going by recently posted discussions, 2021 Yudkowsky seems to have lost all hope that these methods will end up saving the world. My own opinion is much more like 2015 MIRI than it is like 2021 MIRI. Modulo of course the idea that people meditating on HPMOR is a uniquely promising approach to solving alignment. I have never much felt he appeal of that approach myself. I think you have argued recently that 2021 Yudkowsky has also become less enthusiastic about the idea.

The above 2015 MIRI paper owes a lot to Armstrong's 2015 'Motivated value selection for artificial agents', the first official academic-style paper about Armstrong's indifference methods. This paper arguably gets a bit closer to describing an agent design that provably meets the safety properties in the MIRI paper. The paper gets closer if we squint just right and interpret Armstrong's 2005 somewhat ambiguous conditional probability notation for defining counterfactuals in just the right way.

My 2019 paper 'Corrigibility with Utility Preservation' in fact can be read to show how you need to squint in just the right way if you want to use Armstrong's indifference methods to create provable corrigibility related safety properties. The paper also clarifies a lot of questions about agent self-knowledge and emerging self-preservation incentives. Especially because of this second theme, the level of math in this paper tends to be more than what most people can or want to deal with.

In my 2020 'AGI Agent Safety by Iteratively Improving the Utility Function' I use easier and more conventional MDP math to define a corrigible (for a definition of corrigibility) agent using Armstrong's indifference methods as a basis. This paper also shows some results and design ideas going beyond stop buttons and indifference methods.

In my 2021 'Counterfactual Planning in AGI Systems' I again expand on some of these themes, developing a different mathematical lens which I feel clarifies several much broader issues, and makes the math more accessible.

Orseau and Armstrong's 2016 'Safely interruptible agents' is often the go-to cite when people want to cite Armstrong's indifference methods. In fact it formulates and proves (for some ML systems) a related 'safely interruptable' safety property, a property which we also want to have if we apply a design like indifference methods.

More recent work which expands on safe interruptability is the 2021 'How RL Agents Behave When Their Actions Are Modified' by Langlois and Everitt.

Everitt and Hutter's 2016 'Self-modification of policy and utility function in rational agents' also deals with agent self-knowledge and emerging self-preservation incentives, but like my above 2019 paper which also deals with the topic, the math tends to be a bit too much for many people.

Carey, Everitt, et al. 2021 'Agent Incentives: A Causal Perspective' has (the latest version of) a definition of causal influence diagrams and some safety properties we might define by using them. The careful mathematical definitions provided in the paper use Pearl's style of defining causal graphs, which is again too much for some people. I have tried to offer a more accessible introduction and definitional framework here.

Everitt et al's 2019 'Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective' uses the tools defined above to take a broader look at various safety mechanisms and their implied safety properties.

The 2020 'Asymptotically unambitious artificial general intelligence' by Cohen et al designs a generic system where a uncertain-about-goals agent can call a mentor, and then proves some bounds for it.

I also need to mention the 2017 'The Off-Switch Game' by Hadfield-Menell et al. Like the 2015 MIRI paper on corrigibility, this paper ends up noting that it has clarified some problems, but without solving all of them.

There is lots of other work, but this should give you an idea of what I mean with formal methods work that defines useful AGI safety properties in a mathematically precise way.

Overall, given the results in the above work, I feel that formal methods are well underway in clarifying how one might define useful AGI safety properties, and how we might built them into AGI agents, if AGI agents are ever developed.

But it is also clear from recent comments like this one that Yudkowsky feels that no progress has been made. Yudkowsky is not alone in having this opinion.

I feel quite the opposite, but there is only a limited amount of time that I am willing to invest in convincing Yudkowsky, or any other participants on this forum, that a lot of problems in alignment and embeddedness are much more tractable than they seem to believe.

In fact, in the last 6 months or so, I have sometimes felt that the intellectually interesting open purely mathematical problems have been solved to such an extent that I need to expand the scope of my research hobby. Sure, formal methods research on characterizing and improving specific ML algorithms could still use some breakthroughs, but this is not the kind of math that I am interested in working on as a self-funded independent researcher. So I have recently become more interested in advancing the policy parts of the alignment problem. As a preview, see my paper here, for which I still need to write an alignment forum post.