Alignment Research = Conceptual Alignment Research + Applied Alignment Research

adamShimi

Instead of talking about alignment research, we should differentiate conceptual alignment research and applied alignment research.

I expect these two categories to be quite obvious to most people around here: conceptual alignment research includes deconfusion, for example work on deception, HCH, universality, abstraction, power-seeking, as well as work searching for approaches and solutions for the problems raised by this deconfusion; whereas applied alignment research focuses on experimentally testing these ideas and adapting already existing fields like RL and DL to be more relevant to alignment questions. Some work will definitely fall in the middle, but that’s not a problem, because the point isn’t to separate “good” from “bad” alignment research or “real” from “fake” alignment research, just to be able to frame these clusters and think about them meaningfully.

Isn’t that decomposition trivial though? It’s indeed obvious, just like the result of almost any deconfusion. And yet, committing to that framing clears the air about so many internal issues of the alignment research community and guards discussions of field-building against frustrating everyone involved.

Thanks to Connor Leahy, Logan Smith and Rob Miles for feedback on a draft.

(Note that the AI tag has a distinction that sounds similar but with in my opinion far worse names (and weird classification). It might make sense to change the names following this post if the distinction makes sense to enough people.)

It’s all a Common-Payoff Game

Another obvious yet fundamental point is that neither conceptual alignment research nor applied alignment research is enough by itself -- we need both. Only succeeding in conceptual alignment research would result in a very good abstract understanding of what we should do, but no means of concretely translating it in time; only succeeding in applied alignment research would result in very good ability to build and understand models that are susceptible to alignment yet no principled understanding of what to be careful of and how to steer them.

So it doesn’t make any sense to say that one is more important than the other, or that only one is “real alignment research”. Yet frustration can so easily push one to make this mistake. I personally said in multiple conversations that I was talking about “real alignment research” when talking about conceptual research, only to realize hours later, when the pressure was down, that I didn’t believe at all that the applied part was fake alignment research.

Keeping this split in mind and the complementarity of both approaches definitely helped me avoid frustration-induced lapses where I cast alignment as a zero-sum game between conceptual alignment research and applied alignment research. Instead, I remember the obvious: alignment is a common-payoff game where we all either win or lose together.

Field-Building Confusions

Keeping this distinction in mind also helps with addressing some frustrating confusions in field-building discussions.

Take my post about creating an alignment research journal/conference: I should have made it clear that I meant conceptual alignment research, but I hadn’t internalized this distinction at the time. Objections like this comment from David Manheim or this comment from Ryan Carey saying that work can get published in ML/CS venues and we should thus push there didn’t convince me at all, without me being able to put a finger on where my disagreement lay.

The conceptual/applied alignment research distinction instead makes it jump out of the page: ML conferences and journals almost never publish conceptual alignment research AFAIK (some of it is published in workshops, but these don’t play the same role with regard to peer review and getting jobs and tenure). Take this quote from Ryan’s comment:

Currently, the flow of AIS papers into the likes of Neurips and AAAI (and probably soon JMLR, JAIR) is rapidly improving. New keywords have been created there at several conferences, along the lines of "AI safety and trustworthiness" (I forget the exact wording) so that you can nowadays expect, on average, to receive reviewer who average out to neutral, or even vaguely sympathetic to AIS research. Ten or so papers were published in such journals in the last year, and all these authors will become reviewers under that keyword when the conference comes around next year. Yes, things like "Logical Inductors" or "AI safety via debate" are very hard to publish. There's some pressure to write research that's more "normie". All of that sucks, but it's an acceptable cost for being in a high-prestige field. And overall, things are getting easier, fairly quickly.

Applying the conceptual/applied distinction makes obvious that the argument only applies for applied alignment research. He literally gives two big examples of conceptual alignment research as the sort of things that can’t get published.

This matters because none of this is making it easier to peer-review/scale/publish conceptual alignment research. It’s not a matter of “normies” research versus “real alignment research”, but that almost all gains in field building and scaling in the last few years are for applied alignment research!

Incentives against Conceptual Alignment Research

I started thinking about all of this because I was so frustrated with having to always push/defend conceptual alignment research in discussions. Why were so many people pushing back against it/not seeing the problem that few people work on it? The answer seems obvious in retrospect: because years ago you actually had to push massively against the pull and influence of conceptual alignment research to do anything else.

7 years ago, when Superintelligence was published, MIRI was pretty much the whole field of alignment. And I don’t need to explain to anyone around here how MIRI is squarely in the conceptual part. As such, I’m pretty sure that many researchers who wanted a more varied field or to work on applied alignment research had to constantly deal with the fact that the big shots in town were not convinced/interested in what they were doing. I expect that much of the applied alignment researchers positioned themselves as an alternative to MIRI’s work.

It made sense at the time, but now applied alignment research has completely dwarfed its conceptual counterpart in terms of researchers, publications, prestige. Why? Because applied alignment research is usually:

Within ML compared to the weird abstract positioning of conceptual alignment research
Experimental and sometimes formal, compared to the weird philosophical aspects of conceptual alignment research
Able to leverage skills that are actually taught in universities and which people are hyped about (programming, ML, data science)

Applied alignment research labs made and are making great progress by leveraging all of these advantages. On the other hand, conceptual alignment research relies entirely on a trickle of new people who bashed their heads long enough against the Alignment Forum to have an idea of what’s happening.

This is not the time for being against conceptual alignment research. Instead, we should all think together about ways of improving this part of the field. And when conceptual alignment research thrives, there will be so many more opportunities for collaborations: experiments on some conceptual alignment concepts, criticism of conceptual ideas from a more concrete perspective, conceptual failures modes for the concrete applied proposals...

Conclusion

You don’t need me to realize that there are two big clusters in alignment research: I call them conceptual alignment research and applied alignment research. But despite how obvious this distinction feels, keeping it in mind is crucial for the only thing that matters: solving the alignment problem.

Appendix: Difference with MIRI’s Philosophy -> Maths -> Engineering

A commenter pointed out to me that my distinction reminded them of the one argued in this MIRI blogpost. In summary, the post argues that Friendly AI (the old school name for alignment) should be tackled in the same way that causality was: starting at philosophy, then formalizing it into maths, and finally implementing these insights through engineering.

I don’t think this fits with the current state of the field, for the following reasons:

Nowadays, the vast majority of the field disagree that there’s any hope of formalizing all of philosophy and then just implementing that to get an aligned AGI.
Because of this perspective that the real problem is to formalize philosophy, the MIRI post basically reduces applied alignment research to exact implementation of the formal models from philosophy. Whereas applied alignment research contains far more insights than that.
Conceptual alignment research isn’t just turning philosophy into mathematics. This is a failure mode I warned against recently: what matters is deconfusion, not formalization. Some non formal deconfusion can (and so often do) help far more than formal work. This doesn’t mean that having a formalization wouldn’t be great; just that this is in no way a requirement for making progress.

So all in all, I feel that my distinction is both more accurate and more fruitful.

Nowadays, the vast majority of the field disagree that there’s any hope of formalizing all of philosophy and then just implementing that to get an aligned AGI.

What do you mean by "formalizing all of philosophy"? I don't see 'From Philosophy to Math to Engineering' as arguing that we should turn all of philosophy into math (and I don't even see the relevance of this to Friendly AI). It's just claiming that FAI research begins with fuzzy informal ideas/puzzles/goals (like the sort you might see philosophers debate), then tries to move in more formal directions.

I imagine part of Luke's point in writing the post was to push back against the temptation to see formal and informal approaches as opposed ('MIRI does informal stuff, so it must not like formalisms'), and to push back against the idea that analytic philosophers 'own' whatever topics they happen to have historically discussed.

Conceptual alignment research isn’t just turning philosophy into mathematics. This is a failure mode I warned against recently: what matters is deconfusion, not formalization.

Pearl's causality (the main example of "turning philosophy into mathematics" Luke uses) was an example of achieving deconfusion about causality, not an example of 'merely formalizing' something. I agree that calling this deconfusion is a clearer way of pointing at the thing, though!

Thanks for the comment!

What do you mean by "formalizing all of philosophy"? I don't see 'From Philosophy to Math to Engineering' as arguing that we should turn all of philosophy into math (and I don't even see the relevance of this to Friendly AI). It's just claiming that FAI research begins with fuzzy informal ideas/puzzles/goals (like the sort you might see philosophers debate), then tries to move in a more formal directions.

I abused the hyperbole in that case. What I was pointing out is the impression that old-school MIRI (a lot of the HRAD work) thinks that solving the alignment problem requires deconfusing every related philosophical problem in terms of maths, and then implementing that. Such a view doesn't seem shared by many in the community for a couple of reasons:

Some doubt that the level of mathematical formalization required is even possible
If timelines are quite short, we probably don't have the time to do all that.
If AGI turns out to be prosaic AGI (which sounds like one of the best bet to make now), then what matters is aligning neural nets, not finding a way of write down a perfectly aligned AGI from scratch (related to the previous point because it seems improbable that the formalization will be finished before neural nets reach AGI, in such a prosaic setting).

I imagine part of Luke's point in writing the post was to push back against the temptation to see formal and informal approaches as opposed ('MIRI does informal stuff, so it must not like formalisms'), and to push back against the idea that analytic philosophers 'own' whatever topics they happen to have historically discussed.

Thanks for that clarification, it makes sense to me. That being said, multiple people (both me a couple of years ago and people I mentor/talk too) seem to have been pushed by MIRI's work in general to think that they need extremely high-level of maths and formalism to even contribute to alignment, which I disagree with, and apparently Luke and you do too.

Reading the linked post, what jumps to me is the focus that friendly AI is about turning philosophy into maths, and I think that's the culprit. That is part of the process, important one and great if we manage it. But expressing and thinking through problems of alignment at a less formal level is still very useful and important; that's how we have most of the big insights and arguments in the field.

Pearl's causality (the main example of "turning philosophy into mathematics" Luke uses) was an example of achieving deconfusion about causality, not an example of 'merely formalizing' something. I agree that calling this deconfusion is a clearer way of pointing at the thing, though!

Funnily, it sounds like MIRI itself (specifically Scott) has call that into doubt with Finite Factored Sets. This work isn't throwing away all of Pearl's work, but it argues that some part where missing/some assumptions unwarranted. Even a case of deconfusion as grounded than Pearl's isn't necessary the right abstraction/deconfusion.

The subtlety I'm trying to point out: actually formally deconfusing is really hard, in part because the formalization we come up with seem so much more serious and research-like than the fuzzy intuition underlying it all. And so I found it really useful to always emphasize that what we actually care about is the intuition/weird philosophical thinking, and the mathematical model are just tools to get clearer about the former. Which I expect is obvious for you and Luke, but isn't for so many others (me from a couple of years ago included).

Cool, that makes sense!

I abused the hyperbole in that case. What I was pointing out is the impression that old-school MIRI (a lot of the HRAD work) thinks that solving the alignment problem requires deconfusing every related philosophical problem in terms of maths, and then implementing that. Such a view doesn't seem shared by many in the community for a couple of reasons:

I'm still not totally clear here about which parts were "hyperbole" vs. endorsed. You say that people's "impression" was that MIRI wanted to deconfuse "every related philosophical problem", which suggests to me that you think there's some gap between the impression and reality. But then you say "such a view doesn't seem shared by many in the community" (as though the "impression" is an actual past-MIRI-view others rejected, rather than a misunderstanding).

HRAD has always been about deconfusion (though I agree we did a terrible job of articulating this), not about trying to solve all of philosophy or "write down a perfectly aligned AGI from scratch". The spirit wasn't 'we should dutifully work on these problems because they're Important-sounding and Philosophical'; from my perspective, it was more like 'we tried to write down a sketch of how to align an AGI, and immediately these dumb issues with self-reference and counterfactuals and stuff cropped up, so we tried to get those out of the way fast so we could go back to sketching how to aim an AGI at intended targets'. As Eliezer put it,

It was a dumb kind of obstacle to run into—or at least it seemed that way at that time. It seemed like if you could get a textbook from 200 years later, there would be one line of the textbook telling you how to get past that.

From my perspective, the biggest reason MIRI started diversifying approaches away from our traditional focus was shortening timelines, where we still felt that "conceptual" progress was crucial, and still felt that marginal progress on the Agent Foundations directions would be useful; but we now assigned more probability to 'there may not be enough time to finish the core AF stuff', enough to want to put a lot of time into other problems too.

Actually, I'm not sure how to categorize MIRI's work using your conceptual vs. applied division. I'd normally assume "conceptual", because our work is so far away from prosaic alignment; but you also characterize applied alignment research as being about "experimentally testing these ideas [from conceptual alignment]", which sounds like the 2017-initiated lines of research we described in our 2018 update. If someone is running software experiments to test ideas about "Seeking entirely new low-level foundations for optimization" outside the current ML paradigm, where does that fall?

If AGI turns out to be prosaic AGI (which sounds like one of the best bet to make now), then what matters is aligning neural nets, not finding a way of write down a perfectly aligned AGI from scratch

Prosaic AGI alignment and "write down a perfectly aligned AGI from scratch" both seem super doomed to me, compared to approaches that are neither prosaic nor perfectly-neat-and-tidy. Where does research like that fall?

HRAD has always been about deconfusion (though I agree we did a terrible job of articulating this), not about trying to solve all of philosophy or "write down a perfectly aligned AGI from scratch". The spirit wasn't 'we should dutifully work on these problems because they're Important-sounding and Philosophical'; from my perspective, it was more like 'we tried to write down a sketch of how to align an AGI, and immediately these dumb issues with self-reference and counterfactuals and stuff cropped up, so we tried to get those out of the way fast so we could go back to sketching how to aim an AGI at intended targets'.

I think that the issue is that I have a mental model of this process you describe that summarize it as "you need to solve a lot of philosophical issues for it to work", and so that's what I get by default when I query for that agenda. Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that's inaccurate?

From my perspective, the biggest reason MIRI started diversifying approaches away from our traditional focus was shortening timelines, where we still felt that "conceptual" progress was crucial, and still felt that marginal progress on the Agent Foundations directions would be useful; but we now assigned more probability to 'there may not be enough time to finish the core AF stuff', enough to want to put a lot of time into other problems too.

Yeah, I think this is a pretty common perspective on that work from outside MIRI. That's my take (that there isn't enough time to solve all of the necessary components) and the one I've seen people use in discussing MIRI multiple time.

Actually, I'm not sure how to categorize MIRI's work using your conceptual vs. applied division. I'd normally assume "conceptual", because our work is so far away from prosaic alignment; but you also characterize applied alignment research as being about "experimentally testing these ideas [from conceptual alignment]", which sounds like the 2017-initiated lines of research we described in our 2018 update. If someone is running software experiments to test ideas about "Seeking entirely new low-level foundations for optimization" outside the current ML paradigm, where does that fall?

A really important point is that the division isn't meant to split researchers themselves but research. So the experiment part would be applied alignment research and the rest conceptual alignment research. What's interesting is that this is a good example of applied alignment research that doesn't have the benefits I mention of more prosaic applied alignment research: being publishable at big ML/AI conferences, being within an accepted paradigm of modern AI...

Prosaic AGI alignment and "write down a perfectly aligned AGI from scratch" both seem super doomed to me, compared to approaches that are neither prosaic nor perfectly-neat-and-tidy. Where does research like that fall?

I would say that the non-prosaic approaches require at least some conceptual alignment research (because the research can't be done fully inside current paradigms of ML and AI), but probably encompass some applied research. Maybe Steve's work is a good example, with a proposal split of two of his posts in this comment.

OK, thanks for the clarifications!

Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that's inaccurate?

I don't know what you mean by "perfectly rational AGI". (Perfect rationality isn't achievable, rationality-in-general is convergently instrumental, and rationality is insufficient for getting good outcomes. So why would that be the goal?)

I think of the basic case for HRAD this way:

We seem to be pretty confused about a lot of aspects of optimization, reasoning, decision-making, etc. (Embedded Agency is talking about more or less the same set of questions as HRAD, just with subsystem alignment added to the mix.)
If we were less confused, it might be easier to steer toward approaches to AGI that make it easier to do alignment work like 'understand what cognitive work the system is doing internally', 'ensure that none of the system's compute is being used to solve problems we don't understand / didn't intend', 'ensure that the amount of quality-adjusted thinking the system is putting into the task at hand is staying within some bound', etc.

These approaches won't look like decision theory, but being confused about basic ground-floor things like decision theory is a sign that you're likely not in an epistemic position to efficiently find such approaches, much like being confused about how/whether chess is computable is a sign that you're not in a position to efficiently steer toward good chess AI designs.

Maybe what I want is a two-dimensional "prosaic AI vs. novel AI" and "whiteboards vs. code". Then I can more clearly say that I'm pretty far toward 'novel AI' on one dimension (though not as far as I was in 2015), separate from whether I currently think the bigger bottlenecks (now or in the future) are more whiteboard-ish problems vs. more code-ish problems.

What you propose seems valuable, although not an alternative to my distinction IMO. This 2-D grid is more about what people consider as the most promising way of getting aligned AGI and how to get there, whereas my distinction focuses on separating two different types of research which have very different methods, epistemic standards and needs in terms of field-building.

Hmmm, maybe your distinction is something like "conceptual" = "we are explicitly and unabashedly talking about AGI & superintelligence" and "applied" = "we're mainly talking about existing algorithms but hopefully it will scale"??

Agreed that the clusters look like that, but I'm not convinced it's the most relevant point. The difference of methods seems important too.

Taking your work as an example, I would put Value loading in the human brain: a worked example as applied alignment research (where the field you're adapting for alignment is neuroscience/cognitive science) and Thoughts on safety in predictive learning as conceptual alignment research (even though the latter does talk about existing algorithms to a great extent).

Nowadays, the vast majority of the field disagree that there’s any hope of formalizing all of philosophy and then just implementing that to get an aligned AGI.

Conceptual alignment research isn’t just turning philosophy into mathematics. This is a failure mode I warned against recently: what matters is deconfusion, not formalization.

Thanks for the comment!

What do you mean by "formalizing all of philosophy"? I don't see 'From Philosophy to Math to Engineering' as arguing that we should turn all of philosophy into math (and I don't even see the relevance of this to Friendly AI). It's just claiming that FAI research begins with fuzzy informal ideas/puzzles/goals (like the sort you might see philosophers debate), then tries to move in a more formal directions.

Some doubt that the level of mathematical formalization required is even possible
If timelines are quite short, we probably don't have the time to do all that.
If AGI turns out to be prosaic AGI (which sounds like one of the best bet to make now), then what matters is aligning neural nets, not finding a way of write down a perfectly aligned AGI from scratch (related to the previous point because it seems improbable that the formalization will be finished before neural nets reach AGI, in such a prosaic setting).

I imagine part of Luke's point in writing the post was to push back against the temptation to see formal and informal approaches as opposed ('MIRI does informal stuff, so it must not like formalisms'), and to push back against the idea that analytic philosophers 'own' whatever topics they happen to have historically discussed.

Pearl's causality (the main example of "turning philosophy into mathematics" Luke uses) was an example of achieving deconfusion about causality, not an example of 'merely formalizing' something. I agree that calling this deconfusion is a clearer way of pointing at the thing, though!

Cool, that makes sense!

I abused the hyperbole in that case. What I was pointing out is the impression that old-school MIRI (a lot of the HRAD work) thinks that solving the alignment problem requires deconfusing every related philosophical problem in terms of maths, and then implementing that. Such a view doesn't seem shared by many in the community for a couple of reasons:

It was a dumb kind of obstacle to run into—or at least it seemed that way at that time. It seemed like if you could get a textbook from 200 years later, there would be one line of the textbook telling you how to get past that.

If AGI turns out to be prosaic AGI (which sounds like one of the best bet to make now), then what matters is aligning neural nets, not finding a way of write down a perfectly aligned AGI from scratch

HRAD has always been about deconfusion (though I agree we did a terrible job of articulating this), not about trying to solve all of philosophy or "write down a perfectly aligned AGI from scratch". The spirit wasn't 'we should dutifully work on these problems because they're Important-sounding and Philosophical'; from my perspective, it was more like 'we tried to write down a sketch of how to align an AGI, and immediately these dumb issues with self-reference and counterfactuals and stuff cropped up, so we tried to get those out of the way fast so we could go back to sketching how to aim an AGI at intended targets'.

From my perspective, the biggest reason MIRI started diversifying approaches away from our traditional focus was shortening timelines, where we still felt that "conceptual" progress was crucial, and still felt that marginal progress on the Agent Foundations directions would be useful; but we now assigned more probability to 'there may not be enough time to finish the core AF stuff', enough to want to put a lot of time into other problems too.

Actually, I'm not sure how to categorize MIRI's work using your conceptual vs. applied division. I'd normally assume "conceptual", because our work is so far away from prosaic alignment; but you also characterize applied alignment research as being about "experimentally testing these ideas [from conceptual alignment]", which sounds like the 2017-initiated lines of research we described in our 2018 update. If someone is running software experiments to test ideas about "Seeking entirely new low-level foundations for optimization" outside the current ML paradigm, where does that fall?

Prosaic AGI alignment and "write down a perfectly aligned AGI from scratch" both seem super doomed to me, compared to approaches that are neither prosaic nor perfectly-neat-and-tidy. Where does research like that fall?

OK, thanks for the clarifications!

Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that's inaccurate?

I think of the basic case for HRAD this way:

We seem to be pretty confused about a lot of aspects of optimization, reasoning, decision-making, etc. (Embedded Agency is talking about more or less the same set of questions as HRAD, just with subsystem alignment added to the mix.)
If we were less confused, it might be easier to steer toward approaches to AGI that make it easier to do alignment work like 'understand what cognitive work the system is doing internally', 'ensure that none of the system's compute is being used to solve problems we don't understand / didn't intend', 'ensure that the amount of quality-adjusted thinking the system is putting into the task at hand is staying within some bound', etc.

These approaches won't look like decision theory, but being confused about basic ground-floor things like decision theory is a sign that you're likely not in an epistemic position to efficiently find such approaches, much like being confused about how/whether chess is computable is a sign that you're not in a position to efficiently steer toward good chess AI designs.

Agreed that the clusters look like that, but I'm not convinced it's the most relevant point. The difference of methods seems important too.