Some Thoughts on Metaphilosophy

Wei Dai

A powerful AI (or human-AI civilization) guided by wrong philosophical ideas would likely cause astronomical (or beyond astronomical) waste. Solving metaphilosophy is one way in which we can hope to avoid this kind of disaster. For my previous thoughts on this topic and further motivation see Metaphilosophical Mysteries, The Argument from Philosophical Difficulty, Three AI Safety Related Ideas, and Two Neglected Problems in Human-AI Safety.

Some interrelated ways of looking at philosophy

Philosophy as answering confusing questions

This was my starting point for thinking about what philosophy is: it's what we do when we try to answer confusing questions, or questions that we don't have any other established methodology for answering. Why do we find some questions confusing, or lack methods for answering them? This leads to my next thought.

Philosophy as ability to generalize / handle distributional shifts

ML systems tend to have a lot of trouble dealing with distributional shifts. (It seems to be a root cause of many AI as well as human safety problems.) But humans seem to have some way of (sometimes) noticing out-of-distribution inputs, and can feel confused instead of just confidently using their existing training to respond to it. This is perhaps most obvious in unfamiliar ethical situations like Torture vs Dust Specks or trying to determine whether our moral circle should include things like insects and RL algorithms. Unlike ML algorithms that extrapolate in an essentially random way when given out-of-distribution inputs, humans can potentially generalize in a principled or correct way, by using philosophical reasoning.

Philosophy as slow but general purpose problem solving

Philosophy may even be a fully general purpose problem solving technique. At least we don't seem to have reason to think that it's not. The problem is that it's painfully slow and resource intensive. Individual humans acting alone seem to have little chance of achieving justifiably high confidence in many philosophical problems even if they devote their entire lives to those problems. Humanity has been collectively trying to solve some philosophical problems for hundreds or even thousands of years, without arriving at final solutions. The slowness of philosophy explains why distributional shifts remain a safety problem for humans, even though we seemingly have a general way of handling them.

Philosophy as meta problem solving

Given that philosophy is extremely slow, it makes sense to use it to solve meta problems (i.e., finding faster ways to handle some class of problems) instead of object level problems. This is exactly what happened historically. Instead of using philosophy to solve individual scientific problems (natural philosophy) we use it to solve science as a methodological problem (philosophy of science). Instead of using philosophy to solve individual math problems, we use it to solve logic and philosophy of math. Instead of using philosophy to solve individual decision problems, we use it to solve decision theory. Instead of using philosophy to solve individual philosophical problems, we can try to use it to solve metaphilosophy.

Philosophy as "high computational complexity class"

If philosophy can solve any problem within a very large class, then it must have a "computational complexity class" that's as high as any given problem within that class. Computational complexity can be measured in various ways, such as time and space complexity (on various actual machines or models of computation), whether and how high a problem is in the polynomial hierarchy, etc. "Computational complexity" of human problems can also be measured in various ways, such as how long it would take to solve a given problem using a specific human, group of humans, or model of human organizations or civilization, and whether and how many rounds of DEBATE would be sufficient to solve that problem either theoretically (given infinite computing power) or in practice.

The point here is that no matter how we measure complexity, it seems likely that philosophy would have a "high computational complexity class" according to that measure.

Philosophy as interminable debate

The visible aspects of philosophy (as traditionally done) seem to resemble an endless (both in clock time and in the number of rounds) game of debate, where people propose new ideas, arguments, counterarguments, counter-counterarguments, and so on, and at the same time to try judge proposed solutions based on these ideas and arguments. People sometimes complain about the interminable nature of philosophical discussions, but that now seems understandable if philosophy is a "high computational complexity" method of general purpose problem solving.

In a sense, philosophy is the opposite of math: whereas in math any debate can be settled by producing a proof (hence analogous to the complexity class NP) (in practice maybe a couple more rounds is needed of people finding or fixing flaws in the proof), potentially no fixed number of rounds of debate (or DEBATE) is enough to settle all philosophical problems.

Philosophy as Jürgen Schmidhuber's General TM

Unlike traditional Turing Machines, a General TM or GTM may edit their previous outputs, and can be considered to solve a problem even if it never terminates, as long as it stops editing its output after a finite number of edits and the final output is the correct solution. So if a GTM solves a certain problem, you know that it will eventually converge to the right solution, but you have no idea when, or if what's on its output tape at any given moment is the right solution. This seems a lot of like philosophy, where people can keep changing their minds (or adjust their credences) based on an endless stream of new ideas, arguments, counterarguments, and so on, and you never really know when you've arrived at a correct answer.

What to do until we solve metaphilosophy?

Protect the trajectory?

What would you do if you had a GTM that could solve a bunch of really important problems, and that was the only method you had of solving them? You'd try to reverse-engineer it and make a bunch of copies. But if you couldn't do that, then you'd want to put layers and layers of protection around it. Applied to philosophy, this line of thought seems to lead to the familiar ideas of using global coordination (or a decisive strategic advantage) to stop technological progress, or having AIs derive their terminal goals from simulated humans who live in a safe virtual environment.

Replicate the trajectory with ML?

Another idea is to try to build a good enough approximation of the GTM by training ML on its observable behavior (including whatever work tapes you have read access to). But there are two problems with this: 1. This is really hard or impossible to do if the GTM has internal state that you can't observe. And 2. If you haven't already reverse engineered the GTM, there's no good way to know that you've built a good enough approximation, i.e., to know that the ML model won't end up converging to answers that are different from the GTM.

A three part model of philosophical reasoning

It may be easier to understand the difficulty of capturing philosophical reasoning with ML by considering a more concrete model. I suggest we can divide it into three parts as follows: A. Propose new ideas/arguments/counterarguments/etc. according to some (implicit) distribution. B. Evaluate existing ideas/arguments/counterarguments/etc. C. Based on past ideas/arguments/counterarguments/etc., update some hidden state that changes how one does A and B. It's tempting to think that building an approximation of B using ML perhaps isn't too difficult, and then we can just search for the "best" ideas/arguments/counterarguments/etc. using standard optimization algorithms (maybe with some safety precautions like trying to avoid adversarial examples for the learned model). There's some chance this could work out well, but without having a deeper understanding of metaphilosophy, I don't see how we can be confident that throwing out A and C won't lead to disaster, especially in the long run. But A and C seem very hard or impossible for ML to capture (A due to paucity of training data, and C due to the unobservable state).

Is there a way around this difficulty? What else can we do in the absence of a full white-box solution to metaphilosophy?

I guess it feels like I don't know how we could know that we're in the position that we've "solved" meta-philosophy. It feels like the thing we could do is build a set of better and better models of philosophy and check their results against held-out human reasoning and against each other.

I also don't think we know how to specify a ground truth reasoning process that we could try to protect and run forever which we could be completely confident would come up with the right outcome (where something like HCH is a good candidate but potentially with bugs/subtleties that need to be worked out).

I feel like I have some (not well justified and possibly motivated) optimism that this process yields something good fairly early on. We could gain confidence that we are in this world if we build a bunch of better and better models of meta-philosophy and observe at some point the models continue agreeing with each other as we improve them, and that they agree with various instantiations of protected human reasoning that we run. If we are in this world, the thing we need to do is just spend some time building a variety of these kinds of models and produce an action that looks good to most of them. (Where agreement is not "comes up with the same answer" but more like "comes up with an answer that other models think is okay and not disastrous to accept").

Do you think this would lead to "good outcomes"? Do you think some version of this approach could be satisfactory for solving the problems in Two Neglected Problems in Human-AI Safety?

Do you think there's a different kind of thing that we would need to do to "solve metaphilosophy"? Or do you think that working on "solving metaphilosophy" roughly caches out as "work on coming up with better and better models of philosophy in the model I've described here"?

I guess it feels like I don’t know how we could know that we’re in the position that we’ve “solved” meta-philosophy.

What I imagine is reaching a level of understanding of what we’re really doing (or what we should be doing) when we “do philosophy”, on par with our current understanding of what “doing math” or “doing science” consist of, or ideally a better level of of understanding than that. (See Apparent Unformalizability of “Actual” Induction for one issue with our current understanding of “doing science”.)

I also don’t think we know how to specify a ground truth reasoning process that we could try to protect and run forever which we could be completely confident would come up with the right outcome (where something like HCH is a good candidate but potentially with bugs/subtleties that need to be worked out).

Here I’m imagining something like putting a group of the best AI researchers, philosophers, etc. in some safe and productive environment (which includes figuring out the right rules of social interactions), where they can choose to delegate further to other reasoning processes, but don’t face any time pressure to do so. Obviously I don’t know how to specify this in terms of having all the details worked out, but that does not seem like a hugely difficult problem to solve, so I wonder what do you mean/imply by “don’t think we know how”?

It feels like the thing we could do is build a set of better and better models of philosophy and check their results against held-out human reasoning and against each other.

If that’s all we do, it seems like it would be pretty easy to miss some error in the models, because we didn’t know that we should test for it. For example there could be entire classes of philosophical problems that the models will fail on, which we won’t know because we won’t have realized yet that those classes of problems even exist.

Do you think this would lead to “good outcomes”? Do you think some version of this approach could be satisfactory for solving the problems in Two Neglected Problems in Human-AI Safety?

It could, but it seems much riskier than either of the approaches I described above.

Do you think there’s a different kind of thing that we would need to do to “solve metaphilosophy”? Or do you think that working on “solving metaphilosophy” roughly caches out as “work on coming up with better and better models of philosophy in the model I’ve described here”?

Hopefully I answered these sufficiently above. Let me know if there’s anything I can clear up further.

All else equal, I prefer an AI which is not capable to philosophy, as I am afraid of completely alien conclusions which it could come to (e.g. insect are more important than humans).

More over, I am skeptical that going on meta-level simplifies the problem to the level that it will be solvable by humans (the same about meta-ethics and theory of human values). For example, if someone says that he is not able to understand math, but instead will work on meta-mathematical problems, we would be skeptical about his ability to contribute. Why meta-level would be simpler?

More over, I am skeptical that going on meta-level simplifies the problem to the level that it will be solvable by humans (the same about meta-ethics and theory of human values).

This is also my reason for being pessimistic about solving metaphilosophy before a good number of object-level philosophical problems have been solved (e.g. in decision theory, ontology/metaphysics, and epistemology). If we imagine being in a state where we believe running computation X would solve hard philosophical problem Y, then it would seem that we already have a great deal of philosophical knowledge about Y, or a more general class of problems that includes Y.

More generally, we could look at the history difficulty of solving a problem vs. the difficulty of automating it. For example: the difficulty of walking vs. the difficulty of programming a robot to walk; the difficulty of adding numbers vs. the difficulty of specifying an addition algorithm; the difficulty of discovering electricity vs. the difficulty of solving philosophy of science to the point where it's clear how a reasoner could have discovered (and been confident in) electricity; and so on.

The plausible story I have that looks most optimistic for metaphilosophy looks something like:

Some philosophical community makes large progress on a bunch of philosophical problems, at a high level of technical sophistication.
As part of their work, they discover some "generators" that generate a bunch of the object-level solutions when translated across domains; these generators might involve e.g. translating a philosophical problem to one of a number of standard forms and then solving the standard form.
They also find philosophical reasons to believe that these generators will generate good object-level solutions to new problems, not just the ones that have already been studied.
These generators would then constitute a solution to metaphilosophy.

I think our positions on this are pretty close, but I may put a bit more weight on other "plausible stories" for solving metaphilosophy relative to your "plausible story". (I'm not sure if overall I'm more or less optimistic than you are.)

If we imagine being in a state where we believe running computation X would solve hard philosophical problem Y, then it would seem that we already have a great deal of philosophical knowledge about Y, or a more general class of problems that includes Y.

It seems quite possible that understanding the general class of problems that includes Y is easier than understanding Y itself, and that allows us to find a computation X that would solve Y without much understanding of Y itself. As an analogy, suppose Y is some complex decision problem that we have little understanding of, and X is an AI that is programmed with a good decision theory.

More generally, we could look at the history difficulty of solving a problem vs. the difficulty of automating it. For example: the difficulty of walking vs. the difficulty of programming a robot to walk;

This does not seem like a very strong argument for your position. My suggestion in the OP is that humans already know the equivalent of "walking" (i.e., doing philosophy), we're just doing it very slowly. Given this, your analogies don't seem very conclusive about the difficulty of solving metaphilosophy or whether we have to make a bunch more progress on object-level philosophical problems before we can solve metaphilosophy.

More over, I am skeptical that going on meta-level simplifies the problem to the level that it will be solvable by humans

If I gave the impression in this post that I expect metaphilosophy to be solved before someone builds an AGI, that was far from my intentions. I think this is a small-chance-of-high-return kind of situation, plus I think someone has to try to attack the problem if only to generate evidence that it really is a hard problem, otherwise I don't know how to convince people to adopt costly social solutions like stopping technological progress. (And actually I don't expect the evidence to be highly persuasive either, so this amounts to just another small chance of high return.)

What I wrote in an earlier post still describes my overall position:

There is no strong empirical evidence that solving metaphilosophy is superhumanly difficult, simply because not many people have attempted to solve it. But I don’t think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.

having AIs derive their terminal goals from simulated humans who live in a safe virtual environment.

There has been some subsequent discussion (expressing concern/doubt) about this at https://www.lesswrong.com/posts/7jSvfeyh8ogu8GcE6/decoupling-deliberation-from-competition?commentId=bSNhJ89XFJxwBoe5e