Similarly, suppose you have two deontological values which trade off against each other. Before systematization, the question of “what’s the right way to handle cases where they conflict” is not really well-defined; you have no procedure for doing so.

Why is this a problem, that calls out to be fixed (hence leading to systematization)? Why not just stick with the default of "go with whichever value/preference/intuition that feels stronger in the moment"? People do that unthinkingly all the time, right? (I have my own thoughts on this, but curious if you agree with me or what your own thinking is.)

And that’s why the “mind itself wants to do this” does make sense, because it’s reasonable to assume that highly capable cognitive architectures will have ways of identifying aspects of their thinking that “don’t make sense” and correcting them.

How would you cash out "don't make sense" here?

which makes sense, since in some sense systematizing from our own welfare to others’ welfare is the whole foundation of morality

This seems wrong to me. I think concern for others' welfare comes from being directly taught/trained as a child to have concern for others, and then later reinforced by social rewards/punishments as one succeeds or fails at various social games. This situation could have come about without anyone "systematizing from our own welfare", just by cultural (and/or genetic) variation and selection. I think value systematizing more plausibly comes into play with things like widening one's circle of concern beyond one's family/tribe/community.

What you're trying to explain with this statement, i.e., "Morality seems like the domain where humans have the strongest instinct to systematize our preferences" seems better explained by what I wrote in this comment.

This reminds me that I have an old post asking Why Do We Engage in Moral Simplification? (What I called "moral simplification" seems very similar to what you call "value systematization".) I guess my post didn't really fully answer this question, and you don't seem to talk much about the "why" either.

Here are some ideas after thinking about it for a while. (Morality is Scary is useful background here, if you haven't read it already.)

  1. Wanting to use explicit reasoning with our values (e.g., to make decisions), which requires making our values explicit, i.e., defining them symbolically, which necessitates simplification given limitations of human symbolic reasoning.
  2. Moral philosophy as a status game, where moral philosophers are implicitly scored on the moral theories they come up with by simplicity and by how many human moral intuitions they are consistent with.
  3. Everyday signaling games, where people (in part) compete to show that they have community-approved or locally popular values. Making values legible and not too complex facilitates playing these games.
  4. Instinctively transferring our intuitions/preferences for simplicity from "belief systematization" where they work really well, into a different domain (values) where they may or may not still make sense.

(Not sure how any of this applies to AI. Will have to think more about that.)

if the biggest impact is the very slow thinking from a very small group of people who care about them, then I think that’s a very small impact

I guess from my perspective, the biggest impact is the possibility that the idea of better preparing for these risks becomes a lot more popular. An analogy with Bitcoin comes to mind, where the idea of cryptography-based distributed money languished for many years, known only to a tiny community, and then was suddenly everywhere. An AI pause would provide more time for something like that to happen. And if the idea of better preparing for these risks was actually a good one (as you seem to think), there's no reason why it couldn't (or was very unlikely to) spread beyond a very small group, do you agree?

In My views on “doom” you wrote:

Probability of messing it up in some other way during a period of accelerated technological change (e.g. driving ourselves crazy, creating a permanent dystopia, making unwise commitments…): 15%

Do you think these risks can also be reduced by 10x by a "very good RSP"? If yes, how or by what kinds of policies? If not, isn't "cut risk dramatically [...] perhaps a 10x reduction" kind of misleading?

It concerns me that none of the RSP documents or discussions I've seen talked about these particular risks, or "unknown unknowns" (other risks that we haven't thought of yet).

I'm also bummed that "AI pause" people don't talk about these risks either, but at least an AI pause would implicitly address these risks by default, whereas RSPs would not.

To be clear, I definitely agree with this. My position is not “RSPs are all we need”, “pauses are bad”, “pause advocacy is bad”, etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. “RSPs are pauses done right.”

Some feedback on this: my expectation upon seeing your title was that you would argue, or that you implicitly believe, that RSPs are better than other current "pause" attempts/policies/ideas. I think this expectation came from the common usage of the phrase "done right" to mean that other people are doing it wrong or at least doing it suboptimally.

But, the gist of your post seems to be: "Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely."

This is a bit stronger than how I would phrase it, but basically yes.

On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA

I tend to be pretty skeptical of new ideas. (This backfired spectacularly once, when I didn't pay much attention to Satoshi when he contacted me about Bitcoin, but I think in general has served me well.) My experience with philosophical questions is that even when some approach looks a stone's throw away from a final solution to some problem, a bunch of new problems pop up and show that we're still quite far away. With an approach that is still as early as yours, I just think there's quite a good chance it doesn't work out in the end, or gets stuck somewhere on a hard problem. (Also some people who have digged into the details don't seem as optimistic that it is the right approach.) So I'm reluctant to decrease my probability of "UDT was a wrong turn" too much based on it.

The rest of your discussion about 2TDT-1CDT seems plausible to me, although of course depends on whether the math works out, doing something about monotonicity, and also a solution to the problem of how to choose one's IBH prior. (If the solution was something like "it's subjective/arbitrary" that would be pretty unsatisfying from my perspective.)

Do you think part of it might be that even people with graduate philosophy educations are too prone to being wedded to their own ideas, or don't like to poke holes at them as much as they should? Because part of what contributes to my wanting to go more meta is being dissatisfied with my own object-level solutions and finding more and more open problems that I don't know how to solve. I haven't read much academic philosophy literature, but did read some anthropic reasoning and decision theory literature earlier, and the impression I got is that most of the authors weren't trying that hard to poke holes in their own ideas.

I don't understand your ideas in detail (am interested but don't have the time/ability/inclination to dig into the mathematical details), but from the informal writeups/reviews/critiques I've seen of your overall approach, as well as my sense from reading this comment of how far away you are from a full solution to the problems I listed in the OP, I'm still comfortable sticking with "most are wide open". :)

On the object level, maybe we can just focus on Problem 4 for now. What do you think actually happens in a 2IBH-1CDT game? Presumably CDT still plays D, and what do the IBH agents do? And how does that imply that the puzzle is resolved?

As a reminder, the puzzle I see is that this problem shows that a CDT agent doesn't necessarily want to become more UDT-like, and for seemingly good reason, so on what basis can we say that UDT is a clear advancement in decision theory? If CDT agents similarly don't want to become more IBH-like, isn't there the same puzzle? (Or do they?) This seems different from the playing chicken with a rock example, because a rock is not a decision theory so that example doesn't seem to offer the same puzzle.

ETA: Oh, I think you're saying that the CDT agent could turn into a IBH agent but with a different prior from the other IBH agents, that ends up allowing it to still play D while the other two still play C, so it's not made worse off by switching to IBH. Can you walk this through in more detail? How does the CDT agent choose what prior to use when switching to IBH, and how do the different priors actual imply a CCD outcome in the end?

Even items 1, 3, 4, and 6 are covered by your research agenda? If so, can you quickly sketch what you expect the solutions to look like?

Load More