If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.
My main "claims to fame":
You may have missed my footnote, where I addressed this?
To preempt a possible misunderstanding, I don't mean "don't try to think up new metaethical ideas", but instead "don't be so confident in your ideas that you'd be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way". Similarly "don't roll your own crypto" doesn't mean never try to invent new cryptography, but rather don't deploy it unless there has been extensive review, and consensus that it is likely to be secure.
By "metaethics," do you mean something like "a theory of how humans should think about their values"?
I feel like I've seen that kind of usage on LW a bunch, but it's atypical. In philosophy, "metaethics" has a thinner, less ambitious interpretation of answering something like, "What even are values, are they stance-independent, yes/no?"
By "metaethics" I mean "the nature of values/morality", which I think is how it's used in academic philosophy. Of course the nature of values/morality has a strong influence on "how humans should think about their values" so these are pretty closely connected, but definitionally I do try to use it the same way as in philosophy, to minimize confusion. This post can give you a better idea of how I typically use it. (But as you'll see below, this is actually not crucial for understanding my post.)
Anyway, I'm asking about this because I found the following paragraph hard to understand:
So in the paragraph that you quoted (and the rest of the post), I was actually talking about philosophical fields/ideas in general, not just metaethics. While my title has "metaethics" in it, the text of the post talks generically about any "philosophical questions" that are relevant for AI x-safety. If we substitute metaethics (in my or the academic sense) into my post, then you can derive that I mean something like this:
Different metaethics (ideas/theories about the nature of values/morality) have different implications for what AI designs or alignment approaches are safe, and if you design an AI assuming that one metaethical theory is true, it could be disastrous if a different metaethical theory actually turns out to be true.
For example, if moral realism is true, then aligning the AI to human values would be pointless. What you really need to do is design the AI to be able to determine and follow objective moral truths. But this approach would be disastrous if moral realism is actually false. Similarly, if moral noncognitivism is true, that means that humans can't be wrong about their values, and implies "how humans should think about their values" is of no importance. If you design AI under this assumption, that would be disastrous if actually humans can be wrong about their values and they really need AIs to help them think about their values and avoid moral errors.
I think in practice a lot of alignment researchers may not even have explicit metaethical theories in mind, but are implicitly making certain metaethical assumptions in their AI design or alignment approach. For example they may largely ignore the question of how humans should think about their values or how AIs should help humans think about their values, thus essentially baking in an assumption of noncognitivism.
You're conceding that morality/values might be (to some degree) subjective, but you're cautioning people from having strong views about "metaethics," which you take to be the question of not just what morality/values even are, but also a bit more ambitiously: how to best reason about them and how to (e.g.) have AI help us think about what we'd want for ourselves and others.
If we substitute "how humans/AIs should reason about values" (which I'm not sure has a name in academic philosophy but I think does fall under metaphilosophy, which covers all philosophical reasoning) into the post, then your conclusion here falls out, so yes, it's also a valid interpretation of what I'm trying to convey.
I hope that makes everything a bit clearer?
A lab leader who’s concerned enough to slow down will be pressured by investors to speed back up, or get replaced, or get outcompeted. Really you need to convince the whole lab and its investors. And you need to be more convincing than the magic of the market!
This seems to imply that lab leaders would be easier to convince if there were no investors and no markets, in other words if they had more concentrated power.
If you spread out the power of AI more, won't all those decentralized nodes of spread out AI power still have to compete with each other in markets? If market pressures are the core problem, how does decentralization solve that?
I'm concerned that your proposed solution attacks "concentration of power" when the real problem you've identified is more like market dynamics. If so, it could fail to solve the problem or make it even worse.
My own perspective is that markets are a definite problem, and concentration of power per se is more ambiguous (I'm not sure if it's good or bad). To solve AI x-safety we basically have to bypass or override markets somehow, e.g., through international agreements and government regulations/bans.
I hinted at it with "prior efforts/history", but to spell it out more, metaethics seems to have a lot more effort gone into it in the past, so there's less likely to be some kind of low hanging fruit in idea space, that once picked, everyone will agree is the right solution.
The problem is that we can't. The closest thing we have is instead a collection of mutually exclusive ideas where at most one (possibly none) is correct, and we have no consensus as to which.
#2 feels like it's injecting some frame that's a bit weird to inject here (don't roll your own metaethics... but rolling your own metaphilosophy is okay?)
Maybe you missed my footnote?
To preempt a possible misunderstanding, I don't mean "don't try to think up new metaethical ideas", but instead "don't be so confident in your ideas that you'd be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way". Similarly "don't roll your own crypto" doesn't mean never try to invent new cryptography, but rather don't deploy it unless there has been extensive review, and consensus that it is likely to be secure.
and/or this part of my answer (emphasis added):
Try to solve metaphilosophy, where potentially someone could make a breakthrough that everyone can agree is correct (after extensive review)
But also, I'm suddenly confused about who this post is trying to warn. Is it more like labs, or more like EA-ish people doing a wider variety of meta-work?
I think I mostly had alignment researchers (in and out of labs) as the target audience in mind, but it does seem relevant to others so perhaps I should expand the target audience?
"More research needed" but here are some ideas to start with:
An update on this 2010 position of mine, which seems to have become conventional wisdom on LW:
In my posts, I've argued that indexical uncertainty like this shouldn't be represented using probabilities. Instead, I suggest that you consider yourself to be all of the many copies of you, i.e., both the ones in the ancestor simulations and the one in 2010, making decisions for all of them. Depending on your preferences, you might consider the consequences of the decisions of the copy in 2010 to be the most important and far-reaching, and therefore act mostly as if that was the only copy. [Emphasis added]
In the subsequent 15 years, I've upweighted influencing the multiverse through my copies in simulations, relative to base universes, to where they're about equally important in my mind. For 4 reasons:
Concretely, this mostly cashes out to me thinking and writing with potential simulators in mind as an additional audience, hoping my ideas might benefit or interest some of them even if they end up largely ignored in this reality.
normally when I think about this problem I resolve it as "what narrow capabilities can we build that are helpful 'to the workflow' of people solving illegible problems, that aren't particularly bad from a capabilities standpoint".
Do you have any writings about this, e.g., examples of what this line of thought led to?
To give a direct answer, not a lot come to mind outside of the MIRI cluster. I think the Center on Long-Term Risk cluster did a bunch of work on decision theory and acausal trade, but it was mostly after I had moved on to other topics, so I'm not sure how much of it constituted progress. Christiano acknowledged some of the problems I pointed out with IDA and came up with some attempted solutions, which I'm not convinced really work.
However, in my previous post, Legible vs. Illegible AI Safety Problems, I explained my latest thinking that the most important motivation for legibilizing AI safety problems isn't to induce faster progress on them as object-level problems, but instead to decrease the probability that AGI/ASI is developed or deployed while key decision makers (e.g., company leaders, government officials, voters) are not even aware of or don't understand the importance of some such problems. So a better metric for measuring the success of this strategy is how much increased legibility has been effected in this wider audience, assuming "how successful has it been" is the main motivation behind your question.
On that front, I think the main weakness of my approach has been its limited reach beyond LW. If someone with better public communications skills were convinced of the value of legibilizing these lesser known problems, that could potentially greatly boost the effectiveness of this strategy.
(Of course, if I've inferred a wrong motivation for your question, please let me know!)