If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.
My main "claims to fame":
I think 4 is basically right
Do you think it's ok to base an AI alignment idea/plan on a metaethical assumption, given that there is a large spread of metaethical positions (among both amateur and professional philosophers) and it looks hard to impossible to resolve or substantially reduce the disagreement in a relevant timeframe? (I noted that the assumption is weightbearing, since you can arrive at an opposite conclusion of "non-upload necessity" given a different assumption.)
(Everyone seems to do this, and I'm trying to better understand people's thinking/psychology around it, not picking on you personally.)
I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration).
Not sure if you can or want to explain this more, but I'm pretty skeptical, given that distributional shift / OOD generalization has been a notorious problem for ML/DL (hence probably not neglected), and I haven't heard of much theoretical or practical progress on this topic.
Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).
What about people whose values are more indexical (they want themselves to be powerful/smart/whatever, not a model/copy of them), or less personal (they don't care about themselves or a copy being powerful, they're fine with an external Friendly AI taking over the world and ensuring a good outcome for everyone)?
I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.
Yeah, this is covered under position 5 in the above linked post.
unrelatedly, I am still not convinced we live in a mathematical multiverse
Not completely unrelated. If this is false, and an ASI acts as if it's true, then it could waste a lot of resources e.g. doing acausal trading with imaginary counterparties. And I also don't think uncertainty about this philosophical assumption can be reduced much in a relevant timeframe by human philosophers/researchers, so safety/alignment plans shouldn't be built upon it either.
Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans
What's the main reason(s) that you think this? For example one way to align an AI[1] that's not an emulation was described in Towards a New Decision Theory: "we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to "optimize" all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about." Which part is the main "impossible" thing in your mind, "how to map fuzzy human preferences to well-defined preferences" or creating an AI that can optimize the universe according to such well-defined preferences?
I currently suspect it's the former, and it's because of your metaethical beliefs/credences. Consider these 2 metaethical positions (from Six Plausible Meta-Ethical Alternatives):
- 3 There aren't facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
- 4 None of the above facts exist, so the only way to become or build a rational agent is to just think about what preferences you want your future self or your agent to hold, until you make up your mind in some way that depends on your psychology. But at least this process of reflection is convergent at the individual level so each person can reasonably call the preferences that they endorse after reaching reflective equilibrium their morality or real values.
If 3 is true, then we can figure out and use the "facts about how to translate non-preferences into preferences" to "map fuzzy human preferences to well-defined preferences" but if 4 is true, then running the human as an emulation becomes the only possible way forward (as far as building an aligned agent/successor). Is this close to what you're thinking?
I also want to note that if 3 (or some of the other metaethical alternatives) is true, then "strong non-upload necessity", i.e. that it is impossible to construct a perfectly aligned successor that is an emulation, becomes very plausible for many humans, because an emulation of a human might find it impossible to make the necessary philosophical progress to figure out the correct normative facts about how to turn their own "non-preferences" into preferences, or simply don't have the inclination/motivation to do this.
which I don't endorse as something we should currently try to do, see Three Approaches to "Friendliness"
I think we're in a similar place with the philosophical worries: we have both a bunch of specific games that fail with older theories, and a bunch of proposals (say, variants of FDT) without a clear winner.
I think the situation in decision theory is way more confusing than this. See https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever and I would be happy to have a chat about this if that would help convey my view of the current situation.
Sorry about the delayed reply. I've been thinking about how to respond. One of my worries is that human philosophy is path dependent, or another way of saying this is that we're prone to accepting wrong philosophical ideas/arguments and then it's hard to talk us out of them. The split of western philosophy into analytical and continental traditions seems to be an instance of this, then even within analytical philosophy, academic philosophers would strongly disagree with each other and each be confident in their own positions and rarely get talked out of them. I think/hope that humans collectively can still make philosophical progress over time (in some mysterious way that I wish I understood), if we're left to our own devices but the process seems pretty fragile and probably can't withstand much external optimization pressure.
On formalizations, I agree they've stood the test of time in your sense, but is that enough to build them into AI? We can see that they wrong on some questions, but can't formally characterize the domain in which they are right. And even if we could, I don't know why we'd muddle through... What if we built AI based on Debate, but used Newtonian physics to answer physics queries instead of human judgment, or the humans are pretty bad at answering physics related questions (including meta questions like how to do science)? That would be pretty disastrous, especially if there are any adversaries in the environment, right?
How do you decide what to set ε to? You mention "we want assumptions about humans that are sensible a priori, verifiable via experiment" but I don't see how ε can be verified via experiment, given that for many questions we'd want the human oracle to answer, there isn't a source of ground truth answers that we can compare the human answers to?
With unbounded Alice and Bob, this results in an equilibrium where Alice can win if and only if there is an argument that is robust to an ε-fraction of errors.
How should I think about, or build up some intuitions about, what types of questions have an argument that is robust to an ε-fraction of errors?
Here's an analogy that leads to a pessimistic conclusion (but I'm not sure how relevant it is): replace the human oracle with a halting oracle, the top level question being debated is whether some Turing machine T halts or not, and the distribution over which ε is define is the uniform distribution. Then it seems like Alice has a very tough time (for any T that she can't prove halts or not herself), because Bob can reject/rewrite all the oracle answers that are relevant to T in some way, which is a tiny fraction of all possible Turing machines. (This assumes that Bob gets to pick the classifier after seeing the top level question. Is this right?)
I think the most dangerous version of 3 is a sort of Chesterton's fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn't require holding on to the mistaken belief.
I think that makes sense, but sometimes you can't necessarily motivate a useful norm "by recognizing that the norm is useful" to the same degree that you can with a false belief. For example there may be situations where someone has an opportunity to violate a social norm in an unobservable way, and they could be more motivated by the idea of potential punishment from God if they were to violate it, vs just following the norm for the greater (social) good.
Do you have an example for 4? It seems rather abstract and contrived.
Hard not to sound abstract and contrived here, but to say a bit more, maybe there is no such thing as philosophical progress (outside of some narrow domains), so by doing philosophical reflection you're essentially just taking a random walk through idea space. Or philosophy is a memetic parasite that exploits bug(s) in human minds to spread itself, perhaps similar to (some) religions.
Overall, I think the risks from philosophical progress aren't overly serious while the opportunities are quite large, so the overall EV looks comfortably positive.
I think the EV is positive if done carefully, which I think I had previously been assuming, but I'm a bit worried now that most people I can attract to the field might not be as careful as I had assumed, so I've become less certain about this.
Some potential risks stemming from trying to increase philosophical competence of humans and AIs, or doing metaphilosophy research. (1 and 2 seem almost too obvious to write down, but I think I should probably write them down anyway.)
Is it something like "during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )"?
Yes, but my concern also includes this happening during training of the debaters, when the simulated or actual humans can also go out of distribution, e.g., the actual human is asked a type of question that he has never considered before, and either answers in a confused way, or will have to use philosophical reasoning and a lot of time to try to answer, or maybe it looks like one of the debaters "jailbreaking" a human via some sort of out of distribution input.
The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won't work?
This intuitively seems hard to me, but since Geoffrey mentioned that you have a doc coming out related to this, I'm happy to read it to see if it changes my mind. But this still doesn't solve the whole problem, because as Geoffrey also wrote, "Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there's no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn't competitive."
For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch.
I think this could be part of a viable approach, for example if we figure out in detail how humans invented philosophy and use that knowledge to design/train an AI that we can have high justified confidence will be philosophically competent. I'm worried that in actual development of brain-like AGI, we will skip this part (because it's too hard, or nobody pushes for it), and end up just assuming that the AGI will invent or learn philosophy because it's brain-like. (And then it ends up not doing that because we didn't give it some "secret sauce" that humans have.) And this does look fairly hard to me, because we don't yet understand the nature of philosophy or what constitutes correct philosophical reasoning or philosophical competence, so how do we study these things in either humans or AIs?
But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress].
I find it pretty plausible that wireheading is what we'll end up wanting, after sufficient reflection. (This could be literal wireheading, or something more complex like VR with artificial friends, connections, etc.) This seems to me to be the default, unless we have reasons to want to avoid it. Currently my reasons are 1. value uncertainty (maybe we'll eventually find good intrinsic reasons to not want to wirehead) 2. opportunity costs (if I wirehead now, it'll cost me in terms of both quality and quantity vs if I wait for more resources and security). But it seems foreseeable that both of these reasons will go away at some point, and we may not have found other good reasons to avoid wireheading by then.
(Of course if the right values for us are to max out our wireheading, then we shouldn't hand the universe off the AGIs that will want to max out their wireheading. Also, this is just the simplest example of how brain-like AGIs' values could conflict with ours.)
I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective.
You're prioritizing instilling AGI with the right social instincts, and arguing that's very important for getting the AGI to eventually converge on values that we'd consider good. But under different meta-ethical views, you should perhaps be prioritizing something else. For example, under moral realism, it doesn't matter what social instincts the AGI starts out with, what's more important is that it has the capacity to eventually find and be motivated by objective moral truths.
From my perspective, in order to not bet the universe on a particular meta-ethical view, it seems that we need to either hold off on building AGI until we definitively solve metaethics (e.g., it's no longer a contentious subject in academic philosophy), or have an approach/plan that will work out well regardless of which meta-ethical view turns out to be correct.
By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”.
Thanks, I appreciate this, but of course still feel compelled to speak up when I see areas where you seem overly optimistic and/or missing some potential risks/failure modes.
I'm scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.
This would relieve the concern I described, but bring up other issues, like being opposed by many because the candidates' values/views are not representative of humanity or themselves. (For example philosophical competence is highly correlated with or causes atheism, making it highly overrepresented in the initial candidates.)
I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they've gained power. Maybe I should have clarified this with you first.
My own "plan" (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to "power corrupts", or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.
Any specific readings or talks you can recommend on this topic?