Sometimes, I say some variant of “yeah, probably some people will need to do a pivotal act” and people raise the objection: “Should a small subset of humanity really get so much control over the fate of the future?”

(Sometimes, I hear the same objection to the idea of trying to build aligned AGI at all.)

I’d first like to say that, yes, it would be great if society had the ball on this. In an ideal world, there would be some healthy and competent worldwide collaboration steering the transition to AGI.[1]

Since we don’t have that, it falls to whoever happens to find themselves at ground zero to prevent an existential catastrophe.

A second thing I want to say is that design-by-committee… would not exactly go well in practice, judging by how well committee-driven institutions function today.

Third, though, I agree that it’s morally imperative that a small subset of humanity not directly decide how the future goes. So if we are in the situation where a small subset of humanity will be forced at some future date to flip the gameboard — as I believe we are, if we’re to survive the AGI transition — then AGI developers need to think about how to do that without unduly determining the shape of the future. 

The goal should be to cause the future to be great on its own terms, without locking in the particular moral opinions of humanity today — and without locking in the moral opinions of any subset of humans, whether that’s a corporation, a government, or a nation.

(If you can't see why a single modern society locking in their current values would be a tragedy of enormous proportions, imagine an ancient civilization such as the Romans locking in their specific morals 2000 years ago. Moral progress is real, and important.)

But the way to cause the future to be great “on its own terms” isn’t to do nothing and let the world get destroyed. It’s to intentionally not leave your fingerprints on the future, while acting to protect it.

You have to stabilize the landscape / make it so that we’re not all about to destroy ourselves with AGI tech; and then you have to somehow pass the question of how to shape the universe back to some healthy process that allows for moral growth and civilizational maturation and so on, without locking in any of humanity’s current screw-ups for all eternity.


Unfortunately, the current frontier for alignment research is “can we figure out how to point AGI at anything?”. By far the most likely outcome is that we screw up alignment and destroy ourselves.

If we do solve alignment and survive this great transition, then I feel pretty good about our prospects for figuring out a good process to hand the future to. Some reasons for that:

  • Human science has a good track record for solving difficult-seeming problems; and if there’s no risk of anyone destroying the world with AGI tomorrow, humanity can take its time and do as much science, analysis, and weighing of options as needed before it commits to anything.
  • Alignment researchers have already spent a lot of time thinking about how to pass that buck, and make sure that the future goes great and doesn’t have our fingerprints on it, and even this small group of people have made real progress, and the problem doesn't seem that tricky. (Because there are so many good ways to approach this carefully and indirectly.)
  • Solving alignment well enough to end the acute risk period without killing everyone implies that you’ve cleared a very high competence bar, as well as a sanity bar that not many clear today. Willingness and ability to diffuse moral hazard is correlated with willingness and ability to save the world.
  • Most people would do worse on their own merits if they locked in their current morals, and would prefer to leave space for moral growth and civilizational maturation. The property of realizing that you want to (or would on reflection want to) diffuse the moral hazard is also correlated with willingness and ability to save the world.
  • Furthermore, the fact that — as far as I know — all the serious alignment researchers are actively trying to figure out how to avoid leaving their fingerprints on the future, seems like a good sign to me. You could find a way to be cynical about these observations, but these are not the observations that the cynical hypothesis would predict ab initio.

This is a set of researchers that generally takes egalitarianism, non-nationalism, concern for future minds, non-carbon-chauvinism, and moral humility for granted, as obvious points of background agreement; the debates are held at a higher level than that.

This is a set of researchers that regularly talk about how, if you’re doing your job correctly, then it shouldn’t matter who does the job, because there should be a path-independent attractor-well that isn't about making one person dictator-for-life or tiling a particular flag across the universe forever.

I’m deliberately not talking about slightly-more-contentful plans like coherent extrapolated volition here, because in my experience a decent number of people have a hard time parsing the indirect buck-passing plans as something more interesting than just another competing political opinion about how the future should go. (“It was already blues vs. reds vs. oranges, and now you’re adding a fourth faction which I suppose is some weird technologist green.”)

I’d say: Imagine that some small group of people were given the power (and thus responsibility) to steer the future in some big way. And ask what they should do with it. Ask how they possibly could wield that power in a way that wouldn’t be deeply tragic, and that would realistically work (in the way that “immediately lock in every aspect of the future via a binding humanity-wide popular vote” would not).

I expect that the best attempts to carry out this exercise will involve re-inventing some ideas that Bostrom and Yudkowsky invented decades ago. Regardless, though, I think the future will go better if a lot more conversations occur in which people take a serious stab at answering that question.


The situation humanity finds itself in (on my model) poses an enormous moral hazard.

But I don’t conclude from this “nobody should do anything”, because then the world ends ignominiously. And I don’t conclude from this “so let’s optimize the future to be exactly what Nate personally wants”, because I’m not a supervillain.[2]

The existence of the moral hazard doesn’t have to mean that you throw up your hands, or imagine your way into a world where the hazard doesn’t exist. You can instead try to come up with a plan that directly addresses the moral hazard — try to solve the indirect and abstract problem of “defuse the moral hazard by passing the buck to the right decision process / meta-decision-process”, rather than trying to directly determine what the long-term future ought to look like.

Rather than just giving up in the face of difficulty, researchers have the ability to see the moral hazard with their own eyes and ensure that civilization gets to mature anyway, despite the unfortunate fact that humanity, in its youth, had to steer past a hazard like this at all.

Crippling our progress in its infancy is a completely unforced error. Some of the implementation details may be tricky, but much of the problem can be solved simply by choosing not to rush a solution once the acute existential risk period is over, and by choosing to end the acute existential risk period (and its associated time pressure) before making any lasting decisions about the future.[3]


(Context: I wrote this with significant editing help from Rob Bensinger. It’s an argument I’ve found myself making a lot in recent conversations.)

  1. ^

    Note that I endorse work on more realistic efforts to improve coordination and make the world’s response to AGI more sane. “Have all potentially-AGI-relevant work occur under a unified global project” isn’t attainable, but more modest coordination efforts may well succeed.

  2. ^

    And I’m not stupid enough to lock in present-day values at the expense of moral progress, or stupid enough to toss coordination out the window in the middle of a catastrophic emergency with human existence at stake, etc.

    My personal CEV cares about fairness, human potential, moral progress, and humanity’s ability to choose its own future, rather than having a future imposed on them by a dictator. I'd guess that the difference between "we run CEV on Nate personally" and "we run CEV on humanity writ large" is nothing (e.g., because Nate-CEV decides to run humanity's CEV), and if it's not nothing then it's probably minor.

  3. ^

    See also Toby Ord’s The Precipice, and its discussion of “the long reflection”. (Though, to be clear, a short reflection is better than a long reflection, if a short reflection suffices. The point is not to delay for its own sake, and the amount of sidereal time required may be quite short if a lot of the cognitive work is being done by uploaded humans and/or aligned AI systems.)

New Comment
10 comments, sorted by Click to highlight new comments since:

I think what you're saying here ought to be uncontroversial. You're saying that should a small group of technical people find themselves in a position of enormous influence, they ought to use that influence in an intelligent and responsible way, which may not look like immediately shirking that responsibility out of a sense that nobody should ever exert influence over the future.

I have the sense that in most societies over most of time, it was accepted that of course various small groups would at certain time find themselves in positions of enormous influence w.r.t. their society, and of course their responsibility in such a situation would be to not shirk that responsibility but to wisely and unilaterally choose a direction forward for their society, as required by the situation at hand.

In an ideal world, there would be some healthy and competent worldwide collaboration steering the transition to AGI

I have the sense that what would be ideal is for humanity to proceed with wisdom. The wisest moves we've made as a species to date (ending slavery? ending smallpox? landing on the moon?) didn't particularly look like "worldwide collaborations". Why, actually, do you say that the ideal would be a worldwide collaboration?

Third, though, I agree that it’s morally imperative that a small subset of humanity not directly decide how the future goes

Why should a small subset of humanity not directly decide how the future goes? The goal ought to be good decision-making, not large- or small-group decision making, and definitely not non-decision-making.

Of course the future should not be a tightly scripted screenplay of contemporary moral norms, but to decide that is to decide something about how the future goes. It's not wrong to make such decisions, it's just important to get such decisions right.

The wisest moves we've made as a species to date (ending slavery? ending smallpox? landing on the moon?) didn't particularly look like "worldwide collaborations".

I think Nate might've been thinking of things like:

  • Having all AGI research occur in one place is good (ceteris paribus), because then the AGI project can take as much time as it needs to figure out alignment, without worrying that some competitor will destroy the world with AGI if they go too slowly.
  • This is even truer if the global coordination is strong enough to prevent other x-risks (e.g., bio-weapons), so we don't have to move faster to avoid those either.
  • In a perfect world, everyone would get some say in major decisions that affect their personal safety (e.g., via elected Scientist Representatives). This helps align incentives, relative to a world where anyone can unilaterally impose serious risks on others.
  • In a perfect world, larger collaborations shouldn't perform worse than smaller ones, because larger collaborations should understand the dysfunctions of large collaborations and have policies and systems in place to avoid them (e.g., by automatically shrinking or siloing if needed).

I interpret Nate as making a concession to acknowledge the true and good aspects of the 'but isn't there something off about a random corporation or government doing all this?' perspective, not as recommending that we (in real life) try to have the UN build AGI or whatever.

I think your pushback is good here, as a reminder that 'but isn't there something off about a random corporation or government doing all this?' also often has less-reasonable intuitions going into it (example), and gets a weird level of emphasis considering how much more important other factors are, considering the track record of giant international collaborations, etc.

Why should a small subset of humanity not directly decide how the future goes? [...] Of course the future should not be a tightly scripted screenplay of contemporary moral norms, but to decide that is to decide something about how the future goes, and it's not wrong to make such decisions, it's just important to get such decisions right.

I'm guessing you two basically agree, and the "directly" in "a small subset of humanity not directly decide" is meant to exclude a "tightly scripted screenplay of contemporary moral norms"?

Nate also has the substantive belief that CEV-ish approaches are good, and (if he agrees with the Arbital page) that the base for CEV should be all humans. (The argument for this on Arbital is a combination of "it's in the class of approaches that seem likeliest to work", and "it seems easier to coordinate around, compared to the other approaches in that class". E.g., I'd say that "run CEV on every human whose name starts with a vowel" is likely to produce the ~same outcome as "run CEV on every human", but the latter is a better Schelling point.)

I imagine if Nate thought the best method for "not tightly scripting the future" were less "CEV based on all humans" and more "CEV based on the 1% smartest humans", he'd care more about distinctions like the one you're pointing at. It's indeed the case that we shouldn't toss away most of the future's value just for the sake of performative egalitarianism: we should do the thing that actually makes sense.

Yeah I also have the sense that we mostly agree here.

I have the sense that CEV stands for, very roughly, "what such-and-such a person would do if they became extremely wise", and the hope (which I think is a reasonable hope) is that there is a direction called "wisdom" such that if you move a person far enough in that direction then they become both intelligent and benevolent, and that this eventually doesn't depend super much on where you started.

The tricky part is that we are in this time where we have the option of making some moves that might be quite disruptive, and we don't yet have direct access to the wisdom that we would ideally use to guide our most significant decisions.

And the key question is really: what do you do if you come into a position of really significant influence, at a time when you don't yet have the tools to access the CEV-level wisdom that you might later get? And some people say it's flat-out antisocial to even contemplate taking any disruptive actions, while others say that given the particular configuration of the world right now and the particular problems we face, it actually seems plausible that a person in such a position of influence ought to seriously consider disruptive actions.

I really agree with the latter, and I also contend that it's the more epistemically humble position, because you're not saying that it's for sure that a pivotal act should be performed, but just that it's quite plausible given the specifics of the current world situation. The other side of the argument seems to be saying that no no no it's definitely better not to do anything like that in anything like the current world situation.

I also contend that it's the more epistemically humble position, because you're not saying that it's for sure that a pivotal act should be performed, but just that it's quite plausible given the specifics of the current world situation

The thing I'd say in favor of this position is that I think it better fits the evidence. I think the problem with the opposing view is that it's wrong, not that it's more confident. E.g., if I learned that Nate assigns probability .9 to "a pivotal act is necessary" (for some operationalization of "necessary") while Critch assigns probability .2 to "a pivotal act is necessary", I wouldn't go "ah, Critch is being more reasonable, since his probability is closer to .5".

I agree with the rest of what you said, and I think this is a good way of framing the issue.

I'd add that I think discussion of this topic gets somewhat distorted by the fact that many people naturally track social consensus, and try to say the words they think will have the best influence on this consensus, rather than blurting out their relevant beliefs.

Many people are looking for a signal that stuff like this is OK to say in polite society, or many others are staking out a position "the case for this makes sense intellectually but there's no way it will ever attract enough support, so I'll preemptively oppose it in order to make my other arguments more politically acceptable". (The latter, unfortunately, being a strategy that can serve as a self-fulfilling prophecy.)

The goal should be to cause the future to be great on its own terms

What the heck is this supposed to mean? Great according to the Inherent Essence Of Goodness that lives inside futures, rather than as part of human evaluations? Because I've got bad news for that plan.

Honestly, I'm disappointed by this post.

You say you've found yourself making this argument a lot recently. That's fair. I think it's totally reasonable that there are some situations where this argument could move people in the right direction - maybe the audience is considering defecting about aligning AI with humanity but would respond to orders from authority. Or maybe they're outsiders who think you are going to defect, and you want to signal to them how you're going to cooperate not just because it's a good idea, but because it's an important moral principle to you (as evolution intended).

But this is not an argument that you should just throw out scattershot. Because it's literally false. There is no single attractor that all human values can be expected to fall into upon reflection. The primary advantage of AI alignment over typical philosophy is that when alignment researchers realize some part of what they were previously calling "alignment" is impossible, they can back up and change how they're cashing out "alignment" so that it's actually possible - philosophers have to keep caring about the impossible thing. This advantage goes away if we don't use it.

Yes, plenty of people liked this post. But I'm holding you to a high standard. Somewhere people should be expected to not keep talking about the impossible thing. Somewhere, there is a version of this post that talks about or directly references:

  • Game-theoretic arguments for cooperation.
  • Why game-theoretic arguments are insufficient for egalitarianism (e.g. overly weighting the preferences of the powerful) but still mean that AI should be designed with more than just you in mind, even before accounting for a human preference for an egalitarian future.
  • Why egalitarianism is a beautiful moral principle that you endorse.
    • "Wait, wasn't that this post?" you might say. Kind of! Making a plain ethical/aesthetic argument is like a magic trick where the magician tells you up front that it's an illusion. This post is much the same magic trick, but the magician is telling you it's real magic.
  • Realistic expectations for what egalitarianism can look like in the real world.
    • It cannot look like finding the one attractor that all human values converge to upon reflection because there is no one attractor that all human values converge to upon reflection.
  • Perhaps an analysis of how big the "fingerprints" of the creators of the AI are in such situations - e.g. by setting the meta-level standards for what counts as a "human value".
    • There is a non-zero chance that the meta-preferences, that end up in charge of the preferences, that end up in charge of the galaxy will come from Mechanical Turkers.

"The goal should be to cause the future to be great on its own terms"

What the heck is this supposed to mean? Great according to the Inherent Essence Of Goodness that lives inside futures, rather than as part of human evaluations?

The rest of the quote explains what this means:

The goal should be to cause the future to be great on its own terms, without locking in the particular moral opinions of humanity today — and without locking in the moral opinions of any subset of humans, whether that’s a corporation, a government, or a nation.

(If you can't see why a single modern society locking in their current values would be a tragedy of enormous proportions, imagine an ancient civilization such as the Romans locking in their specific morals 2000 years ago. Moral progress is real, and important.)

The present is "good on its own terms", rather than "good on Ancient Romans' terms", because the Ancient Romans weren't able to lock in their values. If you think this makes sense (and is a good thing) in the absence of an Inherent Essence Of Goodness, then there's no reason to posit an Inherent Essence Of Goodness when we switch from discussing "moral progress after Ancient Rome" to "moral progress after circa-2022 civilization".

Because it's literally false. There is no single attractor that all human values can be expected to fall into upon reflection.

Could you be explicit about what argument you're making here? Is it something like:

  • Even when two variables are strongly correlated, the most extreme value of one will rarely be the most extreme value of the other; therefore it's <50% likely that different individuals' CEVs will yield remotely similar results? (E.g., similar enough that one individual will consider the output of most other individuals' CEVs morally acceptable?)

Or?:

  • The optimal world-state according to Catholicism is totally different from the optimal world-state according to hedonic utilitarianism; therefore it's <50% likely that the CEV of a random Catholic will consider the output of a hedonic utilitarian's CEV morally acceptable. (And vice versa.)

Regarding the second argument: I don't think that Catholicism is stable under reflection (because it's false, and a mind needs to avoid thinking various low-complexity true thoughts in order to continue believing Catholicism), so I don't think the Catholic and hedonic utilitarian's CEVs will end up disagreeing, even though the optimum for Catholicism and for hedonic utilitarianism disagree.

(I'd bet against hedonic utilitarianism being true as well, but this is obviously a much more open question. And fortunately, CEV-ish buck-passing processes make it less necessary for anyone to take risky bets like that; we can just investigate what's true and base our decisions on what we learn.)

Catholicism is a relatively easy case, and I expect plenty of disagreement about exactly how much moral disagreement looks like the Catholicism/secularism debate. I expect a lot of convergence on questions like "involuntarily enslaving people: good or bad?", on the whole, and less on questions like "which do you want more of: chocolate ice cream, or vanilla ice cream?". But it's the former questions that matter more for CEV; the latter sorts of questions are ones where we can just let individuals choose different lives for themselves.

"Correlations tend to break when you push things to extremes" is a factor that should increase our expectation of how many things people are likely to morally disagree about. Factors pushing in the other direction include 'not all correlations work that way' and evidence that human morality doesn't work that way.

E.g., 'human brains are very similar', 'empirically, people have converged a lot on morality even though we've been pushed toward extremes relative to our EAA', 'we can use negotiation and trade to build value systems that are good compromises between two conflicting value systems', etc.

Also 'the universe is big, and people's "amoral" preferences tend to be about how their own life goes, not about the overall distribution of matter in the universe'; so values conflicts tend to be concentrated in cases where we can just let different present-day stakeholders live different sorts of lives, given the universe's absurd abundance of resources.

Nate said "it shouldn’t matter who does the job, because there should be a path-independent attractor-well that isn't about making one person dictator-for-life or tiling a particular flag across the universe forever", and you said this is "literally false". I don't see what's false about it, so if the above doesn't clarify anything, maybe you can point to the parts of the Arbital article on CEV you disagree with (https://arbital.com/p/cev/)? E.g., I don't see Nate or Eliezer claiming that people will agree about vanilla vs. chocolate.

Game-theoretic arguments for cooperation [...] mean that AI should be designed with more than just you in mind, even before accounting for a human preference for an egalitarian future

Footnote 2 says that Nate isn't "stupid enough to toss coordination out the window in the middle of a catastrophic emergency with human existence at stake". If that isn't an argument 'cooperation is useful, therefore we should take others' preferences into account', then what sort of argument do you have in mind?

Why egalitarianism is a beautiful moral principle that you endorse.

I don't know what you mean by "egalitarianism", or for that matter what you mean by "why". Are you asking for an ode to egalitarianism? Or an argument for it, in terms of more basic values?

The present is "good on its own terms", rather than "good on Ancient Romans' terms", because the Ancient Romans weren't able to lock in their values. If you think this makes sense (and is a good thing) in the absence of an Inherent Essence Of Goodness, then there's no reason to posit an Inherent Essence Of Goodness when we switch from discussing "moral progress after Ancient Rome" to "moral progress after circa-2022 civilization".

The present is certainly good on my terms (relative to ancient Rome). But the present itself doesn't care. It's not the type of thing that can care. So what are you trying to pack inside that phrase, "its own terms"?

If you mean it to sum up a meta-preference you hold about how moral evolution should proceed, then that's fine. But is that really all? Or are you going to go reason as if there's some objective essence of what the present's "own terms" are - e.g. by trying to apply standards of epistemic uncertainty to the state of this essence? 

"Because it's literally false. There is no single attractor that all human values can be expected to fall into upon reflection."

Could you be explicit about what argument you're making here? Is it something like:

  • Even when two variables are strongly correlated, the most extreme value of one will rarely be the most extreme value of the other; therefore it's <50% likely that different individuals' CEVs will yield remotely similar results? (E.g., similar enough that one individual will consider the output of most other individuals' CEVs morally acceptable?)

Or?:

  • The optimal world-state according to Catholicism is totally different from the optimal world-state according to hedonic utilitarianism; therefore it's <50% likely that the CEV of a random Catholic will consider the output of a hedonic utilitarian's CEV morally acceptable. (And vice versa.)

I'll start by quoting the part of Scott's essay that I was particularly thinking of, to clarify:

Our innate moral classifier has been trained on the Balboa Park – West Oakland route. Some of us think morality means “follow the Red Line”, and others think “follow the Green Line”, but it doesn’t matter, because we all agree on the same route.

When people talk about how we should arrange the world after the Singularity when we’re all omnipotent, suddenly we’re way past West Oakland, and everyone’s moral intuitions hopelessly diverge.

But it’s even worse than that, because even within myself, my moral intuitions are something like “Do the thing which follows the Red Line, and the Green Line, and the Yellow Line…you know, that thing!” And so when I’m faced with something that perfectly follows the Red Line, but goes the opposite directions as the Green Line, it seems repugnant even to me, as does the opposite tactic of following the Green Line. As long as creating and destroying people is hard, utilitarianism works fine, but make it easier, and suddenly your Standard Utilitarian Path diverges into Pronatal Total Utilitarianism vs. Antinatalist Utilitarianism and they both seem awful. If our degree of moral repugnance is the degree to which we’re violating our moral principles, and my moral principle is “Follow both the Red Line and the Green Line”, then after passing West Oakland I either have to end up in Richmond (and feel awful because of how distant I am from Green), or in Warm Springs (and feel awful because of how distant I am from Red).

Okay, so.

What's the claim I'm projecting onto Nate, that I'm saying is false? It's something like: "The goal should be to avoid locking in any particular morals. We can do this by passing control to some neutral procedure that allows values to evolve."

And what I am saying is something like: There is no neutral procedure. There is no way to avoid privileging some morals. This is not a big problem, it's just how it is, and we can be okay with it.

Related and repetitive statements:

  • When extrapolating the shared train line past West Oakland, there are multiple ways to continue, but none of them are "the neutral way to do the extrapolation."
  • The self-reflection function has many attractors for almost all humans, groups, societies, and AGI architectures. Different starting points might land us in different attractors, and there is no unique "neutral starting point."
  • There are many procedures for allowing values to evolve, most of them suck, and picking a good one is an action that will bear the fingerprints of our own values. And that's fine!
  • Human meta-preferences, the standards by which we judge what preference extrapolation schemes are good, are preferences. We do not have any mysterious non-preference standards for doing value aggregation and extrapolation.
  • There is not just one CEV that is the neutral way to do preference aggregation and extrapolation. There are lots of choices that we have to / get to make.

So as you can see, I wasn't really thinking about differences between "the CEV" of different people - my focus was more on differences between ways of implementing CEV of the same people. A lot of these ways are going to be more or less as good - like comparing your favorite beef stew vs. a 30-course modernist meal. But not all possible implementations of CEV are good, for example you could screw up by modeling exposing people to extreme or highly-optimized stimuli when extrapolating them, leading to the AI causing large changes in the human condition that we wouldn't presently endorse.

I don't know what you mean by "egalitarianism", or for that matter what you mean by "why". Are you asking for an ode to egalitarianism? Or an argument for it, in terms of more basic values?

By egalitarianism I mean building an AI that tries to help all people, and be responsive to the perspectives of all people, not just a select few. And yes, definitely an ode :D

e.g. by trying to apply standards of epistemic uncertainty to the state of this essence? 

I would say that there's a logical object that a large chunk of human moral discourse is trying to point at — something like "the rules of the logical game Morality", analogous to "the rules of the logical game Chess". Two people can both be discussing the same logical object "the rules of Chess", but have different beliefs about what that logical object's properties are. And just as someone can be mistaken or uncertain bout the rules of chess — or about their interaction in a specific case — someone can be uncertain about morality.

Do you disagree with any of that?

And what I am saying is something like: There is no neutral procedure. There is no way to avoid privileging some morals. This is not a big problem, it's just how it is, and we can be okay with it.

In the CEV Arbital page, Eliezer says:

"Even the terms in CEV, like 'know more' or 'extrapolate a human', seem complicated and value-laden."

If the thing you're saying is that CEV is itself a complicated idea, and it seems hard for humanity to implement such a thing without already having a pretty deep understanding of human values, then I agree. This seems like an important practical challenge for pulling off CEV: you need to somehow start the bootstrapping process, even though our current understanding of human values is insufficient for formally specifying the best way to do CEV.

If instead you just mean to say "there's no reason to favor human values over termite values unless you already care about humans", then yeah, that seems even more obvious to me. If you think Nate is trying to argue for human morality from a humanity-indifferent, View-From-Nowhere perspective, then you're definitely misunderstanding Nate's perspective.

When extrapolating the shared train line past West Oakland, there are multiple ways to continue, but none of them are "the neutral way to do the extrapolation."

If "neutral" here means "non-value-laden", then sure. If "neutral" here means "non-arbitrary, from a human POV", then it seems like an open empirical question how many arbitrary decisions like this are required in order to do CEV.

I'd guess that there are few or no arbitrary decisions involved in using CEV to answer high-takes high-stakes moral questions.

There are many procedures for allowing values to evolve, most of them suck, and picking a good one is an action that will bear the fingerprints of our own values.

This makes me think that you misunderstood Nate's essay entirely. The idea of "don't leave your fingerprints on the future" isn't "try to produce a future that has no basis in human values". The idea is "try to produce a future that doesn't privilege the AGI operator's current values at the expense of other humans' values, the values humans would develop in the future if their moral understanding improved, etc.".

If you deploy AGI and execute a pivotal act, don't leave your personal fingerprints all over the long-term future of humanity, in a way that distinguishes you from other humans.

I would say that there's a logical object that a large chunk of human moral discourse is trying to point at — something like "the rules of the logical game Morality", analogous to "the rules of the logical game Chess". Two people can both be discussing the same logical object "the rules of Chess", but have different beliefs about what that logical object's properties are. And just as someone can be mistaken or uncertain bout the rules of chess — or about their interaction in a specific case — someone can be uncertain about morality.

When I think about the rules of chess, I basically treat them as having some external essence that I have epistemic uncertainty about. What this means mechanistically is:

  • When I'm unsure about the rules of chess, this raises the value of certain information-gathering actions, like checking the FIDE website, asking a friend, reading a book.
  • If I knew the outcomes of all those actions, that would resolve my uncertainty.
  • I have probabilities associated with my uncertainty, and updates to those probabilities based on evidence should follow Bayesian logic.
  • Decision-making under uncertainty should linearly aggregate the different possibilities that I'm uncertain over, weighted by their probability.

So the rules of chess are basically just a pattern out in the world that I can go look at. When I say I'm uncertain about the rules of chess, this is epistemic uncertainty that I manage the same as if I'm uncertain about anything else out there in the world.

The "rules of Morality" are not like this.

  • When I'm unsure about whether I care about fish suffering, this does raise the value of certain information-gathering actions like learning more about fish.
  • But if I knew the outcomes of all those actions, this wouldn't resolve all my uncertainty.
  • I can put probabilities to various possibilities, and can update them on evidence using Bayesian logic - that part still works.
  • Decision-making under the remaining-after-evidence part of the uncertainty doesn't have to look like linear aggregation. In fact it shouldn't - I have meta-preferences like "conservatism," which says that I should trust models differently depending on whether they seem to be inside their domain of validity or not.

So there's a lot of my uncertainty about morality that doesn't stem from being unaware about facts. Where does it come from? One source is self-modeling uncertainty - how do I take the empirical facts about me and the world, and use that to construct a model of myself in which I have preferences, so that I can reflect on my own preferences? There are multiple ways to do this.

So if, and I'm really not sure, but if you were thinking of everything as like uncertainty about the rules of chess, then I would expect two main mistakes: expecting there to be some procedure that takes in evidence and spits out the one right answer, and expecting aggregating over models for decision-making to look like linear aggregation.

"There are many procedures for allowing values to evolve, most of them suck, and picking a good one is an action that will bear the fingerprints of our own values."

This makes me think that you misunderstood Nate's essay entirely. The idea of "don't leave your fingerprints on the future" isn't "try to produce a future that has no basis in human values". The idea is "try to produce a future that doesn't privilege the AGI operator's current values at the expense of other humans' values, the values humans would develop in the future if their moral understanding improved, etc.".

If you deploy AGI and execute a pivotal act, don't leave your personal fingerprints all over the long-term future of humanity, in a way that distinguishes you from other humans.

Well, maybe I misunderstood. But I'm not really accusing y'all of saying "try to produce a future that has no basis in human values." I am accusing this post of saying "there's some neutral procedure for figuring out human values, we should use that rather than a non-neutral procedure."

(If you can't see why a single modern society locking in their current values would be a tragedy of enormous proportions, imagine an ancient civilization such as the Romans locking in their specific morals 2000 years ago. Moral progress is real, and important.)

This really doesn't prove anything. That measurement shouldn't be taken by our values, but by the values of the ancient romans. 

Sure of course the morality of the past gets better and better. It's taking a random walk closer and closer to our morality. Now moral progress might be real. 

The place to look is inside our own value functions, if after 1000 years of careful philosophical debate, humanity decided it was a great idea to eat babies, would you say, "well if you have done all that thinking, clearly you are wiser than me". Or would you say "Arghh, no. Clearly something has broken in your philosophical debate"? That is a part of your own meta value function, the external world can't tell you what to think here (unless you have a meta meta value function. But then you have to choose that for yourself) 

It doesn't help that human values seem to be inarticulate half formed intuitions, and the things we call our values are often instrumental goals. 

If, had ASI not been created, humans would have gone extinct to bioweapons, and pandas would have evolved intelligence, it the extinction of humans and the rise of panda-centric morality just part of moral progress? 

If aliens arrive, and offer to share their best philosophy with us, is the alien influence part of moral progress, or an external fact to be removed? 

If advertisers basically learn to brainwash people to sell more product, is that part of moral progress?

Suppose, had you not made the AI, that Joe Bloggs would have made an AI 10 years later. Joe Bloggs would actually have succeeded at alignment. And would have imposed his personal whims on all humanity forever. If you are trying not to unduely influence the future, do you make everyone beholden to the whims of Joe, as they would be without your influence. 

My personal CEV cares about fairness, human potential, moral progress, and humanity’s ability to choose its own future, rather than having a future imposed on them by a dictator. I'd guess that the difference between "we run CEV on Nate personally" and "we run CEV on humanity writ large" is nothing (e.g., because Nate-CEV decides to run humanity's CEV), and if it's not nothing then it's probably minor.

Wait. The whole point of the CEV is to get the AI to extrapolate what you would want if you were smarter and more informed. That is, the delta from your existing goals to your CEV should be unknowable to you, because if you know your destination you are already there. This sounds like your object level values. And they sound good, as judged by your (and my) object level values.

 

I mean there is a sense in which I agree that locking in say your favourite political party, or a particular view on abortion, is stupid. Well I am not sure that particular view on abortion would be actually bad, it would probably have near no effect in a society of posthuman digital minds. These are things that are fairly clearly instrumental. If I learned that after careful philosophical consideration, and analysis of lots of developmental neurology data, people decided abortion was really bad, I would take that seriously. They have probably realized a moral truth I do not know. 

I think I have a current idea of what is right, with uncertainty bars. When philosophers come to an unexpected conclusion, it is some evidence that the conclusion is right, and also some evidence the philosopher has gone mad.