I would term "hope for " rather than "reliability", because it's about willingness to enact in response to belief in , but if is no good, you shouldn't do that. Indeed, for bad , having the property of is harmful fatalism, following along with destiny rather than choosing it. In those cases, you might want to or something, though that only prevents from being believed, that you won't need to face in actuality, it doesn't prevent the actual . So reflects a value judgement about reflected in agent's policy, something downstream of endorsement of , a law of how the content of the world behaves according to an embedded agent's will.
Payor's Lemma then talks about belief in hope , that is hope itself is exogenous and needs to be judged (endorsed or not). Which is reasonable for games, since what the coalition might hope for is not anyone's individual choice, the details of this hope couldn't have been hardcoded in any agent a priori and need to be negotiated during a decision that forms the coalition. A functional coalition should be willing to act on its own hope (which is again something we need to check for a new coalition, that might've already been the case for a singular agent), that is we need to check that is sufficient to motivate the coalition to actually . This is again a value judgement about whether this coalition's tentative aspirations, being a vehicle for hope that , are actually endorsed by it.
Thus I'd term "coordination" rather than "trust", the fact that this particular coalition would tentatively intend to coordinate on a hope for . Hope is a value judgement about , and in this case it's the coalition's hope, rather any one agent's hope, and the coalition is a temporary nascent agency thing that doesn't necessarily know what it wants yet. The coalition asks: "If we find ourselves hoping for together, will we act on it?" So we start with coordination about hope, seeing if this particular hope wants to settle as the coalition's actual values, and judging if it should by enacting if at least coordination on this particular hope is reached, which should happen only if is a good thing.
(One intuition pump with some limitations outside the provability formalism is treating as "probably ", perhaps according to what some prediction market tells you. If "probably " is enough to prompt you to enact , that's some kind of endorsement, and it's a push towards increasing the equilibrium-on-reflection value of probability of , pushing "probably " closer to reality. But if is terrible, then enacting it in response to its high probability is following along with self-fulfilling doom, rather doing what you can to push the equilibrium away from it.)
Löb's Theorem then says that if we merely endorse a belief by enacting the believed outcome, this is sufficient for the outcome to actually happen, a priori and without that belief yet being in evidence. And Payor's Lemma says that if we merely endorse a coalition's coordinated hope by enacting the hoped-for outcome, this is sufficient for the outcome to actually happen, a priori and without the coordination around that hope yet being in evidence. The use of Löb's Theorem or Payor's Lemma is that the condition (belief in , or coordination around hope for ) should help in making the endorsement, that is it should be easier to decide to if you already believe that , or if you already believe that your coalition is hoping for . For coordination, this is important because every agent can only unilaterally enact its own part in the joint policy, so it does need some kind of premise about the coalition's nature (in this case, about the coalition's tentative hope for what it aims to achieve) in order to endorse playing its part in the coalition's joint policy. It's easier to decide to sign an assurance contract than to unconditionally donate to a project, and the role of Payor's Lemma is to say that if everyone does sign the assurance contract, then the project will in fact get funded sufficiently.
1. Plan A is to race to build a Friendly AI before someone builds an unFriendly AI.
[...] Eliezer himself is now trying hard to change 1
This is not a recent development, as a pivotal act AI is not a Friendly AI (which would be too difficult), but rather things like a lasting AI ban/pause enforcement AI that doesn't kill everyone, or a human uploading AI that does nothing else, which is where you presumably need decision theory, but not ethics, metaethics, or much of broader philosophy.
The "ten people on the inside" direct AIs to useful projects within their resource allocation. The AGIs themselves direct their own projects according to their propensities, which might be influenced by publicly available Internet text, possibly to a greater extent if it's old enough to be part of pretraining datasets.
The amount of resources that AGIs direct on their own initiative might dwarf the amount of resources of the "ten people on the inside", so the impact of openly published technical plans (that make sense on their own merits) might be significant. While AGIs could come up with any ideas independently on their own, path dependence of the acute risk period might still make their initial propensities to pay attention to particular plans matter.
You can't really have a technical "Plan E" because there is approximately no one to implement the plan
AGIs themselves will be implementing some sort of plan (perhaps at very vague and disorganized prompting from humans, or without any prompting at all; which might be influenced by blog posts and such, in publicly available Internet text). This could be relevant for mitigating ASI misalignment if these AGIs are sufficiently aligned to the future of humanity, more so than some of the hypothetical future ASIs (created without following such a plan).
What happens with gradual disempowerment in this picture? Even Plan A seems compatible with handing off increasing levels of influence to AIs. One benefit of "shut it all down" (AGI Pause) is ruling out this problem by not having AGIs around (at least while the Pause lasts, which is also when the exit strategy needs to be prepared, not merely technical alignment).
Gradual disempowerment risks transitioning into permanent disempowerment (if not extinction), where a successful solution to technical ASI-grade alignment by the AIs might result in the future of humanity surviving, but only getting a tiny sliver of resources compared to the AIs, with no way of ever changing that even on cosmic timescales. Permanent disempowerment doesn't even need to involve a takeover.
Also, in the absence of "shut it all down", at some point targeting misalignment risks might be less impactful on the margin than targeting improvements in education (about AI risks and cruxes of mitigation strategies), coordination technologies, and AI Control. These enable directing more resources to misalignment risk mitigation as appropriate, including getting back to "shut it all down", a more robust ASI Pause, or making creation of increasingly capable AGIs non-lethal if misaligned (not a "first critical try").
My point is that the 10-30x AIs might be able to be more effective at coordination around AI risk than humans alone, in particular more effective than currently seems feasible in the relevant timeframe (when not taking into account the use of those 10-30x AIs). Saying "labs" doesn't make this distinction explicit.
with 10-30x AIs, solving alignment takes like 1-3 years of work ... so a crucial factor is US government buy-in for nonproliferation
Those AIs might be able to lobby for nonproliferation or do things like write a better IABIED, making coordination interventions that oppose myopic racing. Directing AIs to pursue such projects could be a priority comparable to direct alignment work. Unclear how visibly asymmetric such interventions will prove to be, but then alignment vs. capabilities work might be in a similar situation.
I think such arguments buy us those 5% of no-takeover (conditional on superintelligence soon), and some of the moderate permanent disempowerment outcomes (maybe the future of humanity gets a whole galaxy out of 4 billion or so galaxies in the reachable universe), as distinct from almost total permanent disempowerment or extinction. Though I expect that it matters which specific projects we ask early AGIs to work on, more than how aligned these early AGIs are, basically for the reasons that companies and institutions employing humans are not centrally concerned with alignment of their employees in the ambitious sense, at the level of terminal values. More time to think of better projects for early AGIs, and time to reflect on pieces of feedback from such projects done by early AGIs, might significantly improve the chances for making ambitious alignment of superintelligence work eventually, on the first critical try, however long it takes to get ready to risk it.
If creation of superintelligence is happening on a schedule dictated by economics of technology adoption rather than by taking exactly the steps that we already know how to take correctly by the time we take them, affordances available to qualitatively smarter AIs will get out of control. And their misalignment (in the ambitious sense, at the level of terminal values) will lead them to taking over rather than complying with humanity's intentions and expectations, even if their own intentions and expectations don't involve humanity literally going extinct.
I think AI takeover is plausible. But Eliezer’s argument that it’s more than 98% likely to happen does not stand up to scrutiny
I think the part of the argument where an AI takeover is almost certain to happen if superintelligence[1] is created soon is extremely convincing (I'd give this 95%), while the part where AI takeover almost certainly results in everyone dying is not. I'd only give 10-30% to everyone dying given an AI takeover (which is not really a decision relevant distinction, just a major difference in models).
But also the outcome of not dying from an AI takeover cashes out as permanent disempowerment, that is humanity not getting more than a trivial share in the reachable universe, with instead AIs taking almost everything. It's not centrally a good outcome that a sane civilization should be bringing about, even as it's also not centrally "doom". So the distinction between AI takeover and the book's titular everyone dying can be a crux, it's not interchangeable.
AIs that are collectively qualitatively better than the whole of humanity at stuff, beyond being merely faster and somewhat above the level of the best humans at everything at the same time. ↩︎
Strange attitude towards the physical world can be reframed as caring only about some abstract world that happens to resemble the physical world in some ways. A chess AI could be said to be acting on some specific physical chessboard within the real world and carefully avoiding all concern about everything else, but it's more naturally described as acting on just the abstract chessboard, nothing else. I think values/preference (for some arbitrary agent) should be not just about probutility upon the physical world, but should also specify which world they are talking about, so that different agents are not just normatively disagreeing about relative value of events, but about which worlds are worth caring about (not just possible worlds within some space of nearby possible worlds, but fundamentally very different abstract worlds), and therefore what kinds of events (from which sample spaces) ought to serve as semantics for possible actions, before their value can be considered.
A world model (such as an LLM with frozen weights) is already an abstraction, its data is not the same as the physical world itself, but it's coordinated with the physical world to some extent, similarly to how an abstract chessboard is coordinated with a specific physical chessboard in the real world (learning is coordination, adjusting the model so that the model and the world have more shared explanations for their details). Thus acting within an abstract world given by a world model (as opposed to within the physical world itself) might be a useful framing for systematically ignoring some aspects of the physical world, and world models could be intentionally crafted to emphasize particular aspects.