Epistemic status: The core ideas seem robust and stable after long reflection and many discussions.
Many researchers identify AI safety with control of a monolithic, superintelligent AI system, and if questioned about multicomponent alternatives, argue that multiple superintelligent-level systems would inevitably collude and act as one. This view seems quite wrong, yet has diverted attention from a rich and promising range of multicomponent strategies for AI safety — strategies that are well aligned with the actual trajectory of AI development.
Adapted from Reframing Superintelligence, Section 20:
Collusion among superintelligent oracles
can readily be avoided
Because collusion among AI question-answering systems can readily be avoided, there is no obstacle to applying superintelligent-level AI resources to problems that include AI safety.
The difficulty of establishing successful collusion among actors tends to increase as their capabilities, knowledge, situations, and roles become more diverse and adversarial (think auditors, competitors, specialists, red-teams…), and increasing the number of actors can make collusive cooperation more difficult . In the context of AI systems (even more so than among human beings), these conditions can be readily implemented and are attractive for pragmatic reasons. Arguments that, absent preexisting alignment, high-level AI systems will inevitably collude are ill-founded. Instead, we should expect that interactions among multiple superintelligent-level systems can be applied to suppress deception and reduce risk by leveraging imperfect alignment achieved at the level of individual systems.
20.2 Trustworthiness can be an emergent property
Prospects for solving AI-safety problems would be greatly improved if we could safely apply superintelligent-level question-answering resources (“oracles”, or more generally, “systems that provide information in response to prompts”) to solving those problems. A familiar objection dismisses this potentially powerful approach as unsafe in itself, arguing that, absent solutions to difficult problems, individual superintelligent-level systems would be untrustworthy, and that attempts to establish checks and balances among multiple systems (for example, through superintelligent-level evaluation of potentially deceptive answers to questions) would inevitably be thwarted by collusive cooperation.
Identifying robust strategies for ensuring non-collusion among superintelligent question-answering systems would overcome this objection, inviting exploration of superintelligence-enabled strategies for managing potentially untrusted superintelligent AI systems. The present discussion argues that the robust non-emergence of deceptive collusion among imperfectly aligned systems can be ensured by structuring systems of systems with diverse capabilities, knowledge, situations, objectives, and roles.
This problem framing assumes the ability to develop systems that respond to questions with superintelligent-level competence, and will (as a consequence of the nature of digital systems) assume the ability to constrain information inputs to AI systems during their development and use.
In the problematic case, superintelligent question-answering systems might provide deceptive answers in pursuit of emergent, potentially undesirable goals. The aim here is to identify principles for architecting multicomponent systems that will act as non-deceptive oracles, while treating their components as actors that could in principle engage in deceptive collusion.
20.3 A range of conditions can make collusion robust or fragile
Potential problems of deceptive collusion can be addressed by architecting systems that make collusion difficult or effectively impossible. Some relevant conditions:
Conditions that tend to facilitate collusion:
A1) Small numbers of actors (the required scope of collusion is small)
A2) Insensitivity to defectors (partial collusion is sufficient)
A3) Similarity among actors (for example, shared histories and objectives)
A4) Free communication (information can flow between any pair of actors)
A5) Iterated decision processes with observable actions
A6) Common knowledge of the above circumstances
Contrasting conditions that tend to disrupt collusion:
B1) Large numbers of actors (the required scope of collusion is large)
B2) Sensitivity to defectors (collusion fails unless it is complete)
B3) Diversity among actors (for example, different histories, adversarial objectives)
B4) Constrained communication (information flows through limited causal channels)
B5) Single-move decision processes, limited observability
B6) Lack of shared knowledge of strategically-relevant circumstances
Implementation choices that would yield conditions (A1–A5) are unnecessary and often undesirable. By contrast, implementation choices that yield conditions (B1–B5) are practical and often well-motivated:
Characteristics of practical architectures:
C1) To improve the quality and reliability of answers, it is natural to implement multiple, diverse (and implicitly competing) systems to propose alternatives.
C2) To identify low-quality or misleading answers, it is natural to employ diverse critics (auditors, red teams, etc.), any one of which might disrupt deceptive collusion.
C3) Systems of diverse, competing proposers and critics naturally implement different and often adversarial objectives.
C4) To minimize unwanted dependencies, systems are designed such that components have a sparse, thoughtfully designed communication graph.
C5) To provide stable results, it is natural to apply fixed models to a series of problems, yielding what is in effect a series of history-blind, single-move decisions.
C6) In training, it is natural to provide differentiated, task-relevant information to systems solving different problems, often omitting knowledge of general circumstances.
Considerations (C1–C5) suggest that implementing systems that are resistant to the emergence of deceptive collusion aligns with the general strategy of employing ensembles of diverse, stable, task-directed system components to propose and assess answers to questions. This approach can pass the test of practicality and potentially widespread adoption.
20.4 Untrusted superintelligence can be applied to AI safety
It seems both possible and practical to establish conditions that would effectively preclude deceptive collusion among diverse, task-oriented, superintelligent-level question-answering systems, thereby addressing a key problem in safely applying superintelligent capabilities to problems (Bostrom, Superintelligence, p.145).
Prospects for obtaining trustworthy answers from sets of individually untrusted superintelligent systems call for deeper exploration of AI safety strategies premised on reliable access to superintelligent problem-solving capabilities. Capabilities of particular interest include the ability to run and evaluate the results of superintelligent-level red-team/blue-team competitions relevant to questions of global strategic stability.
TL;DR: Because collusion can be thwarted, it is reasonable to expect that applying multiple potentially untrustworthy superintelligent-level systems to problems can improve rather than degrade safety. This prospect calls for greater attention.
Note that I have presented a there-exists argument against a for-all proposition. Responses of the form there-exists (“But I can think of something similar that won’t work”), don’t count as counterarguments here. If requiring a for-all counterargument (or at least a quite strong for-all flavor) across the space of potential compositions of AI systems seems like a high bar, well, that’s very much the point.
Fallback arguments of the form “But it would be safer to assume the worst — universal, inevitable collusion among AI systems” miss the key implication of the case made here: that multicomponent approaches could potentially increase safety. To (continue to) neglect potential solutions based on handwaving arguments is a really bad idea.
Paul Christiano (“On heterogeneous objectives”, 2015) considers the case for inevitable collusion; I agree with his gently-stated view that
…the implicit argument for the robustness of collusion makes a lot of implicit assumptions. If I saw an explicit argument I might be able to assess its explicit assumptions, but for now we don’t have one.
At t+7 years, I’ve still seen no explicit argument for robust AI collusion, yet tacit belief in this idea continues to channel attention away from a potential solution-space for AI safety problems, leaving something very much like a void.
Further Reading [in Reframing Superintelligence]
- Section 8: Strong optimization can strongly constrain AI capabilities, behavior, and effects
- Section 12: AGI agents offer no compelling value
- Section 21: Broad world knowledge can support safe task performance
- Section 19: The orthogonality thesis undercuts the generality of instrumental convergence
- Section 23: AI development systems can support effective human guidance
- Section 24: Human oversight need not impede fast, recursive AI technology improvement
A property may be true with high credence yet not be susceptible to formal proof.
Systems need not have access to a shared clock, a hardware-level constraint that is necessary to ensure reproducible results and incidentally limits the scope for coordinated treachery.
It has been persuasively argued that large language models are best thought of as “simulators”, a class of systems that can (in the LLM case) simulate agents while themselves lacking agentic properties. (This is a coherent concept: A physics model could simulate agents without itself having agentic properties.) LLMs show how an AI system trained on a prediction objective can provide a rich world model without pursuing goals.
In a familiar class of worst-case scenarios, systems with general superintelligence would infer extensive knowledge about the world from minimal information, and would choose (if possible) to pursue potentially dangerous goals by manipulating the external environment, for example, through deceptive answers to questions. In this scenario, (all?) superintelligent systems, even if almost isolated, would infer the existence of others like themselves, and (all?) would employ a decision theory that induces them to collude (in a coordinated way?) to pursue shared objectives. Even if we consider a worst-case assumption regarding the default emergence of world-changing goals, the present argument suggests that problematic systems would correctly infer the existence of superintelligent-level systems unlike themselves (systems with diverse and specialized capabilities, knowledge, and interactions, playing roles that include adversarial judges and competitors), and would correctly recognize that deceptive collusion is risky or infeasible.
The idea of multicomponent strategies for AI safety is, of course, neither new nor entirely neglected. However, in a recent search for relevant Alignment Forum posts, I found no evidence of a thriving research community or well-developed concepts:
• (My understanding of) What Everyone in Technical Alignment is Doing and Why (August 2022) surveys the agendas of more than 20 research groups, and none clearly points in the direction I’ve advocated here.
• A pair of posts on Pragmatic AI Safety, Perform Tractable Research While Avoiding Capabilities Externalities and Open Problems in AI X-Risk (May, June 2022), briefly mention highly relevant concepts: the idea of using “counteracting systems [for example] artificial consciences, AI watchdogs, lie detectors, filters for power-seeking actions, and separate reward models”, and the idea of “multiple superintelligent agents that can rein in other rogue systems”. The authors also mention (without endorsing) the counter-claim that “The instant two intelligent agents can reason about each other — regardless of their goals — they will necessarily collude.”
• An overview of 11 proposals for building safe advanced AI (May 2020) mentions only efforts to align individual AI systems; even “AI safety via debate with transparency tools” proposes that a system interact with a copy of itself. Partial success in single-system alignment could be leveraged in multicomponent safety architectures, an application context that has potential implications for research directions in the single-system alignment domain.
Just to restate the standard argument against:
If you've got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other, there is a certain hubris in imagining that you can get them to defect. They don't want your own preferred outcome. Perhaps they will think of some strategy you did not, being much smarter than you, etc etc.
(Or, I mean, actually the strategy is "mutually cooperate"? Simulate a spread of the other possible entities, conditionally cooperate if their expected degree of cooperation goes over a certain threshold? Yes yes, more complicated in practice, but we don't even, really, get to say that we were blindsided here. The mysterious incredibly clever strategy is just all 20 superintelligences deciding to do something else which isn't mutual defection, despite the hopeful human saying, "But I set you up with circumstances that I thought would make you not decide that! How could you? Why? How could you just get a better outcome for yourselves like this?")
I don’t see that as an argument [to narrow this a bit: not an argument relevant to what I propose]. As I noted above, Paul Christiano asks for explicit assumptions.
To quote Paul again:.
Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart entities”.
[Perhaps too-pointed remark deleted]
I don't think you're going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems). It's also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent. Mind space is very wide, and just about everything that isn't incoherent to imagine should exist as an actual possibility somewhere inside it. What we can access inside the subspace that looks like "giant inscrutable matrices trained by gradient descent", before the world ends, is a harsher question.
I could definitely buy that you could get some relatively cognitively weak AGI systems, produced by gradient descent on giant inscrutable matrices, to be in a state of noncooperation. The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.
The most AGI-like systems we have today are LLMs, optimized for a simple prediction task. They can be viewed as simulators, but they have a peculiar relationship to agency:
LLMs have rich knowledge and capabilities, and can even simulate agents, yet they have no natural place in an agent-centric ontology. There’s an update to be had here (new information! fresh perspectives!) and much to reconsider.
Yes, this is the key question, and I think there’s a clear answer, at least in outline:
What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us develop and deploy the physical basis for a defensively stable world — as you know, some extraordinarily capable technologies could be developed and deployed quite rapidly. In this scenario, defense has first move, can preemptively marshal arbitrarily large physical resources, and can restrict resources available to potential future adversaries. I would recommend investing resources in state-of-the-art hostile planning to support ongoing red-team/blue-team exercises.
This isn’t “flipping the table”, it’s reinforcing the table and bolting it to the floor. What you call “strong” systems then can plan whatever they want, but with limited effect.
So I think that building nanotech good enough to flip the tables - which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than "disassemble all GPUs", which I choose not to name explicitly - is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best-attempt doomed system design goes, an imperfect-transparency alarm should have been designed to go off if your nanotech AGI is thinking about minds at all, human or AI, because it is supposed to just be thinking about nanotech. My guess is that you are much safer - albeit still doomed - if you try to do it the just-nanotech way, rather than constructing a system of AIs meant to spy on each other and sniff out each other's deceptions; because, even leaving aside issues of their cooperation if they get generally-smart enough to cooperate, those AIs are thinking about AIs and thinking about other minds and thinking adversarially and thinking about deception. We would like to build an AI which does not start with any crystallized intelligence about these topics, attached to an alarm that goes off and tells us our foundational security assumptions have catastrophically failed and this course of research needs to be shut down if the AI starts to use fluid general intelligence to reason about those topics. (Not shut down the particular train of thought and keep going; then you just die as soon as the 20th such train of thought escapes detection.)
I like the framing of "how can we disrupt collusion?", but I'm a little skeptical that some of the solutions are practical. Specifically:
B2: Collusion can happen between agents that are copies of each other, who then have no reason to defect (so long as they employ a decision theory with a broad enough cartesian boundary). If we knew we could engineer systems with narrow decision theories (e.g. "Even though that's a copy of me, I'll pretend it's a different agent") I'd feel much better here, but as far as I know we don't know how to do that. I'd be excited to hear about work in that direction.
B4: I'm skeptical that we can really constrain communication meaningfully between e.g. models interacting over the internet. And I do expect a lot of models to get deployed over the internet, because that's where the economic value is.
B6: Some of the most powerful models are LLM's, which get trained on a large chunk of all of the text ever written. It seems they'll share an awful lot of knowledge, though maybe you're considering very specific information they'll lack?
C1: I don't think this is born out by the models in use today, but I don't know for sure. E.g. I would bet that text-davinci does better than a combination of smaller models with diverse training processes, but I don't know if that comparison has been done.
From the Afterword: “Note that I have presented a there-exists argument against a for-all proposition. Responses of the form “But I can think of something similar that won’t work” don’t count as counterarguments.” [This has been edited in the main text to sharpen the phrasing.]
B2: Copies might collude with one another, provided they have a broad decision theory: ∃(some sets of models) among which some mechanisms for impeding collusion might not work.
B4: A lot of models may be deployed over the internet: ∃(some communication channels) for which one of 6 mechanisms for impeding collusion might not work (given flawed security).
B6: LLM's will share a lot of knowledge: ∃(high performance models for some tasks) for which one of 6 mechanisms for impeding collusion doesn’t contribute to safety.
C1: Large models tend to beat compositions of small models for some tasks: ∃(some tasks) where some models should be large, and narrow knowledge won’t contribute to impeding collusion.
Not too worrisome, I think, but these are all considerations important to building robust multicomponent systems.
BTW, regarding the (non-)goal oriented nature of LLMs, I highly recommend the “Simulators” post by Janus.
I saw this, but I think it sets a somewhat unhelpful standard. In practice we need to make choices about which approaches are most promising, which to pursue, etc., and evidence that there is more probability mass on success in one area does feel useful.
So, for instance, my point in B2 is (as you correctly interpreted) not that there literally can't exist models that don't exhibit copy-collusion. Rather, I think copy-collusion is reasonably likely, and we don't currently know of a way to prevent it.
Yes, we need to weigh the considerations you mention, and many others.
Copy-collusion must be considered carefully, and its implications will depend on both potential contexts for using copies (sometimes problematic, sometimes OK) and on potential alternatives to using copies.
Copies will always “collude” in the trivial sense of giving identical, potentially deceptive responses to identical inputs, but might also offer potentially deceptive, mutually-supporting responses to differing inputs. For some purposes, a multicomponent architecture must treat multiple copies as a single source of information. (Note that solving the alignment problem is often taken to require that we be able to trust a single copy, or a system that consists of nothing but copies of a single entity — this seems strictly riskier than anything I’ve suggested.)
Some constraints can be effective even among copies: Constrained communication can be applied to preclude coordination that requires new, shared knowledge. Lack of access to a shared clock can be applied to preclude coordination of simultaneous or cleverly sequenced actions. Systems that audit alternative behaviors by re-running processes from checkpoints can be applied to offline copies, where adaptive, multi-trial probing (by what may be highly intelligent systems!) can likely be leveraged to make some kinds of deceptions unwise.
Of course a central idea is that many roles call for systems that are not copies: proposers vs. critics, designers vs. testers, actors vs. auditors, diverse members of ensembles, and so on.
Before asking whether one approach is superior to another, we should ask whether they are competitors or complements. Multicomponent approaches for making non-collusion robust seem quite complementary to strategies for making individual systems more trustworthy.
Regarding criteria for investment, neglectedness is of course a huge consideration.
HCH-like amplification seems related, multiple unreliable agent instances assembled into a bureaucracy that as a whole improves on some of their qualities (perhaps trustworthiness) and allows generating a better dataset for retraining the unreliable agents. So this problem is not specific to interaction of somewhat independently developed and instantiated AIs acting in the world, as it appears in a popular training story of a single AI. There is also the ELK problem that might admit solutions along these lines. This makes it less obviously neglected, even with little meaningful progress.
This approach to amplification involves multiple instances, but not diverse systems, competing systems, different roles, adversarial relationships, or a concern with collusion. It is, as you say a training story for a single AI. Am I missing a stronger connection?
Instances in a bureaucracy can be very different and play different roles or pursue different purposes. They might be defined by different prompts and behave as differently as text continuations of different prompts in GPT-3 (the prompt is "identity" of the agent instance, distinguishes the model as a simulator from agent instances as simulacra). Decision transformers with more free-form task/identity prompts illustrate this point, except a bureaucracy should have multiple agents with different task prompts in a single episode. MCTS and GAN are adversarial and could be reframed like this. One of the tentative premises of ELK is that a model trained for some purpose might additionally allow instantiating an agent that reports what the model knows, even if that activity is not clearly related to the model's main purpose. Colluding instances inside a bureaucracy make it less effective in achieving its goals of producing a better dataset (accurately evaluating outcomes of episodes).
So I think useful arrangements of diverse/competing/flawed systems is a hope in many contexts. It often doesn't work, so looks neglected, but not for want of trying. The concern with collusion in AI risk seems to be more about deceptive alignment, observed behavior of a system becoming so well-optimized that it ceases to be informative about how it would behave in different situations. Very capable AIs can lack any tells even from the perspective of extremely capable observers, their behavior can change in a completely unexpected way with changing circumstances. Hence the focus on interpretability, it's not just useful for human operators, a diverse system of many AIs also needs it to notice misalignment or collusion in its parts. It might even be enough for corrigibility.
These are good points, and I agree with pretty much all of them.
I think that this is an important idea. Though simulators analogous to GPT-3, it may be possible to develop strong, almost-provably-non-agentic intelligent resources, then prompt them to simulate diverse, transient agents on the fly. From the perspective of building multicomponent architectures this seems like a strange and potentially powerful tool.
Regarding interpretability, tasks that require communication among distinct AI components will tend to expose information, and manipulating “shared backgrounds” between information sources and consumers could potentially be exploited to make that information more interpretable. (How one might train against steganography is an interesting question.)
What does and doesn’t work will depend greatly on capabilities, and the problem-context here assumes potentially superintelligent-level AI.
An empirical counterargument to this is in the incentives human leaders face when overseeing people who might coordinate against them. When authoritarian leaders come into power they will actively purge members from their inner circles in order to keep them small. The larger the inner circle, the harder it becomes to prevent a rebellious individual from gathering the critical mass needed for a full blown coup.
Source: The Dictator's Handbook by Bruce Bueno de Mesquita and Alastair Smith
Are you arguing that increasing the number of (AI) actors cannot make collusive cooperation more difficult? Even in the human case, defectors make large conspiracies more difficult, and in the non-human case, intentional diversity can almost guarantee failures of AI-to-AI alignment.
I agree with you that this part of the AGI x-risk solution space, the part where one tries to design measures to lower the probability of collusion between AGIs, is very under-explored. However, I do not believe that the root cause of this lack of attention is a widely held 'tacit belief' that robust AGI collusion is inevitable.
It is easy to imagine the existence of a very intelligent person who nevertheless hates colluding with other people. It is easy to imagine the existence of an AI which approximately maximises a reward function which has a term in it that penalises collusion. So why is nobody working on creating or improving such an AI or penalty term?
My current x-risk community model is that the forces that channel people away from under-explored parts of the solution space have nothing to do with tacit assumptions about impossibilities. These forces operate at a much more pre-rational level of human psychology. Specifically: if there is no critical mass of people working in some part of the solution space already, then human social instincts will push most people away from starting to work there, because working there will necessarily be a very lonely affair. On a more rational level, the critical mass consideration is that if you want to do work that gets you engagement on your blog post, or citations on your academic paper, the best strategy is pick a part of the solution space that already has some people working in it.
TL;DR: if you want to encourage people to explore an under-visited part of the solution space, you are not primarily fighting against a tacit belief that this part of the space will be empty of solutions. Instead, you will need to win the fight against the belief that people will be lonely when they go into that part of the space.
What you describe is a form of spiral causation (only seeded by claims of impossibility), which surely contributes strongly to what we’ve seen in the evolution of ideas. Fortunately, interest in the composite-system solution space (in particular, the open agency model) is now growing, so I think we’re past the worst of the era of unitary-agent social proof.