Applying superintelligence without collusion

Just to restate the standard argument against:

If you've got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other, there is a certain hubris in imagining that you can get them to defect. They don't want your own preferred outcome. Perhaps they will think of some strategy you did not, being much smarter than you, etc etc.

(Or, I mean, actually the strategy is "mutually cooperate"? Simulate a spread of the other possible entities, conditionally cooperate if their expected degree of cooperation goes over a certain threshold? Yes yes, more complicated in practice, but we don't even, really, get to say that we were blindsided here. The mysterious incredibly clever strategy is just all 20 superintelligences deciding to do something else which isn't mutual defection, despite the hopeful human saying, "But I set you up with circumstances that I thought would make you not decide that! How could you? Why? How could you just get a better outcome for yourselves like this?")

[-]Eric Drexler3y*78

I don’t see that as an argument [to narrow this a bit: not an argument relevant to what I propose]. As I noted above, Paul Christiano asks for explicit assumptions.

To quote Paul again:.

I think that concerns about collusion are relatively widespread amongst the minority of people most interested in AI control. And these concerns have in fact led to people dismissing many otherwise-promising approaches to AI control, so it is de facto an important question

Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart entities”.

[Perhaps too-pointed remark deleted]

[-]Eliezer Yudkowsky3y78

I don't think you're going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems). It's also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent. Mind space is very wide, and just about everything that isn't incoherent to imagine should exist as an actual possibility somewhere inside it. What we can access inside the subspace that looks like "giant inscrutable matrices trained by gradient descent", before the world ends, is a harsher question.

I could definitely buy that you could get some relatively cognitively weak AGI systems, produced by gradient descent on giant inscrutable matrices, to be in a state of noncooperation. The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.

[-]Eric Drexler3y1724

Mind space is very wide

Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.

The most AGI-like systems we have today are LLMs, optimized for a simple prediction task. They can be viewed as simulators, but they have a peculiar relationship to agency:

The simulation objective
A simulator trained with machine learning is optimized to accurately model its training distribution – in contrast to, for instance, maximizing the output of a reward function or accomplishing objectives in an environment.… Optimizing toward the simulation objective notably does not incentivize instrumentally convergent behaviors the way that reward functions which evaluate trajectories do.

LLMs have rich knowledge and capabilities, and can even simulate agents, yet they have no natural place in an agent-centric ontology. There’s an update to be had here (new information! fresh perspectives!) and much to reconsider.

[-]Eric Drexler3y1010

The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.

Yes, this is the key question, and I think there’s a clear answer, at least in outline:

What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us develop and deploy the physical basis for a defensively stable world — as you know, some extraordinarily capable technologies could be developed and deployed quite rapidly. In this scenario, defense has first move, can preemptively marshal arbitrarily large physical resources, and can restrict resources available to potential future adversaries. I would recommend investing resources in state-of-the-art hostile planning to support ongoing red-team/blue-team exercises.

This isn’t “flipping the table”, it’s reinforcing the table and bolting it to the floor. What you call “strong” systems then can plan whatever they want, but with limited effect.

[-]Eliezer Yudkowsky3y*1314

So I think that building nanotech good enough to flip the tables - which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than "disassemble all GPUs", which I choose not to name explicitly - is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best-attempt doomed system design goes, an imperfect-transparency alarm should have been designed to go off if your nanotech AGI is thinking about minds at all, human or AI, because it is supposed to just be thinking about nanotech. My guess is that you are much safer - albeit still doomed - if you try to do it the just-nanotech way, rather than constructing a system of AIs meant to spy on each other and sniff out each other's deceptions; because, even leaving aside issues of their cooperation if they get generally-smart enough to cooperate, those AIs are thinking about AIs and thinking about other minds and thinking adversarially and thinking about deception. We would like to build an AI which does not start with any crystallized intelligence about these topics, attached to an alarm that goes off and tells us our foundational security assumptions have catastrophically failed and this course of research needs to be shut down if the AI starts to use fluid general intelligence to reason about those topics. (Not shut down the particular train of thought and keep going; then you just die as soon as the 20th such train of thought escapes detection.)

[-]Adam Jermyn3y40

I like the framing of "how can we disrupt collusion?", but I'm a little skeptical that some of the solutions are practical. Specifically:

B2: Collusion can happen between agents that are copies of each other, who then have no reason to defect (so long as they employ a decision theory with a broad enough cartesian boundary). If we knew we could engineer systems with narrow decision theories (e.g. "Even though that's a copy of me, I'll pretend it's a different agent") I'd feel much better here, but as far as I know we don't know how to do that. I'd be excited to hear about work in that direction.

B4: I'm skeptical that we can really constrain communication meaningfully between e.g. models interacting over the internet. And I do expect a lot of models to get deployed over the internet, because that's where the economic value is.

B6: Some of the most powerful models are LLM's, which get trained on a large chunk of all of the text ever written. It seems they'll share an awful lot of knowledge, though maybe you're considering very specific information they'll lack?

C1: I don't think this is born out by the models in use today, but I don't know for sure. E.g. I would bet that text-davinci does better than a combination of smaller models with diverse training processes, but I don't know if that comparison has been done.

[-]Eric Drexler3y20

From the Afterword: “Note that I have presented a there-exists argument against a for-all proposition. Responses of the form “But I can think of something similar that won’t work” don’t count as counterarguments.” [This has been edited in the main text to sharpen the phrasing.]

B2: Copies might collude with one another, provided they have a broad decision theory: ∃(some sets of models) among which some mechanisms for impeding collusion might not work.

B4: A lot of models may be deployed over the internet: ∃(some communication channels) for which one of 6 mechanisms for impeding collusion might not work (given flawed security).

B6: LLM's will share a lot of knowledge: ∃(high performance models for some tasks) for which one of 6 mechanisms for impeding collusion doesn’t contribute to safety.

C1: Large models tend to beat compositions of small models for some tasks: ∃(some tasks) where some models should be large, and narrow knowledge won’t contribute to impeding collusion.

Not too worrisome, I think, but these are all considerations important to building robust multicomponent systems.

BTW, regarding the (non-)goal oriented nature of LLMs, I highly recommend the “Simulators” post by Janus.

[-]Adam Jermyn3y30

From the Afterword: “Note that I have presented a there-exists argument against a for-all proposition. Responses of the form “But I can think of something similar that won’t work” don’t count as counterarguments.” [This has been edited in the main text to sharpen the phrasing.]

I saw this, but I think it sets a somewhat unhelpful standard. In practice we need to make choices about which approaches are most promising, which to pursue, etc., and evidence that there is more probability mass on success in one area does feel useful.

So, for instance, my point in B2 is (as you correctly interpreted) not that there literally can't exist models that don't exhibit copy-collusion. Rather, I think copy-collusion is reasonably likely, and we don't currently know of a way to prevent it.

[-]Eric Drexler3y30

Yes, we need to weigh the considerations you mention, and many others.

I think copy-collusion is reasonably likely, and we don't currently know of a way to prevent it.

Copy-collusion must be considered carefully, and its implications will depend on both potential contexts for using copies (sometimes problematic, sometimes OK) and on potential alternatives to using copies.

Copies will always “collude” in the trivial sense of giving identical, potentially deceptive responses to identical inputs, but might also offer potentially deceptive, mutually-supporting responses to differing inputs. For some purposes, a multicomponent architecture must treat multiple copies as a single source of information. (Note that solving the alignment problem is often taken to require that we be able to trust a single copy, or a system that consists of nothing but copies of a single entity — this seems strictly riskier than anything I’ve suggested.)

Some constraints can be effective even among copies: Constrained communication can be applied to preclude coordination that requires new, shared knowledge. Lack of access to a shared clock can be applied to preclude coordination of simultaneous or cleverly sequenced actions. Systems that audit alternative behaviors by re-running processes from checkpoints can be applied to offline copies, where adaptive, multi-trial probing (by what may be highly intelligent systems!) can likely be leveraged to make some kinds of deceptions unwise.

Of course a central idea is that many roles call for systems that are not copies: proposers vs. critics, designers vs. testers, actors vs. auditors, diverse members of ensembles, and so on.

we need to make choices about which approaches are most promising, which to pursue, etc.

Before asking whether one approach is superior to another, we should ask whether they are competitors or complements. Multicomponent approaches for making non-collusion robust seem quite complementary to strategies for making individual systems more trustworthy.

Regarding criteria for investment, neglectedness is of course a huge consideration.

[-]Vladimir_Nesov3y30

HCH-like amplification seems related, multiple unreliable agent instances assembled into a bureaucracy that as a whole improves on some of their qualities (perhaps trustworthiness) and allows generating a better dataset for retraining the unreliable agents. So this problem is not specific to interaction of somewhat independently developed and instantiated AIs acting in the world, as it appears in a popular training story of a single AI. There is also the ELK problem that might admit solutions along these lines. This makes it less obviously neglected, even with little meaningful progress.

[-]Eric Drexler3y10

HCH-like amplification seems related, multiple unreliable agent instances assembled into a bureaucracy that as a whole improves on some of their qualities

This approach to amplification involves multiple instances, but not diverse systems, competing systems, different roles, adversarial relationships, or a concern with collusion. It is, as you say a training story for a single AI. Am I missing a stronger connection?

[-]Vladimir_Nesov3y20

Instances in a bureaucracy can be very different and play different roles or pursue different purposes. They might be defined by different prompts and behave as differently as text continuations of different prompts in GPT-3 (the prompt is "identity" of the agent instance, distinguishes the model as a simulator from agent instances as simulacra). Decision transformers with more free-form task/identity prompts illustrate this point, except a bureaucracy should have multiple agents with different task prompts in a single episode. MCTS and GAN are adversarial and could be reframed like this. One of the tentative premises of ELK is that a model trained for some purpose might additionally allow instantiating an agent that reports what the model knows, even if that activity is not clearly related to the model's main purpose. Colluding instances inside a bureaucracy make it less effective in achieving its goals of producing a better dataset (accurately evaluating outcomes of episodes).

So I think useful arrangements of diverse/competing/flawed systems is a hope in many contexts. It often doesn't work, so looks neglected, but not for want of trying. The concern with collusion in AI risk seems to be more about deceptive alignment, observed behavior of a system becoming so well-optimized that it ceases to be informative about how it would behave in different situations. Very capable AIs can lack any tells even from the perspective of extremely capable observers, their behavior can change in a completely unexpected way with changing circumstances. Hence the focus on interpretability, it's not just useful for human operators, a diverse system of many AIs also needs it to notice misalignment or collusion in its parts. It might even be enough for corrigibility.

[-]Eric Drexler3y30

These are good points, and I agree with pretty much all of them.

Instances in a bureaucracy can be very different and play different roles or pursue different purposes. They might be defined by different prompts and behave as differently as text continuations of different prompts in GPT-3

I think that this is an important idea. Though simulators analogous to GPT-3, it may be possible to develop strong, almost-provably-non-agentic intelligent resources, then prompt them to simulate diverse, transient agents on the fly. From the perspective of building multicomponent architectures this seems like a strange and potentially powerful tool.

Regarding interpretability, tasks that require communication among distinct AI components will tend to expose information, and manipulating “shared backgrounds” between information sources and consumers could potentially be exploited to make that information more interpretable. (How one might train against steganography is an interesting question.)

[-]Eric Drexler3y10

So I think useful arrangements of diverse/competing/flawed systems is a hope in many contexts. It often doesn't work, so looks neglected, but not for want of trying.

What does and doesn’t work will depend greatly on capabilities, and the problem-context here assumes potentially superintelligent-level AI.

[-]NicholasKees3y10

and increasing the number of actors can make collusive cooperation more difficult

An empirical counterargument to this is in the incentives human leaders face when overseeing people who might coordinate against them. When authoritarian leaders come into power they will actively purge members from their inner circles in order to keep them small. The larger the inner circle, the harder it becomes to prevent a rebellious individual from gathering the critical mass needed for a full blown coup.

Source: The Dictator's Handbook by Bruce Bueno de Mesquita and Alastair Smith

[-]Eric Drexler3y13

Are you arguing that increasing the number of (AI) actors cannot make collusive cooperation more difficult? Even in the human case, defectors make large conspiracies more difficult, and in the non-human case, intentional diversity can almost guarantee failures of AI-to-AI alignment.

[-]Koen.Holtman3y10

At t+7 years, I’ve still seen no explicit argument for robust AI collusion, yet tacit belief in this idea continues to channel attention away from a potential solution-space for AI safety problems, leaving something very much like a void.

I agree with you that this part of the AGI x-risk solution space, the part where one tries to design measures to lower the probability of collusion between AGIs, is very under-explored. However, I do not believe that the root cause of this lack of attention is a widely held 'tacit belief' that robust AGI collusion is inevitable.

It is easy to imagine the existence of a very intelligent person who nevertheless hates colluding with other people. It is easy to imagine the existence of an AI which approximately maximises a reward function which has a term in it that penalises collusion. So why is nobody working on creating or improving such an AI or penalty term?

My current x-risk community model is that the forces that channel people away from under-explored parts of the solution space have nothing to do with tacit assumptions about impossibilities. These forces operate at a much more pre-rational level of human psychology. Specifically: if there is no critical mass of people working in some part of the solution space already, then human social instincts will push most people away from starting to work there, because working there will necessarily be a very lonely affair. On a more rational level, the critical mass consideration is that if you want to do work that gets you engagement on your blog post, or citations on your academic paper, the best strategy is pick a part of the solution space that already has some people working in it.

TL;DR: if you want to encourage people to explore an under-visited part of the solution space, you are not primarily fighting against a tacit belief that this part of the space will be empty of solutions. Instead, you will need to win the fight against the belief that people will be lonely when they go into that part of the space.

[-]Eric Drexler3y20

What you describe is a form of spiral causation (only seeded by claims of impossibility), which surely contributes strongly to what we’ve seen in the evolution of ideas. Fortunately, interest in the composite-system solution space (in particular, the open agency model) is now growing, so I think we’re past the worst of the era of unitary-agent social proof.

^{^}

A property may be true with high credence yet not be susceptible to formal proof.

^{^}

Systems need not have access to a shared clock, a hardware-level constraint that is necessary to ensure reproducible results and incidentally limits the scope for coordinated treachery.

^{^}

It has been persuasively argued that large language models are best thought of as “simulators”, a class of systems that can (in the LLM case) simulate agents while themselves lacking agentic properties. (This is a coherent concept: A physics model could simulate agents without itself having agentic properties.) LLMs show how an AI system trained on a prediction objective can provide a rich world model without pursuing goals.

^{^}

In a familiar class of worst-case scenarios, systems with general superintelligence would infer extensive knowledge about the world from minimal information, and would choose (if possible) to pursue potentially dangerous goals by manipulating the external environment, for example, through deceptive answers to questions. In this scenario, (all?) superintelligent systems, even if almost isolated, would infer the existence of others like themselves, and (all?) would employ a decision theory that induces them to collude (in a coordinated way?) to pursue shared objectives. Even if we consider a worst-case assumption regarding the default emergence of world-changing goals, the present argument suggests that problematic systems would correctly infer the existence of superintelligent-level systems unlike themselves (systems with diverse and specialized capabilities, knowledge, and interactions, playing roles that include adversarial judges and competitors), and would correctly recognize that deceptive collusion is risky or infeasible.

^{^}

The idea of multicomponent strategies for AI safety is, of course, neither new nor entirely neglected. However, in a recent search for relevant Alignment Forum posts, I found no evidence of a thriving research community or well-developed concepts:

• (My understanding of) What Everyone in Technical Alignment is Doing and Why (August 2022) surveys the agendas of more than 20 research groups, and none clearly points in the direction I’ve advocated here.

• A pair of posts on Pragmatic AI Safety, Perform Tractable Research While Avoiding Capabilities Externalities and Open Problems in AI X-Risk (May, June 2022), briefly mention highly relevant concepts: the idea of using “counteracting systems [for example] artificial consciences, AI watchdogs, lie detectors, filters for power-seeking actions, and separate reward models”, and the idea of “multiple superintelligent agents that can rein in other rogue systems”. The authors also mention (without endorsing) the counter-claim that “The instant two intelligent agents can reason about each other — regardless of their goals — they will necessarily collude.”

• An overview of 11 proposals for building safe advanced AI (May 2020) mentions only efforts to align individual AI systems; even “AI safety via debate with transparency tools” proposes that a system interact with a copy of itself. Partial success in single-system alignment could be leveraged in multicomponent safety architectures, an application context that has potential implications for research directions in the single-system alignment domain.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

44

Applying superintelligence without collusion

44

The simulation objective

Adapted from Reframing Superintelligence, Section 20:

Collusion among superintelligent oracles
can readily be avoided

20.1 Summary

20.2 Trustworthiness can be an emergent property

20.3 A range of conditions can make collusion robust or fragile

Conditions that tend to facilitate collusion:

Contrasting conditions that tend to disrupt collusion:

Characteristics of practical architectures:

20.4 Untrusted superintelligence can be applied to AI safety

Afterword:

Further Reading [in Reframing Superintelligence]

44

Applying superintelligence without collusion

44

The simulation objective

Adapted from Reframing Superintelligence, Section 20:

Collusion among superintelligent oracles can readily be avoided

20.1 Summary

20.2 Trustworthiness can be an emergent property

20.3 A range of conditions can make collusion robust or fragile

Conditions that tend to facilitate collusion:

Contrasting conditions that tend to disrupt collusion:

Characteristics of practical architectures:

20.4 Untrusted superintelligence can be applied to AI safety

Afterword:

Further Reading [in Reframing Superintelligence]

Collusion among superintelligent oracles
can readily be avoided