[AN #88]: How the principal-agent literature relates to AI risk

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

What can the principal-agent literature tell us about AI risk? (Alexis Carlier and Tom Davidson) (summarized by Rohin): It has been argued (AN #56) that at least some AI risk scenarios rely on principal-agent problems becoming extremely large, and that this is incompatible with the existing academic literature on the principal-agent problem. This post examines this critique in detail.

Generally, the post finds that the principal-agent literature doesn't have much bearing on AI risk, because it usually doesn't consider weak principals with more capable agents, the models that do exist will probably not generalize to the cases we care about, and it doesn't consider the case where contracts can no longer be enforced.

We can consider the application to specific arguments, such as the "going out with a bang" scenario (AN #50) in which we accidentally train influence-maximizers that gradually gain power and then fail catastrophically (e.g. by executing a treacherous turn). In this situation, the principal-agent problem is relevant only in the first stage, where AI agents gradually gain power: this is the case where AI agents are executing some task, and are extracting agency rents to gain power. The second stage, in which the agent fails catastrophically, happens "outside" the principal-agent problem: this failure doesn't happen while performing some assigned task, but instead involves the agent exercising its accumulated power outside of any specific task.

What about the original scenario, in which an AI agent becomes very intelligent, and finds some solution to its task that the designers (principals) didn't think about and are surprised by? In the principal-agent setting, we might model this as the agent having an expanded action set that the principal doesn't know about. The principal-agent literature has not really studied such models, probably because it is immediately obvious that in such a situation the principal could give incentives that lead the agent to kill everyone.

Rohin's opinion: I've been confused about this critique for a while, and I'm glad this post has addressed it: I currently think that this post pretty conclusively refutes the claim that AI risk arguments are in conflict with the principal-agent literature. I especially found it useful to think of the principal-agent problem as tied to rents that an agent can extract while pursuing a task that the principal assigned.

GovAI 2019 Annual Report (Allan Dafoe) (summarized by Rohin): This is exactly what it sounds like.

Rohin's opinion: I generally find governance papers quite illuminating for thinking about how all this technical stuff we do is meant to interact with the broader society and actually have an impact on the world. That said, I usually don't highlight such papers, despite liking them a lot, because the primary audience I have in mind are people trying to solve the technical alignment problem in which you want to ensure a powerful AI system is not adversarially optimizing against you. So instead I've collected a bunch of them in this newsletter and just highlighted the annual report.

Technical AI alignment

Miscellaneous (Alignment)

My personal cruxes for working on AI safety (Buck Shlegeris) (summarized by Rohin): This post describes how Buck's cause prioritization within an effective altruism framework leads him to work on AI risk. The case can be broken down into a conjunction of five cruxes. Specifically, the story for impact is that 1) AGI would be a big deal if it were created, 2) has a decent chance of being created soon, before any other "big deal" technology is created, and 3) poses an alignment problem for which we can think ahead in order to solve, and it's potentially valuable to do so even given the fact that people might try to solve this later. His research 4) would be put into practice if it solved the problem and 5) makes progress on solving the problem.

Rohin's opinion: I enjoyed this post, and recommend reading it in full if you are interested in AI risk because of effective altruism. (I've kept the summary relatively short because not all of my readers care about effective altruism.) My personal cruxes and story of impact are actually fairly different: in particular, while this post sees the impact of research as coming from solving the technical alignment problem, I care about other sources of impact as well. See this comment for details.

AI strategy and policy

The Windfall Clause: Distributing the Benefits of AI (Cullen O’Keefe et al) (summarized by Rohin): The Windfall Clause is a proposed policy lever for improving outcomes from transformative AI. Corporations can voluntarily agree to be bound by the clause, in which case they must donate some proportion of windfall profits (profits in excess of e.g. 1% of world GDP) for the benefit of humanity. Since such a scenario is exceedingly unlikely, it can be in corporations' interests to be bound by this clause, in order to reap the benefits of improved public relations. If the scenario actually occurs, we can then use the donations to solve many societal problems that would likely arise, e.g. job loss, inequality, etc.

Rohin's opinion: While there are certainly major benefits to the Windfall Clause in the case of an actual windfall, it seems to me like there are benefits even when windfalls do not occur (a point mentioned but not emphasized in the full report). For example, in a world in which everyone has agreed to the Windfall Clause, the incentives to "win an economic race" decrease: even if it is possible for e.g. one company to "win" via a monopoly on AI, at least a portion of their "winnings" must be distributed to everyone else, plausibly decreasing incentives to race, and increasing the likelihood that companies pay attention to safety. (This of course assumes that the clause remains binding even after "winning", which is not obviously true.)

Read more: EA Forum summary

The Offense-Defense Balance of Scientific Knowledge: Does Publishing AI Research Reduce Misuse? (Toby Shevlane et al) (summarized by Rohin): Since GPT-2 (AN #46), the AI research community has wrestled with the question of publication of research with malicious applications. On the one hand, publishing such research makes it more likely that those malicious applications arise in reality, but on the other hand, it also allows defenses against the application to be developed. The core of the question is what the offense-defense balance of AI research looks like.

In particular, publication is particularly good if attackers are likely to independently develop the knowledge, or would find it hard to translate the research into a real-world attack, or if defenders will put in a lot of effort to finding a solution, and such a solution is likely to be found and deployed. A canonical example is computer security: once a vulnerability is found, it is usually quite easy to develop a patch that fixes the vulnerability, and such patches can be deployed relatively easily via automatic updates. As a result, in computer security, the default is to publicly disclose vulnerabilities after giving vendors some time to develop and deploy a patch.

Under the opposite conditions, where attackers are likely to be able to use the research to create a real-world attack, or where defenders would find it hard to find and deploy a good solution, it is better to keep the research secret. For example, in biorisks such as the risk of an engineered pandemic, solutions are not necessarily easy to find and/or deploy, and so it seems better to avoid making public the knowledge of how to create a novel virus.

The paper argues that relative to computer security (the default comparison for many AI researchers), publication in AI is more likely to be net negative (specifically from a security standpoint, ignoring beneficial applications of the research), since solutions must often be social (as in e.g. fake news) which are harder to deploy, and publication seems more likely to counterfactually educate attackers rather than defenders (since the defenders are big companies that already have a lot of expertise).

Rohin's opinion: This is a remarkably straightforward and clear analysis, and is way better than any analysis I've seen done by the AI community, which is quite a shame, given how much time AI researchers have spent thinking about publication norms. (Though note that I don't follow this space closely, and so might have missed previous good work on publication norms.) As a comparison, the only conclusion that When Is It Appropriate to Publish High-Stakes AI Research? (AN #55) came to was that whatever the publication norms are, they should be standardized across the AI community.

Who owns artificial intelligence? A preliminary analysis of corporate intellectual property strategies and why they matter (Nathan Calvin et al) (summarized by Rohin): This paper analyzes intellectual property (IP) considerations as they relate to AI. They identify two main incentives for companies: first, to publish AI research openly in order to attract top talent, and second, holding enough patents that they can credibly threaten to sue other companies for patent infringement. This second criterion lets companies stay in a mutually-assured-destruction (MAD) scenario, where if any one company litigates for patent infringement, they will quickly be met with a countersuit, and so the (fragile) equilibrium is to avoid litigation. They also identify two incentives for governments: first, to provide patents as a financial incentive for innovation in order to incentivize research, and second, to allow their own national security apparatus to use state of the art research while keeping it secret from perceived rivals.

Based on this analysis, they propose three scenarios that could unfold in the future. First, the status quo continues, in which companies keep acquiring patents in order to maintain the MAD equilibrium. Second, the equilibrium breaks, with one company litigating that then causes all the other companies to also litigate. This could result in most research becoming secret, in order to ensure that other companies can't "steal" the work and get a patent first. Similarly, contributions to open-source research might decrease, as it would be particularly easy to use such contributions as evidence of patent infringement. Third, more "patent pools" get created, in which multiple companies pool their patents together, to reduce the risk of litigation. Such patent pools could also be used to enforce other principles: with a sufficiently large patent pool, it could be the case that in order to remain competitive actors must license from the patent pool, and such licensing agreements could enforce specific ethical principles (although it would have to be careful to avoid violating antitrust law).

Rohin's opinion: I enjoyed this paper; it seems good to have a better picture of the potential future of openness in AI research, for the reasons given in Strategic Implications of Openness in AI Development. You could also imagine patent pools as a vehicle for safety, as they are one possible way by which companies can cooperate to ensure a shared commitment to safety (along the lines of OpenAI's charter (AN #2)): they could tie competitiveness (which requires use of the research protected by the patent pool) to safety (the conditions involved in licensing the research in the patent pool).

Social and Governance Implications of Improved Data Efficiency (Aaron D. Tucker et al) (summarized by Rohin): Few-shot learning, meta learning, transfer learning, active learning: there's a lot of types of learning that are aiming to improve the data efficiency of ML techniques. What happens if we succeed? This paper propose two effects: an access effect, by which smaller actors can start using ML capabilities with their smaller amounts of data, and a performance effect, by which existing actors see improvements in the performance of their AI systems (since their existing data goes further than it used to). It then analyzes some societal implications of these effects.

By making it easier to reach a given performance with limited data, we will gain access to new applications where data is limited (e.g. machine translation of ancient languages), and for existing applications, more actors will be able to use ML capabilities (this also includes bad actors, who can more easily pursue malicious applications). However, it is not clear how this will affect the competitive advantage of large AI firms: while more actors can access a given level of performance, which might suggest more competition, the large AI firms also gain performance, which could reverse the effect. For example, improved data efficiency makes no difference in a pure winner-take-all situation, and advantages the large firms in cases where the last few miles of performance lead to large gains in utility (e.g. self-driving cars).

The paper also makes two comments on the impacts for AI safety: that algorithms based on human oversight will become more competitive (as it will be more reasonable to collect expensive human data), and that distributional shift problems may become worse (since if you train on smaller amounts of data, you are less likely to see "rare" inputs).

Rohin's opinion: While data efficiency is often motivated in AI by the promise of applications where data is limited, I am actually more excited about the thresholding effects mentioned in the paper, in which as you get the last little bit of performance left to get, the ML systems become robust enough to enable applications to be built on top of them (in the way that computer vision is (hopefully) robust enough that self-driving cars can be built on top of CV models). It seems quite likely to me that data efficiency and large data collection efforts together will tend to mean that the newest ML applications will happen in large firms rather than startups due to these thresholding effects. See also Value of the Long Tail.

I disagree with the point about distributional shift. I often think of ML as a search for a function that does the right thing on the training data. We start with a very large set of functions, and then as we get more training data, we can rule out more and more of the functions. The problem of distribution shift is that even after this, there's a large set of functions left over, and the ML model implements an arbitrary one of them, rather than the one we wanted; while this function behaves as we want on the training set, it may not do so on the test set.

"Increased data efficiency" means that we have some way of ruling out functions a priori that doesn't require access to data, or we get better at figuring out which functions should be ruled out by the data. Suppose for concreteness that our original ML algorithm gets 90% performance with 1,000 samples, and 95% with 10,000 samples, and our efficient ML algorithm that e.g. incorporates causality gets 95% performance with 1,000 samples. Then the question we now have is "do the 9000 extra samples eliminate more bad functions than the assumption of causality, given that they both end up with 95% performance?" My guess would be that the "assumption of causality" does a better job of eliminating bad functions, because it will probably generalize outside of the training set, whereas the 9000 extra samples won't. (This doesn't depend on the fact that it is "causality" per se, just that it is a human-generated intuition.) So this suggests that the efficient ML algorithm would be more robust than the original one.

Should Artificial Intelligence Governance be Centralised? Design Lessons from History (Peter Cihon, Matthijs Maas, and Luke Kemp) (summarized by Rohin): This paper tackles the question of whether or not AI governance should be centralized. In favor of centralized governance, we see that a centralized institution can have major political power, and can be more efficient by avoiding duplication of work and making it easier for actors to comply. However, a centralized institution often requires a large amount of time to create, and even afterwards, it tends to be slow-moving and so may not be able to respond to new situations easily. It also leads to a single point of failure (e.g. via regulatory capture). It also may be forced to have a relatively light touch in order to ensure buy-in from all the relevant actors. With a decentralized system, you can get forum shopping in which actors select the governance mechanisms they like best, which can lead to a quicker progress on time-sensitive issues, but can also lead to weakened agreements, so it is not clear whether this is on net a good effect.

The paper then applies this framework to high-level machine intelligence (HLMI), a particular operationalization of powerful AI, and concludes that centralization is particularly promising for HLMI governance.

Lessons for Artificial Intelligence from Other Global Risks (Seth D. Baum et al) (summarized by Rohin): This paper looks at other areas of risk (biorisk, nuclear weapons, global warming, and asteroid collision) and applies lessons from these areas to AI risk. I'd recommend reading the paper: the stories for each area are interesting but hard to summarize here.

AI ALIGNMENT FORUM
AF