Acknowledgements & References

This post is part of the sequence version of the Effective Altruism Foundation's research agenda on Cooperation, Conflict, and Transformative Artificial Intelligence.

Acknowledgements

As noted in the document, several sections of this agenda drew on writings by Lukas Gloor, Daniel Kokotajlo, Anni Leskelä, Caspar Oesterheld, and Johannes Treutlein. Thank you very much to David Althaus, Tobias Baumann, Alexis Carlier, Alex Cloud, Max Daniel, Michael Dennis, Lukas Gloor, Adrian Hutter, Daniel Kokotajlo, János Kramár, David Krueger, Anni Leskelä, Matthijs Maas, Linh Chi Nguyen, Richard Ngo, Caspar Oesterheld, Mahendra Prasad, Rohin Shah, Carl Shulman, Stefan Torges, Johannes Treutlein, and Jonas Vollmer for comments on drafts of this document. Thank you also to the participants of the Effective Altruism Foundation research retreat and workshops, whose contributions also helped to shape this agenda.

References

Arif Ahmed. Evidence, decision and causality. Cambridge University Press, 2014.

AI Impacts. Likelihood of discontinuous progress around the development of agi. https://aiimpacts.org/likelihood-of-discontinuous-progress-around-the-development-of-agi/, 2018. Accessed: July 1 2019.

Riad Akrour, Marc Schoenauer, and Michele Sebag. Preference-based policy learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 12–27. Springer, 2011.

Steffen Andersen, Seda Ertaç, Uri Gneezy, Moshe Hoffman, and John A List. Stakes matter in ultimatum games. American Economic Review, 101(7):3427-39, 2011.

Giulia Andrighetto, Daniela Grieco, and Rosaria Conte. Fairness and compliance in the extortion game. 2015.

Scott E Atkinson, Todd Sandler, and John Tschirhart. Terrorism in a bargaining framework. The Journal of Law and Economics, 30(1):1-21, 1987.

Robert Axelrod. On six advances in cooperation theory. Analyse & Kritik, 22(1):130-151, 2000.

Robert Axelrod and William D Hamilton. The evolution of cooperation. science, 211 (4489):1390-1396, 1981.

Kyle Bagwell. Commitment and observability in games. Games and Economic Behavior, 8(2):271-280, 1995.

Tobias Baumann. Surrogate goals to deflect threats. http://s-risks.org/using-surrogate-goals-to-deflect-threats/, 2017. Accessed March 6, 2019.

Tobias Baumann. Challenges to implementing surrogate goals. http://s-risks.org/challenges-to-implementing-surrogate-goals/, 2018. Accessed March 6, 2019.

Tobias Baumann, Thore Graepel, and John Shawe-Taylor. Adaptive mechanism design: Learning to promote cooperation. arXiv preprint arXiv:1806.04067, 2018.

Ken Binmore, Ariel Rubinstein, and Asher Wolinsky. The nash bargaining solution in economic modelling. The RAND Journal of Economics, pages 176-188, 1986.

Iris Bohnet, Bruno S Frey, and Steffen Huck. More order with less law: On contract enforcement, trust, and crowding. American Political Science Review, 95(1):131-144, 2001.

Friedel Bolle, Yves Breitmoser, and Steffen Schlächter. Extortion in the laboratory. Journal of Economic Behavior & Organization, 78(3):207-218, 2011.

Gary E Bolton and Axel Ockenfels. Erc: A theory of equity, reciprocity, and competition. American economic review, 90(1):166-193, 2000.

Nick Bostrom. Ethical issues in advanced artificial intelligence. Science Fiction and Philosophy: From Time Travel to Superintelligence, pages 277-284, 2003.

Nick Bostrom. Superintelligence: paths, dangers, strategies. 2014.

Ronen I Brafman and Moshe Tennenholtz. Efficient learning equilibrium. In Advances in Neural Information Processing Systems, pages 1635-1642, 2003.

R. A. Briggs. Normative theories of rational choice: Expected utility. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, fall 2019 edition, 2019.

Ernst Britting and Hartwig Spitzer. The open skies treaty. Verification Yearbook, pages 221-237, 2002.

Colin Camerer and Teck Hua Ho. Experience-weighted attraction learning in normal form games. Econometrica, 67(4):827-874, 1999.

Colin F Camerer. Behavioural game theory. Springer, 2008.

Colin F Camerer, Teck-Hua Ho, and Juin-Kuan Chong. A cognitive hierarchy model of games. The Quarterly Journal of Economics, 119(3):861-898, 2004.

Christopher Cherniak. Computational complexity and the universal acceptance of logic. The Journal of Philosophy, 81(12):739-758, 1984.

Thomas J Christensen and Jack Snyder. Chain gangs and passed bucks: Predicting alliance patterns in multipolarity. International organization, 44(2):137-168, 1990.

Paul Christiano. Approval directed agents. https://ai-alignment.com/model-free-decisions-6e6609f5d99e, 2014. Accessed: March 15 2019.

Paul Christiano. Humans consulting hch. https://ai-alignment.com/humans-consulting-hch-f893f6051455, 2016a.

Paul Christiano. Prosaic ai alignment. https://ai-alignment.com/prosaic-ai-control-b959644d79c2, 2016b. Accessed: March 13 2019.

Paul Christiano. Clarifying “ai alignment”. https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6, 2018a. Accessed: October 10 2019.

Paul Christiano. Preface to the sequence on iterated amplification. https://www.lesswrong.com/s/XshCxPjnBec52EcLB/p/HCv2uwgDGf5dyX5y6, 2018b. Accessed March 6, 2019.

Paul Christiano. Preface to the sequence on iterated amplification. https://www.lesswrong.com/posts/HCv2uwgDGf5dyX5y6/preface-to-the-sequence-on-iterated-amplification, 2018c. Accessed: October 10 2019.

Paul Christiano. Techniques for optimizing worst-case performance. https://ai-alignment.com/techniques-for-optimizing-worst-case-performance-39eafec74b99, 2018d. Accessed: June 24, 2019.

Paul Christiano. What failure looks like. https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like, 2019. Accessed: July 2 2019.

Paul Christiano and Robert Wiblin. Should we leave a helpful message for future civilizations, just in case humanity dies out? https://80000hours.org/podcast/episodes/paul-christiano-a-message-for-the-future/, 2019. Accessed: September 25, 2019.

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299-4307, 2017.

Mark Coeckelbergh. Can we trust robots? Ethics and information technology, 14(1):53-60, 2012.

EA Concepts. Importance, tractability, neglectedness framework. https://concepts.effectivealtruism.org/concepts/importance-neglectedness-tractability/, n.d. Accessed: July 1 2019.

Ajeya Cotra. Iterated distillation and amplification. https://www.alignmentforum.org/posts/HqLxuZ4LhaFhmAHWk/iterated-distillation-and-amplification, 2018. Accessed: July 25 2019.

Jacob W Crandall, Mayada Oudah, Fatimah Ishowo-Oloko, Sherief Abdallah, Jean-François Bonnefon, Manuel Cebrian, Azim Shariff, Michael A Goodrich, Iyad Rahwan, et al. Cooperating with machines. Nature communications, 9(1):233, 2018.

Andrew Critch. A parametric, resource-bounded generalization of loeb’s theorem, and a robust cooperation criterion for open-source game theory. The Journal of Symbolic Logic, pages 1-15, 2019.

Allan Dafoe. Ai governance: A research agenda. Governance of AI Program, Future of Humanity Institute, University of Oxford: Oxford, UK, 2018.

Wei Dai. Towards a new decision theory. https://www.lesswrong.com/posts/de3xjFaACCAk6imzv/towards-a-new-decision-theory, 2009. Accessed: March 5 2019.

Wei Dai. The main sources of ai risk. https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk, 2019. Accessed: July 2 2019.

Robyn M Dawes. Social dilemmas. Annual review of psychology, 31(1):169-193, 1980.

Karl W Deutsch and J David Singer. Multipolar power systems and international stability. World Politics, 16(3):390-406, 1964.

Daniel Dewey. My current thoughts on miri’s “highly reliable agent design” work. https://forum.effectivealtruism.org/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design, 2017. Accessed: October 6 2019.

Avinash Dixit. Trade expansion and contract enforcement. Journal of Political Economy, 111(6):1293-1317, 2003.

Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.

K Eric Drexler. Reframing superintelligence: Comprehensive ai services as general intelligence, 2019.

Martin Dufwenberg and Uri Gneezy. Measuring beliefs in an experimental lost wallet game. Games and economic Behavior, 30(2):163-182, 2000.

Daniel Ellsberg. The theory and practice of blackmail. Technical report, RAND CORP SANTA MONICA CA, 1968.

Johanna Etner, Meglena Jeleva, and Jean-Marc Tallon. Decision theory under ambiguity. Journal of Economic Surveys, 26(2):234-270, 2012.

Owain Evans, Andreas Stuhlmüller, Chris Cundy, Ryan Carey, Zachary Kenton, Thomas McGrath, and Andrew Schreiber. Predicting human deliberative judgments with machine learning. Technical report, Technical report, University of Oxford, 2018.

Tom Everitt, Jan Leike, and Marcus Hutter. Sequential extensions of causal and evidential decision theory. In International Conference on Algorithmic DecisionTheory, pages 205-221. Springer, 2015.

Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter. Self-modification of policy and utility function in rational agents. In International Conference on Artificial General Intelligence, pages 1-11. Springer, 2016.

Tom Everitt, Pedro A Ortega, Elizabeth Barnes, and Shane Legg. Understanding agent incentives using causal influence diagrams, part i: single action settings. arXiv preprint arXiv:1902.09980, 2019.

James D Fearon. Rationalist explanations for war. International organization, 49(3):379-414, 1995.

Ernst Fehr and Klaus M Schmidt. A theory of fairness, competition, and cooperation. The quarterly journal of economics, 114(3):817-868, 1999. Ernst Fehr, Simon Gächter, and Georg Kirchsteiger. Reciprocity as a contract enforcement device: Experimental evidence. ECONOMETRICA-EVANSTON ILL-, 65:833-860, 1997.

Dan S Felsenthal and Abraham Diskin. The bargaining problem revisited: mínimum utility point, restricted monotonicity axiom, and the mean as an estimate of expected utility. Journal of Conflict Resolution, 26(4):664-691, 1982.

Mark Fey and Kristopher W Ramsay. Mutual optimism and war. American Journal of Political Science, 51(4):738-754, 2007.

Fey, Mark, and Kristopher W. Ramsay. Mechanism design goes to war: peaceful outcomes with interdependent and correlated types. Review of Economic Design, 13(3):, 233-250, 2009.

Mark Fey and Kristopher W Ramsay. Uncertainty and incentives in crisis bargaining: Game-free analysis of international conflict. American Journal of Political Science, 55(1):149-169, 2011.

Ben Fisch, Daniel Freund, and Moni Naor. Physical zero-knowledge proofs of physical properties. In Annual Cryptology Conference, pages 313-336. Springer, 2014.

Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 122-130. International Foundation for Autonomous Agents and Multiagent Systems, 2018.

Lance Fortnow. Program equilibria and discounted computation time. In Proceedings of the 12th Conference on Theoretical Aspects of Rationality and Knowledge, pages 128-133. ACM, 2009.

James W Friedman. A non-cooperative equilibrium for supergames. The Review of Economic Studies, 38(1):1-12, 1971.

Daniel Garber. Old evidence and logical omniscience in bayesian confirmation theory. 1983.

Ben Garfinkel. Revent developments in cryptography and possible long-run consequences. https://drive.google.com/file/d/0B0j9LKC65n09aDh4RmEzdlloT00/view,2018. Accessed: November 11 2019.

Ben Garfinkel and Allan Dafoe. How does the offense-defense balance scale? Journal of Strategic Studies, 42(6):736-763, 2019.

Scott Garrabrant. Two major obstacles for logical inductor decision theory. https://agentfoundations.org/item?id=1399, 2017. Accessed: July 17 2019.

Scott Garrabrant and Abram Demski. Embedded agency. https://www.alignmentforum.org/posts/i3BTagvt3HbPMx6PN/embedded-agency-full-text-version, 2018. Accessed March 6, 2019.

Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares, and Jessica Taylor. Logical induction. arXiv preprint arXiv:1609.03543, 2016.

Alexandre Gazet. Comparative analysis of various ransomware virii. Journal in computer virology, 6(1):77-90, 2010.

Samuel J Gershman, Eric J Horvitz, and Joshua B Tenenbaum. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science, 349(6245):273-278, 2015.

Allan Gibbard and William L Harper. Counterfactuals and two kinds of expected utility. In Ifs, pages 153-190. Springer, 1978.

Itzhak Gilboa and David Schmeidler. Maxmin expected utility with non-unique prior. Journal of mathematical economics, 18(2):141-153, 1989.

Alexander Glaser, Boaz Barak, and Robert J Goldston. A zero-knowledge protocol for nuclear warhead verification. Nature, 510(7506):497, 2014.

Charles L Glaser. The security dilemma revisited. World politics, 50(1):171-201, 1997.

Piotr J Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research, 24:49-79, 2005.

Oded Goldreich and Yair Oren. Definitions and properties of zero-knowledge proof systems. Journal of Cryptology, 7(1):1-32, 1994.

Shafi Goldwasser, Silvio Micali, and Charles Rackoff. The knowledge complexity of interactive proof systems. SIAM Journal on computing, 18(1):186-208, 1989.

Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. When will ai exceed human performance? evidence from ai experts. Journal of Artificial Intelligence Research, 62:729-754, 2018.

Hilary Greaves, William MacAskill, Rossa O’Keeffe-O’Donovan, and Philip Trammell. Research agenda–web version a research agenda for the global priorities institute. 2019.

Avner Greif, Paul Milgrom, and Barry R Weingast. Coordination, commitment, and enforcement: The case of the merchant guild. Journal of political economy, 102(4):745-776, 1994.

Frances S Grodzinsky, Keith W Miller, and Marty J Wolf. Developing artificial agents worthy of trust: “would you buy a used car from this artificial agent?”. Ethics and information technology, 13(1):17-27, 2011.

Werner Güth, Rolf Schmittberger, and Bernd Schwarze. An experimental analysis of ultimatum bargaining. Journal of economic behavior & organization, 3(4):367-388, 1982.

Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, pages 3909-3917, 2016.

Edward H Hagen and Peter Hammerstein. Game theory and human evolution: A critique of some recent interpretations of experimental games. Theoretical population biology, 69(3):339-348, 2006.

Joseph Y Halpern and Rafael Pass. Game theory with translucent players. International Journal of Game Theory, 47(3):949-976, 2018.

Lars Peter Hansen and Thomas J Sargent. Robustness. Princeton university press, 2008.

Lars Peter Hansen, Massimo Marinacci, et al. Ambiguity aversion and model misspecification: An economic perspective. Statistical Science, 31(4):511-515, 2016.

Garrett Hardin. The tragedy of the commons. science, 162(3859):1243-1248, 1968.

Paul Harrenstein, Felix Brandt, and Felix Fischer. Commitment and extortion. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, page 26. ACM, 2007.

John C Harsanyi and Reinhard Selten. A generalized nash solution for two-person bargaining games with incomplete information. Management Science, 18(5-part-2): 80-106, 1972.

Joseph Henrich, Richard McElreath, Abigail Barr, Jean Ensminger, Clark Barrett, Alexander Bolyanatz, Juan Camilo Cardenas, Michael Gurven, Edwins Gwako, Natalie Henrich, et al. Costly punishment across human societies. Science, 312(5781): 1767-1770, 2006.

Jack Hirshleifer. On the emotions as guarantors of threats and promises. The Dark Side of the Force, pages 198-219, 1987.

Douglas R Hofstadter. Dilemmas for superrational thinkers, leading up to a luring lottery. Scientific American, 6:267-275, 1983.

Terence Horgan. Counterfactuals and newcomb’s problem. The Journal of Philosophy, 78(6):331-356, 1981.

Edward Hughes, Joel Z Leibo, Matthew Phillips, Karl Tuyls, Edgar Dueñez-Guzman, Antonio García Castañeda, Iain Dunning, Tina Zhu, Kevin McKee, Raphael Koster, et al. Inequity aversion improves cooperation in intertemporal social dilemmas. In Advances in neural information processing systems, pages 3326-3336, 2018.

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.

Robert Jervis. Cooperation under the security dilemma. World politics, 30(2):167-214, 1978.

Robert Jervis. Perception and Misperception in International Politics: New Edition. Princeton University Press, 2017.

Daniel Kahneman, Ilana Ritov, David Schkade, Steven J Sherman, and Hal R Varian. Economic preferences or attitude expressions?: An analysis of dollar responses to public issues. In Elicitation of preferences, pages 203-242. Springer, 1999.

Ehud Kalai. Proportional solutions to bargaining situations: interpersonal utility comparisons. Econometrica: Journal of the Econometric Society, pages 1623-1630, 1977.

Ehud Kalai, Meir Smorodinsky, et al. Other solutions to nash’s bargaining problem. Econometrica, 43(3):513-518, 1975.

Fred Kaplan. The wizards of Armageddon. Stanford University Press, 1991.

Holden Karnofsky. Some background on our views regarding advanced artificial intelligence. https://www.openphilanthropy.org/blog/some-background-our-views-regarding-advanced-artificial-intelligence, 2016. Accessed: July 7 2019.

D Marc Kilgour and Frank C Zagare. Credibility, uncertainty, and deterrence. American Journal of Political Science, 35(2):305-334, 1991.

Stephen Knack and Philip Keefer. Institutions and economic performance: cross-country tests using alternative institutional measures. Economics & Politics, 7(3): 207-227, 1995.

Daniel Kokotajlo. The “commitment races” problem. https://www.lesswrong.com/posts/brXr7PJ2W4Na2EW2q/the-commitment-races-problem, 2019a. Accessed: September 11 2019.

Daniel Kokotajlo. Cdt agents are exploitable. Unpublished working draft, 2019b.

Peter Kollock. Social dilemmas: The anatomy of cooperation. Annual review of sociology, 24(1):183-214, 1998.

Kai A Konrad and Stergios Skaperdas. Credible threats in extortion. Journal of Economic Behavior & Organization, 33(1):23-39, 1997.

David M Kreps and Joel Sobel. Signalling. Handbook of game theory with economic applications, 2:849-867, 1994.

Joshua A Kroll, Solon Barocas, Edward W Felten, Joel R Reidenberg, David G Robinson, and Harlan Yu. Accountable algorithms. U. Pa. L. Rev., 165:633, 2016.

David Krueger, Tegan Maharaj, Shane Legg, and Jan Leike. Misleading meta-objectives and hidden incentives for distributional shift. Safe Machine Learning workshop at ICLR, 2019.

Andrew Kydd. Which side are you on? bias, credibility, and mediation. American Journal of Political Science, 47(4):597-611, 2003.

Andrew H Kydd. Rationalist approaches to conflict prevention and resolution. Annual Review of Political Science, 13:101-121, 2010.

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 4190-4203, 2017.

Daryl Landau and Sy Landau. Confidence-building measures in mediation. Mediation Quarterly, 15(2):97-103, 1997.

Patrick LaVictoire, Benja Fallenstein, Eliezer Yudkowsky, Mihaly Barasz, Paul Christiano, and Marcello Herreshoff. Program equilibrium in the prisoner’s dilemma via loeb’s theorem. In Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.

Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464-473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.

Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research. arXiv preprint arXiv:1903.00742, 2019.

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.

Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068, 2017.

Anni Leskela. Simulations as a tool for understanding other civilizations. Unpublished working draft, 2019.

Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, and Shimon Whiteson. Stable opponent shaping in differentiable games. arXiv preprint arXiv:1811.08469, 2018.

David Lewis. Prisoners’ dilemma is a newcomb problem. Philosophy & Public Affairs, pages 235-240, 1979.

Xiaomin Lin, Stephen C Adams, and Peter A Beling. Multi-agent inverse reinforcement learning for certain general-sum stochastic games. Journal of Artificial Intelligence Research, 66:473-502, 2019.

Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.

William MacAskill. A critique of functional decision theory. https://www.lesswrong.com/posts/ySLYSsNeFL5CoAQzN/a-critique-of-functional-decision-theory, 2019. Accessed: September 15 2019.

William MacAskill, Aron Vallinder, Caspar Oesterheld, Carl Shulman, and Johannes Treutlein. The evidentialist’s wager. Manuscript, 2019.

Fabio Maccheroni, Massimo Marinacci, and Aldo Rustichini. Ambiguity aversion, robustness, and the variational representation of preferences. Econometrica, 74(6): 1447-1498, 2006.

Michael W Macy and Andreas Flache. Learning dynamics in social dilemmas. Proceedings of the National Academy of Sciences, 99(suppl 3):7229-7236, 2002. Christopher JG Meacham. Binding and its consequences. Philosophical studies, 149 (1):49-71, 2010.

Kathleen L Mosier, Linda J Skitka, Susan Heers, and Mark Burdick. Automation bias: Decision making and performance in high-tech cockpits. The International journal of aviation psychology, 8(1):47-63, 1998.

Abhinay Muthoo. A bargaining model based on the commitment tactic. Journal of Economic Theory, 69:134-152, 1996.

Rosemarie Nagel. Unraveling in guessing games: An experimental study. The American Economic Review, 85(5):1313-1326, 1995.

John Nash. Two-person cooperative games. Econometrica, 21:128-140, 1953.

John F Nash. The bargaining problem. Econometrica: Journal of the Econometric Society, pages 155-162, 1950.

Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.

Douglass C North. Institutions. Journal of economic perspectives, 5(1):97-112, 1991.

Robert Nozick. Newcomb’s problem and two principles of choice. In Essays in honor of Carl G. Hempel, pages 114-146. Springer, 1969.

Caspar Oesterheld. Deep reinforcement learning from human preferences. https://casparoesterheld.files.wordpress.com/2018/01/rldt.pdf, 2017a.

Caspar Oesterheld. Multiverse-wide cooperation via correlated decision making. 2017b.

Caspar Oesterheld. Robust program equilibrium. Theory and Decision, pages 1-17, 2019.

Caspar Oesterheld and Vincent Conitzer. Extracting money from causal decision theorists. 2019. Accessed: March 13 2019.

Stephen M Omohundro. The nature of self-improving artificial intelligence. Singularity Summit, 2008, 2007.

Stephen M Omohundro. The basic ai drives. In AGI, volume 171, pages 483-492, 2008.

OpenAI. Openai charter. https://openai.com/charter/, 2018. Accessed: July 7 2019.

Petro A Ortega and Vishal Maini. Building safe artificial intelligence: specification, robustness, and assurance. https://medium.com/@deepmindsafetyresearch/building-safe-artificial-intelligence-52f5f75058f1, 2018. Accessed: July 7 2019.

Raja Parasuraman and Dietrich H Manzey. Complacency and bias in human use of automation: An attentional integration. Human factors, 52(3):381-410, 2010. Judea Pearl. Causality. Cambridge university press, 2009.

Julien Perolat, Joel Z Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, and Thore Graepel. A multi-agent reinforcement learning model of common-pool resource appropriation. In Advances in Neural Information Processing Systems, pages 3643-3652, 2017.

Alexander Peysakhovich and Adam Lerer. Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv preprint arXiv:1710.06975, 2017.

Robert Powell. Bargaining theory and international conflict. Annual Review of Political Science, 5(1):1-30, 2002.

Robert Powell. War as a commitment problem. International organization, 60(1): 169-203, 2006.

Kai Quek. Rationalist experiments on war. Political Science Research and Methods, 5 (1):123-142, 2017.

Matthew Rabin. Incorporating fairness into game theory and economics. The American economic review, pages 1281-1302, 1993.

Neil C Rabinowitz, Frank Perbet, H Francis Song, Chiyuan Zhang, SM Eslami, and Matthew Botvinick. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018.

Werner Raub. A general game-theoretic model of preference adaptations in problematic social situations. Rationality and Society, 2(1):67-93, 1990.

Robert W Rauchhaus. Asymmetric information, mediation, and conflict management. World Politics, 58(2):207-241, 2006.

Jonathan Renshon, Julia J Lee, and Dustin Tingley. Emotions and the microfoundations of commitment problems. International Organization, 71(S1):S189-S218, 2017.

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627-635, 2011.

Ariel Rubinstein. Perfect equilibrium in a bargaining model. Econometrica: Journal of the Econometric Society, pages 97-109, 1982.

Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. Ai Magazine, 36(4):105-114, 2015.

Stuart J Russell and Devika Subramanian. Provably bounded-optimal agents. Journal of Artificial Intelligence Research, 2:575-609, 1994.

Santiago Sanchez-Pages. Bargaining and conflict with incomplete information. The Oxford Handbook of the Economics of Peace and Conflict. Oxford University Press, New York, 2012.

Wiliam Saunders. Hch is not just mechanical turk. https://www.alignmentforum.org/posts/4JuKoFguzuMrNn6Qr/hch-is-not-just-mechanical-turk?_ga=2.41060900. 708557547.1562118039-599692079.1556077623, 2019. Accessed: July 2 2019.

Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233-242, 1999.

Jonathan Schaffer. The metaphysics of causation. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, fall 2016 edition, 2016.

James A Schellenberg. A comparative test of three models for solving “the bargaining problem”. Behavioral Science, 33(2):81-96, 1988.

Thomas Schelling. The Strategy of Conflict. Harvard University Press, 1960.

David Schmidt, Robert Shupp, James Walker, TK Ahn, and Elinor Ostrom. Dilemma games: game parameters and matching protocols. Journal of Economic Behavior & Organization, 46(4):357-377, 2001.

Wolfgang Schwarz. On functional decision theory. umsu.de/wo/2018/688, 2018. Accessed: September 15 2019.

Anja Shortland and Russ Roberts. Shortland on kidnap. http://www.econtalk.org/anja-shortland-on-kidnap/, 2019. Accessed: July 13 2019.

Carl Shulman. Omohundro’s “basic ai drives” and catastrophic risks. Manuscript, 2010.

Linda J Skitka, Kathleen L Mosier, and Mark Burdick. Does automation bias decision-making? International Journal of Human-Computer Studies, 51(5):991–1006, 1999.

Alastair Smith and Allan C Stam. Bargaining and the nature of war. Journal of Conflict Resolution, 48(6):783-813, 2004.

Glenn H Snyder. “prisoner’s dilema” and “chicken” models in international politics. International Studies Quarterly, 15(1):66-103, 1971.

Nate Soares and Benja Fallenstein. Toward idealized decision theory. arXiv preprint arXiv:1507.01986, 2015.

Nate Soares and Benya Fallenstein. Agent foundations for aligning machine intelligence with human interests: a technical research agenda. In The Technological Singularity, pages 103-125. Springer, 2017.

Joel Sobel. A theory of credibility. The Review of Economic Studies, 52(4):557-573, 1985.

Ray J Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1):1-22, 1964.

Kaj Sotala. Disjunctive scenarios of catastrophic ai risk. In Artificial Intelligence Safety and Security, pages 315-337. Chapman and Hall/CRC, 2018.

Tom Florian Sterkenburg. The foudations of solomonoff prediction. Master’s thesis, 2013.

Joerg Stoye. Statistical decisions under ambiguity. Theory and decision, 70(2):129-148, 2011.

Joseph Suarez, Yilun Du, Phillip Isola, and Igor Mordatch. Neural mmo: A massively multiagent game environment for training and evaluating intelligent agents. arXiv preprint arXiv:1903.00784, 2019.

Chiara Superti. Addiopizzo: Can a label defeat the mafia? Journal of International Policy Solutions, 11(4):3-11, 2009.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

William Talbott. Bayesian epistemology. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, winter 2016 edition, 2016.

Jessica Taylor. My current take on the paul-miri disagreement on alignability of messy ai. https://agentfoundations.org/item?id=1129, 2016. Accessed: October 6 2019.

Max Tegmark. Parallel universes. Scientific American, 288(5):40-51, 2003.

Moshe Tennenholtz. Program equilibrium. Games and Economic Behavior, 49(2): 363-373, 2004.

Johannes Treutlein. Modeling multiverse-wide superrationality. Unpublished working draft., 2019.

Jonathan Uesato, Ananya Kumar, Csaba Szepesvari, Tom Erez, Avraham Ruderman, Keith Anderson, Nicolas Heess, Pushmeet Kohli, et al. Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures. arXiv preprint arXiv:1812.01647, 2018.

Eric Van Damme. The nash bargaining solution is optimal. Journal of Economic Theory, 38(1):78-100, 1986.

Hal R Varian. Computer mediated transactions. American Economic Review, 100(2): 1-10, 2010.

Heinrich Von Stackelberg. Market structure and equilibrium. Springer Science & Business Media, 2010.

Kenneth N Waltz. The stability of a bipolar world. Daedalus, pages 881-909, 1964.

Weixun Wang, Jianye Hao, Yixi Wang, and Matthew Taylor. Towards cooperation in sequential prisoner’s dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint arXiv:1803.00162, 2018.

E Roy Weintraub. Game theory and cold war rationality: A review essay. Journal of Economic Literature, 55(1):148-61, 2017.

Sylvia Wenmackers and Jan-Willem Romeijn. New theory about old evidence. Synthese, 193(4):1225-1250, 2016.

Lantao Yu, Jiaming Song, and Stefano Ermon. Multi-agent adversarial inverse reinforcement learning. arXiv preprint arXiv:1907.13220, 2019.

Eliezer Yudkowsky. Ingredients of timeless decision theory. https://www.lesswrong.com/posts/szfxvS8nsxTgJLBHs/ingredients-of-timeless-decision-theory, 2009. Accessed: March 14 2019.

Eliezer Yudkowsky. Intelligence explosion microeconomics. Machine Intelligence Research Institute, accessed online October, 23:2015, 2013.

Eliezer Yudkowsky. Modeling distant superintelligences. https://arbital.com/p/distant_SIs/, n.d. Accessed: Feb. 6 2019.

Eliezer Yudkowsky and Nate Soares. Functional decision theory: A new theory of instrumental rationality. arXiv preprint arXiv:1710.05060, 2017.

Claire Zabel and Luke Muehlhauser. Information security careers for gcr reduction. https://forum.effectivealtruism.org/posts/ZJiCfwTy5dC4CoxqA/information-security-careers-for-gcr-reduction, 2019. Accessed: July 17 2019.

Chongjie Zhang and Victor Lesser. Multi-agent learning with policy prediction. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

5

5

Acknowledgements

References