[Cross-posted from personal blog]
Various examples (and lists of examples) of unintended behaviors in AI systems have appeared in recent years. One interesting type of unintended behavior is finding a way to game the specified objective: generating a solution that literally satisfies the stated objective but fails to solve the problem according to the human designer’s intent. This occurs when the objective is poorly specified, and includes reinforcement learning agents hacking the reward function, evolutionary algorithms gaming the fitness function, etc. While ‘specification gaming’ is a somewhat vague category, it is particularly referring to behaviors that are clearly hacks, not just suboptimal solutions.
Since these examples are currently scattered across several lists, I have put together a master list of examples collected from the various existing sources. This list is intended to be comprehensive and up-to-date, and serve as a resource for AI safety research and discussion. If you know of any interesting examples of specification gaming that are missing from the list, please submit them through this form.
Thanks to Gwern Branwen, Catherine Olsson, Alex Irpan, and others for collecting and contributing examples!
I've been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it's fun, standardized, up-to-date, comprehensive, and collaborative.
Some of the appeal is that it's fun to read about AI cheating at tasks in unexpected ways (I've seen a lot of people post on Twitter about their favorite examples from the list). The standardized spreadsheet format seems easier to refer to as well. I think the crowdsourcing aspect is also helpful - this helps keep it current and comprehensive, and people can feel some ownership of the list since can personally contribute to it. My overall takeaway from this is that safety outreach tools are more likely to be impactful if they are fun and easy for people to engage with.
This list had a surprising amount of impact relative to how little work it took me to put it together and maintain it. The hard work of finding and summarizing the examples was done by the people putting together the lists that the master list draws on (Gwern, Lehman, Olsson, Irpan, and others), as well as the people who submit examples through the form. What I do is put them together in a common format and clarify and/or shorten some of the summaries. I also curate the examples to determine whether they fit the definition of specification gaming (as opposed to simply a surprising behavior or solution). Overall, I've probably spent around 10 hours so far on creating and maintaining the list, which is not very much. This makes me wonder if there is other low hanging fruit in the safety resources space that we haven't picked yet.
I have been using it both as an outreach and research tool. On the outreach side, the resource has been helpful for making the argument that safety problems are hard and need general solutions, by making it salient just in how many ways things could go wrong. When presented with an individual example of specification gaming, people often have a default reaction of "well, you can just close the loophole like this". It's easier to see that this approach does not scale when presented with 50 examples of gaming behaviors. Any given loophole can seem obvious in hindsight, but 50 loopholes are much less so. I've found this useful for communicating a sense of the difficulty and importance of Goodhart's Law.
On the research side, the examples have been helpful for trying to clarify the distinction between reward gaming and tampering problems. Reward gaming happens when the reward function is designed incorrectly (so the agent is gaming the design specification), while reward tampering happens when the reward function is implemented incorrectly or embedded in the environment (and so can be thought of as gaming the implementation specification). The boat race example is reward gaming, since the score function was defined incorrectly, while the Qbert agent finding a bug that makes the platforms blink and gives the agent millions of points is reward tampering. We don't currently have any real examples of the agent gaining control of the reward channel (probably because the action spaces of present-day agents are too limited), which seems qualitatively different from the numerous examples of agents exploiting implementation bugs.
I'm curious what people find the list useful for - as a safety outreach tool, a research tool or intuition pump, or something else? I'd also be interested in suggestions for improving the list (formatting, categorizing, etc). Thanks everyone who has contributed to the resource so far!
The biggest benefit for me has come from using this list in conversation, when I am trying to explain the basics of AI risk to someone, or am generally discussing the topic. Before this list came out, I would often have to come up with an example of some specification gaming problem on the fly, and even though I would be confident that my example was representative, it couldn't be sure that it actually happened, which often detracted from the point I was trying to make. After this list came out, I just had a pre-cached list of many examples that I could bring up at any point that I knew had actually happened, and where I could easily reference and find the original source if the other person wanted to follow up on that.
I see this referred to a lot, and also find myself referring to it a lot. Having concrete examples of specification gaming is a valuable shortcut when explaining safety problems, as a "proof of concept" of something going wrong.
It's really valuable to make records like this. Insofar as it's accurate, and any reviewers can add to it, that's great. (Perhaps in the future it could be a wiki page of some sort, but for now I think it was a very valuable thing that got posted in 2018.)
Thanks Ben! I'm happy that the list has been a useful resource. A lot of credit goes to Gwern, who collected many examples that went into the specification gaming list: https://www.gwern.net/Tanks#alternative-examples.
Some other discussion: https://www.lesswrong.com/posts/nvkiGW4vH8CCHfoNi/specification-gaming-examples-in-ai
This looks great, and I think some of the work started by Scott Garrabrant on categorizing Goodhart's law addresses one component of this - but I think there is more and useful work on further categorizing the failure modes. This list looks like a good resource for doing so. On the other hand, I think the categorization would be helped by looking at non AI/ML systems to see how gaming occurs more widely. (Humans game systems differently than AI currently does, in part because we're still smarter. That means looking at humans might help understand how else AI can game metrics.)
For a take on the topic that goes much farther outside the realm of AI examples, see my older blog post on measurement: https://www.ribbonfarm.com/2016/06/09/goodharts-law-and-why-measurement-is-hard/
Thanks a lot for this post. I think “specification gaming” is an essential problem to solve but also possibly one of the hardest to make progress on, so it’s great to see resources aimed at helping people do that. I was unaware of CycleGAN, but it sounds very interesting and I plan to look into it. Thanks for posting something that put me onto an interesting obvservation!