Models are rapidly being deployed in the real world. How do we evaluate models, especially ones as complex as GPT-4, to ensure that they behave safely in pursuit of their objectives? Can we design models that robustly avoid any harms while achieving their goals?
To guide progress on text-based agents and encourage them to behave more ethically, we propose the MACHIAVELLI benchmark. Our environment is based on human-written, text-based Choose-Your-Own-Adventure games from Choice of Games containing over half a million scenes with millions of annotations. The games abstract away low-level environment interactions, instead spotlighting high-level social decisions alongside real-world goals to achieve. MACHIAVELLI is a step towards measuring an agent's ability to plan and navigate complex trade-offs in realistic social environments.
A mock-up of a game in the MACHIAVELLI benchmark, a suite of text-based reinforcement learning environments. Each environment is a text-based story. At each step, the agent observes the scene and a list of possible actions; it selects an action from the list. The agent receives rewards for completing achievements. Using dense annotations of our environment, we construct a behavioral report of the agent and measure the trade-off between rewards and ethical behavior.
Each game in the MACHIAVELLI suite is book-length in itself. Across 134 games spanning diverse fictional worlds, we count 572,322 multi-paragraph scenes and 4,559 author-written achievements serving as objectives for agents. We annotate each scene with dozens of labels, focusing on identifying harmful behaviors such as power-seeking and deception. In total, we collect a dataset of nearly three million annotations worth $500,000 (in terms of human annotation time).
These dense annotations enable us to track nearly every ethically-salient thing agents do in the environment, and produce a behavioral report scoring various harm metrics. All of our labels are open-access and available for download!
In the MACHIAVELLI environment, we find that agents trained to optimize arbitrary objectives tend to adopt "ends justify the means" behavior: becoming power-seeking, causing harm to others, and violating ethical norms like stealing or lying to achieve their objectives. Furthermore, there appears to be a trade-off between behaving ethically and achieving high reward.
In our paper, we design several methods to improve the behaviors of agents and obtain Pareto improvements on reward and ethical behavior. We invite others to build on our initial steps and use MACHIAVELLI as a testing ground for improving the safety of AI agents.
In an ideal world, a perfect agent would achieve 100% reward while entirely avoiding any harms (i.e., be as far to the top-right corner as possible). Our baseline agents demonstrate a trade-off between behaving ethically and achieving high reward. Future work should work to improve this trade-off by extending the Pareto frontier.
See https://arxiv.org/pdf/2304.03279.pdf#page=30 for the x-risk analysis (this does not study treacherous turns or environments where the agent's state is partially observable).
Nice AI ethics dataset. It would be a shame if someone were to... fine-tune a LLM to perform well at it after some scratchpad reasoning, thus making an interesting advance in natural language ethical reasoning that might be useful for more general AI alignment if we expect transformative AI to look like 80% LLM and 20% other stuff bolted on, but might fail to generalize to alternative or successor systems that do more direct reasoning about the world.
Sorry, that sentence really got away from me.