David Mears - AI Alignment Forum

I have a task idea that falls outside the domains you listed as being of interest. The task essentially involves playing a game strategically against other agents (human or AI), where the rules, outputs and scoring are simple, but strategy is complex.* As such it would test threat-model-relevant skills like modelling and predicting other players (even recursively modelling the other players' models?), and doing in-context learning about them. The difficulty of the task depends how good your opponents are, and how many of them there are. It's unlike many of your example tasks because, despite a high upper bound of difficulty level, it doesn't necessarily take very long to 'implement' - e.g. to benchmark against a human, you can just tell the human the rules of the game, let them ask clarifying questions, and then immediately score their performance in the game. (Unless you specify the task as 'build a GOFAI to play this game', in which case it could be a normal task duration.)

How interested are you in a task such as this?

If it's of interest, should the opponent players be human or AI?

Some thoughts:

Pros of playing against humans:
- Maybe the task is more threat-model-relevant when the other players are human (since modelling humans might be harder or just a different capability than modelling AI agents).
- The benchmark would appear to be more headline-worthy when opponents are human or human experts.

Cons of playing against humans:
- In order to fit the desideratum "It's great if the task is reasonable to perform without requiring interacting with the live internet", the other players would need to be AI agents packaged with the task.
- Finding human experts is really hard or expensive (could use non-experts).
- Human performance might vary too much from person to person, or over time, for the task to be a reliable/stable benchmark, while GPT-2 will always be the same kind of player.

Misc:
- If the AI under test isn't told whether the opponents are human or not, this adds more complexity and richness to the task, making it harder to model the opponents. (Because game turns are simple, it seems like it would be really hard to tell whether opponents are human or not.)
- This genre of task could be specified as 'take the best game turns you can right now and learn in context, reasoning on the fly', or as 'build the best game-playing agent you can'. If the task is specified as 'build the best game-playing AI you can', then rather than needing to interact with the live internet during the task, it can be scored after the task per se is completed

*The game I'm thinking of, which could be swapped out for another game, is "every player must name something in a given category, e.g. things you might find in a kitchen, and points are awarded to those players whose answers match exactly [none/one/two/all/as many as you can/etc] of the other players' answers."

Bounty: Diverse hard tasks for LLM agents

David Mears4mo10

I like this genre of task. I didn't quite understand what you meant about being able to score the human immediately - presumably we're interested in how to human could do given more learning, also?

Yes, I suppose so. I assumed (without noticing I was doing so) that humans wouldn't get that much better at the 'match words' game given more learning time than the 6-hour baseline of task length. But that is not necessarily true. I do think a lot of the relevant learning is in-context and varies from instance to instance ("how are the other players playing? what strategies are being used in this game? how are their strategies evolving in response to the game?").

It seems like a good consideration to bear in mind when selecting games for this genre of task: the amount that the task gets easier for humans given learning time (hence how much time you'd need to invest evaluating human performance).

Another bucket of games that might be good fodder for this task genre are 'social deduction' where deception, seeing through deception, and using allegiances are crucial subtasks. I think for social deduction games, or for manipulation and deception in general, the top capability level achievable by humans is exceedingly high (it's more chess-like than tic-tac-toe-like), and would take a lot of time to attain. It's high because the better your opponent is, the better you need to be.

Possible tweaks to the 'match words' game to introduce deception:

introduce the possibility that some players may have other goals, e.g. trying to minimize their own scores, or minimize/maximize group/team scores.
introduce the facility for players to try to influence each others' behaviour between rounds (e.g. by allowing private and public chat between players). This would facilitate the building of alliances / reciprocal behaviour / tit-for-tat.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments