Tl;dr: Looking for hard debugging tasks for evals, paying greater of $60/hr or $200 per example.
METR (formerly ARC Evals) is interested in producing hard debugging tasks for models to attempt as part of an agentic capabilities evaluation. To create these tasks, we’re seeking repos containing extremely tricky bugs. If you send us a codebase that meets the criteria for submission (listed below), we will pay you $60/hr for time spent putting it into our required format, or $200, whichever is greater. (We won’t pay for submissions that don’t meet these requirements.) If we’re particularly excited about your submission, we may also be interested in purchasing IP rights to it. We expect to want about 10-30 examples overall depending on the diversity. We're likely to be putting bounties on additional types of tasks over the next few weeks.
Criteria for submission:
More context and guidance:
Please send submissions to email@example.com in the form of a zip file. Your email should include the number of hours it took for you to get the code from its original state into our required format. If your submission meets our criteria and format requirements, we’ll contact you with a payment form. You’re also welcome to email firstname.lastname@example.org with any questions, including if you are unsure whether a potential submission would meet the criteria.If you would do this task at a higher pay rate please let us know!
(Also if you are interested in forking SWEbench to support non python codebases please contact us.)