Arjun Panickssery

Inference-Only Debate Experiments Using Math Problems

Work supported by MATS and SPAR. Code at https://github.com/ArjunPanickssery/math_problems_debate/. Three measures for evaluating debate are 1. whether the debate judge outperforms a naive-judge baseline where the naive judge answers questions without hearing any debate arguments. 2. whether the debate judge outperforms a consultancy baseline where the judge hears argument(s) from...

Aug 6, 202431

Arjun Panickssery

Arjun Panickssery

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities

LLM Evaluators Recognize and Favor Their Own Generations

Deconfusing In-Context Learning

Inference-Only Debate Experiments Using Math Problems

Arjun Panickssery

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities

LLM Evaluators Recognize and Favor Their Own Generations

Deconfusing In-Context Learning

Inference-Only Debate Experiments Using Math Problems

Inference-Only Debate Experiments Using Math Problems

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities

LLM Evaluators Recognize and Favor Their Own Generations

Deconfusing In-Context Learning