Work supported by MATS and SPAR. Code at https://github.com/ArjunPanickssery/math_problems_debate/. Three measures for evaluating debate are 1. whether the debate judge outperforms a naive-judge baseline where the naive judge answers questions without hearing any debate arguments. 2. whether the debate judge outperforms a consultancy baseline where the judge hears argument(s) from...
Produced as part of the MATS Program Summer 2024 Cohort. The project is supervised by Marius Hobbhahn and Jérémy Scheurer Update: See also our paper on this topic admitted to the NeurIPS 2024 SoLaR Workshop. Introduction To mitigate risks from future AI systems, we need to assess their capabilities accurately....
Self-evaluation using LLMs is used in reward modeling, model-based benchmarks like GPTScore and AlpacaEval, self-refinement, and constitutional AI. LLMs have been shown to be accurate at approximating human annotators on some tasks. But these methods are threatened by self-preference, a bias in which an LLM evaluator scores its own outputs...
I see people use "in-context learning" in different ways. Take the opening to "In-Context Learning Creates Task Vectors": > In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm. However, its underlying mechanism is still not well understood. In particular, it is challenging to...