The Weighted Perplexity Benchmark: Tokenizer-Normalized Evaluation for Language Model Comparison
Abstract Perplexity remains the primary intrinsic metric for evaluating language models, yet direct comparison across models with different tokenization schemes presents methodological challenges. We introduce a tokenizer-normalized perplexity metric that enables consistent comparison of language models regardless of their tokenization approaches. Through empirical analysis of 19 language models across five...
Jul 7, 202521