Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)
Summary Contrast-Consistent Search (CCS) is a method for finding truthful directions within the activation spaces of large language models (LLMs) in an unsupervised way, introduced in Burns et al., 2022. However, all experiments in that study involve training datasets that are balanced with respect to the ground-truth labels of the...
Aug 5, 20236

