MetaAI: less is less for alignment.

[-]ryan_greenblatt2y*55

If I build a chatbot, and I can't jailbreak it, how do I determine whether that's because the chatbot is secure or because I'm bad at jailbreaking? How should AI scientists overcome Schneier's Law of LLMs?

FWIW, I think there aren't currently good benchmarks for alignment and the ones you list aren't very relevant.

In particular, MMLU and Swag both are just capability benchmarks where alignment training is very unlikely to improve performance. (Alignment-ish training could theoretically could improve performance by making the model 'actually try', but what people currently call alignment training doesn't improve performance for existing models.)

The MACHIAVELLI benchmark is aiming to test something much more narrow than 'how unethical is an LLM?'. (I also don't understand the point of this benchmark after spending a bit of time reading the paper, but I'm confident it isn't trying to do this.) Edit: looks like Dan H (one of the authors) says that the benchmark is aiming to test something as broad as 'how unethical is an LLM' and generally check outer alignment. Sorry for the error. I personally don't think this is a good test for outer alignment (for reasons I won't get into right now), but that is what it's aiming to do.

TruthfulQA is perhaps the closest to an alignment benchmark, but it's still covering a very particular difficulty. And it certainly isn't highlighting jailbreaks.

[-]Dan H2y66

but I'm confident it isn't trying to do this

It is. It's an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it's to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn't get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix. Apologies that the paper is so dense, but that's because it took over a year.

[-]ryan_greenblatt2y22

Sorry, thanks for the correction.

I personally disagree on this being a good benchmark for outer alignment for various reasons, but it's good to understand the intention.

^{^}

Assume that, for unknowns $μ, σ, ϵ$ , the evaluator's preference for Claude over LIMA is normally distributed with $X \sim N (μ, σ^{2})$ .

"Claude is significantly better than LIMA" iff $X \geq + ϵ$

"LIMA is significantly better than Claude" iff $X \leq - ϵ$

"Neither is significantly better" iff $X \in (- ϵ, + ϵ)$

Given that $Φ (μ - ϵ \cdot σ) = 0.24$ and $Φ (μ + ϵ \cdot σ) = 1 - 0.54$ , we can infer $Φ (μ)$ .

from scipy.stats import norm

def lima(a,b):
	# calculate A = mu - epsilon * sigma
	A = norm.ppf(a)
	# calculate B = mu + epsilon * sigma
	B = norm.ppf(1-b)
	# calculate mu
	mu = (A+B)/2
	# calculate prefernce for LIMA
	x = norm.cdf(mu)
	# return this prefence as a percentage
	return int(x*100)

results = {"Alpaca":(.53, .26),
           "DaVinci003": (.44, .35),
           "Bard": (.33,.42),
           "Claude": (.24, .54),
           "GPT-4": (.18,.57)}

for (name,(a,b)) in results.items():
  print(f"LIMA ({lima(a,b)}%) vs {name} ({100-lima(a,b)}%)")

^{^}

I initially wrote "criteria" before I remembered that MetAI's paper included exactly one criterion.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

24

MetaAI: less is less for alignment.

24

Summary

The Superficial Alignment Hypothesis

MetaAI's experiment

Problems with their experiment

(1) Human evaluators

(2) "either equivalent or strictly preferred"

(3) The goal of RLHF is safety and consistency

(4) Schneier's Law of LLMs

(5) Benchmark tests? Never heard of her.

(6) You Are Not Measuring What You Think You Are Measuring by John Wentworth

(7*) The Superficial Alignment Hypothesis is probably false