Status: a slightly-edited copy-paste of a Twitter X thread I quickly dashed off a week or so ago.
Here's a thought I'm playing with that I'd like feedback on: I think watermarking large language models is probably overrated. Most of the time, I think what you want to know is "is this text endorsed by the person who purportedly authored it", which can be checked with digital signatures. Another big concern is that people are able to cheat on essays. This is sad. But what do we give up by having watermarking?
Well, as far as I can tell, if you give people access to model internals - certainly weights, certainly logprobs, but maybe even last-layer activations if they have enough - they can bypass the watermarking scheme. This is even sadder - it means you have to strictly limit the set of people who are able to do certain kinds of research that could be pretty useful for safety. In my mind, that makes it not worth the benefit.
What could I be missing here?
Anyway there's a good shot I don't know what I'm missing, so let me know if you know what it is.
Postscript: Someone has pointed me to this paper that purports to bake a watermark into the weights. I can't figure out how it works (at least not at twitter-compatible speeds), but if it does, I think that would alleviate my concerns.
I think that there's a very real benefit to watermarking that is often overlooked, which is that it lets you filter AI-generated data out of your pre-training corpus. That could be quite important for avoiding some of the dangerous failure modes around models predicting other AIs (e.g. an otherwise safe predictor could cause a catastrophe if it starts predicting a superintelligent deceptive AI) that we talk about in "Conditioning Predictive Models".