All of Jonathan Uesato's Comments + Replies

Experimentally evaluating whether honesty generalizes

I'd still describe my optimistic take as "do imitative generalization.

  • Is it fair to describe the reason you view imitative generalization as necessary at all (instead of just debate/amplification) as "direct oversight is not indefinitely scalable"?
    [ETA: It seems equally/more valid to frame imitative generalization as a way of scaling direct oversight to handle inaccessible info, so this isn't a good framing.] 
  • To check my understanding, you're saying that rather than rely on "some combination of scalable oversight + generalization of honesty OOD" you'd
... (read more)
Experimentally evaluating whether honesty generalizes

Zooming out a bit,  I would summarize a few high-level threads as:

  • We both agree that the experiments described here primarily relate to optimism-about-generalization, rather than optimism-about-scalable-oversight.
  • I am substantially more optimistic about scalable oversight, whereas you think that (eventually) we will need to rely on some combination of scalable oversight + generalization of honesty OOD.
  • I am substantially more pessimistic about generalization of honest OOD, whereas you think it is plausible (via some combination of default neural networ
... (read more)
2Paul Christiano6moI'd still describe my optimistic take as "do imitative generalization [] ." But when you really dig into what that means it seems very closely connected to generalization: (i) the reason why "just use this neural net" isn't a good hypothesis is that it generalizes poorly, (ii) for competitiveness reasons you still need to use hypotheses that look quite a lot like neural nets, (iii) so you really need to understand why the "neural net hypothesis" is bad. I think this was a miscommunication, trying again: in amplification we compute the answer from subanswers. A coherence check can ensure that subanswers and answers agree. If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct). The two training strategies are still different in a way that makes consistency checks seem worse (since you might end up tweaking the subanswers to be consistent with the answer, whereas amplification would only try to tweak the answer to be consistent with the subanswers) but it's not clear if that's a key distinction. I'm comparably optimistic about the "neuralese" case as the French case, though there are a lot of other non-generalization difficulties in the neuralese case and it's not overall the kind of thing that I'm imagining working unless you happen to have "introspective" neural nets (and therefore isn't part of the object-level safety program, it's just part of what you'd do if your neural networks were thinking about neuralase rather than in neuralese).
Experimentally evaluating whether honesty generalizes

Thanks for these thoughts. Mostly just responding to the bits with questions/disagreements, and skipping the parts I agree with:

That's basically right, although I think the view is less plausible for "decisions" than for some kinds of reports. For example, it is more plausible that a mapping from symbols in an internal vocabulary to expressions in natural language would generalize than that correct decisions would generalize (or even than other forms of reasoning).

  • I'm curious what factors point to a significant difference regarding generalization between "
... (read more)
3Paul Christiano6moFor honest translation from your world-model to mine (or at least sufficiently small parts of it), there is a uniform intended behavior. But for decisions there isn't any intended uniform behavior. This is not clear to me (and it seems like we get to check). I'm not sure I understand the distinction here. Suppose that amplification lets us compute B from A, so that B is good if A is good (i.e. so that the learned model will end up answering top-level questions well if it answers leaves well). Whereas a coherence condition ensures that A and B are consistent with one another---so if B is "supposed to be" a deterministic function of A, then consistency guarantees that B is good if A is good. I don't think the model necessarily "knows" how to talk about tone (given only a question about tone in natural language with no examples), nor how to translate between its internal beliefs about tone and a description in natural language. The point of a plausibility/coherence condition is to provide enough constraint to pin those things down. You fully learn what kind of sentences we were looking for when we asked about tone, which might have just been totally non-obvious initially. And you learn a bunch of things about how different concepts about tone are supposed to relate to one another (in order to make sure all of your utterances are consistent). What is "coherent-but-often-inaccurate" though? The point is that in order to be coherent you actually have to do quite a lot of work, that basically requires you to understand what the involved terms mean to humans. I don't really think we've done those experiments. I don't know what specification gaming examples you have in mind. And less careful raters are also less careful about coherence conditions. Not to mention that no effort was made to e.g. ask multiple questions about the text and check agreement between them. I agree that it's possible to have some plausibility condition which is insufficient to get good behavior
Experimentally evaluating whether honesty generalizes

Thanks for sharing these thoughts. I'm particularly excited about the possibility of running empirical experiments to better understand potential risks of ML systems and and contribute to debates about difficulties of alignment.

1. Potential implications for optimistic views on alignment

If we observe systems that learn to bullshit convincingly, but don't transfer to behaving honestly, I think that's a real challenge to the most optimistic views about alignment and I expect it would convince some people in ML.

I'm most interested in this point. IIUC, the view... (read more)

3Paul Christiano6moThat's basically right, although I think the view is less plausible for "decisions" than for some kinds of reports. For example, it is more plausible that a mapping from symbols in an internal vocabulary to expressions in natural language would generalize than that correct decisions would generalize (or even than other forms of reasoning). I think a realistic approach would need to use generalization in some situations (where we expect it to work) and then use the facts-that-generalize as an input into supervision. For example, if you were able to answer empirical questions about what's happening right now, you could use those as an input into debate/amplification. (This may also make it more clear why I'm interested in coherence conditions where you can't supervise---in some sense "use the stuff that does generalize as an input into amplification" is quite similar to saying "impose a coherence condition amongst the stuff you can't directly supervise.") "Optimism about scalable oversight" is what I'm usually thinking about, but it does seem to me that there are some cases where it is inadequate. You could hope to play a quantitative/empirical game of getting lots of useful work out of AI before this kind of approach breaks down, but I am interested in whether there's a chance at going straight for an indefinitely scalable approach to alignment. That seems right to me and that is a reasonable description of my view. (I'd be curious to know if you don't encounter a lot of optimism-about-generalization.) For the purposes of "indefinitely scalable alignment approach" the relevant threshold is something quite ambitious like "reflects everything the system knows." I think there's a further question about whether the model knows how to talk about these concepts / is able to express itself. This is another part of the motivation for plausibility constraints (if nothing else they teach the model the syntax, but then when combined with other knowledge about language th
Frequent arguments about alignment

Thanks for writing this. I've been having a lot of similar conversations, and found your post clarifying in stating a lot of core arguments clearly.

Is there an even better critique that the Skeptic could make?

Focusing first on human preference learning as a subset of alignment research: I think most ML researchers "should" agree on the importance of simple human preference learning, both from a safety and capabilities perspective. If we take the narrower question "should we do human preference learning, or is pretraining + minimal prompt engineering enough... (read more)

2John Schulman7moAgree with what you've written here -- I think you put it very well.
SGD's Bias

I'd be interested in the relationship between this and Implicit Gradient Regularization and the sharp/flat minima lit.The basic idea there is to compare the continuous gradient flow on the original objective, to the path followed by SGD due to discretization. They show that the latter can be re-interpreted as optimizing a modified objective which favors flat minima (low sensitivity to parameter perturbations). This isn't clearly the same as what you're analyzing here, since you're looking at variance due to sampling instead, but they might be related under... (read more)

3johnswentworth8moMy own understanding of the flat minima idea is that it's a different thing. It's not really about noise, it's about gradient descent in general being a pretty shitty optimization method, which converges very poorly to sharp minima (more precisely, minima with a high condition number). (Continuous gradient flow circumvents that, but using step sizes small enough to circumvent the problem in practice would make GD prohibitively slow. The methods we actually use are not a good approximation of continuous flow, as I understand it.) If you want flat minima, then an optimization algorithm which converges very poorly to sharp minima could actually be a good thing, so long as you combine it with some way to escape the basin of the sharp minimum (e.g. noise in SGD). That said, I haven't read the various papers on this, so I'm at high risk of misunderstanding. Also worth noting that there are reasons to expect convergence to flat minima besides bias in SGD itself. A flatter basin fills more of the parameter space than a sharper basin, so we're more likely to initialize in a flat basin (relevant to the NTK/GP/Mingard et al picture) or accidentally stumble into one.
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Thanks for the great post. I found this collection of stories and framings very insightful.

1. Strong +1 to "Problems before solutions." I'm much more focused when reading this story (or any threat model) on "do I find this story plausible and compelling?" (which is already a tremendously high bar) before even starting to get into "how would this update my research priorities?"

2. I wanted to add a mention to Katja Grace's "Misalignment and Misuse" as another example discussing how single-single alignment problems and bargaining failures can blur together an... (read more)

2Andrew Critch9moThanks for the pointer to grace2020whose []! I've added it to the original post now under "successes in our agent-agnostic thinking". For sure, that is the point of the "successes" section. Instead of "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes" I should probably have said "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes, and to my eye there should be more communication across the boundary of that bubble."