Yeah, that's my view.
That's right---you still only get a bound on average quality, and you need to do something to cope with failures so rare they never appear in training (here's a post reviewing my best guesses).
But before you weren't even in the game, it wouldn't matter how well adversarial training worked because you didn't even have the knowledge to tell whether a given behavior is good or bad. You weren't even getting the right behavior on average.
(In the OP I think the claim "the generalization is now coming entirely from human beliefs" is fine, I meant generalization from one distribution to another. "Neural nets are are fine" was sweeping these issues under the rug. Though note that in the real world the distribution will change from neural net training to deployment, it's just exactly the normal robustness problem. The point of this post is just to get it down to only a robustness problem that you could solve with some kind of generalization of adversarial training, the reason to set it up as in the OP was to make the issue more clear.)
So even when you talk about amplifying f, you mean a certain way of extending human predictions to more complicated background information (e.g. via breaking down Z into chunks and then using copies of f that have been trained on smaller Z), not fine-tuning f to make better predictions.
That's right, f is either imitating a human, or it's trained by iterated amplification / debate---in any case the loss function is defined by the human. In no case is f optimized to make good predictions about the underlying data.
My impression is that your hope is that if Z and f start out human-like, then this is like specifying the "programming language" of a universal prior, so that search for highly-predictive Z, decoded through f, will give something that uses human concepts in predicting the world.
Z should always be a human-readable (or amplified-human-readable) latent; it will necessarily remain human-readable because it has no purpose other than to help a human make predictions. f is going to remain human-like because it's predicting what the human would say (or what the human-consulting-f would say etc.).
The amplified human is like the programming language of the universal prior, Z is like the program that is chosen (or slightly more precisely: Z is like a distribution over programs, described in a human-comprehensible way) and f is an efficient distillation of the intractable ideal.
I'm not totally sure what actually distinguishes f and Z, especially once you start jointly optimizing them. If f incorporates background knowledge about the world, it can do better at prediction tasks. Normally we imagine f having many more parameters than Z, and so being more likely to squirrel away extra facts, but if Z is large then we might imagine it containing computationally interesting artifacts like patterns that are designed to train a trainable f on background knowledge in a way that doesn't look much like human-written text.
f is just predicting P(y|x, Z), it's not trying to model D. So you don't gain anything by putting facts about the data distribution in f---you have to put them in Z so that it changes P(y|x,Z).
Now, maybe you can try to ensure that Z is at least somewhat textlike via making sure it's not too easy for a discriminator to tell from human text, or requiring it to play some functional role in a pure text generator, or whatever.
The only thing Z does is get handed to the human for computing P(y|x,Z).
The difference is that you can draw as many samples as you want from D* and they are all iid. Neural nets are fine in that regime.
It seems even worse than any of that. If your AI wanted anything at all it might debate well in order to survive. So if you are banking on it single-mindedly wanting to win the debate then you were already in deep trouble.
Do you think something like IDA is the only plausible approach to alignment? If so, I hadn't realized that, and I'd be curious to hear more arguments, or just intuitions are fine. The aligned overseer you describe is supposed to make treachery impossible by recognizing it, so it seems your concern is equivalent to the concern: "any agent (we make) that learns to act will be treacherous if treachery is possible." Are all learning agents fundamentally out to get you? I suppose that's a live possibility to me, but it seems to me there is a possibility we could design an agent that is not inclined to treachery, even if the treachery wouldn't be recognized
No, but what are the approaches to avoiding deceptive alignment that don't go through competitiveness?
I guess the obvious one is "don't use ML," and I agree that doesn't require competitiveness.
Edit: even so, having two internal components that are competitive with each other (e.g. overseer and overseee) does not require competitiveness with other projects.
No, but now we are starting to play the game of throttling the overseee (to avoid it overpowering the overseer) and it's not clear how this is going to work and be stable. It currently seems like the only appealing approach to getting stability there is to ensure the overseer is competitive.
This argument seems to prove too much. Are you saying that if society has learned how to do artificial induction at a superhuman level, then by the time we give a safe planner that induction subroutine, someone will have already given that induction routine to an unsafe planner? If so, what hope is there as prediction algorithms relentlessly improve? In my view, the whole point of AGI Safety research is to try to come up with ways to use powerful-enough-to-kill-you artificial induction in a way that it doesn't kill you (and helps you achieve your other goals). But it seems you're saying that there is a certain level of ingenuity where malicious agents will probably act with that level of ingenuity before benign agents do.
I'm saying that if you can't protect yourself from an AI in your lab, under conditions that you carefully control, you probably couldn't protect yourself from AI systems out there in the world.
The hope is that you can protect yourself from an AI in your lab.
So competitiveness still matters somewhat, but here's a potential disagreement we might have: I think we will probably have at least a few months, and maybe more than a year, where the top one or two teams have AGI (powerful enough to kill everyone if let loose), and nobody else has anything more valuable than an Amazon Mechanical Turk worker.
Definitely a disagreement, I think that before anyone has an AGI that could beat humans in a fistfight, tons of people will have systems much much more valuable than a mechanical turk worker.
The way I map these concepts, this feels like an elision to me. I understand what you're saying, but I would like to have a term for "this AI isn't trying to kill me", and I think "safe" is a good one. That's the relevant sense of "safe" when I say "if it's safe, we can try it out and tinker". So maybe we can recruit another word to describe an AI that is both safe itself and able to protect us from other agents.
I mean that we don't have any process that looks like debate that could produce an agent that wasn't trying to kill you without being competitive, because debate relies on using aligned agents to guide the training process (and if they aren't competitive then the agent-being-trained will, at least in the limit, converge to an equilibrium where it kills you).