I feel excited about this framework! Several thoughts:
I especially like the metathreat hierarchy. It makes sense because if you completely curry it, each agent sees the foe's action, policy, metapolicy, etc., which are all generically independent pieces of information. But it gets weird when an agent sees an action that's not compatible with the foe's policy.
You hinted briefly at using hemicontinuous maps of sets instead of or in addition to probability distributions, and I think that's a big part of what makes this framework exciting. Maybe if one takes a bilimit of Scott domains or whatever, you can have an agent that can be understood simultaneously on multiple levels, and so evade commitment races. I haven't thought much about that.
I think you're right that the epiphenomenal utility functions are not good. I still think using reflective oracles is a good idea. I wonder if the power of Kakutani fixed points (magical reflective reasoning) can be combined with the power of Kleene fixed points (iteratively refining commitments).
See also this comment from 2013 that has the computable version of NicerBot.
Or maybe it means we train the professional in the principles and heuristics that the bot knows. The question is if we can compress the bot's knowledge into, say, a 1-year training program for professionals.
There are reasons to be optimistic: We can discard information that isn't knowledge (lossy compression). And we can teach the professional in human concepts (lossless compression).
This sounds like a great goal, if you mean "know" in a lazy sense; I'm imagining a question-answering system that will correctly explain any game, move, position, or principle as the bot understands it. I don't believe I could know all at once everything that a good bot knows about go. That's too much knowledge.
Red-penning is a general problem-solving method that's kinda similar to this research methodology.
I'd believe the claim if I thought that alignment was easy enough that AI products that pass internal product review and which don't immediately trigger lawsuits would be aligned enough to not end the world through alignment failure. But I don't think that's the case, unfortunately.
It seems like we'll have to put special effort into both single/single alignment and multi/single "alignment", because the free market might not give it to us.
I'd like more discussion of the claim that alignment research is unhelpful-at-best for existential safety because of it accelerating deployment. It seems to me that alignment research has a couple paths to positive impact which might balance the risk:
Tech companies will be incentivized to deploy AI with slipshod alignment, which might then take actions that no one wants and which pose existential risk. (Concretely, I'm thinking of out with a whimper and out with a bang scenarios.) But the existence of better alignment techniques might legitimize governance demands, i.e. demands that tech companies don't make products that do things that literally no one wants.
Single/single alignment might be a prerequisite to certain computational social choice solutions. E.g., once we know how to build an agent that "does what [human] wants", we can then build an agent that "helps [human 1] and [human 2] draw up incomplete contracts for mutual benefit subject to the constraints in the [policy] written by [human 3]". And slipshod alignment might not be enough for this application.
In this case humans are doing the job of transferring from D to D∗, and the training algorithm just has to generalize from a representative sample of D∗ to the test set.
Thanks for the references! I now know that I'm interested specifically in cooperative game theory, and I see that Shoham & Leyton-Brown has a chapter on "coalitional game theory", so I'll take a look.
A proof of the lemma V(a,b∗)≤v∗: