I'm Anthony DiGiovanni, a suffering-focused AI safety researcher at the Center on Long-Term Risk. I (occasionally) write about altruism-relevant topics on my blog, Ataraxia. All opinions my own.
Basic questions: If the type of Adv(M) is a pseudo-input, as suggested by the above, then what does Adv(M)(x) even mean? What is the event whose probability is being computed? Does the unacceptability checker C also take real inputs as the second argument, not just pseudo-inputs—in which case I should interpret a pseudo-input as a function that can be applied to real inputs, and Adv(M)(x) is the statement "A real input x is in the pseudo-input (a set) given by Adv(M)"?
(I don't know how pedantic this is, but the unacceptability penalty seems pretty important, and I struggle to understand what the unacceptability penalty is because I'm confused about Adv(M)(x).)
(At the risk of necroposting:) Was this paper ever written? Can't seem to find it, but I'm interested in any developments on this line of research.
I think you might be misunderstanding Jan's understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate's notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI's focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.