Anthony DiGiovanni — AI Alignment Forum

AI ALIGNMENT FORUM
AF

METR: Measuring AI Ability to Complete Long Tasks

No, at some point you "jump all the way" to AGI

I'm confused as to what the actual argument for this is. It seems like you've just kinda asserted it. (I realize in some contexts all you can do is offer an "incredulous stare," but this doesn't seem like the kind of context where that suffices.)

I'm not sure if the argument is supposed to be the stuff you say in the next paragraph (if so, the "Also" is confusing).

[Link] Why I’m optimistic about OpenAI’s alignment approach

Anthony DiGiovanni3y35

I think you might be misunderstanding Jan's understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate's notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI's focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.

Relaxed adversarial training for inner alignment

Anthony DiGiovanni3y20

$L_{M} = P_{M} (Adv (M) (x) | x \sim deploy) \cdot P_{M} (C (M, x) | Adv (M) (x), x \sim deploy)$

Basic questions: If the type of Adv(M) is a pseudo-input, as suggested by the above, then what does Adv(M)(x) even mean? What is the event whose probability is being computed? Does the unacceptability checker C also take real inputs as the second argument, not just pseudo-inputs—in which case I should interpret a pseudo-input as a function that can be applied to real inputs, and Adv(M)(x) is the statement "A real input x is in the pseudo-input (a set) given by Adv(M)"?

(I don't know how pedantic this is, but the unacceptability penalty seems pretty important, and I struggle to understand what the unacceptability penalty is because I'm confused about Adv(M)(x).)

Cooperative Oracles: Introduction

Anthony DiGiovanni5y00

(At the risk of necroposting:) Was this paper ever written? Can't seem to find it, but I'm interested in any developments on this line of research.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments