AI ALIGNMENT FORUM
AF

2218
Anthony DiGiovanni
Ω58110
Message
Dialogue
Subscribe

Researcher at the Center on Long-Term Risk. All opinions my own.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
1Anthony DiGiovanni's Shortform
3y
0
54Responses to apparent rationalist confusions about game / decision theory
2y
0
10When is intent alignment sufficient or necessary to reduce AGI conflict?
3y
0
11When would AGIs engage in conflict?
3y
0
16When does technical work to reduce AGI conflict make a difference?: Introduction
3y
0
METR: Measuring AI Ability to Complete Long Tasks
Anthony DiGiovanni7mo42

No, at some point you "jump all the way" to AGI

I'm confused as to what the actual argument for this is. It seems like you've just kinda asserted it. (I realize in some contexts all you can do is offer an "incredulous stare," but this doesn't seem like the kind of context where that suffices.)

I'm not sure if the argument is supposed to be the stuff you say in the next paragraph (if so, the "Also" is confusing).

Reply
[Link] Why I’m optimistic about OpenAI’s alignment approach
Anthony DiGiovanni3y35

I think you might be misunderstanding Jan's understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate's notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI's focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.

Reply
Relaxed adversarial training for inner alignment
Anthony DiGiovanni3y20

Adv:M→Xpseudo

LM=PM(Adv(M)(x) | x∼deploy)⋅PM(C(M,x) | Adv(M)(x), x∼deploy)

Basic questions: If the type of Adv(M) is a pseudo-input, as suggested by the above, then what does Adv(M)(x) even mean? What is the event whose probability is being computed? Does the unacceptability checker C also take real inputs as the second argument, not just pseudo-inputs—in which case I should interpret a pseudo-input as a function that can be applied to real inputs, and Adv(M)(x) is the statement "A real input x is in the pseudo-input (a set) given by Adv(M)"?

(I don't know how pedantic this is, but the unacceptability penalty seems pretty important, and I struggle to understand what the unacceptability penalty is because I'm confused about Adv(M)(x).)

Reply
Cooperative Oracles: Introduction
Anthony DiGiovanni5y00

(At the risk of necroposting:) Was this paper ever written? Can't seem to find it, but I'm interested in any developments on this line of research.

Reply