Nate Soares

# Wiki Contributions

I don't see this as worst-case thinking. I do see it as speaking from a model that many locals don't share (without any particular attempt made to argue that model).

In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.

AFAICT, our degree of disagreement here turns on what you mean by "pointed". Depending on that, I expect I'd either say "yeah maybe, but that kind of pointing is hard" or "yep, my highest-credence models have pretty high probability on this thing failing to optimize X once it's sorted out".

For instance, the latter response obtains if the "pointing" is done by naive training.

(Though I also have some sense that I see the situation as more fragile than you--there's lots of ways for reflection to ruin your day, if the wrong kludge is pointed the wrong way. So maybe we have a broader disagreement about that, too.)

Also, as a reminder, my high credence in doom doesn't come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.

It would still help like me to have a "short version" section at the top :-)

I think that distillations of research agendas such as this one are quite valuable, and hereby offer LawrenceC a $3,000 prize for writing it. (I'll follow up via email.) Thanks, LawrenceC! Going forward, I plan to keep an eye out for distillations such as this one that seem particularly skilled or insightful to me, and offer them a prize in the$1-10k range, depending on how much I like them.

Insofar as I do this, I'm going to be completely arbitrary about it, and I'm only going to notice attempts haphazardly, so please don't do rely on the assumption that I'll grant your next piece of writing a prize. I figure that this is better than nothing, but I don't have the bandwidth at the moment to make guarantees about how many posts I'll read.

Separately, here's some quick suggestions about what sort of thing would make me like the distillation even more:

• Distill out a 1-sentence-each bullet-point version of each of the 9 hypotheses at the top
• Distill out a tweet-length version of the overall idea
• Distill out a bullet-point list of security assumptions

These alone aren't necessarily what makes a distillation better-according-to-me in and of themselves, but I hypothesize that the work required to add those things is likely to make the overall distillation better. (The uniting theme is something like: forcing yourself to really squeeze out the key ideas.)

For clarity as to what I mean by a "security assumption", here are a few examples (not intended to apply in the case of shard theory, but intenderd rather to transmit the idea of a security assumption):

• "This plan depends critically on future ability to convince leading orgs to adopt a particular training method."
• "This plan depends critically on the AI being unable to confidently distinguish deployment from training."
• "This plan depends critically on the hypothesis that AI tech will remain compute-intensive right up until the end."

This is maybe too vague a notion; there's an art to figuring out what counts and what doesn't, and various people probably disagree about what should count.

well, in your search for that positive result, i recommend spending some time searching for a critch!simplified alternative to the Y combinator :-p.

not every method of attaining self-reference in the λ-calculus will port over to logic (b/c in the logical setting lots of things need to be quoted), but the quotation sure isn't making the problem any easier. a solution to the OP would yield a novel self-reference combinator in the λ-calculus, and the latter might be easier to find (b/c you don't need to juggle quotes).

if you can lay bare the self-referential property that you're hoping for in the easier setting of λ-calculus, then perhaps others will have an easier time understanding what you want and helping out (and/or you'll have an easier time noticing why your desires are unsatisfiable).

(and if it's still not clear that löb's theorem is tightly connected to the Y combinator, such that any solution to the OP would immediately yield a critch!simplified self-reference combinator in the λ-calculus, then I recommend spending a little time studying the connection between the Y combinator, löb's theorem, and lawvere's fixpoint theorem.)

which self-referential sentence are you trying to avoid?

it keeps sounding to me like you're saying "i want a λ-calculus combinator that produces the fixpoint of a given function f, but i don't want to use the Y combinator".

do you deny the alleged analogy between the normal proof of löb and the Y combinator? (hypothesis: maybe you see that the diagonal lemma is just the type-level Y combinator, but have not yet noticed that löb's theorem is the corresponding term-level Y combinator?)

if you follow the analogy, can you tell me what λ-term should come out when i put in f, and how it's better than (λ s. f (s s)) (λ s. f (s s))?

or (still assuming you follow the analogy): what sort of λ-term representing the fixpoint of f would constitute "referring to itself (instead of being a term about types that refer to themselves)"? in what sense is the term (λ s. f (s s)) (λ s. f (s s)) failing to "refer to itself", and what property are you hoping for instead?

(in case it helps with communication: when i try myself to answer these questions while staring at the OP, my best guess is that you're asking "instead of the Y combinator, can we get a combinator that goes like f ↦ f ????", and the two obvious ways to fill in the blanks are f ↦ f (Y f) and f ↦ f (f (f (.... i discussed why both of those are troublesome here, but am open to the possibility that i have not successfully understood what sort of fixpoint combinator you desire.)

(ETA: also, ftr, in the proof-sketch of löb's theorem that i gave above, the term "g "g"" occurs as a subterm if you do enough substitution, and it refers to the whole proof of löb's theorem. just like how, in the version of the Y combinator given above, the term g g occurs as a subterm if you do enough β-reduction, and it refers to the whole fixpoint. which i note b/c it seems to me that you might have misunderstood a separate point about where the OP struggles as implying that the normal proof isn't self-referring.)

((the OP is gonna struggle insofar as it does the analog of asking for a λ-term x that is literally syntactically identical to f x, for which the only solution is f (f (f (... which is problematic. but a λ-term y that β-reduces pretty quickly to f y is easy, and it's what the standard constuction produces.))

various attempts to distill my objection:

the details of the omitted "zip" operation are going to be no simpler than the standard proof of löb's theorem, and will probably turn out to be just a variation on the standard proof of löb's theorem (unless you can find a way of building self-reference that's shorter than the Y combinator (b/c the standard proof of löb's theorem is already just the Y combinator plus the minimal modifications to interplay with gödel-codes))

even more compressed: the normal proof of löb is contained in the thing labeled "zip". the argument in the OP is operating under the assumption that a proof can make reference to its own gödel-number, and that assumption is essentially just löb's theorem.

a thing that might help is asking what specific proof-steps the step "1. the gödel-code P of this very proof proves C" cashes out into. obviously step 1 can't cash out into verifying (under quotation) every step of the entire proof (by checking for the appropriate prime-factors), because then the proof would require at least one step for every step in the proof plus three extra steps at the end, which would make the proof infinite (and infinite proofs admit contradictions).

so step 1 has to involve some indirection somehow. (cf: we can't get a fixpoint in λ-calculus by f (f (f ..., we need some indirection, as occurs in (Y f) aka (λ s. f (s s)) (λ s. f (s s)), to make the term finite.)

this works fine, as long as you are ok w/ the gödel-code in step 1 being not P, but some different proof of the same conclusion (analogous to the difference between f (Y f) and Y f), and so long as you have the analog of the Y combinator (modified to interplay with gödel-codes).

but the analog of the Y combinator is precisely löb's theorem :-p

if i understand your OP correctly, we can factor it into three claims:

1. when proving C, it's valid to assume a referece «this» to the gödel-code of a proof of C
2. given (1) and a proof f of □ C → C, there's a proof of C by simply applying f to «this»
3. you can arrange things such that if you desugar «this» into a particular numeral in "f «this»", you'll find that that numeral is the gödel-code of "f «this»".

(1) is löb's theorem.

(2) is an indirect proof of löb's theorem, from the assumption of löb's theorem (in direct analogy with how, once we've found syntax Y f for the fixpoint of f, we can wrap it in f to find an alternate syntax, namely f (Y f)).

(3) is false (i'm fairly confident)--you can get real close, but you can't quite get there (in direct analogy to how you can get λ-term Y f that means the same thing as f (Y f), but you can't get a λ-term fix f that is syntactically exactly f (fix f), on pain of infinite syntax).

afaict, there's two ways to read this proposal (if we port across the analogy to λ-calculus), according to whether you cash "zip" out to a recursive instance of the whole proposal, vs whether you cash "zip" out to some different proof of löb's theorem:

1. "i hear that people find the Y combinator confusing. instead of using (Y f) to name the fixpoint of f, why not use f (f (f (f (... ad infinitum?"

2. "i hear that people find the Y combinator confusing. instead of using (Y f) to name the fixpoint of f, why not use f (Y f)?"

the answer to (1) is that it doesn't make sense unless you allow infinite syntax trees (which aren't permitted in normal math b/c infinite proofs admit contradictions).

the answer to (2) is that, while (Y f) may be somewhat tricky to grasp, f (Y f) is surely no simpler :-p

an attempt to rescue what seems to me like the intuition in the OP:

(note that the result is underwhelming, but perhaps informative.)

in the lambda calculus we might say "given (f : A → A), we can get an element (a : A) by the equation (a := f a)".

recursive definitions such as this one work perfectly well, modulo the caveat that you need to have developed the Y combinator first so that you can ground out this new recursive syntax (assuming you don't want to add any new basic combinators).

by a directly analogous argument, we might wish define (löb f := f "löb f"). very direct! very quine-like!

and, this proof works perfectly well. ...modulo the caveat that you need to have developed a modified Y combinator first, so that you can ground out this recursive syntax (assuming you don't want to add any new axioms).

so if you prove the existence of a modified Y-combinator (that works w/ quotations), then you can add recursive syntax (modulo quotation) to your theory, and get a very "direct/quine-like" proof of löb's theorem.

the only hiccup is that the modified-Y-combinator is itself löb's theorem, so the above proof amounts to "once you have proven löb's theorem, there is a very direct and intuitive proof of löb's theorem", which... is perhaps less exciting :-p

(bonus fun fact: in type theory, there's an operator that forms quines. and the proof of löb's theorem is extremely simple, given this operator; you just take a (syntactic) term of type □ A → A, and feed it the quotation of the result of this whole operation, using the quining operator. the hiccup is, ofc, that the quining operator is itself just löb's theorem!)

(Epistemic status: quickly-recounted lightly-edited cached state that I sent in response to an email thread on this topic, that I now notice had an associated public post. Sorry for the length; it was easier to just do a big unfiltered brain-dump than to cull, with footnotes added.)

here's a few quick thoughts i have cached about the proof of löb's theorem (and which i think imply that the suggested technique won't work, and will be tricky to repair):

#1. löb's theorem is essentially just the Y combinator, but with an extra level of quotation mixed in.[1]

Y : (A → A) → A
Y f := let g be (λ s. f (s s)) in g g

löb : □ (□ A → A) → □ A
löb f := let g be (λ s. f (s "s")) in g "g"


the standard proof of löb's theorem (e.g. as it appears in the cartoon guide) is complicated by two points:

1a. one thing we need here is that the variable s is sensical, and in particular we need some S with S ≃ (□ S → A). the existence of such an S follows from the diagonal lemma.[2]

1b. the other thing we need here is to show that quotation and normal proof steps (function application or modus ponens or whatever) interact in the "obvious way", and don't gum each other up. for instance, we need that □(A → B) ∧ □(A) yields □(B), showing that we can "apply quoted functions to quoted arguments", in order for syntax like f (s "s") to be sensical.

if i were trying to make a guide to löb's theorem today, i'd break it down as above; for me, at least, this breakdown is significantly more intuitive than the cartoon guide (which, iirc, doesn't even go into the diagonal lemma, and just blends the Y combinator together with 1b).

this suggests that you're going to be hard-pressed to do any self-reference without routing through the nomal machinery of löb's theorem, in the same way that it's hard to do recursion in the lambda calculus without routing through the Y combinator--you don't need to use Y per se, but you're unlikely to find a simpler way to do generic recursion.

this also suggests that you're hiding all the work done by the normal proof of löb's theorem in the definition of "zip", and that this proof won't turn out to be simpler when "zip" is cashed out in full.

#2. fun fact: the diagonal lemma is also just an instance of the Y combinator with an extra level of quotation.

in fact, there's a proof of the diagonal lemma that goes by analogy to the proof-sketch of löb's theorem (above). very roughly speaking, the way that you avoid infinite regress is that, in the step 1a, you need an S with S ≃ □ S → Type, and this S can be written manually because Type (or Proposition or w/e you want to use) is very rich (namely, it contains existential quantifiers and equality).[3]

indeed, you can think of the diagonal lemma as simply an application of löb's theorem where the target A is Type (or Proposition or w/e).

in fact, this analogy is quite tight. one cool corollary is that we can pop open tarski's undefinability of truth, and replace the diagonal lemma with löb's theorem. the result is a new theorem that i call "the undefinability of normalization": no theory can prove that its terms normalize, on pain of contradiction.

(proof, in case it's not obvious: invoke löb's theorem on (not ∘ normalize : □ Bool → Bool), now you have syntax for a boolean value that's provably-equal to its own negation.)

#3. fun fact: löb's theorem does not hold for various breeds of normalized proofs (such as proofs with cuts eliminated--though exactly what counts as "normalized" in PA is somewhat subtle). in particular, when □ denotes normalized proofs, □(A → B) ∧ □(A) does not yield □(B), because feeding a normalized argument into a normalized function does not yield a normalized output. and while we (who trust the theory) know that unnormalized-box(B) implies normalized-box(B), no consistent theory can prove its own version of that fact, by the undefinability of normalization!

...which yields the hypothesis that the One True Counterfactuals are those facts which have short normalized proofs.

(recall: i think it was patrick who first set forth the trolljecture that counterfactuals correspond to short proofs of material implications? and this trolljecture was shot down by the existence of spurious counterfactuals, which use löb to deduce arbitrary material implications with a ~constant proof-length. but that counterexample doesn't apply to normalized proofs! so the repaired trolljecture yet stands!)

(though i admit this hasn't risen high enough in my priority-queue for me to actually try to prove nice things about this bold new notion of counterfactuals. also they obviously can't be the whole answer, b/c normalized proofs about large worlds are large, etc.)

1. slighly more formally: both the Y combinator and löb's theorem are instantiations of a generalization of lawvere's fixpoint theorem to semicategories. in the case of the Y combinator, we use some domain (in the sense of domain theory) that serves as semantics for the untyped lambda calculus. in löb's theorem, iirc we use a semicategory where a morphism from A to B is a proof of □ A → B (tho this is an old cached thought that i haven't checked). that the latter is not a category follows from gödel incompleteness; there's no identity morphism at ⊥.

note: i dislike semicategories, and if you have a more elegant way to unify Y and löb, i'm eager to hear it. ↩︎

2. in other words: the santa clause sentence is just the natural/obvious type of the variable in the Y combinator (as modified to work with gödel-codes). if you want a better intuition for it, i recommend the exercise of trying to figure out what type the variable s should have in the normal Y combinator. ↩︎

3. in fact, according to me, this is one of three keys to löb's theorem.

the first two keys are ofc gödel-codes + the Y combinator, with the caveat that making the Y combinator behave requires the diagonal lemma.

and the diagonal lemma is again just gödel-codes + the Y combinator, except that in this very specific case we can make the Y combinator behave without already needing the diagonal lemma, by hand-coding in a workaround using quantifiers and equalities and whatnot.

and the hand-coded workaround is kinda hacky. very roughly speaking, the trouble is that we want to let G(s) be F (s "s"), but showing that s "s" is sensical relies on the diagonal lemma, which we're in the middle of proving and can't invoke.

and the solution is to let G(s) instead be ∃ x. just x = try (s "s") ∧ F x. we don't need to know that s "s" is sensical, because we're naming a proposition, and we can just bolt on a clause asserting that s "s" is sensical! we just build our sentence to say "s "s" is sensical, and also ...", and thus sidestep the problem!

in other words: we're able to build a sentence that would be meaningful (it would mean a falsehood) even if s "s" wasn't sensical, and this lets us form a sort of "guarded" version of the Y combinator (in the special case where we're producing a sentence) that doesn't depend critically on the sensicality of s "s". (which ends up making s "s" be sensical in the actual use-case of G "G", and so the guard ends up redundant and harmless.)

(exercise: why can't you use this technique to avoid the invocation of the diagonal lemma from the proof of gödel's incompleteness theorem, i.e. löb's theorem where the target A is ⊥?)

this technique lets us slip around the need to invoke the diagonal lemma before we've finished proving the diagonal lemma. which is a critical step in the bootstrapping process.

...i don't have this point entirely distilled yet, but hopefully that gives at least some intuition for what i mean when i say that löb is just gödel-codes + the Y combinator + a hacky workaround to get you off the ground. ↩︎

my guess is it's not worth it on account of transaction-costs. what're they gonna do, trade half a universe of paperclips for half a universe of Fun? they can already get half a universe of Fun, by spending on Fun what they would have traded away to paperclips!

and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)

there's also an issue where it's not like every UFAI likes paperclips in particular. it's not like 1% of humanity's branches survive and 99% make paperclips, it's like 1% survive and 1% make paperclips and 1% make giant gold obelisks, etc. etc. the surviving humans have a hard time figuring out exactly what killed their bretheren, and they have more UFAIs to trade with than just the paperclipper (if they want to trade at all).

maybe the branches that survive decide to spend some stars on a mixture of plausible-human-UFAI-goals in exchange for humans getting an asteroid in lots of places, if the transaction costs are low and the returns-to-scale diminish enough and the visibility works out favorably. but it looks pretty dicey to me, and the point about discussing aliens first still stands.