All of So8res's Comments + Replies

If you allow indirection and don't worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.

Answering your request for prediction, given that it seems like that request is still live: a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (... (read more)

I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well

(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don't understand the relevance of this claim to my argument.)

Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for".

Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.

(This seems to me like plausibly one of the sources of misunderstandi... (read more)

3Matthew Barnett2mo
If ordinary humans can't single out concepts that are robustly worth optimizing for, then either, 1. Human beings in general cannot single out what is robustly worth optimizing for 2. Only extraordinary humans can single out what is robustly worth optimizing for Can you be more clear about which of these you believe? I'm also including "indirect" ways that humans can single out concepts that are robustly worth optimizing for. But then I'm allowing that GPT-N can do that too. Maybe this is where the confusion lies? If you're allowing for humans to act in groups and come up with these concepts after e.g. deliberation, and still think that ordinary humans can't single out concepts that are robustly worth optimizing for, then I think this view is a little silly, although the second interpretation at least allows for the possibility that the future goes well and we survive AGI, and that would be nice to know.

That helps somewhat, thanks! (And sorry for making you repeat yourself before discarding the erroneous probability-mass.)

I still feel like I can only barely maybe half-see what you're saying, and only have a tenuous grasp on it.

Like: why is it supposed to matter that GPT can solve ethical quandries on-par with its ability to perform other tasks? I can still only half-see an answer that doesn't route through the (apparently-disbelieved-by-both-of-us) claim that I used to argue that getting the AI to understand ethics was a hard bit, by staring at sentences ... (read more)

3Matthew Barnett2mo
Thanks for trying to understand my position. I think this interpretation that you gave is closest to what I'm arguing, I have a quick response to what I see as your primary objection: I think this is kinda downplaying what GPT-4 is good at? If you talk to GPT-4 at length, I think you'll find that it's cognizant of many nuances in human morality that go way deeper than the moral question of whether to "call 911 when Alice is in labor and your car has a flat". Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for". I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well, and to the extent it can't, I expect almost all the bugs to be ironed out in near-term multimodal models.  It would be nice if you made a precise prediction about what type of moral reflection or value specification multimodal models won't be capable of performing in the near future, if you think that they are not capable of the 'deep' value specification that you care about. And here, again, I'm looking for some prediction of the form: humans are able to do X, but LLMs/multimodal models won't be able to do X by, say, 2028. Admittedly, making this prediction precise is probably hard, but it's difficult for me to interpret your disagreement without a little more insight into what you're predicting.

I have the sense that you've misunderstood my past arguments. I don't quite feel like I can rapidly precisely pinpoint the issue, but some scattered relevant tidbits follow:

  • I didn't pick the name "value learning", and probably wouldn't have picked it for that problem if others weren't already using it. (Perhaps I tried to apply it to a different problem than Bostrom-or-whoever intended it for, thereby doing some injury to the term and to my argument?)

  • Glancing back at my "Value Learning" paper, the abstract includes "Even a machine intelligent enough

... (read more)

Glancing back at my "Value Learning" paper, the abstract includes "Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended", which supports my recollection that I was never trying to use "Value Learning" for "getting the AI to understand human values is hard" as opposed to "getting the AI to act towards value in particular (as opposed to something else) is hard", as supports my sense that this isn't hindsight bias, and is in fact a misunderstanding.

For what it's worth, I didn't claim that you argue... (read more)

I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson's recent "maybe intelligence makes things more good" type reasoning).

Here's my replies, which seemed worth putting somewhere public:

The claims in the paper seem wrong to me as stated, and in particular seems to conflate values with instrumental subgoals. One does not need to terminally value survival to avoid getting hit by a truck while fetching coffee; they could simply understand that one can't fetch the coffee

... (read more)

Someone recently privately asked me for my current state on my 'Dark Arts of Rationality' post. Here's some of my reply (lightly edited for punctuation and conversation flow), which seemed worth reproducing publicly:

FWIW, that post has been on my list of things to retract for a while.

(The retraction is pending a pair of blog posts that describe some of my thoughts on related matters, which have been in the editing queue for over a year and the draft queue for years before that.)

I wrote that post before reading much of the sequences, and updated away from

... (read more)

Below is a sketch of an argument that might imply that the answer to Q5 is (clasically) 'yes'. (I thought about a question that's probably the same a little while back, and am reciting from cache, without checking in detail that my axioms lined up with your A1-4).

Pick a lottery with the property that forall with and , forall , we have . We will say that is "extreme(ly high)".

Pick a lottery with .

Now, for any with , define to be the guaranteed by continuity (A3).

Lemma: forall with , ... (read more)

I'm awarding another $3,000 distillation prize for this piece, with complements to the authors.

A few people recently have asked me for my take on ARC evals, and so I've aggregated some of my responses here:

- I don't have strong takes on ARC Evals, mostly on account of not thinking about it deeply.
- Part of my read is that they're trying to, like, get a small dumb minimal version of a thing up so they can scale it to something real. This seems good to me.
- I am wary of people in our community inventing metrics that Really Should Not Be Optimized and handing them to a field that loves optimizing metrics.
- I expect there are all sorts of issues that wo... (read more)

John has also made various caveats to me, of the form "this field is pre-paradigmatic and the math is merely suggestive at this point". I feel like he oversold his results even so.

Part of it is that I get the sense that John didn't understand the limitations of his own results--like the fact that the telephone theorem only says anything in the infinite case, and the thing it says then does not (in its current form) arise as a limit of sensible things that can be said in finite cases. Or like the fact that the alleged interesting results of the gKPD theorem... (read more)

(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)

In hindsight, I do think the period when our discussions took place were a local maximum of (my own estimate of the extent of applicability of my math), partially thanks to your input and partially because I was in the process of digesting a bunch of the technical results we talked about and figuring out the next hurdles. In particular, I definitely underestimated the difficulty of extending the results to finite approximations.

That said, I doubt that fully accounts for the difference in perception.

John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment. (I added this note to the top of the post as a parenthetical; thanks.)

More details:

  • I think the argument Nate gave is at least correct for markets of relatively-highly-intelligent agents, and that was a big update for me (thankyou Nate!). I'm still unsure how far it generalizes to relatively less powerful agents.
  • Nate left out my other big takeaway: Nate's argument here implies that there's probably a lot of money to be made in real-world markets! In practice, it would probably look like an insurance-like contract, by which two traders would commit to the "side-channel trades at non-market prices" required to make them aggrega
... (read more)

Here's a recent attempt of mine at a distillation of a fragment of this plan, copied over from a discussion elsewhere:


goal: make there be a logical statement such that a proof of that statement solves the strawberries-on-a-plate problem (or w/e).

summary of plan:

  • the humans put in a herculean effort to build a multi-level world-model that is interpretable to them (ranging from quantum chemistry at the lowest level, to strawberries and plates at the top)
  • we interpret this in a very conservative way, as a convex set of models that hopefully contains someth
... (read more)

I don't see this as worst-case thinking. I do see it as speaking from a model that many locals don't share (without any particular attempt made to argue that model).

In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.

AFAICT, our degree of disagreement here turns on what you mean by "pointed". Depending on that, I expect I'd either say "yeah maybe, but that kind of po... (read more)

6Steve Byrnes10mo
Oh, sorry. I’m “uncertain” assuming Model-Based RL with the least-doomed plan that I feel like I more-or-less know how to implement right now. If we’re talking about “naïve training”, then I’m probably very pessimistic, depending on the details. That’s helpful, thanks!

It would still help like me to have a "short version" section at the top :-)

4Lawrence Chan1y
I've expanded the TL;DR at the top to include the nine theses. Thanks for the suggestion!

I think that distillations of research agendas such as this one are quite valuable, and hereby offer LawrenceC a $3,000 prize for writing it. (I'll follow up via email.) Thanks, LawrenceC!

Going forward, I plan to keep an eye out for distillations such as this one that seem particularly skilled or insightful to me, and offer them a prize in the $1-10k range, depending on how much I like them.

Insofar as I do this, I'm going to be completely arbitrary about it, and I'm only going to notice attempts haphazardly, so please don't do rely on the assumption that I... (read more)

2Lawrence Chan1y
Thanks Nate! I didn't add a 1-sentence bullet point for each thesis because I thought the table of contents on the left was sufficient, though in retrospect I should've written it up mainly for learning value. Do you still think it's worth doing after the fact?  Ditto the tweet thread, assuming I don't plan on tweeting this.

well, in your search for that positive result, i recommend spending some time searching for a critch!simplified alternative to the Y combinator :-p.

not every method of attaining self-reference in the λ-calculus will port over to logic (b/c in the logical setting lots of things need to be quoted), but the quotation sure isn't making the problem any easier. a solution to the OP would yield a novel self-reference combinator in the λ-calculus, and the latter might be easier to find (b/c you don't need to juggle quotes).

if you can lay bare the self-referential ... (read more)

which self-referential sentence are you trying to avoid?

it keeps sounding to me like you're saying "i want a λ-calculus combinator that produces the fixpoint of a given function f, but i don't want to use the Y combinator".

do you deny the alleged analogy between the normal proof of löb and the Y combinator? (hypothesis: maybe you see that the diagonal lemma is just the type-level Y combinator, but have not yet noticed that löb's theorem is the corresponding term-level Y combinator?)

if you follow the analogy, can you tell me what λ-term should come out when... (read more)

1Andrew Critch1y
At this point I'm more interested in hashing out approaches that might actually conform to the motivation in the OP.  Perhaps I'll come back to this discussion with you after I've spent a lot more time in a mode of searching for a positive result that fits with my motivation here.  Meanwhile, thanks for thinking this over for a bit.

various attempts to distill my objection:


the details of the omitted "zip" operation are going to be no simpler than the standard proof of löb's theorem, and will probably turn out to be just a variation on the standard proof of löb's theorem (unless you can find a way of building self-reference that's shorter than the Y combinator (b/c the standard proof of löb's theorem is already just the Y combinator plus the minimal modifications to interplay with gödel-codes))


even more compressed: the normal proof of löb is contained in the thing labeled "zip". th... (read more)

an attempt to rescue what seems to me like the intuition in the OP:

(note that the result is underwhelming, but perhaps informative.)

in the lambda calculus we might say "given (f : A → A), we can get an element (a : A) by the equation (a := f a)".

recursive definitions such as this one work perfectly well, modulo the caveat that you need to have developed the Y combinator first so that you can ground out this new recursive syntax (assuming you don't want to add any new basic combinators).

by a directly analogous argument, we might wish define (löb f := f "löb... (read more)

various attempts to distill my objection:


the details of the omitted "zip" operation are going to be no simpler than the standard proof of löb's theorem, and will probably turn out to be just a variation on the standard proof of löb's theorem (unless you can find a way of building self-reference that's shorter than the Y combinator (b/c the standard proof of löb's theorem is already just the Y combinator plus the minimal modifications to interplay with gödel-codes))


even more compressed: the normal proof of löb is contained in the thing labeled "zip". th... (read more)

(Epistemic status: quickly-recounted lightly-edited cached state that I sent in response to an email thread on this topic, that I now notice had an associated public post. Sorry for the length; it was easier to just do a big unfiltered brain-dump than to cull, with footnotes added.)

here's a few quick thoughts i have cached about the proof of löb's theorem (and which i think imply that the suggested technique won't work, and will be tricky to repair):


#1. löb's theorem is essentially just the Y combinator, but with an extra level of quotation mixed in.[1]

... (read more)
2Andrew Critch1y
If by "the normal machinery", you mean a clever application of the diagonal lemma, then I agree.  But I think we can get away with not having the self-referential sentence, by using the same y-combinator-like diagonal-lemma machinery to make a proof that refers to itself (instead of a proof about sentences that refer to themselves) and checks its own validity.  I think I if or someone else produces a valid proof like that, skeptics of its value (of which you might be one; I'm not sure) will look at it and say "That was harder and less efficient than the usual way of proving Löb using the self-referential sentence Ψ and no self-validation".  I predict I'll agree with that, and still find the new proof to be of additional intellectual value, for the following reason: * Human documents tend to refer to the themselves a lot, like bylaws. * Human sentences, on the other hand, rarely refer to themselves.  (This sentence is an exception, but there aren't a lot of naturally occurring examples.) * Therefore, a proof of Löb whose main use of self-references involves the entire proof referring to itself, rather than a sing sentence referring to itself, will be more intuitive to humans (such as lawyers) who are used to thinking about self-referential documents. The skeptic response to that will be to say that those peoples' intuitions are the wrong way to think about y-combinator manipulation, and to that I'll be like "Maybe, but I'm not super convinced their perspective is wrong, and in any case I don't mind meeting them where they're at, using a proof that they find more intuitive.". Summary: I'm pretty confident the proof will be valuable, even though I agree it will have to use much of the same machinery as the usual proof, plus some extra machinery for helping the proof to self-validate, as long as the proof doesn't use sentences that are basically only about their own meaning (the way the sentence Ψ is basically only about its own relationship to the sentence C, whi

an attempt to rescue what seems to me like the intuition in the OP:

(note that the result is underwhelming, but perhaps informative.)

in the lambda calculus we might say "given (f : A → A), we can get an element (a : A) by the equation (a := f a)".

recursive definitions such as this one work perfectly well, modulo the caveat that you need to have developed the Y combinator first so that you can ground out this new recursive syntax (assuming you don't want to add any new basic combinators).

by a directly analogous argument, we might wish define (löb f := f "löb... (read more)

my guess is it's not worth it on account of transaction-costs. what're they gonna do, trade half a universe of paperclips for half a universe of Fun? they can already get half a universe of Fun, by spending on Fun what they would have traded away to paperclips!

and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)

there's also an issue where it's n... (read more)

and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)

This sounds astronomically wrong to me. I think that my personal utility function gets close to saturation with a tiny fraction of the resources in universe-shard. Two people is one room is better than two people in separate rooms, yes. But, two rooms with trillion people each is virtually t... (read more)

For sure. It's tricky to wipe out humanity entirely without optimizing for that in particular -- nuclear war, climate change, and extremely bad natural pandemics look to me like they're at most global catastrophes, rather than existential threats. It might in fact be easier to wipe out humanity by enginering a pandemic that's specifically optimized for this task (than it is to develop AGI), but we don't see vast resources flowing into humanity-killing-virus projects, the way that we see vast resources flowing into AGI projects. By my accounting, most other... (read more)

2BrownHairedEevee2y
Hi, I'm the user who asked this question. Thank you for responding! I see your point about how an AGI would intentionally destroy humanity versus engineered bugs that only wipe us out "by accident", but that's conditional on the AGI having "destroy humanity" as a subgoal. Most likely, a typical AGI will have some mundane, neutral-to-benevolent goal like "maximize profit by running this steel factory and selling steel". Maybe the AGI can achieve that by taking over an iron mine somewhere, or taking over a country (or the world) and enslaving its citizens, or even wiping out humanity. In general, my guess is that the AGI will try to do the least costly/risky thing needed to achieve its goal (maximizing profit), and (setting aside that if all of humanity were extinct, the AGI would have no one to sell steel to) wiping out humanity is the most expensive of these options and the AGI would likely get itself destroyed while trying to do that. So I think that "enslave a large portion of humanity and export cheap steel at a hefty profit" is a subgoal that this AGI would likely have, but destroying humanity is not. It depends on the use case - a misaligned AGI in charge of the U.S. Armed Forces could end up starting a nuclear war - but given how careful the U.S. government has been about avoiding nuclear war, I think they'd insist on an AGI being very aligned with their interests before putting it in charge of something so high stakes. Also, I suspect that some militaries (like North Korea's) might be developing bioweapons and spending 1 to 100% as much on it annually as OpenAI and DeepMind spend on AGI; we just don't know about it. Based on your AGI-bioweapon analogy, I suspect that AGI is a greater hazard than bioweapons, but not by quite as much as your argument implies. While few well-resourced actors are interested in using bioweapons, a who's who of corporations, states, and NGOs will be interested in using AGI. And AGIs can adopt dangerous subgoals for a wide range

Question for Richard, Paul, and/or Rohin: What's a story, full of implausibly concrete details but nevertheless a member of some largish plausible-to-you cluster of possible outcomes, in which things go well? (Paying particular attention to how early AGI systems are deployed and to what purposes, or how catastrophic deployments are otherwise forstalled.)

I wrote this doc a couple of years ago (while I was at CHAI). It's got many rough edges (I think I wrote it in one sitting and never bothered to rewrite it to make it better), but I still endorse the general gist, if we're talking about what systems are being deployed to do and what happens amongst organizations. It doesn't totally answer your question (it's more focused on what happens before we get systems that could kill everyone), but it seems pretty related.

(I haven't brought it up before because it seems to me like the disagreement is much more in th... (read more)

In response to your last couple paragraphs: the critique, afaict, is not "a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those", but rather "a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is th... (read more)

Relevant Feynman quote: 

I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples.

For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)-- disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on.

Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say “False!” [and] point out my counterexample.

("near-zero" is a red herring, and I worry that that phrasing bolsters the incorrect view that the reason MIRI folk think alignment is hard is that we want implausibly strong guarantees. I suggest replacing "reduce x-risk to near-zero" with "reduce x-risk to sub-50%".)

3Rohin Shah2y
(Done)

My take on the exercise:

Is Humbali right that generic uncertainty about maybe being wrong, without other extra premises, should increase the entropy of one's probability distribution over AGI, thereby moving out its median further away in time?

Short version: Nah. For example, if you were wrong by dint of failing to consider the right hypothesis, you can correct for it by considering predictable properties of the hypotheses you missed (even if you don't think you can correctly imagine the true research pathway or w/e in advance). And if you were wrong

... (read more)
I have discussed with MIRI their decision to make their research non-disclosed-by-default and we agreed that my research agenda is a reasonable exception.

Small note: my view of MIRI's nondisclosed-by-default policy is that if all researchers involved with a research program think it should obviously be public then it should obviously be public, and that doesn't require a bunch of bureaucracy. I think this while simultaneously predicting that when researchers have a part of themselves that feels uncertain or uneasy about whether their research sho... (read more)

The second statement seems pretty plausible (when we consider human-accessible AGI designs, at least), but I'm not super confident of it, and I'm not resting my argument on it.

The weaker statement you provide doesn't seem like it's addressing my concern. I expect there are ways to get highly capable reasoning (sufficient for, e.g., gaining decisive strategic advantage) without understanding low-K "good reasoning"; the concern is that said systems are much more difficult to align.

As I noted when we chatted about this in person, my intuition is less "there is some small core of good consequentialist reasoning (it has “low Kolmogorov complexity” in some sense), and this small core will be quite important for AI capabilities" and more "good consequentialist reasoning is low-K and those who understand it will be better equipped to design AGI systems where the relevant consequentialist reasoning happens in transparent boxes rather than black boxes."

Indeed, if I thought one had to understand good consequentialist reasoning in order to design a highly capable AI system, I'd be less worried by a decent margin.

0Jessica Taylor7y
The way I wrote it, I didn't mean to imply "the designers need to understand the low-K thing for the system to be highly capable", merely "the low-K thing must appear in the system somewhere for it to be highly capable". Does the second statement seem right to you? (perhaps a weaker statement, like "for the system to be highly capable, the low-K thing must be the correct high-level understanding of the system, and so the designers must understand the low-K thing to understand the behavior of the system at a high level", would be better?)

Weighing in late here, I'll briefly note that my current stance on the difficulty of philosophical issues is (in colloquial terms) "for the love of all that is good, please don't attempt to implement CEV with your first transhuman intelligence". My strategy at this point is very much "build the minimum AI system that is capable of stabilizing the overall strategic situation, and then buy a whole lot of time, and then use that time to figure out what to do with the future." I might be more optimistic than you about how easy it will turn out to be to find a

... (read more)

Nice work!

Minor note: in equation 1, I think the should be an .

I'm not all that familiar with paraconsistent logic, so many of the details are still opaque to me. However, I do have some intuitions about where there might be gremlins:

Solution 4.1 reads, "The agent could, upon realizing the contradiction, ..." You've got to be a bit careful here: the formalism you're using doesn't contain a reasoner that does something like "realize the contradiction." As stated, the agent is simply constructed to simply execute an action if it can prove ; it i

... (read more)
0IAFF-User-49y
Section 4.1 is fairly unpolished. I'm still looking for better ways of handling the problems it brings up; solutions 4.1 and 4.2 are very preliminary stabs in that direction. The action condition you mention might work. I don't think it would re-introduce Löbian or similar difficulties, as it merely requires that ¯¯¯a implies that G is only true, which is a truth value found in LP. Furthermore, we still have our internally provable T-schema, which does not depend on the action condition, from which we can derive that if the child can prove (¯¯¯a→G)∧¬(¯¯¯a→¬G), then so can the parent. It is important to note that "most" (almost everything we are interested in) of PA⋆ is consistent without problem. Now that I think about it, your action condition should be a requirement for paraconsistent agents, as otherwise they will be willing to do things that they can prove will not accomplish G. There may yet be a situation which breaks this, but I have not come across it.

Thanks for the link! I appreciate your write-ups. A few points:

1. As you've already noticed, your anti-newcomb problem an instance of Dr. Nick Bone's "problematic problems". Benja actually gave a formalism of the general class of problems in the context of provability logic in a recent forum post. We dub these problems "evil problems," and I'm not convinced that your XDT is a sane way to deal with evil problems.

For one thing, every decision theory has an evil problem. As shown in the links above, even in if we consider "fair" games, there is always a probl

... (read more)
0Vanessa Kosoy9y
Hi Nate, thx for commenting! It seems to me this problem can be avoided by allowing access to random bits. See my reply to KnaveOfAllTrades and my reply to V_V. Formally, we should allow pi in (4') to be a random algorithm. I don't think "logical causation" in the sense you are using here is the right way to think about the anti-Newcomb problem. From the precursor's point of view, there is no loss in utility due of choosing XDT over UDT. Of course. I didn't attempt to formalize "fairness" at that post but the idea is approaching optimality for decision-determined problems in the sense of Yudkowsky 2010. I realize that the logical expectation values I'm using are so far mostly wishful thinking. However, I think there is benefit in attacking the problems from both ends: understanding the usage of logical probabilities may shed light on the desirada they should satisfy. Consider two UDT agents A & B with identical utility functions living in different universes. Each of the agents is charged with making a certain decision, while receiving no input. If both agents are aware of each other's existence, we expect [in the sense of "hope" rather than "are able to prove" :)] them to make decisions that will maximizing overall utility, even though that on the surface, each agent is only maximizing over its own decisions rather than the decisions of both agents. What is the difference between this scenario and the scenario of a single agent existing in both universes which receives a single bit of input that indicates in which universe the given copy is? See my reply to Wei Dai. You're referring to the agent-simulates-predictor problem? Actually, I think my (4') may contain a clue for solving it. As I commented, the logical expectation values should only use about as much computer power as the precursor has rather than as much computing power as the successor has. Therefore, if the predictor is as at least strong as the precursor, the successor wins by choosing a policy
1Benya Fallenstein9y
Want to echo Nate's points! One particular thing that I wanted to emphasize is that I think you can see as a thread on this forum (in particular, the modal UDT work is relevant) is that it's useful to make formal toy models where the math is fully specified, so that you can prove theorems about what exactly an agent would do (or, sometimes, write a program that figures it out for you). When you write out things that explicitly, then, for example, it becomes clearer that you need to assume that a decision problem is "fair" (extensional) to get certain results, as Nate points out (or if you don't assume it, someone else can look at your result and point out that it's not true as stated). In your post, you're using "logical expectations" that condition on something being true, without defining exactly what all of this means, and as a result you can argue about what these agents will do, but not actually prove it; that's certainly a reasonable part of the research process, but I'd like to encourage you to turn your work into models that are fully specified, so that you can actually prove theorems about them.

Also, FYI, I tossed together reflective implementations of Solomonoff Induction and AIXI using Haskell, which you can find on the MIRI github. It's not very polished, but it typechecks.

We might be talking about different things when we talk about counterfactuals. Let me be more explicit:

Say an agent is playing against a copy of itself on the prisoner's dilemma. It must evaluate what happens if it cooperates, and what happens if it defects. To do so, it needs to be able to predict what the world would look like "if it took action A". That prediction is what I call a "counterfactual", and it's not always obvious how to construct one. (In the counterfactual world where the agent defects, is the action of its copy also set to 'defect', or is

... (read more)
0Stuart Armstrong9y
I see why you think this gives CDT now! I wasn't meaning for this to be used for counterfactuals about the agent's own decision, but about an event (possibly a past event) that "could have" turned out some other way. The example was to replace the "press" with something more unhackable.

Patrick and I discussed something like this at a previous MIRIx. I think the big problem is that (if I understand what you're suggesting) it basically just implements CDT.

For example, in Newcomb's problem, if X=1 implies Omega is correct and X=0 implies the agent won't necessarily act as predicted, and it acts conditioned on X=0, then it will twobox.

0Stuart Armstrong9y
I'm not sure I understand this. The example I was thinking of was instead of eg conditioning on "the button wasn't pressed" in corrigibility, you have corrigibility only implemented if the button is pressed AND X=1. Then the counterfactual is just X=0. Is there a CDT angle to that?

Yeah, causation in logical uncertainty land would be nice. It wouldn't necessarily solve the whole problem, though. Consider the scenario

outcomes = [3, 2, 1, None]
strategies = {Hi, Med, Low}
A = lambda: Low
h = lambda: Hi
m = lambda: Med
l = lambda: Low
payoffs = {}
payoffs[h()] = 3
payoffs[m()] = 2
payoffs[l()] = 1
E = lambda: payoffs.get(A())

Now it's pretty unclear that (lambda: Low)()==Hi should logically cause E()=3.

When considering (lambda: Low)()==Hi, do we want to change l without A, A without l, or both? These correspond to answers None, 3, and

... (read more)

Typos & syntax complaints:

and let’s consider an oracle to be a function , such that specifies the probability that the oracle will return "true" when invoked on the pair

Confusing notation. In the paragraph above, had the type .

We want to find an oracle machine that will output if , output if , and output either or if the expectations are equal.

Should be "And output either or if the expectations are equal", presumably.

[Will edit this post as I fi

... (read more)
0Benya Fallenstein9y
Fixed the typo, thanks! I considered describing the probabilities of the oracle returning "true" by a different function τ(┌M┐,p), but it seemed too pedantic to have a different letter. Maybe that's wrong, but it still feels too pedantic. If I do things that way I probably shouldn't be writing "O(┌M┐,p) returns 'true' if...", though...