David Xu

Posts

Sorted by New

Wiki Contributions

Comments

Ngo and Yudkowsky on AI capability gains

From my (dxu's) perspective, it's allowable for there to be "deep fundamental theories" such that, once you understand those theories well enough, you lose the ability to imagine coherent counterfactual worlds where the theories in question are false.

To use thermodynamics as an example: the first law of thermodynamics (conservation of energy) is actually a consequence of Noether's theorem, which ties conserved quantities in physics to symmetries in physical laws. Before someone becomes aware of this, it's perhaps possible for them to imagine a universe exactly like our own, except that energy is not conserved; once they understand the connection implied by Noether's theorem, this becomes an incoherent notion: you cannot remove the conservation-of-energy property without changing deep aspects of the laws of physics.

The second law of thermodynamics is similarly deep: it's actually a consequence of there being a (low-entropy) boundary condition at the beginning of the universe, but no corresponding (low-entropy) boundary condition at any future state. This asymmetry in boundary conditions is what causes entropy to appear directionally increasing--and again, once someone becomes aware of this, it is no longer possible for them to imagine living in a universe which started out in a very low-entropy state, but where the second law of thermodynamics does not hold.

In other words, thermodynamics as a "deep fundamental theory" is not merely [what you characterized as] a "powerful abstraction that is useful in a lot of domains". Thermodynamics is a logically necessary consequence of existing, more primitive notions--and the fact that (historically) we arrived at our understanding of thermodynamics via a substantially longer route (involving heat engines and the like), without noticing this deep connection until much later on, does not change the fact that grasping said deep connection allows one to see "at a glance" why the laws of thermodynamics inevitably follow.

Of course, this doesn't imply infinite certainty, but it does imply a level of certainty substantially higher than what would be assigned merely to a "powerful abstraction that is useful in a lot of domains". So the relevant question would seem to be: given my above described epistemic state, how might one convince me that the case for thermodynamics is not as airtight as I currently think it is? I think there are essentially two angles of attack: (1) convince me that the arguments for thermodynamics being a logically necessary consequence of the laws of physics are somehow flawed, or (2) convince me that the laws of physics don't have the properties I think they do.

Both of these are hard to do, however--and for good reason! And absent arguments along those lines, I don't think I am (or should be) particularly moved by [what you characterized as] philosophy-of-science-style objections about "advance predictions", "systematic biases", and the like. I think there are certain theories for which the object-level case is strong enough that it more or less screens off meta-level objections; and I think this is right, and good.

Which is to say:

The mental move I'm doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you're describing something that doesn't commute, addition might be a deep theory, but it's not useful for what you want. Similarly, you could argue that given how we're building AIs and trying to build AGI, evolution is not the deep theory that you want to use. (emphasis mine)

I think you could argue this, yes--but the crucial point is that you have to actually argue it. You have to (1) highlight some aspect of the evolutionary paradigm, (2) point out [what appears to you to be] an important disanalogy between that aspect and [what you expect cognition to look like in] AGI, and then (3) argue that that disanalogy directly undercuts the reliability of the conclusions you would like to contest. In other words, you have to do things the "hard way"--no shortcuts.

...and the sense I got from Richard's questions in the post (as well as the arguments you made in this subthread) is one that very much smells like a shortcut is being attempted. This is why I wrote, in my other comment, that

I don't think I have a good sense of the implied objections contained within Richard's model. That is to say: I don't have a good handle on the way(s) in which Richard expects expected utility theory to fail, even conditioning on Eliezer being wrong about the theory being useful. I think this important because--absent a strong model of expected utility theory's likely failure modes--I don't think questions of the form "but why hasn't your theory made a lot of successful advance predictions yet?" move me very much on the object level.

I think I share Eliezer's sense of not really knowing what Richard means by "deep fundamental theory" or "wide range of applications we hadn't previous thought of", and I think what would clarify this for me would have been for Richard to provide examples of "deep fundamental theories [with] a wide range of applications we hadn't previously thought of", accompanied by an explanation of why, if those applications hadn't been present, that would have indicated something wrong with the theory.

Ngo and Yudkowsky on AI capability gains

Speaking from my own perspective: I definitely had a sense, reading through that section of the conversation, that Richard's questions were somewhat... skewed? ... relative to the way I normally think about the topic. I'm having some difficulty articulating the source of that skewness, so I'll start by talking about how I think the skewness relates to the conversation itself:

I interpreted Eliezer's remarks as basically attempting to engage with Richard's questions on the same level they were being asked--but I think his lack of ability to come up with compelling examples (to be clear: by "compelling" here I mean "compelling to Richard") likely points at a deeper source of disagreement (which may or may not be the same generator as the "skewness" I noticed). And if I were forced to articulate the thing I think the generator might be...

I don't think I have a good sense of the implied objections contained within Richard's model. That is to say: I don't have a good handle on the way(s) in which Richard expects expected utility theory to fail, even conditioning on Eliezer being wrong about the theory being useful. I think this important because--absent a strong model of expected utility theory's likely failure modes--I don't think questions of the form "but why hasn't your theory made a lot of successful advance predictions yet?" move me very much on the object level.

Probing more at the sense of skewness, I'm getting the sense that this exchange here is deeply relevant:

Richard: I'm accepting your premise that it's something deep and fundamental, and making the claim that deep, fundamental theories are likely to have a wide range of applications, including ones we hadn't previously thought of.

Do you disagree with that premise, in general?

Eliezer: I don't know what you really mean by "deep fundamental theory" or "wide range of applications we hadn't previously thought of", especially when it comes to structures that are this simple. It sounds like you're still imagining something I mean by Expected Utility which is some narrow specific theory like a particular collection of gears that are appearing in lots of places.

I think I share Eliezer's sense of not really knowing what Richard means by "deep fundamental theory" or "wide range of applications we hadn't previous thought of", and I think what would clarify this for me would have been for Richard to provide examples of "deep fundamental theories [with] a wide range of applications we hadn't previously thought of", accompanied by an explanation of why, if those applications hadn't been present, that would have indicated something wrong with the theory.

But the reason I'm calling the thing "skewness", rather than something more prosaic like "disagreement", is because I suspect Richard isn't actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of "successful advance predictions"/"wide applications" corresponds to those likely failure modes, etc). I suspect that the frame Richard is operating in would dismiss these questions as largely inconsequential, even though I'm not sure why or what that frame actually looks like; this is a large part of the reason why I have this flagged as a place to look for a deep hidden crux.

(One [somewhat uncharitable] part of me wants to point out that the crux in question may actually just be the "usual culprit" in discussions like this: outside-view/modest-epistemology-style reasoning. This does seem to rhyme a lot with what I wrote above, e.g. it would explain why Richard didn't seem particularly concerned with gears-level failure modes or competing models or the like, and why his line of questioning seemed mostly insensitive to the object-level details of what "advance predictions" look like, why that matters, etc. I do note that Richard actively denied being motivated by this style of reasoning later on in the dialogue, however, which is why I still have substantial uncertainty about his position.)

Ngo and Yudkowsky on AI capability gains

Like, there's a certain kind of theory/model which generalizes well to many classes of new cases and makes nontrivial predictions in those new cases, and those kinds-of-theories/models have a pattern to them which is recognizable.

Could I ask you to say more about what you mean by "nontrivial predictions" in this context? It seems to me like this was a rather large sticking point in the discussion between Richard and Eliezer (that is, the question of whether expected utility theory--as a specific candidate for a "strongly generalizing theory"--produces "nontrivial predictions", where it seemed like Eliezer leaned "yes" and Richard leaned "no"), so I'd be interested in hearing more takes on what constitutes "nontrivial predictions", and what role said (nontrivial) predictions play in making a theory more convincing (as compared to other factors such as e.g. elegance/parsimony/[the pattern John talks about which is recognizable]).

Of course, I'd be interested in hearing what Richard thinks of the above as well.

A positive case for how we might succeed at prosaic AI alignment

[Note: Still speaking from my Eliezer model here, in the sense that I am making claims which I do not myself necessarily endorse (though naturally I don't anti-endorse them either, or else I wouldn't be arguing them in the first place). I want to highlight here, however, that to the extent that the topic of the conversation moves further away from things I have seen Eliezer talk about, the more I need to guess about what I think he would say, and at some point I think it is fair to describe my claims as neither mine nor (any model of) Eliezer's, but instead something like my extrapolation of my model of Eliezer, which may not correspond at all to what the real Eliezer thinks.]

> To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?

The sense that it's still myopic is in the sense that it's non-deceptive, which is the only sense that we actually care about.

> it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper

The safety improvement that I'm claiming is that it wouldn't be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?

If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?

Conversely, if the myopic agent does not learn to imitate the underlying process to sufficient resolution that unwanted behaviors like deception start carrying over, then it is very likely that the powerful consequentialist properties of the underlying process have not been carried over, either. This is because (on my extrapolation of Eliezer's model) deceptive behavior, like all other instrumental strategies, arises from consequentialist reasoning, and is deeply tied to such reasoning in a way that is not cleanly separable—which is to say, by default, you do not manage to sever one without also severing the other.

Again, I (my model of Eliezer) does not think the "deep tie" in question is necessarily insoluble; perhaps there is some sufficiently clever method which, if used, would successfully filter out the "unwanted" instrumental behavior ("deception", in your terminology) from the "wanted" instrumental behavior (planning, coming up with strategies, in general being an effective agent in the real world). But this distinction between "wanted" and "unwanted" is not a natural distinction; it is, in fact, a distinction highly entangled with human concepts and human values, and any "filter" that selects based on said distinction will need to be of similar complexity. (Of identical complexity, in fact, to the whole alignment problem.) "Simple" filters like the thing you are calling "myopia" definitely do not suffice to perform this function.

I'd be interested in hearing which aspect(s) of the above model you disagree with, and why.

A positive case for how we might succeed at prosaic AI alignment

It still doesn't seem to me like you've sufficiently answered the objection here.

I tend to think that HCH is not dangerous, but I agree that it's likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful.

What if any sufficiently powerful objective is non-myopic? Or, on a different-but-equivalent phrasing: what if myopia is a property only of very specific toy objectives, rather than a widespread property of objectives in general (including objectives that humans would intuitively consider to be aimed at accomplishing things "in the real world")?

It seems to me that Eliezer has presented quite compelling arguments that the above is the case, and on a first pass it doesn't look to me like you've countered those arguments.

But that's not that hard, and there's lots of them that can incentivize good non-myopic behavior that are safe to optimize for as long as the optimizer is myopic.

How does a "myopic optimizer" successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about? To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?

AI safety via market making is one example, but it's a very tricky one, so maybe not the best candidate for showcasing what I mean. In particular, I suspect that a myopic optimizer given the goal of acting as a trader or market-maker in such a setup wouldn't act deceptively, though I suspect they would Goodhart on the human approval signal in unsafe ways (which is less bad of a problem than deception, and could potentially be solved via something like my step (6), but still a pretty serious problem).

Maybe a better example would be something like imitative generalization. If imitating HCH is insufficient, we can push further by replacing “imitate HCH” with “output the hypothesis which maximizes HCH's prior times the hypothesis's likelihood,” which gets you substantially farther and I think is still safe to optimize for given a myopic optimizer (though neither are safe for a non-myopic optimizer).

Both of these seem to be examples of solutions that simply push the problem back a step, rather than seeking to eliminate it directly. My model of Eliezer would call this attempting to manipulate confusion, and caution that, although adding more gears to your perpetual motion machine might make the physics-violating component harder to pick out, it does not change the fact that somewhere within the model is a step that violates physics.

In this case, it seems as though all of your proposals are of the form "Train your model to imitate some process X (where X is non-myopic and potentially unsafe), while adding incentives in favor of myopic behavior during training." To which my model of Eliezer replies, "Either your model will end up myopic, and not powerful enough to capture the part of X that actually does the useful work we are interested in, or it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper."

It seems to me that to usefully refute this, you need to successfully argue against Eliezer's background premise here—the one about power and non-myopic reasoning going hand-in-hand in a deep manner that, while perhaps circumventable via similarly deep insights, is not patchable via shallow methods like "Instead of directly using dangerous process X, we will imitate X, thereby putting an extra layer of abstraction between ourselves and the danger." My current impression is that you have not been arguing against this background premise at all, and as such I don't think your arguments hit at the core of what makes Eliezer doubt your proposals.

Discussion with Eliezer Yudkowsky on AGI interventions

So, the point of my comments was to draw a contrast between having a low opinion of "experimental work and not doing only decision theory and logic", and having a low opinion of "mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc." I didn't intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as 'oh yeah, that's not true and wasn't what I meant') is not great for discussion.

It occurs to me that part of the problem may be precisely that Adam et al. don't think there's a large difference between these two claims (that actually matters). For example, when I query my (rough, coarse-grained) model of [your typical prosaic alignment optimist], the model in question responds to your statement with something along these lines:

If you remove "mainstream ML alignment work, and nearly all work outside of the HRAD-ish cluster of decision theory, logic, etc." from "experimental work", what's left? Perhaps there are one or two (non-mainstream, barely-pursued) branches of "experimental work" that MIRI endorses and that I'm not aware of—but even if so, that doesn't seem to me to be sufficient to justify the idea of a large qualitative difference between these two categories.

In a similar vein to the above: perhaps one description is (slightly) hyperbolic and the other isn't. But I don't think replacing the hyperbolic version with the non-hyperbolic version would substantially change my assessment of MIRI's stance; the disagreement feels non-cruxy to me. In light of this, I'm not particularly bothered by either description, and it's hard for me to understand why you view it as such an important distinction.

Moreover: I don't think [my model of] the prosaic alignment optimist is being stupid here. I think, to the extent that his words miss an important distinction, it is because that distinction is missing from his very thoughts and framing, not because he happened to use choose his words somewhat carelessly when attempting to describe the situation. Insofar as this is true, I expect him to react to your highlighting of this distinction with (mostly) bemusement, confusion, and possibly even some slight suspicion (e.g. that you're trying to muddy the waters with irrelevant nitpicking).

To be clear: I don't think you're attempting to muddy the waters with irrelevant nitpicking here. I think you think the distinction in question is important because it's pointing to something real, true, and pertinent—but I also think you're underestimating how non-obvious this is to people who (A) don't already deeply understand MIRI's view, and (B) aren't in the habit of searching for ways someone's seemingly pointless statement might actually be right.

I don't consider myself someone who deeply understands MIRI's view. But I do want to think of myself as someone who, when confronted with a puzzling statement [from someone whose intellectual prowess I generally respect], searches for ways their statement might be right. So, here is my attempt at describing the real crux behind this disagreement:

(with the caveat that, as always, this is my view, not Rob's, MIRI's, or anybody else's)

(and with the additional caveat that, even if my read of the situation turns out to be correct, I think in general the onus is on MIRI to make sure they are understood correctly, rather than on outsiders to try to interpret them—at least, assuming that MIRI wants to make sure they're understood correctly, which may not always be the best use of researcher time)

I think the disagreement is mostly about MIRI's counterfactual behavior, not about their actual behavior. I think most observers (including both Adam and Rob) would agree that MIRI leadership has been largely unenthusiastic about a large class of research that currently falls under the umbrella "experimental work", and that the amount of work in this class MIRI has been unenthused about significantly outweighs the amount of work they have been excited about.

Where I think Adam and Rob diverge is in their respective models of the generator of this observed behavior. I think Adam (and those who agree with him) thinks that the true boundary of the category [stuff MIRI finds unpromising] roughly coincides with the boundary of the category [stuff most researchers would call "experimental work"], such that anything that comes too close to "running ML experiments and seeing what happens" will be met with an immediate dismissal from MIRI. In other words, [my model of] Adam thinks MIRI's generator is configured such that the ratio of "experimental work" they find promising-to-unpromising would be roughly the same across many possible counterfactual worlds, even if each of those worlds is doing "experiments" investigating substantially different hypotheses.

Conversely, I think Rob thinks the true boundary of the category [stuff MIRI finds unpromising] is mostly unrelated to the boundary of the category [stuff most researchers would call "experimental work"], and that—to the extent MIRI finds most existing "experimental work" unpromising—this is mostly because the existing work is not oriented along directions MIRI finds promising. In other words, [my model of] Rob thinks MIRI's generator is configured such that the ratio of "experimental work" they find promising-to-unpromising would vary significantly across counterfactual worlds where researchers investigate different hypotheses; in particular, [my model of] Rob thinks MIRI would find most "experimental work" highly promising in the world where the "experiments" being run are those whose results Eliezer/Nate/etc. would consider difficult to predict in advance, and therefore convey useful information regarding the shape of the alignment problem.

I think Rob's insistence on maintaining the distinction between having a low opinion of "experimental work and not doing only decision theory and logic", and having a low opinion of "mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc." is in fact an attempt to gesture at the underlying distinction outlined above, and I think that his stringency on this matter makes significantly more sense in light of this. (Though, once again, I note that I could be completely mistaken in everything I just wrote.)

Assuming, however, that I'm (mostly) not mistaken, I think there's an obvious way forward in terms of resolving the disagreement: try to convey the underlying generators of MIRI's worldview. In other words, do the thing you were going to do anyway, and save the discussions about word choice for afterwards.

Discussion with Eliezer Yudkowsky on AGI interventions

Thanks for elaborating. I don't think I have the necessary familiarity with the alignment research community to assess your characterization of the situation, but I appreciate your willingness to raise potentially unpopular hypotheses to attention. +1

Discussion with Eliezer Yudkowsky on AGI interventions

Similarly, the fact that they kept at it over and over with all the big improvement of DL instead of trying to adapt to prosaic Alignment sounds like evidence that they might be over attached to a specific framing, which they had trouble to discard.

I'm... confused by this framing? Specifically, this bit (as well as other bits like these)

I have to explain again and again to stressed-out newcomers that you definitely don’t need to master model theory or decision theory to do alignment, and try to steer them towards problems and questions that look like they’re actually moving the ball instead of following the lead of the “figure of authority”.

Some of the brightest and first thinkers on alignment have decided to follow their own nerd-sniping and call everyone else fakers, and when they realized they were not actually making progress, they didn’t switch to something else as much as declare everyone was still full of it

Also, I don’t know how much is related to mental health and pessimism and depression (which I completely understand can color one’s view of the world), but I would love to see the core MIRI team and EY actually try solving alignment with neural nets and prosaic AI. Starting with all their fears and caveats, sure, but then be like “fuck it, let’s just find a new way of grappling it”.

seem to be coming at the problem with [something like] a baked-in assumption that prosaic alignment is something that Actually Has A Chance Of Working?

And, like, to be clear, obviously if you're working on prosaic alignment that's going to be something you believe[1]. But it seems clear to me that EY/MIRI does not share this viewpoint, and all the disagreements you have regarding their treatment of other avenues of research seem to me to be logically downstream of this disagreement?

I mean, it's possible I'm misinterpreting you here. But you're saying things that (from my perspective) only make sense with the background assumption that "there's more than one game in town"--things like "I wish EY/MIRI would spend more time engaging with other frames" and "I don't like how they treat lack of progress in their frame as evidence that all other frames are similarly doomed"--and I feel like all of those arguments simply fail in the world where prosaic alignment is Actually Just Doomed, all the other frames Actually Just Go Nowhere, and conceptual alignment work of the MIRI variety is (more or less) The Only Game In Town.

To be clear: I'm pretty sure you don't believe we live in that world. But I don't think you can just export arguments from the world you think we live in to the world EY/MIRI thinks we live in; there needs to be a bridging step first, where you argue about which world we actually live in. I don't think it makes sense to try and highlight the drawbacks of someone's approach when they don't share the same background premises as you, and the background premises they do hold imply a substantially different set of priorities and concerns.

Another thing it occurs to me your frustration could be about is the fact that you can't actually argue this with EY/MIRI directly, because they don't frequently make themselves available to discuss things. And if something like that's the case, then I guess what I want to say is... I sympathize with you abstractly, but I think your efforts are misdirected? It's okay for you and other alignment researchers to have different background premises from MIRI or even each other, and for you and those other researchers to be working on largely separate agendas as a result? I want to say that's kind of what foundational research work looks like, in a field where (to a first approximation) nobody has any idea what the fuck they're doing?

And yes, in the end [assuming somebody succeeds] that will likely mean that a bunch of people's research directions were ultimately irrelevant. Most people, even. That's... kind of unavoidable? And also not really the point, because you can't know which line of research will be successful in advance, so all you have to go on is your best guess, which... may or may not be the same as somebody else's best guess?

I dunno. I'm trying not to come across as too aggressive here, which is why I'm hedging so many of my claims. To some extent I feel uncomfortable trying to "police" people's thoughts here, since I'm not actually an alignment researcher... but also it felt to me like your comment was trying to police people's thoughts, and I don't actually approve of that either, so...

Yeah. Take this how you will.


[1] I personally am (relatively) agnostic on this question, but as a non-expert in the field my opinion should matter relatively little; I mention this merely as a disclaimer that I am not necessarily on board with EY/MIRI about the doomed-ness of prosaic alignment.

Discussion with Eliezer Yudkowsky on AGI interventions

Eliezer Yudkowsky

Alpha Zero scales with more computing power, I think AlphaFold 2 scales with more computing power, Mu Zero scales with more computing power. Precisely because GPT-3 doesn't scale, I'd expect an AGI to look more like Mu Zero and particularly with respect to the fact that it has some way of scaling.

I think this view dovetails quite strongly with the view expressed in this comment by maximkazhenkov:

Progress in model-based RL is far more relevant to getting us closer to AGI than other fields like NLP or image recognition or neuroscience or ML hardware. I worry that once the research community shifts its focus towards RL, the AGI timeline will collapse - not necessarily because there are no more critical insights left to be discovered, but because it's fundamentally the right path to work on and whatever obstacles remain will buckle quickly once we throw enough warm bodies at them. I think - and this is highly controversial - that the focus on NLP and Vision Transformer has served as a distraction for a couple of years and actually delayed progress towards AGI.

If curiosity-driven exploration gets thrown into the mix and Starcraft/Dota gets solved (for real this time) with comparable data efficiency as humans, that would be a shrieking fire alarm to me (but not to many other people I imagine, as "this has all been done before").

I was definitely one of those on the Transformer "train", so to speak, after the initial GPT-3 preprint, and I think this was further reinforced by gwern's highlighting of the scaling hypothesis--which, while not necessarily unique to Transformers, was framed using Transformers and GPT-3 as relevant examples, in a way that suggested [to me] that Transformers themselves might scale to AGI.

I think, after thinking about this more, looking more into EfficientZero/MuZero/related work in model-based reinforcement learning etc., that I have updated in favor of the view espoused by Eliezer and maximkazhenkov, wherein model-based RL (and RL-type training in general, which concerns dynamics and action spaces in a way that sequence prediction simply does not) seems more likely to me to produce AGI than a scaled-up version of a sequence predictor.

(I still have a weak belief that Transformers and their ilk may be capable of producing AGI, but if so I think it will probably be a substantially longer, harder path than RL-based systems, and will probably involve much more work than just throwing more compute at models.)


Also, here is a relevant tweet from Eliezer:

I don't think it scales to superintelligence, not without architectural changes that would permit more like a deliberate intelligence inside to do the predictive work. You don't get things that want only the loss function, any more than humans explicitly want inclusive fitness.

(in response to)

what's the primary risk you would expect from a GPT-like architecture scaled up to the regime of superintelligence? (Obviously such architectures are unaligned, but they're also not incentivized by default to take action in the real world.)

Can you control the past?

The output of this process is something people have taken to calling Son-of-CDT; the problem (insofar as we understand Son-of-CDT well enough to talk about its behavior) is that the resulting decision theory continues to neglect correlations that existed prior to self-modification.

(In your terms: Alice and Bob would only one-box in Newcomb variants where Omega based his prediction on them after they came up with their new decision theory; Newcomb variants where Omega's prediction occurred before they had their talk would still be met with two-boxing, even if Omega is stipulated to be able to predict the outcome of the talk.)

This still does not seem like particularly sane behavior, which means, unfortunately, that there's no real way for a CDT agent to fix itself: it was born with too dumb of a prior for even self-modification to save it.

Load More