Dagon

Just this guy, you know?

Posts

Sorted by New

Comments

Safer sandboxing via collective separation

Depending on your threat modeling of a given breach, this could be comforting or terrifying.

The economic incentives to attack and to defend are usually similar. Systems get broken sometimes but not always.

If the cost of a loss (AGI escapes, takes over the world, and runs it worse than humans are) is much higher, that changes the "economic incentives" about this. It implies that "sometimes but not always" is a very dangerous equilibrium. If the cost of a loss (AGI has a bit more influence on the outside world, but doesn't actually destroy much) is more inline with today's incentives, it's a fine thing.

What counts as defection?

It's worth being careful to acknowledge that this set of assumptions is far more limited than the game-theoretical underpinnings. Because it requires interpersonal utility summation, you can't normalize in the same ways, and you need to do a LOT more work to show that any given situation fits this model. Most situations and policies don't even fit the more general individual-utility model, and I suspect even fewer will fit this extension.

That said, I like having it formalized, and I look forward to the extension to multi-coalition situations. A spy can benefit Russia and the world more than they hurt the average US resident.

"How conservative" should the partial maximisers be?

"kill all humans, then shut down" is probably the action that most minimizes change. Leaving those buggers alive will cause more (and harder to predict) change than anything else the agent might do.

There's no way to talk about this in the abstract sense of change - it has to be differential from a counterfactual (aka: causal), and can only be measured by other agents' evaluation functions. The world changes for lots of reasons, and an agent might have most of it's impact by PREVENTING a change, or by FAILING to change something that's within it's power. Asimov's formulation included this understanding: A robot may not injure a human being or, through inaction, allow a human being to come to harm.

Predictors exist: CDT going bonkers... forever

[note: this is bugging me more than it should. I really don't get why this is worth so much repetition of examples that don't show anything new.]

I'll admit I'm one of those who doesn't see CDT as hopeless. It takes a LOT of hypothetical setup to show cases where it fails, and neither newcomb nor this seem to be as much about decision theory as about free will.

Part of this is my failing. I keep thinking CDT is "classical decision theory", and it means "make the best conditional predictions you can, and then maximize your expected value. This is very robust, but describes all serious decision theories. The actual discussion is about "causal decision theory", and there are plenty of failure cases, where the agent has a flawed model of causality.

But for some reason, we can't just say "incorrect causal models make bad predictions" and move on. We keep bringing up really contrived cases where a naive agent, which we label CDT, makes bad conditional predictions, and it's not clear why they're so stupid as to not notice. I don't know ANYONE who claims an agent should make and act on incorrect predictions.

For your newcomb-like example (and really, any Omega causality violation), I assert that a CDT agent could notice outcomes and apply bayes' theorem to the chance that they can trick Omega just as well as any other DT. Assuming that Omega is cheating, and changing the result after my choice is sufficient to get the right answer.

Cases of mind-reading and the like are similarly susceptible to better causality models - recognizing that the causality is due to the agent's intent, not their actions, makes CDT recognize that to the extent it can control the intent, it should.

Your summary includes " the CDT agent can never learn this", and that seems the crux. To me, not learning something means that _EITHER_ CDT agent is a strawman that we shouldn't spend so much time on, _OR_ this is something that cannot be true, and it's probably good if agents can't learn it. If you tell me that a Euclidian agent knows pi and can accurately make wagers on the circumference of a circle knowing only it's diameter, but it's flawed because a magic being puts it on a curved surface and it never re-considers that belief, I'm going to shrug and say "okay... but here in flatland that doesn't happen". It doesn't matter how many thought experiments you come up with to show counterfactual cases where C/D is different for a circle, you're completely talking past my objection that Euclidian decision theory is simple and workable for actual use.

To summarize my confusion, does CDT require that the agent unconditionally believe in perfect free will independent of history (and, ironically, with no causality for the exercise of will)? If so, that should be the main topic of dispute - the frequency of actual case where it makes bad predictions, not that it makes bad decisions in ludicrously-unlikely-and-perhaps-impossible situations.

The "Commitment Races" problem

I think you're missing at least one key element in your model: uncertainty about future predictions. Commitments have a very high cost in terms of future consequence-effecting decision space. Consequentialism does _not_ imply a very high discount rate, and we're allowed to recognize the limits of our prediction and to give up some power in the short term to reserve our flexibility for the future.

Also, one of the reasons that this kind of interaction is rare among humans is that commitment is impossible for humans. We can change our minds even after making an oath - often with some reputational consequences, but still possible if we deem it worthwhile. Even so, we're rightly reluctant to make serious committments. An agent who can actually enforce it's self-limitations is going to be orders of magnitude more hesitant to do so.

All that said, it's worth recognizing that an agent that's significantly better at predicting the consequences of potential commitments will pay a lower cost for the best of them, and has a material advantage over those who need flexibility because they don't have information. This isn't a race in time, it's a race in knowledge and understanding. I don't think there's any way out of that race - more powerful agents are going to beat weaker ones most of the time.

Buridan's ass in coordination games

Nope. Random choice gives a specific value for R each game. The outcome for that iteration is IDENTICAL to the outcome if that R was chosen intentionally. Randomness only has game value as a mechanism to keep information from an adversarial actor.

Buridan's ass in coordination games

Sure, but non-adversarial cases (really, any cases where u is determined independently of strategies chosen) can just choose R as a fixed part of the strategy, rather than a random shared component determined later.

Buridan's ass in coordination games

Based on other comments, I realize I'm making an assumption for something you haven't specified. How is uy chosen? If it's random and independent, then my assertion holds, if it's selected by an adversary who knows the players' full strategies somehow, then R is just a way of keeping a secret from the adversary - sequence doesn't matter, but knowledge does.

Buridan's ass in coordination games

uy and R are independently chosen from well-defined distributions. Regardless of sequence, neither knows the other and CANNOT be chosen based on the other. I'll see if I can find time tonight to figure out whether I'm saying your claim 1 is wrong (it dropped epsilon too soon from the floor value, but I'm not sure if it's more fundamentally problematic than that) or that your claim 2 is misleading.

My current expectation is that I'll find that your claim 2 results are available in situation 1, by using your given function with a pre-agreed value rather than a random one.

Buridan's ass in coordination games
R∼Uniform([0,1])

How can it possibly matter whether R is chosen before or after uy? R is completely independent of u, right? It's not a covert communication mechanism about the players' observations, it's a random value.

Load More