Posts

Sorted by New

Wiki Contributions

Comments

Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk")

The distinction between "accidental" and "negligent" is always a bit political. It's a question of assignment of credit/blame for hypothetical worlds, which is pretty much impossible in any real-world causality model.

I do agree that in most discussions, "accident" often implies a single unexpected outcome, rather than a repeated risk profile and multiple moves toward the bad outcome. Even so, if it doesn't reach the level of negligence for any one actor, Eliezer's term "inadequate equilibrium" may be more accurate.

Which means that using a different word will be correctly identified as a desire to shift responsibility from "it's a risk that might happen" to "these entities are bringing that risk on all of us".

Humans do acausal coordination all the time

Dagon2y710

Interesting take, but I'll note that these are not acausal, just indirect-causal. Voting is a good example - counts are public, so future voters KNOW how many of their fellow citizens take it seriously enough to participate.

In all of these examples, there is a signaling path to future impact. Which humans are perhaps over-evolved to focus on.

Unifying Bargaining Notions (1/2)

Dagon2y-22

I really wish you'd included the outside-of-game considerations. The example of what to eat for dinner is OVERWHELMINGLY about the future relationship between the diners, not about the result itself. This is true of all real-world bargaining (where you're making commitments and compromises) - you're giving up some immediate value in order to make future interactions way better.

Oracle predictions don't apply to non-existent worlds

Dagon3y00

Thanks for patience with this. I am still missing some fundamental assumption or framing about why this is non-obvious (IMO, either the Oracle is wrong, or the choice is illusory). I'll continue to examine the discussions and examples in hopes that it will click.

Oracle predictions don't apply to non-existent worlds

Dagon3y00

Hmm. So does this only apply to CDT agents, who foolishly believe that their decision is not subject to predictions?

Oracle predictions don't apply to non-existent worlds

Dagon3y00

Is there an ELI5 doc about what's "normal" for Oracles, and why they're constrained in that way? The examples I see confuse me in that they are exploring what seem like edge cases, and I'm missing the underlying model that makes these cases critical.

Specifically, when you say "It's only guaranteed to be correct on the actual decision", why does the agent not know what "correct" means for the decision?

Oracle predictions don't apply to non-existent worlds

Dagon3y00

Sure, that's a sane Oracle. The Weird Oracle used in so many thought experiments doesn't say ""The taxi will arrive in one minute!", it says "You will grab your coat in time for the taxi.".

A world in which the alignment problem seems lower-stakes

Dagon3y10

I don't follow the half-universe argument. Are you somehow sending the AGI outside of your light-cone? Or have you crafted the AGI utility function and altered your own to not care about the others' half? I don't get the model of utility that works for

The only information you have about the other half is your utility.

My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Dagon3y00

Sorry, I didn't mean to be accusatory in that, only descriptive in a way that I hope will let me understand what you're trying to model/measure as "alignment", with the prerequisite understanding of what the payout matrix indicates. http://cs.brown.edu/courses/cs1951k/lectures/2020/chapters1and2.pdf is one reference, but I'll admit it's baked in to my understanding to the point that I don't know where I first saw it. I can't find any references to the other interpretation (that the payouts are something other than a ranking of preferences by each player).

So the question is "what DO these payout numbers represent"? and "what other factors go into an agent's decision of which row/column to choose"?

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Dagon3y00

I went back and re-read your https://www.lesswrong.com/posts/8LEPDY36jBYpijrSw/what-counts-as-defection post, and it's much clearer to me that you're NOT using standard game-theory payouts (utility) here. You're using some hybrid of utility and resource payouts, where you seem to normalize payout amounts, but then don't limit the decision to the payouts - players have a utility function which converts the payouts (for all players, not just themselves) into something they maximize in their decision. It's not clear whether they include any non-modeled information (how much they like the other player, whether they think there are future games or reputation effects, etc.) in their decision.

Based on this, I don't think the question is well-formed. A 2x2 normal-form game is self-contained and one-shot. There's no alignment to measure or consider - it's just ONE SELECTION, with one of two outcomes based on the other agent's selection.

It would be VERY INTERESTING to define a game nomenclature to specify the universe of considerations that two (or more) agents can have to make a decision, and then to define an "alignment" measure about when a player's utility function prefers similar result-boxes as the others' do. I'd be curious about even very simple properties, like "is it symmetrical" (I suspect no - A can be more aligned with B than B is with A, even for symmetrical-in-resource-outcome games).