Just this guy, you know?


Sorted by New

Wiki Contributions


Oracle predictions don't apply to non-existent worlds

Thanks for patience with this. I am still missing some fundamental assumption or framing about why this is non-obvious (IMO, either the Oracle is wrong, or the choice is illusory).  I'll continue to examine the discussions and examples in hopes that it will click.

Oracle predictions don't apply to non-existent worlds

Hmm.  So does this only apply to CDT agents, who foolishly believe that their decision is not subject to predictions?

Oracle predictions don't apply to non-existent worlds

Is there an ELI5 doc about what's "normal" for Oracles, and why they're constrained in that way?  The examples I see confuse me in that they are exploring what seem like edge cases, and I'm missing the underlying model that makes these cases critical.

Specifically, when you say "It's only guaranteed to be correct on the actual decision", why does the agent not know what "correct" means for the decision?

Oracle predictions don't apply to non-existent worlds

Sure, that's a sane Oracle.  The Weird Oracle used in so many thought experiments doesn't say ""The taxi will arrive in one minute!", it says "You will grab your coat in time for the taxi.".  

A world in which the alignment problem seems lower-stakes

I don't follow the half-universe argument.  Are you somehow sending the AGI outside of your light-cone?  Or have you crafted the AGI utility function and altered your own to not care about the others' half?  I don't get the model of utility that works for 

The only information you have about the other half is your utility.

My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.  

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Sorry, I didn't mean to be accusatory in that, only descriptive in a way that I hope will let me understand what you're trying to model/measure as "alignment", with the prerequisite understanding of what the payout matrix indicates.   http://cs.brown.edu/courses/cs1951k/lectures/2020/chapters1and2.pdf is one reference, but I'll admit it's baked in to my understanding to the point that I don't know where I first saw it.  I can't find any references to the other interpretation (that the payouts are something other than a ranking of preferences by each player).

So the question is "what DO these payout numbers represent"?  and "what other factors go into an agent's decision of which row/column to choose"?  

Open problem: how can we quantify player alignment in 2x2 normal-form games?

I went back and re-read your https://www.lesswrong.com/posts/8LEPDY36jBYpijrSw/what-counts-as-defection post, and it's much clearer to me that you're NOT using standard game-theory payouts (utility) here.  You're using some hybrid of utility and resource payouts, where you seem to normalize payout amounts, but then don't limit the decision to the payouts - players have a utility function which converts the payouts (for all players, not just themselves) into something they maximize in their decision.  It's not clear whether they include any non-modeled information (how much they like the other player, whether they think there are future games or reputation effects, etc.) in their decision.

Based on this, I don't think the question is well-formed.  A 2x2 normal-form game is self-contained and one-shot.  There's no alignment to measure or consider - it's just ONE SELECTION, with one of two outcomes based on the other agent's selection.  

It would be VERY INTERESTING to define a game nomenclature to specify the universe of considerations that two (or more) agents can have to make a decision, and then to define an "alignment" measure about when a player's utility function prefers similar result-boxes as the others' do.  I'd be curious about even very simple properties, like "is it symmetrical" (I suspect no - A can be more aligned with B than B is with A, even for symmetrical-in-resource-outcome games).  

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Payout correlation IS the metric of alignment.  A player who isn't trying to maximize their (utility) payout is actually not playing the game you've defined.    You're simply incorrect (or describing a different payout matrix than you state) that a player doesn't "have to select a best response".

Open problem: how can we quantify player alignment in 2x2 normal-form games?

I think this is backward.  The game's payout matrix determines the alignment.  Fixed-sum games imply (in the mathematical sense) unaligned players, and common-payoff games ARE the definition of alignment.  

When you start looking at meta-games (where resource payoffs differ from utility payoffs, based on agent goals), then "alignment" starts to make sense as a distinct measurement - it's how much the players' utility functions transform the payoffs (in the sub-games of a series, and in the overall game) from fixed-sum to common-payoff.

"Beliefs" vs. "Notions"

In everyday life, "notion" implies low-confidence.  Often derogatory - low-confidence for the speaker, and a further implication that the holder/object of discussion doesn't even have the idea of confidence.

You might just use "proposition" or "claim" to mean the specific thing that a probability belief applies to.

Load More