Stuart Armstrong

Sequences

Subagents and impact measures
If I were a well-intentioned AI...

Comments

Extortion beats brinksmanship, but the audience matters

The connection to AI alignment is combining the different utilities of different entities without extortion ruining the combination, and dealing with threats and acausal trade.

Extortion beats brinksmanship, but the audience matters

That's a misspelling that's entirely my fault, and has now been corrected.

Extortion beats brinksmanship, but the audience matters

(1) You say that releasing nude photos is in the blackmail category. But who's the audience?

The other people of whom you have nude photos, who are now incentivised to pay up rather than kick up a fuss.

(2) For n=1, m large: Is an example of brinkmanship here a monopolistic buyer who will only choose suppliers giving cutrate prices?

Interesting example that I hadn't really considered. I'd say its fits more under extortion than brinksmanship, though. A small supplier has to sell, or they won't stay in business. If there's a single buyer, "I won't buy from you" is the same as "I will ruin you". Abstracting away the property rights (Walmart is definitely legally allowed to do this), this seems very much an extorsion.

Anthropomorphisation vs value learning: type 1 vs type 2 errors

A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.

Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".

Anthropomorphisation vs value learning: type 1 vs type 2 errors

that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"

It was "any sort of agent pursuing a reward function".

Anthropomorphisation vs value learning: type 1 vs type 2 errors

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.

I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical fact about the universe. And two very different interpretations can be equally valid, with no way of distinguishing between them.

(I like the anthropomorphising/dehumanising symmetry, but I'm focusing on the aspects of dehumanising that cause you to make errors of interpretation. For example, out-groups are perceived as being coherent, acting in concert without disagreements, and often being explicitly evil. This is an error, not just a reduction in social emotions)

Load More