Alex Turner

Alex Turner, Oregon State University PhD student working on AI alignment.

Sequences

Reframing Impact

Comments

Non-Obstruction: A Simple Concept Motivating Corrigibility

Do I intend to do something with people's predictions? Not presently, but I think people giving predictions is good both for the reader (to ingrain the concepts by thinking things through enough to provide a credence / agreement score) and for the community (to see where people stand wrt these ideas).

The Catastrophic Convergence Conjecture

The catastrophic convergence conjecture was originally formulated in terms of "outer alignment catastrophes tending to come from power-seeking behavior." I think that this was a mistake: I meant to talk about impact alignment catastrophes tending to be caused by power-seeking. I've updated the post accordingly.

Learning Normativity: A Research Agenda

Another thing which seems to "gain something" every time it hops up a level of meta: Corrigibility as Outside View. Not sure what the fixed points are like, if there are any, and I don't view what I wrote as attempting to meet these desiderata. But I think there's something interesting that's gained each time you go meta. 

TurnTrout's shortform feed
From unpublished work.

The answer to this seems obvious in isolation: shaping helps with credit assignment, rescaling doesn't (and might complicate certain methods in the advantage vs Q-value way). But I feel like maybe there's an important interaction here that could inform a mathematical theory of how a reward signal guides learners through model space?

Knowledge, manipulation, and free will

(I have a big google doc analyzing corrigibility & manipulation from the attainable utility landscape frame; I’ll link it here when the post goes up on LW)

Knowledge, manipulation, and free will

OK, but there's a difference between "here's a definition of manipulation that's so waterproof you couldn't break it if you optimized against it with arbitrarily large optimization power" and "here's my current best way of thinking about manipulation." I was presenting the latter, because it helps me be less confused than if I just stuck to my previous gut-level, intuitive understanding of manipulation.

Edit: Put otherwise, I was replying more to your point (1) than your point (2) in the original comment. Sorry for the ambiguity!

Knowledge, manipulation, and free will

Not Stuart, but I agree there's overlap here. Personally, I think about manipulation as when an agent's policy robustly steers the human into taking a certain kind of action, in a way that's robust to the human's counterfactual preferences. Like if I'm choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences I had for shoes to begin with, I end up buying blue shoes, then I'm probably being manipulated. A non-manipulative AI would act in a way that increases my knowledge and lets me condition my actions on my preferences.

"Zero Sum" is a misnomer.

I read "feasible" as something like "rationalizable." I think it would have been much clearer if you had said "if no strategy profiles are Pareto over any others."

"Zero Sum" is a misnomer.

So, we could consider a game completely adversarial if it has a structure like this: no strategy profiles are a Pareto improvement over any others. In other words, the feasible outcomes of the game equal the game's Pareto frontier. All possible outcomes involve trade-offs between players.

I must have missed some key word - by this definition, wouldn't common-payoff games be "completely adversarial", because the "feasible" outcomes equal the Pareto frontier under the usual assumptions?

Load More