Prosaic AI alignment

[-]TurnTrout3y40

In general, we have no way to use RL to actually interpret and implement human wishes, rather than to optimize some concrete and easily-calculated reward signal.

I feel confused by this sentence. Reward is not the optimization target. Reward provides cognitive updates to the agent. ETA: So, shouldn't wisely-selected reward schedules produce good cognitive updates, which produces a mind which implements human wishes?

[-]paulfchristiano3y30

So, shouldn't wisely-selected reward schedules produce good cognitive updates, which produces a mind which implements human wishes?

I don't think we know how to pick rewards that would implement human wishes. It's great if people want to propose wise strategies and then argue or demonstrate that they have that effect.

On the other side: if you have an easily measurable reward function, then you can find a policy that gets a high reward by using RL (essentially gradient descent on "how much reward does the policy get.") I agree that this need not produce a policy that is trying to get reward, just one that in fact gets a lot of reward on distribution.

[-]cfoster03y32

At the risk of reading too much into wording, I think the phrasing of the above two comments contains an interesting difference.

The first comment (TurnTrout) talks about reward as the thing providing updates to the agent's cognition, i.e. "reward schedules produce ... cognitive updates", and expresses confusion about a prior quote that mentioned implementing our wishes through reward functions.

The second comment (paulfchristiano) talks about picking "rewards that would implement human wishes" and strategies for doing so.

These seem quite different. If I try to inhabit my model of TurnTrout, I expect he might say "But rewards don't implement wishes! Our wishes are for the agent to have a certain kind of cognition." and perhaps also "Whether a reward event helps us get the cognition we want is, first and foremost, a question of which circuits inside the agent are reinforced/refined by said event, and whether those particular circuits implement the kind of cognition we wanted the agent to have."

[-]paulfchristiano3y20

I don't particularly object to that framing, it's just a huge gap from "Rewards have unpredictable effects on agent's cognition, not necessarily to cause them to want reward" to "we have a way to use RL to interpret and implement human wishes."

[-]TurnTrout3y20

it's just a huge gap from "Rewards have unpredictable effects on agent's cognition, not necessarily to cause them to want reward" to "we have a way to use RL to interpret and implement human wishes."

So, OP said

In general, we have no way to use RL to actually interpret and implement human wishes, rather than to optimize some concrete and easily-calculated reward signal.

I read into this a connotation of "In general, there isn't a practically-findable way to use RL...". I'm now leaning towards my original interpretation being wrong -- that you meant something more like "we don't know how to use RL to actually interpret and implement human wishes" (which I agree with).

[-]TurnTrout3y20

I agree that this need not produce a policy that is trying to get reward, just one that in fact gets a lot of reward on distribution.

I think this tells us relatively little about the internal cognition, and so is a relatively non-actionable fact (which you probably agree with?). But I want to sort out my thoughts more, here, before laying down more of my intuitions on that.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

18

18

1. Prosaic AGI

2. Our current state

2a. The concern

2b. Behaving cautiously

2c. The current state of AI alignment

3. Priorities

3a. Easy to start now

3b. Importance

3c. Feasibility

Conclusion