Alex Turner

Alex Turner, Oregon State University PhD student working on AI alignment.

Alex Turner's Comments

AGIs as populations

In other words, I think of patching your way to good arguments

As opposed to what?

Conclusion to 'Reframing Impact'

if you're managing a factory, I can say "Rohin, I want you to make me a lot of paperclips this month, but if I find out you've increased production capacity or upgraded machines, I'm going to fire you". You don't even have to behave greedily – you can plan for possible problems and prevent them, without upgrading your production capacity from where it started.

I think this is a natural concept and is distinct from particular formalizations of it.

edit: consider the three plans

  1. Make 10 paperclips a day
  2. Make 10 paperclips a day, but take over the planet and control a paperclip conglomerate which could turn out millions of paperclips each day, but which in fact never does.
  3. take over the planet and make millions of paperclips each day.
Conclusion to 'Reframing Impact'

Why do you object to the latter?

Conclusion to 'Reframing Impact'

if we believe that regularizing A pursuing keeps A's power low

I don't really believe the premise

with respect to my specific proposal in the superintelligent post, or the conceptual version?

Conclusion to 'Reframing Impact'

I've updated the post with epistemic statuses:

  • AU theory describes how people feel impacted. I'm darn confident (95%) that this is true.
  • Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world. Confident (75%). The theorems on power-seeking only apply in the limit of farsightedness and optimality, which isn't realistic for real-world agents. However, I think they're still informative. There are also strong intuitive arguments for power-seeking.
  • CCC is true. Fairly confident (70%). There seems to be a dichotomy between "catastrophe directly incentivized by goal" and "catastrophe indirectly incentivized by goal through power-seeking", although Vika provides intuitions in the other direction.
  • AUP prevents catastrophe (in the outer alignment sense, and assuming the CCC). Very confident (85%).
  • Some version of AUP solves side effect problems for an extremely wide class of real-world tasks, for subhuman agents. Leaning towards yes (65%).
  • For the superhuman case, penalizing the agent for increasing its own AU is better than penalizing the agent for increasing other AUs. Leaning towards yes (65%).
  • There exists a simple closed-form solution to catastrophe avoidance (in the outer alignment sense). Pessimistic (35%).
AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

After rereading the sequence and reflecting on this further, I disagree with your interpretation of the Reframing Impact concept of impact. The concept is "change in my ability to get what I want", i.e. change in the true human utility function. This is a broad statement that does not specify how to measure "change", in particular what it is measured with respect to (the baseline) or how to take the difference from the baseline (e.g. whether to apply absolute value). Your interpretation of this statement uses the previous state as a baseline and does not apply an absolute value to the difference. This is a specific and nonstandard instantiation of the impact concept, and the undesirable property you described does not hold for other instantiations - e.g. using a stepwise inaction baseline and an absolute value: Impact(s, a) = |E[V(s, a)] - E[V(s, noop)]|. So I don't think it's fair to argue based on this instantiation that it doesn't make sense to regularize the RI notion of impact.

AU theory says that people feel impacted as new observations change their on-policy value estimate (so it's the TD error). I agree with Rohin's interpretation as I understand it.

However, AU theory is descriptive – it describes when and how we feel impacted, but not how to build agents which don't impact us much. That's what the rest of the sequence talked about.

[AN #100]: What might go wrong if you learn a reward function while acting

Newsletter #100 (!!)

To Rohin & all the other talented writers – thank you for making this newsletter happen.

Conclusion to 'Reframing Impact'

Starting this post by saying "we are pretty close to the impact measurement endgame" seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.

What I actually said was:

I think we're plausibly quite close to the impact measurement endgame

First, the "I think", and second, the "plausibly". I think the "plausibly" was appropriate, because in worlds where the CCC is true and you can just straightforwardly implement AUP ("optimize the objective, without becoming more able to optimize the objective"), you don't need additional ideas to get a superintelligence-safe impact measure.

Conclusion to 'Reframing Impact'

Here are some reasons I don't endorse this approach:

I think this makes sense – you come in and wonder "what's going on, this doesn't even pass the basic test cases?!".

Some context: in the superintelligent case, I often think about "what agent design would incentivize putting a strawberry on a plate, without taking over the world"? Although I certainly agree SafeLife-esque side effects are important, power-seeking might be the primary avenue to impact for sufficiently intelligent systems. Once a system is smart enough, it might realize that breaking vases would get it in trouble, so it avoids breaking vases as long as we have power over it.

If we can't deal with power-seeking, then we can't deal with power-seeking & smaller side effects at the same time. So, I set out to deal with power-seeking for the superintelligent case.

Under this threat model, the random reward AUP penalty (and the RR penalty AFAICT) can be avoided with the help of a "delusion box" which holds the auxiliary AUs constant. Then, the agent can catastrophically gain power without penalty. (See also: Stuart's subagent sequence)

I investigated whether we can get an equation which implements the reasoning in my first comment: "optimize the objective, without becoming more able to optimize the objective". As you say, I think Rohin and others have given good arguments that my preliminary equations don't work as well as we'd like. Intuitively, though, it feels like there might be a better way to implement that reasoning.

I think the agent-reward equations do help avoid certain kinds of loopholes, and that they expose key challenges for penalizing power seeking. Maybe going back to the random rewards or a different baseline helps overcome those challenges, but it's not clear to me that that's true.

I think this approach needs to be tested in a variety of environments to show that this agent can do something useful that doesn't just optimize the reward (to address the concern in point 1).

I'm pretty curious about that – implementing eg Stuart's power-seeking gridworld would probably make a good project for anyone looking to get into AI safety. (I'd do it myself, but coding is hard through dictation)

Not sure what you mean by the CCC not applying to SafeLife - do you mean that it is not relevant or that doesn't hold in this environment? I get the sense that it doesn't hold, which seems concerning.

I meant that it isn't relevant to this environment. In the CCC post, I write:

"But what about the Blackwell-optimal policy for Tic-Tac-Toe? These agents aren't taking over the world now". The CCC is talking about agents optimizing a reward function in the real world (or, for generality, in another sufficiently complex multiagent environment).

This sequence doesn't focus on other kinds of environments, so there's probably more good thinking to do about what I called "interfaces".

I feel it is very important to voice my disagreement here to avoid the appearance of consensus that agent-reward AUP is the default / state of the art approach in impact regularization.

That makes sense. I'm only speaking for myself, after all. For the superintelligent case, I am slightly more optimistic about approaches relying on agent-reward. I agree that those approaches are wildly inappropriate for other classes of problems, such as SafeLife.

Reasons for Excitement about Impact of Impact Measure Research

(This discussion was continued privately – to clarify, I was narrowly arguing that AUP is correct, but that this should only provide a mild update in favor of implementations working in the superintelligent case.)

Load More