David Scott Krueger

I'm more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger

Bio from https://www.davidscottkrueger.com/:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge's Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

  • Reward modeling and reward gaming
  • Aligning foundation models
  • Understanding learning and generalization in deep learning and foundation models, especially via “empirical theory” approaches
  • Preventing the development and deployment of socially harmful AI systems
  • Elaborating and evaluating speculative concerns about more advanced future AI systems
     

Wiki Contributions

Comments

By "intend" do you mean that they sought that outcome / selected for it?  
Or merely that it was a known or predictable outcome of their behavior?

I think "unintentional" would already probably be a better term in most cases. 

Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to...

We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do.  I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).

It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper.

So...

  • Do you think this analysis is correct?  Or what is it missing?  (maybe the assumption that the policy is deterministic is significant?  This turns out to be the case for Orseau et al.'s "Agents and Devices" approach, I think https://arxiv.org/abs/1805.12387).
  • Are you trying to get around this somehow?  Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?

"Concrete Problems in AI Safety" used this distinction to make this point, and I think it was likely a useful simplification in that context.  I generally think spelling it out is better, and I think people will pattern match your concerns onto the “the sci-fi scenario where AI spontaneously becomes conscious, goes rogue, and pursues its own goal” or "boring old robustness problems" if you don't invoke structural risk.  I think structural risk plays a crucial role in the arguments, and even if you think things that look more like pure accidents are more likely, I think the structural risk story is more plausible to more people and a sufficient cause for concern.

RE (A): A known side-effect is not an accident.


 

I agree somewhat, however, I think we need to be careful to distinguish "do unsavory things" from "cause human extinction", and should generally be squarely focused on the latter.  The former easily becomes too political, making coordination harder.

Yes it may be useful in some very limited contexts.  I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing.

AI is highly non-analogous with guns.

I really don't think the distinction is meaningful or useful in almost any situation.  I think if people want to make something like this distinction they should just be more clear about exactly what they are talking about.

This is a great post.  Thanks for writing it!  I think Figure 1 is quite compelling and thought provoking.
I began writing a response, and then realized a lot of what I wanted to say has already been said by others, so I just noted where that was the case.  I'll focus on points of disagreement.

Summary: I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.

A high-level counter-argument I didn't see others making: 

  • I wasn't entirely sure what was your argument that long-term planning ability saturates... I've seen this argued both based on complexity and chaos, and I think here it's a bit of a mix of both.
    • Counter-argument to chaos-argument: It seems we can make meaningful predictions of many relevant things far into the future (e.g. that the sun's remaining natural life-span is 7-8 billion years).
    • Counter-argument to complexity-argument: Increases in predictive ability can have highly non-linear returns, both in terms of planning depth and planning accuracy.  
      • Depth: You often only need to be "one step ahead" of your adversary in order to defeat them and win the whole "prize" (e.g. of market or geopolitical dominance), e.g. if I can predict the weather one day further ahead, this could have a major impact in military strategy.
      • Accuracy: If you can make more accurate predictions about, e.g. how prices of assets will change, you can make a killing in finance.
         

High-level counter-arguments I would've made that Vanessa already made: 

  • This argument proves too much: it suggests that there are not major differences in ability to do long-term planning that matter.
  • Humans have not reached the limits of predictive ability


Low-level counter-arguments:

  • RE Claim 1: Why would AI only have an advantage in IQ as opposed to other forms of intelligence / cognitive skill?  No argument is provided.
  • (Argued by Jonathan Uesato) RE Claim 3: Scaling laws provide ~zero evidence that we are at the limit of “what can be achieved with a certain level of resources”.

This is a great post.  Thanks for writing it!

I agree with a lot of the counter-arguments others have mentioned.

Summary:

  • I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.

     
  • High-level counter-arguments already argued by Vanessa: 
    • This argument proves too much: it suggests that there are not major differences in ability to do long-term planning that matter.
    • Humans have not reached the limits of predictive ability


 

  • You often only need to be one step ahead of your adversary to defeat them.
  • Prediction accuracy is not the relevant metric: an incremental increase in depth-of-planning could be decisive in conflicts (e.g. if I can predict the weather one day further ahead, this could have a major impact in military strategy).
    • More generally, the ability to make large / highly leveraged bets on future outcomes means that slight advantages in prediction ability could be decisive.


 

  • Low-level counter-arguments:
  • (RE Claim 1: Why would AI only have an advantage in IQ as opposed to other forms of intelligence / cognitive skill?  No argument is provided.
  • (Argued by Jonathan Uesato) RE Claim 3: Scaling laws provide ~zero evidence that we are at the limit of “what can be achieved with a certain level of resources”.
  • RE Claim 5: Systems trained with short-term objectives can learn to do long-term planning competently.

This post tacitly endorses the "accident vs. misuse" dichotomy.
Every time this appears, I feel compelled to mention I think is a terrible framing.
I believe the large majority of AI x-risk is best understood as "structural" in nature: https://forum.effectivealtruism.org/posts/oqveRcMwRMDk6SYXM/clarifications-about-structural-risk-from-ai

 

Load More