David Scott Krueger

I'm more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger

Bio from https://www.davidscottkrueger.com/:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge's Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

Reward modeling and reward gaming
Aligning foundation models
Understanding learning and generalization in deep learning and foundation models, especially via “empirical theory” approaches
Preventing the development and deployment of socially harmful AI systems
Elaborating and evaluating speculative concerns about more advanced future AI systems

Posts

Sorted by New

3capybaralet's Shortform

32"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

12What organizations other than Conjecture have (esp. public) info-hazard policies?

21A (EtA: quick) note on terminology: AI Alignment != AI x-safety

8Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk")

15Quick thoughts on "scalable oversight" / "super-human feedback" research

14Mechanistic Interpretability as Reverse Engineering (follow-up to "cars and elephants")

26"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

16[An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.]

51An Update on Academia vs. Industry (one year into my faculty job)

32Causal confusion as an argument against the scaling hypothesis

Wiki Contributions

Consequentialism

(+50/-38)

Comments

Quick thoughts on "scalable oversight" / "super-human feedback" research

David Scott Krueger1mo10

I don't disagree... in this case you don't get agents for a long time; someone else does though.

Quick thoughts on "scalable oversight" / "super-human feedback" research

David Scott Krueger2mo10

I meant "other training schemes" to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally "training" and more like "engineering".

Reading the ethicists 2: Hunting for AI alignment papers

David Scott Krueger5mo10

I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.

How LLMs are and are not myopic

David Scott Krueger9mo30

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?

What Discovering Latent Knowledge Did and Did Not Find

David Scott Krueger9mo10

What do you mean by "random linear probe"?

Instrumental Convergence? [Draft]

David Scott Krueger10mo21

So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let's spread our probabilities in such a way that we meet the following three conditions. Firstly, we don't expect Sia's desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia's desires are satisfied at is equal to our expectation of the degree to which Sia's desires are satisfied at $W^{*}$ , for any $W, W^{*}$ . Call that common expected value ' $μ$ '. Secondly, our probabilities are symmetric around $μ$ . That is, our probability that $W$ satisfies Sia's desires to at least degree $μ + x$ is equal to our probability that it satisfies her desires to at most degree $μ - x$ . And thirdly, learning how well satisfied Sia's desires are at some worlds won't tell us how well satisfied her desires are at other worlds. That is, the degree to which her desires are satisfied at some worlds is independent of how well satisfied they are at any other worlds. (See the appendix for a more careful formulation of these assumptions.) If our probability distribution satisfies these constraints, then I'll say that Sia's desires are 'sampled randomly' from the space of all possible desires.

This is a characterization, and it remains to show that there exist distributions that fit it (I suspect there are not, assuming the sets of possible desires and worlds are unbounded).

I also find the 3rd criteria counterintuitive. If worlds share features, I would expect these to not be independent.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

David Scott Krueger1y12

I'm not necessarily saying people are subconsciously trying to create a moat.

I'm saying they are acting in a way that creates a moat, and that enables them to avoid competition, and that more competition would create more motivation for them to write things up for academic audiences (or even just write more clearly for non-academic audiences).

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

David Scott Krueger1y10

Q: "Why is that not enough?"
A: Because they are not being funded to produce the right kinds of outputs.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

David Scott Krueger1y45

My point is not specific to machine learning. I'm not as familiar with other academic communities, but I think most of the time it would probably be worth engaging with them if there is somewhere where your work could fit.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

David Scott Krueger1y21

In my experience people also often know their blog posts aren't very good.