David Scott Krueger

I'm more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger**Bio from https://www.davidscottkrueger.com/:**

I am an Assistant Professor at the University of Cambridge and a member of Cambridge's Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

- Reward modeling and reward gaming
- Aligning foundation models
- Understanding learning and generalization in deep learning and foundation models, especially via “empirical theory” approaches
- Preventing the development and deployment of socially harmful AI systems
- Elaborating and evaluating speculative concerns about more advanced future AI systems

10

I don't disagree... in this case *you* don't get agents for a long time; someone else does though.

10

I meant "other training schemes" to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally "training" and more like "engineering".

10

I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.

30

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?

If so, can you perhaps provide a simple+intuitive+concrete example?

10

What do you mean by "random linear probe"?

21

So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let's spread our probabilities in such a way that we meet the following three conditions. Firstly, we don't expect Sia's desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia's desires are satisfied at is equal to our expectation of the degree to which Sia's desires are satisfied at , for any . Call that common expected value ''. Secondly, our probabilities are symmetric around . That is, our probability that satisfies Sia's desires to at least degree is equal to our probability that it satisfies her desires to at most degree . And thirdly, learning how well satisfied Sia's desires are at some worlds won't tell us how well satisfied her desires are at other worlds. That is, the degree to which her desires are satisfied at some worlds is independent of how well satisfied they are at any other worlds. (See the appendix for a more careful formulation of these assumptions.) If our probability distribution satisfies these constraints, then I'll say that Sia's desires are 'sampled randomly' from the space of all possible desires.

This is a characterization, and it remains to show that there exist distributions that fit it (I suspect there are not, assuming the sets of possible desires and worlds are unbounded).

I also find the 3rd criteria counterintuitive. If worlds share features, I would expect these to not be independent.

12

I'm not necessarily saying people are subconsciously trying to create a moat.

I'm saying they are acting in a way that creates a moat, and that enables them to avoid competition, and that more competition would create more motivation for them to write things up for academic audiences (or even just write more clearly for non-academic audiences).

10

Q: "Why is that not enough?"

A: Because they are not being funded to produce the right kinds of outputs.

45

My point is not specific to machine learning. I'm not as familiar with other academic communities, but I think most of the time it would probably be worth engaging with them if there is somewhere where your work could fit.

Really interesting point!

I introduced this term in my slides that included "paperweight" as an example of an "AI system" that maximizes safety.

I sort of still think it's an OK term, but I'm sure I will keep thinking about this going forward and hope we can arrive at an even better term.