Thomas Kwa

Doing alignment research with Vivek Hebbar's team at MIRI as well as independent projects.

Sequences

Catastrophic Regressional Goodhart

Wiki Contributions

Comments

Maybe the reward models are expressive enough to capture all patterns in human preferences, but it seems nice to get rid of this assumption if we can. Scaling laws suggest that larger models perform better (in the Gao paper there is a gap between 3B and 6B reward model) so it seems reasonable that even the current largest reward models are not optimal.

I guess it hasn't been tested whether DPO scales better than RLHF. I don't have enough experience with these techniques to have a view on whether it does.

DPO seems like a step towards better and more fine-grained control over models than RLHF, because it removes the possibility that the reward model underfits.

It seems like there's some intuition underlying this post for why the wildfire spark of strategicness is possible, but there is no mechanism given. What is this mechanism, and in what toy cases do you see a wildfire of strategicness? My guess is something like

  • Suppose one part of your systems contains a map from desired end-states to actions required to achieve those ends, another part has actuators, and a third part starts acting strategically. Then the third part needs only to hook together the other two parts with its goals to become an actualizing agent.

This doesn't really feel like a wildfire though, so I'm curious if you have something different in mind.

I commented on the original post last year regarding the economics angle:

Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent utility function in general. A bounty also hasn't found any such papers.

Based on this lit review and the Wikipedia page and ChatGPT [1], I'm 90% sure that "representative agent" in economics means the idea that the market's aggregate preferences are similar to the typical individuals' preferences, and the general question of whether a market has any complete preference ordering does not fall within the scope of the term.

[1] GPT4 says "The representative agent is assumed to behave in a way that represents the average or typical behavior of the group in the aggregate. In macroeconomics, for instance, a representative agent might be used to describe the behavior of all households in an economy.

This modeling approach is used to reduce the complexity of economic models, making them more tractable, but it has also received criticism."

Prediction market for whether someone will strengthen our results or prove something about the nonindependent case:

https://manifold.markets/ThomasKwa/will-someone-strengthen-our-goodhar?r=VGhvbWFzS3dh

Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either

I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.

edit: The proof is easy. Let  be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for  replaced by . Then . But this means  and so  You don't lose any channel capacity switching to 

  • Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
  • I thought briefly about the Ilharco et al paper and am very impressed by it as well.
  • Thanks for linking to the resources.

I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

Load More