All Posts

Sorted by New

Week Of Sunday, July 5th 2020
Week Of Sun, Jul 5th 2020

No posts for this week
8Alex Turner5dI think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait - you'd also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you'd learn these features for a wide range of cost functions. I wonder if that's already been empirically investigated? And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment). What can we learn about this?
1Alex Turner2dTransparency Q: how hard would it be to ensure a neural network doesn't learn any explicit NANDs?

Load More Weeks