Re: agents terminalizing instrumental values.
I anticipate there will be a hill-of-common-computations, where the x-axis is the frequency of the instrumental subgoal, and the y-axis is the extent to which the instrumental goal has been terminalized.
This is because for goals which are very high in frequency, there will be little incentive for the computations responsible for achieving that goal to have self-preserving structures. It will not make sense for them to devote optimization power towards ensuring future states still require them... (read more)
What do you mean by “surface area”?
The main big one was that when I was making experiments, I did not have in mind a particular theory about how the network was doing a particular capability. I just messed around with matrices, and graphed a bunch of stuff, and multiplied a bunch of weights by a bunch of other weights. Occasionally, I'd get interesting looking pictures, but I had no clue what to do with those pictures, or followup questions I could ask, and I think it's because I didn't have an explicit model of what I think it should be doing, and so couldn't update my picture of the mechanisms the network was using off the data I gathered about the network's internals.
This was really really helpful! I learned a lot about how to think through experiment design, watching you do it, and I found some possible-mistakes I've been making while designing my own experiments!
My only criticism: When copilot auto-fills in details, it would be helpful if you'd explain what it did and why its what you wanted it to do, like how you do with your own code.