All of Garrett Baker's Comments + Replies

Re: agents terminalizing instrumental values. 

I anticipate there will be a hill-of-common-computations, where the x-axis is the frequency[1] of the instrumental subgoal, and the y-axis is the extent to which the instrumental goal has been terminalized. 

This is because for goals which are very high in frequency, there will be little incentive for the computations responsible for achieving that goal to have self-preserving structures. It will not make sense for them to devote optimization power towards ensuring future states still require them... (read more)

3Alex Turner1mo
I don't know if I follow, I think computations terminalize themselves because it makes sense to cache them (e.g. don't always model out whether dying is a good idea, just cache that it's bad at the policy-level). & Isn't "balance while standing up" terminalized? Doesn't it feel wrong to fall over, even if you're on a big cushy surface? Feels like a cached computation to me. (Maybe that's "don't fall over and hurt yourself" getting cached?)
1Neel Nanda1mo
I use surface area as a fuzzy intuition around "having some model of what's going on, and understanding of what's happening in a problem/phenomena". Which doesn't necessarily looking like a full understanding, but looks like having a list in my head of confusing phenomena, somewhat useful ideas, and hooks into what I could investigate next. I find this model useful both to recognise 'do I have any surface area on this problem' and to motivate next steps by 'what could give me more surface area on this problem' even if it's not a perfectly robust way.

The main big one was that when I was making experiments, I did not have in mind a particular theory about how the network was doing a particular capability. I just messed around with matrices, and graphed a bunch of stuff, and multiplied a bunch of weights by a bunch of other weights. Occasionally, I'd get interesting looking pictures, but I had no clue what to do with those pictures, or followup questions I could ask, and I think it's because I didn't have an explicit model of what I think it should be doing, and so couldn't update my picture of the mechanisms the network was using off the data I gathered about the network's internals.

2Neel Nanda1mo
Makes sense, thanks! Fwiw, I think the correct takeaway is a mix of "try to form hypotheses about what's going on" and "it's much, much easier when you have at least some surface area on what's going on". There are definitely problems where you don't really know going in (eg, I did not expect modular addition to be being solved with trig identities!), and there's also the trap of being overconfident in an incorrect view. But I think the mode of iteratively making and testing hypotheses is pretty good. An alternate, valid but harder, mode is to first do some exploratory analysis where you just hit the problem with a standard toolkit and see what sticks, without any real hypothesis. And then use this raw data to go off and try to form a hypothesis about what's going on, and what to do next to test/try to break it.

This was really really helpful! I learned a lot about how to think through experiment design, watching you do it, and I found some possible-mistakes I've been making while designing my own experiments! 

My only criticism: When copilot auto-fills in details, it would be helpful if you'd explain what it did and why its what you wanted it to do, like how you do with your own code.

1Neel Nanda1mo
Awesome, really appreciate the feedback! And makes sense re copilot, I'll keep that in mind in future videos :) (maybe should just turn it off?) I'd love to hear more re possible-mistakes if you're down to share!