All of Garrett Baker's Comments + Replies

I've seen this, their examples don't seem so subtle to me compared with alternatives. 

For example,


You can clearly see a cat in the center of the left image!

6Alex Turner20d
I mostly... can just barely see an ear above the train if I look, after being told to look there. I don't think it's "clear." I also note that these are black-box attacks on humans which originated from ANNs; these are transferred attacks from eg a CNN.

So: do you think that ambitious mech interp is impossible? Do you think that current interp work is going the wrong direction in terms of achieving ambitious understanding? Or do you think that it'd be not useful even if achieved?

Mostly I think that MI is right to think it can do a lot for alignment, but I suspect that lots of the best things it can do for alignment it will do in a very dual-use way, which skews heavily towards capabilities. Mostly because capabilities advances are easier and there are more people working on those.

At the same time I sus... (read more)

A couple of unconnected points: This doesn't clearly follow: one way for x to be easier is [there are many ways to do x, so that it's not too hard to find one]. If it's easy to find a few ways to get x, giving me another one may not help me at all. If it's hard to find any way to do x, giving me a workable approach may be hugely helpful. (I'm not making a case one way or another on the main point - I don't know the real-world data on this, and it's also entirely possible that the bar on alignment is so high that most/all MI isn't useful for alignment)   I'm not entirely clear I understand you here, but if I do, my response would be: targeted approaches may be faster and cheaper at solving the problems they target. Ambitious approaches are more likely to help solve problems that you didn't know existed, and didn't realize you needed to target. If targeted approaches are being used for [demonstrate that problems of this kind are possible], I expect they are indeed faster and cheaper. If we're instead talking about being used as part of an alignment solution, targeted approaches seem likely to be ~irrelevant (of course I'd be happy if I'm wrong on this!). (again, assuming I understand how you're using 'targeted' / 'ambitious')

Interpretability seems pretty useful for alignment, but it also seems pretty dangerous for capabilities. Overall the field seems net-bad. Using an oversimplified model, my general reason for thinking this is because for any given interpretability advance, it can either be used for the purposes of capabilities or the purposes of alignment. Alignment is both harder, and has fewer people working on it than improving model capabilities. Even if the marginal interpretability advance would be net good for alignment if alignment and capabilities were similar in s... (read more)

This is why I'm pessimistic about most interpretability work. It just isn't focused enough

Most of the "exploratory" interp work you suggest is trying to achieve an ambitious mechanistic understanding of models, which requires a really high degree of model understanding in general. They're not trying to solve particular concrete problems, and it seems unfair to evaluate them according to a different theory of change. If you're going to argue against this line of work, I think you should either argue that they're failing to achieve their theory of change, or... (read more)

I don't think the conclusion follows from the premises. People often learn new concepts after studying stuff, and it seems likely (to me) that when studying human cognition, we'd first be confused because our previous concepts weren't sufficient to understand it, and then slowly stop being confused as we built & understood concepts related to the subject. If an AI's thoughts are like human thoughts, given a lot of time to understand them, what you describe doesn't rule out that the AI's thoughts would be comprehensible.

The mere existence of concepts we don't know about in a subject doesn't mean that we can't learn those concepts. Most subjects have new concepts.

2Alex Turner3mo
I agree that with time, we might be able to understand. (I meant to communicate that via "might still be incomprehensible")

Counterintuitively, it may be easier for an organization (e.g. Redwood Research) to get a $1 million grant from Open Phil than it is for an individual to get a $10k grant from LTFF. The reason why is that both grants probably require a similar amount of administrative effort and a well-known organization is probably more likely to be trusted to use the money well than an individual so the decision is easier to make. This example illustrates how decision-making and grant-making processes are probably just as important as the total amount of money available

... (read more)
1Stephen McAleese4mo
That seems like a better split and there are outliers of course. But I think orgs are more likely to be well-known to grant-makers on average given that they tend to have a higher research output, more marketing, and the ability to organize events. An individual is like an organization with one employee.

This seems like an underestimate because you don’t consider whether the first “AGI” will indeed make it so we only get one chance. If it can only self improve by more gradient steps, then humanity has a greater chance than if it self improves by prompt engineering or direct modification of its weights or latent states. Shard theory seems to have nonzero opinions on the fruitfulness of the non-data methods.

2Alex Turner8mo
What does self-improvement via gradients vs prompt-engineering vs direct mods have to do with how many chances we get? I guess, we have at least a modicum more control over the gradient feedback loop, than over the other loops?  Can you say more?

I think this type of criticism is applicable in an even wider range of fields than even you immediately imagine (though in varying degrees, and with greater or lesser obviousness or direct correspondence to the SGD case). Some examples:

  • Despite the economists, the economy doesn't try to maximize welfare, or even net dollar-equivalent wealth. It rewards firms which are able to make a profit in proportion to how much they're able to make a profit, and dis-rewards firms which aren't able to make a profit. Firms which are technically profitable, but have no

... (read more)
3Alex Turner10mo
very pithy. nice insight, thanks. 

This is true, but indicates a radically different stage in training in which we should find deception compared to deception being an intrinsic value. It also possibly expands the kinds of reinforcement schedules we may want to use compared to the worlds where deception crops up at the earliest opportunity (though pseudo-deception may occur, where behaviors correlated with successful deception are reinforced possibly?).

2Alex Turner1y
Oh, huh, I had cached the impression that deception would be derived, not intrinsic-value status. Interesting.

John usually does not make his plans with an eye toward making things easier. His plan previously involved values because he thought they were strictly harder than corrigibility. If you solve values, you solve corrigibility. Similarly, if you solve abstraction, you solve interpretability, shard theory, value alignment, corrigibility, etc.

I don’t know all the details of John’s model here, but it may go something like this: If you solve corrigibility, and then find out corrigibility isn’t sufficient for alignment, you may expect your corrigible agent to help you build your value aligned agent.

2Alex Turner1y
In what way do you think solving abstraction would solve shard theory?

I think the pointer “the thing I would do if I wanted to make a second AI that would be the best one I could make at my given intelligence” is what is being updated in favor of, since this does feel like a natural abstraction, given how many agents would think this (also seems very similar to the golden rule. “I will do what I would want a successor AI to do if the successor AI was actually the human’s successor AI”. or “treat others (the human) how I’d like to be treated (by a successor AI), (and abstracting one meta-level upwards)”). Whether this turns out to be value learning or something else 🤷. This seems a different question from whether or not it is indeed a natural abstraction.

1Charlie Steiner1y
Interesting. What is it that potentially makes "treat the human like I would like to be treated if I had their values" easier than "treat the human like they would like to be treated"?

Seems possibly relevant & optimistic when seeing deception as a value. It has the form ‘if about to tell human statement with properties x, y, z, don’t’ too.

2Alex Turner1y
It can still be robustly derived as an instrumental subgoal during general-planning/problem-solving, though?

Re: agents terminalizing instrumental values. 

I anticipate there will be a hill-of-common-computations, where the x-axis is the frequency[1] of the instrumental subgoal, and the y-axis is the extent to which the instrumental goal has been terminalized. 

This is because for goals which are very high in frequency, there will be little incentive for the computations responsible for achieving that goal to have self-preserving structures. It will not make sense for them to devote optimization power towards ensuring future states still require them... (read more)

3Alex Turner1y
I don't know if I follow, I think computations terminalize themselves because it makes sense to cache them (e.g. don't always model out whether dying is a good idea, just cache that it's bad at the policy-level).  & Isn't "balance while standing up" terminalized? Doesn't it feel wrong to fall over, even if you're on a big cushy surface? Feels like a cached computation to me. (Maybe that's "don't fall over and hurt yourself" getting cached?)
1Neel Nanda1y
I use surface area as a fuzzy intuition around "having some model of what's going on, and understanding of what's happening in a problem/phenomena". Which doesn't necessarily looking like a full understanding, but looks like having a list in my head of confusing phenomena, somewhat useful ideas, and hooks into what I could investigate next. I find this model useful both to recognise 'do I have any surface area on this problem' and to motivate next steps by 'what could give me more surface area on this problem' even if it's not a perfectly robust way.

The main big one was that when I was making experiments, I did not have in mind a particular theory about how the network was doing a particular capability. I just messed around with matrices, and graphed a bunch of stuff, and multiplied a bunch of weights by a bunch of other weights. Occasionally, I'd get interesting looking pictures, but I had no clue what to do with those pictures, or followup questions I could ask, and I think it's because I didn't have an explicit model of what I think it should be doing, and so couldn't update my picture of the mechanisms the network was using off the data I gathered about the network's internals.

2Neel Nanda1y
Makes sense, thanks! Fwiw, I think the correct takeaway is a mix of "try to form hypotheses about what's going on" and "it's much, much easier when you have at least some surface area on what's going on". There are definitely problems where you don't really know going in (eg, I did not expect modular addition to be being solved with trig identities!), and there's also the trap of being overconfident in an incorrect view. But I think the mode of iteratively making and testing hypotheses is pretty good. An alternate, valid but harder, mode is to first do some exploratory analysis where you just hit the problem with a standard toolkit and see what sticks, without any real hypothesis. And then use this raw data to go off and try to form a hypothesis about what's going on, and what to do next to test/try to break it.

This was really really helpful! I learned a lot about how to think through experiment design, watching you do it, and I found some possible-mistakes I've been making while designing my own experiments! 

My only criticism: When copilot auto-fills in details, it would be helpful if you'd explain what it did and why its what you wanted it to do, like how you do with your own code.

1Neel Nanda1y
Awesome, really appreciate the feedback! And makes sense re copilot, I'll keep that in mind in future videos :) (maybe should just turn it off?) I'd love to hear more re possible-mistakes if you're down to share!