All of zhukeepa's Comments + Replies

I just want to share another reason I find this n=1 anecdote so interesting -- I have a highly speculative inside view that the abstract concept of self provides a cognitive affordance for intertemporal coordination, resulting in a phase transition in agentiness only known to be accessible to humans.

Hmm, I'm not sure I understand what point you think I was trying to make. The only case I was trying to make here was that much of our subjective experience which may appear uniquely human might stem from our langauge abilites, which seems consistent with Helen Keller undergoing a phase transition in her subjective experience upon learning a single abstract concept. I'm not getting what age has to do with this.

2Alex Zhu4y
I just want to share another reason I find this n=1 anecdote so interesting -- I have a highly speculative inside view that the abstract concept of self provides a cognitive affordance for intertemporal coordination, resulting in a phase transition in agentiness only known to be accessible to humans.
Questions #2 and #3 seem positively correlated – if the thing that humans have is important, it's evidence that architectural changes matter a lot.

Not necessarily. For example, it may be that language ability is very important, but that most of the heavy lifting in our language ability comes from general learning abilities + having a culture that gives us good training data for learning language, rather than from architectural changes.

I remembered reading about this a while back and updating on it, but I'd forgotten about it. I definitely think this is relevant, so I'm glad you mentioned it -- thanks!

I think this explanation makes sense, but it raises the further question of why we don't see other animal species with partial language competency. There may be an anthropic explanation here - i.e. that once one species gets a small amount of language ability, they always quickly master language and become the dominant species. But this seems unlikely: e.g. most birds have such severe brain size limitations that, while they could probably have 1% of human language, I doubt they could become dominant in anywhere near the same way we did.

Can you elabora... (read more)

1Richard Ngo4y
A couple of intuitions: * Koko the gorilla had partial language competency. * The ability to create and understand combinatorially many sentences - not necessarily with fully recursive structure, though. For example, if there's a finite number of sentence templates, and then the animal can substitute arbitrary nouns and verbs into them (including novel ones). * The sort of things I imagine animals with partial language saying are: * There's a lion behind that tree. * Eat the green berries, not the red berries. * I'll mate with you if you bring me a rabbit. "Once one species gets a small amount of language ability, they always quickly master language and become the dominant species" - this seems clearly false to me, because most species just don't have the potential to quickly become dominant. E.g. birds, small mammals, reptiles, short-lived species..
This seems like a false dichotomy. We shouldn't think of scaling up as "free" from a complexity perspective - usually when scaling up, you need to make quite a few changes just to keep individual components working. This happens in software all the time: in general it's nontrivial to roll out the same service to 1000x users.

I agree. But I also think there's an important sense in which this additional complexity is mundane -- if the only sorts of differences between a mouse brain and a human brain were the sorts of differences invol... (read more)

1Richard Ngo4y
I think whether the additional complexity is mundane or not depends on how you're producing the agent. Humans can scale up human-designed engineering products fairly easily, because we have a high-level understanding of how the components all fit together. But if you have a big neural net whose internal composition is mostly determined by the optimiser, then it's much less clear to me. There are some scaling operations which are conceptually very easy for humans, and also hard to do via gradient descent. As a simple example, in a big neural network where the left half is doing subcomputation X and the right half is doing subcomputation Y, it'd be very laborious for the optimiser to swap it so the left half is doing Y and the right half is doing X - since the optimiser can only change the network gradually, and after each gradient update the whole thing needs to still work. This may be true even if swapping X and Y is a crucial step towards scaling up the whole system, which will later allow much better performance. In other words, we're biased towards thinking that scaling is "mundane" because human-designed systems scale easily (and to some extent, because evolution-designed systems also scale easily). It's not clear that AIs also have this property; there's a whole lot of retraining involved in going from a small network to a bigger network (and in fact usually the bigger network is trained from scratch rather than starting from a scaled-up version of the small one).

That's one of the "unique intellectual superpowers" that I think language confers us:

On a species level, our mastery of language enables intricate insights to accumulate over generations with high fidelity. Our ability to stand on the shoulders of giants is unique among animals, which is why our culture is unrivaled in its richness in sophistication.

(I do think it helps to explicitly name our ability to learn culture as something that sets us apart, and wish I'd made that more front-and-center.)

I'm still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it's in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:

  • Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it's in the test distribution. But it does little to prevent a malignant 1% from ve
... (read more)
4Paul Christiano5y
The question is roughly: can you be good enough at finding problematic inputs that, if a problematic input exists, it is much more likely to be found during the adversarial training step than at deployment time? If so, then a model with high reward probably won't do anything bad at deployment time (exactly how good you need to be depends on exactly how overwhelmingly strong the prior for malign behavior is). This likely involves introducing some asymmetry between the adversary's task and the test time task. Examples of possible asymmetries including allowing the adversary to produce different kinds of inputs (e.g. setting activations arbitrarily), letting the adversary poke around inside the model to find bad cases (e.g. doing gradient descent, combining with interpretability, or backwards chaining through the model), or giving the adversary a much weaker job (e.g. only asking them to produce an input on which the model thinks anything remotely suspicious as judged by interpretability, or finding any conditions under which the model fails to instantly shut down a malicious scheme that it previously initiated). (I don't know what "prevent" means here.) I agree that if your interpretability doesn't allow you to detect the agent thinking "Am I on the training distribution?" then it won't work. So then the question is: can interpretability be strong enough that if the model is thinking something catastrophically dangerous, you have a very good chance of noticing it? (Again, how good you need to be depends on how overwhelming is the prior for bad behavior.) I don't think this fact makes bad behavior unlikely on their own. But interpretability or adversarial training could exploit the fact that your model is trained by gradient descent. Regularization can strengthen the basic observation that a malign failure needs to use parts of the model that are exercised during training, making it easier to identify possible malign failures by examining behavior on the training d

I'm currently intuiting that there's a broad basin of "seeming corrigible until you can perform a treacherous turn", but not a broad basin of true corrigibility. If the overseer can robustly detect incorrigible cognition in the distilled agent, I think things may be fine. But if e.g. the overseer is only inspecting the distilled agent's justifications for its behavior, and something like its verbal loop, I don't see how things can work out. 

Here are my assumptions underlying this intuition: 

1. Past a certain capabilities level... (read more)

But if e.g. the overseer is only inspecting the distilled agent's justifications for its behavior, and something like its verbal loop, I don't see how things can work out.

You can imagine the overseer as inspecting the agent's actions, and probing the agent's behavior in hypothetical situations. The overseer only "looks inside" the agent's head as a way to help evaluate behavior or identify possibly problematic situations (and there is not necessarily any explicit looking, it could be a weight/activation sharing scheme wh... (read more)

I really like that list of points! Not that I'm Rob, but I'd mentally classified each of those as alignment failures, and the concern I was trying to articulate was that, by default, I'd expect an AI trying to do the right thing will make something like one of these mistakes. Those are good examples of the sorts of things I'd be scared of if I had a well-intentioned non-neurotypical assistant. Those are also what I was referring to when I talked about "black swans" popping up. And when I said:

2. Corrigibility depends critica
... (read more)
3Paul Christiano6y
I don't think that "ability to figure out what is right" is captured by "metaphilosophical competence." That's one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise... OK, but that can mostly be done based on simple arguments about irreversibility and resource consumption. It doesn't take much philosophical competence, or aesthetic sense, to notice that making a binding agreement that constrains all of your future behavior ever is a big deal, even if it would take incredible sophistication to figure out exactly which deals are good. Ditto for the other items on my list except possibly acausal trade that goes off the table based on crossing some capability threshold, but practically even that is more like a slow-burning problem than a catastrophe. I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others. I agree that we need to understand something about the kind of errors that our AI will make, in order to understand whether it is safe. But in order to talk about how important that problem is (and how much of a focus it should be relative to what I'm calling "alignment") we need to actually talk about how easy or hard those errors are. In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to "which of these cases would a human consider potentially catastrophic" even worse than an existing ML system would). Using Scott Garrabrant's terminology, I think that we should basically start by trying to get robustness to scaling up, then once we understand what's needed for that try to get robustness to relative scale, then once we understand what's needed for that we should aim for robustness to scaling down. I expect robustness to scaling down to be the easiest of these, and it's definitely the easiest to get empirical feedback about. It's also the one for which w

I thought more about my own uncertainty about corrigibility, and I've fleshed out some intuitions on it. I'm intentionally keeping this a high-level sketch, because this whole framing might not make sense, and even if it does, I only want to expound on the portions that seem most objectionable.

Suppose we have an agent A optimizing for some values V. I'll call an AI system S high-impact calibrated with respect to A if, when A would consider an action "high-impact" with respect to V, S will correctly classify it as high-impact with ... (read more)

2Paul Christiano6y
I disagree with 2, 4, 5 and the conclusion, though it might depend on how you are defining terms. On 2, if there are morally important decisions you don't recognize as morally important (e.g. massive mindcrime), you might destroy value by making the wrong decision and not realizing the VOI, but that's not behaving incorrigibly. On 4, that's one reason but not the only reason you could robustly generalize. On 5 I don't understand what you mean or why that might be true. I don't really understand what you mean by black swans (or the direct relevance to corrigibility).