I just want to share another reason I find this n=1 anecdote so interesting -- I have a highly speculative inside view that the abstract concept of self provides a cognitive affordance for intertemporal coordination, resulting in a phase transition in agentiness only known to be accessible to humans.
Hmm, I'm not sure I understand what point you think I was trying to make. The only case I was trying to make here was that much of our subjective experience which may appear uniquely human might stem from our langauge abilites, which seems consistent with Helen Keller undergoing a phase transition in her subjective experience upon learning a single abstract concept. I'm not getting what age has to do with this.
Questions #2 and #3 seem positively correlated – if the thing that humans have is important, it's evidence that architectural changes matter a lot.
Not necessarily. For example, it may be that language ability is very important, but that most of the heavy lifting in our language ability comes from general learning abilities + having a culture that gives us good training data for learning language, rather than from architectural changes.
I remembered reading about this a while back and updating on it, but I'd forgotten about it. I definitely think this is relevant, so I'm glad you mentioned it -- thanks!
I think this explanation makes sense, but it raises the further question of why we don't see other animal species with partial language competency. There may be an anthropic explanation here - i.e. that once one species gets a small amount of language ability, they always quickly master language and become the dominant species. But this seems unlikely: e.g. most birds have such severe brain size limitations that, while they could probably have 1% of human language, I doubt they could become dominant in anywhere near the same way we did.
Can you elabora... (read more)
This seems like a false dichotomy. We shouldn't think of scaling up as "free" from a complexity perspective - usually when scaling up, you need to make quite a few changes just to keep individual components working. This happens in software all the time: in general it's nontrivial to roll out the same service to 1000x users.
I agree. But I also think there's an important sense in which this additional complexity is mundane -- if the only sorts of differences between a mouse brain and a human brain were the sorts of differences invol... (read more)
That's one of the "unique intellectual superpowers" that I think language confers us:
On a species level, our mastery of language enables intricate insights to accumulate over generations with high fidelity. Our ability to stand on the shoulders of giants is unique among animals, which is why our culture is unrivaled in its richness in sophistication.
(I do think it helps to explicitly name our ability to learn culture as something that sets us apart, and wish I'd made that more front-and-center.)
I'm still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it's in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:
I'm currently intuiting that there's a broad basin of "seeming corrigible until you can perform a treacherous turn", but not a broad basin of true corrigibility. If the overseer can robustly detect incorrigible cognition in the distilled agent, I think things may be fine. But if e.g. the overseer is only inspecting the distilled agent's justifications for its behavior, and something like its verbal loop, I don't see how things can work out.
Here are my assumptions underlying this intuition:
1. Past a certain capabilities level... (read more)
But if e.g. the overseer is only inspecting the distilled agent's justifications for its behavior, and something like its verbal loop, I don't see how things can work out.
You can imagine the overseer as inspecting the agent's actions, and probing the agent's behavior in hypothetical situations. The overseer only "looks inside" the agent's head as a way to help evaluate behavior or identify possibly problematic situations (and there is not necessarily any explicit looking, it could be a weight/activation sharing scheme wh... (read more)
I really like that list of points! Not that I'm Rob, but I'd mentally classified each of those as alignment failures, and the concern I was trying to articulate was that, by default, I'd expect an AI trying to do the right thing will make something like one of these mistakes. Those are good examples of the sorts of things I'd be scared of if I had a well-intentioned non-neurotypical assistant. Those are also what I was referring to when I talked about "black swans" popping up. And when I said:
2. Corrigibility depends critica
I thought more about my own uncertainty about corrigibility, and I've fleshed out some intuitions on it. I'm intentionally keeping this a high-level sketch, because this whole framing might not make sense, and even if it does, I only want to expound on the portions that seem most objectionable.
Suppose we have an agent A optimizing for some values V. I'll call an AI system S high-impact calibrated with respect to A if, when A would consider an action "high-impact" with respect to V, S will correctly classify it as high-impact with ... (read more)