Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, now OpenAI Futures/Governance team. Views are my own & do not represent those of my employer. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html


Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions


OK, guess I'll go read those posts then...

I think I agree with everything you said yet still feel confused. My question/objection/issue was not so much "How do you explain people sometimes falling victim to plans which spuriously appeal to their value shards!?!? Checkmate!" but rather "what does it mean for an appeal to be spurious? What is the difference between just thinking long and hard about what to do vs. adversarially selecting a plan that'll appeal to you? Isn't the former going to in effect basically equal the latter, thanks to extremal Goodhart? In the limit where you consider all possible plans (maximum optimization power), aren't they the same?"

Consider yourself pinged! No rush to reply though.

1 is evidentially supported by the only known examples of general intelligences, but also AI will not have the same inductive biases. So the picture might be more complicated. I’d guess shard theory is still appropriate, but that's ultimately a question for empirical work (with interpretability).[12] 

Shard theory seems more evidentially supported than bag-o-heuristics theory and rational agent theory, but that's a pretty low bar! I expect a new theory to come along which is as much of an improvement over shard theory as shard theory is over those.

Re the 5 open questions: Yeah 4 and 5 seem like the hard ones to me.

Anyhow, in conclusion, nice work & I look forward to reading future developments. (Now I'll go read the other comments)

Great post! I think it's very good for alignment researchers to be this level of concrete about their plans, it helps enormously in a bunch of ways e.g. for evaluating the plan.

Comments as I go along:

Why wouldn't the agent want to just find an adversarial input to its diamond abstraction, which makes it activate unusually strongly? (I think that agents might accidentally do this a bit for optimizer's curse reasons, but not that strongly. More in an upcoming post.)

Consider why you wouldn't do this for "hanging out with friends." Consider the expected consequences of the plan "find an adversarial input to my own evaluation procedure such that I find a plan which future-me maximally evaluates as letting me 'hang out with my friends'." I currently predict that such a plan would lead future-me to daydream and not actually hang out with my friends, as present-me evaluates the abstract expected consequences of that plan. My friend-shard doesn't like that plan, because I'm not hanging out with my friends. So I don't search for an adversarial input. I infer that I don't want to find those inputs because I don't expect those inputs to lead me to actually hang out with my friends a lot as I presently evaluate the abstract-plan consequences.

I don't think an agent can consider searching for adversarial inputs to its shards without also being reflective, at which point the agent realizes the plan is dumb as evaluated by the current shards assessing the predicted plan-consequences provided by the reflective world-model. 

How is the bolded sentence different from the following:

"Consider the expected consequences of the plan "think a lot longer and harder, considering a lot more possibilities for what you should do, and then make your decision." I currently predict that such a plan would lead future-me to waste his life doing philosophy or maybe get pascal's mugged by some longtermist AI bullshit instead of actually helping people with his donations. My helping-people shard doesn't like this plan, because it predicts abstractly that thinking a lot more will not result in helping people more."

(Basically I'm saying you should think more, and then write more, about the difference between these two cases because they seem plausibly on a spectrum to me, and this should make us nervous in a couple of ways. Are we actually being really stupid by being EAs and shutting up and calculating? Have we basically adversarial-exampled ourselves away from doing things that we actually thought were altruistic and effective back in the day? If not, what's different about the kind of extended search process we did, from the logical extension of that which is to do an even more extended search process, a sufficiently extreme search process that outsiders would call the result an adversarial example?)

In particular, even though online self-supervised learning continues to develop the world model and create more advanced concepts, the reward events also keep pinging the invocation of the diamond-abstraction as responsible for reward (because insofar as the agent's diamond-shard guides its decisions, then the diamond-shard's diamond-abstraction is in fact responsible for the agent getting reward). The diamond-abstraction gradient starves the AI from exclusively acting on the basis of possible advanced "alien" abstractions which would otherwise have replaced the diamond abstraction. The diamond shard already gets reward effectively, integrating with the rest of the agent's world model and recurrent state, and therefore provides "job security" for the diamond-abstraction. (And once the agent is smart enough, it will want to preserve its diamond abstraction, insofar as that is necessary for the agent to keep achieving its current goals which involve prototypical-diamonds.)

Are you sure that's how it works? Seems plausible to me but I'm a bit nervous, I think it could totally turn out to not work like that. (That is, it could turn out that the agent wanting to preserve its diamond abstraction is the only thing that halts the march towards more and more alien-yet-effective abstractions)

Suppose the AI keeps training, but by instrumental convergence, seeking power remains a good idea, and such decisions continually get strengthened. This strengthens the power-seeking shard relative to other shards. Other shards want to prevent this from happening.

you go on to talk about shards eventually values-handshaking with each other. While I agree that shard theory is a big improvement over the models that came before it (which I call rational agent model and bag o' heuristics model) I think shard theory currently has a big hole in the middle that mirrors the hole between bag o' heuristics and rational agents. Namely, shard theory currently basically seems to be saying "At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents; they have beliefs and desires of their own, and even negotiate with each other!" My response is "but what happens in the middle? Seems super important! Also haven't you just reproduced the problem but inside the head?" (The problem being, when modelling AGI we always understood that it would start out being just a crappy bag of heuristics and end up a scary rational agent, but what happens in between was a big and important mystery. Shard theory boldly strides into that dark spot in our model... and then reproduces it in miniature! Progress, I guess.)


I'm not sure about that actually. Hard takeoff and soft takeoff disagree about the rate of slope change, not about the absolute height of the line. I guess if you are thinking about the "soft takeoff means shorter timelines" then yeah it also means higher AI progress prior to takeoff, and in particular predicts more stuff happening now. But people generally agree that despite that effect, the overall correlation between short timelines and fast takeoff is positive. 

Anyhow, even if you are right, I definitely think the evidence is pretty weak. Both sides make pretty much the exact same retrodictions and were in fact equally unsurprised by the last few years. I agree that Yudkowsky deserves spanking for not working harder to make concrete predictions/bets with Paul, but he did work somewhat hard, and also it's not like Paul, Ajeya, etc. are going around sticking their necks out much either. Finding concrete stuff to bet on (amongst this group of elite futurists) is hard. I speak from experience here, I've talked with Paul and Ajeya and tried to find things in the next 5 years we disagree on and it's not easy, EVEN THOUGH I HAVE 5-YEAR TIMELINES. We spent about an hour probably. I agree we should do it more.

(Think about you vs. me. We both thought in detail about what our median futures look like. They were pretty similar, especially in the next 5 years!)

Thanks for this! I think it is a well-written and important critique. I don't agree with it though, but unfortunately I am not sure how to respond. Basically you are taking a possibility--that there is some special sauce architecture in the brain that is outside the space of current algorithms & that we don't know how to find via evolution because it's complex enough that if we just try to imitate evolution we'll probably mess up and draw our search space to exclude it, or make the search space too big and never find it even with 10^44 flops--and saying "this feels like 50% likely to me" and Ajeya is like "no no it feels like 10% to me" and I'm like "I'm being generous by giving it even 5%, I don't see how you could look at the history of AI progress so far & what we know about the brain and still take this hypothesis seriously" But it's just coming down to different intuitions/priors. (Would you agree with this characterization?)

The only viable counterargument I've heard to this is that the government can be competent at X while being incompetent at Y, even if X is objectively harder than Y. The government is weird like that. It's big and diverse and crazy. Thus, the conclusion goes, we should still have some hope (10%?) that we can get the government to behave sanely on the topic of AGI risk, especially with warning shots, despite the evidence of it behaving incompetently on the topic of bio risk despite warning shots.

Or, to put it more succinctly: The COVID situation is just one example; it's not overwhelmingly strong evidence.

(This counterargument is a lot more convincing to the extent that people can point to examples of governments behaving sanely on topics that seem harder than COVID. Maybe Y2K? Maybe banning bioweapons? Idk, I'd be interested to see research on this: what are the top three examples we can find, as measured by a combination of similarity-to-AGI-risk and competence-of-government-response.)

OK, thanks! Continuity does seem appealing to me but it seems negotiable; if you can find an even more threat-resistant bargaining solution (or an equally threat-resistant one that has some other nice property) I'd prefer it to this one even if it lacked continuity.

Awesome work! 

Misc. thoughts and questions as I go along:

1. Why is Continuity appealing/important again?

2. In the Destruction Game, does everyone get the ability to destroy arbitrary amounts of utility, or is how much utility they are able to destroy part of the setup of the game, such that you can have games where e.g. one player gets a powerful button and another player gets a weak one?

Load More