TL;DR: This post is about value of recreating “caring drive” similar to some animals and why it might be useful for AI Alignment field in general. Finding and understanding the right combination of training data/loss function/architecture/etc that allows gradient descent to robustly find/create agents that will care about other agents with different goals could be very useful for understanding the bigger problem. While it's neither perfect nor universally present, if we can understand, replicate, and modify this behavior in AI systems, it could provide a hint to the alignment solution where the AGI “cares” for humans.
Disclaimers: I’m not saying that “we can raise AI like a child to make it friendly” or that “people are aligned to evolution”. Both of these claims I find to be obvious errors. Also, I will write a lot about evolution, as some agentic entity, that “will do that or this”, not because I think that it’s agentic, but because it’s easier to write this way. I think that GPT-4 have some form of world model, and will refer to it a couple of times.
I think that part of one of the possible “alignment solutions” will look like the right set of training data + training loss that allow gradient to robustly find something like a ”caring drive” that we can then study, recreate and repurpose for ourselves. And I think we have some rare examples of this in nature already. Some animals, especially humans, will kind-of-align themselves to their presumable offspring. They will want to make their life easier and better, to the best of their capabilities and knowledge. Not because they “aligned to evolution” and want to increase the frequency of their genes, but because of some strange internal drive created by evolution. The set of triggers tuned by evolution, activated by events associated with the birth will awake the mechanism. It will re-aim the more powerful mother agent to be aligned to the less powerful baby agent, and it just so happens that their babies will give them the right cues and will be nearby when the mechanism will do its work. We will call the more powerful initial agent that changes its behavior and tries to protect and help its offspring “mother” and the less powerful and helpless agent “baby”. Of course the mechanism isn’t ideal, but it works well enough, even in the modern world, far outside of initial evolutionary environment. And I’m not talking about humans only, stray urban animals that live in our cities will still adapt their “caring procedures” to this completely new environment, without several rounds of evolutionary pressure. If we can understand how to make this mechanism for something like a “cat-level” AI, by finding it via gradient descend and then rebuild it from scratch, maybe we will gain some insides into the bigger problem.
What do I mean by “caring drive”? Animals, including humans, have a lot of competing motivations, “want drives”, they want to eat, sleep, have sex, etc. It seems that the same applies to caring about babies. But it seems to be much more complicated set of behaviors. You need to:correctly identify your baby, track its position, protect it from outside dangers, protect it from itself, by predicting the actions of the baby in advance to stop it from certain injury, trying to understand its needs to correctly fulfill them, since you don’t have direct access to its internal thoughts etc.Compared to “wanting to sleep if active too long” or “wanting to eat when blood sugar level is low” I would confidently say that it’s a much more complex “wanting drive”. And you have no idea about “spreading the genes” part. You just “want a lot of good things to happen” to your baby for some strange reason. I’m yet not sure, but this complex nature could be the reason why there is an attraction basin for more “general” and “robust” solution. Just like LLM will find some general form of “addition” algorithm instead of trying to memorize a bunch of examples seen so far, especially if it will not see them again too often. I think that instead of hardcoding a bunch of britle optimized caring procedures, evolution repeatedly finds the way to make mothers “love” their babies, outsourcing a ton of work to them, especially if situations where it’s needed aren’t too similar.
And all of it is a consequence of a blind hill climbing algorithm. That’s why I think that we might have a chance of recreating something similar with gradient descend. The trick is to find the right conditions that will repeatedly allow gradient descend to find the same caring-drive-structure, find similarities, understand the mechanism, recreate it from scratch to avoid hidden internal motivations, repurpose it for humans and we are done! Sounds easy (it’s not)
A lot of times, it’s much more efficient to just make more babies, but sometimes they must provide some care, simply because it was the path that evolution found that works. And even if they will care about some of them, they may choose one, and left others die, again, because they don’t have a lot of resources to spare and evolution will tune this mechanism to favor the most promising offspring if it is more efficient. And not all animals could become such caring parents: you can’t really care and protect something else if you are too dumb for example. So there is also some capability requirements for animals to even have a chance of obtaining such adaptation. I expect the same capability requirements for AI systems. If we want to recreate it, we will need to try it with some advanced systems, otherwise I don’t see how it might work at all.
Which is obvious, there is nothing surprising in “if you damage it, it could break”, this will apply to any solution to some degree. It shouldn’t be surprising that drug abusing or severely ill parents will often fail to care about their child at all. However, If we will succeed at building aligned AGI stable enough for some initial takeoff time, then the problem of protecting it from damage should not be ours to worry at some moment. But we still need to ensure initial stability.
Evolution has naturally tuned this mechanism for optimal resource allocation, which sometimes means shutting down care when resources needed to be diverted elsewhere. Evolution is ruthless because of the limited resources, and will eradicate not only genetic lines that care too less, but also the ones that care too much. We obviously don’t need that part. And a lot of times you can just give up on your baby and instead try to make a new one, if the situation is too dire, which we also don’t want to happen to us. Which means that we need to understand how it works, to be able to construct it in the way we want.
Of course, there are exceptions, all people are different and we can’t afford to clone some “proven to be a loving mother” woman hundreds of times to see if the underlying mechanism triggers reliably in all environments. But it seems to work in general, and more so: it continues to work reliably even with our current technologies, in our crazy world, far away from initial evolution environment. And we didn’t had to live through waves of birth declines and rises as evolution tries to adapt us to the new realities, tuning brains of new generation of mothers to find the ones that will start to care about their babies in the new agricultural or industrial or information era.
For what I know, it is possible to imagine alternative human civilization, without any parental care, so instead our closest candidate for such behavior would be some other intelligent species. Intelligent enough to be able to care in theory and forced by their weak bodies to do so in order to have any descendants at all, maybe it could be some mammals, birds, or whatever, it doesn’t matter. The point I’m making here is that: I don’t think that it is some anthropic trap to search for inspiration or hints in our own behavior, it just so happens that we are smart, but have weak babies that require a lot of attention so that we received this mechanism from evolution as a “simplest solution”. You don’t need to search for more compact brains that will allow for longer pregnancy, or hardwire even more knowledge into the infants brains if you can outsource a lot of stuff to the smart parents, you just need to add the “caring drive” and it will work fine. We want AI to care about us, not because we care about our children, and want the same from AI, we just don’t want to die, and we would want AI to care about us, even if we ourselves would lack this ability.
I’m not saying that it’s a go-to solution that we can just copy, but the step in right direction from my view. Replicating similar behavior and studying its parts could be a promising direction. There are a few moments that might make this whole approach useless, for example:
Overall I’m pretty sure that this do in fact work, certainly good enough to be a viable research direction.
I have no concrete idea. I have a few, but I’m not sure about how practically possible they are. And since nobody knows how this mechanism works as far as I know, it’s hard to imagine having the concrete blueprint to create one. So the best I can give is: we try to create something that looks right from the outside and see if there is anything interesting in the inside. I also have some ideas about “what paths could or couldn’t lead to the interesting insides”.First of all, I think this “caring drive” couldn’t run without some internal world model. Something like: it's hard to imagine far generalized goals without some far generalized capabilities. And world model could be obtained from highly diverse, non repetitive dataset, which forces the model to actually “understand” something and stop memorizing.Maybe you can set an environment with multiple agents, similar to Deepmind’s here (https://www.deepmind.com/blog/generally-capable-agents-emerge-from-open-ended-play), initially reward an agent for surviving by itself, and then introduce the new type of task: baby-agent, that will appear near the original mother-agent (we will call it “birth”), and from that point of time, the whole reward will come purely from how long baby will survive? Baby will initially have less mechanical capabilities, like speed, health, etc and then “grow” to be more capable by itself? I’m not sure what should be the “brain” of baby-agent, another NN or maybe the same that were found from training the mother agent? Maybe creating a chain of agents: agent 1 at some point gives birth to the agent 2 and receive reward for each tick that agent 2 is alive, which itself will give birth to the agent 3 and receive reward fot each tick that agent 3 is alive, and so on. Maybe it will produce something interesting? Obviously the “alive time” is a proxy, and given enough optimization power we should expect Goodhart horrors beyond our comprehension. But the idea is that maybe there is some “simple” solution that will be found first, which we can study. Recreating the product of evolution, not using the immense “computational power” of it could be very tricky.
But if it seems to work, and mother-agent behave in a seemingly “caring way”, then we can try to apply interpretability tools, try to change the original environment drastically, to see how far it will generalize, try to break something and see how well it works, or manually override some parameters and study the change. However, I’m not qualified to make this happen anyway, so if you find this idea interesting, contact me, maybe we can do this project together.
Let’s imagine that we’ve got some agent that can behave with care toward the right type of “babies”. For some yet unknown reason, from outside view it behaves as if it cares about its baby-agent, it finds the creative ways to do so in new contexts. Now the actual work begins: we need to understand where are the parts that make this possible located, what is the underlying mechanism, what parts are crucial and what happens when you break them, how can we re-write the “baby” target, so that our agent will care about different baby-agents, under what conditions gradient descent will find an automatic off switch (I expect this to be related to the chance of obtaining another baby and given only 1 baby per “life”, gradient will never find the switch, since it will have no use). Then we can actually start to think about recreating it from scratch. Just like what people did with modular addition: https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking . Except this time we don’t know how the algorithm could work or look like. But “intentions”, “motivations” and “goals” of potential AI systems are not magic, we should be able to recreate and reverse-engineer them.
Suppose your AI includes an LLM. You can just prompt it with "You love, and want to protect, all humans in the same way as a parent loves their children." Congratulations — you just transferred the entire complex behavior pattern. Now all you need to do it tune it.