Does this imply that AGI is not as likely to emerge from language models as might have been thought? To me it looks like it's saying that the only way to get enough data would be to have the AI actively interacting in the world - getting data itself.
The principles from the post can still be applied. Some humans do end up aligned to animals - particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.
Also, remember that the problem is not to align an entire civilization of naturally evolved organisms to weaker entities. The problem is to align exactly one entirely artificial organism to weaker entities. This is much simpler, and as mentioned entirely possible by just figuring out how already existing people of that sort end up that way - but your use of "we" here seems to imply that you think the entirety of human civilization is the thing we ought to be using as inspiration for the AGI, which is not the case.
By the way: at least part of the explanation for why I personally am aligned to animals is that I have a strong tendency to be moved by the Care/Harm moral foundation - see this summary of The Righteous Mind for more details. It is unclear exactly how it is implemented in the brain, but it is suspected to be a generalization of the very old instincts that cause mothers to care about the safety and health of their children. I have literally, regularly told people that I perceive animals as identical in moral relevance to human children, implying that some kind of parental instincts are at work in the intuitions that make me care about their welfare. Even carnists feel this way about their pets, hence calling themselves e.g. "cat moms". So, the main question here for alignment is: how can we reverse engineer parental instincts?
Ah, sorry, I misunderstood you.
Your human flourishing example sounds like it wouldn't generalize well. As the AI's capacities grow stronger it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them, and if it grows more intelligent after we deploy it we will have no way to determine if its thought assessor generalizes wrongly. This is, I would think, a rather basic and obvious flaw in relying on any part of the world model directly.
As for how to code that stuff, well, I'll figure out how to do that after we've all figured out how to mathematically specify those things. :P
The problem is that the steering subsystem does not have a world model and can't directly refer to anything in a learned world model. Insofar as we want to design a steering system to serve a particular goal, then, we have to design it in such a way that it doesn't have to have any particular learned world model at all in order to recognize what behaviors move it towards versus further away from that goal.
Example: "am I eating sugar? if so, reward!" is a good steering mechanism, as a presumably simple algorithm in the brainstem is capable of recognizing whether sugar is being eaten or not, and correcting thought assessors appropriately. But, "is this increasing human flourishing? if so, reward!" is not, as I have no idea how to pick out what in the learned world model of the AGI corresponds to "human flourishing".
But if we can mathematically define agency, consciousness, etc, then it might be possible to make a cascade of steering mechanisms in the "brain stem" that will make the AGI tend to pay attention to things that might be agents, tend to try to determine how conscious they are, tend to try to determine what they want, tend to take actions that give them what they want, etc, in such a way that it can learn in real time how best to do any of those things and we don't have to worry what its world model actually looks like, as it will never contradict the definitions of those important concepts that we hardcoded for it. Does that make sense, and if so am I missing anything important?
Hmm. I'm not sure if I believe that. But I get what you mean. To me, English language sentences seem like they rely for their meaning on the life experience of English speakers and have far more complexity than they appear to have. Example: try to rigorously define "woman" in a way every English speaker would agree on. It's very hard if not impossible.
As a result, I prefer trying to think of utility functions that at least in principle can be made mathematically rigorous. I think my example is actually far simpler than "maximize human flourishing", in other words. And I really don't want a difference in interpretation of words to lead to misalignment. But perhaps I misunderstand you and you have some notion that there's a way around that problem?
Re the human flourishing example - it seems to me that a better choice of thought assessor / ultimate value is "Does this tend to increase the total subjective utility (weighted by amount of consciousness) of all sentient beings?" It's simple, relies on probably-natural abstractions (utility, consciousness, sentient, agents), does not rely on arbitrary things that are hard to define like what exactly a "human" is, and I think most human morals (at least of the second order want-to-want kind) fall straight out of it.
Defining the utility function of an arbitrary agent is an issue, of course, but if an entity does not have coherent desires, their subagents could perhaps be factored into it, with moral relevance equal to that of the whole being multiplied by the "proportion" of their mind controlled by that subagent. But perhaps this is just CEV again. Actually, given that animals don't particularly care about (or know about) the concept of uplifting and yet I consider it a moral imperative, I must actually want CEV after all. Heh.
There are some potential failures here of course. For instance, the AGI may inappropriately believe that agents exist which really do not - humans do this all the time, and take them into account in moral calculations - spirits for instance! Well, of course, spirits do exist, but only as self-replicating [via proselytization etc] subagents in the human brain, not as external entities with consciousness of their own, and have minimal moral relevance. But it's probably possible to constrain it to only those entities which have a known, bounded physical location (allowing for such notions of "bounding" as would be necessary to locate a highly dispersed digital entity in space...), or some such thing.
Ultimately though, this is just a special case of the social instincts thing. I would just want it to be hardwired to feel things like lovingkindness, compassion, and sympathetic joy for all sentient beings, not just humans. A bodhisattva, in other words. :)
What I'm expecting, if LLMs remain in the lead, is that we end up in a magical, spirit-haunted world where narrative causality starts to actually work, and trope-aware people essentially become magicians who can trick the world-sovereign AIs into treating them like protagonists and bending reality to suit them. Which would be cool as fuck, but also very chaotic. That may actually be the best-case alignment scenario right now, and I think there's a case for alignment-interested people who can't do research themselves but who have writing talent to write a LOT of fictional stories about AGIs that end up kind and benevolent, empower people in exactly this way, etc., to help stack the narrative-logic deck.