But then, "the self-alignment problem" would likewise make it sound like it's about how you need to align yourself with yourself. And while it is the case that increased self-alignment is generally very good and that not being self-aligned causes problems for the person in question, that's not actually the problem the post is talking about.
I don't know how you would describe "true niceness", but I think it's neither of the above.
Agreed. I think "true niceness" is something like, act to maximize people's preferences, while also taking into account the fact that people often have a preference for their preferences to continue evolving and to resolve any of their preferences that are mutually contradictory in a painful way.
Niceness is natural for agents of similar strengths because lots of values point towards the same "nice" behavior. But when you're much more powerful than anyone else, the target becomes much smaller, right?
Depends on the specifics, I think.
As an intuition pump, imagine the kindest, wisest person that you know. Suppose that that person was somehow boosted into a superintelligence and became the most powerful entity in the world.
Now, it's certainly possible that for any human, it's inevitable for evolutionary drives optimized for exploiting power to kick in at that situation and corrupt them... but let's further suppose that the process of turning them into a superintelligence also somehow removed those, and made the person instead experience a permanent state of love towards everybody.
I think it's at least plausible that the person would then continue to exhibit "true niceness" towards everyone, despite being that much more powerful than anyone else.
So at least if the agent had started out at a similar power level as everyone else - or if it at least simulates the kinds of agents that did - it might retain that motivation when boosted to higher level of power.
Do you have reasons to expect "slight RL on niceness" to give you "true niceness" as opposed to a kind of pseudo-niceness?
I don't have a strong reason to expect that it'd happen automatically, but if people are thinking about the best ways to actually make the AI have "true niceness", then possibly! That's my hope, at least.
I would be scared of an AI which has been trained to be nice if there was no way to see if, when it got more powerful, it tried to modify people's preferences / it tried to prevent people's preferences from changing.
Me too!
Great post!
When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt
I'm very confused by the frequent use of "GPT-4", and am failing to figure out whether this is actually meant to read GPT-2 or GPT-3, whether there's some narrative device where this is a post written at some future date when GPT-4 has actually been released (but that wouldn't match "when LLMs first appeared"), or what's going on.
Thanks, this seems like a nice breakdown of issues!
If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algorithmic description. Maybe the difference is that you’re imagining that people are going to hand-write source code that has a labeled “this is an empathetic simulation” variable, and a “my preferences are being satisfied” variable? Because I don’t expect either of those to happen (well, at least not the former, and/or not directly). Things can emerge inside a trained model instead of being in the source code, and if so, then finding them is tricky.
So I don't think that there's going to be hand-written source code with slots for inserting variables. When I expect it to have a "natural" algorithmic description, I mean natural in a sense that's something like "the kinds of internal features that LLMs end up developing in order to predict text, because those are natural internal representations to develop when you're being trained to predict text, even though no human ever hand-coded or them or even knew what they would be before inspecting the LLM internals after the fact".
Phrased differently, the claim might be something like "I expect that if we develop more advanced AI systems that are trained to predict human behavior and to act in a way that they predict to please humans, then there is a combination of cognitive architecture (in the sense that "transformer-based LLMs" are a "cognitive architecture") and reward function that will naturally end up learning to do PF because that's the kind of thing that actually does let you best predict and fulfill human preferences".
The intuition comes from something like... looking at LLMs, it seems like language was in some sense "easy" or "natural" - just throw enough training data at a large enough transformer-based model, and a surprisingly sophisticated understanding of language emerges. One that probably ~nobody would have expected just five years ago. In retrospect, maybe this shouldn't have been too surprising - maybe we should expect most cognitive capabilities to be relatively easy/natural to develop, and that's exactly the reason why evolution managed to find them.
If that's the case, then it might be reasonable to assume that maybe PF could be the same kind of easy/natural, in which case it's that naturalness which allowed evolution to develop social animals in the first place. And if most cognition runs on prediction, then maybe the naturalness comes from something like there only being relatively small tweaks in the reward function that will bring you from predicting & optimizing your own well-being to also predicting & optimizing the well-being of others.
If you ask me what exactly that combination of cognitive architecture and reward function is... I don't know. Hopefully, e.g. your research might one day tell us. :-) The intent of the post is less "here's the solution" and more "maybe this kind of a thing might hold the solution, maybe we should try looking in this direction".
- 2. Will installing a PF motivation into an AGI be straightforward in the future “by default” because capabilities research will teach us more about AGI than we know today, and/or because future AGIs will know more about the world than AIs today?
I say “no” to both. For the first one, I really don’t think capabilities research is going to help with this, for reasons here. For the second one, you write in OP that even infants can have a PF motivation, which seems to suggest that the problem should be solvable independent of the AGI understanding the world well, right?
I read your linked comment as an argument for why social instincts are probably not going to contribute to capabilities - but I think that doesn't establish the opposite direction of "might capabilities be necessary for social instincts" or "might capabilities research contribute to social instincts"?
If my model above is right, that there's a relatively natural representation of PF that will emerge with any AI systems that are trained to predict and try to fulfill human preferences, then that kind of a representation should emerge from capabilities researchers trying to train AIs to better fulfill our preferences.
- 3. Is figuring out how to install a PF motivation a good idea?
I say “yes”.
You're probably unsurprised to hear that I agree. :-)
- 4. Independently of which is a better idea, is the technical problem of installing a PF motivation easier, harder, or the same difficulty as the technical problem of installing a “human flourishing” motivation?
What do you have in mind with a "human flourishing" motivation?
A sufficiently advanced AGI familiar with humans will have a clear concept of “not killing everyone” (or more specifically, “what humans mean when they say the words ‘not killing everyone’”). We just add a bit of an extra component that makes the AGI intrinsically value that concept. This implies that capabilities progress may be closely linked to alignment progress.
Some major differences off the top of my head:
An observation: it feels slightly stressful to have posted this. I have a mental simulation telling me that there are social forces around here that consider it morally wrong or an act of defection to suggest that alignment might be relatively easy, like it implied that I wasn't taking the topic seriously enough or something. I don't know how accurate that is, but that's the vibe that my simulators are (maybe mistakenly) picking up.
Forget about what the social consensus is. If you have technical understanding of current AIs, do you truly believe there are any major obstacles left? The kind of problems that AGI companies could reliably not tear down with their resources? If you do, state so in the comments, but please do not state what those obstacles are.
I guess the reasoning behind the "do not state" request is something like "making potential AGI developers more aware of those obstacles is going to direct more resources into solving those obstacles". But if someone is trying to create AGI, aren't they going to run into those obstacles anyway, making it inevitable that they'll be aware of them in any case?
At least ChatGPT seems to have a longer context window, this experiment suggesting 8192 tokens.
And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off.
Maybe - but Microsoft and Google are huge organizations, and huge organizations have an incentive to push for regulation that imposes costs that they can pay while disproportionately hampering smaller competitors. It seems plausible to me that both M & G might prefer a regulatory scheme that overall slows down progress while cementing their dominance, since that would be a pretty standard regulatory-capture-driven-by-the-dominant-actors-in-the-field kind of scenario.
A sudden wave of destabilizing AI breakthroughs - with DALL-E/Midjourney/Stable Diffusion suddenly disrupting art and Chat-GPT who-knows-how-many-things - can also make people on the street concerned and both more supportive of AI regulation in general, as well as more inclined to take AGI scenarios seriously in particular. I recently saw a blog post from someone speculating that this might cause a wide variety of actors - M & G included - with a desire to slow down AI progress to join forces to push for widespread regulation.
Some notable/famous signatories that I noted: Geoffrey Hinton, Yoshua Bengio, Demis Hassabis (DeepMind CEO), Sam Altman (OpenAI CEO), Dario Amodei (Anthropic CEO), Stuart Russell, Peter Norvig, Eric Horvitz (Chief Scientific Officer at Microsoft), David Chalmers, Daniel Dennett, Bruce Schneier, Andy Clark (the guy who wrote Surfing Uncertainty), Emad Mostaque (Stability AI CEO), Lex Friedman, Sam Harris.
Edited to add: a more detailed listing from this post: