On your first point, I do think people have thought about this before and determined it doesn't work. But from the post:
If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insight I gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature universality hypothesis for neural networks, about natural abstractions.
Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?
If you're convinced by them, then you'll understand the reaction of "Fuck, we've been wasting so much time and studying humans makes so much sense" which is described in this post (e.g. Turntrout's idea on corrigibility and statement "I wrote this post as someone who previously needed to read it."). I'm stating here that me arguing "you should feel this way now before being convinced of specific mechanistic understandings" doesn't make sense when stated this way.
Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come "for free" with model-based agents.
Link? I don't think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I'm open to being wrong.
Oh, you're stating potential mechanisms for human alignment w/ humans that you don't think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize.
Turntrout's other post claims that the genome likely doesn't directly specify rewards for everything humans end up valuing. People's specific families aren't encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.
A related point: humans don't maximize the reward specified by their limbic system, but can instead be modeled as a system of inner-optimizers that value proxies instead (e.g. most people wouldn't push a wirehead button if it killed a loved one). This implies that inner-optimizers that are not optimizing the base objective are good, meaning that inner-alignment & outer-alignment are not the right terms to use.
There are other mechanisms, and I believe it's imperative to dig deeper into them, develop a better theory of how learning systems grow values, and test that theory out on other learning systems.
To add, Turntrout does state:
In an upcoming post, I’ll discuss one particularly rich vein of evidence provided by humans.
so the doc Ulisse provided is a decent write-up about just that, but there are more official posts intended to published.
Ah, yes I recognized I was replying to only an example you gave, and decided to post a separate comment on the more general point:)
There are other mechanisms which influence other things, but I wouldn't necessarily trust them to generalize either.
Could you elaborate?
I believe the diamond example is true, but not the best example to use. I bet it was mentioned because of the arbital article linked in the post.
The premise isn't dependent on diamonds being terminal goals; it could easily be about valuing real life people or dogs or nature or real life anything. Writing an unbounded program that values real world objects is an open-problem in alignment; yet humans are a bounded program that values real world objects all of the time, millions of times a day.
The post argues that focusing on the causal explanations behind humans growing values is way more informative than other sources of information, because humans exist in reality and anchoring your thoughts to reality is more informative about reality.
There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post's point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post:
the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes form their values, and, more generally, about how intelligent minds can form values at all.
Humans can provide a massive amount of info on how highly intelligent systems value things in the real world. There are guaranteed-to-exist mechanisms behind why humans value real world things and mechanisms behind the variance in human values, and the post argues we should look at these mechanisms first (if we're able to). I predict that a mechanistic understanding will enable the below knowledge:
I aspire for the kind of alignment mastery which lets me build a diamond-producing AI, or if that didn’t suit my fancy, I’d turn around and tweak the process and the AI would press green buttons forever instead, or—if I were playing for real—I’d align that system of mere circuitry with humane purposes.
To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.
Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates.
If we understand the mechanisms behind why some people e.g. terminally value animal happiness and some don't, then we can apply these mechanisms to other learning systems.
I wouldn't expect a human who had been given all the power in the world all their life such that they've learned they can solve any conflict by destroying their opposition to be very aligned.
I agree this is likely.
This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
I think you're ignoring the [now bolded part] in "a particular human’s learning process + reward circuitry + "training" environment" and just focusing in the environment. Humans very often don't optimize for their reward circuitry in their limbic system. If I gave you a button that killed everyone but maximized your reward circuitry every time you pressed it, most people wouldn't press it (would you?). I do agree that if you pressed the button once, you would then want to press the button again, but not beforehand which is an inner-misalignment w/ respect to the reward circuitry. Though maybe you'd say the wirehead thing is an extreme case OOD?
By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is.
I agree, but I'm bolding "most people" because you're claiming there exist some people that would retain that value if scaled up(?) I think replace "dog-lover" w/ "family-lover" and there's even more people. But I don't think this is a disagreement between us?
My bad; I've updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution's failure at inner alignment is the most significant and informative evidence that inner alignment is hard.
Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there's the disconnect (usually misalignment is thought of as bad, and I'm not just mistyping). Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the "a particular human’s learning process + reward circuitry + "training" environment" part, and less on the evolution part.
If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
There may not be substantial disagreements here. Do you agree with:
"a particular human's learning process + reward circuitry + "training" environment -> the human's learned values" is more informative about inner-misalignment than the usual "evolution -> human values" (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences)
The most important claim in your comment is that "human learning → human values" is evidence that inner misalignment is easier than it seems when one looks at it from the "evolution -> human values" perspective. Here's why I disagree:
I don't know what you mean by "inner misalignment is easier"? Could you elaborate? I don't think you mean "inner misalignment is more likely to happen" because you then go on to explain inner-misalignment & give an example and say "I worry you are being insufficiently pessimistic."One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given. See:
I also think this regularity in inner values is reasonably robust to sharp left turns in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence.
This matches my intuitions.
My understanding is: Bob's genome didn't have access to Bob's developed world model (WM) when he was born (because his WM wasn't developed yet). Bob's genome can't directly specify "care about your specific family" because it can't hardcode Bob's specific family's visual or auditory features.
This direct-specification wouldn't work anyways because people change looks, Bob could be adopted, or Bob could be born blind & deaf.
[Check, does the Bob example make sense?]
But, the genome does do something indirectly that consistently leads to people valuing their families (say ~80% of people). The bulleted list (e.g. reaction to being scammed, etc) are other extremely common human values & biases that seems improbable for the genome to directly specify, so the alternative hypothesis is the genome set the initial conditions (along with the environment) such that these are generally convergently learned.
The hope is that this is true, the mechanisms of which can be understood, and these mechanism can be applied to AGI convergently learning desired values.