Kaj_Sotala - AI Alignment Forum

Thoughts on “AI is easy to control” by Pope & Belrose

I mostly agree with what you say, just registering my disagreement/thoughts on some specific points. (Note that I haven't yet read the page you're responding to.)

Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer.

Maybe? Depends on what exactly you mean by the word "might", but it doesn't seem obvious to me that this would need to be the case. My intuition from seeing the kinds of interpretability results we've seen so far, is that within less of a decade we'd already have a pretty rigorous theory and toolkit for answering these kinds of questions. At least assuming that we don't keep switching to LLM architectures that work based on entirely different mechanisms and make all of the previous interpretability work irrelevant.

If by "might" you mean something like a "there's at least a 10% probability that this could take decades to answer" then sure, I'd agree with that. Now I haven't actually thought about this specific question very much before seeing it pop up in your post, so I might radically revise my intuition if I thought about it more, but at least it doesn't seem immediately obvious to me that I should assign "it would take decades of work to answer this" a very high probability.

Instead, the authors make a big deal out of the fact that human innate drives are relatively simple (I think they mean “simple compared to a modern big trained ML model”, which I would agree with). I’m confused why that matters. Who cares if there’s a simple solution, when we don’t know what it is?

I would assume the intuition to be something like "if they're simple, then given the ability to experiment on minds and access AI internals, it will be relatively easy to figure out how to make the same drives manifest in an AI; the amount of (theory + trial and error) required for that will not be as high as it would be if the drives were intrinsically complex".

We can run large numbers of experiments to find the most effective interventions, and we can also run it in a variety of simulated environments and test whether it behaves as expected both with and without the cognitive intervention. Each time the AI’s “memories” can be reset, making the experiments perfectly reproducible and preventing the AI from adapting to our actions, very much unlike experiments in psychology and social science.
That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before (different ANN weights, not just different environment / different prompt). Humans are like that too, but LLMs are not.

There's something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus... it's not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be.

Of course, if there was something very unexpected and surprising in the newspaper, that might cause a bigger update, but I expect that they would also have reasonably good models of the kinds of things that are likely to trigger major updates or significant emotional shifts in me. If they were at all competent, that's specifically the kind of thing that I'd expect them to work on trying to find out!

And even if there was a major shift, I think it's basically unheard of that literally everything about my thoughts and behavior would change. When I first understood the potentially transformative impact of AGI, it didn't change the motor programs that determine how I walk or brush my teeth, nor did it significantly change what kinds of people I feel safe around (aside for some increase in trust toward other people who I felt "get it"). I think that human brains quite strongly preserve their behavior and prediction structures, just adjusting them somewhat when faced with new information. Most of the models and predictions you've made about an adult will tend to stay valid, though of course with children and younger people there's much greater change.

Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.

In some sense yes, but it does also seem to me that prediction and desire does get conflated in humans in various ways, and that it would be misleading to say that the people in question want it. For example, I think about this post by @romeostevensit often:

Fascinating concept that I came across in military/police psychology dealing with the unique challenges people face in situations of extreme stress/danger: scenario completion. Take the normal pattern completion that people do and put fear blinders on them so they only perceive one possible outcome and they mechanically go through the motions *even when the outcome is terrible* and there were obvious alternatives. This leads to things like officers shooting *after* a suspect has already surrendered, having overly focused on the possibility of needing to shoot them. It seems similar to target fixation where people under duress will steer a vehicle directly into an obstacle that they are clearly perceiving (looking directly at) and can't seem to tear their gaze away from. Or like a self fulfilling prophecy where the details of the imagined bad scenario are so overwhelming, with so little mental space for anything else that the person behaves in accordance with that mental picture even though it is clearly the mental picture of the *un*desired outcome.
I often try to share the related concept of stress induced myopia. I think that even people not in life or death situations can get shades of this sort of blindness to alternatives. It is unsurprising when people make sleep a priority and take internet/screen fasts that they suddenly see that the things they were regarding as obviously necessary are optional. In discussion of trauma with people this often seems to be an element of relationships sadly enough. They perceive no alternative and so they resign themselves to slogging it out for a lifetime with a person they are very unexcited about. This is horrific for both people involved.

It's, of course, true that for an LLM, prediction is the only thing it can do, and that humans have a system of desires on top of that. But it looks to me that a lot of human behavior is just having LLM-ish predictive models of how someone like them would behave in a particular situation, which is also the reason why conceptual reframings the like one you can get in therapy can be so powerful ("I wasn't lazy after all, I just didn't have the right tools for being productive" can drastically reorient many predictions you're making of yourself and thus your behavior). (See also my post on human LLMs, which has more examples.)

While it's obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that "the human brain is not just LLM-like prediction", while you seem to be saying that "the human brain does not do LLM-like prediction at all". (Of course, "LLM-like prediction" is a vague concept and maybe we're just using it differently and ultimately agree.)

Genetic fitness is a measure of selection strength, not the selection target

Kaj Sotala10mo20

I don't think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals

I agree that they don't depend on that. Your arguments are also substantially different from the ones I was criticizing! The ones I was responding were ones like the following:

The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities. (A central AI alignment problem: capabilities generalization, and the sharp left turn)

15. [...] We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. [...]
16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. (AGI Ruin: A List of Lethalities)

Those arguments are explicitly premised on humans having been optimized for IGF, which is implied to be a single thing. As I understand it, your argument is just that humans now have some very different behaviors from the ones they used to have, omitting any claims of what evolution originally optimized us for, so I see it as making a very different sort of claim.

To respond to your argument itself:

I agree that there are drives for which the behavior looks very different from anything that we did in the ancestral environment. But does very different-looking behavior by itself constitute a sharp left turn relative to our original values?

I would think that if humans had experienced a sharp left turn, then the values of our early ancestors should look unrecognizable to us, and vice versa. And certainly, there do seem to be quite a few things that our values differ on - modern notions like universal human rights and living a good life while working in an office might seem quite alien and repulsive to some tribal warrior who values valor in combat and killing and enslaving the neighboring tribe, for instance.

At the same time... I think we can still basically recognize and understand the values of that tribal warrior, even if we don't share them. We do still understand what's attractive about valor, power, and prowess, and continue to enjoy those kinds of values in less destructive forms in sports, games, and fiction. We can read Gilgamesh or Homer or Shakespeare and basically get what the characters are motivated by and why they are doing the things they're doing. An anthropologist can go to a remote tribe to live among them and report that they have the same cultural and psychological universals as everyone else and come away with at least some basic understanding of how they think and why.

It's true that humans couldn't eradicate diseases before. But if you went to people very far back in time and told them a story about a group of humans who invented a powerful magic that could destroy diseases forever and then worked hard to do so... then the people of that time would not understand all of the technical details, and maybe they'd wonder why we'd bother bringing the cure to all of humanity rather than just our tribe (though Prometheus is at least commonly described as stealing fire for all of humanity, so maybe not), but I don't think they would find it a particularly alien or unusual motivation otherwise. Humans have hated disease for a very long time, and if they'd lost any loved ones to the particular disease we were eradicating they might even cheer for our doctors and want to celebrate them as heroes.

Similarly, humans have always gone on voyages of exploration - e.g. the Pacific islands were discovered and settled long ago by humans going on long sea voyages - so they'd probably have no difficulty relating to a story about sorcerers going to explore the moon, or of two tribes racing for the glory of getting there first. Babylonians had invented the quadratic formula by 1600 BC and apparently had a form of Fourier analysis by 300 BC, so the math nerds among them would probably have some appreciation of modern-day advanced math if it was explained to them. The Greek philosophers argued over epistemology, and there were apparently instructions on how to animate golems (arguably AGI-like) around by the late 12th/early 13th century.

So I agree that the same fundamental values and drives can create very different behavior in different contexts... but if it is still driven by the same fundamental values and drives in a way that people across time might find relatable, why is that a sharp left turn? Analogizing that to AI, it would seem to imply that if the AI generalized its drives in that kind of way when it came to novel contexts, then we would generally still be happy about the way it had generalized them.

This still leaves us with that tribal warrior disgusted with our modern-day weak ways. I think that a lot of what is going on with him is that he has developed particular strategies for fulfilling his own fundamental drives - being a successful warrior was the way you got what you wanted back in that day - and internalized them as a part of his aesthetic of what he finds beautiful and what he finds disgusting. But it also looks to me like this kind of learning is much more malleable than people generally expect. One's sense of aesthetics can be updated by propagating new facts into it, and strongly-held identities (such as "I am a technical person") can change in response to new kinds of strategies becoming viable, and generally many (I think most) deep-seated emotional patterns can at least in principle be updated. (Generally, I think of human values in terms of a two-level model, where the underlying "deep values" are relatively constant, with emotional responses, aesthetics, identities, and so forth being learned strategies for fulfilling those deep values. The strategies are at least in principle updatable, subject to genetic constraints such as the person's innate temperament that may be more hardcoded.)

I think that the tribal warrior would be disgusted by our society because he would rightly recognize that we have the kinds of behavior patterns that wouldn't bring glory in his society and that his tribesmen would find it shameful to associate with, and also that trying to make it in our society would require him to unlearn a lot of stuff that he was deeply invested in. But if he was capable of making the update that there were still ways for him to earn love, respect, power, and all the other deep values that his warfighting behavior had originally developed to get... then he might come to see our society as not that horrible after all.

I am confused by your AlphaGo argument because "winning states of the board" looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how "surviving and reproducing" looks very different depending on what kinds of hazards are in the environment.

I don't think the actual victory states look substantially different? They're all ones where AlphaGo has more territory than the other player, even if the details of how you get there are going to be different.

I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go.

Yeah, I would expect this as well, but those random drives would still be systematically shaped in a consistent direction (that which brings you closer to a victory state).

Genetic fitness is a measure of selection strength, not the selection target

Kaj Sotala10mo10

Thanks, edited:

I argued that there’s no single thing that evolution selects for; rather, the thing that it’s selecting is constantly changing.

Genetic fitness is a measure of selection strength, not the selection target

Kaj Sotala10mo10

Does this comment help clarify the point?

Genetic fitness is a measure of selection strength, not the selection target

Kaj Sotala10mo62

So I think the issue is that when we discuss what I'd call the "standard argument from evolution", you can read two slightly different claims into it. My original post was a bit muddled because I think those claims are often conflated, and before writing this reply I hadn't managed to explicitly distinguish them.

The weaker form of the argument, which I interpret your comment to be talking about, goes something like this:

The original evolutionary "intent" of various human behaviors/goals was to increase fitness, but in the modern day these behaviors/goals are executed even though their consequences (in terms of their impact on fitness) are very different. This tells us that the intent of the process that created a behavior/goal does not matter. Once the behavior/goal has been created, it will just do what it does even if the consequences of that doing deviate from their original purpose. Thus, even if we train an AI so that it carries out goal X in a particular context, we have no particular reason to expect that it would continue to automatically carry out the same (intended) goal if the context changes enough.

I agree with this form of the argument and have no objections to it. I don't think that the points in my post are particularly relevant to that claim. (I've even discussed a form of inner optimization in humans that causes value drift that I don't recall anyone else discussing in those terms before.)

However, I think that many formulations are actually implying, if not outright stating a stronger claim:

In the case of evolution, humans were originally selected for IGF but are now doing things that are completely divorced from that objective. Thus, even if we train an AI so that it carries out goal X in a particular context, we have a strong reason to expect that its behavior would deviate so much from the goal as to become practically unrecognizable.

So the difference is something like the implied sharpness of the left turn. In the weak version, the claim is just that the behavior might go some unknown amount to the left. We should figure out how to deal with this, but we don't yet have much empirical data to estimate exactly how much it might be expected to go left. In the strong version, the claim is that the empirical record shows that the AI will by default swerve a catastrophic amount to the left.

(Possibly you don't feel that anyone is actually implying the stronger version. If you don't and you would already disagree with the stronger version, then great! We are in agreement. I don't think it matters whether the implication "really is there" in some objective sense, or even whether the original authors intended it or not. I think the relevant thing is that I got that implication from the posts I read, and I expect that if I got it, some other people got it too. So this post is then primarily aimed at the people who did read the strong version to be there and thought it made sense.)

You wrote:

I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don't think I've heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically.

If we are talking about the weak version of the argument, then yes, I agree with everything here. But I think the strong version - where our behavior is implied to be completely at odds with our original behavior - has to implicitly assume that things like an art-creation drive are something novel.

Now I don't think that anyone who endorses the strong version (if anyone does) would explicitly endorse the claim that our art-creation drive just appeared out of nowhere. But to me, the strong version becomes pretty hard to maintain if you take the stance that we are mostly still executing all of the behaviors that we used to, and it's just that their exact forms and relative weightings are somewhat out of distribution. (Yes, right now our behavior seems to lead to falling birthrates and lots of populations at below replacement rates, which you could argue was a bigger shift than being "somewhat out of distribution", but... to me that intuitively feels like it's less relevant than the fact that most individual humans still want to have children and are very explicitly optimizing for that, especially since we've only been in the time of falling birthrates for a relatively short time and it's not clear whether it'll continue for very long.)

I think the strong version also requires one to hold that evolution does, in fact, consistently and predominantly optimize for a single coherent thing. Otherwise, it would mean that our current-day behaviors could be explained by "evolution doesn't consistently optimize for any single thing" just as well as they could be explained by "we've experienced a left turn from what evolution originally optimized for".

However, it is pretty analogous to RL, and especially multi agent RL, and overall I don't think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it's a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded.

I agree that there are contexts where it would be analogous to that. But in that example, AlphaGo is still being rewarded for winning games of Go, and it's just that the exact strategies it needs to use differ. That seems different than e.g. the bacteria example, where bacteria are selected for exactly the opposite traits - either selected for producing a toxin and an antidote, or selected for not producing a toxin and an antidote. That seems to me more analogous to a situation where AlphaGo is initially being rewarded for winning at Go, then once it starts consistently winning it starts getting rewarded for losing instead, and then once it starts consistently losing it starts getting rewarded for winning again.

And I don't think that that kind of a situation is even particularly rare - anything that consumes energy (be it a physical process such as producing a venom or a fur, or a behavior such as enjoying exercise) is subject to that kind of an "either/or" choice.

Now you could say that "just like AlphaGo is still rewarded for winning games of Go and it's just the strategies that differ, the organism is still rewarded for reproducing and it's just the strategies that differ". But I think the difference is that for AlphaGo, the rewards are consistently shaping its "mind" towards having a particular optimization goal - one where the board is in a winning state for it.

And one key premise on which the "standard argument from evolution" rests is that evolution has not consistently shaped the human mind in such a direct manner. It's not that we have been created with "I want to have surviving offspring" as our only explicit cognitive goal, with all of the evolutionary training going into learning better strategies to get there by explicit (or implicit) reasoning. Rather we have been given various motivations that exhibit varying degrees of directness in how useful they are for that goal - from "I want to be in a state where I produce great art" (quite indirect) to "I want to have surviving offspring" (direct), with the direct goal competing with all the indirect ones for priority. Unlike AlphaGo, which does have the cognitive capacity for direct optimization toward its goal being the sole reward criteria all along.

This is also a bit hard to put a finger on, but I feel like there's some kind of implicit bait-and-switch happening with the strong version of the standard argument. It correctly points out that we have not had IGF as our sole explicit optimization goal because we didn't start by having enough intelligence for that to work. Then it suggests that because of this, AIs are likely to also be misaligned... even though, unlike with human evolution, we could just optimize them for one explicit goal from the beginning, so we should expect our AIs to be much more reliably aligned with that goal!

Genetic fitness is a measure of selection strength, not the selection target

Kaj Sotala10mo110

Thank you, I like this comment. It feels very cooperative and like some significant effort went into it, and it also seems to touch the core of some important consideratios.

I notice I'm having difficulty responding, in that I disagree with some of what you said, but then have difficulty figuring out my reasons for that disagreement. I have the sense there's a subtle confusion going on, but trying to answer you makes me uncertain whether others are the ones with the subtle confusion or if I am.

I'll think about it some more and get back to you.

Genetic fitness is a measure of selection strength, not the selection target

Kaj Sotala10mo61

infanticide is not a substitute for contraception

I did not mean to say that they would be exactly equivalent nor that infanticide would be without significant downsides.

How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewards (whatever it is you are enjoying doing while very busy not ever donating sperm)?

"Inner optimizers diverging from the optimization process's reward function" sounds to me like humans were already donating to sperm banks in the EEA, only for an inner optimizer to wreak havoc and sidetrack us from that. I assume you mean something different, since under that interpretation of what you mean the answer would be obvious - that we don't need to invoke inner optimizers because there were no sperm banks in the EEA, so "that's not the kind of behavior that evolution selected for" is a sufficient explanation.

Value systematization: how values become coherent (and misaligned)

Kaj Sotala10mo22

Similarly, suppose you have two deontological values which trade off against each other. Before systematization, the question of "what's the right way to handle cases where they conflict" is not really well-defined; you have no procedure for doing so. After systematization, you do. (And you also have answers to questions like "what counts as lying?" or "is X racist?", which without systematization are often underdefined.) [...]
You can conserve your values (i.e. continue to care terminally about lower-level representations) but the price you pay is that they make less sense, and they're underdefined in a lot of cases. [...] And that's why the "mind itself wants to do this" does make sense, because it's reasonable to assume that highly capable cognitive architectures will have ways of identifying aspects of their thinking that "don't make sense" and correcting them.

I think we should be careful to distinguish explicit and implicit systematization. Some of what you are saying (e.g. getting answers to question like "what counts as lying") sounds like you are talking about explicit, consciously done systematization; but some of what you are saying (e.g. minds identifying aspects of thinking that "don't make sense" and correcting them) also sounds like it'd apply more generally to developing implicit decision-making procedures.

I could see the deontologist solving their problem either way - by developing some explicit procedure and reasoning for solving the conflict between their values, or just going by a gut feel for which value seems to make more sense to apply in that situation and the mind then incorporating this decision into its underlying definition of the two values.

I don't know how exactly deontological rules work, but I'm guessing that you could solve a conflict between them by basically just putting in a special case for "in this situation, rule X wins over rule Y" - and if you view the rules as regions in state space where the region for rule X corresponds to the situations where rule X is applied, then adding data points about which rule is meant to cover which situation ends up modifying the rule itself. It would also be similar to the way that rules work in skill learning in general, in that experts find the rules getting increasingly fine-grained, implicit and full of exceptions. Here's how Josh Waitzkin describes the development of chess expertise:

Let’s say that I spend fifteen years studying chess. [...] We will start with day one. The first thing I have to do is to internalize how the pieces move. I have to learn their values. I have to learn how to coordinate them with one another. [...]
Soon enough, the movements and values of the chess pieces are natural to me. I don’t have to think about them consciously, but see their potential simultaneously with the figurine itself. Chess pieces stop being hunks of wood or plastic, and begin to take on an energetic dimension. Where the piece currently sits on a chessboard pales in comparison to the countless vectors of potential flying off in the mind. I see how each piece affects those around it. Because the basic movements are natural to me, I can take in more information and have a broader perspective of the board. Now when I look at a chess position, I can see all the pieces at once. The network is coming together.
Next I have to learn the principles of coordinating the pieces. I learn how to place my arsenal most efficiently on the chessboard and I learn to read the road signs that determine how to maximize a given soldier’s effectiveness in a particular setting. These road signs are principles. Just as I initially had to think about each chess piece individually, now I have to plod through the principles in my brain to figure out which apply to the current position and how. Over time, that process becomes increasingly natural to me, until I eventually see the pieces and the appropriate principles in a blink. While an intermediate player will learn how a bishop’s strength in the middlegame depends on the central pawn structure, a slightly more advanced player will just flash his or her mind across the board and take in the bishop and the critical structural components. The structure and the bishop are one. Neither has any intrinsic value outside of its relation to the other, and they are chunked together in the mind.
This new integration of knowledge has a peculiar effect, because I begin to realize that the initial maxims of piece value are far from ironclad. The pieces gradually lose absolute identity. I learn that rooks and bishops work more efficiently together than rooks and knights, but queens and knights tend to have an edge over queens and bishops. Each piece’s power is purely relational, depending upon such variables as pawn structure and surrounding forces. So now when you look at a knight, you see its potential in the context of the bishop a few squares away. Over time each chess principle loses rigidity, and you get better and better at reading the subtle signs of qualitative relativity. Soon enough, learning becomes unlearning. The stronger chess player is often the one who is less attached to a dogmatic interpretation of the principles. This leads to a whole new layer of principles—those that consist of the exceptions to the initial principles. Of course the next step is for those counterintuitive signs to become internalized just as the initial movements of the pieces were. The network of my chess knowledge now involves principles, patterns, and chunks of information, accessed through a whole new set of navigational principles, patterns, and chunks of information, which are soon followed by another set of principles and chunks designed to assist in the interpretation of the last. Learning chess at this level becomes sitting with paradox, being at peace with and navigating the tension of competing truths, letting go of any notion of solidity.

"Sitting with paradox, being at peace with and navigating the tension of competing truths, letting go of any notion of solidity" also sounds to me like some of the models for higher stages of moral development, where one moves past the stage of trying to explicitly systematize morality and can treat entire systems of morality as things that all co-exist in one's mind and are applicable in different situations. Which would make sense, if moral reasoning is a skill in the same sense that playing chess is a skill, and moral preferences are analogous to a chess expert's preferences for which piece to play where.

Value systematization: how values become coherent (and misaligned)

Kaj Sotala10mo98

Morality seems like the domain where humans have the strongest instinct to systematize our preferences

At least, the domain where modern educated Western humans have an instinct to systematize our preferences. Interestingly, it seems the kind of extensive value systematization done in moral philosophy may itself be an example of belief systematization. Scientific thinking taught people the mental habit of systematizing things, and then those habits led them to start systematizing values too, as a special case of "things that can be systematized".

Phil Goetz had this anecdote:

I'm also reminded of a talk I attended by one of the Dalai Lama's assistants. This was not slick, Westernized Buddhism; this was saffron-robed fresh-off-the-plane-from-Tibet Buddhism. He spoke about his beliefs, and then took questions. People began asking him about some of the implications of his belief that life, love, feelings, and the universe as a whole are inherently bad and undesirable. He had great difficulty comprehending the questions - not because of his English, I think; but because the notion of taking a belief expressed in one context, and applying it in another, seemed completely new to him. To him, knowledge came in units; each unit of knowledge was a story with a conclusion and a specific application. (No wonder they think understanding Buddhism takes decades.) He seemed not to have the idea that these units could interact; that you could take an idea from one setting, and explore its implications in completely different settings.

David Chapman has a page talking about how fundamentalist forms of religion are a relatively recent development, a consequence of how secular people first started systematizing values and then religion has to start doing the same in order to adapt:

Fundamentalism describes itself as traditional and anti-modern. This is inaccurate. Early fundamentalism was anti-modernist, in the special sense of “modernist theology,” but it was itself modernist in a broad sense. Systems of justifications are the defining feature of “modernity,” as I (and many historians) use the term.
The defining feature of actual tradition—“the choiceless mode”—is the absence of a system of justifications: chains of “therefore” and “because” that explain why you have to do what you have to do. In a traditional culture, you just do it, and there is no abstract “because.” How-things-are-done is immanent in concrete customs, not theorized in transcendent explanations.
Genuine traditions have no defense against modernity. Modernity asks “Why should anyone believe this? Why should anyone do that?” and tradition has no answer. (Beyond, perhaps, “we always have.”) Modernity says “If you believe and act differently, you can have 200 channels of cable TV, and you can eat fajitas and pad thai and sushi instead of boiled taro every day”; and every genuinely traditional person says “hell yeah!” Because why not? Choice is great! (And sushi is better than boiled taro.)
Fundamentalisms try to defend traditions by building a system of justification that supplies the missing “becauses.” You can’t eat sushi because God hates shrimp. How do we know? Because it says so here in Leviticus 11:10-11.³
Secular modernism tries to answer every “why” question with a chain of “becauses” that eventually ends in “rationality,” which magically reveals Ultimate Truth. Fundamentalist modernism tries to answer every “why” with a chain that eventually ends in “God said so right here in this magic book which contains the Ultimate Truth.”
The attempt to defend tradition can be noble; tradition is often profoundly good in ways modernity can never be. Unfortunately, fundamentalism, by taking up modernity’s weapons, transforms a traditional culture into a modern one. “Modern,” that is, in having a system of justification, founded on a transcendent eternal ordering principle. And once you have that, much of what is good about tradition is lost.
This is currently easier to see in Islamic than in Christian fundamentalism. Islamism is widely viewed as “the modern Islam” by young people. That is one of its main attractions: it can explain itself, where traditional Islam cannot. Sophisticated urban Muslims reject their grandparents’ traditional religion as a jumble of pointless, outmoded village customs with no basis in the Koran. Many consider fundamentalism the forward-looking, global, intellectually coherent religion that makes sense of everyday life and of world politics.

Jonathan Haidt also talked about the way that even among Westerners, requiring justification and trying to ground everything in harm/care is most prominent in educated people (who had been socialized to think about morality in this way) as opposed to working-class people. Excerpts from The Righteous Mind where he talks about reading people stories about victimless moral violations (e.g. having sex with a dead chicken before eating it) to see how they thought about them:

I got my Ph.D. at McDonald’s. Part of it, anyway, given the hours I spent standing outside of a McDonald’s restaurant in West Philadelphia trying to recruit working-class adults to talk with me for my dissertation research. When someone agreed, we’d sit down together at the restaurant’s outdoor seating area, and I’d ask them what they thought about the family that ate its dog, the woman who used her flag as a rag, and all the rest. I got some odd looks as the interviews progressed, and also plenty of laughter—particularly when I told people about the guy and the chicken. I was expecting that, because I had written the stories to surprise and even shock people.
But what I didn’t expect was that these working-class subjects would sometimes find my request for justifications so perplexing. Each time someone said that the people in a story had done something wrong, I asked, “Can you tell me why that was wrong?” When I had interviewed college students on the Penn campus a month earlier, this question brought forth their moral justifications quite smoothly. But a few blocks west, this same question often led to long pauses and disbelieving stares. Those pauses and stares seemed to say, You mean you don’t know why it’s wrong to do that to a chicken? I have to explain this to you? What planet are you from?
These subjects were right to wonder about me because I really was weird. I came from a strange and different moral world—the University of Pennsylvania. Penn students were the most unusual of all twelve groups in my study. They were unique in their unwavering devotion to the “harm principle,” which John Stuart Mill had put forth in 1859: “The only purpose for which power can be rightfully exercised over any member of a civilized community, against his will, is to prevent harm to others.”1 As one Penn student said: “It’s his chicken, he’s eating it, nobody is getting hurt.”
The Penn students were just as likely as people in the other eleven groups to say that it would bother them to witness the taboo violations, but they were the only group that frequently ignored their own feelings of disgust and said that an action that bothered them was nonetheless morally permissible. And they were the only group in which a majority (73 percent) were able to tolerate the chicken story. As one Penn student said, “It’s perverted, but if it’s done in private, it’s his right.” [...]

Haidt also talks about this kind of value systematization being uniquely related to Western mental habits:

I and my fellow Penn students were weird in a second way too. In 2010, the cultural psychologists Joe Henrich, Steve Heine, and Ara Norenzayan published a profoundly important article titled “The Weirdest People in the World?” The authors pointed out that nearly all research in psychology is conducted on a very small subset of the human population: people from cultures that are Western, educated, industrialized, rich, and democratic (forming the acronym WEIRD). They then reviewed dozens of studies showing that WEIRD people are statistical outliers; they are the least typical, least representative people you could study if you want to make generalizations about human nature. Even within the West, Americans are more extreme outliers than Europeans, and within the United States, the educated upper middle class (like my Penn sample) is the most unusual of all.
Several of the peculiarities of WEIRD culture can be captured in this simple generalization: The WEIRDer you are, the more you see a world full of separate objects, rather than relationships. It has long been reported that Westerners have a more independent and autonomous concept of the self than do East Asians. For example, when asked to write twenty statements beginning with the words “I am …,” Americans are likely to list their own internal psychological characteristics (happy, outgoing, interested in jazz), whereas East Asians are more likely to list their roles and relationships (a son, a husband, an employee of Fujitsu).
The differences run deep; even visual perception is affected. In what’s known as the framed-line task, you are shown a square with a line drawn inside it. You then turn the page and see an empty square that is larger or smaller than the original square. Your task is to draw a line that is the same as the line you saw on the previous page, either in absolute terms (same number of centimeters; ignore the new frame) or in relative terms (same proportion relative to the frame). Westerners, and particularly Americans, excel at the absolute task, because they saw the line as an independent object in the first place and stored it separately in memory. East Asians, in contrast, outperform Americans at the relative task, because they automatically perceived and remembered the relationship among the parts.
Related to this difference in perception is a difference in thinking style. Most people think holistically (seeing the whole context and the relationships among parts), but WEIRD people think more analytically (detaching the focal object from its context, assigning it to a category, and then assuming that what’s true about the category is true about the object). Putting this all together, it makes sense that WEIRD philosophers since Kant and Mill have mostly generated moral systems that are individualistic, rule-based, and universalist. That’s the morality you need to govern a society of autonomous individuals.
But when holistic thinkers in a non-WEIRD culture write about morality, we get something more like the Analects of Confucius, a collection of aphorisms and anecdotes that can’t be reduced to a single rule.6 Confucius talks about a variety of relationship-specific duties and virtues (such as filial piety and the proper treatment of one’s subordinates). If WEIRD and non-WEIRD people think differently and see the world differently, then it stands to reason that they’d have different moral concerns. If you see a world full of individuals, then you’ll want the morality of Kohlberg and Turiel—a morality that protects those individuals and their individual rights. You’ll emphasize concerns about harm and fairness.
But if you live in a non-WEIRD society in which people are more likely to see relationships, contexts, groups, and institutions, then you won’t be so focused on protecting individuals. You’ll have a more sociocentric morality, which means (as Shweder described it back in chapter 1) that you place the needs of groups and institutions first, often ahead of the needs of individuals. If you do that, then a morality based on concerns about harm and fairness won’t be sufficient. You’ll have additional concerns, and you’ll need additional virtues to bind people together.

Evaluating the historical value misspecification argument

Kaj Sotala11mo713

Getting a shape into the AI's preferences is different from getting it into the AI's predictive model. MIRI is always in every instance talking about the first thing and not the second.
You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI's predictive model is not sufficient. It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kind of thing it is much easier to make progress on compared to the main problem.

I read this as saying "GPT-4 has successfully learned to predict human preferences, but it has not learned to actually fulfill human preferences, and that's a far harder goal". But in the case of GPT-4, it seems to me like this distinction is not very clear-cut - it's useful to us because, in its architecture, there's a sense in which "predicting" and "fulfilling" are basically the same thing.

It also seems to me that this distinction is not very clear-cut in humans, either - that a significant part of e.g. how humans internalize moral values while growing up has to do with building up predictive models of how other people would react to you doing something and then having your decision-making be guided by those predictive models. So given that systems like GPT-4 seem to have a relatively easy time doing something similar, that feels like an update toward alignment being easier than expected.

Of course, there's a high chance that a superintelligent AI will generalize from that training data differently than most humans would. But that seems to me more like a risk of superintelligence than a risk from AI as such; a superintelligent human would likely also arrive at different moral conclusions than non-superintelligent humans would.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments