Alignment Stream of Thought

Wiki Contributions


I think the main crux is that in my mind, the thing you call the "weak version" of the argument simply is the only and sufficient argument for inner misalignment and very sharp left turn. I am confused precisely what distinction you draw between the weak and strong version of the argument; the rest of this comment is an attempt to figure that out.

My understanding is that in your view, having the same drive as before means also having similar actions as before. For example, if humans have a drive for making art, in the ancestral environment this means drawing on cave walls (maybe this helped communicate the whereabouts of food in the ancestral environment). In the modern environment, this may mean passing up a more lucrative job opportunity to be an artist, but it still means painting on some other surface. Thus, the art drive, taking almost the same kinds of actions it ever did (maybe we use acrylic paints from the store instead of grinding plants into dyes ourselves), no longer results in the same consequences in amount of communicating food locations or surviving and having children or whatever it may be. But this is distinct from a sharp left turn, where the actions also change drastically (from helping humans to killing humans).

I agree this is more true for some drives. However, I claim that the association between drives and behaviors is not true in general. I claim humans have a spectrum of different kinds of drives, which differ in how specifically the drive specifies behavior. At one end of the spectrum, you can imagine stuff like breathing or blinking where it's kind of hard to even say whether we have a "breathing goal" or a clock that makes you breath regularly--the goal is the behavior, in the same way a cup has the "goal" of holding water. At this end of the spectrum it is valid to use goal/drive and behavior interchangeably. At the other end of the spectrum are goals/drives which are very abstract and specify almost nothing about how you get there: drives like desire for knowledge and justice and altruism and fear of death.

The key thing that makes these more abstract drives special is that because they do not specifically prescribe actions, the behaviors are produced by the humans reasoning about how to achieve the drive, as opposed to behaviors being selected for by evolution directly. This means that a desire for knowledge can lead to reading books, or launching rockets, or doing crazy abstract math, or inventing Anki, or developing epistemology, or trying to build AGI, etc. None of these were specifically behaviors that evolution could have reinforced in us--the behaviors available in the ancestral environment were things like "try all the plants to see which ones are edible". Evolution reinforced the abstract drive for knowledge, and left it up to individual human brains to figure out what to do, using the various Lego pieces of cognition that evolution built for us.

This means that the more abstract drives can actually suddenly just prescribe really different actions when important facts in the world change, and those actions will look very different from the kinds of actions previously taken. To take a non-standard example, for the entire history of the existence of humanity up until quite recently, it just simply has not been feasible for anyone to contribute meaningfully to eradicating entire diseases (indeed, for most of human history there was no understanding of how diseases actually worked, and people often just attributed it to punishment of the gods or otherwise found some way to live with it, and sometimes, as a coping mechanism, to even think the existence of disease and death necessary or net good). From the outside it may appear as if for the entire history of humanity there was no drive for disease eradication, and then suddenly in the blink of an evolutionary timescale eye a bunch of humans developed a disease eradication drive out of nowhere, and then soon thereafter suddenly smallpox stopped existing (and soon potentially malaria and polio). These will have involved lots of novel (on evolutionary timescale) behaviors like understanding and manufacturing microscopic biological things at scale, or setting up international bodies for coordination. In actuality, this was driven by the same kinds of abstract drives that have always existed like curiosity and fear of death and altruism, not some new drive that popped into being, but it involved lots of very novel actions steering towards a very difficult target.

I don't think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals (and there could be multiple). I think there may be a general communication issue where there is a type of person that likes to boil problems down to their core, which is usually some very simple setup, but then neglects to actually communicate why they believe this particular abstraction captures the thing that matters.

I am confused by your AlphaGo argument because "winning states of the board" looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how "surviving and reproducing" looks very different depending on what kinds of hazards are in the environment. (And winning winning states of the board always looking like having more territory encircled seems analogous to surviving and reproducing always looking like having a lot of children)

I think there is also a disagreement about what AlphaGo does, though this is hard to resolve without better interpretability -- I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go. In fact, the biggest gripe I have with most empirical alignment research is that I think models today fail to have sufficiently abstract drives, quite possibly for reasons related to why they are kind of dumb today and why things like AutoGPT mysteriouly have failed to do anything useful whatsoever. But this is a spicy claim and I think not that many other people would endorse this.

I agree with most of the factual claims made in this post about evolution. I agree that "IGF is the objective" is somewhat sloppy shorthand. However, after diving into the specific ways the object level details of "IGF is the objective" play out, I am confused about why you believe this implies the things you claim they imply about the sharp left turn / inner misalignment. Overall, I still believe that natural selection is a reasonable analogy for inner misalignment.

  • I agree fitness is not a single stationary thing. I agree this is prima facie unlike supervised learning, where the objective is typically stationary. However, it is pretty analogous to RL, and especially multi agent RL, and overall I don't think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it's a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded. Therefore I don't see any important disanalogy between evolution and multi agent RL. I have various thoughts on why language models do not make RL analogies irrelevant that I can explain but that's another completely different rabbit hole.
  • I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don't think I've heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically. I think the misconception may arise from a closely related claim that some make, which is that AI systems might develop weird arbitrary goals (tiny metallic squiggles) because any goal with sufficient intelligence implies playing the training game and then doing a sharp left turn. However, the claim here is not that the tiny metallic squiggles drive will suddenly appear at some point and replace the "make humans really happy" drive that existed previously. The claim is that the drive for tiny metallic squiggles was always, from the very beginning, the reason why [make humans really happy] was the observed behavior in environment [humans can turn you off if they aren't happy with you], and therefore in a different environment [humans can no longer turn you off], the observed behavior is [kill everyone and make squiggles].
  • I agree that everything is very complex always. I agree that there are multiple different goals/drives/desires in humans that result in children, of which the sex drive is only one. I agree that humans still have children sometimes, and still want children per se sometimes, but in practice this results in less and less children than in the ancestral environment over time (I bet even foragers are at least above replacement rate) for exactly the reason that the drives that we have always had for the reason that they caused us to survive/reproduce in the past now correspond much less well. I also agree that infanticide exists and occurs (but in the ancestral environment, there are counterbalancing drives like taboos around infanticide). In general, in many cases, simplifying assumptions totally break the analogy and make the results meaningless. I don't think I've been convinced that this is one of those cases.

I don't really care about defending the usage of "fitness as the objective" specifically, and so I don't think the following is a crux and am happy to concede some of the points below for the sake of argument about the object facts of inner alignment. However, for completeness, my take on when "fitness" can be reasonably described as the objective, and when it can't be:

  • I agree that couched in terms of the specific traits, the thing that evolution does in practice is sometimes favoring some traits and sometimes favoring other traits. However, I think there's an important sense in which these traits are not drawn from a hat- natural selection selects for lighter/darker moths because it makes it easier for the moths to survive and reproduce! If lighter moths become more common whenever light moths survive and reproduce better, and vice versa for dark moths, as opposed to moths just randomly becoming more light or more dark in ways uncorrelated to survival/reproduction, it seems pretty reasonable to say that survival/reproduction is closer to the thing being optimized than some particular lightness/darkness function that varies between favoring lightness and darkness.
  • I agree it is possible to do artificial selection for some particular trait like moth color and in this case saying that the process optimizes "fitness" (or survival/reproduction) collapses to saying the same thing as the process optimizes moth lightness/darkness. I agree it would be a little weird to insist that "fitness" is the goal in this case, and that the color is the more natural goal. I also agree that the evolutionary equations plays out the same way whether the source of pressure is artificial human selection or birds eating the moths. Nonetheless, I claim the step where you argue the two cases are equivalent for the purposes of whether we can consider fitness the objective is the step that breaks down. I think the difference between this case and the previous case is that the causality flows differently. We can literally draw from a hat whether we want light moths or dark moths, and then reshape the environment until fitness lines up with our preference for darkness, whereas in the other case, the environment is drawn from a hat and the color selection is determined downstream of that.
Answer by leogaoSep 17, 20232432

Obviously I think it's worth being careful, but I think in general it's actually relatively hard to accidentally advance capabilities too much by working specifically on alignment. Some reasons:

  1. Researchers of all fields tend to do this thing where they have really strong conviction in their direction and think everyone should work on their thing. Convincing them that some other direction is better is actually pretty hard even if you're trying to shove your ideas down their throats.
  2. Often the bottleneck is not that nobody realizes that something is a bottleneck, but rather that nobody knows how to fix it. In these cases, calling attention to the bottleneck doesn't really speed things up, whereas for thinking about alignment we can reason about what things would look like if it were to be solved.
  3. It's generally harder to make progress on something by accident than to make progress on purpose on something if you try really hard to do it. I think this is true even if there is a lot of overlap. There's also an EMH argument one could make here but I won't spell it out.

I think the alignment community thinking correctly is essential for solving alignment. Especially because we will have very limited empirical evidence before AGI, and that evidence will not be obviously directly applicable without some associated abstract argument, any trustworthy alignment solution has to route through the community reasoning sanely.

Also to be clear I think the "advancing capabilities is actually good because it gives us more information on what AGI will look like" take is very bad and I am not defending it. The arguments I made above don't apply, because they basically hinge on work on alignment not actually advancing capabilities.

Ran this on GPT-4-base and it gets 56.7% (n=1000)

I think it's worth disentangling LLMs and Transformers and so on in discussions like this one--they are not one and the same. For instance, the following are distinct positions that have quite different implications:

  • The current precise transformer LM setup but bigger will never achieve AGI
  • A transformer trained on the language modelling objective will never achieve AGI (but a transformer network trained with other modalities or objectives or whatever will)
  • A language model with the transformer architecture will never achieve AGI (but a language model with some other architecture or training process will)

Which interventions make sense depends a lot on your precise model of why current models are not AGI, and I would consequently expect modelling things at the level of "LLMs vs not LLMs" to be less effective.

Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart ( Jacob probably has more detailed takes on this than me. 

In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent

Adding $200 to the pool. Also, I endorse the existence of more bounties/contests like this.

re:1, yeah that seems plausible, I'm thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.

re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn't endorse. More broadly, the thing I'm focusing on in this post is not really about drift over time or self improvement; in the setup I'm describing, the thing that goes wrong is it does the classical "fill the universe with pictures of smiling humans" kind of outer alignment failure case (or worse yet, the more likely outcome of trying to build an agentic AGI is we fail to retarget the search and end up with one that actually cares about microscopic squiggles, and then it does the deceptive alignment using those helpful human concepts it has lying around).

Load More