Sequences

Alignment Stream of Thought

Wiki Contributions

Comments

For your dashboards, how many tokens are you retrieving the top examples from?

Why do you scale your MSE by 1/(x_centred**2).sum(dim=-1, keepdim=True).sqrt() ? In particular, I'm confused about why you have the square root. Shouldn't it just be 1/(x_centred**2).sum(dim=-1, keepdim=True)?

I think this paper is empirical evidence for a nontrivial part of the deceptive alignment argument (RLHF/adversarial training being insufficient to remove it), and I also think most empirical papers don't make any sense when applied to AGI.

I think I have an intellectually consistent stance - I don't think this is because I have a double standard for pessimistic results.

First, suppose you did an experiment where you show models that usually kick puppies and hide a sleeper agent that suddenly becomes helpful and harmless in 2024, and adversarial training failing to remove this. I think I would draw the exact same conclusion about deceptive alignment from this experiment where the labels are painted on differently but the mechanics are the same. And just as I think it is invalid to conclude from the sleeper agent paper that models naturally want to insert backdoors in code even if they're harmless now, it is also invalid to argue from this hypothetical experiment that models naturally want to be helpful even if you try to train them to kick puppies.

Second, I think this paper is actually genuinely better evidence for deceptive alignment than many of the "deception" papers that came before. For example, I claim that the sycophancy and insider trading papers provide approximately no evidence for deceptive alignment. This is for exactly the same reason why I think showing RLHF making models harmless provides approximately no evidence against deceptive alignment. So I don't think it's true that I like empirical papers as long as they purport to support the deceptive alignment argument.

The reasons I think this paper is actually better than the other deception papers (beyond just quality of execution) are that the deceptive alignment in this setup happens for reasons more similar to why it might happen in AGI than in previous work, and the secret scratchpad setting seeming more analogous to AGI than single shot or visible scratchpad.

The training set is a random 100k subsample of this dataset: https://huggingface.co/datasets/amazon_polarity

I'm prepending Alice/Bob and doing the xor of the label in exactly the same way you do.

I'm having some trouble replicating this result in a not exactly comparable setting (internal model, looking at is_alice xor amazon_sentiment). I get 90%+ on the constituent datasets, but only up to 75% on the xor depending on which layer I look at.

(low confidence, will update as I get more results)

I think deceptive alignment is still reasonably likely despite evidence from LLMs.

I agree with:

  • LLMs are not deceptively aligned and don't really have inner goals in the sense that is scary
  • LLMs memorize a bunch of stuff
  • the kinds of reasoning that feed into deceptive alignment do not predict LLM behavior well
  • Adam on transformers does not have a super strong simplicity bias
  • without deceptive alignment, AI risk is a lot lower
  • LLMs not being deceptively aligned provides nonzero evidence against deceptive alignment (by conservation of evidence)

I predict I could pass the ITT for why LLMs are evidence that deceptive alignment is not likely.

however, I also note the following: LLMs are kind of bad at generalizing, and this makes them pretty bad at doing e.g novel research, or long horizon tasks. deceptive alignment conditions on models already being better at generalization and reasoning than current models.

my current hypothesis is that future models which generalize in a way closer to that predicted by mesaoptimization will also be better described as having a simplicity bias.

I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI

I agree with the spirit of the post but not the kinda clickbaity title. I think a lot of people are over updating on single forward pass behavior of current LLMs. However, I think it is still possible to get evidence using current models with careful experiment design and being careful with what kinds of conclusions to draw.

I think the main crux is that in my mind, the thing you call the "weak version" of the argument simply is the only and sufficient argument for inner misalignment and very sharp left turn. I am confused precisely what distinction you draw between the weak and strong version of the argument; the rest of this comment is an attempt to figure that out.

My understanding is that in your view, having the same drive as before means also having similar actions as before. For example, if humans have a drive for making art, in the ancestral environment this means drawing on cave walls (maybe this helped communicate the whereabouts of food in the ancestral environment). In the modern environment, this may mean passing up a more lucrative job opportunity to be an artist, but it still means painting on some other surface. Thus, the art drive, taking almost the same kinds of actions it ever did (maybe we use acrylic paints from the store instead of grinding plants into dyes ourselves), no longer results in the same consequences in amount of communicating food locations or surviving and having children or whatever it may be. But this is distinct from a sharp left turn, where the actions also change drastically (from helping humans to killing humans).

I agree this is more true for some drives. However, I claim that the association between drives and behaviors is not true in general. I claim humans have a spectrum of different kinds of drives, which differ in how specifically the drive specifies behavior. At one end of the spectrum, you can imagine stuff like breathing or blinking where it's kind of hard to even say whether we have a "breathing goal" or a clock that makes you breath regularly--the goal is the behavior, in the same way a cup has the "goal" of holding water. At this end of the spectrum it is valid to use goal/drive and behavior interchangeably. At the other end of the spectrum are goals/drives which are very abstract and specify almost nothing about how you get there: drives like desire for knowledge and justice and altruism and fear of death.

The key thing that makes these more abstract drives special is that because they do not specifically prescribe actions, the behaviors are produced by the humans reasoning about how to achieve the drive, as opposed to behaviors being selected for by evolution directly. This means that a desire for knowledge can lead to reading books, or launching rockets, or doing crazy abstract math, or inventing Anki, or developing epistemology, or trying to build AGI, etc. None of these were specifically behaviors that evolution could have reinforced in us--the behaviors available in the ancestral environment were things like "try all the plants to see which ones are edible". Evolution reinforced the abstract drive for knowledge, and left it up to individual human brains to figure out what to do, using the various Lego pieces of cognition that evolution built for us.

This means that the more abstract drives can actually suddenly just prescribe really different actions when important facts in the world change, and those actions will look very different from the kinds of actions previously taken. To take a non-standard example, for the entire history of the existence of humanity up until quite recently, it just simply has not been feasible for anyone to contribute meaningfully to eradicating entire diseases (indeed, for most of human history there was no understanding of how diseases actually worked, and people often just attributed it to punishment of the gods or otherwise found some way to live with it, and sometimes, as a coping mechanism, to even think the existence of disease and death necessary or net good). From the outside it may appear as if for the entire history of humanity there was no drive for disease eradication, and then suddenly in the blink of an evolutionary timescale eye a bunch of humans developed a disease eradication drive out of nowhere, and then soon thereafter suddenly smallpox stopped existing (and soon potentially malaria and polio). These will have involved lots of novel (on evolutionary timescale) behaviors like understanding and manufacturing microscopic biological things at scale, or setting up international bodies for coordination. In actuality, this was driven by the same kinds of abstract drives that have always existed like curiosity and fear of death and altruism, not some new drive that popped into being, but it involved lots of very novel actions steering towards a very difficult target.

I don't think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals (and there could be multiple). I think there may be a general communication issue where there is a type of person that likes to boil problems down to their core, which is usually some very simple setup, but then neglects to actually communicate why they believe this particular abstraction captures the thing that matters.

I am confused by your AlphaGo argument because "winning states of the board" looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how "surviving and reproducing" looks very different depending on what kinds of hazards are in the environment. (And winning winning states of the board always looking like having more territory encircled seems analogous to surviving and reproducing always looking like having a lot of children)

I think there is also a disagreement about what AlphaGo does, though this is hard to resolve without better interpretability -- I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go. In fact, the biggest gripe I have with most empirical alignment research is that I think models today fail to have sufficiently abstract drives, quite possibly for reasons related to why they are kind of dumb today and why things like AutoGPT mysteriouly have failed to do anything useful whatsoever. But this is a spicy claim and I think not that many other people would endorse this.

I agree with most of the factual claims made in this post about evolution. I agree that "IGF is the objective" is somewhat sloppy shorthand. However, after diving into the specific ways the object level details of "IGF is the objective" play out, I am confused about why you believe this implies the things you claim they imply about the sharp left turn / inner misalignment. Overall, I still believe that natural selection is a reasonable analogy for inner misalignment.

  • I agree fitness is not a single stationary thing. I agree this is prima facie unlike supervised learning, where the objective is typically stationary. However, it is pretty analogous to RL, and especially multi agent RL, and overall I don't think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it's a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded. Therefore I don't see any important disanalogy between evolution and multi agent RL. I have various thoughts on why language models do not make RL analogies irrelevant that I can explain but that's another completely different rabbit hole.
  • I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don't think I've heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically. I think the misconception may arise from a closely related claim that some make, which is that AI systems might develop weird arbitrary goals (tiny metallic squiggles) because any goal with sufficient intelligence implies playing the training game and then doing a sharp left turn. However, the claim here is not that the tiny metallic squiggles drive will suddenly appear at some point and replace the "make humans really happy" drive that existed previously. The claim is that the drive for tiny metallic squiggles was always, from the very beginning, the reason why [make humans really happy] was the observed behavior in environment [humans can turn you off if they aren't happy with you], and therefore in a different environment [humans can no longer turn you off], the observed behavior is [kill everyone and make squiggles].
  • I agree that everything is very complex always. I agree that there are multiple different goals/drives/desires in humans that result in children, of which the sex drive is only one. I agree that humans still have children sometimes, and still want children per se sometimes, but in practice this results in less and less children than in the ancestral environment over time (I bet even foragers are at least above replacement rate) for exactly the reason that the drives that we have always had for the reason that they caused us to survive/reproduce in the past now correspond much less well. I also agree that infanticide exists and occurs (but in the ancestral environment, there are counterbalancing drives like taboos around infanticide). In general, in many cases, simplifying assumptions totally break the analogy and make the results meaningless. I don't think I've been convinced that this is one of those cases.

I don't really care about defending the usage of "fitness as the objective" specifically, and so I don't think the following is a crux and am happy to concede some of the points below for the sake of argument about the object facts of inner alignment. However, for completeness, my take on when "fitness" can be reasonably described as the objective, and when it can't be:

  • I agree that couched in terms of the specific traits, the thing that evolution does in practice is sometimes favoring some traits and sometimes favoring other traits. However, I think there's an important sense in which these traits are not drawn from a hat- natural selection selects for lighter/darker moths because it makes it easier for the moths to survive and reproduce! If lighter moths become more common whenever light moths survive and reproduce better, and vice versa for dark moths, as opposed to moths just randomly becoming more light or more dark in ways uncorrelated to survival/reproduction, it seems pretty reasonable to say that survival/reproduction is closer to the thing being optimized than some particular lightness/darkness function that varies between favoring lightness and darkness.
  • I agree it is possible to do artificial selection for some particular trait like moth color and in this case saying that the process optimizes "fitness" (or survival/reproduction) collapses to saying the same thing as the process optimizes moth lightness/darkness. I agree it would be a little weird to insist that "fitness" is the goal in this case, and that the color is the more natural goal. I also agree that the evolutionary equations plays out the same way whether the source of pressure is artificial human selection or birds eating the moths. Nonetheless, I claim the step where you argue the two cases are equivalent for the purposes of whether we can consider fitness the objective is the step that breaks down. I think the difference between this case and the previous case is that the causality flows differently. We can literally draw from a hat whether we want light moths or dark moths, and then reshape the environment until fitness lines up with our preference for darkness, whereas in the other case, the environment is drawn from a hat and the color selection is determined downstream of that.
Load More