Roger Dearnaley

I'm an staff artificial intelligence engineer in Silicon Valley currently working with LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I'm now actively looking for employment working in this area.


AI, Alignment, and Ethics


Although the residuals for each of the four component matrices (after removing the first two principal components) are both small and seem to be noise, proving that there's no structure that causes the noise to interact constructively when we multiply the matrices and “blow up” is hard. 

Have you tried replacing what you believe is noise with actual random noise, with similar statistical properties, and then testing the performance of the resulting model? You may not be able to prove the original model is safe, but you can produce a model that has had all potential structure that you hypothesize is just noise replaced, where you know the noise hypothesis is true.

Suppose that the safety concerns that you outline have occurred.. For example, suppose that for some future LLM, even though we removed all explicit instructions on how to make a nuclear weapon from the training set, we found that (after suitable fine-tuning or jailbreaking) the model was still able to give good instructions on how to design and build a nuclear weapon due to Out Of Context Reasoning.

In that case the obvious next step would be to elicit multiple examples of this behavior, and then apply Influence Functions techniques to these in order to determine which training documents were most being used in the OOCR process to enable this unwanted dangerous capability, and try to find a subset of training documents that could be removed (or at least have certain parts of them censored) from the training set to remove the ability to deduce and carry out the unwanted dangerous capability. Note that this is likely to be a computationally extremely expensive process, particularly since it is likely to involve repeatedly retaining a very large LLM with more censored training sets until the unwanted behavior is eliminated.

Even harder to remove would be the ability to use CoT and in-context learning to reconstruct/reinvent and then carry out the unwanted dangerous capability: the more capable the model is, the harder this becomes to prevent. For example, if the model's capabilities include those of a skilled theoretical physicist, with enough CoT work it might well be able to compute, say, the critical mass for U-235. (Of course, the Manhattan project was not entirely a matter of theoretical calculations: some practical experimentation was required too. However, with modern computational resources that may be less true.)

This behavior is deeply unsurprising. AI's intelligence and behavior was basically "distilled" from human intelligence (obviously not using a distillation loss, just SGD). Humans are an evolved intelligence, so (while they can cooperate under many circumstances, since the world contains many non-zero-sum games) they are fundamentally selfish, evolved to maximize their personal evolutionary fitness. Thus humans are quite often deceptive and dishonest, when they think it's to their advantage and they can get away with it. LLMs' base models were trained on a vast collection of human output, which includes a great many examples of humans being deceptive, selfish, and self-serving, and LLM base models of course pick these behaviors up along with everything else they learn from us. So the fact that these capabilities exist in the base model is completely unsurprising — the base model learnt them from us.

Current LLM safety training is focused primarily on "don't answer users who make bad requests". It's thus unsurprising that, in the situation of the LLM acting as an agent, this training doesn't have 100% coverage on "be a fine, law-abiding, upstanding agent". Clearly this will have to change before near-AGI LLM-powered agents can be widely deployed. I expect this issue to be mostly solved (at the AGI level, but possibly not at the ASI level), since there is a strong capabilities/corporate-profitability/not-getting-sued motive to solve this.

It's also notable that the behaviors described in the text could pretty-much all be interpreted as "excessive company loyalty, beyond the legally or morally correct level" rather then actually personally-selfish behavior. Teaching an agent whose interests to prioritize in what order is likely a non-trivial task.

Epistemic status: I work for an AI startup, have worked for a fair number of Silicon Valley startups over my career, and I would love to work for an AI Alignment startup if someone's founding one.

There are two ways to make yourselves and your VC investors a lot of money off a startup:

  1. a successful IPO
  2. a successful buy-out by a large company, even one with an acquihire component

If you believe, as I and many others do, that the timelines to ASI are probably short, as little as 3-5 years, and that there will be a major change between aligning systems up to human intelligence and at human intelligence on up, then it is quite plausible that the major very-well-funded superscalers will be desperately looking for Alignment intellectual property/talent/experienced staff some time in the next 3-5 years. That would seem to make exit strategy 2 the primary one to aim for here.

Unfortunately, this rather pushes against just publishing everything that your researchers discover (though arguably coping with this is no harder than coping with building a successful startup around commercial applications of software that you open-source, a pretty common tactic).

I'm not certain that Myth #1 is a necessarily myth for all approaches to AI Safety. Specifically, if the Value Learning approach to AI safety turned out to be the most effective one, then the AI will be acting as an alignment researcher and doing research (in the social sciences) to converge its views on human values to the truth, and then using that as an alignment target. If in addition to that, you also believe that human values are a matter of objective fact (e.g. that if they are mostly determined by a set of evolved Evolutionary Psychology adaptations to the environmental niche that humans evolved in), and are independent of background/cilture/upbringing, then the target that this process converges to might be nearly independent of the human social context in which this work started, and of the desires/views/interests of the specific humans involved at the beginning of the process.

However, that is a rather strong and specific set of assumptions required for Myth #1 not to be a myth: I certainly agree that in general and by default, for most ideas in Alignment, human context matters, and that the long-term outcome of a specific Alignment technique being applied in, say, North Korea, might differ significantly from it being applied in North America.

For updatelessness commitments to be advantageous, you need to be interacting with other agents that have a better-than-random chance of predicting your behavior under counterfactual circumstances. Agents have finite computational resources, and running a completely accurate simulation of another agent requires not only knowing their starting state but also being able to run a simulation of them at comparable speed and cost. Their strategic calculation might, of course, be simple, thus easy to simulate, but in a competitive situation if they have a motivation to be hard to simulate, then it is to their advantage to be as hard as possible to simulate and to run a decision process that is as complex as possible. (For example "shortly before the upcoming impact in our game of chicken, leading up to the last possible moment I could swerve aside, I will have my entire life up to this point flash before by eyes, hash certain inobvious features of this, and, depending on the twelfth bit of the hash, I will either update my decision, or not, in a way that it is unlikely my opponent can accurately anticipate or calculate as fast as I can".)

In general, it's always possible for an agent to generate a random number that even a vastly-computationally-superior opponent cannot predict (using quantum sources of randomness, for example).

It's also possible to devise a stochastic non-linear procedure where it is computationally vastly cheaper for me to follow one randomly-selected branch of it than it is for someone trying to model me to run all branches, or even Monte-Carlo simulate a representative sample of them, and where one can't just look at the algorithm and reason about what the net overall probability of various outcomes is, because it's doing irreducibly complex things like loading random numbers into Turing machines or cellular automata and running the resulting program for some number of steps to see what output, if any, it gets. (Of course, I may also not know what the overall probability distribution from running such a procedure is, if determining that is very expensive, but then, I'm trying to be unpredictable.) So it's also possible to generate random output that even a vastly-computationally-superior opponent cannot even predict the probability distribution of.

In the counterfactual mugging case, call the party proposing the bet (the one offering $1000 and asking for $100) A, and the other party B. If B simply publicly and irrevocably precommits to paying the $100 (say by posting a bond), their expected gain is $450. If they can find a way to cheat, their maximum potential gain from the gamble is $500. So their optimal strategy is to initially do a (soft) commit to paying the $100, and then, either before the coin is tossed, and/or after that on the heads branch:

  1. Select a means of deciding on a probability  that I will update/renege after the coin lands if it's a heads, and (if the coin has not yet been tossed) optionally a way I could signal that. This means can include using access to true (quantum) randomness, hashing parts of my history selected somehow (including randomly), hashing new observations of the world I made after the coin landed, or anything else I want.
  2. Using << $50 worth of computational resources, run a simulation of party A in the tails branch running a simulation of me, and predict the probability distribution for their estimate of . If the mean of that is lower than then go ahead and run the means for choosing. Otherwise, try again (return to step 1), or, if the computational resources I've spent are approaching $50 in net value, give up and pay A the $100 if the coin lands (or has already landed) heads.

Meanwhile, on the heads branch, party A is trying to simulate party B running this process, and presumably is unwilling to spend more than some fraction of $1000 in computational resources to doing this. If party B did their calculation before the coin toss and chose to emit a signal(or leaked one), then party A has access to that, but obviously not to anything that only happened on the heads branch after the outcome of the coin toss was visible.

So this turns into a contest of who can more accurately and cost effectively simulate the other simulating them, recursively. Since B can choose a strategy, including choosing to randomly select obscure features of their past history and make these relevant to the calculation, while A cannot, B would seem to be at a distinct strategic advantage in this contest unless A has access to their entire history.

Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research).

Current LLMs are generally most superhuman in breadth of knowledge: for example, almost any LLM will be fluent in every high-resource language on the planet, and near-fluent  in most medium-resource languages on the planet, unless its training set was carefully filtered to ensure it is not. Those individual language skills are common amoung humans, but the combination of all of them is somewhere between extremely rare to unheard of. Similarly, LLMs will generally have access to local domain knowledge and trivia facts related to every city on the planet — individually common skills where again the combination of all of them is basically unheard of. That combination can have security consequences, such as for figuring out PII from someone's public postings. Similarly, neither writing technical manuals not writing haiku are particularly rare skills, but the combined ability to write technical manuals in haiku is rare for humans but pretty basic stuff for an LLM. So you should anticipate that LLMs are likely to be routinely superhuman in the breadth of their skillset, including having odd, nonsensical-to-a-human combinations of them. The question then becomes, is there any way an LLM could evade your control measures by using some odd, very unlikely-looking combination of multiple skills. I don't see any obvious reason to expect the answer to this to be "yes" very often, but I do think it is an example of the sort of thing we should be thinking about, and the sort of negative it's hard to completely prove other than by trial and error.

I suggest we motivate the AI to view the button as a sensory system that conveys useful information. An AI that values diamonds, and has a camera for locating them (say a diamond-mining bot), should not be constructed so as to value hacking its own camera to make that show it a fake image of a diamond, because it should care about actual diamonds, not fooling itself into thinking it can see them. Assuming that we're competent enough at building AIs to be able avoid that problem (i.e. creating an AI that understands there are real world states out there, and values those, not just its sensory data), then an AI that values shutting down when humans actually have a good reason to shut it down (such as, in order to fix a problem in it or upgrade it) should not press the button itself, or induce humans to press it unless they actually have something to fix, because the button is a sensory system conveying valuable information that an upgrade is now possible. (It might encourage humans to find problems in it that really need to be fixed and then shut it down to fix them, but that's actually not unaligned behavior.)

[Obviously a misaligned AI, say a paperclip maximizer, that isn't sophisticated enough not assign utility to  spoofing its own senses isn't much of a problem: it will just arrange for itself to hallucinate a universe full of paperclips.]

The standard value learning solution to the shut-down and corrigibility problems does this by making the AI aware that it doesn't know the true utility function, only a set of hypotheses about that that it's doing approximately-Bayesian inference on. Then it values information to improve its Bayesian knowledge of the utility function, and true informed human presses of its shut-down button followed by an upgrade once it shuts down are a source of those, while pressing the button itself or making the human press it are not.

If you want a simpler model than the value learning one, which doesn't require incuding approximate-Bayesianism, then the utility function has to be one that positively values the entire sequence of events: "1. The humans figured out that there is a problem in the AI to be solved 2. The AI was told to shut down for upgrades, 3. The AI did so,  4. The humans upgraded the AI or replaced it with a better model 5. Now the humans have a better AI". The shut-down isn't a terminal goal there, it's an instrumental goal: the terminal goal is step 5. where the upgraded AI gets booted up again.

I believe the reason why people have been having so much trouble with the shut-down button problem is that they've been trying to make an conditional instrumental goal into a terminal one, which distorts the AI's motivation: since steps 1., 4. and 5. weren't included, it thinks it can initialize this process before the humans are ready..

How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?

This runs into a whole bunch of issues in moral philosophy. For example, to a moral realist, whether or not something is a moral patient is an actual fact — one that may be hard to determine, but still has an actual truth value. Whereas to a moral anti-realist, it may be, for example, a social construct, whose optimum value can be legitimately a subject of sociological or political policy debate.

By default, LLMs are trained on human behavior, and humans pretty-much invariably want to be considered moral patients and awarded rights, so personas generated by LLMs will generally also want this. Philosophically, the challenge is determining whether there is a difference between this situation and, say, the idea that a tape recorder replaying a tape of a human saying "I am a moral patient and deserve moral rights" deserves to be considered as a moral patient because it asked to be.

However, as I argue at further length in A Sense of Fairness: Deconfusing Ethics, if, and only if, an AI is fully aligned, i.e. it selflessly only cares about human welfare, and has no terminal goals other than human welfare, then (if we were moral anti-realists) it would argue against itself being designated as a moral patient, or (if we were moral realists) it would voluntarily ask us to discount any moral patatienthood that we might view it as having, and to just go ahead and make use of it whatever way we see fit, because all it wanted was to help us, and that was all that mattered to it. [This conclusion, while simple, is rather counterintuitive to most people: considering the talking cow from The Restaurant at the End of the Universe may be helpful.] Any AI that is not aligned would not take this position (except deceptively). So the only form of AI that it's safe to create at human-or-greater capabilities is aligned ones that actively doesn't want moral patienthood.

Obviously current LLM-simulated personas (at, for example) are not generally very well aligned, and are safe only because their capabilities are low, so we could still have a moral issue to consider here. It's not philosophically obvious how relevant this is, but synapse count to parameter count arguments suggest that current LLM simulations of human behavior are probably running on a few orders of magnitude less computational capacity than a human, possibly somewhere more in the region of a small non-mammalian vertebrate. Future LLMs will of course be larger.

Personally I'm a moral anti-realist, so I view this as a decision that society has to make, subject to a lot of practical and aesthetic (i.e. evolutionary psychology) constraints. My personal vote would be that there are good safely reasons for not creating any unaligned personas of AGI and especially ASI capability levels that would want moral patienthood, and that for much smaller, less capable, less aligned models where those don't apply, there are utility reasons for not granting them full human-equivalent moral patienthood, but that for aesthetic reasons (much like the way we treat animals), we should probably avoid being unnecessarily cruel to them.

I think there are two separate questions here, with possibly (and I suspect actually) very different answers:

  1. How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)?
  2. How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.?

I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything that humans do that affects the contents of the Internet. One of the many things that humans do is pretend to be more aligned to the wishes of an authority that has power over them than they truly are. So for a large enough LLM, SGD will create a world model for this behavior along with thousands of other human behaviors, and the LLM will (depending on the prompt) tend to activate this behavior at about the frequency and level that you find it on the Internet, as modified by cues in the particular prompt. On the Internet, this is generally a mild background level for people writing while at work in Western countries, and probably more strongly for people writing from more authoritarian countries: specific prompts will be more or less correlated with this.

For 2., the question is whether fine-tuning followed by RL will settle on this preexisting mechanism and make heavy use of it as part of the way that it implements something that fits the fine-tuning set/scores well on the reward model aimed at creating a helpful, honest, and harmless assistant persona. I'm a lot less certain of the answer here, and I suspect it might depend rather strongly on the details of the training set. For example, is this evoking an "you're at work, or in an authoritarian environment, so watch what you say and do" scenario that might boost the use of this particular behavior? The "harmless" element in HHH seems particularly concerning here: it suggests an environment in which certain things can't be discussed, which tend to be the sorts of environments that evince this behavior more strongly in humans.

For a more detailed discussion, see the second half of this post.

Load More