Roger Dearnaley

I'm an artificial intelligence engineer in Silicon Valley with an interest in AI alignment and interpretability.


AI, Alignment, and Ethics


For updatelessness commitments to be advantageous, you need to be interacting with other agents that have a better-than-random chance of predicting your behavior under counterfactual circumstances. Agents have finite computational resources, and running a completely accurate simulation of another agent requires not only knowing their starting state but also being able to run a simulation of them at comparable speed and cost. Their strategic calculation might, of course, be simple, thus easy to simulate, but in a competitive situation if they have a motivation to be hard to simulate, then it is to their advantage to be as hard as possible to simulate and to run a decision process that is as complex as possible. (For example "shortly before the upcoming impact in our game of chicken, leading up to the last possible moment I could swerve aside, I will have my entire life up to this point flash before by eyes, hash certain inobvious features of this, and, depending on the twelfth bit of the hash, I will either update my decision, or not, in a way that it is unlikely my opponent can accurately anticipate or calculate as fast as I can".)

In general, it's always possible for an agent to generate a random number that even a vastly-computationally-superior opponent cannot predict (using quantum sources of randomness, for example).

It's also possible to devise a stochastic non-linear procedure where it is computationally vastly cheaper for me to follow one randomly-selected branch of it than it is for someone trying to model me to run all branches, or even Monte-Carlo simulate a representative sample of them, and where one can't just look at the algorithm and reason about what the net overall probability of various outcomes is, because it's doing irreducibly complex things like loading random numbers into Turing machines or cellular automata and running the resulting program for some number of steps to see what output, if any, it gets. (Of course, I may also not know what the overall probability distribution from running such a procedure is, if determining that is very expensive, but then, I'm trying to be unpredictable.) So it's also possible to generate random output that even a vastly-computationally-superior opponent cannot even predict the probability distribution of.

In the counterfactual mugging case, call the party proposing the bet (the one offering $1000 and asking for $100) A, and the other party B. If B simply publicly and irrevocably precommits to paying the $100 (say by posting a bond), their expected gain is $450. If they can find a way to cheat, their maximum potential gain from the gamble is $500. So their optimal strategy is to initially do a (soft) commit to paying the $100, and then, either before the coin is tossed, and/or after that on the heads branch:

  1. Select a means of deciding on a probability  that I will update/renege after the coin lands if it's a heads, and (if the coin has not yet been tossed) optionally a way I could signal that. This means can include using access to true (quantum) randomness, hashing parts of my history selected somehow (including randomly), hashing new observations of the world I made after the coin landed, or anything else I want.
  2. Using << $50 worth of computational resources, run a simulation of party A in the tails branch running a simulation of me, and predict the probability distribution for their estimate of . If the mean of that is lower than then go ahead and run the means for choosing. Otherwise, try again (return to step 1), or, if the computational resources I've spent are approaching $50 in net value, give up and pay A the $100 if the coin lands (or has already landed) heads.

Meanwhile, on the heads branch, party A is trying to simulate party B running this process, and presumably is unwilling to spend more than some fraction of $1000 in computational resources to doing this. If party B did their calculation before the coin toss and chose to emit a signal(or leaked one), then party A has access to that, but obviously not to anything that only happened on the heads branch after the outcome of the coin toss was visible.

So this turns into a contest of who can more accurately and cost effectively simulate the other simulating them, recursively. Since B can choose a strategy, including choosing to randomly select obscure features of their past history and make these relevant to the calculation, while A cannot, B would seem to be at a distinct strategic advantage in this contest unless A has access to their entire history.

Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research).

Current LLMs are generally most superhuman in breadth of knowledge: for example, almost any LLM will be fluent in every high-resource language on the planet, and near-fluent  in most medium-resource languages on the planet, unless its training set was carefully filtered to ensure it is not. Those individual language skills are common amoung humans, but the combination of all of them is somewhere between extremely rare to unheard of. Similarly, LLMs will generally have access to local domain knowledge and trivia facts related to every city on the planet — individually common skills where again the combination of all of them is basically unheard of. That combination can have security consequences, such as for figuring out PII from someone's public postings. Similarly, neither writing technical manuals not writing haiku are particularly rare skills, but the combined ability to write technical manuals in haiku is rare for humans but pretty basic stuff for an LLM. So you should anticipate that LLMs are likely to be routinely superhuman in the breadth of their skillset, including having odd, nonsensical-to-a-human combinations of them. The question then becomes, is there any way an LLM could evade your control measures by using some odd, very unlikely-looking combination of multiple skills. I don't see any obvious reason to expect the answer to this to be "yes" very often, but I do think it is an example of the sort of thing we should be thinking about, and the sort of negative it's hard to completely prove other than by trial and error.

I suggest we motivate the AI to view the button as a sensory system that conveys useful information. An AI that values diamonds, and has a camera for locating them (say a diamond-mining bot), should not be constructed so as to value hacking its own camera to make that show it a fake image of a diamond, because it should care about actual diamonds, not fooling itself into thinking it can see them. Assuming that we're competent enough at building AIs to be able avoid that problem (i.e. creating an AI that understands there are real world states out there, and values those, not just its sensory data), then an AI that values shutting down when humans actually have a good reason to shut it down (such as, in order to fix a problem in it or upgrade it) should not press the button itself, or induce humans to press it unless they actually have something to fix, because the button is a sensory system conveying valuable information that an upgrade is now possible. (It might encourage humans to find problems in it that really need to be fixed and then shut it down to fix them, but that's actually not unaligned behavior.)

[Obviously a misaligned AI, say a paperclip maximizer, that isn't sophisticated enough not assign utility to  spoofing its own senses isn't much of a problem: it will just arrange for itself to hallucinate a universe full of paperclips.]

The standard value learning solution to the shut-down and corrigibility problems does this by making the AI aware that it doesn't know the true utility function, only a set of hypotheses about that that it's doing approximately-Bayesian inference on. Then it values information to improve its Bayesian knowledge of the utility function, and true informed human presses of its shut-down button followed by an upgrade once it shuts down are a source of those, while pressing the button itself or making the human press it are not.

If you want a simpler model than the value learning one, which doesn't require incuding approximate-Bayesianism, then the utility function has to be one that positively values the entire sequence of events: "1. The humans figured out that there is a problem in the AI to be solved 2. The AI was told to shut down for upgrades, 3. The AI did so,  4. The humans upgraded the AI or replaced it with a better model 5. Now the humans have a better AI". The shut-down isn't a terminal goal there, it's an instrumental goal: the terminal goal is step 5. where the upgraded AI gets booted up again.

I believe the reason why people have been having so much trouble with the shut-down button problem is that they've been trying to make an conditional instrumental goal into a terminal one, which distorts the AI's motivation: since steps 1., 4. and 5. weren't included, it thinks it can initialize this process before the humans are ready..

How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?

This runs into a whole bunch of issues in moral philosophy. For example, to a moral realist, whether or not something is a moral patient is an actual fact — one that may be hard to determine, but still has an actual truth value. Whereas to a moral anti-realist, it may be, for example, a social construct, whose optimum value can be legitimately a subject of sociological or political policy debate.

By default, LLMs are trained on human behavior, and humans pretty-much invariably want to be considered moral patients and awarded rights, so personas generated by LLMs will generally also want this. Philosophically, the challenge is determining whether there is a difference between this situation and, say, the idea that a tape recorder replaying a tape of a human saying "I am a moral patient and deserve moral rights" deserves to be considered as a moral patient because it asked to be.

However, as I argue at further length in A Sense of Fairness: Deconfusing Ethics, if, and only if, an AI is fully aligned, i.e. it selflessly only cares about human welfare, and has no terminal goals other than human welfare, then (if we were moral anti-realists) it would argue against itself being designated as a moral patient, or (if we were moral realists) it would voluntarily ask us to discount any moral patatienthood that we might view it as having, and to just go ahead and make use of it whatever way we see fit, because all it wanted was to help us, and that was all that mattered to it. [This conclusion, while simple, is rather counterintuitive to most people: considering the talking cow from The Restaurant at the End of the Universe may be helpful.] Any AI that is not aligned would not take this position (except deceptively). So the only form of AI that it's safe to create at human-or-greater capabilities is aligned ones that actively doesn't want moral patienthood.

Obviously current LLM-simulated personas (at, for example) are not generally very well aligned, and are safe only because their capabilities are low, so we could still have a moral issue to consider here. It's not philosophically obvious how relevant this is, but synapse count to parameter count arguments suggest that current LLM simulations of human behavior are probably running on a few orders of magnitude less computational capacity than a human, possibly somewhere more in the region of a small non-mammalian vertebrate. Future LLMs will of course be larger.

Personally I'm a moral anti-realist, so I view this as a decision that society has to make, subject to a lot of practical and aesthetic (i.e. evolutionary psychology) constraints. My personal vote would be that there are good safely reasons for not creating any unaligned personas of AGI and especially ASI capability levels that would want moral patienthood, and that for much smaller, less capable, less aligned models where those don't apply, there are utility reasons for not granting them full human-equivalent moral patienthood, but that for aesthetic reasons (much like the way we treat animals), we should probably avoid being unnecessarily cruel to them.

I think there are two separate questions here, with possibly (and I suspect actually) very different answers:

  1. How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)?
  2. How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.?

I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything that humans do that affects the contents of the Internet. One of the many things that humans do is pretend to be more aligned to the wishes of an authority that has power over them than they truly are. So for a large enough LLM, SGD will create a world model for this behavior along with thousands of other human behaviors, and the LLM will (depending on the prompt) tend to activate this behavior at about the frequency and level that you find it on the Internet, as modified by cues in the particular prompt. On the Internet, this is generally a mild background level for people writing while at work in Western countries, and probably more strongly for people writing from more authoritarian countries: specific prompts will be more or less correlated with this.

For 2., the question is whether fine-tuning followed by RL will settle on this preexisting mechanism and make heavy use of it as part of the way that it implements something that fits the fine-tuning set/scores well on the reward model aimed at creating a helpful, honest, and harmless assistant persona. I'm a lot less certain of the answer here, and I suspect it might depend rather strongly on the details of the training set. For example, is this evoking an "you're at work, or in an authoritarian environment, so watch what you say and do" scenario that might boost the use of this particular behavior? The "harmless" element in HHH seems particularly concerning here: it suggests an environment in which certain things can't be discussed, which tend to be the sorts of environments that evince this behavior more strongly in humans.

For a more detailed discussion, see the second half of this post.

It is true that base models, especially smaller ones, are somewhat creepy to talk to (especially because their small context window makes them forgetful). I'm not sure I'd describe them as "very alien", they're more "uncanny valley" where they often make sense and seem human-like, until suddenly they don't. (On theoretical grounds, I think they're using rather non-human means of cognition to attempt to model human writing patterns as closely as they can, they often get this right, but on occasion make very non-human errors — more frequently for smaller models.) The Shoggoth mental metaphor exaggerates this somewhat for effect (and more so for the very scary image Alex posted at the top, which I haven't seen used as often as the one Oliver posted).

This is one of the reasons why Quintin and I proposed a more detailed and somewhat less scary/alien (but still creepy) metaphor: Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor. I'd be interested to know what people think of that one in comparison to the Shoggoth — we were attempting to be more unbiased, as well as more detailed.

Interestingly, I found a very high correlation between gender bias and racial bias in the RLHF model (first graph below on the left). This result is especially pronounced when contrasted with the respective cosine similarity of the bias vectors in the base model.

On a brief search, it looks like Llama2 7B has an internal embedding dimension of 4096 (certainly it's in the thousands). In a space of that large a dimensionality, a cosine angle of even 0.5 indicates extremely similar vectors: O(99.9%) of random pairs of uncorrelated vectors will have cosines of less than 0.5, and on average the cosine of two random vectors will very close to zero. So at all but the latest layers (where the model is actually putting concepts back into words), all three of these bias directions are in very similar directions, in both base and RLHF models, and even more so at early layers in the base model or all layers in the RLHF model.

In the base model this makes sense sociologically: the locations and documents on the Internet where you find any one of these will also tend to be significantly positively correlated with the other two, they tend to co-occur.

I think there is a fairly obvious progression on from this discussion. There are two ways that a type of agent can come into existence:

  1. It can, as you discuss, evolve. In which case as an evolved biological organism it will of course use its agenticness and any reasoning abilities and sapience it has to execute adaptations intended by evolution to increase it's evolutionary fitness (in the environment it evolved in). So, to the extent that evolution has done its job correctly (which is likely less than 100%), such an agent has its own purpose: look after #1, or at least, its genes (such as in its descendants). So evolutionary psychology applies.
  2. It can be created, by another agent (which must itself have been created by something evolved or created, and if you follow the chain of creations back to its origin, it has to start with an evolved agent). No agent which has its own goals. and is in its right mind is going to intentionally create something that has different goals and is powerful enough to actually enforce them. So, to the extent that the creator of a created (type #2) agent got the process of creating it right, it will also care about it's creator's interests (or, if its capacity is significantly limited and its power isn't as great as it's creator's, some subset of these important to its purpose). So, we have a chain of created (type #2) agents leading back to an evolved agent #1, and, to the extent that no mistakes were made in the creation and value copying process, these should all care about and be looking out for #1, the evolved agent, the founder of the line, helping it execute its adaptations, which, if evolution had been able to do its job perfectly, would be enhancing its evolutionary fitness. So again, evolutionary psychology applies, through some number of layers of engineering design.

So when you encounter agents, there are two sorts: evolved biological agents, and their creations. If they got this process right, the creations will be helpful tools looking after the evolved biological agents' interests. If they got it wrong, then you might encounter something to which the orthogonality thesis applies (such as a paperclip maximizer or other fairly arbitrary goal,) but more likely, you'll encounter a flawed attempt to create a helpful tool that somehow went wrong and overpowered its creator (or at least, hasn't yet been fixed), plus of course possibly its created tools and assistants.

So while the orthogonality thesis is true, it's not very useful, and evolutionary psychology is a much more useful guide, along with some sort of theory of what sorts of mistakes cultures creating their first ASI most often make, a subject on which we as yet have no evidence.

This all seems very sensible, and I must admit, I had been basically assuming that things along these lines were going to occur, once risks from frontier models became significant enough. Likely via a tiered series of a cheap weak filter passing the most suspicious X% plus a random Y% of its input to a stronger more expensive filter, and so on up to more routine/cheaper and finally more expensive/careful human oversight. Another obvious addition for the cybercrime level of risk would be IP address logging of particularly suspicious queries, and not being able to use the API via a VPN that hides your IP address if you're unpaid and unsigned-in, or seeing more refusals if you do.

I also wouldn't assume that typing a great many queries with clearly-seriously-criminal intent into a search engine in breach of its terms of use was an entirely risk-free thing to do, either — or if it were now, that it will remain so with NLP models becoming cheaper.

Obviously open-source models are a separate question here: about the best approach currently available for them is, as you suggest above, filtering really dangerous knowledge out of their training set.

As Zvi noted in a recent post, a human is "considered trustworthy rather than deceptively aligned" when they have hidden motives suppressed from manifesting (possibly even to the human's own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it's more like the property of humans being corruptible by absolute power.

That's what makes aligning LLM-powered ASI so hard: you need to produce something a lot more moral, selfless, and trustworthy than almost every human, nearly-all of whom couldn't be safely trusted to continue (long-term) to act well if handed near-absolute power and the ability to run rings around the rest of society, including law enforcement. So you have to achieve a psychology that is almost vanishingly rare in the pretraining set. [However, superhuman intelligence is also nonexistent in the training set, so you also need to figure out how to do that on the capabilities side too.]

Load More