Roger Dearnaley

I'm a machine learning engineer in Silicon Valley with an interest in AI alignment and safety.


Sorted by New


Attempting to summarize the above, the author's projection (which looks very reasonable to me) is that for a reasonable interpretation of the phrase 'Transformative Artificial Intelligence' (TAI), i.e. an AI that could have a major transformative effect on human society, we should expect it by around 2030. So we should expect accelerating economic, social, and political upheavals in around that time-frame.

For the phrase Artificial General Intelligence (AGI), things are a little less clear, depending on what exactly you mean by AGI: the author's projection is that we should expect AI that matches or exceeds the usual range of human ability in many respects, and is clearly superhuman in many respects, but where there do still exist certain mental tasks that some humans are superior to AI at (and quite possibly also many physical tasks humans are still superior at, thanks to Moravec's Paradox).

The consequences of this analysis for alignment timelines are less clear, but my impression is that if any significant fraction of the AI capabilities the author is forecasting for 2030 are badly misaligned, we could easily be in deep trouble. Similarly, if any significant fraction of this set of capabilities were being actively misused under human direction, that could also cause serious trouble (where my definition of 'misuse' includes things like natural progressions from what the advertising industry or political consultants already do, but carried out much more effectively using much greater persuasion abilities enabled by AI).

Having done some small-scale experiments in the state-of-the-art models GPT-4, Claude, and (PaLM 2-era) Bard, my impressions are:

  • They are each entirely capable of being prompted to write text, or poetry, from the viewpoint of someone feeling a specified emotion, and to me the results reliably plausibly evoke that emotion
  • Feeding this AI-generated emotional text back into a fresh context, they are reliably able to identify the emotional state of the writer as the chosen emotion (or at least a near cognate of it, like "happiness" -> "joy" or "spite" -> "anger"). This is particularly easy from poetry.
  • Even Falcon 40B can do both of these, but a lot less reliably. [And it's a lousy poet.]
  • SotA models with safety training seemed to have no trouble simulating the writing of someone feeling an unaligned emotion such as spite. (Their anger-emulation did mostly seem to be justified anger at injustice, so the RL training may possibly have had some effect. However their spite was actually spiteful.)
  • The SotA models were even able to do both of these tasks fairly reliably for sociopathy (antisocial personality disorder), a non-neurotypical condition shared by ~1%-3% of the population (though the resulting poems did read rather like they had been written while consulting the DSM checklist for anti-social personality disorder). [Reading sociopathic-sounding text written by an AI was almost as chilling as reading (honest/undisguised) writing by an actual human sociopath — yes, I found some for comparison.]
  • The SotA models had a little more trouble when asked to write and to identify poetry about emotion A written from the viewpoint of someone feeling a conflicting emotion B, but were mostly able to get this distinction correct for both generating text and for identifying both the emotion it was about and the emotion of the writer. So they do appear to have separate or dual-purpose models for emotion in general as a topic of discussion vs. the actual emotional state of the writer, albeit with some crosstalk between them. Falcon 40B on the other hand appeared to be unable to do this more challenging task with any significant reliability.
  • I haven't yet tried them on identifying two different people's emotions from a dialog between two people with different emotional states.

My impression is that, for use cases such as using an LLM in a scaffolded agent to do things like planning, listing subtasks, idea generation, critical feedback, and so forth, it would be vitally important to avoid it generating text from the viewpoint of someone feeling various negative/unaligned emotions (or indeed, from the viewpoint of a sociopath). The shoggoth can emulate the full range of human emotions and mindests, even non-neurotypical ones, so be very sure that you get it to put the right mask on before using it to power an agent.

In practice, just prompting the LLM into emulating a suitably caring, honest, and helpful emotional state is likely to usually be an acceptable solution to this (at least unless handling large amounts of text that might, for an LLM's simulation of a human's emotions, induce a switch to a different emotion). However, to get a reliable emotional state that couldn't be accidentally 'jail-broken' by a lot of annoying or depressing text following the prompt, I'd feel a lot more confident using something more thorough, like RL or something built on using interpretability to locate the parts of the emotional model.

Note that for entertainment uses, like powering Character.AI's characters, you wouldn't want or need this — there it's desirable if they can accurately portray a full range of human emotions. I just don't think we want an powerful AGI that has a emulated temper it could lose, or that can emulate human deceit, greed, ambition, or lust.

Obviously, an AGI could still reinvent deceit, greed, or ambition, as convergent instrumental strategies. But currently we're handing it a fully functional library of all of these preinstalled inside our LLM. This seems deeply unwise. Ideally I'd like to power an agent with an LLM that intellectually understood the concept of deceit, but was incapable of accurately emulating a human being deceitful, because it was just too darned honest.

Intuitively, the best way to do this would be to build “sensors” and “effectors” to have inputs and outputs and then have some program decide what the effectors should do based on the input from the sensors.


I think this is extremely hard to impossible in Conways' Life, if the remaining space is full of ash (if it's empty, then it's basically trivial, just a matter of building a lot of large logic circuits, so basically all you need is a suitable compiler, and Life enthusiasts have some pretty good ones). The problem with this is that there is no way in Life to probe an area short of sending out an influence to probe it (e.g. fire some pattern of colliding gliders at it and see what gliders you get back). Establishing whether it contains empty space or ash is easy enough. But if it contains ash, probing this will perturb it, and generally also cause it to grow, and it's highly unpredictable how far the effect of any probe spreads or how long it lasts. Meanwhile, the active patches you're creating in the ash are randomly firing unexpected gliders and spaceships back at you, which you need to shield against or avoid being in line of fire. I think in practice it's going to be somewhere between impossible and astoundingly difficult to probe random ash well enough to identify what it is so you can figure out how do a two-sided disassembly on it, because in probing it you make it mutate and grow. So I think clearing a large area of random ash to make space is an insoluble problem in Life.

Fundamentally, Conway's Life is a hostile environment for replicators unless it's completely empty, or at least has extremely predictable contents. Like most cellar automata, it doesn't have an equivalent of "low energy physics".

This is very interesting: thanks for plotting it.

However, there is something that's likely to happen that might perturb this extrapolation. Companies building large foundation models are likely soon going to start building multimodal models (indeed, GPT-4 is already multimodal, since it understands images as well as text). This will happen for at least three inter-related reasons:

  1. Multimodal models are inherently more useful, since they also understand some combination of images, video, music... as well as text, and the relationships between them.
  2. It's going to be challenging to find orders of magnitude more high-quality text data than exists on the Internet, but there are huge amounts of video and image data (YouTube, TV and cinema, Google Street View, satellite images, everything any Tesla's cameras have ever uploaded, ...), and it seems that the models of reality needed to understand/predict text, images, and video overlap and interact significantly and usefully.
  3. It seems likely that video will give the models better understanding of commonsense aspects of physical reality important to humans (and humanoid robots): humans are heavily visual, and so are a lot of things in the society we've built

The question then is, does a thousand tokens-worth of text, video, and image data teach the model the same net amount? It seems plausible that video or image data might require more input to learn the same amount (depending on details of compression and tokenization), in which case training compute requirements might increase, which could throw the trend lines off. Even if not, the set of skills the model is learning will be larger, and while some things it's learning overlap between these, others don't, which could also alter the trend lines.

I've been thinking for a while that one could do syllabus learning for LLMs. It's fairly easy to classify text by reading age. So start training the LLM on only text with a low reading age, and then increase the ceiling on reading age until it's training on the full distribution of text. ( experimented with curriculum learning in early LLMs, with little effect, but oddly didn't test reading age.)

To avoid distorting the final training distribution by much, you would need to be able to raise the reading age limit fairly fast, so by the time it's reached maximum you're only used up say ten percent of the text with low reading ages, so then in the final training distribution those're only say ten percent underrepresented. So the LLM is still capable of generating children's stories if needed (just slightly less likely to do so randomly).

The hope is that this would improve quality faster early in the training run, to sooner get the LLM to a level where it can extract more benefit from even the more difficult texts, so hopefully reach a slightly higher final quality from the same amount of  training data and compute. Otherwise for those really difficult texts that happen to be used early on in the training run, the LLM presumably gets less value from them than if they'd been later in the training. I'd expect any resulting improvement to be fairly small, but then this isn't very hard to do.

A more challenging approach would be to do the early training on low-reading-age material in a smaller LLM, potentially saving compute, and then do something like add more layers near the middle, or distill the behavior of the small LLM into a larger one, before continuing the training. Here the aim would be to also save some compute during the early parts of the training run. Potential issues would be if the distillation process or loss of quality from adding new randomly-initialized layers ended up costing more compute/quality than we'd saved/gained.

[In general, the Bitter Lesson suggests that sadly the time and engineering effort spent on these sorts of small tweaks might be better spent on just scaling up more.]

As long as all the agentic AGIs people are building are value learners (i.e. their utility function is hard-coded to something like "figure out what utility function humans in aggregate would want you to use if they understood the problem better, and use that"), then improving their understanding of the human values becomes a convergent instrumental strategy for them: obviously the better they understand the human-desired utility function, the better job they can do of optimizing it. In particular, if AGI's capabilities are large, and as a result many of the things it can do are outside the region of validity of its initial model of human values, and also it understands the concept of the region of validity of a model (a rather basic, obviously required capability for an AGI that can do research, so this seems like a reasonable assumption), then it can't use most of its capabilities safely, so solving that problem obviously becomes top priority. This is painfully obvious to us, so it should also be painfully obvious to an AGI capable of doing research.

In that situation, a fast takeoff should just cause you to get an awful lot of AGI intelligence focused on the problem of solving alignment. So, as the author mentions, perhaps we should be thinking about how we would maintain human supervision in that eventuality? That strikes me as a particular problem that I'd feel more comfortable to have solved by a human alignment researcher than an AGI one.

I'm not an ethical philosopher, but my intuition, based primarily on personal experience, is that deontological ethics are a collection of heuristic rules of thumb extracted from the average answers of utilitarian ethics applied to a common range of situations that often crop up between humans. (I also view this as a slightly-idealized description of the legal system.) As such, they're useful primarily in the same ways that heuristics often are useful compared to actually calculating a complex function, by reducing computational load. For people, they also provide useful markers to avoid 'slippery slope' situations where personal benefit might encourage you to err on one side in a complex estimation/calculation of overall utility. They also provide a way of trying to settle arguments: "I didn't break any deontological ethical rules" is often a defense in the court of public opinion, and is often less contentious than "utilitarian ethics support my actions".

As such, my feeling is that for a powerful AGI, it should have better ability to handle computational load than a human, it is more likely to encounter situations that are 'out of distribution' (atypical or even weird) compared to a human, which might take these heuristics outside their range of validity, it ought to be more capable of computing a utility function without personal bias, and it is likely to be smart enough to find ways to 'rules lawyer' corner cases that the deontological heuristics don't handle well. So for a sufficiently smart AGI, I would strongly suspect that even well-implemented deontological ethics would be more dangerous than well-implemented utilitarian ethics. But I'm mostly working from software-engineer intuition, that I don't really trust a spaghetti-code ball of heuristics — so this isn't a philosophical argument.

However, for less capable AI systems, ones not powerful enough to run a good utilitarian value function, a set of deontological ethical heuristics (and also possibly-simplified summaries of relevant laws) might well be useful to reduce computational load, if these were carefully crafted to cover the entire range of situations that they are likely to encounter (and especially with guides for identifying when a situation was outside that range and it should consult something more capable). However, the resulting collection of heuristics might look rather different from the deontological ethical rules I'd give a human child.

More broadly, most people in the AI alignment space that I've seen approaching the problem of either describing human values to an AI, or having it learn them, have appeared to view ethics from a utilitarian/consequentialist rather than a deontological perspective, and tend to regard this prospect as very challenging and complex — far more so than if you just had to teach the machine a list of deontological ethical rules. So my impression is that most people in AI safety and alignment are not using a deontological viewpoint — I'd love to hear it that has been your experience too? Indeed, my suspicion is that many of them would view that as either oversimplified, or unlikely to continue to continue to work well as rapid technological change enabled by AGI caused a large number of new ethical conundrums to appear that we don't yet have a social consensus on deontological rules for.

For example, my personal impression is that many human societies are still arguing about changes in deontological ethics in response to the easy availability of birth control, something that we've had for O(60) years. In the presence of AGI, rates of  technological change could well increase massively, and we could face ethical conundrums far more complex than those posed by birth control.

I thought this was a very interesting paper — I particularly liked the relationship to phase transitions.

However, I think there's a likely to be another 'phase' that they don't discuss (possibly it didn't crop up in their small models, since it's only viable in a sufficiently large model): one where you pack a very large number of features (thousands or millions, say) into a fairly large number of dimensions (hundreds, say). In spaces with dimensionality >= O(100), the statistics of norm and dot product are such that even randomly chosen unit norm vectors are almost invariably nearly orthogonal. So gradient descent would have very little work to do to find a very large number of vectors (much larger than the number of dimensions) that are all mutually almost-orthogonal, so that show very little interference between them. This is basically the limiting case of the pattern observed in the paper of packing n features in superposition into d dimensions where n > d >= 1, taking this towards the limit where n >> d >> 1.

Intuitively this phase seems particularly likely in a context like the residual stream of an LLM, where (in theory, if not in practice) the embedding space is invariant under arbitrary rotations, so there's no obvious reason to expect vectors to align with the coordinate axes. On the other hand, in a system where there was a preferred basis (such as a system using L1 regularization), you might get such vectors that were themselves sparse, with most components zero but a significant number of non-zero components, enough for the randomness to still give low interference.

More speculatively, in a neural net that was using at least some of its neurons in this high-dimensionality dense superposition phase, the model will presumably learn ways to manipulate these vectors to do computation in superposition. One possibility for this might be methods comparable to some of the possible Vector Symbolic Architectures (also known as hyperdimensional computing) outlined in e.g. Of the primitives used in that, a fully connected layer can clearly be trained to implement both addition of vectors and permutations of their elements, I suspect something functionally comparable to the vector elementwise-multiplication (Hadamard product) operation could be produced by using the nonlinearity of a smooth activation function such as GELU or Swish, and I suspect their their clean-up memory operation could be implemented using attention. If it turned out to be the case that SGD actually often finds solutions of this form, then an understanding of vector symbolic architectures might be helpful for interpretability of models where portions of them used this phase. This seems most likely in models that need to pack vast numbers of features into large numbers of dimensions, such as modern large LLMs.

An interesting paper on successfully distinguishing different mechanisms inside image classification models: — for this small model they correspond to different, disconnected local minimal of the loss function (I assume basically because it only has enough capacity to implement one strategy really well, so it has to pick one). They even outline approaches to move models from one mechanism that doesn't generalize well to another that does.

I don't immediately see how to extend this to the sort of different mechanisms that Paul was discussing, but it feels like it might be relevant; albeit the mechanisms might be a lot less clearly separable on something as complex and multi-task-capable as an AGI, which might well need to learn multiple capabilities (possibly including deceit) and then have a way of deciding which one to apply in a particular case.

One thing that is pretty clear is that an honest mechanism and a deceitful mechanism are going to have very different latent knowledge inside them: "how to I keep the diamond safe?" and "how do I tamper with the sensors so the diamond looks safe?" are very different problems. They're also potentially of different difficulty levels, which might have a big effect on which one gradient descent, or indeed smart AGI optimization, is going to find a solution to first. If our sensors were hardened enough to make fooling them really difficult, that might make finding a passable (and improvable) approach to vault safety much easier than to fooling the humans, at least for gradient descent. Of course, while gradient descent generally stays in whatever local minimum it found first, and AGI doing optimization probably doesn't have that limitation, and could decide to switch strategies. On the other hand, the strategy "don't do any work other than fooling the humans" generalizes really well to many different problems.

However, I still feel that this approach to AGI safety is like trying to build barriers between yourself and something malicious and very smart, and you're a lot better off if the system doesn't have anything malicious in it to start off with. So, I'm a lot more optimistic about an AGI that's a value learner,  can figure out that we don't want to be shown deceitful images (not a very hard problem in human values), and then not do that because it knows that's not what we want.

Quite a number of emotion neurons have also been found in the CLIP text/image network, see for more details. In this case it's apparently not representing the emotions of the writer of the text/photographer of the image, but those of the subject of the picture. Nevertheless, it demonstrates that neural nets can learn non-trivial representations of human emotions (interestingly, even down to distinguishing between 'healthy' and 'unhealthy/mentally troubled' variants of the same emotion). It would be interesting to see if LLMs distinguish between writing about a specific emotion, and writing while feeling that emotion. My expectation would be that these two ideas are correlated but distinct: one can write dispassionately about anger, or write angrily about some other emotion, so a sufficiently large LLM would need to use different representations for them, but they might well overlap.

Load More