Roger Dearnaley

I'm a machine learning engineer in Silicon Valley with an interest in AI alignment and safety.


Sorted by New


I've been thinking for a while that one could do syllabus learning for LLMs. It's fairly easy to classify text by reading age. So start training the LLM on only text with a low reading age, and then increase the ceiling on reading age until it's training on the full distribution of text. ( experimented with curriculum learning in early LLMs, with little effect, but oddly didn't test reading age.)

To avoid distorting the final training distribution by much, you would need to be able to raise the reading age limit fairly fast, so by the time it's reached maximum you're only used up say ten percent of the text with low reading ages, so then in the final training distribution those're only say ten percent underrepresented. So the LLM is still capable of generating children's stories if needed (just slightly less likely to do so randomly).

The hope is that this would improve quality faster early in the training run, to sooner get the LLM to a level where it can extract more benefit from even the more difficult texts, so hopefully reach a slightly higher final quality from the same amount of  training data and compute. Otherwise for those really difficult texts that happen to be used early on in the training run, the LLM presumably gets less value from them than if they'd been later in the training. I'd expect any resulting improvement to be fairly small, but then this isn't very hard to do.

A more challenging approach would be to do the early training on low-reading-age material in a smaller LLM, potentially saving compute, and then do something like add more layers near the middle, or distill the behavior of the small LLM into a larger one, before continuing the training. Here the aim would be to also save some compute during the early parts of the training run. Potential issues would be if the distillation process or loss of quality from adding new randomly-initialized layers ended up costing more compute/quality than we'd saved/gained.

[In general, the Bitter Lesson suggests that sadly the time and engineering effort spent on these sorts of small tweaks might be better spent on just scaling up more.]

Subproblem 1.2/2.1: Traps

Allowing traps in the environment creates two different problems:

  • (Subproblem 1.2) Bayes-optimality becomes intractable in a very strong sense (even for a small number of deterministic MDP hypotheses with small number of states).
  • (Subproblem 2.1) It's not clear how to to talk about learnability and learning rates.

It makes some sense to consider these problems together, but different direction emphasize different sides.


Evolved organisms (such as humans) are good at dealing with traps: getting eaten is always a possibility. At the simplest level they do this by having multiple members of the species die, and using an evolutionary learning mechanism to evolve detectors for potential trap situations and some trap-avoiding behavior for this to trigger. An example of this might be the human instinct of vertigo near cliff edges — it's hard not to step back. The cost of this is that some number of individuals die from the traps before the species evolves a way of avoiding the trap.

As a sapient species using the scientific method, we have more sophisticated ways to detect traps. Often we may have a well-supported model of the world that lets us predict and avoid a trap ("nuclear war could well wipe out the human race, let's not do that"). Or we may have an unproven theory that predicts a possible trap, but that also predicts some less dangerous phenomenon. So rather than treating the universe like a multi-armed bandit and jumping into the potential trap to find out what happens and test our theory, we perform the lowest risk/cost experiment that will get us a good Bayesian update on the support for our unproven theory, hopefully at no cost to life or limb. If that raises the theory's support, then we become more cautious about the predicted trap, or if it lowers it, we become less. Repeat until your Bayesian updates converge on either 100% or 0%.

An evolved primate heuristic for this is "if nervous of an unidentified object, poke it with a stick and see what happens". This of course works better on, say, live/dead snakes than on some other perils that modern technology has exposed us to.

The basic trick here is to have a world model sophisticated enough that it can predict traps in advance, and we can find hopefully non-fatal ways of testing them that don't require us to jump into the trap. This requires that the universe has some regularities strong enough to admit models like this, as ours does. Likely most universes that didn't would be uninhabitable and life wouldn't evolve in them. 

As long as all the agentic AGIs people are building are value learners (i.e. their utility function is hard-coded to something like "figure out what utility function humans in aggregate would want you to use if they understood the problem better, and use that"), then improving their understanding of the human values becomes a convergent instrumental strategy for them: obviously the better they understand the human-desired utility function, the better job they can do of optimizing it. In particular, if AGI's capabilities are large, and as a result many of the things it can do are outside the region of validity of its initial model of human values, and also it understands the concept of the region of validity of a model (a rather basic, obviously required capability for an AGI that can do research, so this seems like a reasonable assumption), then it can't use most of its capabilities safely, so solving that problem obviously becomes top priority. This is painfully obvious to us, so it should also be painfully obvious to an AGI capable of doing research.

In that situation, a fast takeoff should just cause you to get an awful lot of AGI intelligence focused on the problem of solving alignment. So, as the author mentions, perhaps we should be thinking about how we would maintain human supervision in that eventuality? That strikes me as a particular problem that I'd feel more comfortable to have solved by a human alignment researcher than an AGI one.

I'm not an ethical philosopher, but my intuition, based primarily on personal experience, is that deontological ethics are a collection of heuristic rules of thumb extracted from the average answers of utilitarian ethics applied to a common range of situations that often crop up between humans. (I also view this as a slightly-idealized description of the legal system.) As such, they're useful primarily in the same ways that heuristics often are useful compared to actually calculating a complex function, by reducing computational load. For people, they also provide useful markers to avoid 'slippery slope' situations where personal benefit might encourage you to err on one side in a complex estimation/calculation of overall utility. They also provide a way of trying to settle arguments: "I didn't break any deontological ethical rules" is often a defense in the court of public opinion, and is often less contentious than "utilitarian ethics support my actions".

As such, my feeling is that for a powerful AGI, it should have better ability to handle computational load than a human, it is more likely to encounter situations that are 'out of distribution' (atypical or even weird) compared to a human, which might take these heuristics outside their range of validity, it ought to be more capable of computing a utility function without personal bias, and it is likely to be smart enough to find ways to 'rules lawyer' corner cases that the deontological heuristics don't handle well. So for a sufficiently smart AGI, I would strongly suspect that even well-implemented deontological ethics would be more dangerous than well-implemented utilitarian ethics. But I'm mostly working from software-engineer intuition, that I don't really trust a spaghetti-code ball of heuristics — so this isn't a philosophical argument.

However, for less capable AI systems, ones not powerful enough to run a good utilitarian value function, a set of deontological ethical heuristics (and also possibly-simplified summaries of relevant laws) might well be useful to reduce computational load, if these were carefully crafted to cover the entire range of situations that they are likely to encounter (and especially with guides for identifying when a situation was outside that range and it should consult something more capable). However, the resulting collection of heuristics might look rather different from the deontological ethical rules I'd give a human child.

More broadly, most people in the AI alignment space that I've seen approaching the problem of either describing human values to an AI, or having it learn them, have appeared to view ethics from a utilitarian/consequentialist rather than a deontological perspective, and tend to regard this prospect as very challenging and complex — far more so than if you just had to teach the machine a list of deontological ethical rules. So my impression is that most people in AI safety and alignment are not using a deontological viewpoint — I'd love to hear it that has been your experience too? Indeed, my suspicion is that many of them would view that as either oversimplified, or unlikely to continue to continue to work well as rapid technological change enabled by AGI caused a large number of new ethical conundrums to appear that we don't yet have a social consensus on deontological rules for.

For example, my personal impression is that many human societies are still arguing about changes in deontological ethics in response to the easy availability of birth control, something that we've had for O(60) years. In the presence of AGI, rates of  technological change could well increase massively, and we could face ethical conundrums far more complex than those posed by birth control.

I thought this was a very interesting paper — I particularly liked the relationship to phase transitions.

However, I think there's a likely to be another 'phase' that they don't discuss (possibly it didn't crop up in their small models, since it's only viable in a sufficiently large model): one where you pack a very large number of features (thousands or millions, say) into a fairly large number of dimensions (hundreds, say). In spaces with dimensionality >= O(100), the statistics of norm and dot product are such that even randomly chosen unit norm vectors are almost invariably nearly orthogonal. So gradient descent would have very little work to do to find a very large number of vectors (much larger than the number of dimensions) that are all mutually almost-orthogonal, so that show very little interference between them. This is basically the limiting case of the pattern observed in the paper of packing n features in superposition into d dimensions where n > d >= 1, taking this towards the limit where n >> d >> 1.

Intuitively this phase seems particularly likely in a context like the residual stream of an LLM, where (in theory, if not in practice) the embedding space is invariant under arbitrary rotations, so there's no obvious reason to expect vectors to align with the coordinate axes. On the other hand, in a system where there was a preferred basis (such as a system using L1 regularization), you might get such vectors that were themselves sparse, with most components zero but a significant number of non-zero components, enough for the randomness to still give low interference.

More speculatively, in a neural net that was using at least some of its neurons in this high-dimensionality dense superposition phase, the model will presumably learn ways to manipulate these vectors to do computation in superposition. One possibility for this might be methods comparable to some of the possible Vector Symbolic Architectures (also known as hyperdimensional computing) outlined in e.g. Of the primitives used in that, a fully connected layer can clearly be trained to implement both addition of vectors and permutations of their elements, I suspect something functionally comparable to the vector elementwise-multiplication (Hadamard product) operation could be produced by using the nonlinearity of a smooth activation function such as GELU or Swish, and I suspect their their clean-up memory operation could be implemented using attention. If it turned out to be the case that SGD actually often finds solutions of this form, then an understanding of vector symbolic architectures might be helpful for interpretability of models where portions of them used this phase. This seems most likely in models that need to pack vast numbers of features into large numbers of dimensions, such as modern large LLMs.

An interesting paper on successfully distinguishing different mechanisms inside image classification models: — for this small model they correspond to different, disconnected local minimal of the loss function (I assume basically because it only has enough capacity to implement one strategy really well, so it has to pick one). They even outline approaches to move models from one mechanism that doesn't generalize well to another that does.

I don't immediately see how to extend this to the sort of different mechanisms that Paul was discussing, but it feels like it might be relevant; albeit the mechanisms might be a lot less clearly separable on something as complex and multi-task-capable as an AGI, which might well need to learn multiple capabilities (possibly including deceit) and then have a way of deciding which one to apply in a particular case.

One thing that is pretty clear is that an honest mechanism and a deceitful mechanism are going to have very different latent knowledge inside them: "how to I keep the diamond safe?" and "how do I tamper with the sensors so the diamond looks safe?" are very different problems. They're also potentially of different difficulty levels, which might have a big effect on which one gradient descent, or indeed smart AGI optimization, is going to find a solution to first. If our sensors were hardened enough to make fooling them really difficult, that might make finding a passable (and improvable) approach to vault safety much easier than to fooling the humans, at least for gradient descent. Of course, while gradient descent generally stays in whatever local minimum it found first, and AGI doing optimization probably doesn't have that limitation, and could decide to switch strategies. On the other hand, the strategy "don't do any work other than fooling the humans" generalizes really well to many different problems.

However, I still feel that this approach to AGI safety is like trying to build barriers between yourself and something malicious and very smart, and you're a lot better off if the system doesn't have anything malicious in it to start off with. So, I'm a lot more optimistic about an AGI that's a value learner,  can figure out that we don't want to be shown deceitful images (not a very hard problem in human values), and then not do that because it knows that's not what we want.

Quite a number of emotion neurons have also been found in the CLIP text/image network, see for more details. In this case it's apparently not representing the emotions of the writer of the text/photographer of the image, but those of the subject of the picture. Nevertheless, it demonstrates that neural nets can learn non-trivial representations of human emotions (interestingly, even down to distinguishing between 'healthy' and 'unhealthy/mentally troubled' variants of the same emotion). It would be interesting to see if LLMs distinguish between writing about a specific emotion, and writing while feeling that emotion. My expectation would be that these two ideas are correlated but distinct: one can write dispassionately about anger, or write angrily about some other emotion, so a sufficiently large LLM would need to use different representations for them, but they might well overlap.

Some very interesting and inspiring material.

I was fascinated to see that provides some clear evidence for emotion neurons in CLIP rather similar to the ones for modeling author's current emotional state that I hypothesized might exist in LLMs in As I noted there, if true this would have significant potential for LLM safety and alignment.

A. Harden sensors so that tampering is harder than the intended task

We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.


Suppose the diamond in the room we're monitoring contained a security tag, powered by incoming light, with a cryptographic secret in it, and an interface that allows us to continuously do a zero-knowledge identity proof on it via a laser optical interface, using a laser pointed into the room as input and the (high-speed) camera as output. So to successfully tamper with the camera, the AI would need access to the secret inside the security tag inside the diamond, or at least to reroute the channel to-and-from it. So it would need to be able to steal the diamond, extract the security tag, and reverse engineer its cryptographic secret out of it in a time shorter then the period between camera frames, or similarly replace the diamond with an optically-similar replacement now containing the security tag in the room, or integrate the security tag and its input-output channel into its camera-spoofing system, in each case in the time between frames. None of these are actually physically impossible, but they could be made extremely hard to do in a small fraction of a second (say we're using a really-high speed camera with basically no temporal gaps between frames).

The basic idea here is that cryptographic techniques may be useful when hardening things. Whether this approach could usefully be extended to anything significantly more AGI-alignment-like than a diamond-in-a-vault scenario is unclear. 

Now we’re going to build a new model that is constructed based on the description of this model. Each component in the new model is going to be a small model trained to imitate a human computing the function that the description of the component specifies.


Some of the recent advances in symbolic regression and equation learning might be useful during this step to help generate functions describing component behavior, if what the component in the model is doing is moderately complicated. (E.g. A Mechanistic Interpretability Analysis of Grokking found that a model trained to do modular arithmetic ended up implementing it using discrete Fourier transforms and trig identities, which sounds like the sort of thing that might be a lot easier to figure our from a learned equation describing the component's behavior). Being able to reduce a neural circuit to an equation or a Bayesnet or whatever would help a lot with interpretablity, and at that point you might even not need to train an implementation model — we could maybe just use the symbolic form directly, as a more compact and more efficiently computable representation.

At the end of this process, you might even end up with something symbolic that looked a lot like a "Good Old Fashioned AI" (GOFAI) model — but a "Bitter Lesson compatible" one first learnt by a neural net and then reverse engineered using interpretability. Obviously doing this would put high demands on our interpretation tools. 

If I had such a function describing a neural net component, one of my first questions would be: what portions of the domain of this function are well covered by the training set that the initial neural net model was trained on, or at least sufficiently near items in that training set interpolating the function to them seems likely to be safe (given its local first, second, third... partial derivatives) vs. what portions are untested extrapolations? Did the symbolic regression/function learning process give us multiple candidate functions, and if so how much do they differ outside that well-tested region of the function domain?

This seems like it would give us some useful intuition for when the model might be unsafely extrapolating outside the training distribution and we need to be particularly cautious.

Some portions of the neural net may turn out to be irreducibly complex — I suspect it would be good to be able to identify when something genuinely is complex, and when we're just looking at a big tangled-up blob of memorized instances from the training set (e.g. by somehow localizing sources of loss on the test set).

Load More