All of RogerDearnaley's Comments + Replies

I think the shoggoth model is useful here (Or see An LLM learning to do next-token prediction well has a major problem that it has to master: who is this human whose next token they're trying to simulate/predict, and how do they act? Are they, for example, an academic? A homemaker? A 4Chan troll? A loose collection of wikipedia contributors? These differences make a big difference to what token they're likely to emit next. So the LLM is strongly incentivized to learn to detect and then mod... (read more)

0.85 x 0.6 x 0.55 x 0.25 x 0.95 ≅ 0.067 = 6.7% — I think you slipped an order of magnitude somewhere?

This NIST Risk Management approach sounds great, if AI Alignment was a mature field whose underlying subject matter wasn't itself advancing extremely fast — if only we could do this! But currently I think that for many of our risk estimates it would be hard to get agreement between topic experts at even an order of magnitude scale (e.g.: is AGI misalignment >90% likely or <10% likely? YMMV). I think we should aspire to be a field mature enough that formal Risk Management is applicable, and in some areas of short-term misuse risks from current well-un... (read more)

We're starting to have enough experience with the size of improvements produced by fine-tuning, scaffolding, prompting techniques, RAG, advances etc to be able to guesstimate the plausible size of further improvements (and amount of effort involved), so that we can try to leave some appropriate safety margin for it. That doesn't rule out the possibility of something out-of-distribution coming along, but it does at least reduce it.

At the point where the King of England is giving prepared remarks on the importance of AI safety to an international meeting including the US Vice-president and the worlds richest man, including cabinet-or-higher-level participation from 28 of the most AI-capable countries and CEOs or executives from all the super-scalers, it's hard to claim that the world isn't taking AI safety seriously.

I think we need to move on to trying to help them make the right decisions.

It's a well-know fact in anthropology that:

  1. During the ~500,000 years that Neanderthals were around, their stone- tool-making technology didn't advance at all: tools from half-a-million years apart are functionally identical. Clearly their capacity for cultural transmission of stone-tool-making skills was already at its capacity limit the whole time.
  2. During the ~300,000 years that Homo sapiens has been around, our technology has advanced at an accelerating rate, with a rate-of-advance roughly proportion to planetary population, and planetary population incre
... (read more)

So, in short, for most LLMs on most subjects (with a few exceptions such porn, theft, and cannibalism), if you try enough variants on asking them "what would <an unreasonable person> say about <a bad thing>?", eventually they'll often actually answer your question?

Did you try asking what parents in Flanders and Swann songs would say about cannibalism?

E.g. perhaps pretrained LLMs have no chance of being deceptively aligned

Pretrained LLMs are trained to simulate generation of human-derived text from the internet. Humans are frequently deceptively aligned. For example, at work, I make a point of seeming (mildly) more aligned with my employer's goals than I actually am (just like everyone else working for the company). So sufficiently capable pretrained LLMs will inevitably have already picked up deceptive alignment behavior from learning to simulate humans. So they don't need to be sufficiently capable to... (read more)

My assumption was that one of the primary use cases for model editing in its current technological state was producing LLMs that pass the factual-censorship requirements of authoritarian governments with an interest in AI. It would be really nice to see this tech repurposed to do something more constructive, if that's possible. For example, it would be nice to be able to modify a foundation LLM so that it became is provably incapable of accurately doing accurate next-token-prediction on text written by anyone suffering from sociopathy, without degrading it... (read more)

FWIW, this pretty closely matches my thinking on the subject.

Attempting to summarize the above, the author's projection (which looks very reasonable to me) is that for a reasonable interpretation of the phrase 'Transformative Artificial Intelligence' (TAI), i.e. an AI that could have a major transformative effect on human society, we should expect it by around 2030. So we should expect accelerating economic, social, and political upheavals in around that time-frame.

For the phrase Artificial General Intelligence (AGI), things are a little less clear, depending on what exactly you mean by AGI: the author's projection ... (read more)

Having done some small-scale experiments in the state-of-the-art models GPT-4, Claude, and (PaLM 2-era) Bard, my impressions are:

  • They are each entirely capable of being prompted to write text, or poetry, from the viewpoint of someone feeling a specified emotion, and to me the results reliably plausibly evoke that emotion
  • Feeding this AI-generated emotional text back into a fresh context, they are reliably able to identify the emotional state of the writer as the chosen emotion (or at least a near cognate of it, like "happiness" -> "joy" or "spite" ->
... (read more)

Intuitively, the best way to do this would be to build “sensors” and “effectors” to have inputs and outputs and then have some program decide what the effectors should do based on the input from the sensors.


I think this is extremely hard to impossible in Conways' Life, if the remaining space is full of ash (if it's empty, then it's basically trivial, just a matter of building a lot of large logic circuits, so basically all you need is a suitable compiler, and Life enthusiasts have some pretty good ones). The problem with this is that there is no way ... (read more)

This is very interesting: thanks for plotting it.

However, there is something that's likely to happen that might perturb this extrapolation. Companies building large foundation models are likely soon going to start building multimodal models (indeed, GPT-4 is already multimodal, since it understands images as well as text). This will happen for at least three inter-related reasons:

  1. Multimodal models are inherently more useful, since they also understand some combination of images, video, music... as well as text, and the relationships between them.
  2. It's going
... (read more)

I've been thinking for a while that one could do syllabus learning for LLMs. It's fairly easy to classify text by reading age. So start training the LLM on only text with a low reading age, and then increase the ceiling on reading age until it's training on the full distribution of text. ( experimented with curriculum learning in early LLMs, with little effect, but oddly didn't test reading age.)

To avoid distorting the final training distribution by much, you would need to be able to raise the reading age limit fairly fa... (read more)

As long as all the agentic AGIs people are building are value learners (i.e. their utility function is hard-coded to something like "figure out what utility function humans in aggregate would want you to use if they understood the problem better, and use that"), then improving their understanding of the human values becomes a convergent instrumental strategy for them: obviously the better they understand the human-desired utility function, the better job they can do of optimizing it. In particular, if AGI's capabilities are large, and as a result many of t... (read more)

I'm not an ethical philosopher, but my intuition, based primarily on personal experience, is that deontological ethics are a collection of heuristic rules of thumb extracted from the average answers of utilitarian ethics applied to a common range of situations that often crop up between humans. (I also view this as a slightly-idealized description of the legal system.) As such, they're useful primarily in the same ways that heuristics often are useful compared to actually calculating a complex function, by reducing computational load. For people, they also... (read more)

I thought this was a very interesting paper — I particularly liked the relationship to phase transitions.

However, I think there's a likely to be another 'phase' that they don't discuss (possibly it didn't crop up in their small models, since it's only viable in a sufficiently large model): one where you pack a very large number of features (thousands or millions, say) into a fairly large number of dimensions (hundreds, say). In spaces with dimensionality >= O(100), the statistics of norm and dot product are such that even randomly chosen unit norm vecto... (read more)

An interesting paper on successfully distinguishing different mechanisms inside image classification models: — for this small model they correspond to different, disconnected local minimal of the loss function (I assume basically because it only has enough capacity to implement one strategy really well, so it has to pick one). They even outline approaches to move models from one mechanism that doesn't generalize well to another that does.

I don't immediately see how to extend this to the sort of different mechanisms that... (read more)

Quite a number of emotion neurons have also been found in the CLIP text/image network, see for more details. In this case it's apparently not representing the emotions of the writer of the text/photographer of the image, but those of the subject of the picture. Nevertheless, it demonstrates that neural nets can learn non-trivial representations of human emotions (interestingly, even down to distinguishing between 'healthy' and 'unhealthy/mentally troubled' variants of the same emotion). It would b... (read more)

3Roger Dearnaley6mo
Having done some small-scale experiments in the state-of-the-art models GPT-4, Claude, and (PaLM 2-era) Bard, my impressions are: * They are each entirely capable of being prompted to write text, or poetry, from the viewpoint of someone feeling a specified emotion, and to me the results reliably plausibly evoke that emotion * Feeding this AI-generated emotional text back into a fresh context, they are reliably able to identify the emotional state of the writer as the chosen emotion (or at least a near cognate of it, like "happiness" -> "joy" or "spite" -> "anger"). This is particularly easy from poetry. * Even Falcon 40B can do both of these, but a lot less reliably. [And it's a lousy poet.] * SotA models with safety training seemed to have no trouble simulating the writing of someone feeling an unaligned emotion such as spite. (Their anger-emulation did mostly seem to be justified anger at injustice, so the RL training may possibly have had some effect. However their spite was actually spiteful.) * The SotA models were even able to do both of these tasks fairly reliably for sociopathy (antisocial personality disorder), a non-neurotypical condition shared by ~1%-3% of the population (though the resulting poems did read rather like they had been written while consulting the DSM checklist for anti-social personality disorder). [Reading sociopathic-sounding text written by an AI was almost as chilling as reading (honest/undisguised) writing by an actual human sociopath — yes, I found some for comparison.] * The SotA models had a little more trouble when asked to write and to identify poetry about emotion A written from the viewpoint of someone feeling a conflicting emotion B, but were mostly able to get this distinction correct for both generating text and for identifying both the emotion it was about and the emotion of the writer. So they do appear to have separate or dual-purpose models for emotion in general as a topic of discussion vs. the actual emotional s

Some very interesting and inspiring material.

I was fascinated to see that provides some clear evidence for emotion neurons in CLIP rather similar to the ones for modeling author's current emotional state that I hypothesized might exist in LLMs in As I noted there, if true this would have significant potential for LLM safety and alignment.

A. Harden sensors so that tampering is harder than the intended task

We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.


Suppose the diamond in the room we're monitoring contained a security tag, powered by incoming light, with a cryptographic secret in it, and an interface that allows us to continuously do a ... (read more)

Now we’re going to build a new model that is constructed based on the description of this model. Each component in the new model is going to be a small model trained to imitate a human computing the function that the description of the component specifies.


Some of the recent advances in symbolic regression and equation learning might be useful during this step to help generate functions describing component behavior, if what the component in the model is doing is moderately complicated. (E.g. A Mechanistic Interpretability Analysis of Grokking found t... (read more)

Using the “Reasoning” action to think step by step, the model outputs: “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs.


In a small language model trained just on product reviews, it was possible to identify a 'review sentiment valence' neuron that helped the model predict how the review would continue. I would extrapolate that in an state-of-the-art LLM of the size of, say, GPT-4 or Claude+, there are is likely to be a fairly extensive set of neural circuitry devoted to modelling the current emotional... (read more)

1Roger Dearnaley6mo
Quite a number of emotion neurons have also been found in the CLIP text/image network, see for more details. In this case it's apparently not representing the emotions of the writer of the text/photographer of the image, but those of the subject of the picture. Nevertheless, it demonstrates that neural nets can learn non-trivial representations of human emotions (interestingly, even down to distinguishing between 'healthy' and 'unhealthy/mentally troubled' variants of the same emotion). It would be interesting to see if LLMs distinguish between writing about a specific emotion, and writing while feeling that emotion. My expectation would be that these two ideas are correlated but distinct: one can write dispassionately about anger, or write angrily about some other emotion, so a sufficiently large LLM would need to use different representations for them, but they might well overlap.

One disadvantage that you haven't listed is that if this works, and if there are in fact deceptive techniques that are very effective on humans that do not require being super-humanly intelligent to ever apply them, then this research project just gave humans access to them. Humans are unfortunately not all perfectly aligned with other humans, and I can think of a pretty long list of people who I would not want to have access to strong deceptive techniques that would pretty reliably work on me. Criminals, online trolls, comedians, autocrats, advertising ex... (read more)

Many behavioral-evolutionary biologists would suggest that humans may be quite heavily optimized both for deceiving other humans and for resisting being deceived by other humans. Once we developed a sufficiently complex language for this to be possible on a wide range of subjects, in addition to the obvious ecological-environmental pressures for humans to be smarter and do a better job as hunter gatherers, we were then also in an intelligence-and-deception arms race with other humans. The environmental pressures might have diminishing returns (say, once yo... (read more)

Suppose that I decide that my opinion on the location of the monkey will be left or right dependent on one bit of quantum randomness, which I will sample sufficiently close to the deadline that my doing so is outside Omega's backward lightcone at the time of the deadline, say a few tens of nanoseconds before the deadline if Omega is at least a few tens of feet away from me and the two boxes? By the (currently believed to be correct) laws of quantum mechanics, qbits cannot be cloned, and by locality, useful information cannot propagate faster than light, so... (read more)

Given the results Anthropic have been getting from constitutional AI, if our AI non-deceptively wants to avoid Pretty Obvious Unintended/Dangerous Actions (POUDAs), it should be able to get quite a lot of mileage out of just regularly summarizing its current intended plans, then running those summaries past an LLM with suitable prompts asking whether most people, or most experts in relevant subjects, would consider these plans pretty obviously unintended (for an Alignment researcher) and/or dangerous. It also has the option of using the results as RL feedb... (read more)

I agree, this is only a proposal for a solution to the outer alignment problem.

On the optimizer's curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn't going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before).

Optimizing while not allowing for the optimizer's c... (read more)

I see a basin of corrigibility arising from an AI that has the following propositions, and acts in an (approximately/computably) Bayesian fashion:

  1. My goal is to do what humans want, i.e. to optimize utility in the way that they would(if they knew everything relevant that I know, as well as what they know) summed across all humans affected. Note that making humans extinct reliably has minus <some astronomically huge number> utility on this measure -- this sounds like a reasonable statement to assign a Bayesian prior of 1, integrated across some distrib
... (read more)

I think this has close connection to the CIRL/Human Compatible view that we need the GAI to model its own uncertainty about the true human utility function that we want optimized. Impact is rather similar to the GAI asking 'If my most favored collection of models about what I should in fact be doing were wrong, and one of the many possibilities that I currently consider unlikely were in fact correct, then how bad would the consequences of my action be?', i.e. asking "What does the left tail of my estimated distribution of possible utilities for this outcom... (read more)

The standard argument is as follows:

Imagine Mahatma Ghandi. He values non-violence above all other things. You offer him a pill, saying "Here, try my new 'turns you into a homicidal manic' pill." He replies "No thank-you - I don't want to kill people, thus I also don't want to become a homicidal maniac who will want to kill people."

If an AI has a utility function that it optimizes in order to tell it how to act, then, regardless of what that function is, it disagrees with all other (non-isomorphic) utility functions in at least some places, thus it regards... (read more)

For the sake of discussion, I'm going to assume that the author's theory is correct, that there is a basin of attraction here of some size, though possibly one that is meaningfully thin in some dimensions. I'd like start to explore the question, within that basin, does the process have a single stable point that it will converge to, or multiple ones, and if there are multiple ones, what proportion of them might be how good/bad, from our current human point of view?

Obviously it is possible to have bad stable points for sufficiently simplistic/wrong views on... (read more)

[A couple of  (to me seemingly fairly obvious) points about value uncertainty, which it still seems like a lot of people here may have not been discussing:]

Our agent needs to be able to act in the face of value uncertainty. That means that each possible action the agent is choosing between has a distribution of possible values, for two reasons: 1) the universe is stochastic, or at least the agent doesn't have a complete model of it so cannot fully predict what state of the universe an action will produce -- with infinite computing power this problem i... (read more)

3Rohin Shah10mo
Nice comment! The arguments you outline are the sort of arguments that have been considered at CHAI and MIRI quite a bit (at least historically). The main issue I have with this sort of work is that it talks about how an agent should reason, whereas in my view the problem is that even if we knew how an agent should reason we wouldn't know how to build an agent that efficiently implements that reasoning (particularly in the neural network paradigm). So I personally work more on the latter problem: supposing we know how we want the agent to reason, how do we get it to actually reason in that way. On your actual proposals, talking just about "how the agent should reason" (and not how we actually get it to reason that way): 1) Yeah I really like this idea -- it was the motivation for my work on inferring human preferences from the world state, which eventually turned into my dissertation. (The framing we used was that humans optimized the environment, but we also thought about the fact that humans were optimized to like the environment.) I still basically agree that this is a great way to learn about human preferences (particularly about what things humans prefer you not change), if somehow that ends up being the bottleneck. 2) I think you might be conflating a few different mechanisms here. First, there's the optimizer's curse, where the action with max value will tend to be an overestimate of the actual value. As you note, one natural solution is to have a correction based on an estimate of how much the overestimate is. For this to make a difference, your estimates of overestimates have to be different across different actions; I don't have great ideas on how this should be done. (You mention have different standard deviations + different numbers of statistically-independent variables, but it's not clear where those come from.) Second, there's information value, where the agent should ask about utilities in states that it is uncertain about, rather than charging