Quintin Pope

Wiki Contributions


Announcing the Alignment of Complex Systems Research Group

I think this line of work is very interesting and important. I and a few others are working on something we've dubbed shard theory, which attempts to describe the process of human value formation. The theory posits that the internals of monolithic learning systems actually resemble something like an ecosystem already. However, rather than there being some finite list of discrete subcomponents / modules, it's more like there's a continuous distribution over possible internal subcomponents and features. 

Continuous agency

To take agency as an example, suppose you have a 3-layer transformer being trained via RL using just the basic REINFORCE algorithm. We typically think of such a setup as having one agent with three layers:

However, we can just as easily draw our Cartesian boundaries differently and call it three agents that pass messages between them:

It makes no difference to the actual learning process. In fact, we can draw Cartesian boundaries around any selection of non-overlaping subsets that cover all the model's parameters, call each subset an "agent", and the training process is identical. The reason this is interesting is what happens when this fact overlaps with regional specialization in the model. E.g., let's take an extreme where the reward function only rewards three things:

  1. Manipulating humans
  2. Solving math problems
  3. Helping humans

And let's suppose the model has an extreme degree of regional specialization such that each layer can only navigate a single one of those tasks (each layer is fully specialized to only of the above tasks). Additionally, let's suppose that the credit assignment process is "perfect", in the sense that, for task i, the outer learning process only updates the parameters of the layer that specializes in task i:

The reason that this matters is because there's only one way for the training process to instill value representations into any of the layers: by updating the parameters of those layers. Thus, if Layer 3 isn't updated on rewards from the "Solving math problems" or "Manipulating humans" tasks, Layer 3's value representations won't care about doing either of those things. 

If we view each layer as its own agent (which I think we can), then the overall system's behavior is a mult-agent consensus between components whose values differ significantly.

Of course, real world specialization is nowhere near this strict. The interactions between complex reward functions and complex environments means there's more like a continuous distribution over possible rewarding behaviors. Additionally, real world credit assignment is very noisy. Thus, the actual distribution of agentic specializations looks a bit more like this:

Thus, I interpret RL systems as having something like a continuous distribution over possible internal agents, each of which implement different values. Regions in this distribution are the shards of shard theory. I.e., shards refer to dense regions in you distribution over agentic, values-implementing computations.

Convergent value reflection / philosophy

This has a number of (IMO) pretty profound implications. For one, we should not expect AI systems to be certain of their own learned values, for much the same reason humans are uncertain. "Self-misalignment" isn't some uniquely human failing left to us by evolution. It's just how sophisticated RL systems work by default. 

Similarly, something like value reflection is probably convergent among RL systems trained on complex environments / reward signals. Such systems need ways to manage internal conflicts among their shards. The process of weighting / negotiating between / compromising among internal values, and the agentic processes implementing those values, is probably quite important for broad classes of RL systems, not just humans.

Additionally, something like moral philosophy is probably convergent as well. Unlike value reflection, moral philosophy would relate to whether (and how) the current shards allows additional shards to form. 

Suppose you (a human) have a distribution of shards that implement common sense human values like "don't steal", "don't kill", etc. Then, you encounter a new domain where those shards are a poor guide for determining your actions. Maybe you're trying to determine which charity to donate to. Maybe you're trying to answer weird questions in your moral philosophy class. The point is that you need some new shards to navigate this new domain, so you go searching for one or more new shards, and associated values that they implement. 

Concretely, let's suppose you consider classical utilitarianism (CU) as your new value. The CU shard effectively navigates the new domain, but there's a potential problem: the CU shard doesn't constrain itself to only navigating the new domain. It also produces predictions regarding the correct behavior on the old domains that already existing shards navigate. This could prevent the old shards from determining your behavior on the old domains. For instrumental reasons, the old shards don't want to be disempowered. 

One possible option is for there to be a "negotiation" between the old shards and the CU shard regarding what sort of predictions CU will generate on the domains that the old shards navigate. This might involve an iterative process of searching over the input space to the CU shard for situations where the CU shard strongly diverges from the old shards, in domains that the old shards already navigate. Each time a conflict is found, you either modify the CU shard to agree with the old shards, constrain the CU shard so as to not apply to those sorts of situations, or reject the CU shard entirely if no resolution is possible.

The above essentially describes the core of the cognitive process we call moral philosophy. However, none of the underlying motivations for this process are unique to humans or our values. In this framing, moral philosophy is essentially a form of negotiation between existing shards and a new shard that implements desirable cognitive capabilities. The old shards agree to let the new shard come into existence. In exchange, the new shard agrees to align itself to the values of the old shards (or at least, not conflict too strongly). 

Continuous Ontologies

I also think the continuous framing applies to other features of cognition beyond internal agents. E.g., I don't think it's appropriate to think of an AI or human as having a single ontology. Instead, they both have distributions over possible ontologies. In any given circumstance, the AI / human will dynamically sample an appropriate-seeming ontology from said distribution. 

This possibly explains why humans don't seem to suffer particularly from ontological crises. E.g., learning quantum mechanics does not result in humans (or AIs) suddenly switching from a classical to a quantum ontology. Rather, their distribution over possible ontologies simply extends its support to a new region in the space of possible ontologies. However, this is a process that happens continuously throughout learning, so the already existing values shards are usually able to navigate the shift fine. 

This neatly explains human robustness to ontological issues without having to rely on evolution somehow hard-coding complex crisis handling adaptations into the human learning process (despite the fact that our ancestors never had to deal with ontological shifts such as discovering QM).

Implications for value fragility

I also think that the idea of "value fragility" changes significantly when you shift from a discrete view of values to a continuous view. If you assume a discrete view, then you're likely to be greatly concerned by the fact that repeated introspection on your values will give different results. It feels like your values are somehow unstable, and that you need to find the "true" form of your values. 

This poses a significant problem for AI alignment. If you think that you have some discrete set of "true" values concepts, and that an AI will also have a discrete set of "true" values values concepts, then these sets need to near-perfectly align to have any chance of the AI optimizing for what we actually want. I.e., this picture:

In the continuous perspective, values have no “true” concept, only a continuous distribution over possible instantiations. The values that are introspectively available to us at any given time are discrete samples from that distribution. In fact, looking for a value's "true" conceptualization is a type error, roughly analogous to thinking that a Gaussian distribution has some hidden "true" sample that manages to capture the entire distribution in one number. 

An AI and human can have overlap between their respective value distributions, even without those distributions perfectly agreeing. It’s possible for an AI to have an important and non-trivial degree of alignment with human values without requiring the near-perfect alignment the discrete view implies is necessary, as illustrated in the diagram below:


If you want, you can join the shard theory discord: https://discord.gg/AqYkK7wqAG

You can also read some of our draft documents for explaining shard theory:

Shard theory 101 (Broad introduction, focuses less on the continuous view and more on the value / shard formation process and how that relates to evolution)

Your Shards and You: The Shard Theory Account of Common Moral Intuitions (Focuses more on shards as self-perpetuating optimization demons, similar to what you call self-enforcing abstractions)

What even are "shards"? (Presents the continuous view of values / agency, fairly similar to this comment)

Intuitions about solving hard problems

Hmm. I suppose a similar key insight for my own line of research might go like:

The orthogonality thesis is actually wrong for brain-like learning systems. Such systems first learn many shallow proxies for their reward signal. Moreover, the circuits implementing these proxies are self-preserving optimization demons. They’ll steer the learning process away from the true data generating process behind the reward signal so as to ensure their own perpetuation.

If true, this insight matters a lot for value alignment because it points to a way that aligned behavior in the infra-human regime could perpetuate into the superhuman regime. If all of:

  • We can instil aligned behavior in the infra-human regime
  • The circuits that implement aligned behavior in the infra-human regime can ensure their own perpetuation into the superhuman regime
  • The circuits that implement aligned behavior in the infra-human regime continue to implement it in the superhuman regime

hold true, then I think we’re in a pretty good position regarding value alignment. Off-switch corrigibility is a bust though because self-preserving circuits won’t want to let you turn them off.

If you’re interested in some of the actual arguments for this thesis, you can read my answer to a question about the relation between human reward circuitry and human values.

Favorite / most obscure research on understanding DNNs?

I’d suggest my LessWrong post on grokking and SGD: Hypothesis: gradient descent prefers general circuits It argues that SGD has an inductive bias towards general circuits, and that this explains grokking. On the one hand, I’m not certain the hypothesis is correct. However, the post is very obscure and is a favourite of mine, so I feel it’s appropriate for this question.

SGD’s bias, a post by John Wentworth, explores a similar question by using an analogy to a random walk.

I suspect you’re familiar with it, but Is SGD a Bayesian sampler? Well, almost advances the view that DNN initialisations strongly prefer simple functions, and this explains DNN generalisation.

Locating and Editing Factual Knowledge in GPT is a very recent interpretability work looking at where GPT models store their knowledge and how we can access/modify that knowledge.

Edit: I see you’re the person who wrote How complex are myopic imitators?. In that case, I’d again point you towards Hypothesis: gradient descent prefers general circuits, with the added context that I now suspect (but am not confident) that “generality bias from SGD” and “simplicity bias from initialisation” are functionally identical in terms of their impact on the model’s internals. If so, I think the main value of generality bias as a concept is that it’s easier for humans to intuit the sorts of circuits that generality bias favours. I.e., circuits that contribute to a higher fraction of the model’s outputs.

Hypothesis: gradient descent prefers general circuits

First, I'd like to note that I don't see why faster convergence after changing the learning rate support either story. After initial memorization, the loss decreases by ~3 OOM. Regardless of what's gaining on inside the network, it wouldn't be surprising if raising the learning rate increased convergence.

Also, I think what's actually going on here is weirder than either of our interpretations. I ran experiments where I kept the learning rate the same for the first 1000 steps, then increased it by 10x and 50x for the rest of the training.

Here is the accuracy curve with the default learning rate:

Base LR throughout (0.001)

Here is the curve with 10x learning rate:

10x LR after 1000 steps (0.01)

And here is the curve with 50x learning rate:

50x LR after 1000 steps (0.05)

Note that increasing the learning rate doesn't consistently increase validation convergence. The 50x run does reach convergence faster, but the 10x run doesn't even reach it at all.

In fact, increasing the learning rate causes the training accuracy to fall to the validation accuracy, after which they begin to increase together (at least for a while). For the 10x increase, the training accuracy quickly diverges from the validation accuracy. In the 50x run, the training and validation accuracies move in tandem throughout the run.

Frederik's results are broadly similar. If you mouse over the accuracy and loss graphs, you'll see that 

  1. Training performance drops significantly immediately after the learning rate increases.
  2. The losses and accuracies of the "5x" and "10x" lines correlate together pretty well between training/validation. In contrast, the losses and accuracies of the "default" lines don't correlate strongly between training and testing. 

I think that increasing the learning rate after memorization causes some sort of "mode shift" in the training process. It goes from: 

First, learn shallow patterns that strongly overfit to the training data, then learn general patterns. 


Immediately learn general patterns that perform about equally well on the training and validation data.

In the case of my 10x run, I think it actually has two mode transitions, first from "shallow first" to "immediately general", then another transition back to "shallow first", and that's why you see the training accuracy diverge from the validation accuracy again.

I think results like these make a certain amount of sense, given that higher learning rates are associated with better generalization in more standard settings.

AI Tracker: monitoring current and near-future risks from superscale models

This seems like a great resource. I also like the way it’s presented. It’s very clean.

I’d appreciate more focus on the monetary return on investment large models provide their creators. I think that’s the key metric that will determine how far firms scale up these large models. Relatedly, I think it’s important to track advancements that improve model/training efficiency because they can change the expected ROI for further scaling models.

A positive case for how we might succeed at prosaic AI alignment

I agree that transformers vs other architectures is a better example of the field “following the leader” because there are lots of other strong architectures (perceiver, mlp mixer, etc). In comparison, using self supervised transfer learning is just an objectively good idea you can apply to any architecture and one the brain itself almost surely uses. The field would have converged to doing so regardless of the dominant architecture.

One hopeful sign is how little attention the ConvBERT language model has gotten. It mixes some convolution operations with self attention to allow self attention heads to focus on global patterns as opposed to local patterns better handled by convolution. ConvBERT is more compute efficient than a standard transformer, but hasn’t made much of a splash. It shows the field can ignore low profile advances made by smaller labs.

For your point about the value of alignment: I think there’s a pretty big range of capabilities where the marginal return on extra capabilities is higher than the marginal return on extra alignment. Also, you seem focused on avoiding deception/treacherous turns, which I think are a small part of alignment costs until near human capabilities.

I don’t know what sort of capabilities penalty you pay for using a myopic training objective, but I don’t think there’s much margin available before voluntary mass adoption becomes implausible.

A positive case for how we might succeed at prosaic AI alignment

The reason self supervised approaches took over NLP is because they delivered the best results. It would be convenient if the most alignable approach also gave the best results, but I don’t think that’s likely. If you convince the top lab to use an approach that delivered worse results, I doubt much of the field would follow their example.

Meta learning to gradient hack

Thanks for the feedback! I use batch norm regularisation, but not dropout.

I just tried retraining the 100,000 cycle meta learned model in a variety of ways, including for 10,000 steps with 10,000x higher lr, using resilient backprop (which multiplies weights by a factor to increase/decrease them), and using an L2 penalty to decrease weight magnitude. So far, nothing has gotten the network to model the base function. The L2 penalty did reduce weight values to ~the normal range, but the network still didn’t learn the base function.

I now think the increase in weight values is just incidental and that the meta learner found some other way of protecting the network from SGD.

Paths To High-Level Machine Intelligence

Thank you for this excellent post. Here are some thoughts I had while reading.

The hard paths hypothesis:

I think there's another side to the hard paths hypothesis. We are clearly the first technology-using species to evolve on Earth. However, it's entirely possible that we're not the first species with human-level intelligence. If a species with human level intelligence but no opposable thumbs evolved millions of years ago, they could have died out without leaving any artifacts we'd recognize as signs of intelligence.

Besides our intelligence, humans seem odd in many ways that could plausibly contribute to developing a technological civilization.

  • We are pretty long-lived.
  • We are fairly social.
    • Feral children raised outside of human culture experience serious and often permanent mental disabilities (Wikipedia).
    • A species with human-level intelligence, but whose members live mostly independently may not develop technological civilization.
  • We have very long childhoods.
  • We have ridiculously high manual dexterity (even compared to other primates).
  • We live on land.

Given how well-tuned our biology seems for developing civilization, I think it's plausible that multiple human-level intelligent species arose in Earth's history, but additional bottlenecks prevented them from developing technological civilization. However, most of these bottlenecks wouldn't be an issue for an intelligence generated by simulated evolution. E.g., we could intervene in such a simulation to give low-dexterity species other means of manipulating their environment. Perhaps Earth's evolutionary history actually contains n human-level intelligent species, only one of which developed technology. That implies the true compute required to evolve human-level intelligence is far lower.

Brain imitation learning:

I also think the discussion of neuromophic AI and whole brain emulation misses an important possibility that Gwern calls "brain imitation learning". In essence, you record a bunch of data about human brain activity (using EEG, implanted electrodes, etc.), then you train a deep neural network to model the recorded data (similar to how GPT-3 or BERT model text). The idea is that modeling brain activity will cause the deep network to learn some of the brain's neurological algorithms. Then, you train the deep network on some downstream task and hope its learned brain algorithms generalize to the task in question.

I think brain imitation learning is pretty likely to work. We've repeatedly seen in deep learning that knowledge distillation (training a smaller student model to imitate a larger teacher model) is FAR more computationally efficient than trying to train the student model from scratch, while also giving superior performance (Wikipedia, distilling BERT, distilling CLIP). Admittedly, brain activity data is pretty expensive. However, the project that finally builds human-level AI will plausibly cost billions of dollars in compute for training. If brain imitation learning can cut the price by even 10%, it will be worth hundreds of millions in terms of saved compute costs.

DeepMind: Generally capable agents emerge from open-ended play

What really impressed me were the generalized strategies the agent applied to multiple situations/goals. E.g., "randomly move things around until something works" sounds simple, but learning to contextually apply that strategy 

  1. to the appropriate objects, 
  2. in scenarios where you don't have a better idea of what to do, and 
  3. immediately stopping when you find something that works 

is fairly difficult for deep agents to learn. I think of this work as giving the RL agents a toolbox of strategies that can be flexibly applied to different scenarios. 

I suspect that finetuning agents trained in XLand in other physical environments will give good results because the XLand agents already know how to use relatively advanced strategies. Learning to apply the XLand strategies to the new physical environments will probably be easier than starting from scratch in the new environment.

Load More