All of Quintin Pope's Comments + Replies

Steganography in Chain of Thought Reasoning

I think it probably works out in the other direction. The easiest way to encode problem-relevant info is by changing the conceptual content of the generated text, not weird changes to the grammatical rules. This is because KL penalties and cross entropy loss actually prioritize modeling exact grammatical patterns over conceptual content. E.g., GPTs will quickly learn to generate ~perfect grammar, but are slow to pick up on / produce coherent conceptual content.

Another framing is to think of the KL regularizated capabilities training as Bayes updating of th... (read more)

1Alex Gray3d
I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments. Trying to encode an additional penalty on changing non-semantic information is an interesting idea. However I think you're missing that you don't have the ability to directly compare to a reference LM in cases where you're training to improve on some performance benchmark. During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content). So your proposed detection will always show some amount of disagreement between the prior and the trained model on weird grammatical patterns as well as conceptual tokens. The question is: "is the difference merely due to the changes to improve performance, or is it also transmitting hidden information"
Steganography in Chain of Thought Reasoning

Apologies for my first comment. I was tired, and most of what I said about KL divergence is actually wrong.

The core issue is that you can't actually avoid the KL penalty by hiding in the nullspace. That's just not how KL divergence works. A KL divergence of zero implies that the two LMs specify exactly the same distribution over tokens. If you deviate away from the reference LM's prior in any manner, then you take a KL penalty. 

E.g., if the reference LM is indifferent between the phrases “My stomach is grumbling” or “I could use a snack”, but the steg... (read more)

2Alex Gray3d
If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct. This also applies to changes during training where the model is learning to perform better on the objective task. So we are expecting some amount of KL divergence already. My claims are: 1. The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs) 2. optimization pressure will try to push this extra information into the cheapest places to hide 3. the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task
Steganography in Chain of Thought Reasoning

I think this is a very important direction to investigate. CoT reasoners seem like our best shot at aligning an AGI.

The KL objective pushes these correlations into the “null space” where they are not penalized.

I don’t think this is true. There are many ways to express the same underlying idea, but GPT-3 is not agnostic between them. Divergences from GPT-3’s nullspace are still penalized. KL isn’t wrt “does this text contain the same high level contents according to a human?” It’s wrt “does this text match GPT-3’s most likely continuations?”

Edit: not actually how KL divergence works.

1Alex Gray4d
I think I understand what you're saying, but I want to double check and try laying it out explicitly. * The "null space" is actually the null space of the prior model, which is human-like, but not exactly lined up to human perception of language * Because of this, the information is "hidden" w.r.t. human perceptions, it's "hidden" w.r.t. the language model prior I think I agree with both of these points, but here's my thinking for why I still expect to see this phenomena (and why the article above was simplified to just be "human") * Empirically drawing many samples results in lots of "contains similar high level contents according to a human (me)" * It might be worth studying this explicitly, such as setting up a bunch of chain of thought reasoning like scenarios, and sampling a lot of generated reasoning, then asking humans to rate how similar the generations are * My prediction is that people will think that they are largely similar * We could also use this same method to study human-measured relevance of other non-trivial sampling methods, like beam search, etc * "Hiding" w.r.t. language model prior is sufficient, in the limit of good language models * One of the things I think my reasoning relies on is that we will continue to produce better language models, which means the prior over time more and more closely resembles human priors * This suggests that one way to avoid this phenomena is to only use smaller / less capable models -- or at least they will do a worse job at hiding information. I currently expect that many research labs will continue to do chain of thought reasoning with state of the art models, though. * There's a lot of room in the model distribution * Chain of thought reasoning is not currently, nor do I expect it will ever be, overconstrained. * Instead, it can be seen as a mechanism for dramatically increasin
Reward is not the optimization target

but at this point the outer optimizer might notice (do the algorithmic equivalent of thinking the following), "If I modified this agent slightly by making it answer 'good' instead (or increasing its probability of answering 'good'), then expected future reward will be increased."

This is where I disagree with your mechanics story. The RL algorithm is not that clever. If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction. You can propose different types of outer optimizers which are this clever and can do intentional lookahead like this, but e.g., policy gradient isn’t doing that.

3Matthew "Vaniver" Graves11d
Wait, I don't think this is true? At least, I'd appreciate it being stepped thru in more detail. In the simplest story, we're imagining an agent whose policy isπθand, for simplicity's sake,θ0is a scalar that determines "how much to maximize for reward" and all the other parameters ofθstore other things about the dynamics of the world / decision-making process. It seems to me that∇θis obviously going to try to pointθ0in the direction of "maximize harder for reward". In the more complicated story, we're imagining an agent whose policy isπθwhich involves how it manipulates both external and internal actions (and thus both external and internal state). One of the internal state pieces (let's call its0 like last time) determines whether it selects actions that are more reward-seeking or not. Again I think it seems likely that∇θis going to try to adjustθsuch that the agent selects internal actions that points0in the direction of "maximize harder for reward". What is my story getting wrong?
Humans provide an untapped wealth of evidence about alignment

There must have been some reason(s) why organisms exhibiting empathy were selected for during our evolution. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.

The human learning process (somewhat) consistently converges to empathy. Evolution might have had some weird, inhuman reason for configuring a learning ... (read more)

0sairjy1mo
We could study such a learning process, but I am afraid that the lessons learned won't be so useful. Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior. Worst, humans tend to hack these feelings (incrementing or decrementing them) to achieve other goals: i.e MDMA to increase love/empathy or drugs for soldiers to make them soulless killers. An AGI will have a much easier time hacking these pro-social-reward functions.
Human values & biases are inaccessible to the genome

I'd note that it's possible for an organism to learn to behave (and think) in accordance with the "simple mathematical theory of agency" you're talking about, without said theory being directly specified by the genome. If the theory of agency really is computationally simple, then many learning processes probably converge towards implementing something like that theory, simply as a result of being optimized to act coherently in an environment over time.

5Vanessa Kosoy1mo
Well, how do you define "directly specified"? If human brains reliably converge towards a certain algorithm, then effectively this algorithm is specified by the genome. The real question is, which parts depends only on genes and which parts depend on the environment. My tentative opinion is that the majority is in the genes, since humans are, broadly speaking, pretty similar to each other. One environment effect is, feral humans grow up with serious mental problems. But, my guess is, this is not because of missing "values" or "biases", but (to 1st approximation) because they lack the ability to think in language. Another contender for the environment-dependent part is cultural values. But even here, I suspect that humans just follow social incentives rather than acquire cultural values as an immutable part of their own utility function. I admit that it's difficult to be sure about this.
Humans provide an untapped wealth of evidence about alignment

I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn't generalize.

I disagree both with this conclusion and the process that most people use to reach it. 

The process: I think that, unless you have a truly mechanistic, play-by-play, and predicatively robust understanding of how human values actually form, then you are not in an epistemic position to make strong conclusions about whether or not the underlying mechanisms can generalize to superintelligences. 

E.g., there a... (read more)

2tailcalled1mo
I don't think I've ever seen a truly mechanistic, play-by-play and robust explanation of how anything works in human psychology. At least not by how I would label things, but maybe you are using the labels differently; can you give an example?
On how various plans miss the hard bits of the alignment challenge

I don't think that "evolution -> human values" is the most useful reference class when trying to calibrate our expectations wrt how outer optimization criteria relate to inner objectives. Evolution didn't directly optimize over our goals. It optimized over our learning process and reward circuitry. Once you condition on a particular human's learning process + reward circuitry configuration + the human's environment, you screen off the influence of evolution on that human's goals. So, there are really two areas from which you can draw evidence about inne... (read more)

0mesaoptimizer1mo
The most important claim in your comment is that "human learning → human values" is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the "evolution -> human values" perspective. Here's why I disagree: Evolution optimized humans for an environment very different from what we see today. This implies that humans are operating out-of-distribution. We see evidence of misalignment. Birth control is a good example of this. A human's environment optimizes a human continually towards certain a certain objective (that changes given changes in the environment). This human is aligned with the environment's objective in that distribution. Outside that distribution, the human may not be aligned with the objective intended by the environment. An outer misalignment example of this is a person brought up in a high-trust environment, and then thrown into a low-trust / high-conflict environment. Their habits and tendencies make them an easy mark for predators. An inner misalignment example of this is a gay male who grows up in an environment hostile to his desires and his identity (but knows of environments where this isn't true). After a few extremely negative reactions to him opening up to people, or expressing his desires, he'll simply decide to present himself as heterosexual and bide his time and gather the power to leave the environment he is in. One may claim that the previous example somehow doesn't count because since one's sexual orientation is biologically determined (and I'm assuming this to be the case for this example, even if this may not be entirely true), this means that evolution optimized this particular human for being inner misaligned relative to their environment. However, that doesn't weaken this argument: "human learning -> human values" shows a huge amount of evidence of inner misalignment being ubiquitous. I worry you are being insufficiently pessimistic.
Human values & biases are inaccessible to the genome

The post isn't saying that there's no way for the genome to influence your preferences / behavior. More like, "the genome faces similar inaccessibility issues as us wrt to learned world models", meaning it needs to use roundabout methods of influencing a person's learned behavior / cognition / values. E.g., the genome can specify some hard-coded rewards for experiential correlates of engaging in sexual activity. Future posts will go into more details on how some of those roundabout ways might work.

The post is phrased pretty strongly (e.g. it makes claims about things being "inaccessible" and "intractable").

Especially given the complexity of the topic, I expect the strength of these claims to be misleading. What one person thinks of as "roundabout methods" another might consider "directly specifying". I find it pretty hard to tell whether I actually disagree with your and Alex's views, or just the way you're presenting them.

Announcing the Alignment of Complex Systems Research Group

I think this line of work is very interesting and important. I and a few others are working on something we've dubbed shard theory, which attempts to describe the process of human value formation. The theory posits that the internals of monolithic learning systems actually resemble something like an ecosystem already. However, rather than there being some finite list of discrete subcomponents / modules, it's more like there's a continuous distribution over possible internal subcomponents and features. 

Continuous agency

To take agency as an example, sup... (read more)

3Jan_Kulveit2mo
Thank! It's a long comment, so I'll comment on the convergence, morphologies and the rest latter, so here is just top-level comment on shards. (I've read about half of the doc) My impression is they are basically the same thing which I called "agenty subparts" in Multi-agent predictive minds and AI alignment [https://www.lesswrong.com/posts/3fkBWpE4f9nYbdf7E/multi-agent-predictive-minds-and-ai-alignment] (and Friston calls "fixed priors"). Where "agenty" means ~ description from intentional stance is a good description, in information theory sense. (This naturally implies fluid boundaries and continuity) Where I would disagree/find your terminology unclear is where you refer to this as an example of inner alignment failure. Putting in "agenty subparts" into the predictive processing machinery is not a failure, but bandwidth-feasible way for the evolution to communicate valuable states to the PP engine. Also: I think what you are possibly underestimating is how much is evolution building on top on existing, evolutionary older control circuitry. E.g. evolution does not need to "point to a concept of sex in the PP world model" - evolution was able to make animals seek reproduction long time ago before it invented complex brains. This simplifies the task - what evolution actually had to do was to connect the "PP agenty parts" to parts of existing control machinery, which is often based on "body states". Technically the older control systems are often using chemicals in blood, or quite old parts of the brain.
Intuitions about solving hard problems

Hmm. I suppose a similar key insight for my own line of research might go like:

The orthogonality thesis is actually wrong for brain-like learning systems. Such systems first learn many shallow proxies for their reward signal. Moreover, the circuits implementing these proxies are self-preserving optimization demons. They’ll steer the learning process away from the true data generating process behind the reward signal so as to ensure their own perpetuation.

If true, this insight matters a lot for value alignment because it points to a way that aligned beh... (read more)

Favorite / most obscure research on understanding DNNs?

I’d suggest my LessWrong post on grokking and SGD: Hypothesis: gradient descent prefers general circuits It argues that SGD has an inductive bias towards general circuits, and that this explains grokking. On the one hand, I’m not certain the hypothesis is correct. However, the post is very obscure and is a favourite of mine, so I feel it’s appropriate for this question.

SGD’s bias, a post by John Wentworth, explores a similar question by using an analogy to a random walk.

I suspect you’re familiar with it, but Is SGD a Bayesian sampler? Well, almost advance... (read more)

Hypothesis: gradient descent prefers general circuits

First, I'd like to note that I don't see why faster convergence after changing the learning rate support either story. After initial memorization, the loss decreases by ~3 OOM. Regardless of what's gaining on inside the network, it wouldn't be surprising if raising the learning rate increased convergence.

Also, I think what's actually going on here is weirder than either of our interpretations. I ran experiments where I kept the learning rate the same for the first 1000 steps, then increased it by 10x and 50x for the rest of the training.

Here is the accurac... (read more)

2Rohin Shah6mo
I'm kinda confused at your perspective on learning rates. I usually think of learning rates as being set to the maximum possible value such that training is still stable. So it would in fact be surprising if you could just 10x them to speed up convergence. (So an additional aspect of my prediction would be that you can't 10x the learning rate at the beginning of training; if you could then it seems like the hyperparameters were chosen poorly and that should be fixed first.) Indeed in your experiments at the moment you 10x the learning rate accuracy does in fact plummet! I'm a bit surprised it manages to recover, but you can see that the recovery is not nearly as stable as the original training before increasing the learning rate (this is even more obvious in the 50x case), and notably even the recovery for the training accuracy looks like it takes longer (1000-2000 steps) than the original increase in training accuracy (~400 steps). I do think this suggests that you can't in fact "just 10x the learning rate" once grokking starts, which seems like a hit to my story.
AI Tracker: monitoring current and near-future risks from superscale models

This seems like a great resource. I also like the way it’s presented. It’s very clean.

I’d appreciate more focus on the monetary return on investment large models provide their creators. I think that’s the key metric that will determine how far firms scale up these large models. Relatedly, I think it’s important to track advancements that improve model/training efficiency because they can change the expected ROI for further scaling models.

3Edouard Harris9mo
Thanks for the kind words and thoughtful comments. You're absolutely right that expected ROI ultimately determines scale of investment. I agree on your efficiency point too: scaling and efficiency are complements, in the sense that the more you have of one, the more it's worth investing in the other. I think we will probably include some measure of efficiency as you've suggested. But I'm not sure exactly what that will be, since efficiency measures tend to be benchmark-dependent so it's hard to get apples-to-apples here for a variety of reasons. (e.g., differences in modalities, differences in how papers record their results, but also the fact that benchmarks tend to get smashed pretty quickly these days, so newer models are being compared on a different basis from old ones.) Did you have any specific thoughts about this? To be honest, this is still an area we are figuring out. On the ROI side: while this is definitely the most important metric, it's also the one with by far the widest error bars. The reason is that it's impossible to predict all the creative ways people will use these models for economic ends — even GPT-3 by itself might spawn entire industries that don't yet exist. So the best one could hope for here is something like a lower bound with the accuracy of a startup's TAM estimate: more art than science, and very liable to be proven massively wrong in either direction. (Disclosure: I'm a modestly prolific angel investor, and I've spoken to — though not invested in — several companies being built on GPT-3's API.) There's another reason we're reluctant to publish ROI estimates: at the margin, these estimates themselves bolster the case for increased investment in scaling, which is concerning from a risk perspective. This probably wouldn't be a huge effect in absolute terms, since it's not really the sort of thing effective allocators weigh heavily as decision inputs, but there are scenarios where it matters and we'd rather not push our luck. Thanks
A positive case for how we might succeed at prosaic AI alignment

I agree that transformers vs other architectures is a better example of the field “following the leader” because there are lots of other strong architectures (perceiver, mlp mixer, etc). In comparison, using self supervised transfer learning is just an objectively good idea you can apply to any architecture and one the brain itself almost surely uses. The field would have converged to doing so regardless of the dominant architecture.

One hopeful sign is how little attention the ConvBERT language model has gotten. It mixes some convolution operations with se... (read more)

A positive case for how we might succeed at prosaic AI alignment

The reason self supervised approaches took over NLP is because they delivered the best results. It would be convenient if the most alignable approach also gave the best results, but I don’t think that’s likely. If you convince the top lab to use an approach that delivered worse results, I doubt much of the field would follow their example.

5Evan Hubinger9mo
I suspect that there were a lot of approaches that would have produced similar results to how we ended up doing language modeling. I believe that the main advantage of Transformers over LSTMs is just that LSTMs have exponentially decaying ability to pay attention to prior tokens while Transformers can pay constant attention to all tokens in the context. I suspect that it would have been possible to fix the exponential decay problem with LSTMs and get them to scale like Transformers, but Transformers came first, so nobody tried. And that's not to say that ML as a field is incompetent or anything—it's just why would you try when you already have Transformers. Also, note that “best results” for powerful AI systems is going to include alignment—alignment is a pretty important component of best results for any actual practical application that the big labs care about that isn't just “scores the highest on some benchmark.”
Meta learning to gradient hack

Thanks for the feedback! I use batch norm regularisation, but not dropout.

I just tried retraining the 100,000 cycle meta learned model in a variety of ways, including for 10,000 steps with 10,000x higher lr, using resilient backprop (which multiplies weights by a factor to increase/decrease them), and using an L2 penalty to decrease weight magnitude. So far, nothing has gotten the network to model the base function. The L2 penalty did reduce weight values to ~the normal range, but the network still didn’t learn the base function.

I now think the increase in weight values is just incidental and that the meta learner found some other way of protecting the network from SGD.

3Evan Hubinger10mo
Interesting! I'd definitely be excited to know if you figure out what it's doing.
Paths To High-Level Machine Intelligence

Thank you for this excellent post. Here are some thoughts I had while reading.

The hard paths hypothesis:

I think there's another side to the hard paths hypothesis. We are clearly the first technology-using species to evolve on Earth. However, it's entirely possible that we're not the first species with human-level intelligence. If a species with human level intelligence but no opposable thumbs evolved millions of years ago, they could have died out without leaving any artifacts we'd recognize as signs of intelligence.

Besides our intelligence, humans seem od... (read more)

1Daniel_Eth1y
Thanks for the comments! RE: THE HARD PATHS HYPOTHESIS I think it's very unlikely that Earth has seen other species as intelligent as humans (with the possible exception of other Homo species). In short, I suspect there is strong selection pressure for (at least many of) the different traits that allow humans to have civilization to go together. Consider dexterity – such skills allow one to use intelligence to make tools; that is, the more dexterous one is, the greater the evolutionary value of high intelligence, and the more intelligent one is, the greater the evolutionary value of dexterity. Similar positive feedback loops also seem likely between intelligence and: longevity, being omnivorous, having cumulative culture, hypersociality, language ability, vocal control, etc. Regarding dolphins and whales, it is true that many have more neurons than us, but they also have thin cortices, low neuronal packing densities, and low axonal conduction velocities [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4685590/] (in addition to lower EQs [https://en.wikipedia.org/wiki/Encephalization_quotient] than humans). Additionally, birds and mammals are both considered unusually intelligent for animals (more so than reptiles, amphibians, fish, etc), and both birds [https://www.sciencedirect.com/science/article/pii/S0960982220304309#bib49] and mammals [https://www.pnas.org/content/early/2010/11/15/1005246107] have seen (neurological evidence of) gradual trends of increasing (maximum) intelligence over the course of the past 100 MY or more (and even extant nonhuman great apes seem most likely [https://royalsocietypublishing.org/doi/10.1098/rspb.2019.2208] to be somewhat smarter than their last common ancestors with humans). So if there was a previously intelligent species, I'd be scratching my head about when it would have evolved. While we can't completely rule out a previous species as smart as humans (we also can't completely rule out a previous technological species, for w
DeepMind: Generally capable agents emerge from open-ended play

What really impressed me were the generalized strategies the agent applied to multiple situations/goals. E.g., "randomly move things around until something works" sounds simple, but learning to contextually apply that strategy 

  1. to the appropriate objects, 
  2. in scenarios where you don't have a better idea of what to do, and 
  3. immediately stopping when you find something that works 

is fairly difficult for deep agents to learn. I think of this work as giving the RL agents a toolbox of strategies that can be flexibly applied to different scenari... (read more)