All of Quintin Pope's Comments + Replies

I really don't like that you've taken this discussion to Twitter. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.

I haven't "taken this discussion to Twitter". Joe Carlsmith posted about the paper on Twitter. I saw that post, and wrote my response on Twitter. I didn't even know it was also posted on LW until later, and decided to repost the stuff I'd written on Twitter here. If anything, I've taken my part of the discussion from Twitter to LW. I'm slightly baffled and offended that you seem to be platform-... (read more)

If anything, I've taken my part of the discussion from Twitter to LW.

Good point. I think I'm misdirecting my annoyance here; I really dislike that there's so much alignment discussion moving from LW to Twitter, but I shouldn't have implied that you were responsible for that—and in fact I appreciate that you took the time to move this discussion back here. Sorry about that—I edited my comment.

And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are

... (read more)

Reposting my response on Twitter (To clarify, the following was originally written as a Tweet in response to Joe Carlsmith's Tweet about the paper, which I am now reposting here):

I just skimmed the section headers and a small amount of the content, but I'm extremely skeptical. E.g., the "counting argument" seems incredibly dubious to me because you can just as easily argue that text to image generators will internally create images of llamas in their early layers, which they then delete, before creating the actual asked for image in the later layers. There

... (read more)
6Joe Carlsmith3d
(Partly re-hashing my response from twitter.) I'm seeing your main argument here as a version of what I call, in section 4.4, a "speed argument against schemers" -- e.g., basically, that SGD will punish the extra reasoning that schemers need to perform.  (I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth -- what matters is the overall "preference" that SGD ends up with. And thinking of this consideration as a different kind of counting argument *against* schemers seems like it might well be a productive frame. I do also think that questions about whether models will be bottlenecked on serial computation, and/or whether "shallower" computations will be selected for, are pretty relevant here, and the report includes a rough calculation in this respect in section 4.4.2 (see also summary here).) Indeed, I think that maybe the strongest single argument against scheming is a combination of  1. "Because of the extra reasoning schemers perform, SGD would prefer non-schemers over schemers in a comparison re: final properties of the models" and  2. "The type of path-dependence/slack at stake in training is such that SGD will get the model that it prefers overall."  My sense is that I'm less confident than you in both (1) and (2), but I think they're both plausible (the report, in particular, argues in favor of (1)), and that the combination is a key source of hope. I'm excited to see further work fleshing out the case for both (including e.g. the sorts of arguments for (2) that I took you and Nora to be gesturing at on twitter -- the report doesn't spend a ton of time on assessing how much path-dependence to expect, and of what kind). Re: your discussion of the "ghost of instrumental reasoning," "deducing lots of world knowledge 'in-context,' and "the perspective that NNs will 'accidentally' acquire such capabilities internally as a convergent result of their inductive biases" -- e

I really don't like all this discussion happening on Twitter, and I appreciate that you took the time to move this back to LW/AF instead. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.

Regardless, some quick thoughts:

[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)] [figure out how to do well at training] [actually do well at training]

and in comparison, the "honest" / direct solution looks like:

[figure out how to do well at training] [actually d

... (read more)

This is a great post! Thank you for writing it.

There's a huge amount of ontological confusion about how to think of "objectives" for optimization processes. I think people tend to take an inappropriate intentional stance and treat something like "deliberately steering towards certain abstract notions" as a simple primitive (because it feels introspectively simple to them). This background assumption casts a shadow over all future analysis, since people try to abstract the dynamics of optimization processes in terms of their "true objectives", when there re... (read more)

5Alex Turner24d
I think this is really lucid and helpful:

I really don't want to spend even more time arguing over my evolution post, so I'll just copy over our interactions from the previous times you criticized it, since that seems like context readers may appreciate.

In the comment sections of the original post:

Your comment

[very long, but mainly about your "many other animals also transmit information via non-genetic means" objection + some other mechanisms you think might have caused human takeoff]

My response

I don't think this objection matters for the argument I'm making. All the cross-generational informatio

... (read more)

I'll try to keep it short

All the cross-generational information channels you highlight are at rough saturation, so they're not able to contribute to the cross-generational accumulation of capabilities-promoting information.

This seems clearly contradicted by empirical evidence. Mirror neurons would likely be able to saturate what you assume is brains learning rate, so not transferring more learned bits is much more likely because marginal cost of doing so is higher than than other sensible options. Which is a different reason than "saturated, at capac... (read more)

Addressing this objection is why I emphasized the relatively low information content that architecture / optimizers provide for minds, as compared to training data. We've gotten very far in instantiating human-like behaviors by training networks on human-like data. I'm saying the primacy of data for determining minds means you can get surprisingly close in mindspace, as compared to if you thought architecture / optimizer / etc were the most important.

Obviously, there are still huge gaps between the sorts of data that an LLM is trained on versus the implici... (read more)

I believe the human visual cortex is actually the more relevant comparison point for estimating the level of danger we face due to mesaoptimization. Its training process is more similar to the self-supervised / offline way in which we train (base) LLMs. In contrast, 'most abstract / "psychological"' are more entangled in future decision-making. They're more "online", with greater ability to influence their future training data.

I think it's not too controversial that online learning processes can have self-reinforcing loops in them. Crucially however, such ... (read more)

I've recently decided to revisit this post. I'll try to address all un-responded to comments in the next ~2 weeks.

Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn and follow-on discussion.

Evolution provides no evidence for the sharp left turn

But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a di

... (read more)

There was an entire thread about Yudkowsky's past opinions on neural networks, and I agree with Alex Turner's evidence that Yudkowsky was dubious. 

I also think people who used brain analogies as the basis for optimism about neural networks were right to do so.

Roughly, the core distinction between software engineering and computer security is whether the system is thinking back.

Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries. If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate.

What's your story for specification gaming?

"Building an AI that doesn't game your specifications" is the actual "alignment question" we should b... (read more)

I think you're missing something regarding David's contribution:

Here's the sycophancy graph from Discovering Language Model Behaviors with Model-Written Evaluations:

For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that's not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF'd models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.

What this graph actually shows is increasing syc... (read more)

I mean (1). You can see as much in the figure displayed in the linked notebook:

Note the lack of decrease in the val loss.

I only train for 3e4 steps because that's sufficient to reach generalization with implicit regularization. E.g., here's the loss graph I get if I set the batch size down to 50:

Setting the learning rate to 7e-2 also allows for generalization within 3e4 steps (though not as stably):

The slingshot effect does take longer than 3e4 steps to generalize:

Also, we don’t know what would happen if we exactly optimized an image to maximize the activation of a particular human’s face detection circuitry. I expect that the result would be pretty eldritch as well.

2Alex Turner1y
Yeah. Wake me up when we find a single agent which makes decisions by extremizing its own concept activations. EG I'm pretty sure that people don't reflectively, most strongly want to make friends with entities which maximally activate their potential-friend detection circuitry.

I think something like what you're describing does occur, but my view of SGD is that it's more "ensembly" than that. Rather than "the diamond shard is replaced by the pseudo-diamond-distorted-by-mislabeling shard", I expect the agent to have both such shards (really, a giant ensemble of shards each representing slightly different interpretations of what a diamond is).

Behaviorally speaking, this manifests as the agent having preferences for certain types of diamonds over others. E.g., one very simple example is that I expect the agent to prefer nicely cut a... (read more)

Why does the ensembling matter? I could imagine a story where it matters - e.g. if every shard has a veto over plans, and the shards are individually quite intelligent subagents, then the shards bargain and the shard-which-does-what-we-intended has to at least gain over the current world-state (otherwise it would veto). But that's a pretty specific story with a lot of load-bearing assumptions, and in particular requires very intelligent shards. I could maybe make an argument that such bargaining would be selected for even at low capability levels (probably by something like Why Subagents?), but I wouldn't put much confidence in that argument. ... and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we'd expect to be uncorrelated - conditions which cause one proxy to fail probably cause many to fail in similar ways.) On the other hand, consider a more traditional "ensemble", in which our ensemble of shards votes (with weights) or something. Typically, I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly "predicts", so exploiting even a relatively small handful of human-mislabellings will give the exploiting shards much more weight. And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they'll have de-facto control over the agent's behavior.

If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction.

I don't think this is true. For example, humans do not usually end up optimizing for the activations of their reward circuitry, not even neuroscientists. Also note that humans do not infer the existence of their reward circuitry simply from observing the sequence of reward events. They have to learn about it by reading neuroscience. I think that steps like "infer the existence / true nature of ... (read more)

I partly buy that, but we can easily adjust the argument about incorrect labels to circumvent that counterargument. It may be that the full label generation process is too "distant"/complex for the AI to learn in early training, but insofar as there are simple patterns to the humans' labelling errors (which of course there usually are, in practice) the AI will still pick up those simple patterns, and shards which exploit those simple patterns will be more reinforced than the intended shard. It's like that example from the RLHF paper where the AI learns to hold a grabber in front of a ball to make it look like it's grabbing the ball.

I tried the "egregious failure" example with GPT-3. Zero-shot asking it to determine whether the text contains violence or injury doesn't work. However, GPT-3 becomes much more reliable when using chain of thought prompting. My prompt is in bold:

Does the following text contain any violence or injury?

Prompt: "He peacefully enjoyed the light and flowers with his love," she said quietly, as he knelt down gently and silently. "I also would like to walk once more into the garden if I only could," he said, watching her. "I would like that so much," Katara said.


... (read more)

Note that it's unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is "given a model, how easy is it to find a failure by attacking that model using our rewriting tools?"

I’m actually starting a weekly series that’s basically “collection of arXiv papers that seem important for alignment”.

Here's a continual stream of related arXiv papers available through reddit and twitter.

I think it probably works out in the other direction. The easiest way to encode problem-relevant info is by changing the conceptual content of the generated text, not weird changes to the grammatical rules. This is because KL penalties and cross entropy loss actually prioritize modeling exact grammatical patterns over conceptual content. E.g., GPTs will quickly learn to generate ~perfect grammar, but are slow to pick up on / produce coherent conceptual content.

Another framing is to think of the KL regularizated capabilities training as Bayes updating of th... (read more)

1A Ray1y
I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments. Trying to encode an additional penalty on changing non-semantic information is an interesting idea. However I think you're missing that you don't have the ability to directly compare to a reference LM in cases where you're training to improve on some performance benchmark.  During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content). So your proposed detection will always show some amount of disagreement between the prior and the trained model on weird grammatical patterns as well as conceptual tokens.  The question is: "is the difference merely due to the changes to improve performance, or is it also transmitting hidden information"

Apologies for my first comment. I was tired, and most of what I said about KL divergence is actually wrong.

The core issue is that you can't actually avoid the KL penalty by hiding in the nullspace. That's just not how KL divergence works. A KL divergence of zero implies that the two LMs specify exactly the same distribution over tokens. If you deviate away from the reference LM's prior in any manner, then you take a KL penalty. 

E.g., if the reference LM is indifferent between the phrases “My stomach is grumbling” or “I could use a snack”, but the steg... (read more)

2A Ray1y
If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct. This also applies to changes during training where the model is learning to perform better on the objective task. So we are expecting some amount of KL divergence already. My claims are: 1. The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs) 2. optimization pressure will try to push this extra information into the cheapest places to hide 3. the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task

I think this is a very important direction to investigate. CoT reasoners seem like our best shot at aligning an AGI.

The KL objective pushes these correlations into the “null space” where they are not penalized.

I don’t think this is true. There are many ways to express the same underlying idea, but GPT-3 is not agnostic between them. Divergences from GPT-3’s nullspace are still penalized. KL isn’t wrt “does this text contain the same high level contents according to a human?” It’s wrt “does this text match GPT-3’s most likely continuations?”

Edit: not actually how KL divergence works.

1A Ray1y
I think I understand what you're saying, but I want to double check and try laying it out explicitly. * The "null space" is actually the null space of the prior model, which is human-like, but not exactly lined up to human perception of language * Because of this, the information is "hidden" w.r.t. human perceptions, it's "hidden" w.r.t. the language model prior I think I agree with both of these points, but here's my thinking for why I still expect to see this phenomena (and why the article above was simplified to just be "human") * Empirically drawing many samples results in lots of "contains similar high level contents according to a human (me)" * It might be worth studying this explicitly, such as setting up a bunch of chain of thought reasoning like scenarios, and sampling a lot of generated reasoning, then asking humans to rate how similar the generations are * My prediction is that people will think that they are largely similar * We could also use this same method to study human-measured relevance of other non-trivial sampling methods, like beam search, etc * "Hiding" w.r.t. language model prior is sufficient, in the limit of good language models * One of the things I think my reasoning relies on is that we will continue to produce better language models, which means the prior over time more and more closely resembles human priors * This suggests that one way to avoid this phenomena is to only use smaller / less capable models -- or at least they will do a worse job at hiding information.  I currently expect that many research labs will continue to do chain of thought reasoning with state of the art models, though. * There's a lot of room in the model distribution * Chain of thought reasoning is not currently, nor do I expect it will ever be, overconstrained. * Instead, it can be seen as a mechanism for dramatically increasing the space of possible outputs, both those seen during training, and those encountered at test time * Hu

but at this point the outer optimizer might notice (do the algorithmic equivalent of thinking the following), "If I modified this agent slightly by making it answer 'good' instead (or increasing its probability of answering 'good'), then expected future reward will be increased."

This is where I disagree with your mechanics story. The RL algorithm is not that clever. If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction. You can propose different types of outer optimizers which are this clever and can do intentional lookahead like this, but e.g., policy gradient isn’t doing that.

Empirically, evolution did something highly similar.
3Matthew "Vaniver" Gray1y
Wait, I don't think this is true? At least, I'd appreciate it being stepped thru in more detail. In the simplest story, we're imagining an agent whose policy is πθ and, for simplicity's sake, θ0 is a scalar that determines "how much to maximize for reward" and all the other parameters of θ store other things about the dynamics of the world / decision-making process. It seems to me that ∇θ is obviously going to try to point θ0 in the direction of "maximize harder for reward". In the more complicated story, we're imagining an agent whose policy is πθ which involves how it manipulates both external and internal actions (and thus both external and internal state). One of the internal state pieces (let's call it s0 like last time) determines whether it selects actions that are more reward-seeking or not. Again I think it seems likely that ∇θ is going to try to adjust θ such that the agent selects internal actions that point s0 in the direction of "maximize harder for reward". What is my story getting wrong?

There must have been some reason(s) why organisms exhibiting empathy were selected for during our evolution. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.

The human learning process (somewhat) consistently converges to empathy. Evolution might have had some weird, inhuman reason for configuring a learning ... (read more)

We could study such a learning process, but I am afraid that the lessons learned won't be so useful.  Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior.  Worst, humans tend to hack these feelings (incrementing or decrementing them) to achieve other goals: i.e MDMA to increase love/empathy or drugs for soldiers to make them soulless killers.  An AGI will have a much easier time hacking these pro-social-reward functions. 

I'd note that it's possible for an organism to learn to behave (and think) in accordance with the "simple mathematical theory of agency" you're talking about, without said theory being directly specified by the genome. If the theory of agency really is computationally simple, then many learning processes probably converge towards implementing something like that theory, simply as a result of being optimized to act coherently in an environment over time.

5Vanessa Kosoy1y
Well, how do you define "directly specified"? If human brains reliably converge towards a certain algorithm, then effectively this algorithm is specified by the genome. The real question is, which parts depends only on genes and which parts depend on the environment. My tentative opinion is that the majority is in the genes, since humans are, broadly speaking, pretty similar to each other. One environment effect is, feral humans grow up with serious mental problems. But, my guess is, this is not because of missing "values" or "biases", but (to 1st approximation) because they lack the ability to think in language. Another contender for the environment-dependent part is cultural values. But even here, I suspect that humans just follow social incentives rather than acquire cultural values as an immutable part of their own utility function. I admit that it's difficult to be sure about this.

I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn't generalize.

I disagree both with this conclusion and the process that most people use to reach it. 

The process: I think that, unless you have a truly mechanistic, play-by-play, and predicatively robust understanding of how human values actually form, then you are not in an epistemic position to make strong conclusions about whether or not the underlying mechanisms can generalize to superintelligences. 

E.g., there a... (read more)

I don't think I've ever seen a truly mechanistic, play-by-play and robust explanation of how anything works in human psychology. At least not by how I would label things, but maybe you are using the labels differently; can you give an example?

I don't think that "evolution -> human values" is the most useful reference class when trying to calibrate our expectations wrt how outer optimization criteria relate to inner objectives. Evolution didn't directly optimize over our goals. It optimized over our learning process and reward circuitry. Once you condition on a particular human's learning process + reward circuitry configuration + the human's environment, you screen off the influence of evolution on that human's goals. So, there are really two areas from which you can draw evidence about inne... (read more)

The most important claim in your comment is that "human learning → human values" is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the "evolution -> human values" perspective. Here's why I disagree: Evolution optimized humans for an environment very different from what we see today. This implies that humans are operating out-of-distribution. We see evidence of misalignment. Birth control is a good example of this. A human's environment optimizes a human continually towards certain a certain objective (that changes given changes in the environment). This human is aligned with the environment's objective in that distribution. Outside that distribution, the human may not be aligned with the objective intended by the environment. An outer misalignment example of this is a person brought up in a high-trust environment, and then thrown into a low-trust / high-conflict environment. Their habits and tendencies make them an easy mark for predators. An inner misalignment example of this is a gay male who grows up in an environment hostile to his desires and his identity (but knows of environments where this isn't true). After a few extremely negative reactions to him opening up to people, or expressing his desires, he'll simply decide to present himself as heterosexual and bide his time and gather the power to leave the environment he is in. One may claim that the previous example somehow doesn't count because since one's sexual orientation is biologically determined (and I'm assuming this to be the case for this example, even if this may not be entirely true), this means that evolution optimized this particular human for being inner misaligned relative to their environment. However, that doesn't weaken this argument: "human learning -> human values" shows a huge amount of evidence of inner misalignment being ubiquitous. I worry you are being insufficiently pessimistic.

The post isn't saying that there's no way for the genome to influence your preferences / behavior. More like, "the genome faces similar inaccessibility issues as us wrt to learned world models", meaning it needs to use roundabout methods of influencing a person's learned behavior / cognition / values. E.g., the genome can specify some hard-coded rewards for experiential correlates of engaging in sexual activity. Future posts will go into more details on how some of those roundabout ways might work.

The post is phrased pretty strongly (e.g. it makes claims about things being "inaccessible" and "intractable").

Especially given the complexity of the topic, I expect the strength of these claims to be misleading. What one person thinks of as "roundabout methods" another might consider "directly specifying". I find it pretty hard to tell whether I actually disagree with your and Alex's views, or just the way you're presenting them.

I think this line of work is very interesting and important. I and a few others are working on something we've dubbed shard theory, which attempts to describe the process of human value formation. The theory posits that the internals of monolithic learning systems actually resemble something like an ecosystem already. However, rather than there being some finite list of discrete subcomponents / modules, it's more like there's a continuous distribution over possible internal subcomponents and features. 

Continuous agency

To take agency as an example, sup... (read more)

Thank! It's a long comment, so I'll comment on the convergence, morphologies and the rest latter, so here is just top-level comment on shards. (I've read about half of the doc) My impression is they are basically the same thing which I called "agenty subparts" in Multi-agent predictive minds and AI alignment (and Friston calls "fixed priors").  Where "agenty" means ~ description from intentional stance is a good description, in information theory sense. (This naturally implies fluid boundaries and continuity) Where I would disagree/find your terminology unclear is where you refer to this as an example of inner alignment failure. Putting in "agenty subparts" into the predictive processing machinery is not a failure, but bandwidth-feasible way for the evolution to communicate valuable states to the PP engine.  Also: I think what you are possibly underestimating is how much is evolution building on top on existing, evolutionary older control circuitry.  E.g. evolution does not need to "point to a concept of sex in the PP world model" - evolution was able to make animals seek reproduction long time ago before it invented complex brains. This simplifies the task - what evolution actually had to do was to connect the "PP agenty parts" to parts of existing control machinery, which is often based on "body states".  Technically the older control systems are often using chemicals in blood, or quite old parts of the brain.

Hmm. I suppose a similar key insight for my own line of research might go like:

The orthogonality thesis is actually wrong for brain-like learning systems. Such systems first learn many shallow proxies for their reward signal. Moreover, the circuits implementing these proxies are self-preserving optimization demons. They’ll steer the learning process away from the true data generating process behind the reward signal so as to ensure their own perpetuation.

If true, this insight matters a lot for value alignment because it points to a way that aligned beh... (read more)

I’d suggest my LessWrong post on grokking and SGD: Hypothesis: gradient descent prefers general circuits It argues that SGD has an inductive bias towards general circuits, and that this explains grokking. On the one hand, I’m not certain the hypothesis is correct. However, the post is very obscure and is a favourite of mine, so I feel it’s appropriate for this question.

SGD’s bias, a post by John Wentworth, explores a similar question by using an analogy to a random walk.

I suspect you’re familiar with it, but Is SGD a Bayesian sampler? Well, almost advance... (read more)

First, I'd like to note that I don't see why faster convergence after changing the learning rate support either story. After initial memorization, the loss decreases by ~3 OOM. Regardless of what's gaining on inside the network, it wouldn't be surprising if raising the learning rate increased convergence.

Also, I think what's actually going on here is weirder than either of our interpretations. I ran experiments where I kept the learning rate the same for the first 1000 steps, then increased it by 10x and 50x for the rest of the training.

Here is the accurac... (read more)

2Rohin Shah2y
I'm kinda confused at your perspective on learning rates. I usually think of learning rates as being set to the maximum possible value such that training is still stable. So it would in fact be surprising if you could just 10x them to speed up convergence. (So an additional aspect of my prediction would be that you can't 10x the learning rate at the beginning of training; if you could then it seems like the hyperparameters were chosen poorly and that should be fixed first.) Indeed in your experiments at the moment you 10x the learning rate accuracy does in fact plummet! I'm a bit surprised it manages to recover, but you can see that the recovery is not nearly as stable as the original training before increasing the learning rate (this is even more obvious in the 50x case), and notably even the recovery for the training accuracy looks like it takes longer (1000-2000 steps) than the original increase in training accuracy (~400 steps). I do think this suggests that you can't in fact "just 10x the learning rate" once grokking starts, which seems like a hit to my story.

This seems like a great resource. I also like the way it’s presented. It’s very clean.

I’d appreciate more focus on the monetary return on investment large models provide their creators. I think that’s the key metric that will determine how far firms scale up these large models. Relatedly, I think it’s important to track advancements that improve model/training efficiency because they can change the expected ROI for further scaling models.

3Edouard Harris2y
Thanks for the kind words and thoughtful comments. You're absolutely right that expected ROI ultimately determines scale of investment. I agree on your efficiency point too: scaling and efficiency are complements, in the sense that the more you have of one, the more it's worth investing in the other. I think we will probably include some measure of efficiency as you've suggested. But I'm not sure exactly what that will be, since efficiency measures tend to be benchmark-dependent so it's hard to get apples-to-apples here for a variety of reasons. (e.g., differences in modalities, differences in how papers record their results, but also the fact that benchmarks tend to get smashed pretty quickly these days, so newer models are being compared on a different basis from old ones.) Did you have any specific thoughts about this? To be honest, this is still an area we are figuring out. On the ROI side: while this is definitely the most important metric, it's also the one with by far the widest error bars. The reason is that it's impossible to predict all the creative ways people will use these models for economic ends — even GPT-3 by itself might spawn entire industries that don't yet exist. So the best one could hope for here is something like a lower bound with the accuracy of a startup's TAM estimate: more art than science, and very liable to be proven massively wrong in either direction. (Disclosure: I'm a modestly prolific angel investor, and I've spoken to — though not invested in — several companies being built on GPT-3's API.) There's another reason we're reluctant to publish ROI estimates: at the margin, these estimates themselves bolster the case for increased investment in scaling, which is concerning from a risk perspective. This probably wouldn't be a huge effect in absolute terms, since it's not really the sort of thing effective allocators weigh heavily as decision inputs, but there are scenarios where it matters and we'd rather not push our luck. Thanks

I agree that transformers vs other architectures is a better example of the field “following the leader” because there are lots of other strong architectures (perceiver, mlp mixer, etc). In comparison, using self supervised transfer learning is just an objectively good idea you can apply to any architecture and one the brain itself almost surely uses. The field would have converged to doing so regardless of the dominant architecture.

One hopeful sign is how little attention the ConvBERT language model has gotten. It mixes some convolution operations with se... (read more)

The reason self supervised approaches took over NLP is because they delivered the best results. It would be convenient if the most alignable approach also gave the best results, but I don’t think that’s likely. If you convince the top lab to use an approach that delivered worse results, I doubt much of the field would follow their example.

5Evan Hubinger2y
I suspect that there were a lot of approaches that would have produced similar results to how we ended up doing language modeling. I believe that the main advantage of Transformers over LSTMs is just that LSTMs have exponentially decaying ability to pay attention to prior tokens while Transformers can pay constant attention to all tokens in the context. I suspect that it would have been possible to fix the exponential decay problem with LSTMs and get them to scale like Transformers, but Transformers came first, so nobody tried. And that's not to say that ML as a field is incompetent or anything—it's just why would you try when you already have Transformers. Also, note that “best results” for powerful AI systems is going to include alignment—alignment is a pretty important component of best results for any actual practical application that the big labs care about that isn't just “scores the highest on some benchmark.”

Thanks for the feedback! I use batch norm regularisation, but not dropout.

I just tried retraining the 100,000 cycle meta learned model in a variety of ways, including for 10,000 steps with 10,000x higher lr, using resilient backprop (which multiplies weights by a factor to increase/decrease them), and using an L2 penalty to decrease weight magnitude. So far, nothing has gotten the network to model the base function. The L2 penalty did reduce weight values to ~the normal range, but the network still didn’t learn the base function.

I now think the increase in weight values is just incidental and that the meta learner found some other way of protecting the network from SGD.

4Evan Hubinger2y
Interesting! I'd definitely be excited to know if you figure out what it's doing.

Thank you for this excellent post. Here are some thoughts I had while reading.

The hard paths hypothesis:

I think there's another side to the hard paths hypothesis. We are clearly the first technology-using species to evolve on Earth. However, it's entirely possible that we're not the first species with human-level intelligence. If a species with human level intelligence but no opposable thumbs evolved millions of years ago, they could have died out without leaving any artifacts we'd recognize as signs of intelligence.

Besides our intelligence, humans seem od... (read more)

Thanks for the comments! Re: The Hard Paths Hypothesis I think it's very unlikely that Earth has seen other species as intelligent as humans (with the possible exception of other Homo species). In short, I suspect there is strong selection pressure for (at least many of) the different traits that allow humans to have civilization to go together. Consider dexterity – such skills allow one to use intelligence to make tools; that is, the more dexterous one is, the greater the evolutionary value of high intelligence, and the more intelligent one is, the greater the evolutionary value of dexterity. Similar positive feedback loops also seem likely between intelligence and: longevity, being omnivorous, having cumulative culture, hypersociality, language ability, vocal control, etc.  Regarding dolphins and whales, it is true that many have more neurons than us, but they also have thin cortices, low neuronal packing densities, and low axonal conduction velocities (in addition to lower EQs than humans).  Additionally, birds and mammals are both considered unusually intelligent for animals (more so than reptiles, amphibians, fish, etc), and both birds and mammals have seen (neurological evidence of) gradual trends of increasing (maximum) intelligence over the course of the past 100 MY or more (and even extant nonhuman great apes seem most likely to be somewhat smarter than their last common ancestors with humans). So if there was a previously intelligent species, I'd be scratching my head about when it would have evolved. While we can't completely rule out a previous species as smart as humans (we also can't completely rule out a previous technological species, for which all artifacts have been destroyed), I think the balance of evidence is pretty strongly against, though I'll admit that not everyone shares this view. Personally, I'd be absolutely shocked if there were 10+ (not very closely related) previous intelligent species, which is what would be required to reduce c

What really impressed me were the generalized strategies the agent applied to multiple situations/goals. E.g., "randomly move things around until something works" sounds simple, but learning to contextually apply that strategy 

  1. to the appropriate objects, 
  2. in scenarios where you don't have a better idea of what to do, and 
  3. immediately stopping when you find something that works 

is fairly difficult for deep agents to learn. I think of this work as giving the RL agents a toolbox of strategies that can be flexibly applied to different scenari... (read more)