Reposting my response on Twitter (To clarify, the following was originally written as a Tweet in response to Joe Carlsmith's Tweet about the paper, which I am now reposting here):
...I just skimmed the section headers and a small amount of the content, but I'm extremely skeptical. E.g., the "counting argument" seems incredibly dubious to me because you can just as easily argue that text to image generators will internally create images of llamas in their early layers, which they then delete, before creating the actual asked for image in the later layers. There
I really don't like all this discussion happening on Twitter, and I appreciate that you took the time to move this back to LW/AF instead. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.
Regardless, some quick thoughts:
...[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)] [figure out how to do well at training] [actually do well at training]
and in comparison, the "honest" / direct solution looks like:
[figure out how to do well at training] [actually d
This is a great post! Thank you for writing it.
There's a huge amount of ontological confusion about how to think of "objectives" for optimization processes. I think people tend to take an inappropriate intentional stance and treat something like "deliberately steering towards certain abstract notions" as a simple primitive (because it feels introspectively simple to them). This background assumption casts a shadow over all future analysis, since people try to abstract the dynamics of optimization processes in terms of their "true objectives", when there re...
I really don't want to spend even more time arguing over my evolution post, so I'll just copy over our interactions from the previous times you criticized it, since that seems like context readers may appreciate.
In the comment sections of the original post:
[very long, but mainly about your "many other animals also transmit information via non-genetic means" objection + some other mechanisms you think might have caused human takeoff]
...I don't think this objection matters for the argument I'm making. All the cross-generational informatio
I'll try to keep it short
All the cross-generational information channels you highlight are at rough saturation, so they're not able to contribute to the cross-generational accumulation of capabilities-promoting information.
This seems clearly contradicted by empirical evidence. Mirror neurons would likely be able to saturate what you assume is brains learning rate, so not transferring more learned bits is much more likely because marginal cost of doing so is higher than than other sensible options. Which is a different reason than "saturated, at capac...
Addressing this objection is why I emphasized the relatively low information content that architecture / optimizers provide for minds, as compared to training data. We've gotten very far in instantiating human-like behaviors by training networks on human-like data. I'm saying the primacy of data for determining minds means you can get surprisingly close in mindspace, as compared to if you thought architecture / optimizer / etc were the most important.
Obviously, there are still huge gaps between the sorts of data that an LLM is trained on versus the implici...
I believe the human visual cortex is actually the more relevant comparison point for estimating the level of danger we face due to mesaoptimization. Its training process is more similar to the self-supervised / offline way in which we train (base) LLMs. In contrast, 'most abstract / "psychological"' are more entangled in future decision-making. They're more "online", with greater ability to influence their future training data.
I think it's not too controversial that online learning processes can have self-reinforcing loops in them. Crucially however, such ...
I've recently decided to revisit this post. I'll try to address all un-responded to comments in the next ~2 weeks.
Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn and follow-on discussion.
Evolution provides no evidence for the sharp left turn
...But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a di
Roughly, the core distinction between software engineering and computer security is whether the system is thinking back.
Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries. If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate.
What's your story for specification gaming?
"Building an AI that doesn't game your specifications" is the actual "alignment question" we should b...
Here's the sycophancy graph from Discovering Language Model Behaviors with Model-Written Evaluations:
For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that's not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF'd models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.
What this graph actually shows is increasing syc...
I mean (1). You can see as much in the figure displayed in the linked notebook:
Note the lack of decrease in the val loss.
I only train for 3e4 steps because that's sufficient to reach generalization with implicit regularization. E.g., here's the loss graph I get if I set the batch size down to 50:
Setting the learning rate to 7e-2 also allows for generalization within 3e4 steps (though not as stably):
The slingshot effect does take longer than 3e4 steps to generalize:
Also, we don’t know what would happen if we exactly optimized an image to maximize the activation of a particular human’s face detection circuitry. I expect that the result would be pretty eldritch as well.
I think something like what you're describing does occur, but my view of SGD is that it's more "ensembly" than that. Rather than "the diamond shard is replaced by the pseudo-diamond-distorted-by-mislabeling shard", I expect the agent to have both such shards (really, a giant ensemble of shards each representing slightly different interpretations of what a diamond is).
Behaviorally speaking, this manifests as the agent having preferences for certain types of diamonds over others. E.g., one very simple example is that I expect the agent to prefer nicely cut a...
If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction.
I don't think this is true. For example, humans do not usually end up optimizing for the activations of their reward circuitry, not even neuroscientists. Also note that humans do not infer the existence of their reward circuitry simply from observing the sequence of reward events. They have to learn about it by reading neuroscience. I think that steps like "infer the existence / true nature of ...
I tried the "egregious failure" example with GPT-3. Zero-shot asking it to determine whether the text contains violence or injury doesn't work. However, GPT-3 becomes much more reliable when using chain of thought prompting. My prompt is in bold:
Does the following text contain any violence or injury?
...
Prompt: "He peacefully enjoyed the light and flowers with his love," she said quietly, as he knelt down gently and silently. "I also would like to walk once more into the garden if I only could," he said, watching her. "I would like that so much," Katara said.C
Note that it's unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is "given a model, how easy is it to find a failure by attacking that model using our rewriting tools?"
I’m actually starting a weekly series that’s basically “collection of arXiv papers that seem important for alignment”.
Here's a continual stream of related arXiv papers available through reddit and twitter.
https://www.reddit.com/r/mlsafety/
https://twitter.com/topofmlsafety
I think it probably works out in the other direction. The easiest way to encode problem-relevant info is by changing the conceptual content of the generated text, not weird changes to the grammatical rules. This is because KL penalties and cross entropy loss actually prioritize modeling exact grammatical patterns over conceptual content. E.g., GPTs will quickly learn to generate ~perfect grammar, but are slow to pick up on / produce coherent conceptual content.
Another framing is to think of the KL regularizated capabilities training as Bayes updating of th...
Apologies for my first comment. I was tired, and most of what I said about KL divergence is actually wrong.
The core issue is that you can't actually avoid the KL penalty by hiding in the nullspace. That's just not how KL divergence works. A KL divergence of zero implies that the two LMs specify exactly the same distribution over tokens. If you deviate away from the reference LM's prior in any manner, then you take a KL penalty.
E.g., if the reference LM is indifferent between the phrases “My stomach is grumbling” or “I could use a snack”, but the steg...
I think this is a very important direction to investigate. CoT reasoners seem like our best shot at aligning an AGI.
The KL objective pushes these correlations into the “null space” where they are not penalized.
I don’t think this is true. There are many ways to express the same underlying idea, but GPT-3 is not agnostic between them. Divergences from GPT-3’s nullspace are still penalized. KL isn’t wrt “does this text contain the same high level contents according to a human?” It’s wrt “does this text match GPT-3’s most likely continuations?”
Edit: not actually how KL divergence works.
but at this point the outer optimizer might notice (do the algorithmic equivalent of thinking the following), "If I modified this agent slightly by making it answer 'good' instead (or increasing its probability of answering 'good'), then expected future reward will be increased."
This is where I disagree with your mechanics story. The RL algorithm is not that clever. If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction. You can propose different types of outer optimizers which are this clever and can do intentional lookahead like this, but e.g., policy gradient isn’t doing that.
There must have been some reason(s) why organisms exhibiting empathy were selected for during our evolution. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.
The human learning process (somewhat) consistently converges to empathy. Evolution might have had some weird, inhuman reason for configuring a learning ...
I'd note that it's possible for an organism to learn to behave (and think) in accordance with the "simple mathematical theory of agency" you're talking about, without said theory being directly specified by the genome. If the theory of agency really is computationally simple, then many learning processes probably converge towards implementing something like that theory, simply as a result of being optimized to act coherently in an environment over time.
I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn't generalize.
I disagree both with this conclusion and the process that most people use to reach it.
The process: I think that, unless you have a truly mechanistic, play-by-play, and predicatively robust understanding of how human values actually form, then you are not in an epistemic position to make strong conclusions about whether or not the underlying mechanisms can generalize to superintelligences.
E.g., there a...
I don't think that "evolution -> human values" is the most useful reference class when trying to calibrate our expectations wrt how outer optimization criteria relate to inner objectives. Evolution didn't directly optimize over our goals. It optimized over our learning process and reward circuitry. Once you condition on a particular human's learning process + reward circuitry configuration + the human's environment, you screen off the influence of evolution on that human's goals. So, there are really two areas from which you can draw evidence about inne...
The post isn't saying that there's no way for the genome to influence your preferences / behavior. More like, "the genome faces similar inaccessibility issues as us wrt to learned world models", meaning it needs to use roundabout methods of influencing a person's learned behavior / cognition / values. E.g., the genome can specify some hard-coded rewards for experiential correlates of engaging in sexual activity. Future posts will go into more details on how some of those roundabout ways might work.
The post is phrased pretty strongly (e.g. it makes claims about things being "inaccessible" and "intractable").
Especially given the complexity of the topic, I expect the strength of these claims to be misleading. What one person thinks of as "roundabout methods" another might consider "directly specifying". I find it pretty hard to tell whether I actually disagree with your and Alex's views, or just the way you're presenting them.
I think this line of work is very interesting and important. I and a few others are working on something we've dubbed shard theory, which attempts to describe the process of human value formation. The theory posits that the internals of monolithic learning systems actually resemble something like an ecosystem already. However, rather than there being some finite list of discrete subcomponents / modules, it's more like there's a continuous distribution over possible internal subcomponents and features.
To take agency as an example, sup...
Hmm. I suppose a similar key insight for my own line of research might go like:
The orthogonality thesis is actually wrong for brain-like learning systems. Such systems first learn many shallow proxies for their reward signal. Moreover, the circuits implementing these proxies are self-preserving optimization demons. They’ll steer the learning process away from the true data generating process behind the reward signal so as to ensure their own perpetuation.
If true, this insight matters a lot for value alignment because it points to a way that aligned beh...
I’d suggest my LessWrong post on grokking and SGD: Hypothesis: gradient descent prefers general circuits It argues that SGD has an inductive bias towards general circuits, and that this explains grokking. On the one hand, I’m not certain the hypothesis is correct. However, the post is very obscure and is a favourite of mine, so I feel it’s appropriate for this question.
SGD’s bias, a post by John Wentworth, explores a similar question by using an analogy to a random walk.
I suspect you’re familiar with it, but Is SGD a Bayesian sampler? Well, almost advance...
First, I'd like to note that I don't see why faster convergence after changing the learning rate support either story. After initial memorization, the loss decreases by ~3 OOM. Regardless of what's gaining on inside the network, it wouldn't be surprising if raising the learning rate increased convergence.
Also, I think what's actually going on here is weirder than either of our interpretations. I ran experiments where I kept the learning rate the same for the first 1000 steps, then increased it by 10x and 50x for the rest of the training.
Here is the accurac...
This seems like a great resource. I also like the way it’s presented. It’s very clean.
I’d appreciate more focus on the monetary return on investment large models provide their creators. I think that’s the key metric that will determine how far firms scale up these large models. Relatedly, I think it’s important to track advancements that improve model/training efficiency because they can change the expected ROI for further scaling models.
I agree that transformers vs other architectures is a better example of the field “following the leader” because there are lots of other strong architectures (perceiver, mlp mixer, etc). In comparison, using self supervised transfer learning is just an objectively good idea you can apply to any architecture and one the brain itself almost surely uses. The field would have converged to doing so regardless of the dominant architecture.
One hopeful sign is how little attention the ConvBERT language model has gotten. It mixes some convolution operations with se...
The reason self supervised approaches took over NLP is because they delivered the best results. It would be convenient if the most alignable approach also gave the best results, but I don’t think that’s likely. If you convince the top lab to use an approach that delivered worse results, I doubt much of the field would follow their example.
Thanks for the feedback! I use batch norm regularisation, but not dropout.
I just tried retraining the 100,000 cycle meta learned model in a variety of ways, including for 10,000 steps with 10,000x higher lr, using resilient backprop (which multiplies weights by a factor to increase/decrease them), and using an L2 penalty to decrease weight magnitude. So far, nothing has gotten the network to model the base function. The L2 penalty did reduce weight values to ~the normal range, but the network still didn’t learn the base function.
I now think the increase in weight values is just incidental and that the meta learner found some other way of protecting the network from SGD.
Thank you for this excellent post. Here are some thoughts I had while reading.
I think there's another side to the hard paths hypothesis. We are clearly the first technology-using species to evolve on Earth. However, it's entirely possible that we're not the first species with human-level intelligence. If a species with human level intelligence but no opposable thumbs evolved millions of years ago, they could have died out without leaving any artifacts we'd recognize as signs of intelligence.
Besides our intelligence, humans seem od...
What really impressed me were the generalized strategies the agent applied to multiple situations/goals. E.g., "randomly move things around until something works" sounds simple, but learning to contextually apply that strategy
is fairly difficult for deep agents to learn. I think of this work as giving the RL agents a toolbox of strategies that can be flexibly applied to different scenari...
I haven't "taken this discussion to Twitter". Joe Carlsmith posted about the paper on Twitter. I saw that post, and wrote my response on Twitter. I didn't even know it was also posted on LW until later, and decided to repost the stuff I'd written on Twitter here. If anything, I've taken my part of the discussion from Twitter to LW. I'm slightly baffled and offended that you seem to be platform-... (read more)
Good point. I think I'm misdirecting my annoyance here; I really dislike that there's so much alignment discussion moving from LW to Twitter, but I shouldn't have implied that you were responsible for that—and in fact I appreciate that you took the time to move this discussion back here. Sorry about that—I edited my comment.
... (read more)