For what it's worth, I buy the claim from Gwern that Microsoft trained Sydney pretty poorly, much worse than is achievable with SFT on highly rated data. For example, Sydney shows significant repetition, which you don't see even on text-davinci-002
or (early 2022) LaMDA, both trained without RLHF.
Oh, cool! I'll take a look later this week
Quick clarifications:
Edit: played around with the models, it seems like the transformer only gets 99.7% train accuracy and 97.5% test accuracy!
I broadly agree with the points being made here, but allow me to nitpick the use of the word "predictive" here, and argue for the key advantage of the simulators framing over the prediction one:
Pretrained models don’t ‘simulate a character speaking’; they predict what comes next, which implicitly involves making predictions about the distribution of characters and what they would say next.
The simulators frame does make it very clear that there's a distinction between the simulator/GPT-3 and the simulacra/characters or situations it's making predictions abo...
The time-evolution rules of the state are simply the probabilities of the autoregressive model -- there's some amount of high level structure but not a lot. (As Ryan says, you don't get the normal property you want from a state (the Markov property) except in a very weak sense.)
I also disagree that purely thinking about the text as state + GPT-3 as evolution rules is the intention of the original simulators post; there's a lot of discussion about the content of the simulations themselves as simulated realities or alternative universes (though the post does...
Nitpick: I mean, technically, the state is only the last 4k tokens or however long your context length is. Though I agree this is still very uninteresting.
We were quite familiar with Geiger et al's work before writing the post, and think it's importantly different. Though it seems like we forgot to cite it in the Causal Scrubbing AF post, whoops.
Hopefully this will be fixed with the forthcoming arXiv paper!
At least based on my convos with them, the Anthropic team does seem like a clear example of this, at least insofar as you think understanding circuits in real models with more than one MLP layer in them is important for interp -- superposition just stops you from using the standard features as directions approach almost entirely!
Don't think there have been public writeups, but here's two relevant manifold markets:
In general, a tool being engineering-relevant does not imply that it will be competitive for setting a new SOTA on something risky. So when I will talk about engineering relevance in this sequence, I don't have big advancements in mind so much as stuff like fairly simple debugging work.
Fwiw this does not seem to be in the Dan Hendrycks post you linked!
Google’s event where they’re presumably unveiling their response will happen Feb 8th at 2:30 PM CET/5:30 AM PT:
That being said, it's possible that both group composition tasks (like the mod add stuff) and MNIST are pretty special datasets, in that generalizing solutions have small weight norm and memorization solutions have large weight norm. It might be worth constructing tasks where generalizing solutions have large weight norm, and seeing what happens.
The negative result tells us that the strong form of the claim "regularization = navigability" is probably wrong. Having a smaller weight norm actually is good for generalization (just as the learning theorists would have you believe). You'll have better luck moving along the set of minimum loss weights in the way that minimizes the norm than in any other way.
Have you seen the Omnigrok work? It directly argues that weight norm is directly related to grokking:
Similarly, Figure 7 from https://arxiv.org/abs/2301.05217 also makes this point, but less str...
As for other forms of noise inducing grokking: we do see grokking with dropout! So there's some reason to think noise -> grokking.
(Source: Figure 28 from https://arxiv.org/abs/2301.05217)
Also worth noting that grokking is pretty hyperparameter sensitive -- it's possible you just haven't found the right size/form of noise yet!
In particular, can we use noise to make a model grok even in the absence of regularization (which is currently a requirement to make models grok with SGD)?
Worth noting that you can get grokking in some cases without explicit regularization with full batch gradient descent, if you use an adaptive optimizer, due to the slingshot mechanism: https://arxiv.org/abs/2206.04817
Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer t...
Yep, this is correct - in the worse case, you could have performance that is exponential in the size of the interpretation.
(Redwood is fully aware of this problem and there have been several efforts to fix it.)
Thanks for the clarification! I'll have to think more about this.
Yeah, I think it was implicitly assumed that there existed some such that no token ever had probability .
Thanks for the clarification!
I agree that your model of subagents in the two posts share a lot of commonalities with parts of Shard Theory, and I should've done a lit review of your subagent posts. (I based my understanding of subagent models on some of the AI Safety formalisms I've seen as well as John Wentworth's Why Subagents?.) My bad.
That being said, I think it's a bit weird to have "habitual subagents", since the word "agent" seems to imply some amount of goal-directedness. I would've classified your work as closer to Shard Theory than the subagent models I normally think about.
Thanks!
just procrastination/lacking urgency
This is probably true in general, to be honest. However, it's an explanation for why people don't do anything, and I'm not sure this differentially leads to delaying contact with reality more than say, delaying writing up your ideas in a Google doc.
Some more strategies I like for touching reality faster
I like the "explain your ideas to other people" point, it seems like an important caveat/improvement to the "have good collaborators" strategy I describe above. I also think the meta strategy point of building a good workflow is super important!
I think this is a good word of caution. I'll edit in a link to this comment.
Thanks for posting this! I agree that it's good to get it out anyways, I thought it was valuable. I especially resonate with the point in the Pure simulators section.
Some responses:
In general I'm skeptical that the simulator framing adds much relative to 'the model is predicting what token would appear next in the training data given the input tokens'. I think it's pretty important to think about what exactly is in the training data, rather than about some general idea of accurately simulating the world.
I think that the main value of the simula...
- C* What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI? Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?
So, it turns out that negative prediction heads appear ~everywhere! For example, Noa Nabeshima found them on ResNeXt
s trained on ImageNet: there seem to be heads that significantly reduce the probability of certain outputs. IIRC the explanation we settled on was calibration; ablating these heads seemed to increase log loss via overconfident predictions on borderline cases?
The distinction between "newbies get caught up trying to understand every detail, experts think in higher-level abstractions, make educated guesses, and only zoom in on the details that matter" felt super interesting and surprising to me.
I claim that this is 1) an instance of a common pattern that 2) is currently missing a step (the pre-newbie stage).
The general pattern is the following (terminology borrowed from Terry Tao):
Many forms of interpretability seek to explain how the network's outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are "build an interpretable model" techniques such as LIME.
In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah's words:
Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.
Or see this post by ...
I've expanded the TL;DR at the top to include the nine theses. Thanks for the suggestion!
Thanks Nate!
I didn't add a 1-sentence bullet point for each thesis because I thought the table of contents on the left was sufficient, though in retrospect I should've written it up mainly for learning value. Do you still think it's worth doing after the fact?
Ditto the tweet thread, assuming I don't plan on tweeting this.
See also Superexponential Concept Space, and Simple Words, from the Sequences:
...By the time you're talking about data with forty binary attributes, the number of possible examples is past a trillion—but the number of possible concepts is past two-to-the-trillionth-power. To narrow down that superexponential concept space, you'd have to see over a trillion examples before you could say what was In, and what was Out. You'd have to see every possible example, in fact.
[...]
From this perspective, learning doesn't just rely on inductive bias, it is nea
wirehead-proof crib, and eventually it will be sufficiently self-aware and foresighted that when we let it out of the crib, it can deliberately avoid situations that would get it addicted to wireheading.
I feel like I'm saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where...
I feel like I'm saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where human approval diverges from what the humans "would want if fully informed".
I actually want to controversy that. I'm now going to write quickly about selection arguments in alignment more generally (thi...
Thanks for the clarification. I've edited in a link to this comment.
Right, that's a decent objection.
I have three responses:
I think the claim that an optimizer is a retargetable search process makes a lot of sense* and I've edited the post to link to this clarification.
That being said, I'm still confused about the details.
Suppose that I do a goal-conditioned version of the paper, where (hypothetically) I exhibit a transformer circuit that, conditioned on some prompt or the other, was able to alternate between performing gradient descent on three types of objectives (say, L1, L2, L\infty) -- would this suffice? How about if, instead, there wasn't any prompt that let me swi...
Well, no, that's not the definition of optimizer in the mesa-optimization post! Evan gives the following definition of an optimizer:
A system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system
And the following definition of a mesa-optimizer:
...Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) fi
That definition of "optimizer" requires
some objective function that is explicitly represented within the system
but that is not the case here.
There is a fundamental difference between
The transformers in this paper are programs of the 2nd type. They don't contain any l...
I really do empathize with the authors, since writing an abstract fundamentally requires trading off faithfulness to the paper content and the length and readability of the abstract. But I do agree that they could've been more precise without a significant increase in length.
Nitpick: I think instead of expanding on the sentence
As a result we are able to train a more harmless and less evasive AI assistant than previous attempts that engages with harmful queries by more often explaining its objections to them than avoiding answering
My proposed rewrite ...
You're welcome, and I'm glad you think the writeup is good.
Thank you for the good work.
Sure, edited the post to clarify.
Cool, I don't think we disagree here.
I think your claim is something like:
Without some form of regularization, some forms of RL can lead to trajectories that have zero probability wrt the base distribution (e.g. because they break a correlation that occurs on the pretraining distribution with 100% accuracy). However, sampling cannot lead to trajectories with zero probability?
As stated, this claim is false for LMs without top-p sampling or floating point rounding errors, since every token has a logit greater than negative infinity and thus a probability greater than actual 0. So with enough sa...
Can you explain why RLHF is worse from a Causal Goodhart perspective?
I'm surprised no one has brought up the quantilizer results, specifically the quantilizer optimality theorem from Taylor 2015:
Theorem 1 (Quantilizer optimality). Choose q=1/t. Then, a q-quantilizer maximizes expected U-utility subject to constraint 2.
where constraint 2 is that you don't do more than t worse in expectation on any possible cost function, relative to the original distribution of actions. That is, quantilizers (which are in turn approximated by BoN), are the optimal solution to a particular robust RL problem.
However, it turns out t...
Boltzmann factor to upweight answers that your overseer likes. AFAICT this doesn't generally induce more causal Goodhart problems than best-of-N selection does.
This seems correct insofar as your proxy reward does not have huge upward errors (that you don't remove via some sort of clipping). For example, if there's 1 million normal sentences with reward uniformly distributed between [0, 100] and one adversarial sentence with reward r=10^5, conditioning on reward>99 leads to a 1/10,000 chance of sampling the adversarial sentence, while it's very tricky (i...
It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering.
It's also true that maximizing Reward - KL is Bayesian updating as the linked post shows, and it's true that maximizing reward subject to a KL constraint is also equivalent to Bayesian updating as well (by Lagrangian multipliers). You see similar results with Max Ent RL (where you maximize Reward + Entropy, which is equal to a constant minus the KL relative to a ...
This doesn't seem to be what Gao et al found: Figure 9 shows that the KL between RL and initial policy, at a given proxy reward score, still is significantly larger than the equivalent KL for a BoN-policy, as shown in Figure 1.
I agree with the general point, but I'll note that at equal proxy reward model scores, the RL policy has significantly more KL divergence with the base policy.
I expect that the media firestorm around GPT-3 made it significantly easier to raise the capital + support within Google Brain to train PaLM.
Wouldn't surprise me if this was true, but I agree with you that it's possible the ship has already sailed on LLMs. I think this is more so the case if you have a novel insight about what paths are more promising to AGI (similar to the scaling hypothesis in 2018)---getting ~everyone to adopt that insight would significantly advance timelines, though I'd argue that publishing it (such that only the labs explicitly aimi...
Publishing capabilities work is notably worse than just doing the work.
- I'd argue that hyping up the capabilities work is even worse than just quietly publishing it without fanfare.
What's the mechanism you're thinking of, through which hype does damage?
I also doubt that good capabilities work will be published "without fanfare", given how watched this space is.
...My read is that fairly little current alignment work really feels "serial" to me. Assuming that you're mostly referring to conceptual alignment work, my read is that a lot of it is fairly confus
Nate tells me that his headline view of OpenAI is mostly the same as his view of other AGI organizations, so he feels a little odd singling out OpenAI.
[...]
But, while this doesn't change the fact that we view OpenAI's effects as harmful on net currently, Nate does want to acknowledge that OpenAI seems to him to be doing better than some other orgs on a number of fronts:
I wanted to give this a big +1. I think OpenAI is doing better than literally every single other major AI research org except probably Anthropic and Deepmind on trying to solve the AI-n...
Accordingly, I think there’s a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.
I'm not sure I agree that this is unfair.
OpenAI is clearly on the cutting edge of AI research.
This is obviously a good reason to focus on them more.
OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/Rationalist social scene.
Perhaps we have responsibility to scrutinize/criticize them more because of this...
Epistemic status: Half speculation, half solid advice. I'm writing this up as I've said this a bunch IRL.
Current large language models (LLMs) are sufficiently good at in-context learning that for many NLP tasks, it's often better and cheaper to just query an LM with the appropriate prompt, than to train your own ML model. A lot of this comes from my personal experience (i.e. replacing existing "SoTA" models in other fields with prompted LMs, and getting better performance), but there's also examples ...
Also, a cheeky way to say this:
What Grokking Feels Like From the Inside
What does grokking_NN feel like from the inside? It feels like grokking_Human a concept! :)
Thanks!
(As an amusing side note: I spent 20+ minutes after finishing the writeup trying to get the image from the recent 4-layer docstring circuit post to preview properly the footnotes, and eventually gave up. That is, a full ~15% of the total time invested was spent on that footnote!)