LawrenceC — AI Alignment Forum

I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.

I'm also a part-time fund manager for the LTFF.

Obligatory research billboard website: https://chanlawrence.me/

Fair, but in my head I did plan to get it done on the 10th. The tweet is not in itself the prediction, it's just evidence that I made the prediction in my head.

And indeed I did finish the draft on June 10th, but at 11 PM and I decided to wait for feedback before posting. So I wasn't that off in the end, but I still consider it off.

There are indeed many, many silly claims out there, on either side of any debate. And yes, the people pretending that the AIs of 2025 have the limitations of those from 2020 are being silly, journalist or no.

I do want to clarify that I don't think this is a (tech) journalist problem. Presumably when you mention Nightshade dismissively, it's a combination of two reasons: 1) Nightshade artefacts are removable via small amounts of Gaussian blur and 2) Nightshade can't be deployed at scale on enough archetypal images to have a real effect? If you look at the Nightshade website, you'll see that the authors lie about 1):

As with Glaze, Nightshade effects are robust to normal changes one might apply to an image. You can crop it, resample it, compress it, smooth out pixels, or add noise, and the effects of the poison will remain.

So (assuming my recollection that Nightshade is defeatable by Gaussian noise is correct) this isn't an issue of journalists making stuff up or misunderstanding what the authors said, it's the authors putting things in their press release that, at the very least, are not at all backed up by their paper.

(Also, either way, Gary Marcus is not a tech journalist!)

I think writing this post was helpful to me in thinking through my career options. I've also been told by others that the post was quite valuable to them as an example of someone thinking through their career options.

Interestingly, I left METR (then ARC Evals) about a month and a half after this post was published. (I continued to be involved with the LTFF.) I then rejoined METR in August 2024. In between, I worked on ambitious mech interp and did some late stage project management and paper writing (including some for METR). I also organized a mech interp workshop at ICML 2024, which, if you squint, counts as "onboarding senior academics".

I think leaving METR was a mistake ex post, even if it made sense ex ante. I think my ideas around mech interp when I wrote this post weren't that great, even if I thought the projects I ended up working on were interesting (see e.g. Compact Proofs and Computation in Superposition). While the mech interp workshop was very well attended (e.g. the room was so crowded that people couldn't get in due to fire code) and pretty well received, I'm not sure how much value it ended up producing for AIS. Also, I think I was undervaluing the resources available to METR as well as how much I could do at METR.

If I were to make a list for myself in 2023 using what I know now, I'd probably have replaced "onboarding senior academics" with "get involved in AI policy via the AISIs", and instead of "writing blog posts or takes in general", I'd have the option of "build common knowledge in AIS via pedagogical posts". Though realistically, knowing what I know now, I'd have told my past self to try to better leverage my position at METR (and provided him with a list of projects to do at METR) instead of leaving.

Also, I regret both that I called it "ambitious mech interp", and that this post became the primary reference for what this term meant. I should've used a more value-neutral name such as "rigorous model internals" and wrote up a separate post describing it.

I think this post made an important point that's still relevant to this day.

If anything, this post is more relevant in late 2024 than in early 2023, as the pace of AI makes ever more people want to be involved, while more and more mentors have moved towards doing object level work. Due to the relative reduction of capacity in evaluating new AIS researchers, there's more reliance on systems or heuristics to evaluate people now than in early 2023.

Also, I find it amusing that without the parenthetical, the title of the post makes another important point: "evals are noisy".

I think this post was useful in the context it was written in and has held up relatively well. However, I wouldn't active recommend it to anyone as of Dec 2024 -- both because the ethos of the AIS community has shifted, making posts like this less necessary, and because many other "how to do research" posts were written that contain the same advice.

Background

This post was inspired by conversations I had in mid-late 2022 with MATS mentees, REMIX participants, and various bright young people who were coming to the Bay to work on AIS (collectively, "kiddos"). The median kiddo I spoke with had read a small number of ML papers and a medium amount of LW/AF content, and was trying to string together an ambitious research project from several research ideas they recently learned about. (Or, sometimes they were assigned such a project by their mentors in MATS or REMIX.)

Unfortunately, I don't think modern machine learning is the kind of field where you can take several where research consistently works out of the box. Many high level claims even in published research papers are just... wrong, it can be challenging to reproduce results even when they are right, and even techniques that work reliably may not work for the reasons people think they do.

Hence, this post.

What do I think of the content of the post?

I think the core idea of this post held up pretty well with time. I continue to think that making contact with reality is very important, and I think the concrete suggestions for how to make contact with reality are still pretty good.

If I were to write it today, I'd probably add a fifth major reason for why it's important to make quick contact with reality: mental health/motivation. That is, producing concrete research outputs, even small ones, feels pretty essential to maintaining motivation for the vast majority of researchers. My guess is I missed this factor because I focused on the content of research projects, as opposed to the people doing the research.

Where do I feel the post stands now?

Over the past two years, the ethos of the AIS community has changed substantially toward empirical work, over the past two years, and especially in 2024.

The biggest part of this is because of the pace of AI. When this post was written, ChatGPT was a month old, and GPT-4 was still more than 2 months away. People both had longer timelines and thought of AIS in more conceptual terms. Many research conceptual research projects of 2022 have fallen into the realm of the empirical as of late 2024.

Part of this is due to the rise of (dangerous capability) evals as a major AIS focus in 2023, which is both substantially more empirical compared to the median 2022 AIS research topic, and an area where making contact with reality can be as simple as "pasting a prompt into claude.ai".

Part of this is due to Anthropic's rise to being the central place for AIS researchers. "Being able to quickly produce ML results" is a major part of what it takes to get hired there as a junior researcher, and people know this.

Finally, there's been a decent amount of posts or write-ups giving the same advice, e.g. Neel's written advice for his MATS scholars and a recent Alignment Forum post by Ethan Perez.

As a result, this post feels much less necessary or relevant in late December 2024 than in December 2022.

I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.

I mean, we don't know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + "high-quality multi-task instruction data". I wouldn't be surprised if the same were true of Qwen 1.5.

From the Qwen2 report:

Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.

Similarly, Gemma 2 had its pretraining corpus filtered to remove "unwanted or unsafe utterances". From the Gemma 2 tech report:

We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3)
[...]
We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our
pre-trained and fine-tuned checkpoints producing harmful content. (page 10)

After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I've unendorsed the comment above.

It's still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it's plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models.

It's worth noting that there's reasons to expect the "base models" of both Gemma2 and Qwen 1.5 to demonstrate refusals -- neither is trained on unfilted webtext.

We don't know what 1.5 was trained on, but we do know that Qwen2's pretraining data both contains synthetic data generated by Qwen1.5, and was filtered using Qwen1.5 models. Notably, its pretraining data explicitly includes "high-quality multi-task instruction data"! From the Qwen2 report:

Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.

I think this had a huge effect on Qwen2: Qwen2 is able to reliably follow both the Qwen1.5 chat template (as you note) as well as the "User: {Prompt}\n\nAssistant: " template. This is also reflected in their high standardized benchmark scores -- the "base" models do comparably to the instruction finetuned ones! In other words, Qwen2 "base" models are pretty far from traditional base models a la GPT-2 or Pythia as a result of explicit choices made when generating their pretraining data and this explains its propensity for refusals. I wouldn't be surprised if the same were true of the 1.5 models.

I think the Gemma 2 base models were not trained on synthetic data from larger models but its pretraining dataset was also filtered to remove "unwanted or unsafe utterances". From the Gemma 2 tech report:

We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3)
[...]
We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our
pre-trained and fine-tuned checkpoints producing harmful content. (page 10)

My guess is this filtering explains why the model refuses, moreso than (and in addition to?) ChatGPT contamination. Once you remove all the "unsafe completions"

I don't know what's going on with LLaMA 1, though.

Ah, you're correct, it's from the original instructGPT release in Jan 2022:
https://openai.com/index/instruction-following/

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments

Background

What do I think of the content of the post?

Where do I feel the post stands now?