How are applications processed? Sometimes applications are processed on a rolling basis, so it's important to submit as soon as possible. Other times, you just need to apply by the date, so if you're about to post something big, it makes sense to hold-off your application.

This criticism feels a bit strong to me. Knowing the extent to which interpretability work scales up to larger models seems pretty important. I could have imagined people either arguing that such techniques would work worse on larger models b/c required optimizations or better because less concepts would be in superposition. Work on this feels quite important, even though there's a lot more work to be done.

Also, sharing some amount of eye-catching results seems important for building excitement for interpretability research.

Update: I skipped the TLDR when I was reading this post b/c I just read the rest. I guess I'm fine with Anthropic mostly focusing on establishing one kind of robustness and leaving other kinds of robustness for future work. I'd be more likely to agree with Steven Casper if there isn't further research from Anthropic in the next year that makes significant progress in evaluating the robustness of their approach. One additional point: independent researchers can run some of these other experiments, but they can't run the scaling experiment.

That's an excellent point.

I agree. I think that's probably a better way of clarifying the confusion that what I wrote.

“But also this person doesn't know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned”

Have you written about this anywhere?

Nabgure senzr gung zvtug or hfrshy:

Gurer'f n qvssrerapr orgjrra gur ahzore bs zngurzngvpny shapgvbaf gung vzcyrzrag n frg bs erdhverzragf naq gur ahzore bs cebtenzf gung vzcyrzrag gur frg bs erdhverzragf.

Fvzcyvpvgl vf nobhg gur ynggre, abg gur sbezre.

Gur rkvfgrapr bs n ynetr ahzore bs cebtenzf gung cebqhpr gur rknpg fnzr zngurzngvpny shapgvba pbagevohgrf gbjneqf fvzcyvpvgl.

I wrote up my views on the principle of indifference here:

I agree that it has certain philosophical issues, but I don’t believe that this is as fatal to counting arguments as you believe.

Towards the end I write:

“The problem is that we are making an assumption, but rather than owning it, we're trying to deny that we're making any assumption at all, ie. "I'm not assuming a priori A and B have equal probability based on my subjective judgement, I'm using the principle of indifference". Roll to disbelieve.

I feel less confident in my post than when I wrote it, but it still feels more credible than the position articulated in this post.

Otherwise: this was an interesting post. Well done on identifying some arguments that I need to digest.

IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs

You don't know where they heard that?

One of the main challenges I see here is how to calibrate this. In other words, if I can't break a model despite adding an activation vector of strength x, what does this mean in terms of how safe we should consider the model to be? ie. How much extra adversarial prompting effort is that equivalent to or how should I modify my probabilities of the model being safe?

Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?

I think it’s worth also raising the possibility of a Kuhnian scenario where the “mature science” is actually missing something and further breakthrough is required after that to move it into in a new paradigm.

Load More