You mean it might still Goodhart to what we think they might say? Ideally, the actual people would be involved in the process.
My intuition is that the best way to build wise AI would be to train imitation learning agents on people who we consider to be wise. If we trained imitations of people with a variety of perspectives, we could then simulate discussions between them and try to figure out the best discussion formats between such agents. This could likely get us reasonably far.
The reason why I say imitation learning is because that would give us something that we could treat as an optimisation target which is what we require for training ML systems.
How are applications processed? Sometimes applications are processed on a rolling basis, so it's important to submit as soon as possible. Other times, you just need to apply by the date, so if you're about to post something big, it makes sense to hold-off your application.
This criticism feels a bit strong to me. Knowing the extent to which interpretability work scales up to larger models seems pretty important. I could have imagined people either arguing that such techniques would work worse on larger models b/c required optimizations or better because less concepts would be in superposition. Work on this feels quite important, even though there's a lot more work to be done.
Also, sharing some amount of eye-catching results seems important for building excitement for interpretability research.
Update: I skipped the TLDR when I was reading this post b/c I just read the rest. I guess I'm fine with Anthropic mostly focusing on establishing one kind of robustness and leaving other kinds of robustness for future work. I'd be more likely to agree with Steven Casper if there isn't further research from Anthropic in the next year that makes significant progress in evaluating the robustness of their approach. One additional point: independent researchers can run some of these other experiments, but they can't run the scaling experiment.
That's an excellent point.
I agree. I think that's probably a better way of clarifying the confusion that what I wrote.
“But also this person doesn't know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned”
Have you written about this anywhere?
Nabgure senzr gung zvtug or hfrshy:
Gurer'f n qvssrerapr orgjrra gur ahzore bs zngurzngvpny shapgvbaf gung vzcyrzrag n frg bs erdhverzragf naq gur ahzore bs cebtenzf gung vzcyrzrag gur frg bs erdhverzragf.
Fvzcyvpvgl vf nobhg gur ynggre, abg gur sbezre.
Gur rkvfgrapr bs n ynetr ahzore bs cebtenzf gung cebqhpr gur rknpg fnzr zngurzngvpny shapgvba pbagevohgrf gbjneqf fvzcyvpvgl.
I wrote up my views on the principle of indifference here:
https://www.lesswrong.com/posts/3PXBK2an9dcRoNoid/on-having-no-clue
I agree that it has certain philosophical issues, but I don’t believe that this is as fatal to counting arguments as you believe.
Towards the end I write:
“The problem is that we are making an assumption, but rather than owning it, we're trying to deny that we're making any assumption at all, ie. "I'm not assuming a priori A and B have equal probability based on my subjective judgement, I'm using the principle of indifference". Roll to disbelieve.
I feel less confident in my post than when I wrote it, but it still feels more credible than the position articulated in this post.
Otherwise: this was an interesting post. Well done on identifying some arguments that I need to digest.
IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs
You don't know where they heard that?
I would love to hear what shard theorists make of this.
We could describe this AI as having learned a meta-shard - pace around at the start so that you have time to plan.
But at the point where we've allowed meta-shards, maybe we've already undermined the main claims of shard theory?