julianjm has not written any posts yet.

Re: how to update based on benchmark progress in general, see my response to you above.
On the rest, I think the best way I can think of explaining this is in terms of alignment and not correctness.
My naive interpretation is that we only use ML when we can't be bothered to write a traditional solution, but I don't think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)
The bird example is good. My contention is basically that when it comes to making something like "recognizing birds" economically useful, there is an enormous chasm between 90% performance on a subset of ImageNet... (read 554 more words →)
"actually it's quite gameable" = "actually it's quite easy" ;)
You joke, but one of my main points is that these are very, very different things. Any benchmark, or dataset, acts as a proxy for the underlying task that we care about. Turing used natural conversation because it was a domain where a wide range of capabilities are normally used by humans. The problem is that in operationalizing the test (e.g., trying to fool a human), it ends up being possible or easy to pass without necessarily using or requiring all of those capabilities. And this can happen for reasons beyond just overfitting to the data distribution, because the test itself may just... (read 789 more words →)
I'm not saying that all benchmarks are necessarily hard, I'm saying that these ones look pretty hard to me (compared with ~ordinary conversation).
I'm not sure exactly what you mean here, but if you mean "holding an ordinary conversation with a human" as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There's a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it's quite gameable. This is why the Winograd Schema Challenge was proposed, but even that and new proposed versions of it have seen lots of progress recently —... (read 1841 more words →)
I guess my main concern here is — besides everything I wrote in my reply to you below — basically that reliability of GPT-N on simple, multiclass classification tasks lacking broader context may not be representative of its reliability in real-world automation settings. If we're to take SuperGLUE as representative, well.. it's already basically solved.
One of the problems here is that when you have the noise ceiling set so low, like it is in SuperGLUE, reaching human performance does not mean the model is reliable. It means the humans aren't. It means you wouldn't even trust a human to do this task if you really cared about the result. Coming up with... (read more)
On 1a:
Insofar as humans like having reasons for failures, I'm willing to accept this as one reason that reliability standards could be a bit higher for ML, but I doubt it would be drastically higher. I'd love a real example (outside of criminal justice) where this is a bottleneck.
Take for example writing news / journalistic articles. Distinguishability from human-written articles is used as evidence for GPT's abilities. The abilities are impressive here, but the task at hand for the original writer is not to write an article that looks human, but one that reports the news. This means deciding what is newsworthy, aggregating evidence, contacting sources, and summarizing and reporting the information... (read 2234 more words →)
Oh yeah, one more thing which I think actually might be the most important point. On a lot of these benchmarks — at the very least, on SuperGLUE — "human-level performance" is a much weaker requirement than "human equivalence." Human performance isn't necessarily an indicator of irreducible entropy in the underlying task. To a large extent, it just reflects ambiguity or coarseness in the dataset specification. A big part of this is the artificial setting of the data annotation, which is unfortunately kind of necessary in a lot of cases when the goal is characterizing abstract language understanding. On the long tail of examples that require more careful reasoning, in the absence... (read 602 more words →)
Hi — new here. I'm an NLP researcher, and for background, I would guess I fall on the skeptical side of the scaling hypothesis by LW standards. I was pointed to this by abergal. Here is some feedback:
1. On the question of economic value:
AFAIK, there are many factors other than raw performance on a benchmark (relating to interpretability, accountability/explainability, and integration with surrounding software) which, as far as I'm aware, may dominate the question of economic viability of AI systems at least on the sub-99.99% accuracy regime (depending on application domain). Examples of what I mean:
1a. Even when humans are used to perform a task, and even when they perform it very... (read 1103 more words →)
Thank you for posting this! I think it's well worth thinking through debate baselines more.
TL;DR:
- I think your open consultancy idea isn't specific to consultancy. I think everything about it equally well applies to debate, and it basically corresponds to how the RL training should work in practice.
- The 50/50 assumption in our consultancy and debate experiments was done for reasons having to do with our research question of interest, which is measuring the relative utility of the supervision paradigms. I don't think we would want to directly use consultancy with a 50/50 prior to help us answer questions in practice and this wasn't the intent behind our use of it as a baseline
... (read 739 more words →)