The preliminary results where obtained on a subset of the full benchmark (~90 tasks vs 206 tasks). And there were many changes since then, including scoring changes. Thus, I'm not sure we'll see the same dynamics in the final results. Most likely yes, but maybe not.
I agree that the task selection process could create the dynamics that look like the acceleration. A good point.
As I understand, the organizers have accepted almost all submitted tasks (the main rejection reasons were technical - copyright etc). So, it was mostly self-selection, with the bias towards the hardest imaginable text tasks. It seems that for many contributors, the main motivation was something like:
Take that, the most advanced AI of Google! Let's see if you can handle my epic task!
This includes many cognitive tasks that are supposedly human-complete (e.g. understanding of humor, irony, ethics), and the tasks that are probing the model's generality (e.g. playing chess, recognizing images, navigating mazes - all in text).
I wonder if the performance dynamics on such tasks will follow the same curve.
The list of of all tasks is available here.
The results were presented at a workshop by the project organizers. The video from the workshop is available here (the most relevant presentation starts at 5:05:00).
It's one of those innocent presentations that, after you understand the implications, keep you awake at night.
your view seems to imply that we will move quickly from much worse than humans to much better than humans, but it's likely that we will move slowly through the human range on many tasks
We might be able to falsify that in a few months.
There is a joint Google / OpenAI project called BIG-bench. They've crowdsourced ~200 of highly diverse text tasks (from answering scientific questions to predicting protein interacting sites to measuring self-awareness).
One of the goals of the project is to see how the performance on the tasks is changing with the model size, with the size ranging by many orders of magnitude.
A half-year ago, they presented some preliminary results. A quick summary:
if you increase the N of parameters from 10^7 to 10^10, the aggregate performance score grows roughly like log(N).
But after the 10^10 point, something interesting happens: the score starts growing much faster (~N).
And for some tasks, the plot looks like a hockey stick (a sudden change from ~0 to almost-human).
The paper with the full results is expected to be published in the next few months.
Judging by the preliminary results, the FOOM could start like this:
The GPT-5 still sucks on most tasks. It's mostly useless. But what if we increase parameters_num by 2? What could possibly go wrong?