I think that taking a theoretical computer science perspective to GPT-3 might help clarify some part of the debate.
My background is in theoretical computer science, and that's my prefered perspective to make sense of the world. So in this post, I try to rephrase the task we ask GPT-3 to solve in terms of search and optimization problems, and I propose different criteria for solving such a task. I thus consider GPT-3 as a black-box, and only ask about how to decide if it is good at its task or not.
Epistemic status: probably nothing new, but maybe an interesting perspective. This is an attempt to understand the intuitions behind different statements about whether GPT-3 "solves some task".
Let's fix a language (English for example). Then a very basic description of the task thrown at GPT-3 is that it's given a prompt p and it needs to give a follow-up text x such that x is a good answer to the prompt. For example, p might be a question about parenthesis or addition or a story to complete, and x is GPT-3's answer. I thus define Promptp to be this search problem for a given prompt p.
Now, what does it mean for an algorithm A to solve Promptp? Multiple possibilities come to mind:
Okay, so GPT-3 is solving Promptp iff it is a probabilistic solution to it. Is it enough?
Not exactly. In some cases, we don't want a binary choice between good and bad answers; there might be a scale of answers which are more and more correct. That is to say, it should be an optimization problem.
To go from a search problem to an optimization problem, we need a measure of how good the solution is. Let's say we have a function measurep:Answers→R+ . The optimization problem BestPromptp can then be defined with our probabilistic criterion in multiple ways:
This new criterion feels much closer to what I want GPT-3 to do when I give it a prompt. But I don't know how to write a measure function that captures what I want (after all, that would be a utility function for the task!). We can do without, by instead comparing two answers with each other. I probably have a couple of answers I am waiting for when writing my prompt; and when I receive an answer, I feel that comparing it to one of my expected answer should be possible.
In consequence, I give the following criterion. Assume the existence of a set of expected answers expectedp and of a comparison relation between answers ≥p. Then
(As Good As Expected) A solves BestPromptp if Pr[∃x∈expectedp:At(p)≥px]>l , with l>1/2 .
I am personally happy with that criterion. And I think it is rather uncontroversial (with maybe the subtlety that some people would prefer a ∀ in place of the ∃).
But we're not done yet. This only constrains the algorithm about what it does for a fixed prompt. A big part of the discussion of GPT-3 turns around the number of prompts for GPT-3 should give an interesting answer. This requires a step back.
When evaluating GPT-3, what I really care about is its answer for a specific task (like addition or causality). And this task can be encoded through many prompts. Let's say that we have a set P of prompts that somehow capture our task (from our perspective). What constrains should an algorithm A satisfy to be said to solve the task? We already have a meaningful criterion for answering correctly to a prompt: the "As Good as Expected" one. What's left to decide is for which prompts in P should A satisfy this criterion.
Contrary to the part about solving a prompt, the answer doesn't look obvious to me here. I also believe that much of the disagreement about GPT-3's competence is based on a disagreement about this part.
Thinking about GPT-3, and the sort of problem we want it to solve, through the lens of theoretical computer science, helped me understand slightly better the debate surrounding it. I hope it can be helpful to you too. And I'm also very curious of any error you might find in this post, whether conceptual or technical.