Thinking about it more there's another, more serious restriction, at least for now: Codex can't write code that depends on the rest of your codebase. Consider the following lightly-anonymized code from a real-world codebase I contribute to:
def unobserved(self) -> Set[str]: """Returns a set of all unobserved random variable names inside this Distribution -- that is,those that are neither observed nor marginalized over. """ return set(self.components) - self.observed - self.marginalized
def unobserved(self) -> Set[str]:
"""Returns a set of all unobserved random variable names inside this Distribution -- that is,
those that are neither observed nor marginalized over.
return set(self.components) - self.observed - self.marginalized
I don't think Codex could write that code from just that function signature and that docstring, because a human couldn't do it: they wouldn't know how to find the names of the observed and marginalized random variables, and they wouldn't know that self.components exists or has to be explicitly converted into a set. And if the human didn't know what random variables are, or what marginalization is, they'd have an even tougher time.
One can imagine a prompt that might elicit this implementation, something like "this is a Python function that returns the elements of set(self.components) that do not appear in self.observed or self.marginalized", but then we are not really writing docstrings anymore, we are writing programs in a novel language with a stochastic compiler.
This should be fixable with larger context windows, I think. If the prompt for a method could include the whole class definition aside from that method, Codex could at least in principle use other class methods and variables in sensible ways. But this will have to wait on the practical realization of O(n) or O(nlogn) context window scaling.
What's your take on the licensing issue? I know for sure that Codex won't affect my day job in the near term, because I'm not allowed to use it; I guess most large companies, and open-source projects large enough to care about IP assignment, will have the same problem.
This indirectly influences the speed of model improvement we should expect: the market for this kind of tool is smaller than you might think from the size of github's userbase, so there'll be less incentive to invest in it.