This was literally the first output, with no rerolls in the middle! (Although after posting it, I did some other trials which weren't as good, so I did get lucky on the first one. Randomness parameter was set to 0.5.)
I cut it off there because the next paragraph just restated the previous one.
(sorry, couldn't resist)
This is the first post in an Alignment Forum sequence explaining the approaches both MIRI and OpenAI staff believe are the most promising means of auditing the cognition of very complex machine learning models. We will be discussing each approach in turn, with a focus on how they differ from one another.
The goal of this series is to provide a more complete picture of the various options for auditing AI systems than has been provided so far by any single person or organization. The hope is that it will help people make better-informed decisions about which approach to pursue.
We have tried to keep our discussion as objective as possible, but we recognize that there may well be disagreements among us on some points. If you think we've made an error, please let us know!
If you're interested in reading more about the history of AI research and development, see:
1. What Is Artificial Intelligence? (Wikipedia) 2. How Does Machine Learning Work? 3. How Can We Create Trustworthy AI?
The first question we need to answer is: what do we mean by "artificial intelligence"?
The term "artificial intelligence" has been used to refer to a surprisingly broad range of things. The three most common uses are:
The study of how to create machines that can perceive, think, and act in ways that are typically only possible for humans. The study of how to create machines that can learn, using data, in ways that are typically only possible for humans. The study of how to create machines that can reason and solve problems in ways that are typically only possible for humans.
In this sequence, we will focus on the third definition. We believe that the first two are much less important for the purpose of AI safety research, and that they are also much less tractable.
Why is it so important to focus on the third definition?
The third definition is important because, as we will discuss in later posts, it is the one that creates the most risk. It is also the one that is most difficult to research, and so it requires the most attention.
I'm imagining a tiny AI Safety organization, circa 2010, that focused on how to achieve probable alignment for scaled-up versions of that year's state-of-the-art AI designs. It's interesting to ask whether that organization would have achieved more or less than MIRI has, in terms of generalizable work and in terms of field-building.
Certainly it would have resulted in a lot of work that was initially successful but ultimately dead-end. But maybe early concrete results would have attracted more talent/attention/respect/funding, and the org could have thrown that at DL once it began to win the race.
On the other hand, maybe committing to 2010's AI paradigm would have made them a laughingstock by 2015, and killed the field. Maybe the org would have too much inertia to pivot, and it would have taken away the oxygen for anyone else to do DL-compatible AI safety work. Maybe it would have stated its problems less clearly, inviting more philosophical confusion and even more hangers-on answering the wrong questions.
Or, worst, maybe it would have made a juicy target for a hostile takeover. Compare what happened to nanotechnology research (and nanotech safety research) when too much money got in too early - savvy academics and industry representatives exiled Drexler from the field he founded so that they could spend the federal dollars on regular materials science and call it nanotechnology.
That's not a nitpick at all!
Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there's some evidence it's not doing lookahead - its difficulty completing rhymes when writing poetry, for instance.
(Hmm, what's the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn't just memorize moves?)
Thinking about this more, I think that since planning depends on causal modeling, I'd expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I'll edit accordingly. Thanks!
The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way. Inner optimizers are an illustration of another failure mode.
I've been using computable to mean a total function (each instance is computable in finite time).
I'm thinking of an agent outside a universe about to take an action, and each action will cause that universe to run a particular TM. (You could maybe frame this as "the agent chooses the tape for the TM to run on".) For me, this is analogous to acting in the world and causing the world to shift toward some outcomes over others.
By asserting that U should be the computable one, I'm asserting that "how much do I like this outcome" is a more tractable question than "which actions result in this outcome".
An intuition pump in a human setting:
I can check whether given states of a Go board are victories for one player or the other, or if the game is not yet finished (this is analogous to U being a total computable function). But it's much more difficult to choose, for an unfinished game where I'm told I have a winning strategy, a move such that I still have a winning strategy. The best I can really do as a human is calculate a bit and then guess at how the leaves will probably resolve if we go down them (this is analogous to eval being an enumerable but not necessarily computable function).
In general, individual humans are much better at figuring out what outcomes we want than we are at figuring out exactly how to achieve those outcomes. (It would be quite weird if the opposite were the case.) We're not good at either in an absolute sense, of course.
Let's talk first about non-embedded agents.
Say that I'm given the specification of a Turing machine, and I have a computable utility mapping from output states (including "does not halt") to [0,1]. We presumably agree that is possible.
I agree that it's impossible to make a computable mapping from Turing machines to outcomes, so therefore I cannot have a computable utility function from TMs to the reals which assigns the same value to any two TMs with identical output.
But I can have a logical inductor which, for each TM, produces a sequence of predictions about that TM's output's utility. Every TM that halts will eventually get the correct utility, and every TM that doesn't will converge to some utility in [0,1], with the usual properties for logical inductors guaranteeing that TMs easily proven to have the same output will converge to the same number, etc.
That's a computable sequence of utility functions over TMs with asymptotic good properties. At any stage, I could stop and tell you that I choose some particular TM as the best one as it seems to me now.
I haven't really thought in a long while about questions like "do logical inductors' good properties of self-prediction mean that they could avoid the procrastination paradox", so I could be talking nonsense there.
I mean the sort of "eventually approximately consistent over computable patterns" thing exhibited by logical inductors, which is stronger than limit-computability.
I think that computable is obviously too strong a condition for classical utility; enumerable is better.
Imagine you're about to see the source code of a machine that's running, and if the machine eventually halts then 2 utilons will be generated. That's a simpler problem to reason about than the procrastination paradox, and your utility function is enumerable but not computable. (Likewise, logical inductors obviously don't make PA approximately computable, but their properties are what you'd want the definition of approximately enumerable to be, if any such definition were standard.)
I suspect that the procrastination paradox leans heavily on the computability requirement as well.
If the listener is running a computable logical uncertainty algorithm, then for a difficult proposition it hasn't made much sense of, the listener might say "70% likely it's a theorem and X will say it, 20% likely it's not a theorem and X won't say it, 5% PA is inconsistent and X will say both, 5% X isn't naming all and only theorems of PA".
Conditioned on PA being consistent and on X naming all and only theorems of PA, and on the listener's logical uncertainty being well-calibrated, you'd expect that in 78% of such cases X eventually names it.
But you can't use the listener's current probabilities on [X saying it] to sort out theorems from non-theorems in a way that breaks computability!
What am I missing?