It wouldn't surprise me if I was similarly confused now, tho hopefully I am less so, and you shouldn't take this post as me speaking for Paul.
This post was improved some by a discussion with Evan which crystallized some points as 'clear disagreements' instead of me being confused, but I think there are more points to crystallize further in this way. It was posted tonight in the state it's in as part of MSFP 2019's blog post day, but might get edited more tomorrow or perhaps will get further elaborated in the comments section.
That is, I can easily see how factored cognition allows you to stick to cognitive strategies that definitely solve a problem in a safe way, but don't see how it does that and allows you to develop new cognitive strategies to solve a problem that doesn’t result in an opening for inner optimizers--not within units, but within assemblages of units.
Do you have some intuition for how inner optimizers would arise within assemblages of units, without being initiated by some unit higher in the hierarchy? Or is that what you are pointing at?
When I imagine them they are being initiated by some unit higher in the hierarchy. Basically, you could imagine having a tree of humans that is implementing a particular search process, or a different tree of humans implementing a search over search processes, with the second perhaps being more capable (because it can improve itself) but also perhaps leading to inner alignment problems.
The View from 2018
In April of last year, I wrote up my confusions with Paul’s agenda, focusing mostly on approval directed agents. I mostly have similar opinions now; the main thing I noticed on rereading it was I talked about ‘human-sized’ consciences, when now I would describe them as larger than human size (since moral reasoning depends on cultural accumulation which is larger than human size). But on the meta level, I think they’re less relevant to Paul’s agenda than I thought then; I was confused about how Paul’s argument for alignment worked. (I do think my objections were correct objections to the thing I was hallucinating Paul meant.) So let’s see if I can explain it to Vaniver_2018, which includes pointing out the obstacles that Vaniver_2019 still sees. It wouldn't surprise me if I was similarly confused now, tho hopefully I am less so, and you shouldn't take this post as me speaking for Paul.
Factored Cognition
One core idea that Paul’s approach rests on is that thoughts, even the big thoughts necessary to solve big problems, can be broken up into smaller chunks, and this can be done until the smallest chunk is digestible. That is, problems can be ‘factored’ into parts, and the factoring itself is a task (that may need to be factored). Vaniver_2018 will object that it seems like ‘big thoughts’ require ‘big contexts’, and Vaniver_2019 has the same intuition, but this does seem to be an empirical question that experiments can give actual traction on (more on that later).
The hope behind Paul’s approach is not that the small chunks are all aligned, and chaining together small aligned things leads to a big aligned thing, which is what Vaniver_2018 thinks Paul is trying to do. A hope behind Paul’s approach is that the small chunks are incentivized to be honest. This is possibly useful for transparency and avoiding inner optimizers. A separate hope with small chunks is that they’re cheap; mimicking the sort of things that human personal assistants can do in 10 minutes only requires lots of 10 minute chunks of human time (each of which only costs a few dollars) and doesn’t require figuring out how intelligence works; that’s the machine learning algorithm’s problem.
So how does it work? You put in an English string, a human-like thing processes it, and it passes out English strings--subquestions downwards if necessary, and answers upwards. The answers can be “I don’t know” or “Recursion depth exceeded” or whatever. The human-like thing comes preloaded (or pre-trained) with some idea of how to do this correctly; obviously incorrect strategies like “just pass the question downward for someone else to answer” get ruled out, and the humans we’ve trained on have been taught things like how to do good Fermi estimation and some of the alignment basics. This is general, and lets you do anything humans can do in a short amount of time (and when skillfully chained, anything humans can do in a long amount of time, given the large assumption that you can serialize the relevant state and subdivide problems in the relevant ways).
Now schemes diverge a bit on how they use factored cognition, but in at least some we begin by training the system to simply imitate humans, and then switch to training the system to be good at answering questions or to distill long computations into cached answers or quicker computations. One of the tricks we can use here is that ‘self-play’ of a sort is possible, where we can just ask the system whether a decomposition was the right move, and this is an English question like any other.
Honesty Criterion
Originally, I viewed the frequent reserialization as a solution to a security concern. If you do arbitrary thought for arbitrary lengths of time, then you risk running into inner optimizers or other sorts of unaligned cognition. Now it seems that the real goal is closer to an ‘honesty criterion’; if you ask a question, all the computation in that unit will be devoted to answering the question, and all messages between units are passed where the operator can see them, in plain English.[1]
Even if one succeeds at honesty, it still seems difficult to maintain both generality and safety. That is, I can easily see how factored cognition allows you to stick to cognitive strategies that definitely solve a problem in a safe way, but don't see how it does that and allows you to develop new cognitive strategies to solve a problem that doesn’t result in an opening for inner optimizers--not within units, but within assemblages of units. Or, conversely, one could become more general while giving up on safety. In order to get both it seems like we’re resting a lot on the Overseer’s Manual or way that we trained the humans that we used as training data.
Serialized State is Inadequate or Inefficient
In my mind, the primary reason to build advanced AI (as opposed to simple AI) is to accomplish megaprojects instead of projects. Curing cancer (in a way that potentially involves novel research) seems like a megaproject, whereas determining how a particular protein folds (which might be part of curing cancer) is more like a project. To the extent that Factored Cognition relies on the serialized state (of questions and answers) to enforce honesty on the units of computation, it seems like that will be inefficient for problems whose state are large enough that they impose significant serialization costs, and inadequate for problems whose state are too large to serialize. If we allow answers that are a page long at most, or that a human could write out in 10 minutes, then we’re not going to get a 300-page report of detailed instructions. (Of course, allowing them to collate reports written by subprocesses gets around this difficulty, but means that we won’t have ‘holistic oversight’ and will allow for garbage to be moved around without being caught if the system doesn’t have the ability to read what it’s passing.)
The factored cognition approach also has a tree structure of computation, as opposed to a graph structure, which leads to lots of duplicated effort and the impossibility of horizontal communication. If I’m designing a car, I might consider each part separately, but then also modify the parts as I learn more about the requirements of the other parts. This sort of sketch-then-refinement seems quite difficult to do under the factored cognition approach, even though it involves reductionism and factorization.
Shared memory partially solves this (because, among other things, it introduces the graph structure of computation), but now reduces the guarantee of our honesty criterion because we allow arbitrary side effects. It seems to me like this is a necessary component for most of human reasoning, however. James Maxwell, the pioneer behind electromagnetism, lost most of his memory with age, in a way that seriously reduced his scientific productivity. And factored cognition doesn’t even allow the external notes and record-keeping he used to partially compensate.
There's Actually a Training Procedure
The previous section described what seems to me to be a bug; from Paul's perspective this might be a necessary feature because his approaches are designed around taking advantage of arbitrary machine learning, which means only the barest of constraints can be imposed. IDA presents a simple training procedure that, if used with an extremely powerful model-finding machine learning system, allows us to recursively surpass the human level in a smooth way. (Amusingly to me, this is like Paul enforcing slow takeoff.)
Training The Factoring Problem is Ungrounded
From my vantage point, the trick that we can improve the system by asking it questions like “was X a good way to factor question Y?”, where X was the attempt it had at factoring Y, is one of the core reasons to think this approach is workable, and also seems like it won’t work (or will preserve blind spots in dangerous ways). This is because while we could actually find the ground truth on how many golf balls fit in a 737, it is much harder to find the ground truth on what cognitive style most accurately estimates how many golf balls fit in a 737.
It seems like there are a few ways to go about this:
We now face some tradeoffs between exploration (in a monstrously huge search space, which may be highly computationally costly to meaningfully explore) and rubber-stamping, where I use my cognitive style to evaluate whether or not my cognitive style is any good. Even if we have a good resolution to that tradeoff, we have to deal with the cognitive credit-assignment problem.
That is, in reinforcement learning one has to figure out which actions taken (or not taken) before a reward led to receiving the reward so that it can properly assign credit; similarly the system that's training the Q&A policy needs to understand well enough how the policy is leading to correct answers such that it can apply the right gradients in the right places (or use a tremendous amount of compute doing this by blind search).
This is complicated by the fact that there may be multiple approaches to problem-solving that are internally coherent, but mixtures of those approaches fail. If we only use methods like gradient-descent that smoothly traverse the solution space, this won't be a problem (because gradient descent won't sharply jump from one to another), but it's an open empirical question as to whether future ML techniques will be based on gradient descent. It’s not obvious how we can extricate ourselves from the dependence on our learned question-answer policy. If I normally split a model into submodels based on a lexicographical ordering, and now I’m considering a hypothetical split into submodels based on statistical clustering, I would likely want to consider the hypothetical split all the way down the tree (as updates to my beliefs on ‘what strategy should I use to A this Q?’ will impact more than just this question), especially if there are two coherent strategies but a mixture of the strategies is incoherent. But how to implement this is nonobvious; am I not just passing questions to the alternate branch, but also a complete description of the new cognitive strategy they should employ? It seems like a tremendous security hole to have ‘blindly follow whatever advice you get in the plaintext of questions’ as part of my Q->A policy, and so it seems more like I should be spinning up a new hypothetical agent (where the advice is baked into their policy instead of their joint memory) in a way that may cause some of my other guarantees that relied on smoothness to fail.
Also note that because updates to my policy impact other questions, I might actually want to consider the impact on other questions as well, further complicating the search space. (Ideally, if I had been handling two questions the same way and discover that I should handle them separately, my policy will adjust to recognize the two types and split accordingly.) While this is mostly done by the machine learning algorithm that’s trying to massage the Q->A policy to maximize reward, it seems like making the reward signal (from the answer to this meta-question) attuned to how it will be used will probably make it better (consider the answer “it should be answered like these questions, instead of those,” though generally we assume yes/no answers are used for reward signals).
When we have an update procedure to a system, we can think of that update procedure as the system's "grounding", or the source of gravity that it becomes arranged around. I don't yet see a satisfying source of grounding for proposals like HCH that are built on factored cognition. Empiricism doesn't allow us to make good use of samples or computation, in a way that may render the systems uncompetitive, and alternatives to empiricism seem like they allow the system to go off in a crazy direction in a way that's possibly unrecoverable. It seems like the hope is that we have a good human seed that then is gradually amplified, in a way that seems like it might work but relies on more luck than I would like: the system is rolling the dice whenever it makes a significant transition in its cognitive style, as it can no longer fully trust oversight from previous systems in the amplification tree as they may misunderstand what's going on in the contemporary system, and it can no longer fully trust oversight from itself, because it's using the potentially corrupted reasoning process to evaluate itself.
Of course some messages could be hidden through codes, but this behavior is generally discouraged by the optimization procedure, as whenever you compare to a human baseline they will not do the necessary decoding and will behave in a different way, costing you points. ↩︎