The first draft of this post started with a point that was clear, cohesive, and wrong. So instead, you get this bunch of rambling that I think should be interesting.
I - Do not mess with time
Harry wrote down 181,429. He repeated what he'd just written down, and Anthony confirmed it.
Then Harry raced back down into the cavern level of his trunk, glanced at his watch (the watch said 4:28 which meant 7:28) and then shut his eyes.
Around thirty seconds later, Harry heard the sound of steps, followed by the sound of the cavern level of the trunk sliding shut. (Harry wasn't worried about suffocating. An automatic Air-Freshening Charm was part of what you got if you were willing to buy a really good trunk. Wasn't magic wonderful, it didn't have to worry about electric bills.)
And when Harry opened his eyes, he saw just what he'd been hoping to see, a folded piece of paper left on the floor, the gift of his future self.
Call that piece of paper "Paper-2".
Harry tore a piece of paper off his pad.
Call that "Paper-1". It was, of course, the same piece of paper. You could even see, if you looked closely, that the ragged edges matched.
Harry reviewed in his mind the algorithm that he would follow.
If Harry opened up Paper-2 and it was blank, then he would write "101 x 101" down on Paper-1, fold it up, study for an hour, go back in time, drop off Paper-1 (which would thereby become Paper-2), and head on up out of the cavern level to join his dorm mates for breakfast.
If Harry opened up Paper-2 and it had two numbers written on it, Harry would multiply those numbers together.
If their product equaled 181,429, Harry would write down those two numbers on Paper-1 and send Paper-1 back in time.
Otherwise Harry would add 2 to the number on the right and write down the new pair of numbers on Paper-1. Unless that made the number on the right greater than 997, in which case Harry would add 2 to the number on the left and write down 101 on the right.
And if Paper-2 said 997 x 997, Harry would leave Paper-1 blank.
Which meant that the only possible stable time loop was the one in which Paper-2 contained the two prime factors of 181,429.
If this worked, Harry could use it to recover any sort of answer that was easy to check but hard to find. He wouldn't have just shown that P=NP once you had a Time-Turner, this trick was more general than that. Harry could use it to find the combinations on combination locks, or passwords of every sort. Maybe even find the entrance to Slytherin's Chamber of Secrets, if Harry could figure out some systematic way of describing all the locations in Hogwarts. It would be an awesome cheat even by Harry's standards of cheating.
Harry took Paper-2 in his trembling hand, and unfolded it.
Paper-2 said in slightly shaky handwriting:
DO NOT MESS WITH TIME
Harry wrote down "DO NOT MESS WITH TIME" on Paper-1 in slightly shaky handwriting, folded it neatly, and resolved not to do any more truly brilliant experiments on Time until he was at least fifteen years old.
-(Harry Potter and the Methods of Rationality, ch. 17)
II - Intro
This post is primarily an attempt to think about if HCH is aligned, in the hypothetical where it can be built. By "aligned" I mean in the sense that asking HCH to choose good actions, or to design an FAI, would result in good outputs, even if there were no fancy safety measures beyond verbal instructions to the human inside HCH on some best practices for factored cognition. I will mention some considerations from real-world approximations later, but mostly I'm talking about a version of what Paul later calls "weak HCH" for conceptual simplicity.
In this hypothetical, we have a perfectly simulated human in some environment, and they only get to run for ~2 weeks of subjective time. We can send in our question only at the start of the simulation, and they send out their answer at the end, but in between they're allowed to query new simulations, each given a new question and promptly returning the answer they would get after 2 subjective weeks of rapid simulation (and the sub-simulations have access to sub-sub-simulations in turn, and so on). We assume that the computer running this process is big but finite, and so the tree structure of simulations asking questions of more simulations should bottom out with probability 1 at a bunch of leaf-node simulations who can answer their question without any help. And if halting is ever in doubt, we can posit some safeguards that eventually kick in to hurry the human up to make a decision.
I see two main questions here. One is about the character of attractors within HCH. An attractor is some set of outputs (of small size relative to the set of all possible outputs) such that prompting the human with an element of the set will cause them to produce an output that's also an element of the set. We can think of them as the basins of stability in the landscape traversed by the function. The set just containing an empty string is one example of an attractor, if our human doesn't do anything unprompted. But maybe there's some more interesting attractors like "really sad text," where your human reliably outputs really sad text upon being prompted with really sad text. Or a maybe there's a virtuous cycle attractor, where well-thought-out answers support more well-thought-out answers.
How much we expect the behavior of HCH to be dominated by the limiting behavior of the attractors depends on how deep we expect it to be, relative to how quickly convergent behavior appears, and how big we expect individual attractors to be. There's no point describing the dynamical behavior of HCH in this way if there are 10 jillion different tiny attractors, after all.
The other question is about capability. Does a tree of humans actually converge to good answers, or do they probably screw up somewhere? This is mainly speculation based on gut feelings about the philosophical competence of humans, or how easy it is to do factored cognition, but this is also related to questions about attractors, because if there are undesirable attractors, you don't want HCH to pursue cognitive strategies that will find them. If undesirable attractors are hard to find, this is easy - and vice versa.
III - Features of HCH
Normally we think of attractors in the context of nonlinear dynamics (wikipedia), or iterated function application (review paper). HCH is just barely neither of these things. In HCH, information flows two ways - downwards and upwards. Iterated function application requires us to be tracking some function output , which is only a function of the history (think the Fibonacci sequence or the logistic map), whereas even though we can choose an ordering to put an HCH tree into a sequence, for any such ordering you can have functional dependence on terms later in the sequence, which makes proving thing much harder. The only time we can really squeeze HCH into the straitjacket is when considering the state of the human up until they get their first response to their queries to HCH, because the sequence of first queries really is just iterated function application.
Aside before getting back to attractors: future-dependence of the sequence also makes it possible for there to be multiple consistent solutions. However, this is mostly avoided for finite-sized trees, is a separate problem from the character of attractors, and corresponds to a training issue for IDA approximations to HCH that I think can be avoided with foresight.
At first blush this partial description (only the first queries, and no responses allowed) doesn't seem that exciting (which is good, exciting is bad!). This is primarily because in order for the computation of HCH to terminate, we already know that we want everything to eventually reach the "no query" attractor. We can certainly imagine non-terminating attractors - maybe I spin up an instance to think about anti-malarial drugs if asked about malaria vaccines, and spin up an instance to think about malaria vaccines if asked about anti-malarial drugs - but we already know we'll need some methods to avoid these.
However, downward attractors can be different from the "no query" attractor and yet still terminate, by providing restrictions above and beyond the halting of the computation. Let's go back to the hypothetical "very sad text" attractor. You can have a "very sad text" attractor that will still terminate; the set of queries in this attractor will include "no query," but will also include very sad text, which (in this hypothetical) causes the recipient to also include very sad text in their first query.
In the end, though, the thing we care about is the final output of HCH, not what kind of computation it's running down in its depths, and this is governed by the upward flow of answers. So how much can we treat the upward answers as having attractors? Not unboundedly - there is almost certainly no small set of responses to the first query that causes the human to produce an output also in that set regardless of the prompt. But humans do have some pretty striking regularities. In this post, I'll mostly have to stick to pure speculation.
IV - Pure speculation
If we think of the flow of outputs up the HCH tree as opposed to down, do there seem to be probable attractors? Because this is no longer an iterated function, we have to loosen the definition - let's call a "probable attractor" some set that (rather than mapping to itself for certain) gets mapped to itself with high probability, given some probability distribution over the other factors affecting the human. Thus probable attractors have some lifetime that can be finite, or can be infinite if the probabilities converge to 1 sufficiently quickly.
Depending on how we phrase mathematical questions about probable attractors, they seem to either be obviously common (if you think about small stochastic perturbations to functions with attractors), or obviously rare (if you think about finite state machines that have been moved to worst-case nearby states). I'm not actually sure what a good perspective is.
But I do have some guesses about possible attractors for humans in HCH. An important trick for thinking about them is that attractors aren't just repetitious, they're self-repairing. If the human gets an input that deviates from the pattern a little, their natural dynamics will steer them into outputting something that deviates less. This means that a highly optimized pattern of flashing lights that brainwashes the viewer into passing it on is a terrible attractor, and that bigger, better attractors are going to look like ordinary human nature, just turned up to 11.
Here are some ideas:
- Framings of the problem. Once you are prompted to think about something a certain way, that might easily self-reinforce, even though you'd just as easily think about it a different way if prompted differently.
- Memes that use the communication channel plus normal human reasoning. E.g. a convincing argument for why we should be doing something else instead, or an apparent solution to the entire problem that seems important to pass on.
- Self-propagating style/emotional choices in communication. E.g. the very sad text attractor. I wonder if there's any data on how easy it is to influence humans in this sort of "emotional telephone" game.
- Memes that hijack the communication channel using normal communication in ways that go against sensible precommitments, but are some sort of slippery slope. E.g. a joke so funny that you want to share it, or a seductive conspiracy theory.
None of these are automatically bad (well, except maybe the last one). But we might expect them to decrease the quality of answers from HCH by driving the humans to somewhat unusual parts of the probability distribution. This leads to the observation that there's no reason to expect a single "good reasoning" attractor, where all the good behavior lives, independent of the original question asked to HCH. This is sort of the inverse Anna Karenina principle.
Let me restate the issue in bigger font: To the extent that iterated human message-passing tends to end up in these attractors, we can expect this to degrade the answer quality. Not because of a presumption of malicious behavior, but because these long-lifetime attractors are going to have some features based on human psychological vulnerabilities, not on good reasoning. It's like extremal Goodhart for message-passing.
V - Some thoughts on training
Because these issues are intrinsic to HCH even if it gets to use perfect simulations of humans, they aren't going to be fixed too easily in training. We could just solve meta-preferences and use reflectively consistent human-like agents, but if we can do that kind of ambitious applied philosophy project, we can probably just solve the alignment problem without simulating human-adjacent things.
Transparency tools might be able to catch particularly explicit optimization for memetic fitness. But they might not help when attractors are arrived at by imitation of human reasoning (and when they do help, they'll be an implementation of a specific sort of meta-preference scheme). A trained IDA approximation of HCH is also likely to use some amount of reasoning about what humans are will do (above and beyond direct imitation), which can seamlessly learn convergent behavior.
Correlations between instances of imitations can provide a training incentive to produce outputs for which humans are highly predictable (I'm reminded of seeing David Keuger's paper recently). This is that issue that I mentioned earlier was related to the multiplicity of consistent solutions to HCH - we don't want our human-imitations to be incentivized to make self-fulfilling prophecies that push them away from the typical human probability distribution. Removing these correlations might require expensive efforts to train off of a clean supervised set with a high fraction of actual humans.
VI - Humans are bad at things
The entire other question about the outer alignment of HCH is whether humans being bad at things is going to be an issue. Sometimes you give a hard problem to people, and they just totally mess it up. Worst case we ask HCH to design an FAI and get back something that maximizes compressibility or similar. In order for HCH to be "really" aligned, we might want it to eliminate such errors, or at least drastically decrease them. Measuring such a thing (or just establishing trust) might require trying it on "easier problems" first (see some of the suggestions from Ajeya Cotra's post).
Honestly this topic could be its own post. Or book. Or anthology. But here, I'll keep it short and relate it back to the discussion of attractors.
My gut feeling is that the human is too influential for errors to ever really be extirpated. If we ask HCH a question that requires philosophical progress, and the human happens to be locked into some unproductive frame of mind (which gets duplicated every time a new instance is spooled up), then they're probably going to process the problem and then interpret the responses to their queries in line with their prior inclinations. History shows us that people on average don't make a ton of philosophical progress in ~2 weeks of subjective time.
And suppose that the human is aware of this problem and prepares themselves to cast a wide net for arguments, to really try to change their mind, to try (or at least get a copy of themselves to try) all sorts of things that might transform their philosophical outlook. Is the truth going to pour itself down upon this root-node human? No. You know what's going to pour down upon them? Attractors.
This sort of issue seems to put a significant limit on the depth and adventurousness available to searches for arguments that would change your mind about subjective matters. Which in turn makes me think that HCH is highly subject to philosophical luck.