Hands-On Experience Is Not Magic

Thane Ruthenis

Here are some views, oftentimes held in a cluster:

You can't make strong predictions about what superintelligent AGIs will be like. We've never seen anything like this before. We can't know that they'll FOOM, that they'll have alien values, that they'll kill everyone. You can speculate, but making strong predictions about them? That can't be invalid.
You can't figure out how to align an AGI without having an AGI on-hand. Iterative design is the only approach to design that works in practice. Aligning AGI right on the first try isn't simply hard, it's impossible, so racing to build an AGI to experiment with is the correct approach for aligning it.
An AGI cannot invent nanotechnology/brain-hacking/robotics/[insert speculative technology] just from the data already available to humanity, then use its newfound understanding to build nanofactories/take over the world/whatever on the first try. It'll have to engage in extensive, iterative experimentation first, and there'll be many opportunities to notice what it's doing and stop it.
More broadly, you can't genuinely generalize out of distribution. The sharp left turn is a fantasy — you can't improve without the policy gradient, and unless there's someone holding your hand and teaching you, you can only figure it out by trial-and-error. Thus, there wouldn't be genuine sharp AGI discontinuities.
There's something special about training by SGD, and the "inscrutable" algorithms produced this way. They're a specific kind of "connectivist" algorithms made up of an inchoate mess of specialized heuristics. This is why interpretability is difficult — it involves translating these special algorithms into a more high-level form — and indeed, it's why AIs may be inherently uninterpretable!

You can probably see the common theme here. It holds that learning by practical experience (henceforth LPE) is the only process by which a certain kind of cognitive algorithms can be generated. LPE is the only way to become proficient in some domains, and the current AI paradigm works because it implements this kind of learning, and it only works inasmuch as it implements this kind of learning.^[1]

All in all, it's not totally impossible. I myself had suggested that some capabilities may only be implementable via one algorithm and one algorithm only.

But I think this is false, in this case. And perhaps, when put this way, it already looks false to you as well.

If not, let's dig into the why.^[2]

A Toy Formal Model

What is a "heuristic", fundamentally speaking? It's a recorded statistical correlation — the knowledge that if you're operating in some environment with the intent to achieve some goal $G$ , taking the action $A$ is likely to lead to achieving that goal.

As a toy formality, we can say that it's a structure of the following form:

h : ⟨ E, G ⟩ \to A | E_{A} \to G_{E}

The question is: what information is necessary for computing $h$ ? Clearly you need to know $E$ and $G$ — the structure of the environment and what you're trying to do there. But is there anything else?

The LPE view says yes: you also need a set of "training scenarios" $S = {E_{A_{1}}, . . ., E_{A_{n}}}$ , where the results of taking various actions $A_{i}$ on the environment are shown. Not because you need to learn the environment's structure — we're already assuming it's known. No, you need them because... because...

Perhaps I'm failing the ITT here, but I think the argument just breaks down at this step, in a way that can't be patched. It seems clear, to me, that $E$ itself is entirely sufficient to compute $h$ , essentially by definition. If heuristics are statistical correlations, it should be sufficient to know the statistical model of the environment to generate them!

Toy-formally, $P (h | E \cdot S) = P (h | E)$ . Once the environment's structure is known, you gain no additional information from playing around with it.

If your understanding is incomplete, sure, you may gain an additional appreciation of the environment's dynamics by running mental simulations. But it's still about figuring out the environment's structure, not because this training set is absolutely necessary.

Concretely:

Imagine that your knowledge of tic-tac-toe was erased, and now you're introduced to the game's rules anew. You'll likely instantly infer that taking the center square is a pretty good starting move, because it maximizes optionality^[3]. To make that inference, you won't need to run mental games against imaginary opponents, in which you'll start out by making random moves. It'll be clear to you at a glance.
Imagine that someone told you a number of simple but novel mathematical theorems, in a domain you're familiar with. Would you try to learn how to use them by generating random strings of mathematical symbols and seeing whether a given random string constitutes a valid application of one of the theorems? I expect not: rather, you'll be able to instantly "slot" them into the domain's structure, track their implications, draw associations. You may then still "play around" with them, but the bulk of the work will have already been done.

Figuring out good environmental heuristics does not strictly require a training set, only the knowledge of the environment's structure.

Why Are Humans Tempted to Think Otherwise?

Two reasons:

The first is because in many practical cases, LPE is the most cost-efficient way to learn an environment's structure. Even in my very simple tic-tac-toe example, momentary abstract reasoning only yielded us a "pretty good" move. In practical cases, the situation is even worse: we're not given the game's rules on a silver platter, we can only back-infer them from studying how things tend to play out.

The second is because our System 1 (which implements quick heuristics) is faster and allocated more compute than System 2 (which does abstract reasoning), owning to the fact that general intelligence is a novel evolutionary adaptation. Thus, "solving" environments abstractly is more time-consuming than just running out and refining our LPE-heuristics against them, and the resultant algorithms work slower. (And that often makes them useless — consider trying to use System 2 to coordinate muscle movements in a brawl.)

This creates the illusion that LPE is the only thing that works. It is, however, an illusion:

As I'd mentioned, we often apply non-LPE-based environment-solving to constrain the space of heuristics over which we search, as in the tic-tac-toe and math examples. Indeed, it seems that scientific research would be impossible without that.
LPE-based learning does not work in domains where failure is lethal, by definition. However, we have some success navigating them anyway.

LPE is a specific method of deriving a certain type of statistical correlations from the environment, and it only works if it's given a set of training examples as an input. But it's not the only method — merely one that's most applicable in the regime in which we've been operating up to this point.

What about superintelligent AGIs, then? By the definition of being "superintelligent", they'd have more resources allocated to their general-intelligence module/System-2 equivalent. Thus, they'd be natively better at solving environments abstractly, "without experience".

Takeaways

The LPE views holds that merely knowing the structure of some domain is not enough to learn how to navigate it. You also need to do some trial-and-error in it, to arrive at the necessary heuristics.^[4]

I claim that this is false, that there are algorithms that allow learning without experience — and indeed, that one of such algorithms is the cornerstone of "general intelligence".

If true, this should negate the initial statements:

It is, in fact, possible to make strong predictions about OOD events like AGI Ruin — if you've studied the problem exhaustively enough to infer its structure despite lacking the hands-on experience. By the same token, it should be possible to solve the problem in advance, without creating it first.

And an AGI, by dint of being superintelligent, would be very good at this sort of thing — at generalizing to domains it hasn't been trained on, like social manipulation, or even to entirely novel ones, like nanotechnology, then successfully navigating them at the first try.

Much like the existence vs. nonexistence of general intelligence, the degree of importance ascribed to LPE seems to be one of the main causes of divergence in people's P(doom) estimates.

^{^}
Put in other words, it says that babble-and-prune is the only general-purpose method of planning possible. Stochastically generate candidate solutions, prune them, repeat until arriving at a good-enough solution.
^{^}
Also, here's a John Wentworth post that addresses the babble-and-prune framing in particular.
^{^}
And it's indeed a pretty good move, much better than random, if not the optimal one.
^{^}
Indeed, some people ascribe some truly mythical importance to that process.

Here are some views, often held in a cluster:

I'm not sure exactly which clusters you're referring to, but I'll just assume that you're pointing to something like "people who aren't very into the sharp left turn and think that iterative, carefully bootstrapped alignment is a plausible strategy." If this isn't what you were trying to highlight, I apologize. The rest of this comment might not be very relevant in that case.

To me, the views you listed here feel like a straw man or weak man of this perspective.

Furthermore, I think the actual crux is more often "prior to having to align systems that are collectively much more powerful than humans, we'll only have to align systems that are somewhat more powerful than humans." This is essentially the crux you highlight in A Case for the Least Forgiving Take On Alignment. I believe disagreements about hands-on experience are quite downstream of this crux: I don't think people with reasonable views (not weak men) believe that "without prior access to powerful AIs, humans will need to align AIs that are vastly, vastly superhuman, but this will be fine because these AIs will need lots of slow, hands-on experience in the world to do powerful stuff (like nanotech)."

So, discussing how well superintelligent AIs can operate from first principles seems mostly irrelevant to this discussion (if by superintelligent AI, you mean something much, much smarter than the human range).

I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.

I largely agree with the general point that I think this post is making, which I would summarize in my own words as: the importance of iteration-and-feedback cycles, experimentation, experience, trial-and-error, etc. (LPE, in your terms) is sometimes overrated in importance and necessity. This over-emphasis is particularly common among those who have an optimistic view on solving the alignment problem through iterative experimentation.

I think degree to which LPE is actually necessary for solving problems in any given domain, as well as the minimum amount of time, resources, and general tractability of obtaining such LPE, is an empirical question which people frequently investigate for particular important domains.

Differing intuitions about how important LPE is in general, and how tractable it is to obtain, seems like an important place for identifying cruxes in world views. I wrote a bit more about this in a recent post, and commented on one of the empirical investigations to which my post is partially a response to. As I said in the comment, I find such investigations interesting and valuable as a matter of furthering scientific understanding about the limits of the possible, but pretty futile as attempts to bound the capabilities of a superintelligence. I think your post is a good articulation of one reason why I find these arguments so uncompelling.