Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”
It seems to me that the main difficulty is storing your reference policy somewhere where the gradient can't touch it (even indirectly). Does anyone have a story of how that happens?
I think there's a trilemma with updating CAIS-like systems to the foundational model world, which is: who is doing the business development?
I came up with three broad answers (noting reality will possibly be a mixture):
[In pre-foundational-model CAIS, the answer was obviously 3--every business procures its own AI tools to accomplish particular functions, and there's no 'central planning' for computer science.]
I don't think 1 is CAIS, or if it is, then I don't see the daylight between CAIS and good ol' sovereign AI. You gradually morph from the economy as it is now to central planning via AGI, and I don't think you even have much guarantee that it's human overseen or follows the relevant laws.
I think 2 has trouble being comprehensive. There are ten thousand use cases for AI; the AI company has to be massive to have a product for all of them (or be using the AI to do most of the work, in which case we're degenerating into case 1), and then it suffers from internal control problems. (This degenerates into case 3, where individual product teams are like firms and the company that made the AI is like the government.)
I think 3 has trouble being non-agentic and peaceful. Even with GPT-4, people are trying to set it up to act autonomously. I think the Drexlerian response here is something like:
Yes, but why expect them to succeed? When someone tells GPT-4 to make money for it, it'll attempt to deploy some standard strategy, which will fail because a million other people are trying to exact same thing, or only get them an economic rate of return ("put your money in index funds!"). Only in situations where the human operators have a private edge on the rest of the economy (like having a well-built system targeting an existing vertical that the AI can slot into, you have pre-existing tooling able to orient to the frontier of human knowledge, etc.) will you get an AI system with a private edge against the rest of the economy, and it'll be overseen by humans.
My worry here mostly has to do with the balance between offense and defense. If foundational-model-enabled banking systems are able to detect fraud as easily as foundational-model-enabled criminals are able to create fraud, then we get a balance like today's and things are 'normal'. But it's not obvious to me that this will be the case (especially in sectors where crime is better resourced than police are, or sectors where systems are difficult to harden).
That said, I do think I'm more optimistic about the foundational model version of CAIS (where there can be some centralized checks on what the AI is doing for users) than the widespread AI development version.
However, after looking back on it more than four years later, I think the general picture it gave missed some crucial details about how AI will go.
I feel like this is understating things a bit.
In my view (Drexler probably disagrees?), there are two important parts of CAIS:
I think a 'foundation model' world probably wrecks both. I think they might be recoverable--and your post goes some of the way to making that visualizable to me--but it still doesn't seem like the default outcome.
[In particular, I like the point about models with broad world models can still have narrow responsibilities, and think that likely makes them more likely to be safe, at least in the medium term. Having one global moral/law-abiding foundational AI model that many people then slot into their organizations seems way better than everyone training whatever AI model they need for their use case.]
Then the model can safely scale.
If there are experiences which will change itself which don't lead to less of the initial good values, then yeah, for an approximate definition of safety. You're resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.
FWIW I don't really see your description as, like, a specific alignment strategy so much as the strategy of "have an alignment strategy at all". The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
So I definitely think that's something weirdly unspoken about the argument; I would characterize it as Eliezer saying "suppose I'm right and they're wrong; all this requires is things to be harder than people think, which is usual. Suppose instead that I'm wrong and they're right; this requires things to be easier than people think, which is unusual." But the equation of "people" and "Eliezer" is sort of strange; as Quintin notes, it isn't that unusual for outside observers to overestimate difficulty, and so I wish he had centrally addressed the the reference class tennis game; is the expertise "getting AI systems to be capable" or "getting AI systems to do what you want"?
FWIW, I thought the bit about manifolds in The difficulty of alignment was the strongest foot forward, because it paints a different detailed picture than your description that it's responding to.
That said, I don't think Quintin's picture obviously disagrees with yours (as discussed in my response over here) and I think you'd find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there's a related one that goes thru on his model.
During this process, I don’t think it’s particularly unusual for the person to notice a technical problem but overlook a clever way to solve that problem.
I think this isn't the claim; I think the claim is that it would be particularly unusual for someone to overlook that they're accidentally solving a technical problem. (It would be surprising for Edison to not be thinking hard about what filament to use and pick tungsten; in actual history, it took decades for that change to be made.)
BTW I do agree with you that Eliezer’s interview response seems to suggest that he thinks aligning an AGI to “basic notions of morality” is harder and aligning an AGI to “strawberry problem” is easier. If that’s what he thinks, it’s at least not obvious to me.
My sense (which I expect Eliezer would agree with) is that it's relatively easy to get an AI system to imitate the true underlying 'basic notions of morality', to the extent humans agree on that, but that this doesn't protect you at all as soon as you want to start making large changes, or as soon as you start trying to replace specialist sectors of the economy. (A lot of ethics for doctors has to do with the challenges of simultaneously being a doctor and a human; those ethics will not necessarily be relevant for docbots, and the question of what they should be instead is potentially hard to figure out.)
So if you're mostly interested in getting out of the acute risk period, you probably need to aim for a harder target.
But gradient descent will still change the way that the system interprets things in its data storage, right?
I guess part of the question here is whether gradient descent will even scale to AutoGPT-like systems. You're probably not going to be able to differentiate thru your external notes / other changes you could make to your environment.