"Building an AI that doesn't game your specifications" is the actual "alignment question" we should be doing research on.
Ok, it sounds to me like you're saying:
"When you train ML systems, they game your specifications because the training dynamics are too dumb to infer what you actually want. We just need One Weird Trick to get the training dynamics to Do What You Mean Not What You Say, and then it will all work out, and there's not a demon that will create another obstacle given that you surmounted this one."
That is, training processes are not neutral; there's the bad training processes that we have now (or had before the recent positive developments) and eventually will be good training processes that create aligned-by-default systems.
Is this roughly right, or am I misunderstanding you?
If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate.
Cool, we agree on this point.
my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries.
I think we agree here on the local point but disagree on its significance to the broader argument. [I'm not sure how much we agree-I think of training dynamics as 'neutral', but also I think of them as searching over program-space in order to find a program that performs well on a (loss function, training set) pair, and so you need to be reasoning about search. But I think we agree the training dynamics are not trying to trick you / be adversarial and instead are straightforwardly 'trying' to make Number Go Down.]
In my picture, we have the neutral training dynamics paired with the (loss function, training set) which creates the AI system, and whether the resulting AI system is adversarial or not depends mostly on the choice of (loss function, training set). It seems to me that we probably have a disagreement about how much of the space of (loss function, training set) leads to misaligned vs. aligned AI (if it hits 'AI' at all), where I think aligned AI is a narrow target to hit that most loss functions will miss, and hitting that narrow target requires security mindset.
To explain further, it's not that the (loss function, training set) is thinking back at you on its own; it's that the AI that's created by training is thinking back at you. So before you decide to optimize X you need to check whether or not you actually want something that's optimizing X, or if you need to optimize for Y instead.
So from my perspective it seems like you need security mindset in order to pick the right inputs to ML training to avoid getting misaligned models.
The only sense in which it's clear that it's "for personal gain" is that it's lying to get what you want.
Sure, I'm with you that far - but if what someone wants is [a wonderful future for everyone], then that's hardly what most people would describe as "for personal gain".
If Alice lies in order to get influence, with the hope of later using that influence for altruistic ends, it seems fair to call the influence Alice gets 'personal gain'. After all, it's her sense of altruism that will be promoted, not a generic one.
- Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?
For me, the answer here is "probably yes"; I think there is some bar of 'moral' and 'intelligent' where this doesn't happen, but I don't feel confident about where it is.
I think there are two things that I expect to be big issues, and probably more I'm not thinking of:
But uploaded and enhanced humans aren't going to have superhuman moral judgement. How does this strategy interact with the claim that we need far better-than-human moral judgement to avoid a catastrophe?
I understand Eliezer's position to be that 1) intelligence helps with moral judgment and 2) it's better to start with biological humans than whatever AI design is best at your intelligence-related subtask, but also that intelligence amplification is dicey business and this is more like "the least bad option" than one that seems actively good.
Like we have some experience inculcating moral values in humans that will probably generalize better to augmented humans than it will to AIs; but also I think Eliezer is more optimistic (for timing reasons) about amplifications that can be done to adult humans.
ETA: in Eliezer's AGI ruin post, he says,
Yeah, my interpretation of that is "if your target is the human level of wisdom, it will destroy humans just like humans are on track to do." If someone is thinking "will this be as good as the Democrats being in charge or the Republicans being in charge?" they are not grappling with the difficulty of successfully wielding futuristically massive amounts of power.
I claim that GPT-4 is already pretty good at extracting preferences from human data.
So this seems to me like it's the crux. I agree with you that GPT-4 is "pretty good", but I think the standard necessary for things to go well is substantially higher than "pretty good", and that's where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment. My guess is Eliezer, Rob, and Nate feel basically the same way.
Basically, I think your later section--"Maybe you think"--is pointing in the right direction, and requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people. CEV was about this; talk about philosophical competence or metaphilosophy was about this. "Philosophy with a deadline" would be a weird way to put it if you thought contemporary philosophy was good enough.
But gradient descent will still change the way that the system interprets things in its data storage, right?
I guess part of the question here is whether gradient descent will even scale to AutoGPT-like systems. You're probably not going to be able to differentiate thru your external notes / other changes you could make to your environment.
Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”
It seems to me that the main difficulty is storing your reference policy somewhere where the gradient can't touch it (even indirectly). Does anyone have a story of how that happens?
I think there's a trilemma with updating CAIS-like systems to the foundational model world, which is: who is doing the business development?
I came up with three broad answers (noting reality will possibly be a mixture):
[In pre-foundational-model CAIS, the answer was obviously 3--every business procures its own AI tools to accomplish particular functions, and there's no 'central planning' for computer science.]
I don't think 1 is CAIS, or if it is, then I don't see the daylight between CAIS and good ol' sovereign AI. You gradually morph from the economy as it is now to central planning via AGI, and I don't think you even have much guarantee that it's human overseen or follows the relevant laws.
I think 2 has trouble being comprehensive. There are ten thousand use cases for AI; the AI company has to be massive to have a product for all of them (or be using the AI to do most of the work, in which case we're degenerating into case 1), and then it suffers from internal control problems. (This degenerates into case 3, where individual product teams are like firms and the company that made the AI is like the government.)
I think 3 has trouble being non-agentic and peaceful. Even with GPT-4, people are trying to set it up to act autonomously. I think the Drexlerian response here is something like:
Yes, but why expect them to succeed? When someone tells GPT-4 to make money for it, it'll attempt to deploy some standard strategy, which will fail because a million other people are trying to exact same thing, or only get them an economic rate of return ("put your money in index funds!"). Only in situations where the human operators have a private edge on the rest of the economy (like having a well-built system targeting an existing vertical that the AI can slot into, you have pre-existing tooling able to orient to the frontier of human knowledge, etc.) will you get an AI system with a private edge against the rest of the economy, and it'll be overseen by humans.
My worry here mostly has to do with the balance between offense and defense. If foundational-model-enabled banking systems are able to detect fraud as easily as foundational-model-enabled criminals are able to create fraud, then we get a balance like today's and things are 'normal'. But it's not obvious to me that this will be the case (especially in sectors where crime is better resourced than police are, or sectors where systems are difficult to harden).
That said, I do think I'm more optimistic about the foundational model version of CAIS (where there can be some centralized checks on what the AI is doing for users) than the widespread AI development version.
However, after looking back on it more than four years later, I think the general picture it gave missed some crucial details about how AI will go.
I feel like this is understating things a bit.
In my view (Drexler probably disagrees?), there are two important parts of CAIS:
I think a 'foundation model' world probably wrecks both. I think they might be recoverable--and your post goes some of the way to making that visualizable to me--but it still doesn't seem like the default outcome.
[In particular, I like the point about models with broad world models can still have narrow responsibilities, and think that likely makes them more likely to be safe, at least in the medium term. Having one global moral/law-abiding foundational AI model that many people then slot into their organizations seems way better than everyone training whatever AI model they need for their use case.]
While the coauthors broadly agree about points listed in the post, I wanted to stick my neck out a bit more and assign some numbers to one of the core points. I think on present margins, voluntary restraint slows down capabilities progress by at most 5% while probably halving safety progress, and this doesn't seem like a good trade. [The numbers seem like they were different in the past, but the counterfactuals here are hard to estimate.] I think if you measure by the number of people involved, the effect of restraint is substantially lower; here I'm assuming that people who are most interested in AI safety are probably most focused on the sorts of research directions that I think could be transformative, and so have an outsized impact.