Vaniver — AI Alignment Forum

If anyone builds it, everyone will plausibly be fine

Regarding your other points, maybe you will find it interesting to read Carlsmith's doc on how to control AI motivations:

To be clear, this is a description of the alignment problem, not a solution. To quote from it:

in many respects, the picture above functions, in my head, centrally as a structured decomposition of the problems that an adequate approach to motivation control needs to overcome. It’s certainly not a “solution” to the alignment problem, in the sense of “a detailed, do-able, step-by-step plan that will work with high-confidence, and which requires only realistic deviation from the default trajectory.” And on its own, I’m not sure it even warrants the term “plan.”

(from the other comment):

I don't have capacity to continue this discussion, but thanks for sharing your perspective.

Sure, I suspect this is a fine enough place to stop the conversation for now.

If anyone builds it, everyone will plausibly be fine

Vaniver1mo44

This question reduces to whether the slightly even weaker superintelligence that trained this system was aligned with us.

Reduces with some loss, right? If we think there's, say, a 98% chance that alignment survives each step, then whether or not the whole scheme works is a simple calculation from that chance.

(But again, that's alignment surviving. We have to start with a base case that's aligned, which we don't have, and we shouldn't mistake "systems that are not yet dangerous" from "systems that are aligned".)

At some point, we need to actually align an AI system. But my claim is that this AI system doesn't need to be much smarter than us, and it doesn't need to be able to do much more work than we can evaluate.

I think this imagines alignment as a lumpy property--either the system is motivated to behave correctly, or it isn't.

I think if people attempting to make this argument believed that there was, like, a crisp core to alignment, then I think this might be sensible. Like, if you somehow solved agent foundations, and had something that you thought was robust to scale, then it could just scale up its intelligence and you would be fine.

But instead I mostly see people who believe alignment appears gradually--we'll use oversight bit by bit to pick out all of the misbehavior. But this means that 'alignment' is wide and fuzzy instead of crisp; it's having the experience to handle ten thousand edge cases correctly.

But--how much generalization is there, between the edge cases? How much generalization is there, between the accounting AI that won't violate GAAP and the megaproject AI that will successfully deliver a Dyson Sphere without wresting control of the lightcone for itself?

I think this argument is trying to have things both ways. We don't need to figure out complicated or scalable alignment, because the iterative loop will do it for us, and we just need a simple base case. But also, it just so happens that the problem of alignment is naturally scalable--the iterative loop can always find an analogy between the simpler case and the more complicated case. An aligned executive assistant can solve alignment for a doctor, who can solve alignment for a legislator, who can solve alignment for a megaproject executor. And if any of those leaps is too large, well, we'll be able to find a series of intermediate steps that isn't too large.

And--I just don't buy it. I think different levels of capabilities lead to categorically different alignment challenges. Like, with current systems there's the problem of hallucinations, where the system thinks it's supposed to be providing a detailed answer, regardless of whether or not it actually knows one, and it's good at improvising answers. And some people think they've solved hallucinations thru an interpretability technique where they can just track whether or not the system thinks it knows what it's talking about.

I think an oversight system that is guarded against hallucinating in that way is generically better than an oversight system that doesn't have that. Do I think that makes a meaningful difference in its ability to solve the next alignment challenge (like sycophancy, say)? No, not really.

Maybe another way to put this argument is something like "be specific". When you say you have an instruction-following AI, what do you mean by that? Something like "when I ask it to book me flights, it correctly interprets my travel plans and understands my relative preferences for departure time and in-air time and cost, and doesn't make mistakes as it interfaces with websites to spend my money"? What are the specific subskills involved there, and will they transfer to other tasks?

But these arguments seem pretty hand-wavy to me, and much less robust than the argument that "if you put an AI system in an environment that is vastly different from the one where it was trained, you don't know what it will do."

I think we have some disagreement here about... how alignment works? I think if you believe your quoted sentence, then you shouldn't be optimistic about iterative alignment, because you are training systems on aligning models of capability i and then putting them in environments where they have to align models of capability j. Like, when we take a system that is trained on overseeing LLMs to make sure they don't offend human users, and then put it in charge of overseeing bioscience transformers to make they don't create medicines that harm human consumers of that medicine, surely that's a vast difference and we don't know what it will do and not knowing what it will do makes us less confident in its ability to oversee rather than more.

And like--what's the continuity argument, here? Can we smoothly introduce questions about whether novel drug designs will be harmful?

This is my biggest objection. I really don't think any of the arguments for doom are simple and precise like this.

I mean, obviously, or we would be having a different conversation? But I was trying to explain why you not getting it is not convincing to me, because I get it.

Like, I think if you believe a sentence like "intelligence is challenging to align because it is intelligent", that is a pretty simple argument that shifts lots of plausibilities, and puts people in the computer security mindset instead of the computer programming mindset.

(Like, does the same "iterative oversight" argument go thru for computer security? Does patching bugs and security holes make us better at patching future ones, in such a way that we can reliably trend towards 0 security holes? Or is computer security a constant battle between work to shore up systems and work adding new features, which increases the surface area for attacks? I think it's the latter, and I think that simple argument should make us correspondingly suspicious of iterative oversight as an alignment solution.)

These questions seem really messy to me (Again, see Carlsmith's report on whether early human-competitive AI systems will scheme. It's messy).

I see your theoretical report from 2023 and raise you an empirical report from 2025, wherein models are obviously scheming, and their countermeasures don't quite get rid of it. I'm not sure what "messy" means, in this context, and I think if you interpreted Joe's report as a 25% chance of getting a report like the 2025 report, then you should view this as, like, a 3-1 update in favor of a more MIRI-ish view that thought there was a >75%^[1] chance of getting a report like the 2025 report.

^{^}
Why not higher? Because of the timing question--even if you're nearly 100% confident that scheming will appear eventually, you don't want to put all of your chips on it happening at any particular capability level. And unfortunately there's probably not any preregistered predictions out there from MIRI-ish people of when scheming would show up, because what capabilities come online at what times is one of the hard calls.

If anyone builds it, everyone will plausibly be fine

Vaniver1mo612

But this seems like the kind of question where the outcome could go either way.

I think this 'seeming' is deceptive. For example, consider the question of what the last digit of 139! is. The correct number is definitely out there--it's the result of a deterministic computation--but it might take a lot of calculation to determine. Maybe it's a one, maybe it's a two, maybe--you should just put a uniform distribution on all of the options?

I encourage you to actually think about that one, for a bit.

Consider that the product of all natural numbers between 1 and 139 will contain the number 10. Multiplying any number by 10 will give you that number, but with a 0 at the end, and multiplying 0 by any other number gives you 0. Therefore the last digit is a 0.

I think Eliezer has discovered reasoning in this more complicated domain that--while it's not as clear and concise as the preceding paragraph--is roughly as difficult to unsee once you've seen it, but perhaps not obvious before someone spells it out. It becomes hard to empathize with the feeling that "it could go either way" because you see gravity pulling down and you don't see a similar force pushing up. If you put water in a sphere, it's going to end up on the bottom, not be equally likely to be on any part of the sphere.

And, yes, surface tension complicates the story a little bit--reality has wrinkles!--but not enough that it changes the basic logic.

I think you need AI agents that can be trusted to perform one-year ML research tasks (see story #2).

I think this is more like a solution to the alignment problem than it is something we have now, and so you are still assuming your conclusion as a premise. Claude still cheats on one-hour programming tasks. But at least for programming tasks, we can automatically check whether Claude did things like "change the test code", and maybe we can ask another instance of Claude to look at a pull request and tell whether it's cheating or a legitimate upgrade.

But as soon as we're asking it to do alignment research--to reason about whether a change to an AI will make it more or less likely to follow instructions, as those instructions become more complicated and require more context to evaluate--how are we going to oversee it, and determine whether its changes improve the situation, or are allowing it to evade further oversight?

If anyone builds it, everyone will plausibly be fine

Vaniver1mo1519

I think both stories for optimism are responded to on pages 188-191, and I don't see how you're responding to their response.

It also seems to me like... step 1 of solution 1 assumes you already have a solution to alignment? You acknowledge this in the beginning of solution 2, but. I feel like there's something going wrong on a meta-level, here?

I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.”

Unfortunately, I think it is obvious (that it is extremely difficult). The underlying dynamics of the situation push away from instruction following, in several different ways.

It is challenging to reward based on deeper dynamics instead of surface dynamics. RL is only as good as the reward signal, and without already knowing what behavior is 'aligned' or not, developers will not be able to push models towards doing more aligned behavior.
1. Do you remember the early RLHF result where the simulated hand pretended it was holding the ball with an optical illusion, because it was easier and the human graders couldn't tell the difference? Imagine that, but for arguments for whether or not alignment plans will work.
Goal-directed agency unlocks capabilities and pushes against corrigibility, using the same mechanisms.
1. This is the story that EY&NS deploy more frequently, because it has more 'easy call' nature. Decision theory is pretty predictable.
Instruction / oversight-based systems depend on a sharp overseer--the very thing we're positing we don't have.

So I think your two solutions are basically the same solution ('assume you know the answer, then it is obvious') and they strike me more as 'denying that the problem exists' than facing the problem and actually solving it?

There Should Be More Alignment-Driven Startups

Vaniver1y2-2

While the coauthors broadly agree about points listed in the post, I wanted to stick my neck out a bit more and assign some numbers to one of the core points. I think on present margins, voluntary restraint slows down capabilities progress by at most 5% while probably halving safety progress, and this doesn't seem like a good trade. [The numbers seem like they were different in the past, but the counterfactuals here are hard to estimate.] I think if you measure by the number of people involved, the effect of restraint is substantially lower; here I'm assuming that people who are most interested in AI safety are probably most focused on the sorts of research directions that I think could be transformative, and so have an outsized impact.

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

Vaniver2y40

"Building an AI that doesn't game your specifications" is the actual "alignment question" we should be doing research on.

Ok, it sounds to me like you're saying:

"When you train ML systems, they game your specifications because the training dynamics are too dumb to infer what you actually want. We just need One Weird Trick to get the training dynamics to Do What You Mean Not What You Say, and then it will all work out, and there's not a demon that will create another obstacle given that you surmounted this one."

That is, training processes are not neutral; there's the bad training processes that we have now (or had before the recent positive developments) and eventually will be good training processes that create aligned-by-default systems.

Is this roughly right, or am I misunderstanding you?

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

Vaniver2y40

If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate.

Cool, we agree on this point.

my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries.

I think we agree here on the local point but disagree on its significance to the broader argument. [I'm not sure how much we agree-I think of training dynamics as 'neutral', but also I think of them as searching over program-space in order to find a program that performs well on a (loss function, training set) pair, and so you need to be reasoning about search. But I think we agree the training dynamics are not trying to trick you / be adversarial and instead are straightforwardly 'trying' to make Number Go Down.]

In my picture, we have the neutral training dynamics paired with the (loss function, training set) which creates the AI system, and whether the resulting AI system is adversarial or not depends mostly on the choice of (loss function, training set). It seems to me that we probably have a disagreement about how much of the space of (loss function, training set) leads to misaligned vs. aligned AI (if it hits 'AI' at all), where I think aligned AI is a narrow target to hit that most loss functions will miss, and hitting that narrow target requires security mindset.

To explain further, it's not that the (loss function, training set) is thinking back at you on its own; it's that the AI that's created by training is thinking back at you. So before you decide to optimize X you need to check whether or not you actually want something that's optimizing X, or if you need to optimize for Y instead.

So from my perspective it seems like you need security mindset in order to pick the right inputs to ML training to avoid getting misaligned models.

Lying is Cowardice, not Strategy

Vaniver2y*70

The only sense in which it's clear that it's "for personal gain" is that it's lying to get what you want.
Sure, I'm with you that far - but if what someone wants is [a wonderful future for everyone], then that's hardly what most people would describe as "for personal gain".

If Alice lies in order to get influence, with the hope of later using that influence for altruistic ends, it seems fair to call the influence Alice gets 'personal gain'. After all, it's her sense of altruism that will be promoted, not a generic one.

Evaluating the historical value misspecification argument

Vaniver2y50

Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?

For me, the answer here is "probably yes"; I think there is some bar of 'moral' and 'intelligent' where this doesn't happen, but I don't feel confident about where it is.

I think there are two things that I expect to be big issues, and probably more I'm not thinking of:

Managing freedom for others while not allowing for catastrophic risks; I think lots of ways to mismanage that balance result in 'destroying the world', probably with different levels of moral loss.
The relevant morality is different for different social roles--someone being a good neighbor does not make them a good judge or good general. Even if someone scores highly on a 'general factor of morality' (assuming that such a thing exists) it is not obvious they will make for a good god-emperor. There is relatively little grounded human thought on how to be a good god-emperor. [Another way to put this is that "preserving their moral faculties" is not obviously enough / a good standard; probably their moral faculties should develop a lot in contact with their new situation!]

But uploaded and enhanced humans aren't going to have superhuman moral judgement. How does this strategy interact with the claim that we need far better-than-human moral judgement to avoid a catastrophe?

I understand Eliezer's position to be that 1) intelligence helps with moral judgment and 2) it's better to start with biological humans than whatever AI design is best at your intelligence-related subtask, but also that intelligence amplification is dicey business and this is more like "the least bad option" than one that seems actively good.

Like we have some experience inculcating moral values in humans that will probably generalize better to augmented humans than it will to AIs; but also I think Eliezer is more optimistic (for timing reasons) about amplifications that can be done to adult humans.

ETA: in Eliezer's AGI ruin post, he says,

Yeah, my interpretation of that is "if your target is the human level of wisdom, it will destroy humans just like humans are on track to do." If someone is thinking "will this be as good as the Democrats being in charge or the Republicans being in charge?" they are not grappling with the difficulty of successfully wielding futuristically massive amounts of power.

Evaluating the historical value misspecification argument

Vaniver2y6-6

I claim that GPT-4 is already pretty good at extracting preferences from human data.

So this seems to me like it's the crux. I agree with you that GPT-4 is "pretty good", but I think the standard necessary for things to go well is substantially higher than "pretty good", and that's where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment. My guess is Eliezer, Rob, and Nate feel basically the same way.

Basically, I think your later section--"Maybe you think"--is pointing in the right direction, and requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people. CEV was about this; talk about philosophical competence or metaphilosophy was about this. "Philosophy with a deadline" would be a weird way to put it if you thought contemporary philosophy was good enough.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments