All of Vaniver's Comments + Replies

The only sense in which it's clear that it's "for personal gain" is that it's lying to get what you want.
Sure, I'm with you that far - but if what someone wants is [a wonderful future for everyone], then that's hardly what most people would describe as "for personal gain".

If Alice lies in order to get influence, with the hope of later using that influence for altruistic ends, it seems far to call the influence Alice gets 'personal gain'. After all, it's her sense of altruism that will be promoted, not a generic one.

This is not what most people mean by "for personal gain". (I'm not disputing that Alice gets personal gain)

Insofar as the influence is required for altruistic ends, aiming for it doesn't imply aiming for personal gain.
Insofar as the influence is not required for altruistic ends, we have no basis to believe Alice was aiming for it.

"You're just doing that for personal gain!" is not generally taken to mean that you may be genuinely doing your best to create a better world for everyone, as you see it, in a way that many would broadly endorse.

In this context, a... (read more)

  • Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?

For me, the answer here is "probably yes"; I think there is some bar of 'moral' and 'intelligent' where this doesn't happen, but I don't feel confident about where it is.

I think there are two things that I expect to be big issues, and probably more I'm not thinking of:

  • Managing freedom for others while not allowing for catastrophic risks; I thi
... (read more)

I claim that GPT-4 is already pretty good at extracting preferences from human data.

So this seems to me like it's the crux. I agree with you that GPT-4 is "pretty good", but I think the standard necessary for things to go well is substantially higher than "pretty good", and that's where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment. My guess is Eliezer, Rob, and Nate feel basically the same way.

Basically, I think your later section--"Maybe you think"--is pointing in the right direction,... (read more)

5Rob Bensinger2mo
I don't think this is the crux. E.g., I'd wager the number of bits you need to get into an ASI's goals in order to make it corrigible is quite a bit smaller than the number of bits required to make an ASI behave like a trustworthy human, which in turn is way way smaller than the number of bits required to make an ASI implement CEV. The issue is that (a) the absolute number of bits for each of these things is still very large, (b) insofar as we're training for deep competence and efficiency we're training against corrigibility (which makes it hard to hit both targets at once), and (c) we can't safely or efficiently provide good training data for a lot of the things we care about (e.g., 'if you're a superintelligence operating in a realistic-looking environment, don't do any of the things that destroy the world'). None of these points require that we (or the AI) solve novel moral philosophy problems. I'd be satisfied with an AI that corrigibly built scanning tech and efficient computing hardware for whole-brain emulation, then shut itself down; the AI plausibly doesn't even need to think about any of the world outside of a particular room, much less solve tricky questions of population ethics or whatever.

So this seems to me like it's the crux. I agree with you that GPT-4 is "pretty good", but I think the standard necessary for things to go well is substantially higher than "pretty good", and that's where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment. 

This makes sense to me. On the other hand - it feels like there's some motte and bailey going on here, if one claim is "if the AIs get really superhumanly capable then we need a much higher standard than pretty good", but then it's illustrated using examples like "think of how your AI might not understand what you meant if you asked it to get your mother out of a burning building".

0Matthew Barnett2mo
That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that's "about as good as human judgement" in the near future. Do you doubt that? If you or anyone else at MIRI doubts that, then I'd be interested in making this prediction more precise, and potentially offering to bet MIRI people on this claim. If MIRI people think that the problem here is that our AIs need to be more moral than even humans, then I don't see where MIRI people think the danger comes from on this particular issue, especially when it comes to avoiding human extinction. Some questions: * Why did Eliezer and Nate talk about stories like Micky Mouse commanding a magical broom to fill a cauldron, and then failing because of misspecification, if the problem was actually more about getting the magical broom to exhibit superhuman moral judgement? * Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions? * Eliezer has said on multiple separate occasions that he'd prefer that we try human intelligent enhancement or try uploading alignment researchers onto computers before creating de novo AGI. But uploaded and enhanced humans aren't going to have superhuman moral judgement. How does this strategy interact with the claim that we need far better-than-human moral judgement to avoid a catastrophe? I mostly saw CEV as an aspirational goal. It's seems more like a grand prize that we could best hope for if we solved every aspect of the alignment problem, rather than a minimal bar that Eliezer was setting for avoiding human extinction. ETA: in Eliezer's AGI ruin post, he says,

But gradient descent will still change the way that the system interprets things in its data storage, right? 

I guess part of the question here is whether gradient descent will even scale to AutoGPT-like systems. You're probably not going to be able to differentiate thru your external notes / other changes you could make to your environment.

Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”

It seems to me that the main difficulty is storing your reference policy somewhere where the gradient can't touch it (even indirectly). Does anyone have a story of how that happens?

2David Manheim4mo
A story of how that happens: In future (unsafe but) sophisticated systems, models will have access to external storage, and be given the ability to read/write. For example, AutoGPT-like systems will have shared scratchpads across instances of the model, to coordinate with themselves. It seems instrumentally convergent for such systems to store things related to their true / eventual goals in such places, for example, so that they can be referenced by local instances. And if we do some sort of oversight of that data storage, key facts that the model wants to preserve might be obfuscated. Of course, this is only one story, and I don't expect it to be the only way such things could happen, but it seems to be a reasonable candidate as a failure mode for systems designed with normal (read: insufficient, but far more than otherwise expected,) levels of caution.

I think there's a trilemma with updating CAIS-like systems to the foundational model world, which is: who is doing the business development?

I came up with three broad answers (noting reality will possibly be a mixture):

  1. The AGI. This is like Sam Altman's old answer or a sovereign AI; you just ask it to look at the world, figure out what tasks it should do, and then do them.
  2. The company that made the AI. I think this was DeepMind's plan; come up with good AI algorithms, find situations where those algorithms can be profitably deployed, and then deploy them, m
... (read more)

However, after looking back on it more than four years later, I think the general picture it gave missed some crucial details about how AI will go.

I feel like this is understating things a bit.

In my view (Drexler probably disagrees?), there are two important parts of CAIS:

  1. Most sectors of the economy are under human supervision and iteratively reduce human oversight as trust is gradually built in engineered subsystems (which are understood 'well enough').
  2. The existing economy is slightly more sophisticated than what SOTA general AI can produce, and so there'
... (read more)

Then the model can safely scale.

If there are experiences which will change itself which don't lead to less of the initial good values, then yeah, for an approximate definition of safety. You're resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose. 

FWIW I don't really see your description as, like, a specific alignment strategy so much as the strategy of "have an alignment strategy at all". The meat is all in 1) how you identify the core of human... (read more)

So I definitely think that's something weirdly unspoken about the argument; I would characterize it as Eliezer saying "suppose I'm right and they're wrong; all this requires is things to be harder than people think, which is usual. Suppose instead that I'm wrong and they're right; this requires things to be easier than people think, which is unusual." But the equation of "people" and "Eliezer" is sort of strange; as Quintin notes, it isn't that unusual for outside observers to overestimate difficulty, and so I wish he had centrally addressed the the reference class tennis game; is the expertise "getting AI systems to be capable" or "getting AI systems to do what you want"?

FWIW, I thought the bit about manifolds in The difficulty of alignment was the strongest foot forward, because it paints a different detailed picture than your description that it's responding to.

That said, I don't think Quintin's picture obviously disagrees with yours (as discussed in my response over here) and I think you'd find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there's a related one that goes thru on his model.

During this process, I don’t think it’s particularly unusual for the person to notice a technical problem but overlook a clever way to solve that problem.

I think this isn't the claim; I think the claim is that it would be particularly unusual for someone to overlook that they're accidentally solving a technical problem. (It would be surprising for Edison to not be thinking hard about what filament to use and pick tungsten; in actual history, it took decades for that change to be made.)

4Steve Byrnes8mo
Sure, but then the other side of the analogy doesn’t make sense, right? The context was: Eliezer was talking in general terms about the difficulty of the AGI x-risk problem and whether it’s likely to be solved. (As I understand it.) [Needless to say, I’m just making a narrow point that it’s a bad analogy. I’m not arguing that p(doom) is high or low, I’m not saying this is an important & illustrative mistake (talking on the fly is hard!), etc.]

BTW I do agree with you that Eliezer’s interview response seems to suggest that he thinks aligning an AGI to “basic notions of morality” is harder and aligning an AGI to “strawberry problem” is easier. If that’s what he thinks, it’s at least not obvious to me.

My sense (which I expect Eliezer would agree with) is that it's relatively easy to get an AI system to imitate the true underlying 'basic notions of morality', to the extent humans agree on that, but that this doesn't protect you at all as soon as you want to start making large changes, or as soon as ... (read more)

4Steve Byrnes8mo
Hmm, on further reflection, I was mixing up * Strawberry Alignment (defined as: make an AGI that is specifically & exclusively motivated to duplicate a strawberry without destroying the world), versus * “Strawberry Problem” (make an AGI that in fact duplicates a strawberry without destroying the world, using whatever methods / motivations you like). Eliezer definitely talks about the latter. I’m not sure Eliezer has ever brought up the former? I think I was getting that from the OP (Quintin), but maybe Quintin was just confused (and/or Eliezer misspoke). Anyway, making an AGI that can solve the strawberry problem is tautologically no harder than making an AGI that can do advanced technological development and is motivated by human norms / morals / whatever, because the latter set of AGIs is a subset of the former. Sorry. I crossed out that paragraph.  :)

seem very implausible when considered in the context of the human learning process (could a human's visual cortex become "deceptively aligned" to the objective of modeling their visual field?).

I think it would probably be strange for the visual field to do this. But I think it's not that uncommon for other parts of the brain to do this; higher level, most abstract / "psychological" parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call... (read more)

1Quintin Pope2mo
I believe the human visual cortex is actually the more relevant comparison point for estimating the level of danger we face due to mesaoptimization. Its training process is more similar to the self-supervised / offline way in which we train (base) LLMs. In contrast, 'most abstract / "psychological"' are more entangled in future decision-making. They're more "online", with greater ability to influence their future training data. I think it's not too controversial that online learning processes can have self-reinforcing loops in them. Crucially however, such loops rely on being able to influence the externally visible data collection process, rather than being invisibly baked into the prior. They are thus much more amenable to being addressed with scalable oversight approaches.

John Wentworth describes the possibility of "optimization demons", self-reinforcing patterns that exploit flaws in an imperfect search process to perpetuate themselves and hijack the search for their own purposes.

But no one knows exactly how much of an issue this is for deep learning, which is famous for its ability to evade local minima when run with many parameters.

Also relevant is Are minimal circuits daemon-free? and Are minimal circuits deceptive?.  I agree no one knows how much of an issue this will be for deep learning.

Additionally, I think tha

... (read more)

I think the bolded text is about Yudkowsky himself being wrong.

That is also how I interpreted it.

If you have a bunch of specific arguments and sources of evidence that you think all point towards a particular conclusion X, then discovering that you're wrong about something should, in expectation, reduce your confidence in X. 

I think Yudkowsky is making a different statement. I agree it would be bizarre for him to be saying "if I were wrong, it would only mean I should have been more confident!"

Yudkowsky is not the aerospace engineer building the rocke

... (read more)

Given the greater evidence available for general ML research, being well calibrated about the difficulty of general ML research is the first step to being well calibrated about the difficulty of ML alignment research. 

I think I agree with this point but want to explicitly note the switch from the phrase 'AI alignment research' to 'ML alignment research'; my model of Eliezer thinks the second is mostly a distraction from the former, and if you think they're the same or interchangeable that seems like a disagreement.

[For example, I think ML alignment re... (read more)

in which Yudkowsky incorrectly assumed that GANs (Generative Adversarial Networks, a training method sometimes used to teach AIs to generate images) were so finicky that they must not have worked on the first try.

I do think this is a point against Yudkowsky. That said, my impression is that GANs are finicky, and I heard rumors that many people tried similar ideas and failed to get it to work before Goodfellow knocked it out of the park. If people were encouraged to publish negative results, we might have a better sense of the actual landscape here, but I t... (read more)

Yudkowsky's own prior statements seem to put him in this camp as well. E.g., here he explains why he doesn't expect intelligence to emerge from neural networks (or more precisely, why he dismisses a brain-based analogy for coming to that conclusion)

I think you're basically misunderstanding and misrepresenting Yudkowsky's argument from 2008. He's not saying "you can't make an AI out of neural networks", he's saying "your design sharing a single feature with the brain does not mean it will also share the brain's intelligence." As well, I don't think he's arg... (read more)

0Quintin Pope2mo
There was an entire thread about Yudkowsky's past opinions on neural networks, and I agree with Alex Turner's evidence that Yudkowsky was dubious.  I also think people who used brain analogies as the basis for optimism about neural networks were right to do so.

Finally, I'd note that having a "security mindset" seems like a terrible approach for raising human children to have good values

Do you have kids, or any experience with them? (There are three small children in the house I live in.) I think you might want to look into childproofing, and meditate on its connection to security mindset.

Yes, this isn't necessarily related to the 'values' part, but for that I would suggest things like Direct Instruction, which involves careful curriculum design to generate lots of examples so that students will reliably end up i... (read more)

It cannot be the case that successful value alignment requires perfect adversarial robustness.

It seems like the argument structure here is something like:

  1. This requirement is too stringent for humans to follow
  2. Humans have successful value alignment
  3. Therefore this requirement cannot be necessary for successful value alignment.

I disagree with point 2, tho; among other things, it looks to me like some humans are on track to accidentally summoning a demon that kills both me and them, which I expect they would regret after-the-fact if they had the chance to.

So any... (read more)

As I understand it, the security mindset asserts a premise that's roughly: "The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions." 

This seems... like a correct description but it's missing the spirit?

Like the intuitions are primarily about "what features are salient" and "what thoughts are easy to think."

However, I don't see why this should be the case. 

Roughly, the core distinction between software engineering and computer security is whether ... (read more)

2Quintin Pope2mo
Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries. If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate. "Building an AI that doesn't game your specifications" is the actual "alignment question" we should be doing research on. The mathematical principles which determine how much a given AI training process games your specifications are not adversaries. It's also a problem we've made enormous progress on, mostly by using large pretrained models with priors over how to appropriately generalize from limited specification signals. E.g., Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) shows how the process of pretraining an LM causes it to go from "gaming" a limited set of finetuning data via shortcut learning / memorization, to generalizing with the appropriate linguistic prior knowledge.

What does this mean for alignment? How do we prevent AIs from behaving badly as a result of a similar "misgeneralization"? What alignment insights does the fleshed-out mechanistic story of humans coming to like ice cream provide?

As far as I can tell, the answer is: don't reward your AIs for taking bad actions.

uh

 

is your proposal "use the true reward function, and then you won't get misaligned AI"?

That's all it would take, because the mechanistic story above requires a specific step where the human eats ice cream and activates their reward circuits. If

... (read more)
1Logan Riggs Smith8mo
My shard theory inspired story is to make an AI that: 1. Has a good core of human values (this is still hard) 2. Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs) Then the model can safely scale. This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism) Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.

I think it's straightforward to explain why humans "misgeneralized" to liking ice cream. 

I don't yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it's a misgeneralization instead of things working as expected.

I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception:

The ancestral environment selected for reward circuitry that would cause its bearers to seek out more of such food sources.

"such food sources" feels a little like it's ... (read more)

There's no guarantee that such a thing even exists, and implicitly aiming to avoid the one value formation process we know is compatible with our own values seems like a terrible idea.

...

It's thus vastly easier to align models to goals where we have many examples of people executing said goals.

I think there's a deep disconnect here on whether interpolation is enough or whether we need extrapolation.

The point of the strawberry alignment problem is "here's a clearly understandable specification of a task that requires novel science and engineering to execute... (read more)

I think this is extremely misleading. Firstly, real-world data in high dimensions basically never look like spheres. Such data almost always cluster in extremely compact manifolds, whose internal volume is minuscule compared to the full volume of the space they're embedded in.

I agree with your picture of how manifolds work; I don't think it actually disagrees all that much with Yudkowsky's.

That is, the thing where all humans are basically the same make and model of car, running the same brand of engine, painted different colors is the claim that the intrin... (read more)

1Quintin Pope2mo
Addressing this objection is why I emphasized the relatively low information content that architecture / optimizers provide for minds, as compared to training data. We've gotten very far in instantiating human-like behaviors by training networks on human-like data. I'm saying the primacy of data for determining minds means you can get surprisingly close in mindspace, as compared to if you thought architecture / optimizer / etc were the most important. Obviously, there are still huge gaps between the sorts of data that an LLM is trained on versus the implicit loss functions human brains actually minimize, so it's kind of surprising we've even gotten this far. The implication I'm pointing to is that it's feasible to get really close to human minds along important dimensions related to values and behaviors, even without replicating all the quirks of human mental architecture.

This seems like way too high a bar. It seems clear that you can have transformative or risky AI systems that are still worse than humans at some tasks. This seems like the most likely outcome to me.

I think this is what Yudkowsky thinks also? (As for why it was relevant to bring up, Yudkowsky was answering the host's question of "How is superintelligence different than general intelligence?")

I expect future capabilities advances to follow a similar pattern as past capabilities advances, and not completely break the existing alignment techniques.

Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn and follow-on discussion.

But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse... (read more)

1Quintin Pope2mo
Evolution provides no evidence for the sharp left turn I dislike this minimization of contemporary alignment progress. Even just limiting ourselves to RLHF, that method addresses far more problems than "the LLM sometimes outputs naughty sentences". E.g., it also tackles problems such as consistently following user instructions, reducing hallucinations, improving the topicality of LLM suggestions, etc. It allows much more significant interfacing with the cognition and objectives pursued by LLMs than just some profanity filter. I don't think ontological collapse is a real issue (or at least, not an issue that appropriate training data can't solve in a relatively straightforwards way). I feel similarly about lots of things that are speculated to be convergent problems for ML systems, such as wireheading and mesaoptimization. If you're referring to the technique used on LLMs (RLHF), then the answer seems like an obvious yes. RLHF just refers to using reinforcement learning with supervisory signals from a preference model. It's an incredibly powerful and flexible approach, one that's only marginally less general than reinforcement learning itself (can't use it for things you can't build a preference model of). It seems clear enough to me that you could do RLHF over the biologist-bot's action outputs in the biological domain, and be able to shape its behavior there. If you're referring to just doing language-only RLHF on the model, then making a bio-model, and seeing if the RLHF influences the bio-model's behaviors, then I think the answer is "variable, and it depends a lot on the specifics of the RLHF and how the cross-modal grounding works".  People often translate non-lingual modalities into language so LLMs can operate in their "native element" in those other domains. Assuming you don't do that, then yes, I could easily see the language-only RLHF training having little impact on the bio-model's behaviors. However, if the bio-model were acting multi-modally by e.

I have a lot of responses to specific points; I'm going to make them as children comment to this comment.

3Quintin Pope2mo
I've recently decided to revisit this post. I'll try to address all un-responded to comments in the next ~2 weeks.

seem very implausible when considered in the context of the human learning process (could a human's visual cortex become "deceptively aligned" to the objective of modeling their visual field?).

I think it would probably be strange for the visual field to do this. But I think it's not that uncommon for other parts of the brain to do this; higher level, most abstract / "psychological" parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call... (read more)

John Wentworth describes the possibility of "optimization demons", self-reinforcing patterns that exploit flaws in an imperfect search process to perpetuate themselves and hijack the search for their own purposes.

But no one knows exactly how much of an issue this is for deep learning, which is famous for its ability to evade local minima when run with many parameters.

Also relevant is Are minimal circuits daemon-free? and Are minimal circuits deceptive?.  I agree no one knows how much of an issue this will be for deep learning.

Additionally, I think tha

... (read more)
4Matthew "Vaniver" Gray8mo
That is also how I interpreted it. I think Yudkowsky is making a different statement. I agree it would be bizarre for him to be saying "if I were wrong, it would only mean I should have been more confident!" I think he is (inside of the example). He's saying "suppose an engineer is wrong about how their design works. Is it more likely that the true design performs better along multiple important criteria than expectation, or that the design performs worse (or fails to function at all)?" Note that 'expectation' is referring to the confidence level inside an argument, but arguments aren't Bayesians; it's the outside agent that shouldn't be expected to predictably update. Another way to put this: does the engineer expect to be disappointed, excited, or neutral if the design doesn't work as planned? Typically, disappointed, implying the plan is overly optimistic compared to reality. If this weren't true--if engineers were calibrated or pessimistic--then I think Yudkowsky would be wrong here (and also probably have a different argument to begin with).
4Matthew "Vaniver" Gray8mo
I think I agree with this point but want to explicitly note the switch from the phrase 'AI alignment research' to 'ML alignment research'; my model of Eliezer thinks the second is mostly a distraction from the former, and if you think they're the same or interchangeable that seems like a disagreement. [For example, I think ML alignment research includes stuff like "will our learned function be robust to distributional shift in the inputs?" and "does our model discriminate against protected classes?" whereas AI alignment research includes stuff like "will our system be robust to changes in the number of inputs?" and "is our model deceiving us about its level of understanding?". They're related in some ways, but pretty deeply distinct.]

in which Yudkowsky incorrectly assumed that GANs (Generative Adversarial Networks, a training method sometimes used to teach AIs to generate images) were so finicky that they must not have worked on the first try.

I do think this is a point against Yudkowsky. That said, my impression is that GANs are finicky, and I heard rumors that many people tried similar ideas and failed to get it to work before Goodfellow knocked it out of the park. If people were encouraged to publish negative results, we might have a better sense of the actual landscape here, but I t... (read more)

Yudkowsky's own prior statements seem to put him in this camp as well. E.g., here he explains why he doesn't expect intelligence to emerge from neural networks (or more precisely, why he dismisses a brain-based analogy for coming to that conclusion)

I think you're basically misunderstanding and misrepresenting Yudkowsky's argument from 2008. He's not saying "you can't make an AI out of neural networks", he's saying "your design sharing a single feature with the brain does not mean it will also share the brain's intelligence." As well, I don't think he's arg... (read more)

Finally, I'd note that having a "security mindset" seems like a terrible approach for raising human children to have good values

Do you have kids, or any experience with them? (There are three small children in the house I live in.) I think you might want to look into childproofing, and meditate on its connection to security mindset.

Yes, this isn't necessarily related to the 'values' part, but for that I would suggest things like Direct Instruction, which involves careful curriculum design to generate lots of examples so that students will reliably end up i... (read more)

3Matthew "Vaniver" Gray8mo
It seems like the argument structure here is something like: 1. This requirement is too stringent for humans to follow 2. Humans have successful value alignment 3. Therefore this requirement cannot be necessary for successful value alignment. I disagree with point 2, tho; among other things, it looks to me like some humans are on track to accidentally summoning a demon that kills both me and them, which I expect they would regret after-the-fact if they had the chance to. So any reasoning that's like "well so long as it's not unusual we can be sure it's safe" runs into the thing where we're living in the acute risk period. The usual is not safe! This seems definitely right to me. An expectation I have is that this will also generate resistance to alignment techniques / control by its operators, which perhaps complicates how benign this is.   [FWIW I also don't think we want an AI that's perfectly robust to all possible adversarial attacks; I think we want one that's adequate to defend against the security challenges it faces, many of which I expect to be internal. Part of this is because I'm mostly interested in AI planning systems able to help with transformative changes to the world instead of foundational models used by many customers for small amounts of cognition, which are totally different business cases and have different security problems.]

As I understand it, the security mindset asserts a premise that's roughly: "The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions." 

This seems... like a correct description but it's missing the spirit?

Like the intuitions are primarily about "what features are salient" and "what thoughts are easy to think."

However, I don't see why this should be the case. 

Roughly, the core distinction between software engineering and computer security is whether ... (read more)

What does this mean for alignment? How do we prevent AIs from behaving badly as a result of a similar "misgeneralization"? What alignment insights does the fleshed-out mechanistic story of humans coming to like ice cream provide?

As far as I can tell, the answer is: don't reward your AIs for taking bad actions.

uh

 

is your proposal "use the true reward function, and then you won't get misaligned AI"?

That's all it would take, because the mechanistic story above requires a specific step where the human eats ice cream and activates their reward circuits. If

... (read more)

I think it's straightforward to explain why humans "misgeneralized" to liking ice cream. 

I don't yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it's a misgeneralization instead of things working as expected.

I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception:

The ancestral environment selected for reward circuitry that would cause its bearers to seek out more of such food sources.

"such food sources" feels a little like it's ... (read more)

3Matthew "Vaniver" Gray8mo
I think there's a deep disconnect here on whether interpolation is enough or whether we need extrapolation. The point of the strawberry alignment problem is "here's a clearly understandable specification of a task that requires novel science and engineering to execute on. Can you do that safely?". If your ambitions are simply to have AI customer service bots, you don't need to solve this problem. If your ambitions include cognitive megaprojects which will need to be staffed at high levels by AI systems, then you do need to solve this problem. More pragmatically, if your ambitions include setting up some sort of system that prevents people from deploying rogue AI systems while not dramatically curtailing Earth's potential, that isn't a goal that we have many examples of people executing on. So either we need to figure it out with humans or, if that's too hard, create an AI system capable of figuring it out (which probably requires an AI leader instead of an AI assistant).
4Matthew "Vaniver" Gray8mo
I agree with your picture of how manifolds work; I don't think it actually disagrees all that much with Yudkowsky's. That is, the thing where all humans are basically the same make and model of car, running the same brand of engine, painted different colors is the claim that the intrinsic dimension of human minds is pretty small. (Taken literally, it's 3, for the three dimensions of color-space.) And so if you think there are, say, 40 intrinsic dimensions to mind-space, and humans are fixed on 37 of the points and variable on the other 3, well, I think we have basically the Yudkowskian picture. (I agree if Yudkowsky's picture was that there were 40M dimensions and humans varied on 3, this would be comically wrong, but I don't think this is what he's imagining for that argument.)
2Matthew "Vaniver" Gray8mo
I think this is what Yudkowsky thinks also? (As for why it was relevant to bring up, Yudkowsky was answering the host's question of "How is superintelligence different than general intelligence?")
3Matthew "Vaniver" Gray8mo
Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn and follow-on discussion. But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model." Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the 'helpful, harmless, honest' alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients? Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don't expect that they will easily transfer to those new challenges.

Is the dataset you used for the regression available? Might be easier to generate the graphs that I'm thinking of then describe them.

[EDIT: I was confused when I wrote the earlier comment, I thought Vivek was talking about the decision square distance to the top 5x5 corner, which I do think my naive guess is plausible for; I don't have the same guess about cheese Euclidean distance to top right corner.]

2Alex Turner8mo
Here's a colab notebook (it takes a while to load the data, be warned). We'll have a post out later. 
2Alex Turner8mo
Yeah, we'll put up additional notebooks/resources/datasets soon.

My naive guess is that the other relationships are nonlinear, and this is the best way to approximate those relationships out of just linear relationships of the variables the regressor had access to.

2Alex Turner8mo
Hm, what do you mean by "other relationships"? Is your guess that "cheese Euclidean distance to top-right" is a statistical artifact, or something else?  If so—I'm quite confident that relationship isn't an artifact (although I don't strongly believe that the network is literally modulating its decisions on the basis of this exact formalization). For example, see footnote 4. I'd also be happy to generate additional vector field visualizations in support of this claim.

Good GPUs feels kind of orthogonal.

IMO it's much easier to support high investment numbers in "AI" if you consider lots of semiconductor / AI hardware startup stuff as "AI investments". My suspicion is that while GPUs were primarily a crypto thing for the last few years, the main growth outlook driving more investment is them being an AI thing. 

I've written a bunch elsewhere about object-level thoughts on ELK. For this review, I want to focus instead on meta-level points.

I think ELK was very well-made; I think it did a great job of explaining itself with lots of surface area, explaining a way to think about solutions (the builder-breaker cycle), bridging the gap between toy demonstrations and philosophical problems, and focusing lots of attention on the same thing at the same time. In terms of impact on the growth and development on the AI safety community, I think this is one of the most importa... (read more)

Sure, points from a scoring rule come both from 'skill' (whether or not you're accurate in your estimates) and 'calibration' (whether your estimates line up with the underlying propensity).

Rather than generating the picture I'm thinking of (sorry, up to something else and so just writing a quick comment), I'll describe it: watch this animation, and see the implied maximum expected score as a function of p (the forecaster's true belief). For all of the scoring rules, it's a convex function with maxima at 0 and 1. (You can get 1 point on average with a linea... (read more)

But we show for any strictly proper scoring rule that there is a function  such that a dishonest prediction is optimal.

Agreed for proper scoring rules, but I'd be a little surprised if it's not possible to make a skill-free scoring rule, and then get a honest prediction result for that. [This runs into other issues--if the scoring rule is skill-free, where does the skill come from?--but I think this can be solved by having oracle-mode and observation-mode, and being able to do honest oracle-mode at all would be nice.]

1Johannes Treutlein1y
I'm not sure I understand what you mean by a skill-free scoring rule. Can you elaborate what you have in mind?

I think there's an existing phrase called "defense in depth", which somehow feels... more like the right spirit? [This is related to the 'swiss cheese' model you bring up in the motivation section.] It's not that we're going to throw together a bunch of miscellaneous stuff and it'll work; it's that we're not going to trust any particular defense that we have enough that we don't also want other defenses.

Great post!

My interpretation is that: as an oracle varies its prediction, it both picks up points from becoming more accurate and from making the outcome more predictable. This means the proper scoring rules, tuned to only one pressure, will result in a dishonest equilibrium instead of an honest one.

(Incidentally, this seems like a pretty sensible explanation for why humans are systematically overconfident; if confidence increases success, then the point when the benefits of increased success match the costs of decreased accuracy will be more extreme than ... (read more)

4Johannes Treutlein1y
Thanks for your comment! Your interpretation sounds right to me. I would add that our result implies that it is impossible to incentivize honest reports in our setting. If you want to incentivize honest reports when f is constant, then you have to use a strictly proper scoring rule (this is just the definition of “strictly proper”). But we show for any strictly proper scoring rule that there is a function f such that a dishonest prediction is optimal. Proposition 13 shows that it is possible to “tune” scoring rules to make optimal predictions very close to honest ones (at least in L1-distance). Yes, that is fair. To be faithful to the common usage of the term, one should maybe require at least two possible fixed points (or points that are somehow close to fixed points). The case with a unique fixed point is probably also safer, and worries about “self-fulfilling prophecies” don't apply to the same degree.

I want to point out that I think the typical important case looks more like "wanting to do things for unusual reasons," and if you're worried about this approach breaking down there that seems like a pretty central obstacle. For example, suppose rather than trying to maintain a situation (the diamond stays in the vault) we're trying to extrapolate (like coming up with a safe cancer cure). When looking at a novel medication to solve an unsolved problem, we won't be able to say "well, it cures the cancer for the normal reason" because there aren't any positi... (read more)

4Paul Christiano1y
Yes, you want the patient to appear on camera for the normal reason, but you don't want the patient to remain healthy for the normal reason. We describe a possible strategy for handling this issue in the appendix. I feel more confident about the choice of research focus than I do about whether that particular strategy will work out. The main reasons are: I think that ELK and deceptive alignment are already challenging and useful to solve even in the case where there is no such distributional shift, that those challenges capture at least some central alignment difficulties, that the kind of strategy described in the post is at least plausible, and that as a result it's unlikely to be possible to say very much about the distributional shift case before solving the simpler case. If the overall approach fails, I currently think it's most likely either because we can't define what we mean by explanation or that we can't find explanations for key model behaviors.

Is this saying "if model performance is getting better, then maybe it will have a sharp left turn, and if model performance isn't getting better, then it won't"?

3Ethan Perez1y
Yes

In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as "alignment".

I share your opinion of RLHF work but I'm not sure I share your opinion of its consequences. For situations where people don't believe arguments that RLHF is fundamentally flawed because they're too focused on empirical evidence over arguments, the generation of empirical evidence that RLHF is flawed seems pretty useful for convincing them! 

This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don't see why it implies internal reward-orientation motivational edifices.

Sorry, if I'm reading this right, we're hypothesizing internal reward-orientation motivational edifices, and then asking the question of whether or not policy gradients will encourage them or discourage them. Quintin seems to think "nah, it needs to take an action before that action can be rewarded", and my response is "wait, isn't this going to be straightforwardly encour... (read more)

3Alex Turner1y
I see. Can't speak for Quintin, but: I mostly think it won't be present, but also conditional on the motivational edifice being present, I expect the edifice to bid up rewarding actions and get reinforced into a substantial influence. I have a lot of uncertainty in this case. I'm hoping to work out a better mechanistic picture of how the gradients would affect such edifices. I think there are a range of disagreements here, but also one man's modus ponens is another's modus tollens: High variance in heroin-propensity implies we can optimize heroin-propensity down to negligible values with relatively few bits of optimization (if we knew what we were doing, at least). This isn't obviously true to me, actually. That strategy certainly sounds quotidien, but is it truly mechanistically deficient? If we tell the early training-AGI "Hey, if you hit the reward button, the ensuing credit assignment will drift your values by mechanisms A, B, and C", that provides important information to the AGI. I think that that's convergently good advice, across most possible values the AGI could have. (This, of course, doesn't address the problem of whether the AGI does have good values to begin with.) More broadly, I suspect there might be some misconception about myself and other shard theory researchers. I don't think, "Wow humans are so awesome, let's go ahead and ctrl+C ctrl+V for alignment." I'm very very against boxing confusion like that. I'm more thinking, "Wow, humans have pretty good general alignment properties; I wonder what the generators are for that?". I want to understand the generators for the one example we have of general intelligences acquiring values over their lifetime, and then use that knowledge to color in and reduce my uncertainty about how alignment works.

For fixed-duration reasoning chains, or in situations where the AI is trying to facilitate cooperation between different passes (like in the proposed scenario where it gets information but then loses access to it), this seems important. For example, in a situation where you show the model a passage, let it make notes, and then have the same model look just at the notes and answer a reading comprehension question, I expect that likely it will want to embed information in the null space of the notes. [Or in an HCH-style situation, it might want to embed the ... (read more)

3A Ray1y
I think this is worth considering even in cases of things like recursive prompting for a couple of reasons that I haven't organized: * In general additional compute time lets you improve your answer, so by default I'd expect answers to any significant question to take multiple steps * I agree for trivial questions it might not do any of this.  My expectation is that AI deployments impacts will mostly be on things like non-trivial questions or tasks * So far current research hasn't done a great deal of regularizing or penalizing models for using a lot of thinking * Mostly they've optimized for "maximize the chance of correct answer" * This could change in the future, especially in cases where the model is expected to be deployed / otherwise sensitive to runtime costs * Steganography is rewarded over openness if the openness is somehow far from the prior for a given piece of information it wants to transmit * Caveat this point is more speculative than the above document * One of my points is that due to null space + optimization pressure, it's very easy and convenient for gradient descent to put information in these hidden dimensions * I think one of the things this can do well is handle concepts that don't yet exist -- and so reasoning moves like "invent a new term" or "give an illustrative example" or "invent a thought experiment" are lower likelihood under the prior * In these cases, it can be more likely for steganography to transmit certain information, if it is less expensive in terms of KL penalties * I do agree that for straightforward reasoning on straightforward evidence, where the reasoning is human-like and close to the prior, we can expect it to mostly reason out loud in a way that is "open" * Myopia -- though less important than it was earlier in RFLO research -- seemed like a good alignment criterion to aim for, and still does today somewhat * The original starting point for this direction of thinking was about Myopia / N

so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).

This is the bit I don't believe, actually. [Or at least don't think is relevant.] Note that in Wei_Dai's hypothetical, the neural net architecture has a particular arrangement such that "how much it optimizes for reward" is either directly or indirectly implied by the neural network weights. [We're providing the reward as part of its observations,... (read more)

3Alex Turner1y
This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don't see why it implies internal reward-orientation motivational edifices. I can probably predict my own limbic reward outputs to some crude degree, but that doesn't make me a reward optimizer. I think that assuming there's a feature-direction "care more about reward" which isn't already gradient-starved by shallower proxies learned earlier in training. In my ontology, this corresponds to "thinking thoughts about reward in order to get reward."  Seems like we're assuming the whole ball game away. You're assuming the cognition is already set up so as to admit easy local refinements towards maximizing reward more, that this is where the gradient points. My current guess is that freshly initialized networks will not have gradients towards modelling and acting to increase the antecedent-computation-reinforcer register in the real world (nor would this be the parametric direction of maximal increase of P(rewarding actions) ). For any observed data point in PG, you're updating to make rewarding actions more probable given the policy network. There are many possible directions in which to increase P(rewarding actions), and internal reward valuation is only one particular direction. But if you're already doing the "lick lollipops" action because you see a lollipop in front of you and have a hardcoded heuristic to grab it and lick it, then this starves any potential gradient (because you're already taking the action of grabbing the lollipop). Now, you might have a situation where the existing computation doesn't get reward. But then policy gradient isn't going to automatically "find" the bandit arm with even higher reward and then provide an exact gradient towards that action. PG is still reinforcing to increase the probability of historically rewarding actions. And you can easily hit gradient starvation there, I think.  If this argument works, why doesn't it

If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction.

Wait, I don't think this is true? At least, I'd appreciate it being stepped thru in more detail.

In the simplest story, we're imagining an agent whose policy is  and, for simplicity's sake,  is a scalar that determines "how much to maximize for reward" and all the other parameters of  store other things about the dynamics of the world / decision-making process.

It seems to me that  is obviously going to ... (read more)

3Oliver Sourbut1y
I think Quintin[1] is maybe alluding to the fact that in the limit of infinite counterfactual exploration then sure, the gradient in sample-based policy gradient estimation will push in that direction. But we don't ever have infinite exploration (and we certainly don't have counterfactual exploration; though we come very close in simulations with resets) so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation). This seems right to me and it's a nuance I've raised in a few conversations in the past. On the other hand kind of half the point of RL optimisation algorithms is to do 'enough' exploration! And furthermore (as I mentioned under Steven's comment) I'm not confident that such simplistic RL is the one that will scale to AGI first. cf various impressive results from DeepMind over the years which use lots of shenanigans besides plain old sample-based policy gradient estimation (including model-based lookahead as in the Alpha and Mu gang). But maybe! ---------------------------------------- 1. This is a guess and I haven't spoken to Quintin about this - Quintin, feel free to clarify/contradict ↩︎

If you have a space with two disconnected components, then I'm calling the distinction between them "crisp."

The components feel disconnected to me in 1D, but I'm not sure they would feel disconnected in 3D or in ND. Is your intuition that they're 'durably disconnected' (even looking at the messy plan-space of the real-world, we'll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator's preferences, once you have... (read more)

5Paul Christiano1y
I don't think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we'd probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that's not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world. If you are in the business of "trying to train corrigibility" or "trying to design corrigible systems," I think understanding that distinction is what the game is about. If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say---like "there are so many ways to mess with you, how could a definition cover all of them?"---doesn't make any progress on that, and so it doesn't help reconcile the intuitions or convince most optimists to be more pessimistic. (Obviously all of that is just a best guess though, and the game may well be about something totally different.)

I agree this list doesn't seem to contain much unpublished material, and I think the main value of having it in one numbered list is that "all of it is in one, short place", and it's not an "intro to computers can think" and instead is "these are a bunch of the reasons computers thinking is difficult to align".

The thing that I understand to be Eliezer's "main complaint" is something like: "why does it seem like No One Else is discovering new elements to add to this list?". Like, I think Risks From Learned Optimization was great, and am glad you and others ... (read more)

2Eliezer Yudkowsky1y
Nearly empty string of uncommon social inputs.  All sorts of empirical inputs, including empirical inputs in the social form of other people observing things. It's also fair to say that, though they didn't argue me out of anything, Moravec and Drexler and Ed Regis and Vernor Vinge and Max More could all be counted as social inputs telling me that this was an important thing to look at.

Why is the process by which humans come to reliably care about the real world

IMO this process seems pretty unreliable and fragile, to me. Drugs are popular; video games are popular; people-in-aggregate put more effort into obtaining imaginary afterlives than life extension or cryonics.

But also humans have a much harder time 'optimizing against themselves' than AIs will, I think. I don't have a great mechanistic sense of what it will look like for an AI to reliably care about the real world.

2Alex Turner1y
One of the problems with English is that it doesn't natively support orders of magnitude for "unreliable." Do you mean "unreliable" as in "between 1% and 50% of people end up with part of their values not related to objects-in-reality", or as in "there is no a priori reason why anyone would ever care about anything not directly sensorially observable, except as a fluke of their training process"? Because the latter is what current alignment paradigms mispredict, and the former might be a reasonable claim about what really happens for human beings. EDIT:  My reader-model is flagging this whole comment as pedagogically inadequate, so I'll point to the second half of section 5 in my shard theory document.

Not saying that this should be MIRI's job, rather stating that I'm confused because I feel like we as a community are not taking an action that would seem obvious to me. 

I wrote about this a bit before, but in the current world my impression is that actually we're pretty capacity-limited, and so the threshold is not "would be good to do" but "is better than my current top undone item". If you see something that seems good to do that doesn't have much in the way of unilateralist risk, you doing it is probably the right call. [How else is the field going to get more capacity?]

2Rob Bensinger1y
+1

Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.

Similarly, our ability to coordinate through language also plays a huge role in explaining our power compared to other animals. But, on a first approximation, other animals can't coordinate at all, making this distinction much less impressive. The first AGIs we construct will be born into a culture already c

... (read more)

But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!

I don't think it's obvious that this means that AGI is more dangerous, because it means that for a fixed total impact of AGI, the AGI doesn't have to be as competent at individual thinking (because it leans relatively more on group thinking). And so at... (read more)

Not sure exactly what this means.

My read was that for systems where you have rock-solid checking steps, you can throw arbitrary amounts of compute at searching for things that check out and trust them, but if there's any crack in the checking steps, then things that 'check out' aren't trustable, because the proposer can have searched an unimaginably large space (from the rater's perspective) to find them. [And from the proposer's perspective, the checking steps are the real spec, not whatever's in your head.]

In general, I think we can get a minor edge from... (read more)

Load More