- Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?
For me, the answer here is "probably yes"; I think there is some bar of 'moral' and 'intelligent' where this doesn't happen, but I don't feel confident about where it is.
I think there are two things that I expect to be big issues, and probably more I'm not thinking of:
I claim that GPT-4 is already pretty good at extracting preferences from human data.
So this seems to me like it's the crux. I agree with you that GPT-4 is "pretty good", but I think the standard necessary for things to go well is substantially higher than "pretty good", and that's where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment. My guess is Eliezer, Rob, and Nate feel basically the same way.
Basically, I think your later section--"Maybe you think"--is pointing in the right direction,...
So this seems to me like it's the crux. I agree with you that GPT-4 is "pretty good", but I think the standard necessary for things to go well is substantially higher than "pretty good", and that's where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment.
This makes sense to me. On the other hand - it feels like there's some motte and bailey going on here, if one claim is "if the AIs get really superhumanly capable then we need a much higher standard than pretty good", but then it's illustrated using examples like "think of how your AI might not understand what you meant if you asked it to get your mother out of a burning building".
But gradient descent will still change the way that the system interprets things in its data storage, right?
I guess part of the question here is whether gradient descent will even scale to AutoGPT-like systems. You're probably not going to be able to differentiate thru your external notes / other changes you could make to your environment.
Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”
It seems to me that the main difficulty is storing your reference policy somewhere where the gradient can't touch it (even indirectly). Does anyone have a story of how that happens?
I think there's a trilemma with updating CAIS-like systems to the foundational model world, which is: who is doing the business development?
I came up with three broad answers (noting reality will possibly be a mixture):
However, after looking back on it more than four years later, I think the general picture it gave missed some crucial details about how AI will go.
I feel like this is understating things a bit.
In my view (Drexler probably disagrees?), there are two important parts of CAIS:
Then the model can safely scale.
If there are experiences which will change itself which don't lead to less of the initial good values, then yeah, for an approximate definition of safety. You're resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.
FWIW I don't really see your description as, like, a specific alignment strategy so much as the strategy of "have an alignment strategy at all". The meat is all in 1) how you identify the core of human...
So I definitely think that's something weirdly unspoken about the argument; I would characterize it as Eliezer saying "suppose I'm right and they're wrong; all this requires is things to be harder than people think, which is usual. Suppose instead that I'm wrong and they're right; this requires things to be easier than people think, which is unusual." But the equation of "people" and "Eliezer" is sort of strange; as Quintin notes, it isn't that unusual for outside observers to overestimate difficulty, and so I wish he had centrally addressed the the reference class tennis game; is the expertise "getting AI systems to be capable" or "getting AI systems to do what you want"?
FWIW, I thought the bit about manifolds in The difficulty of alignment was the strongest foot forward, because it paints a different detailed picture than your description that it's responding to.
That said, I don't think Quintin's picture obviously disagrees with yours (as discussed in my response over here) and I think you'd find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there's a related one that goes thru on his model.
During this process, I don’t think it’s particularly unusual for the person to notice a technical problem but overlook a clever way to solve that problem.
I think this isn't the claim; I think the claim is that it would be particularly unusual for someone to overlook that they're accidentally solving a technical problem. (It would be surprising for Edison to not be thinking hard about what filament to use and pick tungsten; in actual history, it took decades for that change to be made.)
BTW I do agree with you that Eliezer’s interview response seems to suggest that he thinks aligning an AGI to “basic notions of morality” is harder and aligning an AGI to “strawberry problem” is easier. If that’s what he thinks, it’s at least not obvious to me.
My sense (which I expect Eliezer would agree with) is that it's relatively easy to get an AI system to imitate the true underlying 'basic notions of morality', to the extent humans agree on that, but that this doesn't protect you at all as soon as you want to start making large changes, or as soon as ...
seem very implausible when considered in the context of the human learning process (could a human's visual cortex become "deceptively aligned" to the objective of modeling their visual field?).
I think it would probably be strange for the visual field to do this. But I think it's not that uncommon for other parts of the brain to do this; higher level, most abstract / "psychological" parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call...
John Wentworth describes the possibility of "optimization demons", self-reinforcing patterns that exploit flaws in an imperfect search process to perpetuate themselves and hijack the search for their own purposes.
But no one knows exactly how much of an issue this is for deep learning, which is famous for its ability to evade local minima when run with many parameters.
Also relevant is Are minimal circuits daemon-free? and Are minimal circuits deceptive?. I agree no one knows how much of an issue this will be for deep learning.
...Additionally, I think tha
I think the bolded text is about Yudkowsky himself being wrong.
That is also how I interpreted it.
If you have a bunch of specific arguments and sources of evidence that you think all point towards a particular conclusion X, then discovering that you're wrong about something should, in expectation, reduce your confidence in X.
I think Yudkowsky is making a different statement. I agree it would be bizarre for him to be saying "if I were wrong, it would only mean I should have been more confident!"
...Yudkowsky is not the aerospace engineer building the rocke
Given the greater evidence available for general ML research, being well calibrated about the difficulty of general ML research is the first step to being well calibrated about the difficulty of ML alignment research.
I think I agree with this point but want to explicitly note the switch from the phrase 'AI alignment research' to 'ML alignment research'; my model of Eliezer thinks the second is mostly a distraction from the former, and if you think they're the same or interchangeable that seems like a disagreement.
[For example, I think ML alignment re...
in which Yudkowsky incorrectly assumed that GANs (Generative Adversarial Networks, a training method sometimes used to teach AIs to generate images) were so finicky that they must not have worked on the first try.
I do think this is a point against Yudkowsky. That said, my impression is that GANs are finicky, and I heard rumors that many people tried similar ideas and failed to get it to work before Goodfellow knocked it out of the park. If people were encouraged to publish negative results, we might have a better sense of the actual landscape here, but I t...
Yudkowsky's own prior statements seem to put him in this camp as well. E.g., here he explains why he doesn't expect intelligence to emerge from neural networks (or more precisely, why he dismisses a brain-based analogy for coming to that conclusion)
I think you're basically misunderstanding and misrepresenting Yudkowsky's argument from 2008. He's not saying "you can't make an AI out of neural networks", he's saying "your design sharing a single feature with the brain does not mean it will also share the brain's intelligence." As well, I don't think he's arg...
Finally, I'd note that having a "security mindset" seems like a terrible approach for raising human children to have good values
Do you have kids, or any experience with them? (There are three small children in the house I live in.) I think you might want to look into childproofing, and meditate on its connection to security mindset.
Yes, this isn't necessarily related to the 'values' part, but for that I would suggest things like Direct Instruction, which involves careful curriculum design to generate lots of examples so that students will reliably end up i...
It cannot be the case that successful value alignment requires perfect adversarial robustness.
It seems like the argument structure here is something like:
I disagree with point 2, tho; among other things, it looks to me like some humans are on track to accidentally summoning a demon that kills both me and them, which I expect they would regret after-the-fact if they had the chance to.
So any...
As I understand it, the security mindset asserts a premise that's roughly: "The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions."
This seems... like a correct description but it's missing the spirit?
Like the intuitions are primarily about "what features are salient" and "what thoughts are easy to think."
However, I don't see why this should be the case.
Roughly, the core distinction between software engineering and computer security is whether ...
What does this mean for alignment? How do we prevent AIs from behaving badly as a result of a similar "misgeneralization"? What alignment insights does the fleshed-out mechanistic story of humans coming to like ice cream provide?
As far as I can tell, the answer is: don't reward your AIs for taking bad actions.
uh
is your proposal "use the true reward function, and then you won't get misaligned AI"?
...That's all it would take, because the mechanistic story above requires a specific step where the human eats ice cream and activates their reward circuits. If
I think it's straightforward to explain why humans "misgeneralized" to liking ice cream.
I don't yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it's a misgeneralization instead of things working as expected.
I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception:
The ancestral environment selected for reward circuitry that would cause its bearers to seek out more of such food sources.
"such food sources" feels a little like it's ...
There's no guarantee that such a thing even exists, and implicitly aiming to avoid the one value formation process we know is compatible with our own values seems like a terrible idea.
...
It's thus vastly easier to align models to goals where we have many examples of people executing said goals.
I think there's a deep disconnect here on whether interpolation is enough or whether we need extrapolation.
The point of the strawberry alignment problem is "here's a clearly understandable specification of a task that requires novel science and engineering to execute...
I think this is extremely misleading. Firstly, real-world data in high dimensions basically never look like spheres. Such data almost always cluster in extremely compact manifolds, whose internal volume is minuscule compared to the full volume of the space they're embedded in.
I agree with your picture of how manifolds work; I don't think it actually disagrees all that much with Yudkowsky's.
That is, the thing where all humans are basically the same make and model of car, running the same brand of engine, painted different colors is the claim that the intrin...
This seems like way too high a bar. It seems clear that you can have transformative or risky AI systems that are still worse than humans at some tasks. This seems like the most likely outcome to me.
I think this is what Yudkowsky thinks also? (As for why it was relevant to bring up, Yudkowsky was answering the host's question of "How is superintelligence different than general intelligence?")
I expect future capabilities advances to follow a similar pattern as past capabilities advances, and not completely break the existing alignment techniques.
Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn and follow-on discussion.
But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse...
I have a lot of responses to specific points; I'm going to make them as children comment to this comment.
seem very implausible when considered in the context of the human learning process (could a human's visual cortex become "deceptively aligned" to the objective of modeling their visual field?).
I think it would probably be strange for the visual field to do this. But I think it's not that uncommon for other parts of the brain to do this; higher level, most abstract / "psychological" parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call...
John Wentworth describes the possibility of "optimization demons", self-reinforcing patterns that exploit flaws in an imperfect search process to perpetuate themselves and hijack the search for their own purposes.
But no one knows exactly how much of an issue this is for deep learning, which is famous for its ability to evade local minima when run with many parameters.
Also relevant is Are minimal circuits daemon-free? and Are minimal circuits deceptive?. I agree no one knows how much of an issue this will be for deep learning.
...Additionally, I think tha
in which Yudkowsky incorrectly assumed that GANs (Generative Adversarial Networks, a training method sometimes used to teach AIs to generate images) were so finicky that they must not have worked on the first try.
I do think this is a point against Yudkowsky. That said, my impression is that GANs are finicky, and I heard rumors that many people tried similar ideas and failed to get it to work before Goodfellow knocked it out of the park. If people were encouraged to publish negative results, we might have a better sense of the actual landscape here, but I t...
Yudkowsky's own prior statements seem to put him in this camp as well. E.g., here he explains why he doesn't expect intelligence to emerge from neural networks (or more precisely, why he dismisses a brain-based analogy for coming to that conclusion)
I think you're basically misunderstanding and misrepresenting Yudkowsky's argument from 2008. He's not saying "you can't make an AI out of neural networks", he's saying "your design sharing a single feature with the brain does not mean it will also share the brain's intelligence." As well, I don't think he's arg...
Finally, I'd note that having a "security mindset" seems like a terrible approach for raising human children to have good values
Do you have kids, or any experience with them? (There are three small children in the house I live in.) I think you might want to look into childproofing, and meditate on its connection to security mindset.
Yes, this isn't necessarily related to the 'values' part, but for that I would suggest things like Direct Instruction, which involves careful curriculum design to generate lots of examples so that students will reliably end up i...
As I understand it, the security mindset asserts a premise that's roughly: "The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions."
This seems... like a correct description but it's missing the spirit?
Like the intuitions are primarily about "what features are salient" and "what thoughts are easy to think."
However, I don't see why this should be the case.
Roughly, the core distinction between software engineering and computer security is whether ...
What does this mean for alignment? How do we prevent AIs from behaving badly as a result of a similar "misgeneralization"? What alignment insights does the fleshed-out mechanistic story of humans coming to like ice cream provide?
As far as I can tell, the answer is: don't reward your AIs for taking bad actions.
uh
is your proposal "use the true reward function, and then you won't get misaligned AI"?
...That's all it would take, because the mechanistic story above requires a specific step where the human eats ice cream and activates their reward circuits. If
I think it's straightforward to explain why humans "misgeneralized" to liking ice cream.
I don't yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it's a misgeneralization instead of things working as expected.
I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception:
The ancestral environment selected for reward circuitry that would cause its bearers to seek out more of such food sources.
"such food sources" feels a little like it's ...
Is the dataset you used for the regression available? Might be easier to generate the graphs that I'm thinking of then describe them.
[EDIT: I was confused when I wrote the earlier comment, I thought Vivek was talking about the decision square distance to the top 5x5 corner, which I do think my naive guess is plausible for; I don't have the same guess about cheese Euclidean distance to top right corner.]
My naive guess is that the other relationships are nonlinear, and this is the best way to approximate those relationships out of just linear relationships of the variables the regressor had access to.
Good GPUs feels kind of orthogonal.
IMO it's much easier to support high investment numbers in "AI" if you consider lots of semiconductor / AI hardware startup stuff as "AI investments". My suspicion is that while GPUs were primarily a crypto thing for the last few years, the main growth outlook driving more investment is them being an AI thing.
I've written a bunch elsewhere about object-level thoughts on ELK. For this review, I want to focus instead on meta-level points.
I think ELK was very well-made; I think it did a great job of explaining itself with lots of surface area, explaining a way to think about solutions (the builder-breaker cycle), bridging the gap between toy demonstrations and philosophical problems, and focusing lots of attention on the same thing at the same time. In terms of impact on the growth and development on the AI safety community, I think this is one of the most importa...
Sure, points from a scoring rule come both from 'skill' (whether or not you're accurate in your estimates) and 'calibration' (whether your estimates line up with the underlying propensity).
Rather than generating the picture I'm thinking of (sorry, up to something else and so just writing a quick comment), I'll describe it: watch this animation, and see the implied maximum expected score as a function of p (the forecaster's true belief). For all of the scoring rules, it's a convex function with maxima at 0 and 1. (You can get 1 point on average with a linea...
But we show for any strictly proper scoring rule that there is a function such that a dishonest prediction is optimal.
Agreed for proper scoring rules, but I'd be a little surprised if it's not possible to make a skill-free scoring rule, and then get a honest prediction result for that. [This runs into other issues--if the scoring rule is skill-free, where does the skill come from?--but I think this can be solved by having oracle-mode and observation-mode, and being able to do honest oracle-mode at all would be nice.]
I think there's an existing phrase called "defense in depth", which somehow feels... more like the right spirit? [This is related to the 'swiss cheese' model you bring up in the motivation section.] It's not that we're going to throw together a bunch of miscellaneous stuff and it'll work; it's that we're not going to trust any particular defense that we have enough that we don't also want other defenses.
Great post!
My interpretation is that: as an oracle varies its prediction, it both picks up points from becoming more accurate and from making the outcome more predictable. This means the proper scoring rules, tuned to only one pressure, will result in a dishonest equilibrium instead of an honest one.
(Incidentally, this seems like a pretty sensible explanation for why humans are systematically overconfident; if confidence increases success, then the point when the benefits of increased success match the costs of decreased accuracy will be more extreme than ...
I want to point out that I think the typical important case looks more like "wanting to do things for unusual reasons," and if you're worried about this approach breaking down there that seems like a pretty central obstacle. For example, suppose rather than trying to maintain a situation (the diamond stays in the vault) we're trying to extrapolate (like coming up with a safe cancer cure). When looking at a novel medication to solve an unsolved problem, we won't be able to say "well, it cures the cancer for the normal reason" because there aren't any positi...
Is this saying "if model performance is getting better, then maybe it will have a sharp left turn, and if model performance isn't getting better, then it won't"?
In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as "alignment".
I share your opinion of RLHF work but I'm not sure I share your opinion of its consequences. For situations where people don't believe arguments that RLHF is fundamentally flawed because they're too focused on empirical evidence over arguments, the generation of empirical evidence that RLHF is flawed seems pretty useful for convincing them!
This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don't see why it implies internal reward-orientation motivational edifices.
Sorry, if I'm reading this right, we're hypothesizing internal reward-orientation motivational edifices, and then asking the question of whether or not policy gradients will encourage them or discourage them. Quintin seems to think "nah, it needs to take an action before that action can be rewarded", and my response is "wait, isn't this going to be straightforwardly encour...
For fixed-duration reasoning chains, or in situations where the AI is trying to facilitate cooperation between different passes (like in the proposed scenario where it gets information but then loses access to it), this seems important. For example, in a situation where you show the model a passage, let it make notes, and then have the same model look just at the notes and answer a reading comprehension question, I expect that likely it will want to embed information in the null space of the notes. [Or in an HCH-style situation, it might want to embed the ...
so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).
This is the bit I don't believe, actually. [Or at least don't think is relevant.] Note that in Wei_Dai's hypothetical, the neural net architecture has a particular arrangement such that "how much it optimizes for reward" is either directly or indirectly implied by the neural network weights. [We're providing the reward as part of its observations,...
If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction.
Wait, I don't think this is true? At least, I'd appreciate it being stepped thru in more detail.
In the simplest story, we're imagining an agent whose policy is and, for simplicity's sake, is a scalar that determines "how much to maximize for reward" and all the other parameters of store other things about the dynamics of the world / decision-making process.
It seems to me that is obviously going to ...
If you have a space with two disconnected components, then I'm calling the distinction between them "crisp."
The components feel disconnected to me in 1D, but I'm not sure they would feel disconnected in 3D or in ND. Is your intuition that they're 'durably disconnected' (even looking at the messy plan-space of the real-world, we'll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator's preferences, once you have...
I agree this list doesn't seem to contain much unpublished material, and I think the main value of having it in one numbered list is that "all of it is in one, short place", and it's not an "intro to computers can think" and instead is "these are a bunch of the reasons computers thinking is difficult to align".
The thing that I understand to be Eliezer's "main complaint" is something like: "why does it seem like No One Else is discovering new elements to add to this list?". Like, I think Risks From Learned Optimization was great, and am glad you and others ...
Why is the process by which humans come to reliably care about the real world
IMO this process seems pretty unreliable and fragile, to me. Drugs are popular; video games are popular; people-in-aggregate put more effort into obtaining imaginary afterlives than life extension or cryonics.
But also humans have a much harder time 'optimizing against themselves' than AIs will, I think. I don't have a great mechanistic sense of what it will look like for an AI to reliably care about the real world.
Not saying that this should be MIRI's job, rather stating that I'm confused because I feel like we as a community are not taking an action that would seem obvious to me.
I wrote about this a bit before, but in the current world my impression is that actually we're pretty capacity-limited, and so the threshold is not "would be good to do" but "is better than my current top undone item". If you see something that seems good to do that doesn't have much in the way of unilateralist risk, you doing it is probably the right call. [How else is the field going to get more capacity?]
...Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.
Similarly, our ability to coordinate through language also plays a huge role in explaining our power compared to other animals. But, on a first approximation, other animals can't coordinate at all, making this distinction much less impressive. The first AGIs we construct will be born into a culture already c
But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!
I don't think it's obvious that this means that AGI is more dangerous, because it means that for a fixed total impact of AGI, the AGI doesn't have to be as competent at individual thinking (because it leans relatively more on group thinking). And so at...
Not sure exactly what this means.
My read was that for systems where you have rock-solid checking steps, you can throw arbitrary amounts of compute at searching for things that check out and trust them, but if there's any crack in the checking steps, then things that 'check out' aren't trustable, because the proposer can have searched an unimaginably large space (from the rater's perspective) to find them. [And from the proposer's perspective, the checking steps are the real spec, not whatever's in your head.]
In general, I think we can get a minor edge from...
If Alice lies in order to get influence, with the hope of later using that influence for altruistic ends, it seems far to call the influence Alice gets 'personal gain'. After all, it's her sense of altruism that will be promoted, not a generic one.
This is not what most people mean by "for personal gain". (I'm not disputing that Alice gets personal gain)
Insofar as the influence is required for altruistic ends, aiming for it doesn't imply aiming for personal gain.
Insofar as the influence is not required for altruistic ends, we have no basis to believe Alice was aiming for it.
"You're just doing that for personal gain!" is not generally taken to mean that you may be genuinely doing your best to create a better world for everyone, as you see it, in a way that many would broadly endorse.
In this context, a... (read more)