My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

I have a lot of responses to specific points; I'm going to make them as children comment to this comment.

What does this mean for alignment? How do we prevent AIs from behaving badly as a result of a similar "misgeneralization"? What alignment insights does the fleshed-out mechanistic story of humans coming to like ice cream provide?
As far as I can tell, the answer is: don't reward your AIs for taking bad actions.

is your proposal "use the true reward function, and then you won't get misaligned AI"?

That's all it would take, because the mechanistic story above requires a specific step where the human eats ice cream and activates their reward circuits. If you stop the human from receiving reward for eating ice cream, then the human no longer becomes more inclined to navigate towards eating ice cream in the future.
Note that I'm not saying this is an easy task, especially since modern RL methods often use learned reward functions whose exact contours are unknown to their creators.
But from what I can tell, Yudkowsky's position is that we need an entirely new paradigm to even begin to address these sorts of failures.

These three paragraphs feel incoherent to me. The human eating ice cream and activating their reward circuits is exactly what you would expect under the current para... (read more)

2Logan Riggs3y

My shard theory inspired story is to make an AI that: 1. Has a good core of human values (this is still hard) 2. Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs) Then the model can safely scale. This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism) Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.

1Vaniver3y

If there are experiences which will change itself which don't lead to less of the initial good values, then yeah, for an approximate definition of safety. You're resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose. FWIW I don't really see your description as, like, a specific alignment strategy so much as the strategy of "have an alignment strategy at all". The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!

[-]Vaniver3y1310

As I understand it, the security mindset asserts a premise that's roughly: "The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions."

This seems... like a correct description but it's missing the spirit?

Like the intuitions are primarily about "what features are salient" and "what thoughts are easy to think."

However, I don't see why this should be the case.

Roughly, the core distinction between software engineering and computer security is whether the system is thinking back. Software engineering typically involves working with dynamic systems and thinking optimistically how the system could work. Computer security typically involves working with reactive systems and thinking pessimistically about how the system could break.

I think it is an extremely basic AI alignment skill to look at your alignment proposal and ask "how does this break?" or "what happens if the AI thinks about this?".

Additionally, there's a straightforward reason why alignment research (specifically the part of alignment that's about training AIs to have good values) is not like security: there's usually

... (read more)

2Quintin Pope2y

Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries. If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate. "Building an AI that doesn't game your specifications" is the actual "alignment question" we should be doing research on. The mathematical principles which determine how much a given AI training process games your specifications are not adversaries. It's also a problem we've made enormous progress on, mostly by using large pretrained models with priors over how to appropriately generalize from limited specification signals. E.g., Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) shows how the process of pretraining an LM causes it to go from "gaming" a limited set of finetuning data via shortcut learning / memorization, to generalizing with the appropriate linguistic prior knowledge.

[-]Vaniver2y40

"Building an AI that doesn't game your specifications" is the actual "alignment question" we should be doing research on.

Ok, it sounds to me like you're saying:

"When you train ML systems, they game your specifications because the training dynamics are too dumb to infer what you actually want. We just need One Weird Trick to get the training dynamics to Do What You Mean Not What You Say, and then it will all work out, and there's not a demon that will create another obstacle given that you surmounted this one."

That is, training processes are not neutral; there's the bad training processes that we have now (or had before the recent positive developments) and eventually will be good training processes that create aligned-by-default systems.

Is this roughly right, or am I misunderstanding you?

[-]Vaniver2y40

If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate.

Cool, we agree on this point.

my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries.

I think we agree here on the local point but disagree on its significance to the broader argument. [I'm not sure how much we agree-I think of training dynamics as 'neutral', but also I think of them as searching over program-space in order to find a program that performs well on a (loss function, training set) pair, and so you need to be reasoning about search. But I think we agree the training dynamics are not trying to trick you / be adversarial and instead are straightforwardly 'trying' to make Number Go Down.]

In my picture, we have the neutral training dynamics paired with the (loss function, training set) which creates the AI system, and whether the resulting AI system is adversarial or not depends mostly on the choice of (loss function, training set). It seems to me that we probably have a disagreement about how much of the space of (loss function, training set) le... (read more)

[-]Vaniver3y1220

in which Yudkowsky incorrectly assumed that GANs (Generative Adversarial Networks, a training method sometimes used to teach AIs to generate images) were so finicky that they must not have worked on the first try.

I do think this is a point against Yudkowsky. That said, my impression is that GANs are finicky, and I heard rumors that many people tried similar ideas and failed to get it to work before Goodfellow knocked it out of the park. If people were encouraged to publish negative results, we might have a better sense of the actual landscape here, but I think a story of "Goodfellow was unusually good at making GANs and this is why he got it right on his first try" is more compelling to me than "GANs were easy actually".

[-]Vaniver3y117

Yudkowsky's own prior statements seem to put him in this camp as well. E.g., here he explains why he doesn't expect intelligence to emerge from neural networks (or more precisely, why he dismisses a brain-based analogy for coming to that conclusion)

I think you're basically misunderstanding and misrepresenting Yudkowsky's argument from 2008. He's not saying "you can't make an AI out of neural networks", he's saying "your design sharing a single feature with the brain does not mean it will also share the brain's intelligence." As well, I don't think he's arguing about how AI will actually get made; I think he's mostly criticizing the actual AGI developers/enthusiasts that he saw at the time (who were substantially less intelligent and capable than the modern batch of AGI developers).

I think that post has held up pretty well. The architectures used to organize neural networks are quite important, not just the base element. Someone whose only plan was to make their ANN wide would not reach AGI; they needed to do something else, that didn't just rely on surface analogies.

0Quintin Pope2y

There was an entire thread about Yudkowsky's past opinions on neural networks, and I agree with Alex Turner's evidence that Yudkowsky was dubious. I also think people who used brain analogies as the basis for optimism about neural networks were right to do so.

[-]Vaniver3y1012

Finally, I'd note that having a "security mindset" seems like a terrible approach for raising human children to have good values

Do you have kids, or any experience with them? (There are three small children in the house I live in.) I think you might want to look into childproofing, and meditate on its connection to security mindset.

Yes, this isn't necessarily related to the 'values' part, but for that I would suggest things like Direct Instruction, which involves careful curriculum design to generate lots of examples so that students will reliably end up inferring the correct rule.

In short, I think the part of 'raising children' which involves the kids being intelligent as well and independently minded does benefit from security mindset.

As you mention in the next paragraph, this is a long-standing disagreement; I might as well point at the discussion of the relevance of raising human children to instilling goals in an AI in The Detached Lever Fallacy. The short summary of it is that humans have a wide range of options for their 'values', and are running some strategy of learning from their environment (including their parents and their style of raising children) which values ... (read more)

[-]Vaniver3y1014

I think it's straightforward to explain why humans "misgeneralized" to liking ice cream.

I don't yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it's a misgeneralization instead of things working as expected.

I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception:

The ancestral environment selected for reward circuitry that would cause its bearers to seek out more of such food sources.

"such food sources" feels a little like it's eliding the distinction between "high-quality food sources of the ancestral environment" and "foods like ice cream"; the training dataset couldn't differentiate between functions f and g but those functions differ in their reaction to the test set (ice cream). Yudkowsky's primary point with this section, as I understand it, is that even if you-as-evolution know that you want g the only way you can communicate that under the current learning paradigm is with training examples, and it may be non-obvious to which functions f need to be excluded.

[-]Vaniver3y69

seem very implausible when considered in the context of the human learning process (could a human's visual cortex become "deceptively aligned" to the objective of modeling their visual field?).

I think it would probably be strange for the visual field to do this. But I think it's not that uncommon for other parts of the brain to do this; higher level, most abstract / "psychological" parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call 'deceptively aligned' when they're maladaptive. The idea of metacognitive blindspots also seems related.

1Quintin Pope2y

I believe the human visual cortex is actually the more relevant comparison point for estimating the level of danger we face due to mesaoptimization. Its training process is more similar to the self-supervised / offline way in which we train (base) LLMs. In contrast, 'most abstract / "psychological"' are more entangled in future decision-making. They're more "online", with greater ability to influence their future training data. I think it's not too controversial that online learning processes can have self-reinforcing loops in them. Crucially however, such loops rely on being able to influence the externally visible data collection process, rather than being invisibly baked into the prior. They are thus much more amenable to being addressed with scalable oversight approaches.

[-]Vaniver3y54

John Wentworth describes the possibility of "optimization demons", self-reinforcing patterns that exploit flaws in an imperfect search process to perpetuate themselves and hijack the search for their own purposes.
But no one knows exactly how much of an issue this is for deep learning, which is famous for its ability to evade local minima when run with many parameters.

Also relevant is Are minimal circuits daemon-free? and Are minimal circuits deceptive?. I agree no one knows how much of an issue this will be for deep learning.

Additionally, I think that, if deep learning models develop such phenomena, then the brain likely does so as well.

I think the brain obviously has such phenomena, and societies made up of humans also obviously have such phenomena. I think it is probably not adaptive (optimization demons are more like 'cognitive cancer' than 'part of how values form', I think, but in part that's because the term comes with the disapproval built in).

[-]Quintin Pope2y40

I've recently decided to revisit this post. I'll try to address all un-responded to comments in the next ~2 weeks.

4Vaniver3y

That is also how I interpreted it. I think Yudkowsky is making a different statement. I agree it would be bizarre for him to be saying "if I were wrong, it would only mean I should have been more confident!" I think he is (inside of the example). He's saying "suppose an engineer is wrong about how their design works. Is it more likely that the true design performs better along multiple important criteria than expectation, or that the design performs worse (or fails to function at all)?" Note that 'expectation' is referring to the confidence level inside an argument, but arguments aren't Bayesians; it's the outside agent that shouldn't be expected to predictably update. Another way to put this: does the engineer expect to be disappointed, excited, or neutral if the design doesn't work as planned? Typically, disappointed, implying the plan is overly optimistic compared to reality. If this weren't true--if engineers were calibrated or pessimistic--then I think Yudkowsky would be wrong here (and also probably have a different argument to begin with).

4Vaniver3y

I think I agree with this point but want to explicitly note the switch from the phrase 'AI alignment research' to 'ML alignment research'; my model of Eliezer thinks the second is mostly a distraction from the former, and if you think they're the same or interchangeable that seems like a disagreement. [For example, I think ML alignment research includes stuff like "will our learned function be robust to distributional shift in the inputs?" and "does our model discriminate against protected classes?" whereas AI alignment research includes stuff like "will our system be robust to changes in the number of inputs?" and "is our model deceiving us about its level of understanding?". They're related in some ways, but pretty deeply distinct.]

4Vaniver3y

I agree with your picture of how manifolds work; I don't think it actually disagrees all that much with Yudkowsky's. That is, the thing where all humans are basically the same make and model of car, running the same brand of engine, painted different colors is the claim that the intrinsic dimension of human minds is pretty small. (Taken literally, it's 3, for the three dimensions of color-space.) And so if you think there are, say, 40 intrinsic dimensions to mind-space, and humans are fixed on 37 of the points and variable on the other 3, well, I think we have basically the Yudkowskian picture. (I agree if Yudkowsky's picture was that there were 40M dimensions and humans varied on 3, this would be comically wrong, but I don't think this is what he's imagining for that argument.)

1Quintin Pope2y

Addressing this objection is why I emphasized the relatively low information content that architecture / optimizers provide for minds, as compared to training data. We've gotten very far in instantiating human-like behaviors by training networks on human-like data. I'm saying the primacy of data for determining minds means you can get surprisingly close in mindspace, as compared to if you thought architecture / optimizer / etc were the most important. Obviously, there are still huge gaps between the sorts of data that an LLM is trained on versus the implicit loss functions human brains actually minimize, so it's kind of surprising we've even gotten this far. The implication I'm pointing to is that it's feasible to get really close to human minds along important dimensions related to values and behaviors, even without replicating all the quirks of human mental architecture.

3Vaniver3y

It seems like the argument structure here is something like: 1. This requirement is too stringent for humans to follow 2. Humans have successful value alignment 3. Therefore this requirement cannot be necessary for successful value alignment. I disagree with point 2, tho; among other things, it looks to me like some humans are on track to accidentally summoning a demon that kills both me and them, which I expect they would regret after-the-fact if they had the chance to. So any reasoning that's like "well so long as it's not unusual we can be sure it's safe" runs into the thing where we're living in the acute risk period. The usual is not safe! This seems definitely right to me. An expectation I have is that this will also generate resistance to alignment techniques / control by its operators, which perhaps complicates how benign this is. [FWIW I also don't think we want an AI that's perfectly robust to all possible adversarial attacks; I think we want one that's adequate to defend against the security challenges it faces, many of which I expect to be internal. Part of this is because I'm mostly interested in AI planning systems able to help with transformative changes to the world instead of foundational models used by many customers for small amounts of cognition, which are totally different business cases and have different security problems.]

3Vaniver3y

I think there's a deep disconnect here on whether interpolation is enough or whether we need extrapolation. The point of the strawberry alignment problem is "here's a clearly understandable specification of a task that requires novel science and engineering to execute on. Can you do that safely?". If your ambitions are simply to have AI customer service bots, you don't need to solve this problem. If your ambitions include cognitive megaprojects which will need to be staffed at high levels by AI systems, then you do need to solve this problem. More pragmatically, if your ambitions include setting up some sort of system that prevents people from deploying rogue AI systems while not dramatically curtailing Earth's potential, that isn't a goal that we have many examples of people executing on. So either we need to figure it out with humans or, if that's too hard, create an AI system capable of figuring it out (which probably requires an AI leader instead of an AI assistant).

3Vaniver3y

Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn and follow-on discussion. But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model." Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the 'helpful, harmless, honest' alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients? Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don't expect that they will easily transfer to those new challenges.

[-]Quintin Pope2y3-2

Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn and follow-on discussion.

Evolution provides no evidence for the sharp left turn

But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model."

I dislike this minimization of contemporary alignment progress. Even just limiting ourselves to RLHF, that method addresses far more problems than "the LLM sometimes outputs naughty sentences". E.g., it also tackles problems such as consistently following user instructions, reducing hallucinations, improving the topicality of LLM suggestions, etc. It allows much more significant interfacing with the cognition and objectives pursued by LLMs than just some profanity filter.

I don't think ontological collapse is a real issue (or at least, not an issue that appropriate training data can't solve in a relatively stra... (read more)

2Vaniver3y

I think this is what Yudkowsky thinks also? (As for why it was relevant to bring up, Yudkowsky was answering the host's question of "How is superintelligence different than general intelligence?")

[-]DanielFilan3y1813

But have you ever, even once in your life, thought anything remotely like "I really like being able to predict the near-future content of my visual field. I should just sit in a dark room to maximize my visual cortex's predictive accuracy."?

I think I've been in situations where I've been disoriented by a bunch of random stuff happening and wished that less of it was happening so that I could get a better handle on stuff. An example I vividly recall was being in a history class in high school and being very bothered by the large number of conversations happening around me.

3DanielFilan3y

This comment doesn't really engage much with your post - there's a lot there and I thought I'd pick one point to get a somewhat substantive disagreement. But I ended up finding this question and thought that I should answer it.

2DanielFilan3y

To tie up this thread: I started writing a more substantive response to a section but it took a while and was difficult and I then got invited to dinner, so probably won't get around to actually writing it.

[-]Eliezer Yudkowsky3y17-15

This is kinda long. If I had time to engage with one part of this as a sample of whether it holds up to a counterresponse, what would be the strongest foot you could put forward?

(I also echo the commenter who's confused about why you'd reply to the obviously simplified presentation from an off-the-cuff podcast rather than the more detailed arguments elsewhere.)

[-]iceman3y534

This response is enraging.

Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is "This is kinda long."? I shouldn't be that surprised because, IIRC, you said something similar in response to Zack Davis' essays on the Map and Territory distinction, but that's ancillary and AI is core to your memeplex.

I have heard repeated claims that people don't engage with the alignment communities' ideas (recent example from yesterday). But here is someone who did the work. Please explain why your response here does not cause people to believe there's no reason to engage with your ideas because you will brush them off. Yes, nutpicking e/accs on Twitter is much easier and probably more hedonic, but they're not convincible and Quinton here is.

[-]Eliezer Yudkowsky3y96

Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me. To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.

[-]TurnTrout3y127

Here are some of my disagreements with List of Lethalities. I'll quote item one:

“Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again”
(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups.

1Vaniver3y

I have attempted to respond to the whole post over here.

1Vaniver3y

FWIW, I thought the bit about manifolds in The difficulty of alignment was the strongest foot forward, because it paints a different detailed picture than your description that it's responding to. That said, I don't think Quintin's picture obviously disagrees with yours (as discussed in my response over here) and I think you'd find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there's a related one that goes thru on his model.

-3lc3y

[-]cousin_it3y11-1

As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.

I think there's a mistake here which kind of invalidates the whole post. If we don't reward our AI for taking bad actions within the training distribution, it's still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn't in evolution's training distribution for us, but then we found it anyway.

[-]Steven Byrnes3y*106

There's no way to raise a human such that their value system cleanly revolves around the one single goal of duplicating a strawberry, and nothing else. By asking for a method of forming values which would permit such a narrow specification of end goals, you're asking for a value formation process that's fundamentally different from the one humans use. There's no guarantee that such a thing even exists, and implicitly aiming to avoid the one value formation process we know is compatible with our own values seems like a terrible idea.

I narrowly agree with most of this, but I tend to say the same thing with a very different attitude:

I would say: “Gee it would be super cool if we could decide a priori what we want the AGI to be trying to do, WITH SURGICAL PRECISION. But alas, that doesn’t seem possible, at least not according to any method I know of.”

I disagree with you in your apparent suggestion that the above paragraph is obvious or uninteresting, and also disagree with your apparent suggestion that “setting an AGI’s motivations with surgical precision” is such a dumb idea that we shouldn’t even waste one minute of our time thinking about whether it might be possible to do that.

For ... (read more)

4Vaniver3y

My sense (which I expect Eliezer would agree with) is that it's relatively easy to get an AI system to imitate the true underlying 'basic notions of morality', to the extent humans agree on that, but that this doesn't protect you at all as soon as you want to start making large changes, or as soon as you start trying to replace specialist sectors of the economy. (A lot of ethics for doctors has to do with the challenges of simultaneously being a doctor and a human; those ethics will not necessarily be relevant for docbots, and the question of what they should be instead is potentially hard to figure out.) So if you're mostly interested in getting out of the acute risk period, you probably need to aim for a harder target.

4Steven Byrnes3y

Hmm, on further reflection, I was mixing up * Strawberry Alignment (defined as: make an AGI that is specifically & exclusively motivated to duplicate a strawberry without destroying the world), versus * “Strawberry Problem” (make an AGI that in fact duplicates a strawberry without destroying the world, using whatever methods / motivations you like). Eliezer definitely talks about the latter. I’m not sure Eliezer has ever brought up the former? I think I was getting that from the OP (Quintin), but maybe Quintin was just confused (and/or Eliezer misspoke). Anyway, making an AGI that can solve the strawberry problem is tautologically no harder than making an AGI that can do advanced technological development and is motivated by human norms / morals / whatever, because the latter set of AGIs is a subset of the former. Sorry. I crossed out that paragraph. :)

[-]TurnTrout3y*927

Some arguments which Eliezer advanced in order to dismiss neural networks,^[1] seem similar to some reasoning which he deploys in his modern alignment arguments.

Compare his incorrect mockery from 2008:

But there is just no law which says that if X has property A and Y has property A then X and Y must share any other property. "I built my network, and it's massively parallel and interconnected and complicated, just like the human brain from which intelligence emerges! Behold, now intelligence shall emerge from this neural network as well!" And nothing happens. Why should it?
Surface Analogies and Deep Causes^[2]

with his claim in Alexander and Yudkowsky on AGI goals:

[Alexander][14:36]
Like, we're not going to run evolution in a way where we naturally get AI morality the same way we got human morality, but why can't we observe how evolution implemented human morality, and then try AIs that have the same implementation design?
[Yudkowsky][14:37]
Not if it's based on anything remotely like the current paradigm, because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an

... (read more)

[-]DanielFilan3y2846

I don't really get your comment. Here are some things I don't get:

In "Failure By Analogy" and "Surface Analogies and Deep Causes", the point being made is "X is similar in aspects A to thing Y, and X has property P" does not establish "Y has property P". The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me.
Large ANNs don't appear to me to be intelligent because of their similarity to human brains - they appear to me to be intelligent because they're able to be tuned to accurately predict simple facts about a large amount of data that's closely related to human intelligence, and the algorithm they get tuned to seems to be able to be repurposed for a wide variety of tasks (probably related to the wide variety of data that was trained on).
Airplanes don't fly like birds, they fly like airplanes. So indeed you can't just ape one thing about birds[*] to get avian flight. I don't think this is a super revealing technicality but it seemed like you thought it was important.
Maybe most importantly I don't think Eliezer thinks you need to mimic the human brain super closely to get human-like intel

... (read more)

[-]TurnTrout3y*57

Edited to modify confidences about interpretations of EY's writing / claims.

In "Failure By Analogy" and "Surface Analogies and Deep Causes", the point being made is "X is similar in aspects A to thing Y, and X has property P" does not establish "Y has property P". The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me.

This is a valid point, and that's not what I'm critiquing in that portion of the comment. I'm critiquing how -- on my read -- he confidently dismisses ANNs; in particular, using non-mechanistic reasoning which seems similar to some of his current alignment arguments.

On its own, this seems like a substantial misprediction for an intelligence researcher in 2008 (especially one who claims to have figured out most things in modern alignment, by a very early point in time -- possibly that early, IDK). Possibly the most important prediction to get right, to date.

Airplanes don't fly like birds, they fly like airplanes. So indeed you can't just ape one thing about birds[*] to get avian flight. I don't think this is a super revealing technicality but it seemed like you thought it was imp

... (read more)

[-]Zack_M_Davis3y2514

how he confidently dismisses ANNs

I don't think this is a fair reading of Yudkowsky. He was dismissing people who were impressed by the analogy between ANNs and the brain. I'm pretty sure it wasn't supposed to be a positive claim that ANNs wouldn't work. Rather, it's that one couldn't justifiably believe that they'd work just from the brain analogy, and that if they did work, that would be bad news for what he then called Friendliness (because he was hoping to discover and wield a "clean" theory of intelligence, as contrasted to evolution or gradient descent happening to get there at sufficient scale).

Consider "Artificial Mysterious Intelligence" (2008). In response to someone who said "But neural networks are so wonderful! They solve problems and we don't have any idea how they do it!", it's significant that Yudkowsky's reply wasn't, "No, they don't" (contesting the capabilities claim), but rather, "If you don't know how your AI works, that is not good. It is bad" (asserting that opaque capabilities are bad for alignment).

[-]DirectedEvolution3y946

One of Yudkowsky's claims in the post you link is:

It's hard to build a flying machine if the only thing you understand about flight is that somehow birds magically fly. What you need is a concept of aerodynamic lift, so that you can see how something can fly even if it isn't exactly like a bird.

This is a claim that lack of the correct mechanistic theory is a formidable barrier for capabilities, not just alignment, and it inaccurately underestimates the amount of empirical understandings available on which to base an empirical approach.

It's true that it's hard, even perhaps impossible, to build a flying machine if the only thing you understand is that birds "magically" fly.

But if you are like most people for thousands of years, you've observed many types of things flying, gliding, or floating in the air: birds and insects, fabric and leaves, arrows and spears, clouds and smoke.

So if you, like the Montgolfier brothers, observe fabric floating over a fire, and live in an era in which invention is celebrated and have the ability to build, test, and iterate, then you can probably figure out how to build a flying machine without basing this on a fully worked out concept of aerodyna... (read more)

4TurnTrout3y

To be fair, he said that those two will work, and (perhaps?) admitted the possibility of "run advanced neural network algorithms" eventually working. Emphasis mine:

0rvnnt3y

I think it might be relevant to note here that it's not really humans who are building current SOTA AIs --- rather, it's some optimizer like SGD that's doing most of the work. SGD does not have any mechanistic understanding of intelligence (nor anything else). And indeed, it takes a heck of a lot of data and compute for SGD to build those AIs. This seems to be in line with Yudkowsky's claim that it's hard/inefficient to build something without understanding it. I think it's important to distinguish between * Scaling up a neural network, and running some kind of fixed algorithm on it. * Scaling up a neural network, and using SGD to optimize the parameters of the NN, so that the NN ends up learning a whole new set of algorithms. IIUC, in Artificial Mysterious Intelligence, Yudkowsky seemed to be saying that the former would probably fail. OTOH, I don't know what kinds of NN algorithms were popular back in 2008, or exactly what NN algorithms Yudkowsky was referring to, so... *shrugs*.

[-]DirectedEvolution3y1937

If that were the case, I actually would fault Eliezer, at least a little. He’s frequently, though by no means always, stuck to qualitative and hard-to-pin-down punditry like we see here, rather than to unambiguous forecasting.

This allows him, or his defenders, to retroactively defend his predictions as somehow correct even when they seem wrong in hindsight.

Let’s imagine for a moment that Eliezer’s right that AI safety is a cosmically important issue, and yet that he’s quite mistaken about all the technical details of how AGI will arise and how to effectively make it safe. It would be important to know whether we can trust his judgment and leadership.

Without the ability to evaluate his performance, either by going with the most obvious interpretation of his qualitative judgments or an unambiguous forecast, it’s hard to evaluate his performance as an AI safety leader. Combine that with a culture of deference to perceived expertise and status and the problem gets worse.

So I prioritize the avoidance of special pleading in this case: I think Eliezer comes across as clearly wrong in substance in this specific post, and that it’s important not to reach for ways “he was actually right from... (read more)

1TurnTrout3y

I think several things here, considering the broader thread: 1. You've done a great job in communicating several reactions I also had: 1. There are signs of serious mispredictions and mistakes in some of the 2008 posts. 2. There are ways to read these posts as not that bad in hindsight, but we should be careful in giving too much benefit of the doubt. 3. Overall these observations constitute important evidence on EY's alignment intuitions and ability to make qualitative AI predictions. 2. I did a bad job of marking my interpretations of what Eliezer wrote, as opposed to claiming he did dismiss ANNs. Hopefully my edits have fixed my mistakes.

[-]Rafael Harth3y813

I also don't really get your position. You say that,

[Eliezer] confidently dismisses ANNs

but you haven't shown this!

In Surface Analogies and Deep Causes, I read him as saying that neural networks don't automatically yield intelligence just because they share surface similarities with the brain. This is clearly true; at the very least, using token-prediction (which is a task for which (a) lots of training data exist and (b) lots of competence in many different domains is helpful) is a second requirement. If you take the network of GPT-4 and trained it to play chess instead, you won't get something with cross-domain competence.
In Failure by Analogy he makes a very similar abstract point -- and wrt to neural networks in particular, he says that the surface similarity to the brain is a bad reason to be confident in them. This also seems true. Do you really think that neural networks work because they are similar to brains on the surface?

You also said,

The important part is the last part. It's invalid. Finding a design X which exhibits property P, doesn't mean that for design Y to exhibit property P, Y must be very similar to X.

But Eliezer says this too in the post you li... (read more)

3TurnTrout3y

Responding to part of your comment: I know he's talking about alignment, and I'm criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments. That claim seems to be advanced due to... there not being enough similarities between ANNs and human brains -- that without enough similarity in mechanisms wich were selected for by evolution, you simply can't get the AI to generalize in the mentioned human-like way. Not as a matter of the AI's substrate, but as a matter of the AI's policy not generalizing like that. I think this is a dubious claim, and it's made based off of analogies to evolution / some unknown importance of having evolution-selected mechanisms which guide value formation (and not SGD-based mechanisms). From the Alexander/Yudkowsky debate: There's some assertion like "no, there's not a way to get an ANN, even if incorporating structural parameters and information encoded in human genome, to actually unfold into a mind which has human-like values (like 'don't steal')." (And maybe Eliezer comes and says "no that's not what I mean", but, man, I sure don't know what he does mean, then.) Here's some more evidence along those lines: Again, why is this true? This is an argument that should be engaging in technical questions about inductive biases, but instead seems to wave at (my words) "the original way we got property P was by sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, and good luck trying to get it otherwise." Hopefully this helps clarify what I'm trying to critique?

1Rafael Harth3y

Ok, I don't disagree with this. I certainly didn't develop a gears-level understanding of why [building a brain-like thing with gradient descent on giant matrices] is doomed after reading the 2021 conversations. But that doesn't seem very informative either way; I didn't spend that much time trying to grok his arguments.

[-]DanielFilan3y89

This is a valid point, and that's not what I'm critiquing. I'm critiquing how he confidently dismisses ANNs

I guess I read that as talking about the fact that at the time ANNs did not in fact really work. I agree he failed to predict that would change, but that doesn't strike me as a damning prediction.

Matters would be different if he said in the quotes you cite "you only get these human-like properties by very exactly mimicking the human brain", but he doesn't.

Didn't he? He at least confidently rules out a very large class of modern approaches.

Confidently ruling out a large class of modern approaches isn't really that similar to saying "the only path to success is exactly mimicking the human brain". It seems like one could rule them out by having some theory about why they're deficient. I haven't re-read List of Lethalities because I want to go to sleep soon, but I searched for "brain" and did not find a passage saying "the real problem is that we need to emulate the brain precisely but can't because of poor understanding of neuroanatomy" or something.

4DanielFilan3y

I don't want to get super hung up on this because it's not about anything Yudkowsky has said but: IMO this is not a faithful transformation of the line of reasoning you attribute to Yudkowsky, which was: Specifically, where you wrote "an entity which flies", you were transforming "a mind which wants as humans do", which I think should instead be transformed to "an entity which flies as birds do". And indeed planes don't fly like birds do. [EDIT: two minutes or so after pressing enter on this comment, I now see how you could read it your way] I guess if I had to make an analogy I would say that you have to be pretty similar to a human to think the way we do, but probably not to pursue the same ends, which is probably the point you cared about establishing.

[-]TurnTrout3y13

Here's another attempt at one of my contentions.

Consider shard theory of human values. The point of shard theory is not "because humans do RL, and have nice properties, therefore AI + RL will have nice properties." The point is more "by critically examining RL + evidence from humans, I have hypotheses about the mechanistic load-bearing components of e.g. local-update credit assignment in a bounded-compute environment on certain kinds of sensory data, that these components leads to certain exploration/learning dynamics, which explain some portion of human values and experience. Let's test that and see if the generators are similar."

And my model of Eliezer shakes his head at the naivete of expecting complex human properties to reproduce outside of human minds themselves, because AI is not human.

But then I'm like "this other time you said 'AI is not human, stop expecting good property P from superficial similarities', you accidentally missed the modern AI revolution, right? Seems like there is some non-superficial mechanistic similarity/lessons here, and we shouldn't be so quick to assume that the brain's qualitative intelligence or alignment properties come from a huge number of evolutionarily-tuned details which are load-bearing and critical."

[-]TurnTrout3y*1118

It now seems clear to me that EY was not bullish on neural networks leading to impressive AI capabilities. Eliezer said this directly:

I'm no fan of neurons; this may be clearer from other posts.^[1]

I think this is strong evidence for my interpretation of the quotes in my parent comment: He's not just mocking the local invalidity of reasoning "because humans have lots of neurons, AI with lots of neurons -> smart", he's also mocking neural network-driven hopes themselves.

^{^}
More quotes from Logical or Connectionist AI?:
Not to mention that neural networks have also been "failing" (i.e., not yet succeeding) to produce real AI for 30 years now. I don't think this particular raw fact licenses any conclusions in particular. But at least don't tell me it's still the new revolutionary idea in AI.
This is the original example I used when I talked about the "Outside the Box" box - people think of "amazing new AI idea" and return their first cache hit, which is "neural networks" due to a successful marketing campaign thirty goddamned years ago. I mean, not every old idea is bad - but to still be marketing it as the new defiant revolution? Give me a break.
In this passage, he employs well-s

... (read more)

[-]cousin_it1y*40

The relevant point is his latter claim: “in particular with respect to “learn ‘don’t steal’ rather than ‘don’t get caught’.”″ I think this is a very strong conclusion, relative to available data.

I think humans don't steal mostly because society enforces that norm. Toward weaker "other" groups that aren't part of your society (farmed animals, weaker countries, etc) there's no such norm, and humans often behave badly toward such groups. And to AIs, humans will be a weaker "other" group. So if alignment of AIs to human standard is a complete success - if AIs learn to behave toward weaker "other" groups exactly as humans behave toward such groups - the result will be bad for humans.

It gets even worse because AIs, unlike humans, aren't raised to be moral. They're raised by corporations with a goal to make money, with a thin layer of "don't say naughty words" morality. We already know corporations will break rules, bend rules, lobby to change rules, to make more money and don't really mind if people get hurt in the process. We'll see more of that behavior when corporations can make AIs to further their goals.

3David Johnston3y

Would you say Yudkowsky's views are a mischaracterisation of neural network proponents, or that he's mistaken about the power of loose analogies?

3TurnTrout3y

Neither. 1. I don't know what proponents were claiming when proponing neural networks. I do know that neural networks ended up working, big time. 2. I don't think loose analogies are powerful. I think they lead to sloppy thinking.

[-]Steven Byrnes3y52

"I have to be wrong about something, which I certainly am. I have to be wrong about something which makes the problem easier rather than harder, for those people who don't think alignment's going to be all that hard. If you're building a rocket for the first time ever, and you're wrong about something, it's not surprising if you're wrong about something. It's surprising if the thing that you're wrong about causes the rocket to go twice as high, on half the fuel you thought was required and be much easier to steer than you were afraid of."

I agree with OP th... (read more)

[-]Vaniver3y76

During this process, I don’t think it’s particularly unusual for the person to notice a technical problem but overlook a clever way to solve that problem.

I think this isn't the claim; I think the claim is that it would be particularly unusual for someone to overlook that they're accidentally solving a technical problem. (It would be surprising for Edison to not be thinking hard about what filament to use and pick tungsten; in actual history, it took decades for that change to be made.)

4Steven Byrnes3y

Sure, but then the other side of the analogy doesn’t make sense, right? The context was: Eliezer was talking in general terms about the difficulty of the AGI x-risk problem and whether it’s likely to be solved. (As I understand it.) [Needless to say, I’m just making a narrow point that it’s a bad analogy. I’m not arguing that p(doom) is high or low, I’m not saying this is an important & illustrative mistake (talking on the fly is hard!), etc.]

4Vaniver3y

So I definitely think that's something weirdly unspoken about the argument; I would characterize it as Eliezer saying "suppose I'm right and they're wrong; all this requires is things to be harder than people think, which is usual. Suppose instead that I'm wrong and they're right; this requires things to be easier than people think, which is unusual." But the equation of "people" and "Eliezer" is sort of strange; as Quintin notes, it isn't that unusual for outside observers to overestimate difficulty, and so I wish he had centrally addressed the the reference class tennis game; is the expertise "getting AI systems to be capable" or "getting AI systems to do what you want"?

[-]Gordon Seidoh Worley3y25

This post brought to mind a thought: I actually don't care very much about arguments about how likely doom is and how pessimistic or optimistic to be since they are irrelevant, to my style of thinking, for making decisions related to building TAI. Instead, I mostly focus on downside risks and avoiding them because they are so extreme, which makes me look "pessimistic" but actually I'm just trying to minimize the risk of false positives in building aligned AI. Given this framing, it's actually less important, in most cases, to figure out how likely somethin... (read more)

[-]DaemonicSigil3y11

Difficulty of Alignment

I find the prospect of training on model on just 40 parameters to be very interesting. Almost unbelievable, really, to the point where I'm tempted to say: "I notice that I'm confused". Unfortunately, I don't have access to the paper and it doesn't seem to be on sci-hub, so I haven't been able to resolve my confusion. Basically, my general intuition is that each parameter in a network probably only contributes a few bits of optimization power. It can be set fairly high, fairly low, or in between. So if you just pulled 40 random weigh... (read more)

3DanielFilan3y

For the 40 parameters thing, this link should work. See also this earlier paper.

4DanielFilan3y

BTW: the way I found that first link was by searching the title on google scholar, finding the paper, and clicking "All 5 versions" below (it's right next to "Cited by 7" and "Related articles"). That brought me to a bunch of versions, one of which was a seemingly-ungated PDF. This will probably frequently work, because AI researchers usually make their papers publicly available (at least in pre-print form).

1DaemonicSigil3y

Thanks for the link! Looks like they do put optimization effort into choosing the subspace, but it's still interesting that the training process can be factored into 2 pieces like that.

[-]Portia3y0-6

What stood out to me in the video is Eliezer no longer being able to conceive of any positive outcome at all, which is beyond reason. It made me wonder what approach a company could possible develop for alignment, or what a supposedly aligned AI could possibly do, for Eliezer to take back his doom predictions, and suspect that the answer is none. The impression I got was that he is meanwhile closed to the possibility entirely. I found the Time article heartbreaking. These are parents, intelligent, rational parents who I have respect and compassion for, ess... (read more)

3Adele Lopez3y

This market by Eliezer about the possible reasons why AI may yet have a positive outcome seems to refute your first sentence. Also, I haven't seen any AI notkilleveryoneism people advocating terrorism or giving up.

[+]lemonhope3y-30

^{^}

By this, I mostly mean the sorts of empirical approaches we actually use on current state of the art language models, such as RLHF, red teaming, etc.

^{^}

We can take drugs, though, which maybe does something like change the brain's learning rate, or some other hyperparameters.

^{^}

Technically it's trained to do decision transformer-esque reward-conditioned generation of texts.

^{^}

The brain likely includes within-neuron learnable parameters, but I expect these to be a relatively small contribution to the overall information content a human accumulates over their lifetime. For convenience, I just say “connectome” in the main text, but really I mean “connectome + all other within-lifetime learnable parameters of the brain’s operation”.

^{^}

I expect there are pretty straightforward ways of leveraging a 99% successful alignment method into a near-100% successful method by e.g., ensembling multiple training runs, having different runs cross-check each other, searching for inputs that lead to different behaviors between different models, transplanting parts of one model's activations into another model and seeing if the recipient model becomes less aligned, etc.

^{^}

Some alignment researchers do argue that gradient descent is likely to create such an intelligence - an inner optimizer - that then deliberately manipulates the training process to its own ends. I don't believe this either. I don't want to dive deeply into my objections to that bundle of claims in this post, but as with Yudkowsky's position, I have many technical objections to such arguments. Briefly, they:
- often rely on inappropriate analogies to evolution.
- rely on unproven (and dubious, IMO) claims about the inductive biases of gradient descent.
- rely on shaky notions of "optimization" that lead to absurd conclusions when critically examined.
- seem inconsistent with what we know of neural network internal structures (they're very interchangeable and parallel).
- seem like the postulated network structure would fall victim to internally generated adversarial examples.
- don't track the distinction between mesa objectives and behavioral objectives (one can probably convert an NN into an energy function, then parameterize the NN's forwards pass as a search for energy function minima, without changing network behavior at all, so mesa objectives can have ~no relation to behavioral objectives).
- seem very implausible when considered in the context of the human learning process (could a human's visual cortex become "deceptively aligned" to the objective of modeling their visual field?).
- provide limited avenues for any such inner optimizer to actually influence the training process.
See also: Deceptive Alignment is <1% Likely by Default

^{^}

There's also in-context learning, which arguably does count as 'getting smarter while running in inference mode'. E.g., without updating any weights, LMs can:
- adapt information found in task descriptions / instructions to solving future task instances.
- given a coding task, write an initial plan on how to do that task, and then use that plan to do better on the coding task in question.
- even learn to classify images.
The reason this in-context learning doesn't always lead to persistent improvements (or at least changes) in GPT-4 is because OpenAI doesn't train their models like that.

^{^}

OpenAI does periodically train its models in a way that incorporates user inputs somehow. E.g., ChatGPT became much harder to jailbreak after OpenAI trained against the breaks people used against it. So GPT-4 is probably learning from some of the times it's run in inference mode.

^{^}

Unless we actually try the approach and it fails in the way predicted. But that hasn't happened (yet).

^{^}

This sentence would sound much less weird if John had called them "attractors" instead of "demons". One potential downside of choosing evocative names for things is that they can make it awkward to talk about those things in an emotionally neutral way.

^{^}

Level	What it does	In Humans:	In AIs:
Top	Configures the learning process	Genome	Training code
Middle	Stores learned information / behaviors	Connectome	Weights
Bottom	Applies stored info to the current situation	Activations	Activations

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

79

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

79

Difficulty of Alignment

Introduction

My objections

Will current approaches scale to AGI?

Yudkowsky apparently thinks not

Discussion of human generality

Yudkowsky says humans aren't fully general

Yudkowsky talks about an AI being more general than humans

How to think about superintelligence

Yudkowsky describes superintelligence

The difficulty of alignment

Yudkowsky on the width of mind space

Yudkowsky brings up strawberry alignment

Yudkowsky argues against AIs being steerable by gradient descent

Yudkowsky brings up humans liking ice cream as an example of values misgeneralization caused by the shift to our modern environment

Edit: Why evolution is not like AI training

Yudkowsky claims that evolution has a stronger simplicity bias than gradient descent:

Yudkowsky tries to predict the inner goals of a GPT-like model.

Why aren't other people as pessimistic as Yudkowsky?

Yudkowsky mentions the security mindset.

On optimists preemptively becoming "grizzled old cynics"

Hopes for a good outcome

Yudkowsky on being wrong

AI progress rates

Yudkowsky uses progress rates in Go to argue for fast takeoff

On current AI not being self-improving:

Edit: Yudkowsky comments to clarify the intent behind his statement about AIs getting better over time

True experts learn (and prove themselves) by breaking things

Conclusion