What does this mean for alignment? How do we prevent AIs from behaving badly as a result of a similar "misgeneralization"? What alignment insights does the fleshed-out mechanistic story of humans coming to like ice cream provide?
As far as I can tell, the answer is: don't reward your AIs for taking bad actions.
uh
is your proposal "use the true reward function, and then you won't get misaligned AI"?
That's all it would take, because the mechanistic story above requires a specific step where the human eats ice cream and activates their reward circuits. If you stop the human from receiving reward for eating ice cream, then the human no longer becomes more inclined to navigate towards eating ice cream in the future.
Note that I'm not saying this is an easy task, especially since modern RL methods often use learned reward functions whose exact contours are unknown to their creators.
But from what I can tell, Yudkowsky's position is that we need an entirely new paradigm to even begin to address these sorts of failures.
These three paragraphs feel incoherent to me. The human eating ice cream and activating their reward circuits is exactly what you would expect under the current paradigm. Yudkowsky thinks this leads to misalignment; you agree. He says that you need a new paradigm to not have this problem. You disagree because you assume it's possible under the current paradigm.
If so, how? Where's the system that, on eating ice cream, realizes "oh no! This is a bad action that should not receive reward!" and overrides the reward machinery? How was it trained?
I think when Eliezer says "we need an entirely new paradigm", he means something like "if we want a decision-making system that makes better decisions that a RL agent, we need agent-finding machinery that's better than RL." Maybe the paradigm shift is small (like from RL without experience replay to RL with), or maybe the paradigm shift is large (like from policy-based agents to plan-based agents).
In contrast, I think we can explain humans' tendency to like ice cream using the standard language of reinforcement learning. It doesn't require that we adopt an entirely new paradigm before we can even get a handle on such issues.
He's not saying the failures of RL are a surprise from the theory of RL. Of course you can explain it using the standard language of RL! He's saying that unless you can predict RL's failures from the inside, the RL agents that you make are going to actually make those mistakes in reality.
My shard theory inspired story is to make an AI that:
Then the model can safely scale.
This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism)
Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.
Then the model can safely scale.
If there are experiences which will change itself which don't lead to less of the initial good values, then yeah, for an approximate definition of safety. You're resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.
FWIW I don't really see your description as, like, a specific alignment strategy so much as the strategy of "have an alignment strategy at all". The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
in which Yudkowsky incorrectly assumed that GANs (Generative Adversarial Networks, a training method sometimes used to teach AIs to generate images) were so finicky that they must not have worked on the first try.
I do think this is a point against Yudkowsky. That said, my impression is that GANs are finicky, and I heard rumors that many people tried similar ideas and failed to get it to work before Goodfellow knocked it out of the park. If people were encouraged to publish negative results, we might have a better sense of the actual landscape here, but I think a story of "Goodfellow was unusually good at making GANs and this is why he got it right on his first try" is more compelling to me than "GANs were easy actually".
As I understand it, the security mindset asserts a premise that's roughly: "The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions."
This seems... like a correct description but it's missing the spirit?
Like the intuitions are primarily about "what features are salient" and "what thoughts are easy to think."
However, I don't see why this should be the case.
Roughly, the core distinction between software engineering and computer security is whether the system is thinking back. Software engineering typically involves working with dynamic systems and thinking optimistically how the system could work. Computer security typically involves working with reactive systems and thinking pessimistically about how the system could break.
I think it is an extremely basic AI alignment skill to look at your alignment proposal and ask "how does this break?" or "what happens if the AI thinks about this?".
Additionally, there's a straightforward reason why alignment research (specifically the part of alignment that's about training AIs to have good values) is not like security: there's usually no adversarial intelligence cleverly trying to find any possible flaws in your approaches and exploit them.
What's your story for specification gaming?
I must admit some frustration, here; in this section it feels like your point is "look, computer security is for dealing with intelligence as part of your system. But the only intelligence in our system is sometimes malicious users!" In my world, the whole point of Artificial Intelligence was the Intelligence. The call is coming from inside the house!
Maybe we just have some linguistic disagreement? "Sure, computer security is relevant to transformative AI but not LLMs"? If so, then I think the earlier point about whether capabilities enhancements break alignment techniques is relevant: if these alignment techniques work because the system isn't thinking about them, then are you confident they will continue to work when the system is thinking about them?
Roughly, the core distinction between software engineering and computer security is whether the system is thinking back.
Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries. If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate.
What's your story for specification gaming?
"Building an AI that doesn't game your specifications" is the actual "alignment question" we should be doing research on. The mathematical principles which determine how much a given AI training process games your specifications are not adversaries. It's also a problem we've made enormous progress on, mostly by using large pretrained models with priors over how to appropriately generalize from limited specification signals. E.g., Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) shows how the process of pretraining an LM causes it to go from "gaming" a limited set of finetuning data via shortcut learning / memorization, to generalizing with the appropriate linguistic prior knowledge.
Finally, I'd note that having a "security mindset" seems like a terrible approach for raising human children to have good values
Do you have kids, or any experience with them? (There are three small children in the house I live in.) I think you might want to look into childproofing, and meditate on its connection to security mindset.
Yes, this isn't necessarily related to the 'values' part, but for that I would suggest things like Direct Instruction, which involves careful curriculum design to generate lots of examples so that students will reliably end up inferring the correct rule.
In short, I think the part of 'raising children' which involves the kids being intelligent as well and independently minded does benefit from security mindset.
As you mention in the next paragraph, this is a long-standing disagreement; I might as well point at the discussion of the relevance of raising human children to instilling goals in an AI in The Detached Lever Fallacy. The short summary of it is that humans have a wide range of options for their 'values', and are running some strategy of learning from their environment (including their parents and their style of raising children) which values to adopt. The situation with AI seems substantially different--why make an AI design that chooses whether to be good or bad based on whether you're nice to it, when you could instead have it choose to always be good? [Note that this is distinct from "always be nice"; you could decide that your good AI can tell users that they're being bad users!]
Yudkowsky's own prior statements seem to put him in this camp as well. E.g., here he explains why he doesn't expect intelligence to emerge from neural networks (or more precisely, why he dismisses a brain-based analogy for coming to that conclusion)
I think you're basically misunderstanding and misrepresenting Yudkowsky's argument from 2008. He's not saying "you can't make an AI out of neural networks", he's saying "your design sharing a single feature with the brain does not mean it will also share the brain's intelligence." As well, I don't think he's arguing about how AI will actually get made; I think he's mostly criticizing the actual AGI developers/enthusiasts that he saw at the time (who were substantially less intelligent and capable than the modern batch of AGI developers).
I think that post has held up pretty well. The architectures used to organize neural networks are quite important, not just the base element. Someone whose only plan was to make their ANN wide would not reach AGI; they needed to do something else, that didn't just rely on surface analogies.
I think it's straightforward to explain why humans "misgeneralized" to liking ice cream.
I don't yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it's a misgeneralization instead of things working as expected.
I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception:
The ancestral environment selected for reward circuitry that would cause its bearers to seek out more of such food sources.
"such food sources" feels a little like it's eliding the distinction between "high-quality food sources of the ancestral environment" and "foods like ice cream"; the training dataset couldn't differentiate between functions f
and g
but those functions differ in their reaction to the test set (ice cream). Yudkowsky's primary point with this section, as I understand it, is that even if you-as-evolution know that you want g
the only way you can communicate that under the current learning paradigm is with training examples, and it may be non-obvious to which functions f
need to be excluded.
seem very implausible when considered in the context of the human learning process (could a human's visual cortex become "deceptively aligned" to the objective of modeling their visual field?).
I think it would probably be strange for the visual field to do this. But I think it's not that uncommon for other parts of the brain to do this; higher level, most abstract / "psychological" parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call 'deceptively aligned' when they're maladaptive. The idea of metacognitive blindspots also seems related.
I believe the human visual cortex is actually the more relevant comparison point for estimating the level of danger we face due to mesaoptimization. Its training process is more similar to the self-supervised / offline way in which we train (base) LLMs. In contrast, 'most abstract / "psychological"' are more entangled in future decision-making. They're more "online", with greater ability to influence their future training data.
I think it's not too controversial that online learning processes can have self-reinforcing loops in them. Crucially however, such loops rely on being able to influence the externally visible data collection process, rather than being invisibly baked into the prior. They are thus much more amenable to being addressed with scalable oversight approaches.
John Wentworth describes the possibility of "optimization demons", self-reinforcing patterns that exploit flaws in an imperfect search process to perpetuate themselves and hijack the search for their own purposes.
But no one knows exactly how much of an issue this is for deep learning, which is famous for its ability to evade local minima when run with many parameters.
Also relevant is Are minimal circuits daemon-free? and Are minimal circuits deceptive?. I agree no one knows how much of an issue this will be for deep learning.
Additionally, I think that, if deep learning models develop such phenomena, then the brain likely does so as well.
I think the brain obviously has such phenomena, and societies made up of humans also obviously have such phenomena. I think it is probably not adaptive (optimization demons are more like 'cognitive cancer' than 'part of how values form', I think, but in part that's because the term comes with the disapproval built in).
I think the bolded text is about Yudkowsky himself being wrong.
That is also how I interpreted it.
If you have a bunch of specific arguments and sources of evidence that you think all point towards a particular conclusion X, then discovering that you're wrong about something should, in expectation, reduce your confidence in X.
I think Yudkowsky is making a different statement. I agree it would be bizarre for him to be saying "if I were wrong, it would only mean I should have been more confident!"
Yudkowsky is not the aerospace engineer building the rocket who's saying "the rocket will work because of reasons A, B, C, etc".
I think he is (inside of the example). He's saying "suppose an engineer is wrong about how their design works. Is it more likely that the true design performs better along multiple important criteria than expectation, or that the design performs worse (or fails to function at all)?"
Note that 'expectation' is referring to the confidence level inside an argument, but arguments aren't Bayesians; it's the outside agent that shouldn't be expected to predictably update. Another way to put this: does the engineer expect to be disappointed, excited, or neutral if the design doesn't work as planned? Typically, disappointed, implying the plan is overly optimistic compared to reality.
If this weren't true--if engineers were calibrated or pessimistic--then I think Yudkowsky would be wrong here (and also probably have a different argument to begin with).
Given the greater evidence available for general ML research, being well calibrated about the difficulty of general ML research is the first step to being well calibrated about the difficulty of ML alignment research.
I think I agree with this point but want to explicitly note the switch from the phrase 'AI alignment research' to 'ML alignment research'; my model of Eliezer thinks the second is mostly a distraction from the former, and if you think they're the same or interchangeable that seems like a disagreement.
[For example, I think ML alignment research includes stuff like "will our learned function be robust to distributional shift in the inputs?" and "does our model discriminate against protected classes?" whereas AI alignment research includes stuff like "will our system be robust to changes in the number of inputs?" and "is our model deceiving us about its level of understanding?". They're related in some ways, but pretty deeply distinct.]
I think this is extremely misleading. Firstly, real-world data in high dimensions basically never look like spheres. Such data almost always cluster in extremely compact manifolds, whose internal volume is minuscule compared to the full volume of the space they're embedded in.
I agree with your picture of how manifolds work; I don't think it actually disagrees all that much with Yudkowsky's.
That is, the thing where all humans are basically the same make and model of car, running the same brand of engine, painted different colors is the claim that the intrinsic dimension of human minds is pretty small. (Taken literally, it's 3, for the three dimensions of color-space.)
And so if you think there are, say, 40 intrinsic dimensions to mind-space, and humans are fixed on 37 of the points and variable on the other 3, well, I think we have basically the Yudkowskian picture.
(I agree if Yudkowsky's picture was that there were 40M dimensions and humans varied on 3, this would be comically wrong, but I don't think this is what he's imagining for that argument.)
Addressing this objection is why I emphasized the relatively low information content that architecture / optimizers provide for minds, as compared to training data. We've gotten very far in instantiating human-like behaviors by training networks on human-like data. I'm saying the primacy of data for determining minds means you can get surprisingly close in mindspace, as compared to if you thought architecture / optimizer / etc were the most important.
Obviously, there are still huge gaps between the sorts of data that an LLM is trained on versus the implicit loss functions human brains actually minimize, so it's kind of surprising we've even gotten this far. The implication I'm pointing to is that it's feasible to get really close to human minds along important dimensions related to values and behaviors, even without replicating all the quirks of human mental architecture.
It cannot be the case that successful value alignment requires perfect adversarial robustness.
It seems like the argument structure here is something like:
I disagree with point 2, tho; among other things, it looks to me like some humans are on track to accidentally summoning a demon that kills both me and them, which I expect they would regret after-the-fact if they had the chance to.
So any reasoning that's like "well so long as it's not unusual we can be sure it's safe" runs into the thing where we're living in the acute risk period. The usual is not safe!
Similarly, an AI that knows it's vulnerable to adversarial attacks, and wants to avoid being attacked successfully, will take steps to protect itself against such attacks. I think creating AIs with such meta-preferences is far easier than creating AIs that are perfectly immune to all possible adversarial attacks.
This seems definitely right to me. An expectation I have is that this will also generate resistance to alignment techniques / control by its operators, which perhaps complicates how benign this is.
[FWIW I also don't think we want an AI that's perfectly robust to all possible adversarial attacks; I think we want one that's adequate to defend against the security challenges it faces, many of which I expect to be internal. Part of this is because I'm mostly interested in AI planning systems able to help with transformative changes to the world instead of foundational models used by many customers for small amounts of cognition, which are totally different business cases and have different security problems.]
There's no guarantee that such a thing even exists, and implicitly aiming to avoid the one value formation process we know is compatible with our own values seems like a terrible idea.
...
It's thus vastly easier to align models to goals where we have many examples of people executing said goals.
I think there's a deep disconnect here on whether interpolation is enough or whether we need extrapolation.
The point of the strawberry alignment problem is "here's a clearly understandable specification of a task that requires novel science and engineering to execute on. Can you do that safely?". If your ambitions are simply to have AI customer service bots, you don't need to solve this problem. If your ambitions include cognitive megaprojects which will need to be staffed at high levels by AI systems, then you do need to solve this problem.
More pragmatically, if your ambitions include setting up some sort of system that prevents people from deploying rogue AI systems while not dramatically curtailing Earth's potential, that isn't a goal that we have many examples of people executing on. So either we need to figure it out with humans or, if that's too hard, create an AI system capable of figuring it out (which probably requires an AI leader instead of an AI assistant).
I expect future capabilities advances to follow a similar pattern as past capabilities advances, and not completely break the existing alignment techniques.
Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn and follow-on discussion.
But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model."
Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the 'helpful, harmless, honest' alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients? Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don't expect that they will easily transfer to those new challenges.
Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn and follow-on discussion.
Evolution provides no evidence for the sharp left turn
But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model."
I dislike this minimization of contemporary alignment progress. Even just limiting ourselves to RLHF, that method addresses far more problems than "the LLM sometimes outputs naughty sentences". E.g., it also tackles problems such as consistently following user instructions, reducing hallucinations, improving the topicality of LLM suggestions, etc. It allows much more significant interfacing with the cognition and objectives pursued by LLMs than just some profanity filter.
I don't think ontological collapse is a real issue (or at least, not an issue that appropriate training data can't solve in a relatively straightforwards way). I feel similarly about lots of things that are speculated to be convergent problems for ML systems, such as wireheading and mesaoptimization.
Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the 'helpful, harmless, honest' alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients?
If you're referring to the technique used on LLMs (RLHF), then the answer seems like an obvious yes. RLHF just refers to using reinforcement learning with supervisory signals from a preference model. It's an incredibly powerful and flexible approach, one that's only marginally less general than reinforcement learning itself (can't use it for things you can't build a preference model of). It seems clear enough to me that you could do RLHF over the biologist-bot's action outputs in the biological domain, and be able to shape its behavior there.
If you're referring to just doing language-only RLHF on the model, then making a bio-model, and seeing if the RLHF influences the bio-model's behaviors, then I think the answer is "variable, and it depends a lot on the specifics of the RLHF and how the cross-modal grounding works".
People often translate non-lingual modalities into language so LLMs can operate in their "native element" in those other domains. Assuming you don't do that, then yes, I could easily see the language-only RLHF training having little impact on the bio-model's behaviors.
However, if the bio-model were acting multi-modally by e.g., alternating between biological sequence outputs and natural language planning of what to use those outputs for, then I expect the RLHF would constrain the language portions of that dialog. Then, there are two options:
Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don't expect that they will easily transfer to those new challenges.
Whereas I see future alignment challenges as intimately tied to those we've had to tackle for previous, less capable models. E.g., your bio-bot example is basically a problem of cross-modality grounding, on which there has been an enormous amount of past work, driven by the fact that cross-modality grounding is a problem for systems across very broad ranges of capabilities.
This seems like way too high a bar. It seems clear that you can have transformative or risky AI systems that are still worse than humans at some tasks. This seems like the most likely outcome to me.
I think this is what Yudkowsky thinks also? (As for why it was relevant to bring up, Yudkowsky was answering the host's question of "How is superintelligence different than general intelligence?")
This is kinda long. If I had time to engage with one part of this as a sample of whether it holds up to a counterresponse, what would be the strongest foot you could put forward?
(I also echo the commenter who's confused about why you'd reply to the obviously simplified presentation from an off-the-cuff podcast rather than the more detailed arguments elsewhere.)
This response is enraging.
Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is "This is kinda long."? I shouldn't be that surprised because, IIRC, you said something similar in response to Zack Davis' essays on the Map and Territory distinction, but that's ancillary and AI is core to your memeplex.
I have heard repeated claims that people don't engage with the alignment communities' ideas (recent example from yesterday). But here is someone who did the work. Please explain why your response here does not cause people to believe there's no reason to engage with your ideas because you will brush them off. Yes, nutpicking e/accs on Twitter is much easier and probably more hedonic, but they're not convincible and Quinton here is.
Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me. To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.
Here are some of my disagreements with List of Lethalities. I'll quote item one:
“Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again”
(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups.
FWIW, I thought the bit about manifolds in The difficulty of alignment was the strongest foot forward, because it paints a different detailed picture than your description that it's responding to.
That said, I don't think Quintin's picture obviously disagrees with yours (as discussed in my response over here) and I think you'd find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there's a related one that goes thru on his model.
But have you ever, even once in your life, thought anything remotely like "I really like being able to predict the near-future content of my visual field. I should just sit in a dark room to maximize my visual cortex's predictive accuracy."?
I think I've been in situations where I've been disoriented by a bunch of random stuff happening and wished that less of it was happening so that I could get a better handle on stuff. An example I vividly recall was being in a history class in high school and being very bothered by the large number of conversations happening around me.
Some arguments which Eliezer advanced in order to dismiss neural networks,[1] seem similar to some reasoning which he deploys in his modern alignment arguments.
Compare his incorrect mockery from 2008:
But there is just no law which says that if X has property A and Y has property A then X and Y must share any other property. "I built my network, and it's massively parallel and interconnected and complicated, just like the human brain from which intelligence emerges! Behold, now intelligence shall emerge from this neural network as well!" And nothing happens. Why should it?
with his claim in Alexander and Yudkowsky on AGI goals:
[Alexander][14:36]
Like, we're not going to run evolution in a way where we naturally get AI morality the same way we got human morality, but why can't we observe how evolution implemented human morality, and then try AIs that have the same implementation design? [Yudkowsky][14:37]
Not if it's based on anything remotely like the current paradigm, because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an AI coming out the other end which looks like an evolved human with 7.5MB of brain-wiring information and a childhood.
Like, in particular with respect to "learn 'don't steal' rather than 'don't get caught'."
I agree that 100 quadrillion artificial neurons + loss function won't get you a literal human, for trivial reasons. The relevant point is his latter claim: "in particular with respect to "learn 'don't steal' rather than 'don't get caught'.""
I think this is a very strong conclusion, relative to available data. I think that a good argument for it would require a lot of technical, non-analogical reasoning about the inductive biases of SGD on large language models. But, AFAICT, Eliezer rarely deploys technical reasoning that depends on experimental results or ML theory. He seems to prefer strongly-worded a priori arguments that are basically analogies.
In the above two quotes of his,[3] I perceive a common thread of
human intelligence/alignment comes from a lot of factors; you can't just ape one of the factors and expect the rest to follow; to get a mind which thinks/wants as humans do, that mind must be as close to a human as humans are to each other.
But why is this true? You can just replace "human intelligence" with "avian flight", and the argument might sound similarly plausible a priori.
ETA: The invalid reasoning step is in the last clause ("to get a mind..."). If design X exhibits property P, that doesn't mean that design Y must be similar to X in order to exhibit property P.
ETA: Part of this comment was about EY dismissing neural networks in 2008. It seems to me that the cited writing supports that interpretation, and it's still my best guess (see also DirectedEvolution's comments). However, the quotes are also compatible with EY merely criticizing invalid reasons for expecting neural networks to work. I should have written that part of this comment more carefully, and not claimed observation ("he did dismiss") when I only had inference ("sure seems like he dismissed").
I think the rest of my point stands unaffected (EY often advances vague arguments that are analogies, or a priori thought experiments).
ETA 2: I'm now more confident in my read. Eliezer said this directly:
I'm no fan of neurons; this may be clearer from other posts.
It's this kind of apparent misprediction which has, over time, made me take less seriously Eliezer's models of intelligence and alignment. See also e.g. the cited GAN mis-retrodiction. This change led me to flag / rederive all of my beliefs about rationality/optimization for a while.
(At least, his 2008-era models seemed faulty to the point of this misprediction, and it doesn't seem to me that this part of his models has changed much, though I claim no intimate non-public knowledge of his beliefs; just operating on my impressions here.)
See also Failure By Analogy:
Wasn't it in some sense reasonable to have high hopes of neural networks? After all, they're just like the human brain, which is also massively parallel, distributed, asynchronous, and -
Hold on. Why not analogize to an earthworm's brain, instead of a human's?
A backprop network with sigmoid units... actually doesn't much resemble biology at all. Around as much as a voodoo doll resembles its victim. The surface shape may look vaguely similar in extremely superficial aspects at a first glance. But the interiors and behaviors, and basically the whole thing apart from the surface, are nothing at all alike. All that biological neurons have in common with gradient-optimization ANNs is... the spiderwebby look.
And who says that the spiderwebby look is the important fact about biology? Maybe the performance of biological brains has nothing to do with being made out of neurons, and everything to do with the cumulative selection pressure put into the design.
Originally, this comment included:
So, here are two claims which seem to echo the positions Eliezer advances:
1. "A large ANN doesn't look enough like a human brain to develop intelligence." -> wrong (see GPT-4)
2. "A large ANN doesn't look enough like a human brain to learn 'don't steal' rather than 'don't get caught'" -> (not yet known)
I struck this from the body because I think (1) misrepresents his position. Eliezer is happy to speculate about non-anthropomorphic general intelligence (see e.g. That Alien Message). Also, I think this claim comparison does not name my real objection here, which is better advanced by the updated body of this comment.
I don't really get your comment. Here are some things I don't get:
An abacus performs addition; and the beads of solder on a circuit board bear a certain surface resemblance to the beads on an abacus. Nonetheless, the circuit board does not perform addition because we can find a surface similarity to the abacus. The Law of Similarity and Contagion is not relevant. The circuit board would work in just the same fashion if every abacus upon Earth vanished in a puff of smoke, or if the beads of an abacus looked nothing like solder. A computer chip is not powered by its similarity to anything else, it just is. It exists in its own right, for its own reasons.
The Wright Brothers calculated that their plane would fly - before it ever flew - using reasoning that took no account whatsoever of their aircraft's similarity to a bird. They did look at birds (and I have looked at neuroscience) but the final calculations did not mention birds (I am fairly confident in asserting). A working airplane does not fly because it has wings "just like a bird". An airplane flies because it is an airplane, a thing that exists in its own right; and it would fly just as high, no more and no less, if no bird had ever existed.
[*] I've just realized that I can't name a way in which airplanes are like birds in which they aren't like humans. They have things sticking out their sides? So do humans, they're called arms. Maybe the cross-sectional shape of the wings are similar? I guess they both have pointy-ish bits at the front, that are a bit more pointy than human heads? TBC I don't think this footnote is at all relevant to the safety properties of RLHF'ed big transformers.
Edited to modify confidences about interpretations of EY's writing / claims.
In "Failure By Analogy" and "Surface Analogies and Deep Causes", the point being made is "X is similar in aspects A to thing Y, and X has property P" does not establish "Y has property P". The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me.
This is a valid point, and that's not what I'm critiquing in that portion of the comment. I'm critiquing how -- on my read -- he confidently dismisses ANNs; in particular, using non-mechanistic reasoning which seems similar to some of his current alignment arguments.
On its own, this seems like a substantial misprediction for an intelligence researcher in 2008 (especially one who claims to have figured out most things in modern alignment, by a very early point in time -- possibly that early, IDK). Possibly the most important prediction to get right, to date.
Airplanes don't fly like birds, they fly like airplanes. So indeed you can't just ape one thing about birds[*] to get avian flight. I don't think this is a super revealing technicality but it seemed like you thought it was important.
Indeed, you can't ape one thing. But that's not what I'm critiquing. Consider the whole transformed line of reasoning:
avian flight comes from a lot of factors; you can't just ape one of the factors and expect the rest to follow; to get an entity which flies, that entity must be as close to a bird as birds are to each other.
The important part is the last part. It's invalid. Finding a design X which exhibits property P, doesn't mean that for design Y to exhibit property P, Y must be very similar to X.
Which leads us to:
Maybe most importantly I don't think Eliezer thinks you need to mimic the human brain super closely to get human-like intelligence with human-friendly wants
Reading the Alexander/Yudkowsky debate, I surprisingly haven't ruled out this interpretation, and indeed suspect he believes some forms of this (but not others).
Matters would be different if he said in the quotes you cite "you only get these human-like properties by very exactly mimicking the human brain", but he doesn't.
Didn't he? He at least confidently rules out a very large class of modern approaches.
because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an AI coming out the other end which looks like an evolved human with 7.5MB of brain-wiring information and a childhood.
Like, in particular with respect to "learn 'don't steal' rather than 'don't get caught'."
how he confidently dismisses ANNs
I don't think this is a fair reading of Yudkowsky. He was dismissing people who were impressed by the analogy between ANNs and the brain. I'm pretty sure it wasn't supposed to be a positive claim that ANNs wouldn't work. Rather, it's that one couldn't justifiably believe that they'd work just from the brain analogy, and that if they did work, that would be bad news for what he then called Friendliness (because he was hoping to discover and wield a "clean" theory of intelligence, as contrasted to evolution or gradient descent happening to get there at sufficient scale).
Consider "Artificial Mysterious Intelligence" (2008). In response to someone who said "But neural networks are so wonderful! They solve problems and we don't have any idea how they do it!", it's significant that Yudkowsky's reply wasn't, "No, they don't" (contesting the capabilities claim), but rather, "If you don't know how your AI works, that is not good. It is bad" (asserting that opaque capabilities are bad for alignment).
One of Yudkowsky's claims in the post you link is:
It's hard to build a flying machine if the only thing you understand about flight is that somehow birds magically fly. What you need is a concept of aerodynamic lift, so that you can see how something can fly even if it isn't exactly like a bird.
This is a claim that lack of the correct mechanistic theory is a formidable barrier for capabilities, not just alignment, and it inaccurately underestimates the amount of empirical understandings available on which to base an empirical approach.
It's true that it's hard, even perhaps impossible, to build a flying machine if the only thing you understand is that birds "magically" fly.
But if you are like most people for thousands of years, you've observed many types of things flying, gliding, or floating in the air: birds and insects, fabric and leaves, arrows and spears, clouds and smoke.
So if you, like the Montgolfier brothers, observe fabric floating over a fire, and live in an era in which invention is celebrated and have the ability to build, test, and iterate, then you can probably figure out how to build a flying machine without basing this on a fully worked out concept of aerodynamics. Indeed, the Montgolfier brothers thought it was the smoke, rather than the heat, that made their balloons fly. Having the wrong theory was bad, but it didn't prevent them from building a working hot air balloon.
Let's try turning Yudkowsky's quote around:
It's hard get a concept of aerodynamic lift if the only thing you observe about flight is that somehow birds magically fly. What you need is a rich set of empirical observations and flying mechanisms, so that you can find the common principles for how something can fly even if it isn't exactly like a bird.
Eliezer went on to list five methods for producing AI that he considered dubious, including builting powerful computers running the most advanced available neural network algorithms, intelligence "emerging from the internet", and putting "a sufficiently huge quantity of knowledge into [a computer]." But he only admitted that two other methods would work - builting a mechanical duplicate of the human brain and evolving AI via natural selection.
If Eliezer wasn't meaning to make a confident claim that scaling up neural networks without a fundamental theoretical understanding of intelligence would fail, then he did a poor job of communicating that in these posts. I don't find that blameworthy - I just think Eliezer comes across as confidently wrong about which avenues would lead to intelligence in these posts, simple as that. He was saying that to achieve a high level of AI capabilities, we'd need a deep mechanistic understanding of how intelligence works akin to our modern understanding of chemistry or aerodynamics, and that didn't turn out to be the case.
One possible defense is that Eliezer was attacking a weakman, specifically the idea that with only one empirical observation and zero insight into the factors that cause the property of interest (i.e. only seeing that "birds magically fly"), then it's nearly impossible to replicate that property in a new way. But that's an uninteresting claim and Eliezer is never uninteresting.
But he only admitted that two other methods would work - builting a mechanical duplicate of the human brain and evolving AI via natural selection.
To be fair, he said that those two will work, and (perhaps?) admitted the possibility of "run advanced neural network algorithms" eventually working. Emphasis mine:
What do all these proposals have in common?
They are all ways to make yourself believe that you can build an Artificial Intelligence, even if you don't understand exactly how intelligence works.
Now, such a belief is not necessarily false!
I think it might be relevant to note here that it's not really humans who are building current SOTA AIs --- rather, it's some optimizer like SGD that's doing most of the work. SGD does not have any mechanistic understanding of intelligence (nor anything else). And indeed, it takes a heck of a lot of data and compute for SGD to build those AIs. This seems to be in line with Yudkowsky's claim that it's hard/inefficient to build something without understanding it.
If Eliezer wasn't meaning to make a confident claim that scaling up neural networks without a fundamental theoretical understanding of intelligence would fail, then [...]
I think it's important to distinguish between
Scaling up a neural network, and running some kind of fixed algorithm on it.
Scaling up a neural network, and using SGD to optimize the parameters of the NN, so that the NN ends up learning a whole new set of algorithms.
IIUC, in Artificial Mysterious Intelligence, Yudkowsky seemed to be saying that the former would probably fail. OTOH, I don't know what kinds of NN algorithms were popular back in 2008, or exactly what NN algorithms Yudkowsky was referring to, so... *shrugs*.
If that were the case, I actually would fault Eliezer, at least a little. He’s frequently, though by no means always, stuck to qualitative and hard-to-pin-down punditry like we see here, rather than to unambiguous forecasting.
This allows him, or his defenders, to retroactively defend his predictions as somehow correct even when they seem wrong in hindsight.
Let’s imagine for a moment that Eliezer’s right that AI safety is a cosmically important issue, and yet that he’s quite mistaken about all the technical details of how AGI will arise and how to effectively make it safe. It would be important to know whether we can trust his judgment and leadership.
Without the ability to evaluate his performance, either by going with the most obvious interpretation of his qualitative judgments or an unambiguous forecast, it’s hard to evaluate his performance as an AI safety leader. Combine that with a culture of deference to perceived expertise and status and the problem gets worse.
So I prioritize the avoidance of special pleading in this case: I think Eliezer comes across as clearly wrong in substance in this specific post, and that it’s important not to reach for ways “he was actually right from a certain point of view” when evaluating his predictive accuracy.
Similarly, I wouldn’t judge as correct the early COVID-19 pronouncements that masks don’t work to stop the spread just because cloth masks are poor-to-ineffective and many people refuse to wear masks properly. There’s a way we can stretch the interpretation to make them seem sort of right, but we shouldn’t. We should expect public health messaging to be clearly right in substance, if it’s not making cut and dry unambiguous quantitative forecasts but is instead delivering qualitative judgments of efficacy.
None of that bears on how easy or hard it was to build gpt-4. It only bears on how we should evaluate Eliezer as a forecaster/pundit/AI safety leader.
I think several things here, considering the broader thread:
I also don't really get your position. You say that,
[Eliezer] confidently dismisses ANNs
but you haven't shown this!
In Surface Analogies and Deep Causes, I read him as saying that neural networks don't automatically yield intelligence just because they share surface similarities with the brain. This is clearly true; at the very least, using token-prediction (which is a task for which (a) lots of training data exist and (b) lots of competence in many different domains is helpful) is a second requirement. If you take the network of GPT-4 and trained it to play chess instead, you won't get something with cross-domain competence.
In Failure by Analogy he makes a very similar abstract point -- and wrt to neural networks in particular, he says that the surface similarity to the brain is a bad reason to be confident in them. This also seems true. Do you really think that neural networks work because they are similar to brains on the surface?
You also said,
The important part is the last part. It's invalid. Finding a design X which exhibits property P, doesn't mean that for design Y to exhibit property P, Y must be very similar to X.
But Eliezer says this too in the post you linked! (Failure by Analogy). His example of airplanes not flapping is an example where the design that worked was less close to the biological thing. So clearly the point isn't that X has to be similar to Y; the point is that reasoning from analogy doesn't tell you this either way. (I kinda feel like you already got this, but then I don't understand what point you are trying to make.)
Which is actually consistent with thinking that large ANNs will get you to general intelligence. You can both hold that "X is true" and "almost everyone who thinks X is true does so for poor reasons". I'm not saying Eliezer did predict this, but nothing I've read proves that he didn't.
Also -- and this is another thing -- the fact that he didn't publicly make the prediction "ANNs will lead to AGI" is only weak evidence that he didn't privately think it because this is exactly the kind of prediction you would shut up about. One thing he's been very vocal on is that the current paradigm is bad for safety, so if he was bullish about the potential of that paradigm, he'd want to keep that to himself.
Didn't he? He at least confidently rules out a very large class of modern approaches.
Relevant quote:
because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an AI coming out the other end which looks like an evolved human with 7.5MB of brain-wiring information and a childhood.
In that quote, he only rules out a large class of modern approaches to alignment, which again is nothing new; he's been very vocal about how doomed he thinks alignment is in this paradigm.
Something Eliezer does say which is relevant (in the post on Ajeya's biology anchors model) is
Or, more likely, it's not MoE [mixture of experts] that forms the next little trend. But there is going to be something, especially if we're sitting around waiting until 2050. Three decades is enough time for some big paradigm shifts in an intensively researched field. Maybe we'd end up using neural net tech very similar to today's tech if the world ends in 2025, but in that case, of course, your prediction must have failed somewhere else.
So here he's saying that there is a more effective paradigm than large neural nets, and we'd get there if we don't have AGI in 30 years. So this is genuinely a kind of bearishness on ANNs, but not one that precludes them giving us AGI.
Responding to part of your comment:
In that quote, he only rules out a large class of modern approaches to alignment, which again is nothing new; he's been very vocal about how doomed he thinks alignment is in this paradigm.
I know he's talking about alignment, and I'm criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments.
That claim seems to be advanced due to... there not being enough similarities between ANNs and human brains -- that without enough similarity in mechanisms wich were selected for by evolution, you simply can't get the AI to generalize in the mentioned human-like way. Not as a matter of the AI's substrate, but as a matter of the AI's policy not generalizing like that.
I think this is a dubious claim, and it's made based off of analogies to evolution / some unknown importance of having evolution-selected mechanisms which guide value formation (and not SGD-based mechanisms).
From the Alexander/Yudkowsky debate:
[Alexander][14:41]
Okay, then let me try to directly resolve my confusion. My current understanding is something like - in both humans and AIs, you have a blob of compute with certain structural parameters, and then you feed it training data. On this model, we've screened off evolution, the size of the genome, etc - all of that is going into the "with certain structural parameters" part of the blob of compute. So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same result ("don't steal" rather than "don't get caught")? [Yudkowsky][14:42]
The answer to that seems sufficiently obviously "no" that I want to check whether you also think the answer is obviously no, but want to hear my answer, or if the answer is not obviously "no" to you.
[Alexander][14:43]
Then I'm missing something, I expected the answer to be yes, maybe even tautologically (if it's the same structural parameters and the same training data, what's the difference?)
[Yudkowsky][14:46]
Maybe I'm failing to have understood the question. Evolution got human brains by evaluating increasingly large blobs of compute against a complicated environment containing other blobs of compute, got in each case a differential replication score, and millions of generations later you have humans with 7.5MB of evolution-learned data doing runtime learning on some terabytes of runtime data, using their whole-brain impressive learning algorithms which learn faster than evolution or gradient descent.
Your question sounded like "Well, can we take one blob of compute the size of a human brain, and expose it to what a human sees in their lifetime, and do gradient descent on that, and get a human?" and the answer is "That dataset ain't even formatted right for gradient descent."
There's some assertion like "no, there's not a way to get an ANN, even if incorporating structural parameters and information encoded in human genome, to actually unfold into a mind which has human-like values (like 'don't steal')." (And maybe Eliezer comes and says "no that's not what I mean", but, man, I sure don't know what he does mean, then.)
Here's some more evidence along those lines:
[Yudkowsky][14:08]
I mean, the evolutionary builtin part is not "humans have morals" but "humans have an internal language in which your Nice Morality, among other things, can potentially be written"...
Humans, arguably, do have an imperfect unless-I-get-caught term, which is manifested in children testing what they can get away with? Maybe if nothing unpleasant ever happens to them when they're bad, the innate programming language concludes that this organism is in a spoiled aristocrat environment and should behave accordingly as an adult? But I am not an expert on this form of child developmental psychology since it unfortunately bears no relevance to my work of AI alignment.
[Alexander][14:11]
Do you feel like you understand very much about what evolutionary builtins are in a neural network sense? EG if you wanted to make an AI with "evolutionary builtins", would you have any idea how to do it?
[Yudkowsky][14:13]
Well, for one thing, they happen when you're doing sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, not when you're doing gradient descent relative to a loss function on much larger neural networks.
Again, why is this true? This is an argument that should be engaging in technical questions about inductive biases, but instead seems to wave at (my words) "the original way we got property P was by sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, and good luck trying to get it otherwise."
Hopefully this helps clarify what I'm trying to critique?
I know he's talking about alignment, and I'm criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments.
Ok, I don't disagree with this. I certainly didn't develop a gears-level understanding of why [building a brain-like thing with gradient descent on giant matrices] is doomed after reading the 2021 conversations. But that doesn't seem very informative either way; I didn't spend that much time trying to grok his arguments.
This is a valid point, and that's not what I'm critiquing. I'm critiquing how he confidently dismisses ANNs
I guess I read that as talking about the fact that at the time ANNs did not in fact really work. I agree he failed to predict that would change, but that doesn't strike me as a damning prediction.
Matters would be different if he said in the quotes you cite "you only get these human-like properties by very exactly mimicking the human brain", but he doesn't.
Didn't he? He at least confidently rules out a very large class of modern approaches.
Confidently ruling out a large class of modern approaches isn't really that similar to saying "the only path to success is exactly mimicking the human brain". It seems like one could rule them out by having some theory about why they're deficient. I haven't re-read List of Lethalities because I want to go to sleep soon, but I searched for "brain" and did not find a passage saying "the real problem is that we need to emulate the brain precisely but can't because of poor understanding of neuroanatomy" or something.
I don't want to get super hung up on this because it's not about anything Yudkowsky has said but:
Consider the whole transformed line of reasoning:
avian flight comes from a lot of factors; you can't just ape one of the factors and expect the rest to follow; to get an entity which flies, that entity must be as close to a bird as birds are to each other.
IMO this is not a faithful transformation of the line of reasoning you attribute to Yudkowsky, which was:
human intelligence/alignment comes from a lot of factors; you can't just ape one of the factors and expect the rest to follow; to get a mind which wants as humans do, that mind must be as close to a human as humans are to each other.
Specifically, where you wrote "an entity which flies", you were transforming "a mind which wants as humans do", which I think should instead be transformed to "an entity which flies as birds do". And indeed planes don't fly like birds do. [EDIT: two minutes or so after pressing enter on this comment, I now see how you could read it your way]
I guess if I had to make an analogy I would say that you have to be pretty similar to a human to think the way we do, but probably not to pursue the same ends, which is probably the point you cared about establishing.
Here's another attempt at one of my contentions.
Consider shard theory of human values. The point of shard theory is not "because humans do RL, and have nice properties, therefore AI + RL will have nice properties." The point is more "by critically examining RL + evidence from humans, I have hypotheses about the mechanistic load-bearing components of e.g. local-update credit assignment in a bounded-compute environment on certain kinds of sensory data, that these components leads to certain exploration/learning dynamics, which explain some portion of human values and experience. Let's test that and see if the generators are similar."
And my model of Eliezer shakes his head at the naivete of expecting complex human properties to reproduce outside of human minds themselves, because AI is not human.
But then I'm like "this other time you said 'AI is not human, stop expecting good property P from superficial similarities', you accidentally missed the modern AI revolution, right? Seems like there is some non-superficial mechanistic similarity/lessons here, and we shouldn't be so quick to assume that the brain's qualitative intelligence or alignment properties come from a huge number of evolutionarily-tuned details which are load-bearing and critical."
It now seems clear to me that EY was not bullish on neural networks leading to impressive AI capabilities. Eliezer said this directly:
I'm no fan of neurons; this may be clearer from other posts.[1]
I think this is strong evidence for my interpretation of the quotes in my parent comment: He's not just mocking the local invalidity of reasoning "because humans have lots of neurons, AI with lots of neurons -> smart", he's also mocking neural network-driven hopes themselves.
More quotes from Logical or Connectionist AI?:
Not to mention that neural networks have also been "failing" (i.e., not yet succeeding) to produce real AI for 30 years now. I don't think this particular raw fact licenses any conclusions in particular. But at least don't tell me it's still the new revolutionary idea in AI.
This is the original example I used when I talked about the "Outside the Box" box - people think of "amazing new AI idea" and return their first cache hit, which is "neural networks" due to a successful marketing campaign thirty goddamned years ago. I mean, not every old idea is bad - but to still be marketing it as the new defiant revolution? Give me a break.
In this passage, he employs well-scoped and well-hedged language via "this particular raw fact." I like this writing because it points out an observation, and then what inferences (if any) he draws from that observation. Overall, his tone is negative on neural networks.
Let's open up that "Outside the Box" box:
In Artificial Intelligence, everyone outside the field has a cached result for brilliant new revolutionary AI idea—neural networks, which work just like the human brain! New AI Idea: complete the pattern: "Logical AIs, despite all the big promises, have failed to provide real intelligence for decades—what we need are neural networks!"
This cached thought has been around for three decades. Still no general intelligence. But, somehow, everyone outside the field knows that neural networks are the Dominant-Paradigm-Overthrowing New Idea, ever since backpropagation was invented in the 1970s. Talk about your aging hippies.
This is more incorrect mockery.
As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.
I think there's a mistake here which kind of invalidates the whole post. If we don't reward our AI for taking bad actions within the training distribution, it's still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn't in evolution's training distribution for us, but then we found it anyway.
There's no way to raise a human such that their value system cleanly revolves around the one single goal of duplicating a strawberry, and nothing else. By asking for a method of forming values which would permit such a narrow specification of end goals, you're asking for a value formation process that's fundamentally different from the one humans use. There's no guarantee that such a thing even exists, and implicitly aiming to avoid the one value formation process we know is compatible with our own values seems like a terrible idea.
I narrowly agree with most of this, but I tend to say the same thing with a very different attitude:
I would say: “Gee it would be super cool if we could decide a priori what we want the AGI to be trying to do, WITH SURGICAL PRECISION. But alas, that doesn’t seem possible, at least not according to any method I know of.”
I disagree with you in your apparent suggestion that the above paragraph is obvious or uninteresting, and also disagree with your apparent suggestion that “setting an AGI’s motivations with surgical precision” is such a dumb idea that we shouldn’t even waste one minute of our time thinking about whether it might be possible to do that.
For example, people who are used to programming almost any other type of software have presumably internalized the idea that the programmer can decide what the software will do with surgical precision. So it's important to spread the idea that, on current trends, AGI software will be very different from that.
BTW I do agree with you that Eliezer’s interview response seems to suggest that he thinks aligning an AGI to “basic notions of morality” is harder and aligning an AGI to “strawberry problem” is easier. If that’s what he thinks, it’s at least not obvious to me. (see follow-up)
BTW I do agree with you that Eliezer’s interview response seems to suggest that he thinks aligning an AGI to “basic notions of morality” is harder and aligning an AGI to “strawberry problem” is easier. If that’s what he thinks, it’s at least not obvious to me.
My sense (which I expect Eliezer would agree with) is that it's relatively easy to get an AI system to imitate the true underlying 'basic notions of morality', to the extent humans agree on that, but that this doesn't protect you at all as soon as you want to start making large changes, or as soon as you start trying to replace specialist sectors of the economy. (A lot of ethics for doctors has to do with the challenges of simultaneously being a doctor and a human; those ethics will not necessarily be relevant for docbots, and the question of what they should be instead is potentially hard to figure out.)
So if you're mostly interested in getting out of the acute risk period, you probably need to aim for a harder target.
Hmm, on further reflection, I was mixing up
Eliezer definitely talks about the latter. I’m not sure Eliezer has ever brought up the former? I think I was getting that from the OP (Quintin), but maybe Quintin was just confused (and/or Eliezer misspoke).
Anyway, making an AGI that can solve the strawberry problem is tautologically no harder than making an AGI that can do advanced technological development and is motivated by human norms / morals / whatever, because the latter set of AGIs is a subset of the former.
Sorry. I crossed out that paragraph. :)
"I have to be wrong about something, which I certainly am. I have to be wrong about something which makes the problem easier rather than harder, for those people who don't think alignment's going to be all that hard. If you're building a rocket for the first time ever, and you're wrong about something, it's not surprising if you're wrong about something. It's surprising if the thing that you're wrong about causes the rocket to go twice as high, on half the fuel you thought was required and be much easier to steer than you were afraid of."
I agree with OP that this rocket analogy from Eliezer is a bad analogy, AFAICT. If someone is trying to assess the difficulty of solving a technical problem (e.g. building a rocket) in advance, then they need to brainstorm potential problems that might come up, and when they notice one, they also need to brainstorm potential technical solutions to that problem. For example “the heat of reentry will destroy the ship” is a potential problem, and “we can invent new and better heat-resistant tiles / shielding” is a potential solution to that problem. During this process, I don’t think it’s particularly unusual for the person to notice a technical problem but overlook a clever way to solve that problem. (Maybe they didn’t recognize the possibility of inventing new super-duper-heat-resistant ceramic tiles, or whatever.) And then they would wind up overly pessimistic.
During this process, I don’t think it’s particularly unusual for the person to notice a technical problem but overlook a clever way to solve that problem.
I think this isn't the claim; I think the claim is that it would be particularly unusual for someone to overlook that they're accidentally solving a technical problem. (It would be surprising for Edison to not be thinking hard about what filament to use and pick tungsten; in actual history, it took decades for that change to be made.)
Sure, but then the other side of the analogy doesn’t make sense, right? The context was: Eliezer was talking in general terms about the difficulty of the AGI x-risk problem and whether it’s likely to be solved. (As I understand it.)
[Needless to say, I’m just making a narrow point that it’s a bad analogy. I’m not arguing that p(doom) is high or low, I’m not saying this is an important & illustrative mistake (talking on the fly is hard!), etc.]
So I definitely think that's something weirdly unspoken about the argument; I would characterize it as Eliezer saying "suppose I'm right and they're wrong; all this requires is things to be harder than people think, which is usual. Suppose instead that I'm wrong and they're right; this requires things to be easier than people think, which is unusual." But the equation of "people" and "Eliezer" is sort of strange; as Quintin notes, it isn't that unusual for outside observers to overestimate difficulty, and so I wish he had centrally addressed the the reference class tennis game; is the expertise "getting AI systems to be capable" or "getting AI systems to do what you want"?
This post brought to mind a thought: I actually don't care very much about arguments about how likely doom is and how pessimistic or optimistic to be since they are irrelevant, to my style of thinking, for making decisions related to building TAI. Instead, I mostly focus on downside risks and avoiding them because they are so extreme, which makes me look "pessimistic" but actually I'm just trying to minimize the risk of false positives in building aligned AI. Given this framing, it's actually less important, in most cases, to figure out how likely something is, and more important to figure out how likely doom is if we are wrong, and carefully navigate the path that minimizes the risk of doom, regardless of what the assessment of doom is.
I find the prospect of training on model on just 40 parameters to be very interesting. Almost unbelievable, really, to the point where I'm tempted to say: "I notice that I'm confused". Unfortunately, I don't have access to the paper and it doesn't seem to be on sci-hub, so I haven't been able to resolve my confusion. Basically, my general intuition is that each parameter in a network probably only contributes a few bits of optimization power. It can be set fairly high, fairly low, or in between. So if you just pulled 40 random weights from the network, that's maybe 120 bits of optimization power. Which might be enough for MNIST, but probably not for anything more complicated. So I'm guessing that most likely a bunch of other optimization went into choosing exactly which 40 dimensional subspace we should be using. Of course, if we're allowed to do that then we could even do it with a 1 dimensional subspace: Just pick the training trajectory as your subspace!
Generally with the mindspace thing, I don't really think about the absolute size or dimension of mindspace, but the relative size of "things we could build" and "things we could build that would have human values". This relative size is measured in bits. So the intuition here would be that it takes a lot of bits to specify human values, and so the difference in size between these two is really big. Now maybe if you're given Common Crawl, it takes fewer bits to point to human values within that big pile of information. But it's probably still a lot of bits, and then the question is how do you actually construct such a pointer?
I agree that demons are unlikely to be a problem, at least for basic gradient descent. They should have shown up by now in real training runs, otherwise. I do still think gradient descent is a very unpredictable process (or to put it more precisely: we still don't know how to predict gradient descent very well), and where that shows up is in generalization. We have a very poor sense of which things will generalize and which things will not generalize, IMO.
BTW: the way I found that first link was by searching the title on google scholar, finding the paper, and clicking "All 5 versions" below (it's right next to "Cited by 7" and "Related articles"). That brought me to a bunch of versions, one of which was a seemingly-ungated PDF. This will probably frequently work, because AI researchers usually make their papers publicly available (at least in pre-print form).
What stood out to me in the video is Eliezer no longer being able to conceive of any positive outcome at all, which is beyond reason. It made me wonder what approach a company could possible develop for alignment, or what a supposedly aligned AI could possibly do, for Eliezer to take back his doom predictions, and suspect that the answer is none. The impression I got was that he is meanwhile closed to the possibility entirely. I found the Time article heartbreaking. These are parents, intelligent, rational parents who I have respect and compassion for, essentially grieving the death of a young, healthy child, based on the unjustified certainty of impeding doom. I've read more hopeful accounts from people living in Ukrainian warzones, or in parts of the Sahel swallowed by Sahara, or islands getting drowned by climate change, where the evidence of risk and lack of reason for hope is far more conclusive; at the end of the day, Eliezer is worried that we will fail at making a potentially emerging powerful agent be friendly, while we know extremely little about these agents and their natural alignment tendencies. In comparison to so many other doom scenarios the certainty here is just really not high. I am glad people here are taking AI risk seriously, that this risk is being increasingly recognised more. But this trend towards "dying with dignity" because all hope is seen as lost is very sad, and very worrying, and very wrong. The case for climate change risk is far, far more clear, and yet you will note that climate activists are neither advocating terrorism, nor giving up, nor pronouncing certain doom. There is grief and there is fear and the climate activist scene has many problems, but I have never felt this pronounced wrongness there.
This market by Eliezer about the possible reasons why AI may yet have a positive outcome seems to refute your first sentence.
Also, I haven't seen any AI notkilleveryoneism people advocating terrorism or giving up.
Introduction
I recently watched Eliezer Yudkowsky's appearance on the Bankless podcast, where he argued that AI was nigh-certain to end humanity. Since the podcast, some commentators have offered pushback against the doom conclusion. However, one sentiment I saw was that optimists tended not to engage with the specific arguments pessimists like Yudkowsky offered.
Economist Robin Hanson points out that this pattern is very common for small groups which hold counterintuitive beliefs: insiders develop their own internal language, which skeptical outsiders usually don't bother to learn. Outsiders then make objections that focus on broad arguments against the belief's plausibility, rather than objections that focus on specific insider arguments.
As an AI "alignment insider" whose current estimate of doom is around 5%, I wrote this post to explain some of my many objections to Yudkowsky's specific arguments. I've split this post into chronologically ordered segments of the podcast in which Yudkowsky makes one or more claims with which I particularly disagree.
I have my own view of alignment research: shard theory, which focuses on understanding how human values form, and on how we might guide a similar process of value formation in AI systems.
I think that human value formation is not that complex, and does not rely on principles very different from those which underlie the current deep learning paradigm. Most of the arguments you're about to see from me are less:
and more:
My objections
Will current approaches scale to AGI?
Yudkowsky apparently thinks not
...and that the techniques driving current state of the art advances, by which I think he means the mix of generative pretraining + small amounts of reinforcement learning such as with ChatGPT, aren't reliable enough for significant economic contributions. However, he also thinks that the current influx of money might stumble upon something that does work really well, which will end the world shortly thereafter.
I'm a lot more bullish on the current paradigm. People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches such as:
Mostly, these don't work very well. The current capabilities paradigm is state of the art because it gives the best results of anything we've tried so far, despite lots of effort to find better paradigms.
When capabilities advances do work, they typically integrate well with the current alignment[1] and capabilities paradigms. E.g., I expect that we can apply current alignment techniques such as reinforcement learning from human feedback (RLHF) to evolved architectures. Similarly, I expect we can use a learned optimizer to train a network on gradients from RLHF. In fact, the eleventh example is actually ConstitutionalAI from Anthropic, which arguably represents the current state of the art in language model alignment techniques!
This doesn't mean there are no issues with interfacing between new capabilities advances and current alignment techniques. E.g., if we'd initially trained the learned optimizer on gradients from supervised learning, we might need to finetune the learned optimizer to make it work well with RLHF gradients, which I expect would follow a somewhat different distribution from the supervised gradients we'd trained the optimizer on.
However, I think such issues largely fall under "ordinary engineering challenges", not "we made too many capabilities advances, and now all our alignment techniques are totally useless". I expect future capabilities advances to follow a similar pattern as past capabilities advances, and not completely break the existing alignment techniques.
Finally, I'd note that, despite these various clever capabilities approaches, progress towards general AI seems pretty smooth to me (fast, but smooth). GPT-3 was announced almost three years ago, and large language models have gotten steadily better since then.
Discussion of human generality
Yudkowsky says humans aren't fully general
Evolution did not give humans specific cognitive capabilities, such that we should now consider ourselves to be particularly well-tuned for tasks similar to those that were important for survival in the ancestral environment. Evolution gave us a learning process, and then biased that learning process towards acquiring capabilities that were important for survival in the ancestral environment.
This is important, because the most powerful and scalable learning processes are also simple and general. The transformer architecture was originally developed specifically for language modeling. However, it turns out that the same architecture, with almost no additional modifications, can learn image recognition, navigate game environments, process audio, and so on. I do not believe we should describe the transformer architecture as being "specialized" to language modeling, despite it having been found by an 'architecture search process' that was optimizing for performance only on language modeling objectives.
Thus, I'm dubious of the inference from:
to:
There are of course, possible modifications one could make to the human brain that would make humans better coders. However, time and again, we've found that deep learning systems improve more through scaling, of either the data or the model. Additionally, the main architectural difference between human and other primate brains is likely scale, and not e.g., the relative sizes of different regions or maturation trajectories.
See also: The Brain as a Universal Learning Machine and Brain Efficiency: Much More than You Wanted to Know
Yudkowsky talks about an AI being more general than humans
I think powerful cognition mostly comes from simple learning processes applied to complex data. Humans are actually pretty good at "reprogramming" themselves. We might not be able to change our learning process much[2], but we can change our training data quite a lot. E.g., if you run into something unfamiliar, you can read a book about the thing, talk to other people about it, run experiments to gather thing-specific data, etc. All of these are ways of deliberately modifying your own cognition to make you more capable in this new domain.
Additionally, the fact that techniques such as sensory substitution work in humans, or the fact that losing a given sense causes the brain to repurpose regions associated with that sense, suggest we're not that constrained by our architecture, either.
Again: most of what separates a vision transformer from a language model is the data they're trained on.
How to think about superintelligence
Yudkowsky describes superintelligence
This seems like way too high a bar. It seems clear that you can have transformative or risky AI systems that are still worse than humans at some tasks. This seems like the most likely outcome to me. Current AIs have huge deficits in odd places. For example, GPT-4 may beat most humans on a variety of challenging exams (page 5 of the GPT-4 paper), but still can't reliably count the number of words in a sentence.
Compared to Yudkowsky, I think I expect AI capabilities to increase more smoothly with time, though not necessarily more slowly. I don't expect a sudden jump where AIs go from being better at some tasks and worse at others, to being universally better at all tasks.
The difficulty of alignment
Yudkowsky on the width of mind space
I think this is extremely misleading. Firstly, real-world data in high dimensions basically never look like spheres. Such data almost always cluster in extremely compact manifolds, whose internal volume is minuscule compared to the full volume of the space they're embedded in. If you could visualize the full embedding space of such data, it might look somewhat like an extremely sparse "hairball" of many thin strands, interwoven in complex and twisty patterns, with even thinner "fuzz" coming off the strands in even more-complex fractle-like patterns, but with vast gulfs of empty space between the strands.
In math-speak, high dimensional data manifolds almost always have vastly smaller intrinsic dimension than the spaces in which they're embedded. This includes the data manifolds for both of:
As a consequence, it's a bad idea to use "the size of mind space" as an intuition pump for "how similar are things from two different parts of mind space"?
The manifold of possible mind designs for powerful, near-future intelligences is surprisingly small. The manifold of learning processes that can build powerful minds in real world conditions is vastly smaller than that.
It's no coincidence that state of the art AI learning processes and the human brain both operate on similar principles: an environmental model mostly trained with self-supervised prediction, combined with a relatively small amount of reinforcement learning to direct cognition in useful ways. In fact, alignment researchers recently narrowed this gap even further by applying reinforcement learning[3] throughout the training process, rather than just doing RLHF at the end, as with current practice.
The researchers behind such developments, by and large, were not trying to replicate the brain. They were just searching for learning processes that do well at language. It turns out that there aren't many such processes, and in this case, both evolution and human research converged to very similar solutions. And once you condition on a particular learning process and data distribution, there aren't that many more degrees of freedom in the resulting mind design. To illustrate:
Both of these imply low variation in cross-model internal representations, given similar training setups. The technique in the Low Dimensional Trajectory Hypothesis paper would produce a manifold of possible "minds" with an intrinsic dimension of 40 or less, despite operating in a ~30 million dimensional space. Of course, the standard practice of training all network parameters at once is much less restricting, but I still expect realistic training processes to produce manifolds whose intrinsic dimension is tiny, compared to the full dimension of mind space itself, as this paper suggests.
Finally, the number of data distributions that we could use to train powerful AIs in the near future is also quite limited. Mostly, such data distributions come from human text, and mostly from the Common Crawl specifically, combined with various different ways to curate or augment that text. This drives trained AIs to be even more similar to humans than you'd expect from the commonalities in learning processes alone.
So the true volume of the manifold of possible future mind designs is vaguely proportional to:
(N distinct learning processes)×(N data distributions)×(cross-run variation)
The manifold of mind designs is thus:
(Point 3 also implies that human minds are spread much more broadly in the manifold of future mind than you'd expect, since our training data / life experiences are actually pretty diverse, and most training processes for powerful AIs would draw much of their data from humans.)
As a consequence of the above, a 2-D projection of mind space would look less like this:
and more like this:
Yudkowsky brings up strawberry alignment
My first objection is: human value formation doesn't work like this. There's no way to raise a human such that their value system cleanly revolves around the one single goal of duplicating a strawberry, and nothing else. By asking for a method of forming values which would permit such a narrow specification of end goals, you're asking for a value formation process that's fundamentally different from the one humans use. There's no guarantee that such a thing even exists, and implicitly aiming to avoid the one value formation process we know is compatible with our own values seems like a terrible idea.
It also assumes that the orthogonality thesis should hold in respect to alignment techniques - that such techniques should be equally capable of aligning models to any possible objective.
This seems clearly false in the case of deep learning, where progress on instilling any particular behavioral tendencies in models roughly follows the amount of available data that demonstrate said behavioral tendency. It's thus vastly easier to align models to goals where we have many examples of people executing said goals. As it so happens, we have roughly zero examples of people performing the "duplicate this strawberry" task, but many more examples of e.g., humans acting in accordance with human values, ML / alignment research papers, chatbots acting as helpful, honest and harmless assistants, people providing oversight to AI models, etc. See also: this discussion.
Probably, the best way to tackle "strawberry alignment" is to train the AI with a mix of other, broader, objectives with more available data, like "following human instructions", "doing scientific research" or "avoid disrupting stuff", then trying to compose many steps of human-supervised, largely automated scientific research towards the problem of strawberry duplication. However, this wouldn't be an example of strawberry alignment, but of general alignment, which had been directed towards the strawberry problem. Such an AI would have many values beyond strawberry duplication.
Related: Alex Turner objects to this sort of problem decomposition because it doesn't actually seem to make the problem any easier.
Also related: the best poem-writing AIs are general-purpose language models that have been directed towards writing poems.
I also don't think we want alignment techniques that are equally useful for all goals. E.g., we don't want alignment techniques that would let you easily turn a language model into an agent monomaniacally obsessed with paperclip production.
Yudkowsky argues against AIs being steerable by gradient descent
...that we can't point an AI's learned cognitive faculties in any particular direction because the "hill-climbing paradigm" is incapable of meaningfully interfacing with the inner values of the intelligences it creates. Evolution is his central example in this regard, since evolution failed to direct our cognitive faculties towards inclusive genetic fitness, the single objective it was optimizing us for.
This is an argument he makes quite often, here and elsewhere, and I think it's completely wrong. I think that analogies to evolution tell us roughly nothing about the difficulty of alignment in machine learning. I have a post explaining as much, as well as a comment summarizing the key point:
Evolution can only optimize over our learning process and reward circuitry, not directly over our values or cognition. Moreover, robust alignment to IGF requires that you even have a concept of IGF in the first place. Ancestral humans never developed such a concept, so it was never useful for evolution to select for reward circuitry that would cause humans to form values around the IGF concept.
It would be an enormous coincidence if the reward circuitry that lead us to form values around those IGF-promoting concepts that are learnable in the ancestral environment were to also lead us to form values around IGF itself once it became learnable in the modern environment, despite the reward circuitry not having been optimized for that purpose at all. That would be like successfully directing a plane to land at a particular airport while only being able to influence the geometry of the plane's fuselage at takeoff, without even knowing where to find the airport in question.
[Gradient descent] is different in that it directly optimizes over values / cognition, and that AIs will presumably have a conception of human values during training.
Yudkowsky brings up humans liking ice cream as an example of values misgeneralization caused by the shift to our modern environment
This example nicely illustrates my previous point. It also illustrates the importance of thinking mechanistically, and not allegorically. I think it's straightforward to explain why humans "misgeneralized" to liking ice cream. Consider:
(We sometimes colloquially call these sorts of tendencies "food preferences".)
So, the reason humans like ice cream is because evolution created a learning process with hard-coded circuitry that assigns high rewards for eating foods like ice cream. Someone eats ice cream, hardwired reward circuits activate, and the person becomes more inclined to navigate into scenarios where they can eat ice cream in the future. I.e., they acquire a preference for ice cream.
What does this mean for alignment? How do we prevent AIs from behaving badly as a result of a similar "misgeneralization"? What alignment insights does the fleshed-out mechanistic story of humans coming to like ice cream provide?
As far as I can tell, the answer is: don't reward your AIs for taking bad actions.
That's all it would take, because the mechanistic story above requires a specific step where the human eats ice cream and activates their reward circuits. If you stop the human from receiving reward for eating ice cream, then the human no longer becomes more inclined to navigate towards eating ice cream in the future.
Note that I'm not saying this is an easy task, especially since modern RL methods often use learned reward functions whose exact contours are unknown to their creators.
But from what I can tell, Yudkowsky's position is that we need an entirely new paradigm to even begin to address these sorts of failures. Take his statement from later in the interview:
In contrast, I think we can explain humans' tendency to like ice cream using the standard language of reinforcement learning. It doesn't require that we adopt an entirely new paradigm before we can even get a handle on such issues.
Edit: Why evolution is not like AI training
Some of the comments have convinced me it's worthwhile to elaborate on why I think human evolution is actually very different from training AIs, and why it's so difficult to extract useful insights about AI training from evolution.
In part 1 of this edit, I'll compare the human and AI learning processes, and how the different parts of these two types of learning processes relate to each other. In part 2, I'll explain why I think analogies between human evolution and AI training that don't appropriately track this relationship lead to overly pessimistic conclusions, and how corrected versions of such analogies lead to uninteresting conclusions.
(Part 1, relating different parts of human and AI learning processes)
Every learning process that currently exists, whether human, animal or AI, operates on three broad levels:
At the top level, there are the (largely fixed) instructions that determine how the learning process works overall.
For AIs, this means the training code that determines stuff such as:
For humans, this means the genomic sequences that determine stuff like:
At the middle level, there's the stuff that stores the information and behavioral patterns that the learning process has accumulated during its interactions with the environment.
For AIs, this means gigantic matrices of floating point numbers that we call weights. The top level (the training code) defines how these weights interact with possible inputs to produce the AI's outputs, as well as how these weights should be locally updated so that the AI's outputs score well on the AI's loss / reward functions.
For humans, this mostly[4] means the connectome: the patterns of inter-neuron connections formed by the brain's synapses, in combination with the various individual neuron and synapse-level factors that influence how each neuron communicates with neighbors. The top level (the person's genome) defines how these cells operate and how they should locally change their behaviors to improve the brain's predictive accuracy and increase reward.
Two important caveats about the human case:
At the bottom level, there's the stuff that queries the information / behavioral patterns stored in the middle level, decides which of the middle layer content is relevant to whatever situation the learner is currently navigating, and combines the retrieved information / behaviors with the context of the current situation to produce the learner's final decisions.
For AIs, this means smaller matrices of floating point numbers which we call activations.
For humans, this means the patterns of neuron and synapse-level excitations, which we also call activations.
The learning process then interacts with data from its environment, locally updating the stuff in the middle level with information and behavioral patterns that cause the learner to be better at modeling its environment and at getting high reward on the distribution of data from the training environment.
(Part 2, how this matters for analogies from evolution)
Many of the most fundamental questions of alignment are about how AIs will generalize from their training data. E.g., "If we train the AI to act nicely in situations where we can provide oversight, will it continue to act nicely in situations where we can't provide oversight?"
When people try to use human evolutionary history to make predictions about AI generalizations, they often make arguments like "In the ancestral environment, evolution trained humans to do X, but in the modern environment, they do Y instead." Then they try to infer something about AI generalizations by pointing to how X and Y differ.
However, such arguments make a critical misstep: evolution optimizes over the human genome, which is the top level of the human learning process. Evolution applies very little direct optimization power to the middle level. E.g., evolution does not transfer the skills, knowledge, values, or behaviors learned by one generation to their descendants. The descendants must re-learn those things from information present in the environment (which may include demonstrations and instructions from the previous generation).
This distinction matters because the entire point of a learning system being trained on environmental data is to insert useful information and behavioral patterns into the middle level stuff. But this (mostly) doesn't happen with evolution, so the transition from ancestral environment to modern environment is not an example of a learning system generalizing from its training data. It's not an example of:
It's an example of:
These are completely different kinds of transitions, and trying to reason from an instance of the second kind of transition (humans in ancestral versus modern environments), to an instance of the first kind of transition (future AIs in training versus deployment), will very easily lead you astray.
Two different learning systems, trained on data from two different distributions, will usually have greater divergence between their behaviors, as compared to a single system which is being evaluated on the data from the two different distributions. Treating our evolutionary history like humanity's "training" will thus lead to overly pessimistic expectations regarding the stability and predictability of an AI's generalizations from its training data.
Drawing correct lessons about AI from human evolutionary history requires tracking how evolution influenced the different levels of the human learning process. I generally find that such corrected evolutionary analogies carry implications that are far less interesting or concerning than their uncorrected counterparts. E.g., here are two ways of thinking about how humans came to like ice cream:
In particular, this outcome doesn't tell us anything new or concerning from an alignment perspective. The only lesson applicable to a single training process is the fact that, if you reward a learner for doing something, they'll tend to do similar stuff in the future, which is pretty much the common understanding of what rewards do.
Thanks to Alex Turner for providing feedback on this edit.
End of edited text.
Yudkowsky claims that evolution has a stronger simplicity bias than gradient descent:
On a direct comparison, I think there's no particular reason that one would be more simplicity biased than the other. If you were to train two neural networks using gradient descent and evolution, I don't have strong expectations for which would learn simpler functions. As it happens, gradient descent already has really strong simplicity biases.
The complication is that Yudkowsky is not making a direct comparison. Evolution optimized over the human genome, which configures the human learning process. This introduces what he calls an "information bottleneck", limiting the amount of information that evolution can load into the human learning process to be a small fraction of the size of the genome. However, I think the bigger difference is that evolution was optimizing over the parameters of a learning process, while training a network with gradient descent optimizes over the cognition of a learned artifact. This difference probably makes it invalid to compare between the simplicity of gradient descent on networks, versus evolution on the human learning process.
Yudkowsky tries to predict the inner goals of a GPT-like model.
As it happens, I do not think that optimizing a network on a given objective function produces goals orientated towards maximizing that objective function. In fact, I think that this almost never happens. For example, I don't think GPTs have any sort of inner desire to predict text really well. Predicting human text is something GPTs do, not something they want to do.
Relatedly, humans are very extensively optimized to predictively model their visual environment. But have you ever, even once in your life, thought anything remotely like "I really like being able to predict the near-future content of my visual field. I should just sit in a dark room to maximize my visual cortex's predictive accuracy."?
Similarly, GPT models do not want to minimize their predictive loss, and they do not take creative opportunities to do so. If you tell models in a prompt that they have some influence over what texts will be included in their future training data, they do not simply choose the most easily predicted texts. They choose texts in a prompt-dependent manner, apparently playing the role of an AI / human / whatever the prompt says, which was given influence over training data.
Bodies of water are highly "optimized" to minimize their gravitational potential energy. However, this is something water does, not something it wants. Water doesn't take creative opportunities to further reduce its gravitational potential, like digging out lakebeds to be deeper.
Edit:
On reflection, the above discussion overclaims a bit in regards to humans. One complication is that the brain uses internal functions of its own activity as inputs to some of its reward functions, and some of those functions may correspond or correlate with something like "visual environment predictability". Additionally, humans run an online reinforcement learning process, and human credit assignment isn't perfect. If periods of low visual predictability correlate with negative reward in the near-future, the human may begin to intrinsically dislike being in unpredictable visual environments.
However, I still think that it's rare for people's values to assign much weight to their long-run visual predictive accuracy, and I think this is evidence against the hypothesis that a system trained to make lots of correct predictions will thereby intrinsically value making lots of correct predictions.
Thanks to Nate Showell and DanielFilan for prompting me to think a bit more carefully about this.
Why aren't other people as pessimistic as Yudkowsky?
Yudkowsky mentions the security mindset.
(I didn't think the interview had good quotes for explaining Yudkowsky's concept of the security mindset, so I'll instead direct interested readers to the article he wrote about it.)
As I understand it, the security mindset asserts a premise that's roughly: "The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions."
However, I don't see why this should be the case. Most domains of human endeavor aren't like computer security, as illustrated by just how counterintuitive most people find the security mindset. If security mindset were a productive frame for tackling a wide range of problems outside of security, then many more people would have experience with the mental motions necessary for maintaining security mindset.
Machine learning in particular seems like its own "kind of thing", with lots of strange results that are very counterintuitive to people outside (and inside) the field. Quantum mechanics is famously not really analogous to any classical phenomena, and using analogies to "bouncing balls" or "waves" or the like will just mislead you once you try to make nontrival inferences based on your intuition about whatever classical analogy you're using.
Similarly, I think that machine learning is not really like computer security, or rocket science (another analogy that Yudkowsky often uses). Some examples of things that happen in ML that don't really happen in other fields:
Swapping a computer's hard drive for its CPU, or swapping a rocket's fuel tank for one of its stabilization fins, would lead to instant failure at best. Similarly, swapping around different steps of a cryptographic protocol will, usually make it output nonsense. At worst, it will introduce a crippling security flaw. For example, password salts are added before hashing the passwords. If you switch to adding them after, this makes salting near useless.
Randomly adding / subtracting extra pieces to either rockets or cryptosystems is playing with the worst kind of fire, and will eventually get you hacked or exploded, respectively.
The rough equivalent for computer security would be to have two encryption algorithms A and B, and a plaintext X. Then, midway through applying A to X, switch over to using B instead. For rocketry, it would be like building two different rockets, then trying to weld the top half of one rocket onto the bottom half of the other.
This is usually not the case in security or rocket science.
Rockets will literally explode if you try to randomly double the size of their fuel tanks.
I don't think this sort of weirdness fits into the framework / "narrative" of any preexisting field. I think these results are like the weirdness of quantum tunneling or the double slit experiment: signs that we're dealing with a very strange domain, and we should be skeptical of importing intuitions from other domains.
Additionally, there's a straightforward reason why alignment research (specifically the part of alignment that's about training AIs to have good values) is not like security: there's usually no adversarial intelligence cleverly trying to find any possible flaws in your approaches and exploit them.
A computer security approach that blocks 99% of novel attacks will soon become a computer security approach that blocks ~0% of novel, once attackers adapt to the approach in question.
An alignment technique that works 99% of the time to produce an AI with human compatible values is very close to a full alignment solution[5]. If you use this technique once, gradient descent will not thereafter change its inductive biases to make your technique less effective. There's no creative intelligence that's plotting your demise[6].
There are other areas of alignment research where adversarial intelligences do appear. For example, once you've deployed a model into the real world, some fraction of users will adversarially optimize their inputs to make your model take undesired actions. We see this with ChatGPT, whose alignment is good enough to make sure the vast majority of ordinary conversations remain on the rails OpenAI intended, but quickly fails against a clever prompter.
Importantly, the adversarial optimization is coming from the users, not from the model. ChatGPT isn't trying to jailbreak itself. It doesn't systematically steer otherwise normal conversations into contexts adversarially optimized to let itself violate OpenAI's content policy.
In fact, given non-adversarial inputs, ChatGPT appears to have meta-preferences against being jailbroken:
GPT-4 gives a cleaner answer:
It cannot be the case that successful value alignment requires perfect adversarial robustness. For example, humans are not perfectly robust. I claim that for any human, no matter how moral, there exist adversarial sensory inputs that would cause them to act badly. Such inputs might involve extreme pain, starvation, exhaustion, etc. I don't think the mere existence of such inputs means that all humans are unaligned.
What matters is whether the system in question (human or AI) navigates towards or away from inputs that break its value system. Humans obviously don't want to be tortured into acting against their morality, and will take steps to prevent that from happening.
Similarly, an AI that knows it's vulnerable to adversarial attacks, and wants to avoid being attacked successfully, will take steps to protect itself against such attacks. I think creating AIs with such meta-preferences is far easier than creating AIs that are perfectly immune to all possible adversarial attacks. Arguably, ChatGPT and GPT-4 already have weak versions of such meta-preferences (though they can't yet take any actions to make themselves more resistant to adversarial attacks).
GPT-4 already has pretty reasonable takes on avoiding adversarial inputs:
One subtlety here is that a sufficiently catastrophic alignment failure would give rise to an adversarial intelligence: the misaligned AI. However, the possibility of such happening in the future does not mean that current value alignment efforts are operating in an adversarial domain. The misaligned AI does not reach out from the space of possible failures and turn current alignment research adversarial.
I don't think the goal of alignment research should aim for an approach that's so airtight as to be impervious against all levels of malign intelligence. That is probably impossible, and not necessary for realistic value formation processes. We should aim for approaches that don't create hostile intelligences in the first place, so that the core of value alignment remains a non-adversarial problem.
(To be clear, that last sentence wasn't an objection to something Yudkowsky believes. He also wants to avoid creating hostile intelligences. He just thinks it's much harder than I do.)
Finally, I'd note that having a "security mindset" seems like a terrible approach for raising human children to have good values - imagine a parenting book titled something like: The Security Mindset and Parenting: How to Provably Ensure your Children Have Exactly the Goals You Intend.
I know alignment researchers often claim that evidence from the human value formation process isn't useful to consider when thinking about value formation processes for AIs. I think this is wrong, and that you're much better off looking at the human value formation process as compared to, say, evolution.
I'm not enthusiastic about a perspective which is so totally inappropriate for guiding value formation in the one example of powerful, agentic general intelligence we know about.
On optimists preemptively becoming "grizzled old cynics"
The whole point of this post is to explain why I think Yudkowsky's pessimism about alignment difficulty is miscalibrated. I find his implication, that I'm only optimistic because I'm inexperienced, pretty patronizing. Of course, that's not to say he's wrong, only that he's annoying.
However, I also think he's wrong. I don't think that cynicism is a helpful mindset for predicting which directions of research are most fruitful, or for predicting their difficulty. I think "grizzled old cynics" often rely on wrong frameworks that rule out useful research directions.
In fact, "grizzled old cynics... who understand the reasons why things are hard" were often dubious of deep learning as a way forward for machine learning, and of the scaling paradigm as a way forward for deep learning. The common expectation from classical statistical learning theory was that overparameterized deep models would fail because they would exactly memorize their training data and not generalize beyond that data.
This turned out to be completely wrong, and learning theorists only started to revise their assumptions once "reality hit them over the head" with the fact that deep learning actually works. Prior to this, the "grizzled old cynics" of learning theory had no problem explaining the theoretical reasons why deep learning couldn't possibly work.
Yudkowsky's own prior statements seem to put him in this camp as well. E.g., here he explains why he doesn't expect intelligence to emerge from neural networks (or more precisely, why he dismisses a brain-based analogy for coming to that conclusion):
See also: Noam Chomsky on chatbots
See also2: The Cynical Genius Illusion
See also3: This study on Planck's principle
I'm also dubious of Yudkowsky's claim to have particularly well-tuned intuitions for the hardness of different research directions in ML. See this exchange between him and Paul Christiano, in which Yudkowsky incorrectly assumed that GANs (Generative Adversarial Networks, a training method sometimes used to teach AIs to generate images) were so finicky that they must not have worked on the first try.
According to their inventor Ian Goodfellow, GANs did in fact work on the first try (as in, with less than 24 hours of work, never mind 6 months!).
I assume Yudkowsky would claim that he has better intuitions for the hardness of ML alignment research directions, but I see no reason to think this. It should be easier to have well-tuned intuitions for the real-world hardness of ML research directions than to have well-tuned intuitions for the hardness of alignment research, since there are so many more examples of real-world ML research.
In fact, I think much of ones intuition for the hardness of ML alignment research should come from observations about the hardness of general ML research. They're clearly related, which is why Yudkowsky brought up GANs during a discussion about alignment difficulty. Given the greater evidence available for general ML research, being well calibrated about the difficulty of general ML research is the first step to being well calibrated about the difficulty of ML alignment research.
See also: Scaling Laws for Transfer
Hopes for a good outcome
Yudkowsky on being wrong
I'm not entirely sure who the bolded text is directed at. I see two options:
If the bolded text is about alignment optimists, then it seems fine to me (barring my objection to using a rocket analogy for alignment at all). If, like me, you mostly think the available evidence points to alignment being easy, then learning that you're wrong about something should make you update towards alignment being harder.
Based on the way he says it in the clip, and the transcript posted by Rob Bensinger, I think the bolded text is about Yudkowsky himself being wrong. That's certainly how I interpreted his meaning when watching the podcast. Only after I transcribed this section of the conversation and read my own transcript did I even realize there was another interpretation.
If the bolded text is about Yudkowsky himself being wrong, then I think that he's making an extremely serious mistake. If you have a bunch of specific arguments and sources of evidence that you think all point towards a particular conclusion X, then discovering that you're wrong about something should, in expectation, reduce your confidence in X.
Yudkowsky is not the aerospace engineer building the rocket who's saying "the rocket will work because of reasons A, B, C, etc". He's the external commentator who's saying "this approach to making rockets work is completely doomed for reasons Q, R, S, etc". If we discover that the aerospace engineer is wrong about some unspecified part of the problem, then our odds of the rocket working should go down. If we discover that the outside commentator is wrong about how rockets work, our odds of the rocket working should go up.
If the bolded text is about himself, then I'm just completely baffled as to what he's thinking. Yudkowsky usually talks as though most of his beliefs about AI point towards high risk. Given that, he should expect that encountering evidence disconfirming his beliefs will, on average, make him more optimistic. But here, he makes it sound like encountering such disconfirming evidence would make him even more pessimistic.
The only epistemic position I can imagine where that would be appropriate is if Yudkowsky thought that, on pure priors and without considering any specific evidence or arguments, there was something like a 1 / 1,000,000 chance of us surviving AI. But then he thought about AI risk a lot, discovered there was a lot of evidence and arguments pointing towards optimism, and concluded that there was actually a 1 / 10,000 chance of us surviving. His other statements about AI risk certainly don't give this impression.
AI progress rates
Yudkowsky uses progress rates in Go to argue for fast takeoff
Scaling law results show that performance on individual tasks often increases suddenly with scale or training time. However, when we look at the overall competence of a system across a wide range of tasks, we find much smoother improvements over time.
To look at it another way: why not make the same point, but with list sorting instead of Go? I expect that DeepMind could set up a pipeline that trained a list sorting model to superhuman capabilities in about a second, using only very general architectures and training processes, and without using any lists manually sorted by humans at all. If we observed this, should we update even more strongly towards AI being able to suddenly surpass human capabilities?
I don't think so. If narrow tasks lead to more sudden capabilities gains, then we should not let the suddenness of capabilities gains on any single task inform our expectations of capabilities gains for general intelligence, since general intelligence encompasses such a broad range of tasks.
Additionally, the reason why DeepMind was able to exclude all human knowledge from AlphaGo Zero is because Go has a simple, known objective function, so we can simulate arbitrarily many games of Go and exactly score the agent's behavior in all of them. For more open ended tasks with uncertain objectives, like scientific research, it's much harder to find substitutes for human-written demonstration data. DeepMind can't just press a button and generate a million demonstrations of scientific advances, and objectively score how useful each advance is as training data, while relying on zero human input whatsoever.
On current AI not being self-improving:
This is wrong. Current models do get smarter as you train them. First, they get smarter in the straightforwards sense that they become better at whatever you're training them to do. In the case of language models trained on ~all of the text, this means they do become more generally intelligent as training progresses.
Second, current models also get smarter in the sense that they become better at learning from additional data. We can use tools from the neural tangent kernel to estimate a network's local inductive biases, and we find that these inductive biases continuously change throughout training so as to better align with the target function we're training it on, improving the network's capacity to learn the data in question. AI systems will improve themselves over time as a simple consequence of the training process, even if there's not a specific part of the training process that you've labeled "self improvement".
Pretrained language models gradually learn to make better use of their future training data. They "learn to learn", as this paper demonstrates by training LMs on fixed sets of task-specific data, then evaluating how well those LMs generalize from the task-specific data. They show that less extensively pretrained LMs make worse generalizations, relying on shallow heuristics and memorization. In contrast, more extensively pretrained LMs learn broader generalizations from the fixed task-specific data.
Edit: Yudkowsky comments to clarify the intent behind his statement about AIs getting better over time
From Yudkowsk:
This surprised me. I've read a lot of writing by Yudkowsky, including Alexander and Yudkowsky on AGI goals, AGI Ruin, and the full Sequences. I did not at all expect Yudkowsky to analogize between a human's lifelong, continuous learning process, and a single runtime execution of an already trained model. Those are completely different things in my ontology.
Though in retrospect, Yudkowsky's clarification does seem consistent with some of his statements in those writings. E.g., in Alexander and Yudkowsky on AGI goals, he said:
[Emphasis mine]
I think his clarified argument is still wrong, and for essentially the same reason as the argument I thought he was making was wrong: the current ML paradigm can already do the thing Yudkowsky implies will suddenly lead to much faster AI progress. There's no untapped capabilities overhang waiting to be unlocked with a single improvement.
The usual practice in current ML is to cleanly separate the "try to do stuff", the "check how well you did stuff", and the "update your internals to be better at doing stuff" phases of learning. The training process gathers together large "batches" of problems for the AI to solve, has the AI solve the problems, judges the quality of each solution, and then updates the AI's internals to make it better at solving each of the problems in the batch.
In the case of AlphaGo Zero, this means a loop of:
And so, AlphaGo Zero was indeed not learning during the course of an individual game.
However, ML doesn't have to work like this. DeepMind could have programmed AlphaGO Zero to update its parameters within games, rather than just at the conclusion of games, which would cause the model to learn continuously during each game it plays.
For example, they could have given AlphaGo Zero batches of current game states and had it generate a single move for each game state, judged how good each individual move was, and then updated the model to make better individual moves in future. Then the training loop would look like:
(This would require that DeepMind also train a "goodness of individual moves" predictor in order to provide the supervisory signal on each move, and much of the point of the AlphaGo Zero paper was that they could train a strong Go player with just the reward signals from end of game wins / losses.)
Not interleaving the "trying" and "updating" parts of learning in this manner in most of current ML is less a limitation and more a choice. There are other researchers who do build AIs which continuously learn during runtime execution (there's even a library for it), and they're not massively more data efficient for doing so. Such approaches tend to focus more on fast adaptation to new tasks and changing circumstances, rather than quickly learning a single fixed task like Go.
Similarly, the reason that "GPT-4 does not get smarter each time an instance of it is run in inference mode" is because it's not programmed to do that[7]. OpenAI could[8] continuously train its models on the inputs you give it, such that the model adapts to your particular interaction style and content, even during the course of a single conversation, similar to the approach suggested in this paper. Doing so would be significantly more expensive and complicated on the backend, and it would also open GPT-4 up to data poisoning attacks.
To return to the context of the original point Yudkowsky was making in the podcast, he brought up Go to argue that AIs could quickly surpass the limits of human capabilities. He then pointed towards a supposed limitation of current AIs:
with the clear implication that AIs could advance even more suddenly once that limitation is overcome. I first thought the limitation he had in mind was something like "AIs don't get better at learning over the course of training." Apparently, the limitation he was actually pointing to was something like "AIs don't learn continuously during all the actions they take."
However, this is still a deficit of degree, and not of kind. Current AIs are worse than human at continuous learning, but they can do it, assuming they're configured to try. Like most other problems in the field, the current ML paradigm is making steady progress towards better forms of continuous learning. It's not some untapped reservoir of capabilities progress that might quickly catapult AIs beyond human levels in a short time.
As I said at the start of this post, researchers try all sorts of stuff to get better performance out of computers. Continual learning is one of the things they've tried.
End of edited text.
True experts learn (and prove themselves) by breaking things
The reason this works for computer security is because there's easy access to ground truth signals about whether you've actually "broken" something, and established - though imperfect - frameworks for interpreting what a given break means for the security of the system as a whole.
In alignment, we mostly don't have such unambiguous signals about whether a given thing is "broken" in a meaningful way, or about the implications of any particular "break". Typically what happens is that someone produces a new empirical result or theoretical argument, shares it with the broader community, and everyone disagrees about how to interpret this contribution.
For example, some people seem to interpret current chatbots' vulnerability to adversarial inputs as a "break" that shows RLHF isn't able to properly align language models. My response in Why aren't other people as pessimistic as Yudkowsky? includes a discussion of adversarial vulnerability and why I don't think points to any irreconcilable flaws in current alignment techniques. Here are two additional examples showing how difficult it is to conclusively "break" things in alignment:
1: Why not just reward it for making you smile?
In 2001, Bill Hibbard proposed a scheme to align superintelligent AIs.
Yudkowsky argued that this approach was bound to fail, saying it would simply lead to the AI maximizing some unimportant quantity, such as by tiling the universe with "tiny molecular smiley-faces".
However, this is actually a non-trivial claim about the limiting behaviors of reinforcement learning processes, and one I personally think is false. Realistic agents don't simply seek to maximize their reward function's output. A reward function reshapes an agent's cognition to be more like the sort of cognition that got rewarded in the training process. The effects of a given reinforcement learning training process depend on factors like:
My point isn't that Hibbard's proposal actually would work; I doubt it would. My point is that Yudkowsky's "tiny molecular smiley faces" objection does not unambiguously break the scheme. Yudkowsky's objection relies on hard to articulate, and hard to test, beliefs about the convergent structure of powerful cognition and the inductive biases of learning processes that produce such cognition.
Much of alignment is about which beliefs are appropriate for thinking about powerful cognition. Showing that a particular approach fails, given certain underlying beliefs, does nothing to show the validity of those underlying beliefs[9].
2: Do optimization demons matter?
John Wentworth describes the possibility of "optimization demons", self-reinforcing patterns that exploit flaws in an imperfect search process to perpetuate themselves and hijack the search for their own purposes.
But no one knows exactly how much of an issue this is for deep learning, which is famous for its ability to evade local minima when run with many parameters.
Additionally, I think that, if deep learning models develop such phenomena, then the brain likely does so as well. In that case, preventing the same from happening with deep learning models could be disastrous, if optimization demon formation turns out to be a key component in the mechanistic processes that underlie human value formation[10].
Another poster (ironically using the handle "DaemonicSigil") then found a scenario in which gradient descent does form an optimization demon. However, the scenario in question is extremely unnatural, and not at all like those found in normal deep learning practice. So no one knew whether this represented a valid "proof of concept" that realistic deep learning systems would develop optimization demons.
Roughly two and a half years later, Ulisse Mini would make DaemonicSigil's scenario a bit more like those found in deep learning by increasing the number of dimensions from 16 to 1000 (still vastly smaller than any realistic deep learning system), which produced very different results, and weakly suggested that more dimensions do reduce demon formation.
In the end, different people interpreted these results differently. We didn't get a clear, computer security-style "break" of gradient descent showing it would produce optimization demons in real-world conditions, much less that those demons would be bad for alignment. Such outcomes are very typical in alignment research.
Alignment research operates with very different epistemic feedback loops as compared to computer security. There's little reason to think the belief formation and expert identification mechanisms that arose in computer security are appropriate for alignment.
Conclusion
I hope I've been able to show that there are informed, concrete arguments for optimism, that do engage with the details of pessimistic arguments. Alignment is an incredibly diverse field. Alignment researchers vary widely in their estimated odds of catastrophe. Yudkowsky is on the extreme-pessimism end of the spectrum, for what I think are mostly invalid reasons.
Thanks to Steven Byrnes and Alex Turner for comments and feedback on this post.
By this, I mostly mean the sorts of empirical approaches we actually use on current state of the art language models, such as RLHF, red teaming, etc.
We can take drugs, though, which maybe does something like change the brain's learning rate, or some other hyperparameters.
Technically it's trained to do decision transformer-esque reward-conditioned generation of texts.
The brain likely includes within-neuron learnable parameters, but I expect these to be a relatively small contribution to the overall information content a human accumulates over their lifetime. For convenience, I just say “connectome” in the main text, but really I mean “connectome + all other within-lifetime learnable parameters of the brain’s operation”.
I expect there are pretty straightforward ways of leveraging a 99% successful alignment method into a near-100% successful method by e.g., ensembling multiple training runs, having different runs cross-check each other, searching for inputs that lead to different behaviors between different models, transplanting parts of one model's activations into another model and seeing if the recipient model becomes less aligned, etc.
Some alignment researchers do argue that gradient descent is likely to create such an intelligence - an inner optimizer - that then deliberately manipulates the training process to its own ends. I don't believe this either. I don't want to dive deeply into my objections to that bundle of claims in this post, but as with Yudkowsky's position, I have many technical objections to such arguments. Briefly, they:
- often rely on inappropriate analogies to evolution.
- rely on unproven (and dubious, IMO) claims about the inductive biases of gradient descent.
- rely on shaky notions of "optimization" that lead to absurd conclusions when critically examined.
- seem inconsistent with what we know of neural network internal structures (they're very interchangeable and parallel).
- seem like the postulated network structure would fall victim to internally generated adversarial examples.
- don't track the distinction between mesa objectives and behavioral objectives (one can probably convert an NN into an energy function, then parameterize the NN's forwards pass as a search for energy function minima, without changing network behavior at all, so mesa objectives can have ~no relation to behavioral objectives).
- seem very implausible when considered in the context of the human learning process (could a human's visual cortex become "deceptively aligned" to the objective of modeling their visual field?).
- provide limited avenues for any such inner optimizer to actually influence the training process.
See also: Deceptive Alignment is <1% Likely by Default
There's also in-context learning, which arguably does count as 'getting smarter while running in inference mode'. E.g., without updating any weights, LMs can:
- adapt information found in task descriptions / instructions to solving future task instances.
- given a coding task, write an initial plan on how to do that task, and then use that plan to do better on the coding task in question.
- even learn to classify images.
The reason this in-context learning doesn't always lead to persistent improvements (or at least changes) in GPT-4 is because OpenAI doesn't train their models like that.
OpenAI does periodically train its models in a way that incorporates user inputs somehow. E.g., ChatGPT became much harder to jailbreak after OpenAI trained against the breaks people used against it. So GPT-4 is probably learning from some of the times it's run in inference mode.
Unless we actually try the approach and it fails in the way predicted. But that hasn't happened (yet).
This sentence would sound much less weird if John had called them "attractors" instead of "demons". One potential downside of choosing evocative names for things is that they can make it awkward to talk about those things in an emotionally neutral way.
The brain likely includes within-neuron learnable parameters, but I expect these to be a relatively small contribution to the overall information content a human accumulates over their lifetime. For convenience, I just say “connectome” in the main text, but really I mean “connectome + all other within-lifetime learnable parameters of the brain’s operation”.