Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Recent Discussion

Thanks to Chris Scammell, Adam Shimi, Lee Sharkey, Evan Hubinger, Nicholas Dupuis, Leo Gao, Johannes Treutlein, and Jonathan Low for feedback on drafts.

This work was carried out while at Conjecture.

"Moebius illustration of a simulacrum living in an AI-generated story discovering it is in a simulation" by DALL-E 2


TL;DR: Self-supervised learning may create AGI or its foundation. What would that look like?

Unlike the limit of RL, the limit of self-supervised learning has received surprisingly little conceptual attention, and recent progress has made deconfusion in this domain more pressing.

Existing AI taxonomies either fail to capture important properties of self-supervised models or lead to confusing propositions. For instance, GPT policies do not seem globally agentic, yet can be conditioned to behave in goal-directed ways. This post describes a frame that...

It seems as a result of this post, many people are saying that LLMs simulate people and so on. But I'm not sure that's quite the right frame. It's natural if you experience LLMs through chat-like interfaces, but from playing with them in a more raw form, like the RWKV playground, I get a different impression. For example, if I write something that sounds like the start of a quote, it'll continue with what looks like a list of quotes from different people. Or if I write a short magazine article, it'll happily tack on a publication date and "All rights reser... (read more)


I recently watched Eliezer Yudkowsky's appearance on the Bankless podcast, where he argued that AI was nigh-certain to end humanity. Since the podcast, some commentators have offered pushback against the doom conclusion. However, one sentiment I saw was that optimists tended not to engage with the specific arguments pessimists like Yudkowsky offered. 

Economist Robin Hanson points out that this pattern is very common for small groups which hold counterintuitive beliefs: insiders develop their own internal language, which skeptical outsiders usually don't bother to learn. Outsiders then make objections that focus on broad arguments against the belief's plausibility, rather than objections that focus on specific insider arguments.

As an AI "alignment insider" whose current estimate of doom is around 5%, I wrote this post to explain some of my many...

5Adele Lopez2d
This does not seem like it counts as "publicly humiliating" in any way? Rude, sure, but that's quite different.
19Matthew "Vaniver" Gray2d
I have a lot of responses to specific points; I'm going to make them as children comment to this comment.
3Matthew "Vaniver" Gray2d
I think it would probably be strange for the visual field to do this. But I think it's not that uncommon for other parts of the brain to do this; higher level, most abstract / "psychological" parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call 'deceptively aligned' when they're maladaptive. The idea of metacognitive blindspots [] also seems related. 
4Matthew "Vaniver" Gray2d
Also relevant is Are minimal circuits daemon-free? [] and Are minimal circuits deceptive? [].  I agree no one knows how much of an issue this will be for deep learning. I think the brain obviously has such phenomena, and societies made up of humans also obviously have such phenomena. I think it is probably not adaptive (optimization demons are more like 'cognitive cancer' than 'part of how values form', I think, but in part that's because the term comes with the disapproval built in).
4Matthew "Vaniver" Gray2d
That is also how I interpreted it. I think Yudkowsky is making a different statement. I agree it would be bizarre for him to be saying "if I were wrong, it would only mean I should have been more confident!" I think he is (inside of the example). He's saying "suppose an engineer is wrong about how their design works. Is it more likely that the true design performs better along multiple important criteria than expectation, or that the design performs worse (or fails to function at all)?" Note that 'expectation' is referring to the confidence level inside an argument [], but arguments aren't Bayesians; it's the outside agent that shouldn't be expected to predictably update. Another way to put this: does the engineer expect to be disappointed, excited, or neutral if the design doesn't work as planned? Typically, disappointed, implying the plan is overly optimistic compared to reality. If this weren't true--if engineers were calibrated or pessimistic--then I think Yudkowsky would be wrong here (and also probably have a different argument to begin with).
1Matthew "Vaniver" Gray2d
I think I agree with this point but want to explicitly note the switch from the phrase 'AI alignment research' to 'ML alignment research'; my model of Eliezer thinks the second is mostly a distraction from the former, and if you think they're the same or interchangeable that seems like a disagreement. [For example, I think ML alignment research includes stuff like "will our learned function be robust to distributional shift in the inputs?" and "does our model discriminate against protected classes?" whereas AI alignment research includes stuff like "will our system be robust to changes in the number of inputs?" and "is our model deceiving us about its level of understanding?". They're related in some ways, but pretty deeply distinct.]
7Matthew "Vaniver" Gray2d
I do think this is a point against Yudkowsky. That said, my impression is that GANs are finicky, and I heard rumors that many people tried similar ideas and failed to get it to work before Goodfellow knocked it out of the park. If people were encouraged to publish negative results, we might have a better sense of the actual landscape here, but I think a story of "Goodfellow was unusually good at making GANs and this is why he got it right on his first try" is more compelling to me than "GANs were easy actually".
9Matthew "Vaniver" Gray2d
I think you're basically misunderstanding and misrepresenting Yudkowsky's argument from 2008. He's not saying "you can't make an AI out of neural networks", he's saying "your design sharing a single feature with the brain does not mean it will also share the brain's intelligence." As well, I don't think he's arguing about how AI will actually get made; I think he's mostly criticizing the actual AGI developers/enthusiasts that he saw at the time (who were substantially less intelligent and capable than the modern batch of AGI developers). I think that post has held up pretty well. The architectures used to organize neural networks are quite important, not just the base element. Someone whose only plan was to make their ANN wide would not reach AGI; they needed to do something else, that didn't just rely on surface analogies.
8Matthew "Vaniver" Gray2d
Do you have kids, or any experience with them? (There are three small children in the house I live in.) I think you might want to look into childproofing [], and meditate on its connection to security mindset. Yes, this isn't necessarily related to the 'values' part, but for that I would suggest things like Direct Instruction [], which involves careful curriculum design to generate lots of examples so that students will reliably end up inferring the correct rule. In short, I think the part of 'raising children' which involves the kids being intelligent as well and independently minded does benefit from security mindset.   As you mention in the next paragraph, this is a long-standing disagreement; I might as well point at the discussion of the relevance of raising human children to instilling goals in an AI in The Detached Lever Fallacy []. The short summary of it is that humans have a wide range of options for their 'values', and are running some strategy of learning from their environment (including their parents and their style of raising children) which values to adopt. The situation with AI seems substantially different--why make an AI design that chooses whether to be good or bad based on whether you're nice to it, when you could instead have it choose to always be good? [Note that this is distinct from "always be nice"; you could decide that your good AI can tell users that they're being bad users!]
2Matthew "Vaniver" Gray2d
It seems like the argument structure here is something like: 1. This requirement is too stringent for humans to follow 2. Humans have successful value alignment 3. Therefore this requirement cannot be necessary for successful value alignment. I disagree with point 2, tho; among other things, it looks to me like some humans are on track to accidentally summoning a demon that kills both me and them, which I expect they would regret after-the-fact if they had the chance to. So any reasoning that's like "well so long as it's not unusual we can be sure it's safe" runs into the thing where we're living in the acute risk period. The usual is not safe! This seems definitely right to me. An expectation I have is that this will also generate resistance to alignment techniques / control by its operators, which perhaps complicates how benign this is.   [FWIW I also don't think we want an AI that's perfectly robust to all possible adversarial attacks; I think we want one that's adequate to defend against the security challenges it faces, many of which I expect to be internal. Part of this is because I'm mostly interested in AI planning systems able to help with transformative changes to the world instead of foundational models used by many customers for small amounts of cognition, which are totally different business cases and have different security problems.]
6Matthew "Vaniver" Gray2d
This seems... like a correct description but it's missing the spirit? Like the intuitions are primarily about "what features are salient" and "what thoughts are easy to think." Roughly, the core distinction between software engineering and computer security is whether the system is thinking back. Software engineering typically involves working with dynamic systems and thinking optimistically how the system could work. Computer security typically involves working with reactive systems and thinking pessimistically about how the system could break. I think it is an extremely basic AI alignment skill to look at your alignment proposal and ask "how does this break?" or "what happens if the AI thinks about this?". What's your story for specification gaming []? I must admit some frustration, here; in this section it feels like your point is "look, computer security is for dealing with intelligence as part of your system. But the only intelligence in our system is sometimes malicious users!" In my world, the whole point of Artificial Intelligence was the Intelligence. The call is coming from inside the house! Maybe we just have some linguistic disagreement? "Sure, computer security is relevant to transformative AI but not LLMs"? If so, then I think the earlier point [] about whether capabilities enhancements break alignment techniques is relevant: if these alignment techniques work because the system isn't thinking about them, then are you confident they will continue to work when the system is thinking about them?
16Matthew "Vaniver" Gray2d
uh   is your proposal "use the true reward function, and then you won't get misaligned AI"? These three paragraphs feel incoherent to me. The human eating ice cream and activating their reward circuits is exactly what you would expect under the current paradigm. Yudkowsky thinks this leads to misalignment; you agree. He says that you need a new paradigm to not have this problem. You disagree because you assume it's possible under the current paradigm. If so, how? Where's the system that, on eating ice cream, realizes "oh no! This is a bad action that should not receive reward!" and overrides the reward machinery? How was it trained?  I think when Eliezer says "we need an entirely new paradigm", he means something like "if we want a decision-making system that makes better decisions that a RL agent, we need agent-finding machinery that's better than RL." Maybe the paradigm shift is small (like from RL without experience replay to RL with), or maybe the paradigm shift is large (like from policy-based agents to plan-based agents).  He's not saying the failures of RL are a surprise from the theory of RL. Of course you can explain it using the standard language of RL! He's saying that unless you can predict RL's failures from the inside, the RL agents that you make are going to actually make those mistakes in reality.
5Logan Riggs Smith1d
My shard theory inspired story is to make an AI that: 1. Has a good core of human values (this is still hard) 2. Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs) Then the model can safely scale. This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism) Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.

Then the model can safely scale.

If there are experiences which will change itself which don't lead to less of the initial good values, then yeah, for an approximate definition of safety. You're resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose. 

FWIW I don't really see your description as, like, a specific alignment strategy so much as the strategy of "have an alignment strategy at all". The meat is all in 1) how you identify the core of human... (read more)

7Matthew "Vaniver" Gray2d
I don't yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it's a misgeneralization instead of things working as expected. I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception: "such food sources" feels a little like it's eliding the distinction between "high-quality food sources of the ancestral environment" and "foods like ice cream"; the training dataset couldn't differentiate between functions f and g but those functions differ in their reaction to the test set (ice cream). Yudkowsky's primary point with this section, as I understand it, is that even if you-as-evolution know that you want g the only way you can communicate that under the current learning paradigm is with training examples, and it may be non-obvious to which functions f need to be excluded.
2Matthew "Vaniver" Gray2d
I think there's a deep disconnect here on whether interpolation is enough or whether we need extrapolation. The point of the strawberry alignment problem is "here's a clearly understandable specification of a task that requires novel science and engineering to execute on. Can you do that safely?". If your ambitions are simply to have AI customer service bots, you don't need to solve this problem. If your ambitions include cognitive megaprojects which will need to be staffed at high levels by AI systems, then you do need to solve this problem. More pragmatically, if your ambitions include setting up some sort of system that prevents people from deploying rogue AI systems while not dramatically curtailing Earth's potential, that isn't a goal that we have many examples of people executing on. So either we need to figure it out with humans or, if that's too hard, create an AI system capable of figuring it out (which probably requires an AI leader instead of an AI assistant).
4Matthew "Vaniver" Gray2d
I agree with your picture of how manifolds work; I don't think it actually disagrees all that much with Yudkowsky's. That is, the thing where all humans are basically the same make and model of car, running the same brand of engine, painted different colors is the claim that the intrinsic dimension of human minds is pretty small. (Taken literally, it's 3, for the three dimensions of color-space.) And so if you think there are, say, 40 intrinsic dimensions to mind-space, and humans are fixed on 37 of the points and variable on the other 3, well, I think we have basically the Yudkowskian picture. (I agree if Yudkowsky's picture was that there were 40M dimensions and humans varied on 3, this would be comically wrong, but I don't think this is what he's imagining for that argument.)
2Matthew "Vaniver" Gray2d
I think this is what Yudkowsky thinks also? (As for why it was relevant to bring up, Yudkowsky was answering the host's question of "How is superintelligence different than general intelligence?")
2Matthew "Vaniver" Gray2d
Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn [] and follow-on discussion []. But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model." Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the 'helpful, harmless, honest' alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients? Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don't expect that they will easily transfer to those new challenges.
2G Gordon Worley III3d
This post brought to mind a thought: I actually don't care very much about arguments about how likely doom is and how pessimistic or optimistic to be since they are irrelevant, to my style of thinking, for making decisions related to building TAI. Instead, I mostly focus on downside risks and avoiding them because they are so extreme, which makes me look "pessimistic" but actually I'm just trying to minimize the risk of false positives in building aligned AI []. Given this framing, it's actually less important, in most cases, to figure out how likely something is, and more important to figure out how likely doom is if we are wrong, and carefully navigate the path that minimizes the risk of doom, regardless of what the assessment of doom is.
3Steve Byrnes3d
I agree with OP that this rocket analogy from Eliezer is a bad analogy, AFAICT. If someone is trying to assess the difficulty of solving a technical problem (e.g. building a rocket) in advance, then they need to brainstorm potential problems that might come up, and when they notice one, they also need to brainstorm potential technical solutions to that problem. For example “the heat of reentry will destroy the ship” is a potential problem, and “we can invent new and better heat-resistant tiles / shielding” is a potential solution to that problem. During this process, I don’t think it’s particularly unusual for the person to notice a technical problem but overlook a clever way to solve that problem. (Maybe they didn’t recognize the possibility of inventing new super-duper-heat-resistant ceramic tiles, or whatever.) And then they would wind up overly pessimistic.
4Matthew "Vaniver" Gray2d
I think this isn't the claim; I think the claim is that it would be particularly unusual for someone to overlook that they're accidentally solving a technical problem. (It would be surprising for Edison to not be thinking hard about what filament to use and pick tungsten; in actual history, it took decades for that change to be made.)
2Steve Byrnes2d
Sure, but then the other side of the analogy doesn’t make sense, right? The context was: Eliezer was talking in general terms about the difficulty of the AGI x-risk problem and whether it’s likely to be solved. (As I understand it.) [Needless to say, I’m just making a narrow point that it’s a bad analogy. I’m not arguing that p(doom) is high or low, I’m not saying this is an important & illustrative mistake (talking on the fly is hard!), etc.]
1Matthew "Vaniver" Gray2d
So I definitely think that's something weirdly unspoken about the argument; I would characterize it as Eliezer saying "suppose I'm right and they're wrong; all this requires is things to be harder than people think, which is usual. Suppose instead that I'm wrong and they're right; this requires things to be easier than people think, which is unusual." But the equation of "people" and "Eliezer" is sort of strange; as Quintin notes, it isn't that unusual for outside observers to overestimate difficulty, and so I wish he had centrally addressed the the reference class tennis game; is the expertise "getting AI systems to be capable" or "getting AI systems to do what you want"?
7Steve Byrnes3d
I narrowly agree with most of this, but I tend to say the same thing with a very different attitude: I would say: “Gee it would be super cool if we could decide a priori what we want the AGI to be trying to do, WITH SURGICAL PRECISION. But alas, that doesn’t seem possible, at least not according to any method I know of.” I disagree with you in your apparent suggestion that the above paragraph is obvious or uninteresting, and also disagree with your apparent suggestion that “setting an AGI’s motivations with surgical precision” is such a dumb idea that we shouldn’t even waste one minute of our time thinking about whether it might be possible to do that. For example, people who are used to programming almost any other type of software have presumably internalized the idea that the programmer can decide what the software will do with surgical precision. So it's important to spread the idea that, on current trends, AGI software will be very different from that. BTW I do agree with you that Eliezer’s interview response seems to suggest that he thinks aligning an AGI to “basic notions of morality” is harder and aligning an AGI to “strawberry problem” is easier. If that’s what he thinks, it’s at least not obvious to me. (see follow-up [])
3Matthew "Vaniver" Gray2d
My sense (which I expect Eliezer would agree with) is that it's relatively easy to get an AI system to imitate the true underlying 'basic notions of morality', to the extent humans agree on that, but that this doesn't protect you at all as soon as you want to start making large changes, or as soon as you start trying to replace specialist sectors of the economy. (A lot of ethics for doctors has to do with the challenges of simultaneously being a doctor and a human; those ethics will not necessarily be relevant for docbots, and the question of what they should be instead is potentially hard to figure out.) So if you're mostly interested in getting out of the acute risk period, you probably need to aim for a harder target.
3Steve Byrnes2d
Hmm, on further reflection, I was mixing up * Strawberry Alignment (defined as: make an AGI that is specifically & exclusively motivated to duplicate a strawberry without destroying the world), versus * “Strawberry Problem” (make an AGI that in fact duplicates a strawberry without destroying the world, using whatever methods / motivations you like). Eliezer definitely talks about the latter. I’m not sure Eliezer has ever brought up the former? I think I was getting that from the OP (Quintin), but maybe Quintin was just confused (and/or Eliezer misspoke). Anyway, making an AGI that can solve the strawberry problem is tautologically no harder than making an AGI that can do advanced technological development and is motivated by human norms / morals / whatever, because the latter set of AGIs is a subset of the former. Sorry. I crossed out that paragraph.  :)
9Vladimir Slepnev3d
I think there's a mistake here which kind of invalidates the whole post. If we don't reward our AI for taking bad actions within the training distribution, it's still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn't in evolution's training distribution for us, but then we found it anyway.
DIFFICULTY OF ALIGNMENT I find the prospect of training on model on just 40 parameters to be very interesting. Almost unbelievable, really, to the point where I'm tempted to say: "I notice that I'm confused". Unfortunately, I don't have access to the paper and it doesn't seem to be on sci-hub, so I haven't been able to resolve my confusion. Basically, my general intuition is that each parameter in a network probably only contributes a few bits of optimization power. It can be set fairly high, fairly low, or in between. So if you just pulled 40 random weights from the network, that's maybe 120 bits of optimization power. Which might be enough for MNIST, but probably not for anything more complicated. So I'm guessing that most likely a bunch of other optimization went into choosing exactly which 40 dimensional subspace we should be using. Of course, if we're allowed to do that then we could even do it with a 1 dimensional subspace: Just pick the training trajectory as your subspace! Generally with the mindspace thing, I don't really think about the absolute size or dimension of mindspace, but the relative size of "things we could build" and "things we could build that would have human values". This relative size is measured in bits. So the intuition here would be that it takes a lot of bits to specify human values, and so the difference in size between these two is really big. Now maybe if you're given Common Crawl, it takes fewer bits to point to human values within that big pile of information. But it's probably still a lot of bits, and then the question is how do you actually construct such a pointer? DEMONS IN GRADIENT DESCENT I agree that demons are unlikely to be a problem, at least for basic gradient descent. They should have shown up by now in real training runs, otherwise. I do still think gradient descent is a very unpredictable process (or to put it more precisely: we still don't know how to predict gradient descent very well), and where that shows up
For the 40 parameters thing, this link should work []. See also this earlier paper [].
Thanks for the link! Looks like they do put optimization effort into choosing the subspace, but it's still interesting that the training process can be factored into 2 pieces like that.
BTW: the way I found that first link was by searching the title on google scholar, finding the paper, and clicking "All 5 versions" below (it's right next to "Cited by 7" and "Related articles"). That brought me to a bunch of versions, one of which was a seemingly-ungated PDF. This will probably frequently work, because AI researchers usually make their papers publicly available (at least in pre-print form).
6Eliezer Yudkowsky3d
This is kinda long.  If I had time to engage with one part of this as a sample of whether it holds up to a counterresponse, what would be the strongest foot you could put forward? (I also echo the commenter who's confused about why you'd reply to the obviously simplified presentation from an off-the-cuff podcast rather than the more detailed arguments elsewhere.)
1Matthew "Vaniver" Gray2d
FWIW, I thought the bit about manifolds in The difficulty of alignment [] was the strongest foot forward, because it paints a different detailed picture than your description that it's responding to. That said, I don't think Quintin's picture obviously disagrees with yours (as discussed in my response over here []) and I think you'd find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there's a related one that goes thru on his model.
This response is enraging. Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is "This is kinda long."? I shouldn't be that surprised because, IIRC, you said something similar in response to Zack Davis' essays on the Map and Territory distinction, but that's ancillary and AI is core to your memeplex. I have heard repeated claims that people don't engage with the alignment communities' ideas (recent example from yesterday []). But here is someone who did the work. Please explain why your response here does not cause people to believe there's no reason to engage with your ideas because you will brush them off. Yes, nutpicking e/accs on Twitter is much easier and probably more hedonic, but they're not convincible and Quinton here is.
1Matthew "Vaniver" Gray2d
I have attempted to respond to the whole post over here [].
8Eliezer Yudkowsky2d
Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me.  To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.
10Alex Turner2d
Here are some of my disagreements with List of Lethalities []. I'll quote item one:
I think I've been in situations where I've been disoriented by a bunch of random stuff happening and wished that less of it was happening so that I could get a better handle on stuff. An example I vividly recall was being in a history class in high school and being very bothered by the large number of conversations happening around me.
This comment doesn't really engage much with your post - there's a lot there and I thought I'd pick one point to get a somewhat substantive disagreement. But I ended up finding this question and thought that I should answer it.
To tie up this thread: I started writing a more substantive response to a section but it took a while and was difficult and I then got invited to dinner, so probably won't get around to actually writing it.
20Alex Turner3d
Some reasoning which led Eliezer to dismiss neural networks,[1] seems similar to some reasoning which he deploys in his modern alignment arguments.  Compare his incorrect mockery from 2008: with his claim in Alexander and Yudkowsky on AGI goals []: I agree that 100 quadrillion artificial neurons + loss function won't get you a literal human, for trivial reasons. The relevant point is his latter claim: "in particular with respect to "learn 'don't steal' rather than 'don't get caught'."" I think this is a very strong conclusion, relative to available data. I think that a good argument for it would require a lot of technical, non-analogical reasoning about the inductive biases of SGD on large language models. But, AFAICT, Eliezer rarely deploys technical reasoning that depends on experimental results or ML theory. He seems to prefer strongly-worded a priori arguments that are basically analogies.  So, here are two claims which seem to echo the positions Eliezer advances: 1. "A large ANN doesn't look enough like a human brain to develop intelligence." -> wrong (see GPT-4) 2. "A large ANN doesn't look enough like a human brain to learn 'don't steal' rather than 'don't get caught'" -> (not yet known) I perceive a common thread of  But why is this true? You can just replace "human intelligence" with "avian flight", and the argument might sound similarly plausible a priori.  ETA: The invalid reasoning step is in the last clause ("to get a mind..."). If design X exhibits property P, that doesn't mean that design Y must be similar to X in order to exhibit property P.  -------------------------------------------------------------------------------- ETA: Part of this comment was about EY dismissing neural networks in 2008. It seems to me that the cited writing supports that interpretation, and it's still my best guess (see also DirectedEvolution's comment [https://www.less
9Alex Turner1d
EY was not in fact bullish on neural networks leading to impressive AI capabilities. Eliezer said this directly []: I think this is strong evidence for my interpretation of the quotes in my parent comment: He's not just mocking the local invalidity of reasoning "because humans have lots of neurons, AI with lots of neurons -> smart", he's also mocking neural network-driven hopes themselves.  1. ^ More quotes from Logical or Connectionist AI? []: In this passage, he employs well-scoped and well-hedged language via "this particular raw fact." I like this writing because it points out an observation, and then what inferences (if any) he draws from that observation. Overall, his tone is negative on neural networks. Let's open up that "Outside the Box" box: This is more incorrect mockery.
I don't really get your comment. Here are some things I don't get: * In "Failure By Analogy" and "Surface Analogies and Deep Causes", the point being made is "X is similar in aspects A to thing Y, and X has property P" does not establish "Y has property P". The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me. * Large ANNs don't appear to me to be intelligent because of their similarity to human brains - they appear to me to be intelligent because they're able to be tuned to accurately predict simple facts about a large amount of data that's closely related to human intelligence, and the algorithm they get tuned to seems to be able to be repurposed for a wide variety of tasks (probably related to the wide variety of data that was trained on). * Airplanes don't fly like birds, they fly like airplanes. So indeed you can't just ape one thing about birds[*] to get avian flight. I don't think this is a super revealing technicality but it seemed like you thought it was important. * Maybe most importantly I don't think Eliezer thinks you need to mimic the human brain super closely to get human-like intelligence with human-friendly wants. I think he instead thinks you need to mimic the human brain super closely to validly argue by analogy from humans. I think this is pretty compatible with this quote from "Failure By Analogy" (it isn't exactly implied by it, but your interpretation isn't either): * Matters would be different if he said in the quotes you cite "you only get these human-like properties by very exactly mimicking the human brain", but he doesn't. [*] I've just realized that I can't name a way in which airplanes are like birds in which they aren't like humans. They have things sticking out their sides? So do humans, they're called arms. Maybe the cross-sectional shape of the wings are similar? I guess they b
7Alex Turner3d
Edited to modify confidences about interpretations of EY's writing / claims. This is a valid point, and that's not what I'm critiquing in that portion of the comment. I'm critiquing how -- on my read -- he confidently dismisses ANNs; in particular, using non-mechanistic reasoning which seems similar to some of his current alignment arguments. On its own, this seems like a substantial misprediction for an intelligence researcher in 2008 (especially one who claims to have figured out most things in modern alignment [], by a very early point in time -- possibly that early, IDK). Possibly the most important prediction to get right, to date. Indeed, you can't ape one thing. But that's not what I'm critiquing. Consider the whole transformed line of reasoning: The important part is the last part. It's invalid. Finding a design X which exhibits property P, doesn't mean that for design Y to exhibit property P, Y must be very similar to X.  Which leads us to: Reading the Alexander/Yudkowsky debate, I surprisingly haven't ruled out this interpretation, and indeed suspect he believes some forms of this (but not others). Didn't he? He at least confidently rules out a very large class of modern approaches.
7Rafael Harth3d
I also don't really get your position. You say that, but you haven't shown this! * In Surface Analogies and Deep Causes [], I read him as saying that neural networks don't automatically yield intelligence just because they share surface similarities with the brain. This is clearly true; at the very least, using token-prediction (which is a task for which (a) lots of training data exist and (b) lots of competence in many different domains is helpful) is a second requirement. If you take the network of GPT-4 and trained it to play chess instead, you won't get something with cross-domain competence. * In Failure by Analogy [] he makes a very similar abstract point -- and wrt to neural networks in particular, he says that the surface similarity to the brain is a bad reason to be confident in them. This also seems true. Do you really think that neural networks work because they are similar to brains on the surface? You also said, But Eliezer says this too in the post you linked! (Failure by Analogy). His example of airplanes not flapping is an example where the design that worked was less close to the biological thing. So clearly the point isn't that X has to be similar to Y; the point is that reasoning from analogy doesn't tell you this either way. (I kinda feel like you already got this, but then I don't understand what point you are trying to make.) Which is actually consistent with thinking that large ANNs will get you to general intelligence. You can both hold that "X is true" and "almost everyone who thinks X is true does so for poor reasons". I'm not saying Eliezer did predict this, but nothing I've read proves that he didn't. Also -- and this is another thing -- the fact that he didn't publicly make the prediction "ANNs will lead to AGI" is only weak evidence that he
5Alex Turner2d
Responding to part of your comment: I know he's talking about alignment, and I'm criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments. That claim seems to be advanced due to... there not being enough similarities between ANNs and human brains -- that without enough similarity in mechanisms wich were selected for by evolution, you simply can't get the AI to generalize in the mentioned human-like way. Not as a matter of the AI's substrate, but as a matter of the AI's policy not generalizing like that.  I think this is a dubious claim, and it's made based off of analogies to evolution / some unknown importance of having evolution-selected mechanisms which guide value formation (and not SGD-based mechanisms). From the Alexander/Yudkowsky debate: There's some assertion like "no, there's not a way to get an ANN, even if incorporating structural parameters and information encoded in human genome, to actually unfold into a mind which has human-like values (like 'don't steal')." (And maybe Eliezer comes and says "no that's not what I mean", but, man, I sure don't know what he does mean, then.)  Here's some more evidence along those lines: Again, why is this true? This is an argument that should be engaging in technical questions about inductive biases, but instead seems to wave at (my words) "the original way we got property P was by sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, and good luck trying to get it otherwise." Hopefully this helps clarify what I'm trying to critique?
1Rafael Harth19h
Ok, I don't disagree with this. I certainly didn't develop a gears-level understanding of why [building a brain-like thing with gradient descent on giant matrices] is doomed after reading the 2021 conversations. But that doesn't seem very informative either way; I didn't spend that much time trying to grok his arguments.
I don't want to get super hung up on this because it's not about anything Yudkowsky has said but: IMO this is not a faithful transformation of the line of reasoning you attribute to Yudkowsky, which was: Specifically, where you wrote "an entity which flies", you were transforming "a mind which wants as humans do", which I think should instead be transformed to "an entity which flies as birds do". And indeed planes don't fly like birds do. [EDIT: two minutes or so after pressing enter on this comment, I now see how you could read it your way] I guess if I had to make an analogy I would say that you have to be pretty similar to a human to think the way we do, but probably not to pursue the same ends, which is probably the point you cared about establishing.
I guess I read that as talking about the fact that at the time ANNs did not in fact really work. I agree he failed to predict that would change, but that doesn't strike me as a damning prediction. Confidently ruling out a large class of modern approaches isn't really that similar to saying "the only path to success is exactly mimicking the human brain". It seems like one could rule them out by having some theory about why they're deficient. I haven't re-read List of Lethalities because I want to go to sleep soon, but I searched for "brain" and did not find a passage saying "the real problem is that we need to emulate the brain precisely but can't because of poor understanding of neuroanatomy" or something.
17Zack M. Davis3d
I don't think this is a fair reading of Yudkowsky. He was dismissing people who were impressed by the analogy between ANNs and the brain. I'm pretty sure it wasn't supposed to be a positive claim that ANNs wouldn't work. Rather, it's that one couldn't justifiably believe that they'd work just from the brain analogy, and that if they did work, that would be bad news for what he then called Friendliness (because he was hoping to discover and wield a "clean" theory of intelligence, as contrasted to evolution or gradient descent happening to get there at sufficient scale). Consider "Artificial Mysterious Intelligence" [] (2008). In response to someone who said "But neural networks are so wonderful! They solve problems and we don't have any idea how they do it!", it's significant that Yudkowsky's reply wasn't, "No, they don't" (contesting the capabilities claim), but rather, "If you don't know how your AI works, that is not good. It is bad" (asserting that opaque capabilities are bad for alignment).
One of Yudkowsky's claims in the post you link is: This is a claim that lack of the correct mechanistic theory is a formidable barrier for capabilities, not just alignment, and it inaccurately underestimates the amount of empirical understandings available on which to base an empirical approach. It's true that it's hard, even perhaps impossible, to build a flying machine if the only thing you understand is that birds "magically" fly. But if you are like most people for thousands of years, you've observed many types of things flying, gliding, or floating in the air: birds and insects, fabric and leaves, arrows and spears, clouds and smoke. So if you, like the Montgolfier brothers, observe fabric floating over a fire, and live in an era in which invention is celebrated and have the ability to build, test, and iterate, then you can probably figure out how to build a flying machine without basing this on a fully worked out concept of aerodynamics. Indeed, the Montgolfier brothers thought it was the smoke, rather than the heat, that made their balloons fly. Having the wrong theory was bad, but it didn't prevent them from building a working hot air balloon. Let's try turning Yudkowsky's quote around: Eliezer went on to list five methods for producing AI that he considered dubious, including builting powerful computers running the most advanced available neural network algorithms, intelligence "emerging from the internet", and putting "a sufficiently huge quantity of knowledge into [a computer]." But he only admitted that two other methods would work - builting a mechanical duplicate of the human brain and evolving AI via natural selection. If Eliezer wasn't meaning to make a confident claim that scaling up neural networks without a fundamental theoretical understanding of intelligence would fail, then he did a poor job of communicating that in these posts. I don't find that blameworthy - I just think Eliezer comes across as confidently wrong about which avenues wou
4Alex Turner2d
To be fair, he said that those two will work, and (perhaps?) admitted the possibility of "run advanced neural network algorithms" eventually working. Emphasis mine:
I think it might be relevant to note here that it's not really humans who are building current SOTA AIs --- rather, it's some optimizer like SGD that's doing most of the work. SGD does not have any mechanistic understanding of intelligence (nor anything else). And indeed, it takes a heck of a lot of data and compute for SGD to build those AIs. This seems to be in line with Yudkowsky's claim that it's hard/inefficient to build something without understanding it. I think it's important to distinguish between * Scaling up a neural network, and running some kind of fixed algorithm on it. * Scaling up a neural network, and using SGD to optimize the parameters of the NN, so that the NN ends up learning a whole new set of algorithms. IIUC, in Artificial Mysterious Intelligence [], Yudkowsky seemed to be saying that the former would probably fail. OTOH, I don't know what kinds of NN algorithms were popular back in 2008, or exactly what NN algorithms Yudkowsky was referring to, so... *shrugs*.
If that were the case, I actually would fault Eliezer, at least a little. He’s frequently, though by no means always, stuck to qualitative and hard-to-pin-down punditry like we see here, rather than to unambiguous forecasting. This allows him, or his defenders, to retroactively defend his predictions as somehow correct even when they seem wrong in hindsight. Let’s imagine for a moment that Eliezer’s right that AI safety is a cosmically important issue, and yet that he’s quite mistaken about all the technical details of how AGI will arise and how to effectively make it safe. It would be important to know whether we can trust his judgment and leadership. Without the ability to evaluate his performance, either by going with the most obvious interpretation of his qualitative judgments or an unambiguous forecast, it’s hard to evaluate his performance as an AI safety leader. Combine that with a culture of deference to perceived expertise and status and the problem gets worse. So I prioritize the avoidance of special pleading in this case: I think Eliezer comes across as clearly wrong in substance in this specific post, and that it’s important not to reach for ways “he was actually right from a certain point of view” when evaluating his predictive accuracy. Similarly, I wouldn’t judge as correct the early COVID-19 pronouncements that masks don’t work to stop the spread just because cloth masks are poor-to-ineffective and many people refuse to wear masks properly. There’s a way we can stretch the interpretation to make them seem sort of right, but we shouldn’t. We should expect public health messaging to be clearly right in substance, if it’s not making cut and dry unambiguous quantitative forecasts but is instead delivering qualitative judgments of efficacy. None of that bears on how easy or hard it was to build gpt-4. It only bears on how we should evaluate Eliezer as a forecaster/pundit/AI safety leader.
2Alex Turner2d
I think several things here, considering the broader thread:  1. You've done a great job in communicating several reactions I also had: 1. There are signs of serious mispredictions and mistakes in some of the 2008 posts. 2. There are ways to read these posts as not that bad in hindsight, but we should be careful in giving too much benefit of the doubt. 3. Overall these observations constitute important evidence on EY's alignment intuitions and ability to make qualitative AI predictions. 2. I did a bad job of marking my interpretations of what Eliezer wrote, as opposed to claiming he did dismiss ANNs. Hopefully my edits have fixed my mistakes.
2Alex Turner3d
Here's another attempt at one of my contentions.  Consider shard theory of human values. The point of shard theory is not "because humans do RL, and have nice properties, therefore AI + RL will have nice properties." The point is more "by critically examining RL + evidence from humans, I have hypotheses about the mechanistic load-bearing components of e.g. local-update credit assignment in a bounded-compute environment on certain kinds of sensory data, that these components leads to certain exploration/learning dynamics, which explain some portion of human values and experience. Let's test that and see if the generators are similar."  And my model of Eliezer shakes his head at the naivete of expecting complex human properties to reproduce outside of human minds themselves, because AI is not human.  But then I'm like "this other time you said 'AI is not human, stop expecting good property P from superficial similarities', you accidentally missed the modern AI revolution, right? Seems like there is some non-superficial mechanistic similarity/lessons here, and we shouldn't be so quick to assume that the brain's qualitative intelligence or alignment properties come from a huge number of evolutionarily-tuned details which are load-bearing and critical." 
3David Johnston3d
Would you say Yudkowsky's views are a mischaracterisation of neural network proponents, or that he's mistaken about the power of loose analogies?
4Alex Turner3d
Neither. 1. I don't know what proponents were claiming when proponing neural networks. I do know that neural networks ended up working, big time. 2. I don't think loose analogies are powerful. I think they lead to sloppy thinking. 


This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs.

You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of...

That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.

Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:

  • There are kinds of deception (or rather kinds of deceptive capabilities / thoughts) that only show up after a certain capability level, and training before that level just won't affect them cause they're not there yet.
  • General capabilities imply the ability to be deceptive if useful in a particu
... (read more)
Translating it to my ontology: 1. Training against explicit deceptiveness trains some "boundary-like" barriers which will make simple deceptive thoughts labelled as such during training difficult 2. Realistically, advanced AI will need to run some general search processes. The barriers described at step 1. are roughly isomorphic to "there are some weird facts about the world which make some plans difficult to plan" (e.g. similar to such plans being avoided because they depend on extremely costly computations). 3. Given some set of a goal and strong enough capabilities, it seem likely the search will find unforeseen ways around the boundaries. (the above may be different from what Nate means) My response: 1. It's plausible people are missing this but I have some doubts. 2. How I think you get actually non-deceptive powerful systems seems different - deception is relational property between the system and the human, so the "deception" thing can be explicitly understood as negative consequence for the world, and avoided using "normal" planning cognition. 3. Stability of this depends on what the system does with internal conflict. 4. If the system stays in some corrigibility/alignment basin, this should be stable upon reflection / various meta-cognitive modifications. Systems in the basin resist self-modifications toward being incorrigible.  
19Steve Byrnes3d
I think your example was doomed from the start because * the AGI was exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “the nanotech problem will get solved”, * the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”. So the latter is obviously doomed to get crushed by a sufficiently-intelligent AGI. If we can get to a place where the first bullet point still holds, but the AGI also has a comparably-strong, explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”, then we’re in a situation where the AGI is applying its formidable intelligence to fight for both bullet points, not just the first one. And then we can be more hopeful that the second bullet point won’t get crushed. (Related [].) In particular, if we can pull that off, then the AGI would presumably do “intelligent” things to advance the second bullet point, just like it does “intelligent” things to advance the first bullet point in your story. For example, the AGI might brainstorm subtle ways that its plans might pattern-match to deception, and feel great relief (so to speak) at noticing and avoiding those problems before they happen. And likewise, it might brainstorm clever ways to communicate more clearly with its supervisor, and treat those as wonderful achievements (so to speak). Etc. Of course, there remains the very interesting open question of how to reliably get to a place where the AGI has an explicit, endorsed, strong desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”. In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive rewa
4Ben Pace2d
I am naively more scared about such an AI. That AI sounds more like if I say "you're not being helpful, please stop" that it will respond "actually I thought about it, I disagree, I'm going to continue doing what I think is helpful".
7Steve Byrnes2d
I think that, if an AGI has any explicit reflectively-endorsed desire whatsoever, then I can tell a similar scary story: The AGI’s desire isn’t quite what I wanted, so I try to correct it, and the AGI says no. (Unless the AGI’s explicit endorsed desires include / entail a desire to accept correction! Which most desires don’t!) And yes, that is a scary story! It is the central scary story of AGI alignment, right? It would be nice to make an AGI with no explicit desires whatsoever, but I don’t think that’s possible. So anyway, if we do Procedure X which will nominally lead to an AGI with an explicit reflectively-endorsed desire to accept corrections to its desires, then one might think that we’re in the ironic situation that the AGI will accept further corrections to that desire if and only if we don’t need to give it corrections in the first place 😛 (i.e. because Procedure X went perfectly and the desire is already exactly right). That would be cute and grimly amusing if true, and it certainly has a kernel of truth, but it’s a bit oversimplified if we take it literally, I think.
10Daniel Kokotajlo3d
This story involves an AI with an internal censor that deletes deceptive thoughts. But since the AI is trying to make rapid progress towards its goals in the wetlab, it eventually stumbles into a workaround that lets it avoid the censor. What if instead of an internal censor, it had an internal plan-goodness-classifier that heavily downweighted plans which involved deception, and also applied on the meta level (e.g. a plan such as 'think long and hard about how to get similar results as deception without technically counting as deception' would be classified as deceptive and heavily downweighted)? ...attempting to answer my own question... The meta-level thing is definitely necessary, otherwise the system quickly and deliberately finds a workaround. (Though maybe at merely human levels of intelligence this is not catastrophic? Humans find workarounds all the time, little white lies, stretching and bending the rules, etc. yet systems often work well enough in practice.) But yeah assume we have the meta-level thing. It's not that the cognition of the system is mysteriously failing; it's that it is knowingly averse to deception and to thinking about how it can 'get around' or otherwise undermine this aversion. It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn't get classified as deception but achieves similar results, it'll go for it. But this is unlikely in practice because it won't be spending cognition trying to find such plans, because doing so would trigger the aversion. I guess you'd say that as the system gets generally smarter, it becomes likely in practice, because it'll just be doing things like "apply clever cognitive strategies like reframing the problem and then brute-force searching for solutions" and this will be a distribution shift for the deception-classifier so it'll fail, even though at no point was the system intending to make the deception-classifier stay silent... But what if it isn't a distribu
7Ben Pace3d
I brainstormed some possible answers. This list is a bit long. I'm publishing this comment because it's not worth the half hour to make it concise, yet it seems worth trying the exercise before reading the post and possibly others will find it worth seeing my quick attempt. I think the last two bullets are probably my best guesses. Nonetheless here is my list: * Just because an AI isn't consciously deceptive, doesn't mean it won't deceive you, and doesn't mean it won't be adversarial against you. There are many types of goodhart [], and many types of adversarial behavior. * It might have a heuristic to gather resources for itself, and it's not even illegible, it's not adversarial, and it's not deceptive, and then someday that impulse kills you. * There is the boring problem of "the AI just stops working ", because it's turning down its human-modeling component generally, or because it has to do human modeling and so training it not to do deception is super duper expensive because you have to repeatedly train against loads and loads and loads of specific edge cases where thinking about humans turns into deception. * The AI stops thinking deceptive thoughts about humans, but still does catastrophic things. For example, an AI thinking about nanotech, may still build nanobots that kill everyone, and you just weren't smart enough to train it not to / ask the right questions. * The AI does things you just don't understand. For example it manipulates the market in strange ways but at the end your profits go up, so you let it go, even though it's not doing anything deceptive. Just because it's not understandably adversarial doesn't mean it isn't doing adversarial action. "What are you doing?" "I'm gathering resources for the company's profits to go up." "Are you lying to me right now?" "No, I'm making the company's profits go up." "How does t

This insight was made possible by many conversations with Quintin Pope, where he challenged my implicit assumptions about alignment. I’m not sure who came up with this particular idea.

In this essay, I call an agent a “reward optimizer” if it not only gets lots of reward, but if it reliably makes choices like “reward but no task completion” (e.g. receiving reward without eating pizza) over “task completion but no reward” (e.g. eating pizza without receiving reward). Under this definition, an agent can be a reward optimizer even if it doesn't contain an explicit representation of reward, or implement a search process for reward.

Reinforcement learning is learning what to do—how to map situations to actions so as to maximize a numerical reward signal. — Reinforcement learning: An introduction 

Many people[1] seem to...

There is a general phenomenon where:

  • Person A has mental model X and tries to explain X with explanation Q
  • Person B doesn't get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn't actually contain the insights, but P does.
  • Person C doesn't get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: ...

It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contai... (read more)

This was written for the Vignettes Workshop.[1] The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I’m not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me. The methodology is roughly: Write a future history of 2022. Condition on it, and write a future history of 2023. Repeat for 2024, 2025, etc. (I'm posting 2022-2026 now so I can get feedback that will help me write 2027+. I intend to keep writing until the story reaches singularity/extinction/utopia/etc.)

What’s the point of doing this? Well, there are a couple of reasons:

  • Sometimes attempting to write down a concrete example causes you to learn things, e.g. that a possibility is more
This is a linkpost for

A couple months ago EleutherAI started an alignment speaker series, some of these talks have been recorded. This is the first instalment in the series. The following is a transcript generated with the help of Conjecture's Verbalize and some light editing:

Getting started

00:00:22,775 --> 00:00:56,683
Okay, I've started the recording. I think we can give it maybe a minute or two more and then I guess we can get started. I've also got the chat window as part of the recording. So if anyone has something they want to write out, feel free to put that in. Steve, you want to do questions throughout the talk, or should we wait till the end of the talk before we ask questions?

00:00:59,405 --> 00:01:09,452
Let's do throughout, but I reserve...

Load More