Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Recent Discussion


I recently watched Eliezer Yudkowsky's appearance on the Bankless podcast, where he argued that AI was nigh-certain to end humanity. Since the podcast, some commentators have offered pushback against the doom conclusion. However, one sentiment I saw was that optimists tended not to engage with the specific arguments pessimists like Yudkowsky offered. 

Economist Robin Hanson points out that this pattern is very common for small groups which hold counterintuitive beliefs: insiders develop their own internal language, which skeptical outsiders usually don't bother to learn. Outsiders then make objections that focus on broad arguments against the belief's plausibility, rather than objections that focus on specific insider arguments.

As an AI "alignment insider" whose current estimate of doom is around 5%, I wrote this post to explain some of my many...

5Adele Lopez1d
This does not seem like it counts as "publicly humiliating" in any way? Rude, sure, but that's quite different.
16Matthew "Vaniver" Gray1d
I have a lot of responses to specific points; I'm going to make them as children comment to this comment.
3Matthew "Vaniver" Gray1d
I think it would probably be strange for the visual field to do this. But I think it's not that uncommon for other parts of the brain to do this; higher level, most abstract / "psychological" parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call 'deceptively aligned' when they're maladaptive. The idea of metacognitive blindspots [] also seems related. 
2Matthew "Vaniver" Gray1d
Also relevant is Are minimal circuits daemon-free? [] and Are minimal circuits deceptive? [].  I agree no one knows how much of an issue this will be for deep learning. I think the brain obviously has such phenomena, and societies made up of humans also obviously have such phenomena. I think it is probably not adaptive (optimization demons are more like 'cognitive cancer' than 'part of how values form', I think, but in part that's because the term comes with the disapproval built in).
2Matthew "Vaniver" Gray1d
That is also how I interpreted it. I think Yudkowsky is making a different statement. I agree it would be bizarre for him to be saying "if I were wrong, it would only mean I should have been more confident!" I think he is (inside of the example). He's saying "suppose an engineer is wrong about how their design works. Is it more likely that the true design performs better along multiple important criteria than expectation, or that the design performs worse (or fails to function at all)?" Note that 'expectation' is referring to the confidence level inside an argument [], but arguments aren't Bayesians; it's the outside agent that shouldn't be expected to predictably update. Another way to put this: does the engineer expect to be disappointed, excited, or neutral if the design doesn't work as planned? Typically, disappointed, implying the plan is overly optimistic compared to reality. If this weren't true--if engineers were calibrated or pessimistic--then I think Yudkowsky would be wrong here (and also probably have a different argument to begin with).
1Matthew "Vaniver" Gray1d
I think I agree with this point but want to explicitly note the switch from the phrase 'AI alignment research' to 'ML alignment research'; my model of Eliezer thinks the second is mostly a distraction from the former, and if you think they're the same or interchangeable that seems like a disagreement. [For example, I think ML alignment research includes stuff like "will our learned function be robust to distributional shift in the inputs?" and "does our model discriminate against protected classes?" whereas AI alignment research includes stuff like "will our system be robust to changes in the number of inputs?" and "is our model deceiving us about its level of understanding?". They're related in some ways, but pretty deeply distinct.]
5Matthew "Vaniver" Gray1d
I do think this is a point against Yudkowsky. That said, my impression is that GANs are finicky, and I heard rumors that many people tried similar ideas and failed to get it to work before Goodfellow knocked it out of the park. If people were encouraged to publish negative results, we might have a better sense of the actual landscape here, but I think a story of "Goodfellow was unusually good at making GANs and this is why he got it right on his first try" is more compelling to me than "GANs were easy actually".
6Matthew "Vaniver" Gray1d
I think you're basically misunderstanding and misrepresenting Yudkowsky's argument from 2008. He's not saying "you can't make an AI out of neural networks", he's saying "your design sharing a single feature with the brain does not mean it will also share the brain's intelligence." As well, I don't think he's arguing about how AI will actually get made; I think he's mostly criticizing the actual AGI developers/enthusiasts that he saw at the time (who were substantially less intelligent and capable than the modern batch of AGI developers). I think that post has held up pretty well. The architectures used to organize neural networks are quite important, not just the base element. Someone whose only plan was to make their ANN wide would not reach AGI; they needed to do something else, that didn't just rely on surface analogies.
5Matthew "Vaniver" Gray1d
Do you have kids, or any experience with them? (There are three small children in the house I live in.) I think you might want to look into childproofing [], and meditate on its connection to security mindset. Yes, this isn't necessarily related to the 'values' part, but for that I would suggest things like Direct Instruction [], which involves careful curriculum design to generate lots of examples so that students will reliably end up inferring the correct rule. In short, I think the part of 'raising children' which involves the kids being intelligent as well and independently minded does benefit from security mindset.   As you mention in the next paragraph, this is a long-standing disagreement; I might as well point at the discussion of the relevance of raising human children to instilling goals in an AI in The Detached Lever Fallacy []. The short summary of it is that humans have a wide range of options for their 'values', and are running some strategy of learning from their environment (including their parents and their style of raising children) which values to adopt. The situation with AI seems substantially different--why make an AI design that chooses whether to be good or bad based on whether you're nice to it, when you could instead have it choose to always be good? [Note that this is distinct from "always be nice"; you could decide that your good AI can tell users that they're being bad users!]
1Matthew "Vaniver" Gray1d
It seems like the argument structure here is something like: 1. This requirement is too stringent for humans to follow 2. Humans have successful value alignment 3. Therefore this requirement cannot be necessary for successful value alignment. I disagree with point 2, tho; among other things, it looks to me like some humans are on track to accidentally summoning a demon that kills both me and them, which I expect they would regret after-the-fact if they had the chance to. So any reasoning that's like "well so long as it's not unusual we can be sure it's safe" runs into the thing where we're living in the acute risk period. The usual is not safe! This seems definitely right to me. An expectation I have is that this will also generate resistance to alignment techniques / control by its operators, which perhaps complicates how benign this is.   [FWIW I also don't think we want an AI that's perfectly robust to all possible adversarial attacks; I think we want one that's adequate to defend against the security challenges it faces, many of which I expect to be internal. Part of this is because I'm mostly interested in AI planning systems able to help with transformative changes to the world instead of foundational models used by many customers for small amounts of cognition, which are totally different business cases and have different security problems.]
3Matthew "Vaniver" Gray1d
This seems... like a correct description but it's missing the spirit? Like the intuitions are primarily about "what features are salient" and "what thoughts are easy to think." Roughly, the core distinction between software engineering and computer security is whether the system is thinking back. Software engineering typically involves working with dynamic systems and thinking optimistically how the system could work. Computer security typically involves working with reactive systems and thinking pessimistically about how the system could break. I think it is an extremely basic AI alignment skill to look at your alignment proposal and ask "how does this break?" or "what happens if the AI thinks about this?". What's your story for specification gaming []? I must admit some frustration, here; in this section it feels like your point is "look, computer security is for dealing with intelligence as part of your system. But the only intelligence in our system is sometimes malicious users!" In my world, the whole point of Artificial Intelligence was the Intelligence. The call is coming from inside the house! Maybe we just have some linguistic disagreement? "Sure, computer security is relevant to transformative AI but not LLMs"? If so, then I think the earlier point [] about whether capabilities enhancements break alignment techniques is relevant: if these alignment techniques work because the system isn't thinking about them, then are you confident they will continue to work when the system is thinking about them?
14Matthew "Vaniver" Gray1d
uh   is your proposal "use the true reward function, and then you won't get misaligned AI"? These three paragraphs feel incoherent to me. The human eating ice cream and activating their reward circuits is exactly what you would expect under the current paradigm. Yudkowsky thinks this leads to misalignment; you agree. He says that you need a new paradigm to not have this problem. You disagree because you assume it's possible under the current paradigm. If so, how? Where's the system that, on eating ice cream, realizes "oh no! This is a bad action that should not receive reward!" and overrides the reward machinery? How was it trained?  I think when Eliezer says "we need an entirely new paradigm", he means something like "if we want a decision-making system that makes better decisions that a RL agent, we need agent-finding machinery that's better than RL." Maybe the paradigm shift is small (like from RL without experience replay to RL with), or maybe the paradigm shift is large (like from policy-based agents to plan-based agents).  He's not saying the failures of RL are a surprise from the theory of RL. Of course you can explain it using the standard language of RL! He's saying that unless you can predict RL's failures from the inside, the RL agents that you make are going to actually make those mistakes in reality.
5Matthew "Vaniver" Gray1d
I don't yet understand why you put misgeneralized in scare quotes, or whether you have a story for why it's a misgeneralization instead of things working as expected. I think your story for why humans like ice cream makes sense, and is basically the story Yudkowsky would tell too, with one exception: "such food sources" feels a little like it's eliding the distinction between "high-quality food sources of the ancestral environment" and "foods like ice cream"; the training dataset couldn't differentiate between functions f and g but those functions differ in their reaction to the test set (ice cream). Yudkowsky's primary point with this section, as I understand it, is that even if you-as-evolution know that you want g the only way you can communicate that under the current learning paradigm is with training examples, and it may be non-obvious to which functions f need to be excluded.
2Matthew "Vaniver" Gray1d
I think there's a deep disconnect here on whether interpolation is enough or whether we need extrapolation. The point of the strawberry alignment problem is "here's a clearly understandable specification of a task that requires novel science and engineering to execute on. Can you do that safely?". If your ambitions are simply to have AI customer service bots, you don't need to solve this problem. If your ambitions include cognitive megaprojects which will need to be staffed at high levels by AI systems, then you do need to solve this problem. More pragmatically, if your ambitions include setting up some sort of system that prevents people from deploying rogue AI systems while not dramatically curtailing Earth's potential, that isn't a goal that we have many examples of people executing on. So either we need to figure it out with humans or, if that's too hard, create an AI system capable of figuring it out (which probably requires an AI leader instead of an AI assistant).
2Matthew "Vaniver" Gray1d
I agree with your picture of how manifolds work; I don't think it actually disagrees all that much with Yudkowsky's. That is, the thing where all humans are basically the same make and model of car, running the same brand of engine, painted different colors is the claim that the intrinsic dimension of human minds is pretty small. (Taken literally, it's 3, for the three dimensions of color-space.) And so if you think there are, say, 40 intrinsic dimensions to mind-space, and humans are fixed on 37 of the points and variable on the other 3, well, I think we have basically the Yudkowskian picture. (I agree if Yudkowsky's picture was that there were 40M dimensions and humans varied on 3, this would be comically wrong, but I don't think this is what he's imagining for that argument.)
2Matthew "Vaniver" Gray1d
I think this is what Yudkowsky thinks also? (As for why it was relevant to bring up, Yudkowsky was answering the host's question of "How is superintelligence different than general intelligence?")
2Matthew "Vaniver" Gray1d
Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn [] and follow-on discussion []. But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model." Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the 'helpful, harmless, honest' alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients? Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don't expect that they will easily transfer to those new challenges.
2G Gordon Worley III1d
This post brought to mind a thought: I actually don't care very much about arguments about how likely doom is and how pessimistic or optimistic to be since they are irrelevant, to my style of thinking, for making decisions related to building TAI. Instead, I mostly focus on downside risks and avoiding them because they are so extreme, which makes me look "pessimistic" but actually I'm just trying to minimize the risk of false positives in building aligned AI []. Given this framing, it's actually less important, in most cases, to figure out how likely something is, and more important to figure out how likely doom is if we are wrong, and carefully navigate the path that minimizes the risk of doom, regardless of what the assessment of doom is.
3Steve Byrnes1d
I agree with OP that this rocket analogy from Eliezer is a bad analogy, AFAICT. If someone is trying to assess the difficulty of solving a technical problem (e.g. building a rocket) in advance, then they need to brainstorm potential problems that might come up, and when they notice one, they also need to brainstorm potential technical solutions to that problem. For example “the heat of reentry will destroy the ship” is a potential problem, and “we can invent new and better heat-resistant tiles / shielding” is a potential solution to that problem. During this process, I don’t think it’s particularly unusual for the person to notice a technical problem but overlook a clever way to solve that problem. (Maybe they didn’t recognize the possibility of inventing new super-duper-heat-resistant ceramic tiles, or whatever.) And then they would wind up overly pessimistic.
4Matthew "Vaniver" Gray21h
I think this isn't the claim; I think the claim is that it would be particularly unusual for someone to overlook that they're accidentally solving a technical problem. (It would be surprising for Edison to not be thinking hard about what filament to use and pick tungsten; in actual history, it took decades for that change to be made.)
2Steve Byrnes21h
Sure, but then the other side of the analogy doesn’t make sense, right? The context was: Eliezer was talking in general terms about the difficulty of the AGI x-risk problem and whether it’s likely to be solved. (As I understand it.) [Needless to say, I’m just making a narrow point that it’s a bad analogy. I’m not arguing that p(doom) is high or low, I’m not saying this is an important & illustrative mistake (talking on the fly is hard!), etc.]
1Matthew "Vaniver" Gray20h
So I definitely think that's something weirdly unspoken about the argument; I would characterize it as Eliezer saying "suppose I'm right and they're wrong; all this requires is things to be harder than people think, which is usual. Suppose instead that I'm wrong and they're right; this requires things to be easier than people think, which is unusual." But the equation of "people" and "Eliezer" is sort of strange; as Quintin notes, it isn't that unusual for outside observers to overestimate difficulty, and so I wish he had centrally addressed the the reference class tennis game; is the expertise "getting AI systems to be capable" or "getting AI systems to do what you want"?
6Steve Byrnes1d
I narrowly agree with most of this, but I tend to say the same thing with a very different attitude: I would say: “Gee it would be super cool if we could decide a priori what we want the AGI to be trying to do, WITH SURGICAL PRECISION. But alas, that doesn’t seem possible, at least not according to any method I know of.” I disagree with you in your apparent suggestion that the above paragraph is obvious or uninteresting, and also disagree with your apparent suggestion that “setting an AGI’s motivations with surgical precision” is such a dumb idea that we shouldn’t even waste one minute of our time thinking about whether it might be possible to do that. For example, people who are used to programming almost any other type of software have presumably internalized the idea that the programmer can decide what the software will do with surgical precision. So it's important to spread the idea that, on current trends, AGI software will be very different from that. BTW I do agree with you that Eliezer’s interview response seems to suggest that he thinks aligning an AGI to “basic notions of morality” is harder and aligning an AGI to “strawberry problem” is easier. If that’s what he thinks, it’s at least not obvious to me. (see follow-up [])
3Matthew "Vaniver" Gray1d
My sense (which I expect Eliezer would agree with) is that it's relatively easy to get an AI system to imitate the true underlying 'basic notions of morality', to the extent humans agree on that, but that this doesn't protect you at all as soon as you want to start making large changes, or as soon as you start trying to replace specialist sectors of the economy. (A lot of ethics for doctors has to do with the challenges of simultaneously being a doctor and a human; those ethics will not necessarily be relevant for docbots, and the question of what they should be instead is potentially hard to figure out.) So if you're mostly interested in getting out of the acute risk period, you probably need to aim for a harder target.
3Steve Byrnes20h
Hmm, on further reflection, I was mixing up * Strawberry Alignment (defined as: make an AGI that is specifically & exclusively motivated to duplicate a strawberry without destroying the world), versus * “Strawberry Problem” (make an AGI that in fact duplicates a strawberry without destroying the world, using whatever methods / motivations you like). Eliezer definitely talks about the latter. I’m not sure Eliezer has ever brought up the former? I think I was getting that from the OP (Quintin), but maybe Quintin was just confused (and/or Eliezer misspoke). Anyway, making an AGI that can solve the strawberry problem is tautologically no harder than making an AGI that can do advanced technological development and is motivated by human norms / morals / whatever, because the latter set of AGIs is a subset of the former. Sorry. I crossed out that paragraph.  :)
9Vladimir Slepnev1d
I think there's a mistake here which kind of invalidates the whole post. If we don't reward our AI for taking bad actions within the training distribution, it's still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn't in evolution's training distribution for us, but then we found it anyway.
DIFFICULTY OF ALIGNMENT I find the prospect of training on model on just 40 parameters to be very interesting. Almost unbelievable, really, to the point where I'm tempted to say: "I notice that I'm confused". Unfortunately, I don't have access to the paper and it doesn't seem to be on sci-hub, so I haven't been able to resolve my confusion. Basically, my general intuition is that each parameter in a network probably only contributes a few bits of optimization power. It can be set fairly high, fairly low, or in between. So if you just pulled 40 random weights from the network, that's maybe 120 bits of optimization power. Which might be enough for MNIST, but probably not for anything more complicated. So I'm guessing that most likely a bunch of other optimization went into choosing exactly which 40 dimensional subspace we should be using. Of course, if we're allowed to do that then we could even do it with a 1 dimensional subspace: Just pick the training trajectory as your subspace! Generally with the mindspace thing, I don't really think about the absolute size or dimension of mindspace, but the relative size of "things we could build" and "things we could build that would have human values". This relative size is measured in bits. So the intuition here would be that it takes a lot of bits to specify human values, and so the difference in size between these two is really big. Now maybe if you're given Common Crawl, it takes fewer bits to point to human values within that big pile of information. But it's probably still a lot of bits, and then the question is how do you actually construct such a pointer? DEMONS IN GRADIENT DESCENT I agree that demons are unlikely to be a problem, at least for basic gradient descent. They should have shown up by now in real training runs, otherwise. I do still think gradient descent is a very unpredictable process (or to put it more precisely: we still don't know how to predict gradient descent very well), and where that shows up
For the 40 parameters thing, this link should work []. See also this earlier paper [].
Thanks for the link! Looks like they do put optimization effort into choosing the subspace, but it's still interesting that the training process can be factored into 2 pieces like that.
BTW: the way I found that first link was by searching the title on google scholar, finding the paper, and clicking "All 5 versions" below (it's right next to "Cited by 7" and "Related articles"). That brought me to a bunch of versions, one of which was a seemingly-ungated PDF. This will probably frequently work, because AI researchers usually make their papers publicly available (at least in pre-print form).
4Eliezer Yudkowsky2d
This is kinda long.  If I had time to engage with one part of this as a sample of whether it holds up to a counterresponse, what would be the strongest foot you could put forward? (I also echo the commenter who's confused about why you'd reply to the obviously simplified presentation from an off-the-cuff podcast rather than the more detailed arguments elsewhere.)
1Matthew "Vaniver" Gray21h
FWIW, I thought the bit about manifolds in The difficulty of alignment [] was the strongest foot forward, because it paints a different detailed picture than your description that it's responding to. That said, I don't think Quintin's picture obviously disagrees with yours (as discussed in my response is over here []) and I think you'd find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there's a related one that goes thru on his model.
This response is enraging. Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is "This is kinda long."? I shouldn't be that surprised because, IIRC, you said something similar in response to Zack Davis' essays on the Map and Territory distinction, but that's ancillary and AI is core to your memeplex. I have heard repeated claims that people don't engage with the alignment communities' ideas (recent example from yesterday []). But here is someone who did the work. Please explain why your response here does not cause people to believe there's no reason to engage with your ideas because you will brush them off. Yes, nutpicking e/accs on Twitter is much easier and probably more hedonic, but they're not convincible and Quinton here is.
1Matthew "Vaniver" Gray1d
I have attempted to respond to the whole post over here [].
6Eliezer Yudkowsky1d
Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me.  To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.
9Alex Turner1d
Here are some of my disagreements with List of Lethalities []. I'll quote item one:
I think I've been in situations where I've been disoriented by a bunch of random stuff happening and wished that less of it was happening so that I could get a better handle on stuff. An example I vividly recall was being in a history class in high school and being very bothered by the large number of conversations happening around me.
This comment doesn't really engage much with your post - there's a lot there and I thought I'd pick one point to get a somewhat substantive disagreement. But I ended up finding this question and thought that I should answer it.
To tie up this thread: I started writing a more substantive response to a section but it took a while and was difficult and I then got invited to dinner, so probably won't get around to actually writing it.
20Alex Turner2d
Some reasoning which led Eliezer to dismiss neural networks,[1] seems similar to some reasoning which he deploys in his modern alignment arguments.  Compare his incorrect mockery from 2008: with his claim in Alexander and Yudkowsky on AGI goals []: I agree that 100 quadrillion artificial neurons + loss function won't get you a literal human, for trivial reasons. The relevant point is his latter claim: "in particular with respect to "learn 'don't steal' rather than 'don't get caught'."" I think this is a very strong conclusion, relative to available data. I think that a good argument for it would require a lot of technical, non-analogical reasoning about the inductive biases of SGD on large language models. But, AFAICT, Eliezer rarely deploys technical reasoning that depends on experimental results or ML theory. He seems to prefer strongly-worded a priori arguments that are basically analogies.  So, here are two claims which seem to echo the positions Eliezer advances: 1. "A large ANN doesn't look enough like a human brain to develop intelligence." -> wrong (see GPT-4) 2. "A large ANN doesn't look enough like a human brain to learn 'don't steal' rather than 'don't get caught'" -> (not yet known) I perceive a common thread of  But why is this true? You can just replace "human intelligence" with "avian flight", and the argument might sound similarly plausible a priori.  ETA: The invalid reasoning step is in the last clause ("to get a mind..."). If design X exhibits property P, that doesn't mean that design Y must be similar to X in order to exhibit property P.  -------------------------------------------------------------------------------- ETA: Part of this comment was about EY dismissing neural networks in 2008. It seems to me that the cited writing supports that interpretation, and it's still my best guess (see also DirectedEvolution's comment [https://www.less

EY was not in fact bullish on neural networks leading to impressive AI capabilities. Eliezer said this directly:

I'm no fan of neurons; this may be clearer from other posts.[1]

I think this is strong evidence for my interpretation of the quotes in my parent comment: He's not just mocking the local invalidity of reasoning "because humans have lots of neurons, AI with lots of neurons -> smart", he's also mocking neural network-driven hopes themselves. 

  1. ^

    More quotes from Logical or Connectionist AI?:

    Not to mention that neural networks have also been "fai

... (read more)
I don't really get your comment. Here are some things I don't get: * In "Failure By Analogy" and "Surface Analogies and Deep Causes", the point being made is "X is similar in aspects A to thing Y, and X has property P" does not establish "Y has property P". The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me. * Large ANNs don't appear to me to be intelligent because of their similarity to human brains - they appear to me to be intelligent because they're able to be tuned to accurately predict simple facts about a large amount of data that's closely related to human intelligence, and the algorithm they get tuned to seems to be able to be repurposed for a wide variety of tasks (probably related to the wide variety of data that was trained on). * Airplanes don't fly like birds, they fly like airplanes. So indeed you can't just ape one thing about birds[*] to get avian flight. I don't think this is a super revealing technicality but it seemed like you thought it was important. * Maybe most importantly I don't think Eliezer thinks you need to mimic the human brain super closely to get human-like intelligence with human-friendly wants. I think he instead thinks you need to mimic the human brain super closely to validly argue by analogy from humans. I think this is pretty compatible with this quote from "Failure By Analogy" (it isn't exactly implied by it, but your interpretation isn't either): * Matters would be different if he said in the quotes you cite "you only get these human-like properties by very exactly mimicking the human brain", but he doesn't. [*] I've just realized that I can't name a way in which airplanes are like birds in which they aren't like humans. They have things sticking out their sides? So do humans, they're called arms. Maybe the cross-sectional shape of the wings are similar? I guess they b
7Alex Turner2d
Edited to modify confidences about interpretations of EY's writing / claims. This is a valid point, and that's not what I'm critiquing in that portion of the comment. I'm critiquing how -- on my read -- he confidently dismisses ANNs; in particular, using non-mechanistic reasoning which seems similar to some of his current alignment arguments. On its own, this seems like a substantial misprediction for an intelligence researcher in 2008 (especially one who claims to have figured out most things in modern alignment [], by a very early point in time -- possibly that early, IDK). Possibly the most important prediction to get right, to date. Indeed, you can't ape one thing. But that's not what I'm critiquing. Consider the whole transformed line of reasoning: The important part is the last part. It's invalid. Finding a design X which exhibits property P, doesn't mean that for design Y to exhibit property P, Y must be very similar to X.  Which leads us to: Reading the Alexander/Yudkowsky debate, I surprisingly haven't ruled out this interpretation, and indeed suspect he believes some forms of this (but not others). Didn't he? He at least confidently rules out a very large class of modern approaches.
7Rafael Harth1d
I also don't really get your position. You say that, but you haven't shown this! * In Surface Analogies and Deep Causes [], I read him as saying that neural networks don't automatically yield intelligence just because they share surface similarities with the brain. This is clearly true; at the very least, using token-prediction (which is a task for which (a) lots of training data exist and (b) lots of competence in many different domains is helpful) is a second requirement. If you take the network of GPT-4 and trained it to play chess instead, you won't get something with cross-domain competence. * In Failure by Analogy [] he makes a very similar abstract point -- and wrt to neural networks in particular, he says that the surface similarity to the brain is a bad reason to be confident in them. This also seems true. Do you really think that neural networks work because they are similar to brains on the surface? You also said, But Eliezer says this too in the post you linked! (Failure by Analogy). His example of airplanes not flapping is an example where the design that worked was less close to the biological thing. So clearly the point isn't that X has to be similar to Y; the point is that reasoning from analogy doesn't tell you this either way. (I kinda feel like you already got this, but then I don't understand what point you are trying to make.) Which is actually consistent with thinking that large ANNs will get you to general intelligence. You can both hold that "X is true" and "almost everyone who thinks X is true does so for poor reasons". I'm not saying Eliezer did predict this, but nothing I've read proves that he didn't. Also -- and this is another thing -- the fact that he didn't publicly make the prediction "ANNs will lead to AGI" is only weak evidence that he
5Alex Turner1d
Responding to part of your comment: I know he's talking about alignment, and I'm criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments. That claim seems to be advanced due to... there not being enough similarities between ANNs and human brains -- that without enough similarity in mechanisms wich were selected for by evolution, you simply can't get the AI to generalize in the mentioned human-like way. Not as a matter of the AI's substrate, but as a matter of the AI's policy not generalizing like that.  I think this is a dubious claim, and it's made based off of analogies to evolution / some unknown importance of having evolution-selected mechanisms which guide value formation (and not SGD-based mechanisms). From the Alexander/Yudkowsky debate: There's some assertion like "no, there's not a way to get an ANN, even if incorporating structural parameters and information encoded in human genome, to actually unfold into a mind which has human-like values (like 'don't steal')." (And maybe Eliezer comes and says "no that's not what I mean", but, man, I sure don't know what he does mean, then.)  Here's some more evidence along those lines: Again, why is this true? This is an argument that should be engaging in technical questions about inductive biases, but instead seems to wave at (my words) "the original way we got property P was by sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, and good luck trying to get it otherwise." Hopefully this helps clarify what I'm trying to critique?
I don't want to get super hung up on this because it's not about anything Yudkowsky has said but: IMO this is not a faithful transformation of the line of reasoning you attribute to Yudkowsky, which was: Specifically, where you wrote "an entity which flies", you were transforming "a mind which wants as humans do", which I think should instead be transformed to "an entity which flies as birds do". And indeed planes don't fly like birds do. [EDIT: two minutes or so after pressing enter on this comment, I now see how you could read it your way] I guess if I had to make an analogy I would say that you have to be pretty similar to a human to think the way we do, but probably not to pursue the same ends, which is probably the point you cared about establishing.
I guess I read that as talking about the fact that at the time ANNs did not in fact really work. I agree he failed to predict that would change, but that doesn't strike me as a damning prediction. Confidently ruling out a large class of modern approaches isn't really that similar to saying "the only path to success is exactly mimicking the human brain". It seems like one could rule them out by having some theory about why they're deficient. I haven't re-read List of Lethalities because I want to go to sleep soon, but I searched for "brain" and did not find a passage saying "the real problem is that we need to emulate the brain precisely but can't because of poor understanding of neuroanatomy" or something.
17Zack M. Davis2d
I don't think this is a fair reading of Yudkowsky. He was dismissing people who were impressed by the analogy between ANNs and the brain. I'm pretty sure it wasn't supposed to be a positive claim that ANNs wouldn't work. Rather, it's that one couldn't justifiably believe that they'd work just from the brain analogy, and that if they did work, that would be bad news for what he then called Friendliness (because he was hoping to discover and wield a "clean" theory of intelligence, as contrasted to evolution or gradient descent happening to get there at sufficient scale). Consider "Artificial Mysterious Intelligence" [] (2008). In response to someone who said "But neural networks are so wonderful! They solve problems and we don't have any idea how they do it!", it's significant that Yudkowsky's reply wasn't, "No, they don't" (contesting the capabilities claim), but rather, "If you don't know how your AI works, that is not good. It is bad" (asserting that opaque capabilities are bad for alignment).
One of Yudkowsky's claims in the post you link is: This is a claim that lack of the correct mechanistic theory is a formidable barrier for capabilities, not just alignment, and it inaccurately underestimates the amount of empirical understandings available on which to base an empirical approach. It's true that it's hard, even perhaps impossible, to build a flying machine if the only thing you understand is that birds "magically" fly. But if you are like most people for thousands of years, you've observed many types of things flying, gliding, or floating in the air: birds and insects, fabric and leaves, arrows and spears, clouds and smoke. So if you, like the Montgolfier brothers, observe fabric floating over a fire, and live in an era in which invention is celebrated and have the ability to build, test, and iterate, then you can probably figure out how to build a flying machine without basing this on a fully worked out concept of aerodynamics. Indeed, the Montgolfier brothers thought it was the smoke, rather than the heat, that made their balloons fly. Having the wrong theory was bad, but it didn't prevent them from building a working hot air balloon. Let's try turning Yudkowsky's quote around: Eliezer went on to list five methods for producing AI that he considered dubious, including builting powerful computers running the most advanced available neural network algorithms, intelligence "emerging from the internet", and putting "a sufficiently huge quantity of knowledge into [a computer]." But he only admitted that two other methods would work - builting a mechanical duplicate of the human brain and evolving AI via natural selection. If Eliezer wasn't meaning to make a confident claim that scaling up neural networks without a fundamental theoretical understanding of intelligence would fail, then he did a poor job of communicating that in these posts. I don't find that blameworthy - I just think Eliezer comes across as confidently wrong about which avenues wou
4Alex Turner1d
To be fair, he said that those two will work, and (perhaps?) admitted the possibility of "run advanced neural network algorithms" eventually working. Emphasis mine:
I think it might be relevant to note here that it's not really humans who are building current SOTA AIs --- rather, it's some optimizer like SGD that's doing most of the work. SGD does not have any mechanistic understanding of intelligence (nor anything else). And indeed, it takes a heck of a lot of data and compute for SGD to build those AIs. This seems to be in line with Yudkowsky's claim that it's hard/inefficient to build something without understanding it. I think it's important to distinguish between * Scaling up a neural network, and running some kind of fixed algorithm on it. * Scaling up a neural network, and using SGD to optimize the parameters of the NN, so that the NN ends up learning a whole new set of algorithms. IIUC, in Artificial Mysterious Intelligence [], Yudkowsky seemed to be saying that the former would probably fail. OTOH, I don't know what kinds of NN algorithms were popular back in 2008, or exactly what NN algorithms Yudkowsky was referring to, so... *shrugs*.
If that were the case, I actually would fault Eliezer, at least a little. He’s frequently, though by no means always, stuck to qualitative and hard-to-pin-down punditry like we see here, rather than to unambiguous forecasting. This allows him, or his defenders, to retroactively defend his predictions as somehow correct even when they seem wrong in hindsight. Let’s imagine for a moment that Eliezer’s right that AI safety is a cosmically important issue, and yet that he’s quite mistaken about all the technical details of how AGI will arise and how to effectively make it safe. It would be important to know whether we can trust his judgment and leadership. Without the ability to evaluate his performance, either by going with the most obvious interpretation of his qualitative judgments or an unambiguous forecast, it’s hard to evaluate his performance as an AI safety leader. Combine that with a culture of deference to perceived expertise and status and the problem gets worse. So I prioritize the avoidance of special pleading in this case: I think Eliezer comes across as clearly wrong in substance in this specific post, and that it’s important not to reach for ways “he was actually right from a certain point of view” when evaluating his predictive accuracy. Similarly, I wouldn’t judge as correct the early COVID-19 pronouncements that masks don’t work to stop the spread just because cloth masks are poor-to-ineffective and many people refuse to wear masks properly. There’s a way we can stretch the interpretation to make them seem sort of right, but we shouldn’t. We should expect public health messaging to be clearly right in substance, if it’s not making cut and dry unambiguous quantitative forecasts but is instead delivering qualitative judgments of efficacy. None of that bears on how easy or hard it was to build gpt-4. It only bears on how we should evaluate Eliezer as a forecaster/pundit/AI safety leader.
2Alex Turner1d
I think several things here, considering the broader thread:  1. You've done a great job in communicating several reactions I also had: 1. There are signs of serious mispredictions and mistakes in some of the 2008 posts. 2. There are ways to read these posts as not that bad in hindsight, but we should be careful in giving too much benefit of the doubt. 3. Overall these observations constitute important evidence on EY's alignment intuitions and ability to make qualitative AI predictions. 2. I did a bad job of marking my interpretations of what Eliezer wrote, as opposed to claiming he did dismiss ANNs. Hopefully my edits have fixed my mistakes.
2Alex Turner2d
Here's another attempt at one of my contentions.  Consider shard theory of human values. The point of shard theory is not "because humans do RL, and have nice properties, therefore AI + RL will have nice properties." The point is more "by critically examining RL + evidence from humans, I have hypotheses about the mechanistic load-bearing components of e.g. local-update credit assignment in a bounded-compute environment on certain kinds of sensory data, that these components leads to certain exploration/learning dynamics, which explain some portion of human values and experience. Let's test that and see if the generators are similar."  And my model of Eliezer shakes his head at the naivete of expecting complex human properties to reproduce outside of human minds themselves, because AI is not human.  But then I'm like "this other time you said 'AI is not human, stop expecting good property P from superficial similarities', you accidentally missed the modern AI revolution, right? Seems like there is some non-superficial mechanistic similarity/lessons here, and we shouldn't be so quick to assume that the brain's qualitative intelligence or alignment properties come from a huge number of evolutionarily-tuned details which are load-bearing and critical." 
3David Johnston2d
Would you say Yudkowsky's views are a mischaracterisation of neural network proponents, or that he's mistaken about the power of loose analogies?
4Alex Turner2d
Neither. 1. I don't know what proponents were claiming when proponing neural networks. I do know that neural networks ended up working, big time. 2. I don't think loose analogies are powerful. I think they lead to sloppy thinking. 

From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.
However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, ... (read more)

Okay, I think this makes sense. The idea is trying to re-interpret the various functions in the utility function as a single function and asking about the notion of complexity on that function which combines the complexity of producing a circuit which computes that function and the complexity of the circuit itself. But just to check: is T over  S×A×O→S? I thought T in utility functions only depended on states and actions S×A→S?  Maybe I am confused by what you mean by S. I thought it was the state space, but that isn't consistent with r in your post which was defined over A×O→Q? As a follow up: defining r as depending on actions and observations instead of actions and states (which e.g. the definition in POMDP on Wikipedia) seems like it changes things.  So I'm not sure if you intended the rewards to correspond with the observations or 'underlying' states.  One more question, this one about the priors: what are they a prior over exactly? I will use the letters/terms from [] to try to be explicit. Is the prior capturing the "set of conditional observation probabilities" (O on Wikipedia)? Or is it capturing the "set of conditional transition probabilities between states" (T on Wikipedia)? Or is it capturing a distribution over all possible T and O? Or are you imaging that T is defined with U (and is non-random) and O is defined within the prior?  I ask because the term DKL(ζ0||ζ) will be positive infinity if ζ is zero for any value where ζ0 is non-zero. Which makes the interpretation that it is either O or T directly pretty strange (for example, in the case where there are two states s1 and s2 and two obersvations o1 and o2 an O where P(si|oi)=1 and P(si|oj)=0 if i≠j would have a KL divergence of infinity from the ζ0 if ζ0 had non-zero probability on P(s1|o2)). So, I assume this is a prior over what the conditional obs

Maybe I am confused by what you mean by . I thought it was the state space, but that isn't consistent with  in your post which was defined over ?

I'm not entirely sure what you mean by the state space.  is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is , not . I slightly abused notation by defining  in the parent comment. Let's say it's  and  is... (read more)

When we're trying to do AI alignment, we're often studying systems which don't yet exist. This is a pretty weird epistemic activity, and seems really hard to get right. This post offers one frame for thinking about what we're actually doing when we're thinking about AI alignment: using parts of the space of maps to reason about parts of the space of intelligent systems.

In this post, we:

  • Introduce a simple model of the epistemic situation, and
  • Share some desiderata for maps useful for alignment.

We hope that the content is mostly the second kind of obvious: obvious once you see things in this way, which you maybe already do. In our experience, this comes with a risk: reading too fast, you may miss most of the nuance and useful insight...

Status: This was a response to a draft of Holden's cold take "AI safety seems hard to measure". It sparked a further discussion, that Holden recently posted a summary of.

The follow-up discussion ended up focusing on some issues in AI alignment that I think are underserved, which Holden said were kinda orthogonal to the point he was trying to make, and which didn't show up much in the final draft. I nevertheless think my notes were a fine attempt at articulating some open problems I see, from a different angle than usual. (Though it does have some overlap with the points made in Deep Deceptiveness, which I was also drafting at the time.)

I'm posting the document I wrote to Holden with only minimal editing, because it's been a few...

And if you block any one path to the insight that the earth is round, in a way that somehow fails to cripple it, then it will find another path later, because truths are interwoven. Tell one lie, and the truth is ever-after your enemy.

In case it's of any interest, I'll mention that when I "pump this intuition", I find myself thinking it essentially impossible to expect we could ever build a general agent that didn't notice that the world was round, and I'm unsure why (if I recall correctly) I sometimes I read Nate or Eliezer write that they think it's quit... (read more)

This post presents five closely-related ways to achieve proof-based cooperation without using Löb's theorem, and muses on legible cooperation in the real world.

(Edit: maybe they're closer to just-use-Löb's-theorem than I originally thought! See this comment. If these constructions somehow work better, I'm more confused than before about why.)

I'm writing this as a follow-up to Andrew Critch's recent post, to share more of my perspective on the subject.

We're going to dive straight into the weeds. (I'm planning to also write a more accessible explainer post soon.)

The ideas

Idea #1: try to prove

I claim the following are sufficient for robust cooperation:

tries to prove that , and tries to prove . The reason this works is that can prove that , i.e. only cooperates in ways...

Something I'm now realizing, having written all these down: the core mechanism really does echo Löb's theorem! Gah, maybe these are more like Löb than I thought.

(My whole hope was to generalize to things that Löb's theorem doesn't! And maybe these ideas still do, but my story for why has broken, and I'm now confused.)

As something to ponder on, let me show you how we can prove Löb's theorem following the method of ideas #3 and #5:

  • is assumed
  • We consider the loop-cutter
  • We verify that if activates then must be true:
... (read more)
Load More