Sorted by New

Wiki Contributions


Ngo and Yudkowsky on AI capability gains

How do you then classify this as a counterexample rather than a "non-central (but still valid) manifestation of the theory"?

My only reply is "You know it when you see it." And yeah, a crackpot would reason the same way, but non-modest epistemology says that if it's obvious to you that you're not a crackpot then you have to operate on the assumption that you're not a crackpot. (In the alternative scenario, you won't have much impact anyway.) 

Specifically, the situation I mean is the following:

  • You have an epistemic track record like Eliezer or someone making lots of highly upvoted posts in our communities.
  • You find yourself having strong intuitions about how to apply powerful principles like "consequentialism" to new domains, and your intuitions are strong because it feels to you like you have a gears-level understanding that others lack. You trust your intuitions in cases like these.

My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot." 

Maybe there's a potential crux here about how much of scientific knowledge is dependent on successful predictions. In my view, the sequences have convincingly argued that locating the hypothesis in the first place is often done in the absence of already successful predictions, which goes to show that there's a core of "good reasoning" that lets you jump to (tentative) conclusions, or at least good guesses, much faster than if you were to try lots of things at random.

Ngo and Yudkowsky on AI capability gains

It also isn't clear to me that Eliezer has established the strong inferences he draws from noticing this general pattern ("expected utility theory/consequentialism"). But when you asked Eliezer (in the original dialogue) to give examples of successful predictions, I was thinking "No, that's not how these things work." In the mistaken applications of Grand Theories you mention (AGI and capitalism, AGI and symbolic logic, intelligence and ethics, recursive self-improvement and cultural evolution, etc.), the easiest way to point out why they are dumb is with counterexamples. We can quickly "see" the counterexamples. E.g., if you're trying to see AGI as the next step in capitalism, you'll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete). By contrast, if the theory fits, you'll find that whenever you try to construct such a counterexample, it is just a non-central (but still valid) manifestation of the theory. Eliezer would probably say that people who are good at this sort of thinking will quickly see how the skeptics' counterexamples fall relevantly short. 


The reason I remain a bit skeptical about Eliezer's general picture: I'm not sure if his thinking about AGI makes implicit questionable predictions about humans

  • I don't understand his thinking well enough to be confident that it doesn't
  • It seems to me that Eliezer_2011 placed weirdly strong emphasis on presenting humans in ways that matched the pattern "(scary) consequentialism always generalizes as you scale capabilities." I consider some of these claims false or at least would want to make the counterexamples more salient

For instance: 

  • Eliezer seemed to think that "extremely few things are worse than death" is something all philosophically sophisticated humans would agree with
  • Early writings on CEV seemed to emphasize things like the "psychological unity of humankind" and talk as though humans would mostly have the same motivational drives, also with respect to how it relates to "enjoying being agenty" as opposed to "grudgingly doing agenty things but wishing you could be done with your obligations faster"
  • In HPMOR all the characters are either not philosophically sophisticated or they were amped up into scary consequentialists plotting all the time

All of the above could be totally innocent matters of wanting to emphasize the thing that other commenters were missing, so they aren't necessarily indicative of overlooking certain possibilities. Still, the pattern there makes me wonder if maybe Eliezer hasn't spent a lot of time imagining what sorts of motivations humans can have that make them benign not in terms outcome-related ethics (what they want the world to look like), but relational ethics (who they want to respect or assist, what sort of role model they want to follow). It makes me wonder if it's really true that when you try to train an AI to be helpful and corrigible, the "consequentialism-wants-to-become-agenty-with-its-own-goals part" will be stronger than the "helping this person feels meaningful" part. (Leading to an agent that's consequentialist about following proper cognition rather than about other world-outcomes.) 

FWIW I think I mostly share Eliezer's intuitions about the arguments where he makes them; I just feel like I lack the part of his picture that lets him discount the observation that some humans are interpersonally corrigible and not all that focused on other explicit goals, and that maybe this means corrigibility has a crisp/natural shape after all. 

Discussion with Eliezer Yudkowsky on AGI interventions

I share the impression that the agent foundations research agenda seemed not that important. But that point doesn't feel sufficient to argue that Eliezer's pessimism about the current state of alignment research is just a face-saving strategy his brain tricked him into adopting. (I'm not saying you claimed that it is sufficient; probably a lot of other data points are factoring into your judgment.) MIRI have deprioritized agent foundations research for quite a while now. I also just think it's extremely common for people to have periods where they work on research that eventually turns out to be not that important; the interesting thing is to see what happens when that becomes more apparent. I immediately trust people more if I see that they are capable of pivoting and owning up to past mistakes, and I could imagine that MIRI deserves a passing grade on this, even though I also have to say that I don't know how exactly they nowadays think about prioritization in 2017 and earlier.

I really like Vaniver's comment further below:

For what it's worth, my sense is that EY's track record is best in 1) identifying problems and 2) understanding the structure of the alignment problem.

And, like, I think it is possible that you end up in situations where the people who understand the situation best end up the most pessimistic about it.

I'm very far away from confident that Eliezer's pessimism is right, but it seems plausible to me. Of course, some people might be in the epistemic position of having tried to hash out that particular disagreement on the object level and have concluded that Eliezer's pessimism is misguided – I can't comment on that. I'm just saying that based on what I've read, which is pretty much every post and comment on AI alignment on LW and the EA forum, I don't get the impression that Eliezer's pessimism is clearly unfounded.

Everyone's views look like they are suspiciously shaped to put themselves and their efforts into a good light. If someone believed that their work isn't important or their strengths aren't very useful, they wouldn't do the work and wouldn't cultivate the strengths. That applies to Eliezer, but it also applies to the people who think alignment will likely be easy. I feel like people in the latter group would likely be inconvenienced (in terms of the usefulness of their personal strengths or the connections they've built in the AI industry, or past work they've done), too, if it turned out not to be.

Just to give an example on the sorts of observations that make me think Eliezer/"MIRI" could have a point: 

  • I don't know what happened with a bunch of safety people leaving OpenAI but it's at least possible to me that it involved some people having had negative updates on the feasibility of a certain type of strategy that Eliezer criticized early on here. (I might be totally wrong about this interpretation because I haven't talked to anyone involved.)
  • I thought it was interesting when Paul noted that our civilization's Covid response was a negative update for him on the feasibility of AI alignment. Kudos to him for noting the update, but also: Isn't that exactly the sort of misprediction one shouldn't be making if one confidently thinks alignment is likely to succeed? (That said, my sense is that Paul isn't even at the most optimistic end of people in the alignment community.)
  • A lot of the work in the arguments for alignment being easy seems to me to be done by dubious analogies that assume that AI alignment is relevantly similar to risky technologies that we've already successfully invented. People seem insufficiently quick to get to the actual crux with MIRI, which makes me think they might not be great at passing the Ideological Turing Test. When we get to the actual crux, it's somewhere deep inside the domain of predicting the training conditions for AGI, which feels like the sort of thing Eliezer might be good at thinking about. Other people might also be good at thinking about this, but then why do they often start their argument with dubious analogies to past technologies that seem to miss the point?
    [Edit: I may be strawmanning some people here. I have seen direct discussions about the likelihood of treacherous turns vs. repeated early warnings of alignment failure. I didn't have a strong opinion either way, but it's totally possible that some people feel like they understand the argument and confidently disagree with Eliezer's view there.]
A theory of human values

Leaning on this, someone could write a post about the "infectiousness of realism" since it might be hard to reconcile openness to non-zero probabilities of realism with anti-realist frameworks? :P

For people who believe their actions matter infinitely more if realism is true, this could be modeled as an overriding meta-preference to act as though realism is true. Unfortunately if realism isn't true this could go in all kinds of directions depending on how the helpful AI system would expect to get into such a judged-to-be-wrong epistemic state.

Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we'll know by that point whether we've found the right way to do metaphilosophy (and how approaching that question is different from approaching whichever procedures philosophically sophisticated people would pick to settle open issues in something like the above proposals). It seems like there has to come a point where one has to hand off control to some in-advance specified "metaethical framework" or reflection procedure, and judged from my (historically overconfidence-prone) epistemic state it doesn't feel obvious why something like Stuart's anti-realism isn't already close to there (though I'd say there are many open questions and I'd feel extremely unsure about how to proceed regarding for instance "2. A method for synthesising such basic preferences into a single utility function or similar object," and also to some extent about the premise of squeezing a utility function out of basic preferences absent meta-preferences for doing that). Adding layers of caution sounds good though as long as they don't complicate things enough to introduce large new risks.

Will humans build goal-directed agents?
Suppose the agent you're trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don't see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed... I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you're trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)

I'm not sure these are the points Rohin was trying to make, but there seem to be at least two important points here:

  • Imitation learning applied to humans produces goal-directed behavior only insofar humans are goal-directed
  • Imitation learning applied to humans produces agents no more capable than humans. (I think IDA goes beyond this by adding amplification steps, which are separate. And IRL goes beyond this by trying to correct "errors" that the humans make.)

Regarding the second point, there's a safety-relevant sense in which a human-imitating agent is less goal-directed than the human. Because if you scale the human's capabilities, the human will become better at achieving its personal objectives. By contrast, if you scale the imitator's capabilities, it's only supposed to become even better at imitating the unscaled human.