Sorted by New

Wiki Contributions



  • If the Thought Assessors converge to 100% accuracy in predicting the reward that will result from a plan, then a plan to wirehead (hack into the Steering Subsystem and set reward to infinity) would seem very appealing, and the agent would do it.
  • If the Thought Assessors don’t converge to 100% accuracy in predicting the reward that will result from a plan, then that’s the very definition of inner misalignment!


    The thought “I will secretly hack into my own Steering Subsystem” is almost certainly not aligned with the designer’s intention. So a credit-assignment update that assigns more positive valence to “I will secretly hack into my own Steering Subsystem” is a bad update. We don’t want it. Does it increase “inner alignment”? I think we have to say “yes it does”, because it leads to better reward predictions! But I don’t care. I still don’t want it. It’s bad bad bad. We need to figure out how to prevent that particular credit-assignment Thought Assessor update from happening.


    I think there’s a broader lesson here. I think “outer alignment versus inner alignment” is an excellent starting point for thinking about the alignment problem. But that doesn’t mean we should expect one solution to outer alignment, and a different unrelated solution to inner alignment. Some things—particularly interpretability—cut through both outer and inner layers, creating a direct bridge from the designer’s intentions to the AGI’s goals. We should be eagerly searching for things like that.

Yeah, there definitely seems to be something off about that categorization. I've thought a bit about how this stuff works in humans, particularly in this post of my moral anti-realism sequence. To give some quotes from that:

One of many takeaways I got from reading Kaj Sotala’s multi-agent models of mind sequence (as well as comments by him) is that we can model people as pursuers of deep-seated needs. In particular, we have subsystems (or “subagents”) in our minds devoted to various needs-meeting strategies. The subsystems contribute behavioral strategies and responses to help maneuver us toward states where our brain predicts our needs will be satisfied. We can view many of our beliefs, emotional reactions, and even our self-concept/identity as part of this set of strategies. Like life plans, life goals are “merely” components of people’s needs-meeting machinery.[8]

Still, as far as components of needs-meeting machinery go, life goals are pretty unusual. Having life goals means to care about an objective enough to (do one’s best to) disentangle success on it from the reasons we adopted said objective in the first place. The objective takes on a life of its own, and the two aims (meeting one’s needs vs. progressing toward the objective) come apart. Having a life goal means having a particular kind of mental organization so that “we” – particularly the rational, planning parts of our brain – come to identify with the goal more so than with our human needs.[9]


There’s a normative component to something as mundane as choosing leisure activities. [E.g., going skiing in the cold, or spending the weekend cozily at home.] In the weekend example, I’m not just trying to assess the answer to empirical questions like “Which activity would contain fewer seconds of suffering/happiness” or “Which activity would provide me with lasting happy memories.” I probably already know the answer to those questions. What’s difficult about deciding is that some of my internal motivations conflict. For example, is it more important to be comfortable, or do I want to lead an active life? When I make up my mind in these dilemma situations, I tend to reframe my options until the decision seems straightforward. I know I’ve found the right decision when there’s no lingering fear that the currently-favored option wouldn’t be mine, no fear that I’m caving to social pressures or acting (too much) out of akrasia, impulsivity or some other perceived weakness of character.[21]

We tend to have a lot of freedom in how we frame our decision options. We use this freedom, this reframing capacity, to become comfortable with the choices we are about to make. In case skiing wins out, then “warm and cozy” becomes “lazy and boring,” and “cold and tired” becomes “an opportunity to train resilience / apply Stoicism.” This reframing ability is a double-edged sword: it enables rationalizing, but it also allows us to stick to our beliefs and values when we’re facing temptations and other difficulties.


Visualizing the future with one life goal vs. another

Whether a given motivational pull – such as the need for adventure, or (e.g.,) the desire to have children – is a bias or a fundamental value is not set in stone; it depends on our other motivational pulls and the overarching self-concept we’ve formed.

Lastly, we also use “planning mode” to choose between life goals. A life goal is a part of our identity – just like one’s career or lifestyle (but it’s even more serious).

We can frame choosing between life goals as choosing between “My future with life goal A” and “My future with life goal B” (or “My future without a life goal”). (Note how this is relevantly similar to “My future on career path A” and “My future on career path B.”)


It’s important to note that choosing a life goal doesn’t necessarily mean that we predict ourselves to have the highest life satisfaction (let alone the most increased moment-to-moment well-being) with that life goal in the future. Instead, it means that we feel the most satisfied about the particular decision (to adopt the life goal) in the present, when we commit to the given plan, thinking about our future. Life goals inspired by moral considerations (e.g., altruism inspired by Peter Singer’s drowning child argument) are appealing despite their demandingness – they can provide a sense of purpose and responsibility.

So, it seems like we don't want "perfect inner alignment," at least not if inner alignment is about accurately predicting reward and then forming the plan of doing what gives you most reward. Also, there's a concept of "lock in" or "identifying more with the long-term planning part of your brain than with the underlying needs-meeting machinery." Lock in can be dangerous (if you lock in something that isn't automatically corrigible), but it might also be dangerous not to lock in anything (because this means you don't know what other goals form later on).

Idk, the whole thing seems to me like brewing a potion in Harry Potter, except that you don't have a recipe book and there's luck involved, too. "Outer alignment," a minimally sufficient degree thereof (as in: the agent tends to gets rewards when it takes actions towards the intended goal), increases the likelihood that you get broadly pointed you in the right direction, so the intended goal maybe gets considered among things the internal planner considers reinforcing itself around / orienting itself towards. But then, whether the intended gets picked over other alternatives (instrumental requirements for general intelligence, or alien motivations the AI might initially have), who knows. Like with raising a child, sometimes they turn out the way the parents intend, sometimes not at all. There's probably a science to finding out how outcomes become more likely, but even if we could do that with human children developing into adults with fixed identities, there's then still the question of how to find analogous patterns in (brain-like) AI. Tough job.

As you allude by discussing shards for cooperative tendencies, the Shard Theory approach seems relevant for intent alignment too, not just value alignment. (For value alignment, the relevance of humans as an example is “How did human values evolve despite natural selection optimizing for something different and more crude?” For intent alignment, the relevance is “How come some humans exhibit genuinely prosocial motivations and high integrity despite not sharing the exact same goals as others?”) 

Studying the conditions for the evolution of genuinely prosocial motivations seems promising to me.

By “prosocial motivations,” I mean something like “trying to be helpful and cooperative” at least in situations where this is “low cost.” (In this sense, classical utilitarians with prosocial motivations are generally safe to be around even for those of us who don’t want to be replaced by hedonium.)

We can make some interesting observations on prosocial motivations in humans:

  • Due to Elephant in the Brain issues, an aspiration to be prosocial isn't always enough to generate prosociality as a virtue in the way that counts. Something like high metacognition + commitment to high integrity seem required as well.
  • Not all people have genuinely prosocial motivations.
  • People who differ from each other on prosocial motivations (and metacognition and integrity) seem to fall into "surprisingly" distinct clusters.

By the last bullet point, I mean that it seems plausible that we can learn a lot about someone's character even in situations that are obviously "a test." E.g., the best venture capitalists don't often fall prey to charlatan founders. Paul Graham writes about his wife Jessica Livingston:

I'm better at some things than Jessica, and she's better at some things than me. One of the things she's best at is judging people. She's one of those rare individuals with x-ray vision for character. She can see through any kind of faker almost immediately. Her nickname within YC was the Social Radar, and this special power of hers was critical in making YC what it is. The earlier you pick startups, the more you're picking the founders. Later stage investors get to try products and look at growth numbers. At the stage where YC invests, there is often neither a product nor any numbers.

If Graham is correct about his wife's ability, this means that people with "shady character" sometimes fail in test situations specifically due to their character – which is strange because you'd expect that the rational strategy in these situation is "act as though you had good character."

In humans, "perfect psychopaths" arguably don't exist. That is, people without genuinely prosocial motivations, even when they're highly intelligent, don't behave the same as genuinely prosocial people in 99.9% of situations while saving their deceitful actions for the most high-stakes situations. Instead, it seems likely that they can't help but behave in subtly suspicious ways even in situations where they're able to guess that judges are trying to assess their character.

From the perspective of Shard Theory's approach, it seems interesting to ask "Why is this?"

My take (inspired by a lot of armchair psychology and – even worse – armchair evolutionary psychology – is the following:

  • Asymmetric behavioral strategies: Even in "test situations" where the time and means for evaluation are limited (e.g., trial tasks followed by lengthy interviews), people can convey a lot of relevant information through speech. Honest strategies have some asymmetric benefits ("words aren't cheap"). (The term "asymmetric behavioral strategies" is inspired by this comment on "asymmetric tools.")
    • Pointing out others’ good qualities.
      • People who consistently praise others for their good qualities, even in situations where this isn’t socially advantageous, credibly signal that they don’t apply a zero-sum mindset to social situations.
    • Making oneself transparent (includes sharing disfavorable information).
      • People who consistently tell others why they behave in certain ways, make certain decisions, or hold specific views, present a clearer picture of themselves. Others can then check that picture for consistency. The more readily one shares information, the harder it would be to keep lies consistent. The habit of proactive transparency also sets up a precedent: it makes it harder to suddenly shift to intransparency later on, at one’s convenience.
      • Pointing out one’s hidden negative qualities. One subcategory of “making oneself transparent” is when people disclose personal shortcomings even in situations where they would have been unlikely to otherwise come up. In doing so, they credibly signal that they don’t need to oversell themselves in order to gain others’ appreciation. The more openly someone discloses their imperfections, the more their honest intent and their genuine competencies will shine through.
    • Handling difficult interpersonal conversations on private, prosocial emotions.
      • People who don’t shy away from difficult interpersonal conversations (e.g., owning up to one’s mistakes and trying to resolve conflicts) can display emotional depth and maturity as well as an ability to be vulnerable. Difficult interpersonal conversations thereby serve as a fairly reliable signal of someone’s moral character (especially in real-time without practice and rehearsing) because vulnerability is hard to fake for people who aren’t in touch with emotions like guilt and shame, or are incapable of feeling them. For instance, pathological narcissists tend to lack insight into their negative emotions, whereas psychopaths lack certain prosocial emotions entirely. If people with those traits nonetheless attempt to have difficult interpersonal conversations, they risk being unmasked. (Analogy: someone who lacks a sense of smell will be unmasked when talking about the intricacies of perfumery, even if they've done practicing for faking it.)
    • Any individual signal can be faked. A skilled manipulator will definitely go out of their way to fake prosocial signals or cleverly spin up ambiguities in how to interpret past events. To tell whether a person is manipulative, I recommend giving relatively little weight to single examples of their behavior and focus on the character qualities that show up the most consistently.
  • Developmental constraints: The way evolution works, mind designs "cannot go back to the drawing board" – single mutations cannot alter too many things at once without badly messing up the resulting design.
    • For instance, manipulators get better at manipulating if they have a psychology of the sort (e.g.) "high approach seeking, low sensitivity to punishment." Developmental constraint: People cannot alter their dispositions at will.
    • People who self-deceive become more credible liars. Developmental tradeoff: Once you self-deceive, you can no longer go back and "unroll" what you've done.
    • Some people's emotions might have evolved to be credible signals, making people "irrationally" interpersonally vulnerable (e.g., disposition to be fearful and anxious) or "irrationally" affected by others' discomfort (e.g., high affective empathy). Developmental constraint: Faking emotions you don't have is challenging even for skilled manipulators.
    • Different niches / life history strategies: Deceptive strategies seem to be optimized for different niches (at least in some cases). For instance, I've found that we can tell a lot about the character of men by looking at their romantic preferences. (E.g., if someone seeks out shallow relationship after shallow relationship and doesn't seem to want "more depth," that can be a yellow flag. It becomes a red flag if they're not honest about their motivations for the relationship and if they prefer to keep the connection shallow even though the other person would want more depth.)
  • "No man's land" in fitness gradients: In the ancestral environment, asymmetric tools + developmental constraints + inter-species selection pressure for character (neither too weak, nor too strong) produced fitness gradients that steer towards attractors of either high honesty vs high deceitfulness. From a fitness perspective, it sucks to "practice" both extremes of genuine honesty and dishonesty in the same phenotype because the strategies hone in on different sides of various developmental tradeoffs. (And there are enough poor judges of character so that dishonest phenotypes can mostly focus on niches where the attain high reward somewhat easily so they don't have to constantly expose themselves to the highest selection pressures for getting unmasked.) 
  • Capabilities constraints (relative to the capabilities of competent judges): People who find themselves with the deceitful phenotype cannot bridge the gap and learn to act the exact same way a prosocial actor would act (but they can fool incompetent judges or competent judges who face time-constraints or information-constraints). This is a limitation of capabilities: it would be different if people were more skilled learners and had better control over their psychology.

In the context of training TAI systems, we could attempt to recreate these conditions and select for integrity and prosocial motivations. One difficulty here lies in recreating the right "developmental constraints" and in keeping a balance the relative capabilities between judges and to-be-evaluated agents. (Humans presumably went through an evolutionary arms race related to assessing each others' competence and character, which means that people were always surrounded by judges of similar intelligence.) 

Lastly, there's a problem where, if you dial up capabilities too much, it becomes increasingly easier to "fake everything." (For the reasons Ajeya explains in her account of deceptive alignment.)

(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)

How do you then classify this as a counterexample rather than a "non-central (but still valid) manifestation of the theory"?

My only reply is "You know it when you see it." And yeah, a crackpot would reason the same way, but non-modest epistemology says that if it's obvious to you that you're not a crackpot then you have to operate on the assumption that you're not a crackpot. (In the alternative scenario, you won't have much impact anyway.) 

Specifically, the situation I mean is the following:

  • You have an epistemic track record like Eliezer or someone making lots of highly upvoted posts in our communities.
  • You find yourself having strong intuitions about how to apply powerful principles like "consequentialism" to new domains, and your intuitions are strong because it feels to you like you have a gears-level understanding that others lack. You trust your intuitions in cases like these.

My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot." 

Maybe there's a potential crux here about how much of scientific knowledge is dependent on successful predictions. In my view, the sequences have convincingly argued that locating the hypothesis in the first place is often done in the absence of already successful predictions, which goes to show that there's a core of "good reasoning" that lets you jump to (tentative) conclusions, or at least good guesses, much faster than if you were to try lots of things at random.

It also isn't clear to me that Eliezer has established the strong inferences he draws from noticing this general pattern ("expected utility theory/consequentialism"). But when you asked Eliezer (in the original dialogue) to give examples of successful predictions, I was thinking "No, that's not how these things work." In the mistaken applications of Grand Theories you mention (AGI and capitalism, AGI and symbolic logic, intelligence and ethics, recursive self-improvement and cultural evolution, etc.), the easiest way to point out why they are dumb is with counterexamples. We can quickly "see" the counterexamples. E.g., if you're trying to see AGI as the next step in capitalism, you'll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete). By contrast, if the theory fits, you'll find that whenever you try to construct such a counterexample, it is just a non-central (but still valid) manifestation of the theory. Eliezer would probably say that people who are good at this sort of thinking will quickly see how the skeptics' counterexamples fall relevantly short. 


The reason I remain a bit skeptical about Eliezer's general picture: I'm not sure if his thinking about AGI makes implicit questionable predictions about humans

  • I don't understand his thinking well enough to be confident that it doesn't
  • It seems to me that Eliezer_2011 placed weirdly strong emphasis on presenting humans in ways that matched the pattern "(scary) consequentialism always generalizes as you scale capabilities." I consider some of these claims false or at least would want to make the counterexamples more salient

For instance: 

  • Eliezer seemed to think that "extremely few things are worse than death" is something all philosophically sophisticated humans would agree with
  • Early writings on CEV seemed to emphasize things like the "psychological unity of humankind" and talk as though humans would mostly have the same motivational drives, also with respect to how it relates to "enjoying being agenty" as opposed to "grudgingly doing agenty things but wishing you could be done with your obligations faster"
  • In HPMOR all the characters are either not philosophically sophisticated or they were amped up into scary consequentialists plotting all the time

All of the above could be totally innocent matters of wanting to emphasize the thing that other commenters were missing, so they aren't necessarily indicative of overlooking certain possibilities. Still, the pattern there makes me wonder if maybe Eliezer hasn't spent a lot of time imagining what sorts of motivations humans can have that make them benign not in terms outcome-related ethics (what they want the world to look like), but relational ethics (who they want to respect or assist, what sort of role model they want to follow). It makes me wonder if it's really true that when you try to train an AI to be helpful and corrigible, the "consequentialism-wants-to-become-agenty-with-its-own-goals part" will be stronger than the "helping this person feels meaningful" part. (Leading to an agent that's consequentialist about following proper cognition rather than about other world-outcomes.) 

FWIW I think I mostly share Eliezer's intuitions about the arguments where he makes them; I just feel like I lack the part of his picture that lets him discount the observation that some humans are interpersonally corrigible and not all that focused on other explicit goals, and that maybe this means corrigibility has a crisp/natural shape after all. 

I share the impression that the agent foundations research agenda seemed not that important. But that point doesn't feel sufficient to argue that Eliezer's pessimism about the current state of alignment research is just a face-saving strategy his brain tricked him into adopting. (I'm not saying you claimed that it is sufficient; probably a lot of other data points are factoring into your judgment.) MIRI have deprioritized agent foundations research for quite a while now. I also just think it's extremely common for people to have periods where they work on research that eventually turns out to be not that important; the interesting thing is to see what happens when that becomes more apparent. I immediately trust people more if I see that they are capable of pivoting and owning up to past mistakes, and I could imagine that MIRI deserves a passing grade on this, even though I also have to say that I don't know how exactly they nowadays think about prioritization in 2017 and earlier.

I really like Vaniver's comment further below:

For what it's worth, my sense is that EY's track record is best in 1) identifying problems and 2) understanding the structure of the alignment problem.

And, like, I think it is possible that you end up in situations where the people who understand the situation best end up the most pessimistic about it.

I'm very far away from confident that Eliezer's pessimism is right, but it seems plausible to me. Of course, some people might be in the epistemic position of having tried to hash out that particular disagreement on the object level and have concluded that Eliezer's pessimism is misguided – I can't comment on that. I'm just saying that based on what I've read, which is pretty much every post and comment on AI alignment on LW and the EA forum, I don't get the impression that Eliezer's pessimism is clearly unfounded.

Everyone's views look like they are suspiciously shaped to put themselves and their efforts into a good light. If someone believed that their work isn't important or their strengths aren't very useful, they wouldn't do the work and wouldn't cultivate the strengths. That applies to Eliezer, but it also applies to the people who think alignment will likely be easy. I feel like people in the latter group would likely be inconvenienced (in terms of the usefulness of their personal strengths or the connections they've built in the AI industry, or past work they've done), too, if it turned out not to be.

Just to give an example on the sorts of observations that make me think Eliezer/"MIRI" could have a point: 

  • I don't know what happened with a bunch of safety people leaving OpenAI but it's at least possible to me that it involved some people having had negative updates on the feasibility of a certain type of strategy that Eliezer criticized early on here. (I might be totally wrong about this interpretation because I haven't talked to anyone involved.)
  • I thought it was interesting when Paul noted that our civilization's Covid response was a negative update for him on the feasibility of AI alignment. Kudos to him for noting the update, but also: Isn't that exactly the sort of misprediction one shouldn't be making if one confidently thinks alignment is likely to succeed? (That said, my sense is that Paul isn't even at the most optimistic end of people in the alignment community.)
  • A lot of the work in the arguments for alignment being easy seems to me to be done by dubious analogies that assume that AI alignment is relevantly similar to risky technologies that we've already successfully invented. People seem insufficiently quick to get to the actual crux with MIRI, which makes me think they might not be great at passing the Ideological Turing Test. When we get to the actual crux, it's somewhere deep inside the domain of predicting the training conditions for AGI, which feels like the sort of thing Eliezer might be good at thinking about. Other people might also be good at thinking about this, but then why do they often start their argument with dubious analogies to past technologies that seem to miss the point?
    [Edit: I may be strawmanning some people here. I have seen direct discussions about the likelihood of treacherous turns vs. repeated early warnings of alignment failure. I didn't have a strong opinion either way, but it's totally possible that some people feel like they understand the argument and confidently disagree with Eliezer's view there.]

Leaning on this, someone could write a post about the "infectiousness of realism" since it might be hard to reconcile openness to non-zero probabilities of realism with anti-realist frameworks? :P

For people who believe their actions matter infinitely more if realism is true, this could be modeled as an overriding meta-preference to act as though realism is true. Unfortunately if realism isn't true this could go in all kinds of directions depending on how the helpful AI system would expect to get into such a judged-to-be-wrong epistemic state.

Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we'll know by that point whether we've found the right way to do metaphilosophy (and how approaching that question is different from approaching whichever procedures philosophically sophisticated people would pick to settle open issues in something like the above proposals). It seems like there has to come a point where one has to hand off control to some in-advance specified "metaethical framework" or reflection procedure, and judged from my (historically overconfidence-prone) epistemic state it doesn't feel obvious why something like Stuart's anti-realism isn't already close to there (though I'd say there are many open questions and I'd feel extremely unsure about how to proceed regarding for instance "2. A method for synthesising such basic preferences into a single utility function or similar object," and also to some extent about the premise of squeezing a utility function out of basic preferences absent meta-preferences for doing that). Adding layers of caution sounds good though as long as they don't complicate things enough to introduce large new risks.

Suppose the agent you're trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don't see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed... I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you're trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)

I'm not sure these are the points Rohin was trying to make, but there seem to be at least two important points here:

  • Imitation learning applied to humans produces goal-directed behavior only insofar humans are goal-directed
  • Imitation learning applied to humans produces agents no more capable than humans. (I think IDA goes beyond this by adding amplification steps, which are separate. And IRL goes beyond this by trying to correct "errors" that the humans make.)

Regarding the second point, there's a safety-relevant sense in which a human-imitating agent is less goal-directed than the human. Because if you scale the human's capabilities, the human will become better at achieving its personal objectives. By contrast, if you scale the imitator's capabilities, it's only supposed to become even better at imitating the unscaled human.