All of Wei_Dai's Comments + Replies

Another (outer) alignment failure story

This is fuzzier if you can’t tell the difference between deliberation and manipulation. If I define idealized deliberation as an individual activity then I can talk about the extent to which M leads to deviation from idealized deliberation, but it’s probably more accurate to think of idealized deliberation as a collective activity.

How will your AI compute "the extent to which M leads to deviation from idealized deliberation"? (I'm particularly confused because this seems pretty close to what I guessed earlier and seems to face similar problems, but you ... (read more)

2Paul Christiano10dI think I misunderstood what kind of attack you were talking about. I thought you were imagining humans being subject to attack while going about their ordinary business (i.e. while trying to satisfy goals other than moral reflection), but it sounds like in the recent comments you are imagining cases where humans are trying to collaboratively answer hard questions (e.g. about what's right), some of them may sabotage the process, and none of them are able to answer the question on their own and so can't avoid relying on untrusted data from other humans. I don't feel like this is going to overlap too much with the story in the OP, since it takes place over a very small amount of calendar time---we're not trying to do lots of moral deliberation during the story itself, we're trying to defer moral deliberation until after the singularity (by decoupling it from rapid physical/technological progress), and so the action you are wondering about would have happened after the story ended happily. There are still kinds of attacks that are still important (namely those that prevent humans from surviving through to the singularity). Similarly it seems like your description of "go in an info bubble" is not really appropriate for this kind of attack---wouldn't it be more natural to say "tell your AI not to treat untrusted data as evidence about what is good, and try to rely on carefully chosen data for making novel moral progress." So in that light, I basically want to decouple your concern into two parts: 1. Will collaborative moral deliberation actually "freeze" during this scary phase, or will people e.g. keep arguing on the internet and instruct their AI that it shouldn't protect them from potential manipulation driven by those interactions? 2. Will human communities be able to recover mutual trust after the singularity in this story? I feel more concerned about #1. I'm not sure where you are at. I was saying that I think it's better to directly look at
Another (outer) alignment failure story

Trying to imagine myself how an automated filter might work, here's a possible "solution" I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I've been compromised by an AI-powered attack and is now trying to attack you. (I'm talking in binary terms of "compromised" and "uncompromised" for simplicity but of course it will be more complicated than that in reality.)

Is this close to what... (read more)

2Paul Christiano10dThis isn't the kind of approach I'm imagining.
Another (outer) alignment failure story

Most of the time when I look at a message, a bunch of automated systems have looked at it first and will inform me about the intended effect of the message in order to respond to appropriately or decide whether to read it.

This seems like the most important part so I'll just focus on this for now. I'm having trouble seeing how this can work. Suppose that I, as an attacker, tell my AI assistant, "interact with Paul in my name (possibly over a very long period of time) so as to maximize the chances that Paul eventually ends up believing in religion/ideolog... (read more)

2Paul Christiano10dI'm not sure if it's the most important part. If you are including filtering (and not updates about whether people are good to talk to / legal liability / etc.) then I think it's a minority of the story. But it still seems fine to talk about (and it's not like the other steps are easier). Suppose your AI chooses some message M which is calculated to lead to Paul making (what Paul would or should regard as) an error. It sounds like your main question is how an AI could recognize M as problematic (i.e. such that Paul ought to expect to be worse off after reading M, such that it can either be filtered or caveated, or such that this information can be provided to reputation systems or arbiters, or so on). My current view is that the sophistication required to recognize M as problematic is similar to the sophistication required to generate M as a manipulative action. This is clearest if the attacker just generates a lot of messages and then picks M that they think will most successfully manipulate the target---then an equally-sophisticated defender will have the same view about the likely impacts of M. This is fuzzier if you can't tell the difference between deliberation and manipulation. If I define idealized deliberation as an individual activity then I can talk about the extent to which M leads to deviation from idealized deliberation, but it's probably more accurate to think of idealized deliberation as a collective activity. But as far as I can tell the basic story is still intact (and e.g. I have the intuition about "knowing how to manipulate the process is roughly the same as recognizing manipulation," just fuzzier.) It's probably helpful to get more concrete about the kind of attack you are imagining (which is presumably easier than getting concrete about defenses---both depend on future technology but defenses also depend on what the attack is). If your attack involves convincing me of a false claim, or making a statement from which I will predictably make

Trying to imagine myself how an automated filter might work, here's a possible "solution" I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I've been compromised by an AI-powered attack and is now trying to attack you. (I'm talking in binary terms of "compromised" and "uncompromised" for simplicity but of course it will be more complicated than that in reality.)

Is this close to what... (read more)

Another (outer) alignment failure story

(Apologies for the late reply. I've been generally distracted by trying to take advantage of perhaps fleeting opportunities in the equities markets, and occasionally by my own mistakes while trying to do that.)

It seems like the AI described in this story is still aligned enough to defend against AI-powered persuasion (i.e. by the time that AI is sophisticated enough to cause that kind of trouble, most people are not ever coming into contact with adversarial content)

How are people going to avoid contact with adversarial content, aside from "go into an i... (read more)

How are people going to avoid contact with adversarial content, aside from "go into an info bubble with trusted AIs and humans and block off any communications from the outside"? (If that is happening a lot, it seems worthwhile say so explicitly in the story since that might be surprising/unexpected to a lot of readers?)

I don't have a short answer or think this kind of question has a short answer. I don't know what an "info bubble" is and the world I'm imagining may fit your definition of that term (but the quoted description makes it sound like I might be... (read more)

Another (outer) alignment failure story

The ending of the story feels implausible to me, because there's a lack of explanation of why the story doesn't side-track onto some other seemingly more likely failure mode first. (Now that I've re-read the last part of your post, it seems like you've had similar thoughts already, but I'll write mine down anyway. Also it occurs to me that perhaps I'm not the target audience of the story.) For example:

  1. In this story, what is preventing humans from going collectively insane due to nations, political factions, or even individuals blasting AI-powered persua

... (read more)

In this story, what is preventing humans from going collectively insane due to nations, political factions, or even individuals blasting AI-powered persuasion/propaganda at each other? (Maybe this is what you meant by "people yelling at each other"?)

It seems like the AI described in this story is still aligned enough to defend against AI-powered persuasion (i.e. by the time that AI is sophisticated enough to cause that kind of trouble, most people are not ever coming into contact with adversarial content)

Why don't AI safety researchers try to leverage AI t

... (read more)
My research methodology

Why did you write "This post [Inaccessible Information] doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall." just one month before publishing "Learning the prior"? (Is it because you were classifying "learning the prior" / imitative generalization under "iterated amplification" and now you consider it a different algorithm?)

For example, at the beginning of modern cryptography you could describe the methodology as “Tell a story about how someone learns something about your secret” and that only gradually crystal

... (read more)
4Paul Christiano9dI'm still curious for your view on the crypto examples you cited. My current understanding is that people do not expect the security proofs to rule out all possible attacks (a situation I can sympathize with since I've written multiple proofs that rule out large classes of attacks without attempting to cover all possible attacks), so I'm interested in whether (i) you disagree with that and believe that serious onlookers have had the expectation that proofs are comprehensive, (ii) you agree but feel it would be impractical to give a correct proof and this is a testament to the difficulty of proving things, (iii) you feel it would be possible but prohibitively expensive, and are expressing a quantitative point about the cost of alignment analyses being impractical, (iv) you feel that the crypto case would be practical but the AI case is likely to be much harder and just want to make a directionally analogous update. I still feel like more of the action is in my skepticism about the (alignment analysis) <--> (security analysis) analogy, but I could still get some update out of the analogy if the crypto situation is thornier than I currently believe.
6Paul Christiano1moIn my other response to your comment I wrote: I guess SSH itself would be an interesting test of this, e.g. comparing the theoretical model of this paper [https://eprint.iacr.org/2010/095.pdf] to a modern implementation. What is your view about that comparison? e.g. how do you think about the following possibilities: 1. There is no material weakness in the security proof. 2. A material weakness is already known. 3. An interested layperson could find a material weakness with moderate effort. 4. An expert could find a material weakness with significant effort. My guess would be that probably we're in world 2, and if not that it's probably because no one cares that much (e.g. because it's obvious that there will be some material weakness and the standards of the field are such that it's not publishable unless it actually comes with an attack) and we are in world 3. (On a quick skim, and from the author's language when describing the model, my guess is that material weaknesses of the model are more or less obvious and that the authors are aware of potential attacks not covered by their model.)
4Paul Christiano1moI think that post is basically talking about the same kinds of hard cases as in Towards Formalizing Universality [https://ai-alignment.com/towards-formalizing-universality-409ab893a456] 1.5 years earlier (in section IV), so it's intended to be more about clarification/exposition than changing views. See the thread with Rohin above for some rough history. I'm not sure.It's possible I would become more pessimistic if I walked through concrete cases of people's analyses being wrong in subtle and surprising ways. My experience with practical systems is that it is usually easy for theorists to describe hypothetical breaks for the security model, and the issue is mostly one of prioritization (since people normally don't care too much about security). For example, my strong expectation would be that people had described hypothetical attacks on any of the systems discussed in the article you linked [http://www.ibiblio.org/weidai/temp/Provable_Security.pdf] prior to their implementation, at least if they had ever been subject to formal scrutiny. The failures are just quite far away from the levels of paranoia that I've seen people on the theory side exhibit when they are trying to think of attacks. I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn't even take that long. It sounds like a fun game. Another possible divergence is that I'm less convinced by the analogy, since alignment seems more about avoiding the introduction of adversarial consequentialists and it's not clear if that game behaves in the same way. I'm not sure if that's more or less important than the prior point. I would want to do a lot of work before deploying an algorithm in any context where a failure would be catastrophic (though "before letting it be
Persuasion Tools: AI takeover without AGI or agency?

You mention "defenses will improve" a few times. Can you go into more detail about this? What kind of defenses do you have in mind? I keep thinking that in the long run, the only defenses are either to solve meta-philosophy so our AIs can distinguish between correct arguments and merely persuasive ones and filter out the latter for us (and for themselves), or go into an info bubble with trusted AIs and humans and block off any communications from the outside. But maybe I'm not being imaginative enough.

2Daniel Kokotajlo5moI think I mostly agree with you about the long run, but I think we have more short-term hurdles that we need to overcome before we even make it to that point, probably. I will say that I'm optimistic that we haven't yet thought of all the ways advances in tech will help collective epistemology rather than hinder it. I notice you didn't mention debate; I am not confident debate will work but it seems like maybe it will. In the short run, well, there's also debate I guess. And the internet having conversations being recorded by default and easily findable by everyone was probably something that worked in favor of collective epistemology. Plus there is wikipedia, etc. I think the internet in general has lots of things in it that help collective epistemology... it just also has things that hurt, and recently I think the balance is shifting in a negative direction. But I'm optimistic that maybe the balance will shift back. Maybe.
Alignment By Default

So similarly, a human could try to understand Alice's values in two ways. The first, equivalent to what you describe here for AI, is to just apply whatever learning algorithm their brain uses when observing Alice, and form an intuitive notion of "Alice's values". And the second is to apply explicit philosophical reasoning to this problem. So sure, you can possibly go a long way towards understanding Alice's values by just doing the former, but is that enough to avoid disaster? (See Two Neglected Problems in Human-AI Safety for the kind of disaster I have i... (read more)

1johnswentworth9moI mostly agree with you here. I don't think the chances of alignment by default are high. There are marginal gains to be had, but to get a high probability of alignment in the long term we will probably need actual understanding of the relevant philosophical problems.
Alignment By Default

To help me check my understanding of what you're saying, we train an AI on a bunch of videos/media about Alice's life, in the hope that it learns an internal concept of "Alice's values". Then we use SL/RL to train the AI, e.g., give it a positive reward whenever it does something that the supervisor thinks benefits Alice's values. The hope here is that the AI learns to optimize the world according to its internal concept of "Alice's values" that it learned in the previous step. And we hope that its concept of "Alice's values" includes the idea that Alice w... (read more)

1John Maxwell9moMy take is that corrigibility is sufficient to get you an AI that understands what it means to "keep improving their understanding of Alice's values and to serve those values". I don't think the AI needs to play the "genius philosopher" role, just the "loyal and trustworthy servant" role. A superintelligent AI which plays that role should be able to facilitate a "long reflection" where flesh and blood humans solve philosophical problems. (I also separately think unsupervised learning systems could in principle make philosophical breakthroughs. Maybe one already has [https://twitter.com/AmandaAskell/status/1284307770024448001].)
4johnswentworth9moThere's a lot of moving pieces here, so the answer is long. Apologies in advance. I basically agree with everything up until the parts on philosophy. The point of divergence is roughly here: I do think that resolving certain confusions around values involves solving some philosophical problems. But just because the problems are philosophical does not mean that they need to be solved by philosophical reasoning. The kinds of philosophical problems I have in mind are things like: * What is the type signature of human values? * What kind of data structure naturally represents human values? * How do human values interface with the rest of the world? In other words, they're exactly the sort of questions for which "utility function" and "Cartesian boundary" are answers, but probably not the right answers. How could an AI make progress on these sorts of questions, other than by philosophical reasoning? Let's switch gears a moment and talk about some analogous problems: * What is the type signature of the concept of "tree"? * What kind of data structure naturally represents "tree"? * How do "trees" (as high-level abstract objects) interface with the rest of the world? Though they're not exactly the same questions, these are philosophical questions of a qualitatively similar sort to the questions about human values. Empirically, AIs already do a remarkable job reasoning about trees, and finding answers to questions like those above, despite presumably not having much notion of "philosophical reasoning". They learn some data structure for representing the concept of tree, and they learn how the high-level abstract "tree" objects interact with the rest of the (lower-level) world. And it seems like such AIs' notion of "tree" tends to improve as we throw more data and compute at them, at least over the ranges explored to date. In other words: empirically, we seem to be able to solve philosophical problems to a surprising degree by throwing data and compute at
Inaccessible information

or we need to figure out some way to access the inaccessible information that “A* leads to lots of human flourishing.”

To help check my understanding, your previously described proposal to access this "inaccessible" information involves building corrigible AI via iterated amplification, then using that AI to capture "flexible influence over the future", right? Have you become more pessimistic about this proposal, or are you just explaining some existing doubts? Can you explain in more detail why you think it may fail?

(I'll try to guess.) Is it that corri

... (read more)
5Paul Christiano1yI think that's right. The difficulty is that short-term preferences-on-reflection depend on "how good is this situation actually?" and that judgment is inaccessible. This post doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall. This post is part of the effort to pin down the hard cases for iterated amplification, which I suspect will also be hard cases for other alignment strategies (for the kinds of reasons discussed in this post). Yeah, I think that's similar. I'm including this as part of the alignment problem---if unaligned AIs realize that a certain kind of resource is valuable but aligned AIs don't realize that, or can't integrate it with knowledge about what the users want (well enough to do strategy stealing) then we've failed to build competitive aligned AI. Yes. Yes. If we are using iterated amplification to try to train a system that answers the question "What action will put me in the best position to flourish over the long term?" then in some sense the only inaccessible information that matters is "To what extent will this action put me in a good position to flourish?" That information is potentially inaccessible because it depends on the kind of inaccessible information described in this post---what technologies are valuable? what's the political situation? am I being manipulated? is my physical environment being manipulated?---and so forth. That information in turn is potentially inaccessible because it may depend on internal features of models that are only validated by trial and error, for which we can't elicit the correct answer either by directly checking it nor by transfer from other accessible features of the model. (I might be misunderstanding your question.) By default I don't expect to give enough explanations or examples :) My next step in this direction will be thinking through possible approaches for eliciting inaccessible information, which I may write about but which I don't expect to b
Possible takeaways from the coronavirus pandemic for slow AI takeoff

Thanks for writing this. I've been thinking along similar lines since the pandemic started. Another takeaway for me: Under our current political system, AI risk will become politicized. It will be very easy for unaligned or otherwise dangerous AI to find human "allies" who will help to prevent effective social response. Given this, "more competent institutions" has to include large-scale and highly effective reforms to our democratic political structures, but political dysfunction is such a well-known problem (i.e., not particularly neglected) that if ther

... (read more)
2Vika1yThanks Wei! I agree that improving institutions is generally very hard. In a slow takeoff scenario, there would be a new path to improving institutions using powerful (but not fully general) AI, but it's unclear how well we could expect that to work given the generally low priors. The covid response was a minor update for me in terms of AI risk assessment - it was mildly surprising given my existing sense of institutional competence.

Thinking for a minute, I guess my unconditional probability of unaligned AI ending civilization (or something similar) is around 75%. It’s my default expected outcome.

That said, this isn’t a number I try to estimate directly very much, and I’m not sure if it would be the same after an hour of thinking about that number. Though I’d be surprised if I ended up giving more than 95% or less than 40%. 

Curious where yours is at?

In September 2017, based on some conversations with MIRI and non-MIRI folks, I wrote:

I think that at least 80% of the AI safety researchers at MIRI, FHI, CHAI, OpenAI, and DeepMind would currently assign a >10% probability to this claim: "The research community will fail to solve one or more technical AI safety problems, and as a consequence there will be a permanent and drastic reduction in the amount of value in our future."

People may have become more optimistic since then, but most people falling in the 1-10% range would still surprise me a... (read more)

AGIs as collectives

Having said this, I’m open to trying it for one of your arguments. So perhaps you can point me to one that you particularly want engagement on?

Perhaps you could read all three of these posts (they're pretty short :) and then either write a quick response to each one and then I'll decide which one to dive into, or pick one yourself (that you find particularly interesting, or you have something to say about).

... (read more)

My thoughts on each of these. The common thread is that it seems to me you're using abstractions at way too high a level to be confident that they will actually apply, or that they even make sense in those contexts.

AGIs and economies of scale

  • Do we expect AGIs to be so competitive that reducing coordination costs is a big deal? I expect that the dominant factor will be AGI intelligence, which will vary enough that changes in coordination costs aren't a big deal. Variations in human intelligence have a huge effect, and presumably variations in AGI
... (read more)
AGIs as collectives

This seems about right. In general when someone proposes a mechanism by which the world might end, I think the burden of proof is on them. You’re not just claiming “dangerous”, you’re claiming something like “more dangerous than anything else has ever been, even if it’s intent-aligned”. This is an incredibly bold claim and requires correspondingly thorough support.

  1. "More dangerous than anything else has ever been" does not seem incredibly bold to me, given that superhuman AI will be more powerful than anything else the world has seen. Historically the r
... (read more)
AGIs as collectives

To try to encourage you to engage with my arguments more (as far as pointing out where you're not convinced), I think I'm pretty good at being skeptical of my own ideas and have a good track record in terms of not spewing off a lot of random ideas that turn out to be far off the mark. But I am too lazy / have too many interests / am too easily distracted to write long papers/posts where I lay out every step of my reasoning and address every possible counterargument in detail.

So what I'd like to do is to just amend my posts to address the main objections th

... (read more)

I'm pretty skeptical of this as a way of making progress. It's not that I already have strong disagreements with your arguments. But rather, if you haven't yet explained them thoroughly, I expect them to be underspecified, and use some words and concepts that are wrong in hard-to-see ways. One way this might happen is if those arguments use concepts (like "metaphilosophy") that kinda intuitively seem like they're pointing at something, but come with a bunch of connotations and underlying assumptions that make actually understa... (read more)

AGIs as collectives

but when we’re trying to make claims that a given effect will be pivotal for the entire future of humanity despite whatever efforts people will make when the problem starts becoming more apparent, we need higher standards to get to the part of the logistic curve with non-negligible gradient.

I guess a lot of this comes down to priors and burden of proof. (I guess I have a high prior that making something smarter than human is dangerous unless we know exactly what we're doing including the social/political aspects, and you don't, so you think the burden o

... (read more)
Many of my "disjunctive" arguments were written specifically with that scenario in mind.

Cool, makes sense. I retract my pointed questions.

I guess I have a high prior that making something smarter than human is dangerous unless we know exactly what we're doing including the social/political aspects, and you don't, so you think the burden of proof is on me?

This seems about right. In general when someone proposes a mechanism by which the world might end, I think the burden of proof is on them. You're not just claiming "dangerous&... (read more)

AGIs as collectives

For now my epistemic state is: extreme agency is an important component of thee main argument for risk, so all else equal reducing it should reduce risk.

I appreciate the explanation, but this is pretty far from my own epistemic state, which is that arguments for AI risk are highly disjunctive, most types of AGI (not just highly agentic ones) are probably unsafe (i.e., are likely to lead us away from rather than towards a success story), at best probably only a few very specific AGI designs (which may well be agentic if combined with other properties) ar

... (read more)
my own epistemic state, which is that arguments for AI risk are highly disjunctive, most types of AGI (not just highly agentic ones) are probably unsafe (i.e., are likely to lead us away from rather than towards a success story), at best probably only a few very specific AGI designs (which may well be agentic if combined with other properties) are both feasible and safe (i.e., can count as success stories)

Yeah, I guess I'm not surprised that we have this disagreement. To briefly sketch out why I disagree (mostly for common knowledge; I don't expe... (read more)

AGIs as collectives

I don’t think such work should depend on being related to any specific success story.

The reason I asked was that you talk about "safer" and "less safe" and I wasn't sure if "safer" here should be interpreted as "more likely to let us eventually achieve some success story", or "less likely to cause immediate catastrophe" (or something like that). Sounds like it's the latter?

Maybe I should just ask directly, what you tend to mean when you say "safer"?

4Richard Ngo1yMy thought process when I use "safer" and "less safe" in posts like this is: the main arguments that AGI will be unsafe depends on it having certain properties, like agency, unbounded goals, lack of interpretability, desire and ability to self-improve, and so on. So reducing the extent to which it has those properties will make it safer, because those arguments will be less applicable. I guess you could have two objections to this: * Maybe safety is non-monotonic in those properties. * Maybe you don't get any reduction in safety until you hit a certain threshold (corresponding to some success story). I tend not to worry so much about these two objections because to me, the properties I outlined above are still too vague to have a good idea of the landscape of risks with respect to those properties. Once we know what agency is, we can talk about its monotonicity. For now my epistemic state is: extreme agency is an important component of the main argument for risk, so all else equal reducing it should reduce risk. I like the idea of tying safety ideas to success stories in general, though, and will try to use it for my next post, which proposes more specific interventions during deployment. Having said that, I also believe that most safety work will be done by AGIs, and so I want to remain open-minded to success stories that are beyond my capability to predict.
AGIs as collectives

What success story (or stories) did you have in mind when writing this?

1Richard Ngo1yNothing in particular. My main intention with this post was to describe a way the world might be, and some of the implications. I don't think such work should depend on being related to any specific success story.
Curiosity Killed the Cat and the Asymptotically Optimal Agent

From your paper:

It is interesting to note that AIXI, a Bayes-optimal reinforcement learner in general environments,is not asymptotically optimal [Orseau, 2010], and in-deed, may cease to explore [Leikeet al., 2015]. Depending on its prior and its past observations, AIXI may decide at some point that further exploration is not worth the risk. Given our result, this seems like reasonable behavior.

Given this, why is your main conclusion "Perhaps our results suggest we are in need of more theory regarding the 'parenting' of artificial agents" instead of "We should use Bayesian optimality instead of asymptotic optimality"?

3michaelcohen1yThe simplest version of the parenting idea includes an agent which is Bayes-optimal. Parenting would just be designed to help out a Bayesian reasoner, since there's not much you can say about to what extent a Bayesian reasoner will explore, or how much it will learn; it all depends on its prior. (Almost all policies are Bayes-optimal with respect to some (universal) prior). There's still a fundamental trade-off between learning and staying safe, so while the Bayes-optimal agent does not do as bad a job in picking a point on that trade-off as the asymptotically optimal agent, that doesn't quite allow us to say that it will pick the right point on the trade-off. As long as we have access to "parents" that might be able to guide an agent toward world-states where this trade-off is less severe, we might as well make use of them. And I'd say it's more a conclusion, not a main one.
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

I agree that this is troubling, though I think it’s similar to how I wouldn’t want the term biorisk to be expanded ...

Well as I said, natural language doesn't have to be perfectly logical, and I think "biorisk" is in somewhat in that category but there's an explanation that makes it a bit reasonable than it might first appear, which is that the "bio" refers not to "biological" but to "bioweapon". This is actually one of the definitions that Google gives when you search for "bio": "relating to or involving the use of toxic biological or biochemical subst

... (read more)
1Matthew Barnett1yYeah that makes sense. Your points about "bio" not being short for "biological" were valid, but the fact that as a listener I didn't know that fact implies that it seems really easy to mess up the language usage here. I'm starting to think that the real fight should be about using terms that aren't self explanatory. I'm not sure about whether it would have been prevented by using the term more narrowly, but in my experience the most common reaction people outside of EA/LW (and even sometimes within) have to hearing about AI risk is to assume that it's not technical, and to assume that it's not about accidents. In that sense, I have seen been exposed to quite a bit of this already.
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Also, isn't defining "AI risk" as "technical accidental AI risk" analogous to defining "apple" as "red apple" (in terms of being circular/illogical)? I realize natural language doesn't have to be perfectly logical, but this still seems a bit too egregious.

1Matthew Barnett1yI agree that this is troubling, though I think it's similar to how I wouldn't want the term biorisk [https://en.wikipedia.org/wiki/Biorisk] to be expanded to include biodiversity loss (a risk, but not the right type), regular human terrorism (humans are biological, but it's a totally different issue), zombie uprisings (they are biological, but it's totally ridiculous), alien invasions etc. Not to say that's what you are doing with AI risk. I'm worried about what others will do with it if the term gets expanded.
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

But I am optimistic about the actual risks that you and others argue for.

Why? I actually wrote a reply that was more questioning in tone, and then changed it because I found some comments you made where you seemed to be concerned about the additional AI risks. Good thing I saved a copy of the original reply, so I'll just paste it below:

I wonder if you would consider writing an overview of your perspective on AI risk strategy. (You do have a sequence but I'm looking for something that's more comprehensive, that includes e.g. human safety and philosophica

... (read more)
3Rohin Shah1ySeems right, I think my opinions fall closest to Paul's, though it's also hard for me to tell what Paul's opinions are. I think this older thread [https://www.alignmentforum.org/posts/ZeE7EKHTFMBs8eMxn/clarifying-ai-alignment#myFnwgTqSPW4fgd6K] is a relatively good summary of the considerations I tend to think about, though I'd place different emphases now. (Sadly I don't have the time to write a proper post about what I think about AI strategy -- it's a pretty big topic.) Yes, though I would frame it as "the ~5 people reading these comments have two clear terms, while everyone else uses a confusing mishmash of terms". The hard part is in getting everyone else to use the terms. I am generally skeptical of deciding on definitions and getting everyone else to use them, and usually try to use terms the way other people use terms. Agreed with this, but see above about trying to conform with the way terms are used, rather than defining terms and trying to drag everyone else along.
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

AI risk is just a shorthand for “accidental technical AI risk.”

I don't think "AI risk" was originally meant to be a shorthand for "accidental technical AI risk". The earliest considered (i.e., not off-hand) usage I can find is in the title of Luke Muehlhauser's AI Risk and Opportunity: A Strategic Analysis where he defined it as "the risk of AI-caused extinction".

(He used "extinction" but nowadays we tend think in terms of "existential risk" which also includes "permanent large negative consequences", which seems like an reasonable expansion of "AI risk

... (read more)
1Matthew Barnett1yI appreciate the arguments, and I think you've mostly convinced me, mostly because of the historical argument. I do still have some remaining apprehension about using AI risk to describe every type of risk arising from AI. That is true. The way I see it, UDT is definitely on the technical side, even though it incorporates a large amount of philosophical background. When I say technical, I mostly mean "specific, uses math, has clear meaning within the language of computer science" rather than a more narrow meaning of "is related to machine learning" or something similar. My issue with arguing for philosophical failure is that, as I'm sure you're aware, there's a well known failure mode of worrying about vague philosophical problems rather than more concrete ones. Within academic philosophy, the majority of discussion surrounding AI is centered around consciousness, intentionality, whether it's possible to even construct a human-like machine, whether they should have rights etc. There's a unique thread of philosophy that arose from Lesswrong, which includes work on decision theory, that doesn't focus on these thorny and low priority questions. While I'm comfortable with you arguing that philosophical failure is important, my impression is that the overly philosophical approach used by many people has done more harm than good for the field in the past, and continues to do so. It is therefore sometimes nice to tell people that the problems that people work on here are concrete and specific, and don't require doing a ton of abstract philosophy or political advocacy. This is true, but my impression is that when you tell people that a problem is "technical" it generally makes them refrain from having a strong opinion before understanding a lot about it. "Accidental" also reframes the discussion by reducing the risk of polarizing biases. This is a common theme in many fields: * Physicists sometimes get frustrated with people arguing about "the philosophy of th
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Ok, I wasn't sure that you'd agree, but given that you do, it seems that when you wrote the title of this newsletter "Why AI risk might be solved without additional intervention from longtermists" you must have meant "Why some forms of AI risk ...", or perhaps certain forms of AI risk just didn't come to your mind at that time. In either case it seems worth clarifying somewhere that you don't currently endorse interpreting "AI risk" as "AI risk in its entirety" in that sentence.

Similarly, on the inside you wrote:

The main reason I am optimistic about AI s

... (read more)
3Rohin Shah1yTbc, I'm optimistic about all the types of AI safety problems that people have proposed, including the philosophical ones. When I said "all else equal those seem more likely to me", I meant that if all the other facts about the matter are the same, but one risk affects only future people and not current people, that risk would seem more likely to me because people would care less about it. But I am optimistic about the actual risks that you and others argue for. That said, over the last week I have become less optimistic specifically about overcoming race dynamics, mostly from talking to people at FHI / GovAI. I'm not sure how much to update though. (Still broadly optimistic.) It's notable that AI Impacts asked for people who were skeptical of AI risk (or something along those lines) and to my eye it looks like all four of the people in the newsletter independently interpreted that as accidental technical AI risk in which the AI is adversarially optimizing against you (or at least that's what the four people argued against). This seems like pretty strong evidence that when people hear "AI risk" they now think of technical accidental AI risk, regardless of what the historical definition may have been. I know certainly that is my default assumption when someone (other than you) says "AI risk". I would certainly support having clearer definitions and terminology if we could all agree on them.
3Matthew Barnett1yAI risk is just a shorthand for "accidental technical AI risk." To the extent that people are confused, I agree it's probably worth clarifying the type of risk by adding "accidental" and "technical" whenever we can. However, I disagree with the idea that we should expand the word AI risk to include philosophical failures and intentional risks. If you open the term up [https://arbital.com/p/guarded_definition/], these outcomes might start to happen: * It becomes unclear in conversation what people mean when they say AI risk * Like The Singularity, it becomes a buzzword. * Journalists start projecting Terminator scenarios onto the words, and now have justification because even the researchers say that AI risk can mean a lot of different things. * It puts a whole bunch of types of risk into one basket, suggesting to outsiders that all attempts to reduce "AI risk" might be equally worthwhile. * ML researchers start to distrust AI risk researchers, because people who are worried about the Terminator are using the same words as the AI risk researchers and therefore get associated with them. This can all be avoided by having a community norm to clarify that we mean technical accidental risk when we say AI risk, and when we're talking about other types of risks we use more precise terminology.
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

But on the strong versions of warning shots, where there’s common knowledge that building an AGI runs a substantial risk of destroying the world, yes, I expect them to not build AGI until safety is solved. (Not to the standard you usually imagine, where we must also solve philosophical problems, but to the standard I usually imagine, where the AGI is not trying to deceive us or work against us.)

To the extent that we expect strong warning shots and ability to avoid building AGI upon receiving such warning shots, this seems like an argument for researcher

... (read more)
2Rohin Shah1yYes. Agreed, all else equal those seem more likely to me.
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Faced with an actual example, I’m realizing that what I actually expect would cause people to take it more seriously is a) the belief that AGI is near and b) an example where the AI algorithm “deliberately” causes a problem (i.e. “with full knowledge” that the thing it was doing was not what we wanted).

What do you expect the ML community to do at that point? Coordinate to stop or slow down the race to AGI until AI safety/alignment is solved? Or do you think each company/lab will unilaterally invest more into safety/alignment without slowing down capabil

... (read more)
3Rohin Shah1yIt depends a lot on the particular warning shot that we get. But on the strong versions of warning shots, where there's common knowledge that building an AGI runs a substantial risk of destroying the world, yes, I expect them to not build AGI until safety is solved. (Not to the standard you usually imagine, where we must also solve philosophical problems, but to the standard I usually imagine, where the AGI is not trying to deceive us or work against us.) This depends on other background factors, e.g. how much the various actors think they are value-aligned vs. in zero-sum competition. I currently think the ML community thinks they are mostly but not fully value-aligned, and they will influence companies and governments in that direction. (I also want more longtermists to be trying to build more common knowledge of how much humans are value aligned, to make this more likely.) The major disanalogy is that catastrophic outcomes of climate change do not personally affect the CEOs of energy companies very much, whereas AI x-risk affects everyone. (Also, maybe we haven't gotten clear and obvious warning shots?) I agree that my story requires common knowledge of the risk of building AGI, in the sense that you need people to predict "running this code might lead to all humans dying", and not "running this code might lead to <warning shot effect>". You also need relative agreement on the risks. I think this is pretty achievable. Most of the ML community already agrees that building an AGI is high-risk if not done with some argument for safety. The thing people tend to disagree on is when we will get AGI and how much we should work on safety before then.
What can the principal-agent literature tell us about AI risk?

Thanks for making the changes, but even with "PAL confirms that due to diverging interests and imperfect monitoring, AI agents could get some rents." I'd still like to understand why imperfect monitoring could lead to rents, because I don't currently know a model that clearly shows this (i.e., where the rent isn't due to the agent having some other kind of advantage, like not having many competitors).

Also, I get that the PAL in its current form may not be directly relevant to AI, so I'm just trying to understand it on its own terms for now. Possibly I should just dig into the literature myself...

What can the principal-agent literature tell us about AI risk?

PAL confirms that due to diverging interests and imperfect monitoring, agents will get some rents.

Can you provide a source for this, or explain more? I'm asking because your note about competition between agents reducing agency rents made me think that such competition ought to eliminate all rents that the agent could (for example) gain by shirking, because agents will bid against each other to accept lower wages until they have no rent left. For example in the model of principle-agent problem presented in this lecture (which has diverging interests and

... (read more)
3Alexis Carlier1yThanks for catching this! You’re correct that that sentence is inaccurate. Our views changed while iterating the piece and that sentence should have been changed to: “PAL confirms that due to diverging interests and imperfect monitoring, AI agents could get some rents.” This sentence too: “Overall, PAL tells us that agents will inevitably extract some agency rents…” would be better as “Overall, PAL is consistent with AI agents extracting some agency rents…” I’ll make these edits, with a footnote pointing to your comment. The main aim of that section was to point out that Paul’s scenario isn’t in conflict with PAL. Without further research, I wouldn’t want to make strong claims about what PAL implies for AI agency rents because the models are so brittle and AIs will likely be very different to humans; it’s an open question. For there to be no agency rents at all, I think you’d need something close to perfect competition [https://en.wikipedia.org/wiki/Perfect_competition] between agents. In practice the necessary conditions [https://en.wikipedia.org/wiki/Perfect_competition#Idealizing_conditions_of_perfect_competition] are basically never satisfied because they are very strong, so it seems very plausible to me that AI agents extract rents. Re monopoly rents vs agency rents: Monopoly rents refer to the opposite extreme with very little competition, and in the economics literature is used when talking about firms, while agency rents are present whenever competition and monitoring are imperfect. Also, agency rents refer specifically to the costs inherent to delegating to an agent (e.g. an agent making investment decisions optimising for commission over firm profit) vs the rents from monopoly power (e.g. being the only firm able to use a technology due to a patent). But as you say, it's true that lack of competition is a cause of both of these.
Outer alignment and imitative amplification

I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn’t be aligned.

But if you put aside competitiveness, can't HCH be trivially aligned? E.g., you give the humans making up HCH instructions to cause it to not be able to answer anything except simple arithmetic questions. So it seems that a claim of HCH being aligned is meaningless unless the claim is about being aligned at some level of competitiveness.

1Evan Hubinger1yThat's a good point. What I really mean is that I think the sort of HCH that you get out of taking actual humans and giving them careful instructions is more likely to be uncompetitive than it is to be unaligned. Also, I think that “HCH for a specific H” is more meaningful than “HCH for a specific level of competitiveness,” since we don't really know what weird things you might need to do to produce an HCH with a given level of competitiveness.
The Main Sources of AI Risk?

Thank you for making this list. I think it is important enough to be worth continually updating and refining; if you don’t do it then I will myself someday.

Please do. I seem to get too easily distracted these days for this kind of long term maintenance work. I'll ask the admins to give you edit permission on this post (if possible) and you can also copy the contents into a wiki page or your own post if you want to do that instead.

4Daniel Kokotajlo1yHa! I wake up this morning to see my own name as author, that wasn't what I had in mind but it sure does work to motivate me to walk the talk! Thanks!
1Oliver Habryka1yDone! Daniel should now be able to edit the post.
AI Alignment Open Thread October 2019

When I listen to old recordings of right wing talk show hosts from decades ago, they seem to be saying the same stuff that current people are saying today, about political correctness and being forced out of academia for saying things that are deemed harmful by the social elite, or about the Left being obsessed by equality and identity. So I would definitely say that a lot of people predicted this would happen.

I think what's surprising is that although academia has been left-leaning for decades, the situation had been relatively stable until the last fe

... (read more)
AI Alignment Open Thread October 2019

Ahh. To be honest, I read that, but then responded to something different. I assumed you were just expressing general pessimism, since there’s no guarantee that we would converge on good values upon a long reflection (and you recently viscerally realized that values are very arbitrary).

I guess I was also expressing a more general update towards more pessimism, where even if nothing happens during the Long Reflection that causes it to prematurely build an AGI, other new technologies that will be available/deployed during the Long Reflection could also in

... (read more)
AI Alignment Open Thread October 2019

I think it’s likely that another cultural revolution could happen, and this could adversely affect the future if it happens simultaneously with a transition into an AI based economy.

This seems to be ignoring the part of my comment at the top of this sub-thread, where I said "[...] has also made me more pessimistic about non-AGI or delayed-AGI approaches to a positive long term future (e.g., the Long Reflection)." In other words, I'm envisioning a long period of time in which humanity has the technical ability to create an AGI but is deliberately holding

... (read more)
1Matthew Barnett1yAhh. To be honest, I read that, but then responded to something different. I assumed you were just expressing general pessimism, since there's no guarantee that we would converge on good values upon a long reflection (and you recently viscerally realized that values are very arbitrary). Now I see that your worry is more narrow, in that the cultural revolution might happen during this period, and would act unwisely to create the AGI during its wake. I guess this seems quite plausible, and is an important concern, though I personally am skeptical that anything like the long reflection will ever happen.
AI Alignment Open Thread October 2019

I could be wrong here, but the stuff you mentioned appear either ephemeral, or too particular. The “last few years” of political correctness is hardly enough time to judge world-trends by, right? By contrast, the stuff I mentioned (end of slavery, explicit policies against racism and war) seem likely to stick and stay with us for decades, if not centuries.

It sounds like you think that something like another Communist Revolution or Cultural Revolution could happen (that emphasizes some random virtues at the expense of others), but the effect would be tem

... (read more)
1Matthew Barnett1yThat's pretty fair. I think it's likely that another cultural revolution could happen, and this could adversely affect the future if it happens simultaneously with a transition into an AI based economy. However, the deviations from long-term trends are very hard to predict, as you point out, and we should know about the specifics more as we get further along. In the absence of concrete details, I find it far more helpful to use information from long-term trends rather than worrying about specific scenarios.
AI Alignment Open Thread October 2019

By unpredictable I mean that nobody really predicted:

(Edit: 1-3 removed to keep a safer distance from object-level politics, especially on AF)

4 Russia and China adopted communism even though they were extremely poor. (They were ahead of the US in gender equality and income equality for a time due to that, even though they were much poorer.)

None of these seem well-explained by your "rich society" model. My current model is that social media and a decrease in the perception of external threats relative to internal threats both favor more virtue signaling, wh

... (read more)
1Matthew Barnett1yI could be wrong here, but the stuff you mentioned as counterexamples to my model appear either ephemeral, or too particular. The "last few years" of political correctness is hardly enough time to judge world-trends by, right? By contrast, the stuff I mentioned (end of slavery, explicit policies against racism and war) seem likely to stick and stay with us for decades, if not centuries. When I listen to old recordings of right wing talk show hosts from decades ago, they seem to be saying the same stuff that current people are saying today, about political correctness and being forced out of academia for saying things that are deemed harmful by the social elite, or about the Left being obsessed by equality and identity. So I would definitely say that a lot of people predicted this would happen. The main difference is that it's now been amplified as recent political events have increased polarization, the people with older values are dying of old age or losing their power, and we have social media that makes us more aware of what is happening. But in hindsight I think this scenario isn't that surprising. Of course, you can point to a few examples of where my model fails. I'm talking about the general trends rather than the specific cases. If we think in terms of world history, I would say that Russia in the early 20th century was "rich" in the sense that it was much richer than countries in previous centuries and this enabled it to implement communism in the first place. Government power waxes and wanes, but over time I think its power has definitely gone up as the world has gotten richer, and I think this could have been predicted.
AI Alignment Open Thread October 2019

Studying recent cultural changes in the US and the ideas of virtue signaling and preference falsification more generally has also made me more pessimistic about non-AGI or delayed-AGI approaches to a positive long term future (e.g., the Long Reflection). I used to think that if we could figure out how to achieve strong global coordination on AI, or build a stable world government, then we'd be able to take our time, centuries or millennia if needed, to figure out how to build an aligned superintelligent AI. But it seems that human cultural/moral evolution

... (read more)
2Matthew Barnett1yPart of why I'm skeptical of these concerns is that it seems like a lot of moral behavior is predictable as society gets richer, and we can model the social dynamics to predict some outcomes will be good. As evidence for the predictability, consider that rich societies are more open to LGBT rights, they have explicit policies against racism, against war, slavery, torture, and it seems like rich societies are moving in the direction of government control over many aspects of life, such as education and healthcare. Is this just a quirk of our timeline, or a natural feature of civilizations of humans as they get richer? I am inclined to think much of it is the latter. That's not to say that I think the current path we're going on is a good one. I just think it's more predictable than what you seem to think. Given its predictability, I feel somewhat confident in the following statements: eventually, when aging is cured, people will adopt policies that give people the choice to die. Eventually, when artificial meat is very cheap and tasty, people will ban animal-based meat. I'm not predicting these outcomes because I am confusing what I hope for and what I think will happen. I just genuinely think that human virtue signaling dynamics will be favorable to those outcomes. I'm less confident, leaning pessimistic about these questions: I don't think humans will inevitably care about wild animal suffering. I don't think humans will inevitably create a post-human utopia where people can modify their minds into any sort of blissful existence they imagine, and I don't think humans will inevitably care about subroutine suffering. It's these questions that make me uneasy about the future.
Outer alignment and imitative amplification

I may have asked this already somewhere, but do you know if there's a notion of "outer aligned" that is applicable to oracles/predictors in general, as opposed to trying to approximate/predict HCH specifically? Basically the problem is that I don't know what "aligned" or "trying to do what we want" could mean in the general case. Is "outer alignment" meant to be applicable in the general case?

This post talks about outer alignment of the loss function. Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that

... (read more)
1Evan Hubinger1yAnother thing that maybe I didn't make clear previously: I agree, but if you're instructing your humans not to instantiate arbitrary Turing machines, then that's a competitiveness claim, not an alignment claim. I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn't be aligned.
1Evan Hubinger1yI'm not exactly sure what you're asking here. I would call that an outer alignment failure, but only because I would say that the ways in which your loss function can be hacked are part of the specification of your loss function. However, I wouldn't consider an entire training process to be outer aligned—rather, I would just say that an entire training process is aligned. I generally use outer and inner alignment to refer to different components of aligning the training process—namely the objective/loss function/environment in the case of outer alignment and the inductive biases/architecture/optimization procedure in the case of inner alignment (though note that this is a more general definition than the one used in “Risks from Learned Optimization [https://www.alignmentforum.org/s/r9tYkB2a8Fp4DN8yB],” as it makes no mention of mesa-optimizers, though I would still say that mesa-optimization is my primary example of how you could get an inner alignment failure). Yes, though in the definition I gave here I just used the model class of all functions, which is obviously too large but has the nice property of being a fully general definition. I would include all possible input/output channels in the domain/codomain of the model when interpreted as a function. I generally think you need HBO and am skeptical that LBO can actually do very much.
Outer alignment and imitative amplification

Aside from some quibbles, this matches my understanding pretty well, but may leave the reader wondering why Paul Christiano and Ought decided to move away from imitative amplification to approval-based amplification. To try to summarize my understanding of their thinking (mostly from an email conversation in September of last year between me, you (Evan), Paul Christiano, and William Saunders):

  • William (and presumably Paul) think approval-based amplification can also be outer aligned. (I do not a good understand why they think this, and William said "still
... (read more)
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

It seems that the interviewees here either:

  1. Use "AI risk" in a narrower way than I do.
  2. Neglected to consider some sources/forms of AI risk (see above link).
  3. Have considered other sources/forms of AI risk but do not find them worth addressing.
  4. Are worried about other sources/forms of AI risk but they weren't brought up during the interviews.

Can you talk about which of these is the case for yourself (Rohin) and for anyone else whose thinking you're familiar with? (Or if any of the other interviewees would like to chime in for themselves?)

2Paul Christiano1yFor context, here's the one time in the interview I mention "AI risk" (quoting 2 earlier paragraphs for context): (But it's still the case that asked "Can you explain why it's valuable to work on AI risk?" I responded by almost entirely talking about AI alignment, since that's what I work on and the kind of work where I have a strong view about cost-effectiveness.)
2Rohin Shah1yWe discussed this here [https://www.lesswrong.com/posts/TdwpN484eTbPSvZkm/rohin-shah-on-reasons-for-ai-optimism#RZDyAGYX69TeBKJjh] for my interview; my answer is the same as it was then (basically a combination of 3 and 4). I don't know about the other interviewees.
Is the term mesa optimizer too narrow?

When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function.

Assuming this, it seems to me that the heuristics are being continuously trained by the selection stage, so that is the most important part even if heuristics are doing most of the immediate work in making e

... (read more)
1Matthew Barnett1yIf the heuristics are continuously being trained, and this is all happening by comparing things against some criterion that's encoded within some other neural network, I suppose that's a bit like saying that we have an "objective function." I wouldn't call it explicit, though, because to call something explicit means that you could extract the information content easily. I predict that extracting any sort of coherent or consistent reward function from the human brain will be very difficult. I am only using the definition given. The definition clearly states that the objective function must be "explicit" not "implicit." This is important; as Rohin mentioned below, this definition naturally implies that one way of addressing inner alignment will be to use some transparency procedure to extract the objective function used by the neural network we are training. However, if neural networks don't have clean, explicit internal objective functions, this technique becomes a lot harder, and might not be as tractable as other approaches. I actually agree that I didn't adequately argue this point. Right now I'm trying to come up with examples, and I estimate about a 50% chance that I'll write a post about this in the future naming detailed examples. For now, my argument can be summed up by saying, logically, if humans are not mesa optimizers, yet humans are dangerous, then you don't need a mesa optimizer to produce malign generalization.
Is the term mesa optimizer too narrow?

First, I think by this definition humans are clearly not mesa optimizers.

I'm confused/unconvinced. Surely the 9/11 attackers, for example, must have "internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system"? Can you give some examples of humans being highly dangerous without having done this kind of explicit optimization?

As far as I can tell, Hjalmar Wijk introduc

... (read more)
3Matthew Barnett1yETA: I agree if someone were to eg. write a spreadsheet of all the things they could do, and write the costs of those actions, and then choose the one with the lowest cost, this would certainly count. And maybe terrorist organizations do a lot of deliberation that meets this kind of criteria. But I am responding to the more typical type of human action: walking around, seeking food, talking to others, working at a job. There are two reasons why we might model something as an optimizer. The first reason is that we know that it is internally performing some type of search over strategies in its head, and then outputting the strategy that ranks highest under some explicit objective function. The second reason is that, given our ignorant epistemic state, our best model of that object is that it is optimizing some goal. We might call the second case the intentional stance [https://en.wikipedia.org/wiki/Intentional_stance], following Dennett. If we could show that the first case was true in humans, then I would agree that humans would be mesa optimizers. However, my primary objection is that we could have better models of what the brain is actually doing. It's often the case that when you don't know how something works, the best way of understanding it is by modeling it as an optimizer. However, once you get to look inside and see what's going on, this way of thinking lends to better models which take into account the specifics of its operation. I suspect that human brains are well modeled as optimizers from the outside, but that this view falls apart when considering specific cases. When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function. But since this is all a bit vague, and
AI Alignment Open Thread October 2019

Human values can change a lot over a short amount of time, to the extent that maybe the commonly used "value drift" is not a good term to describe it. After reading Geoffrey Miller, my current model is that a big chunk of our apparent values comes from the need to do virtue signaling. In other words, we have certain values because it's a lot easier to signal having those values when you really do have them. But the optimal virtues/values to signal can change quickly due to positive and negative feedback loops in the social dynamics around virtual signaling

... (read more)
AI Alignment Open Thread October 2019

More generally, ideology and other forms of loyalty/virtue signaling seem like something that deserves more attention from a human-AI safety perspective

For people looking into this in the future, here are a couple of academic resources:

... (read more)

Studying recent cultural changes in the US and the ideas of virtue signaling and preference falsification more generally has also made me more pessimistic about non-AGI or delayed-AGI approaches to a positive long term future (e.g., the Long Reflection). I used to think that if we could figure out how to achieve strong global coordination on AI, or build a stable world government, then we'd be able to take our time, centuries or millennia if needed, to figure out how to build an aligned superintelligent AI. But it seems that human cultural/moral evolution

... (read more)
Strategic implications of AIs' ability to coordinate at low cost, for example by merging

I don’t agree that cooperation necessarily allows you to have a greater competitive advantage. It’s worth seeing why this is true in the case of evolution, as I think it carries over to the AI case. Naively, organisms that cooperate would always enjoy some advantages, since they would never have to fight for resources. However, this naive model ignores the fact that genes are selfish: if there is a way to reap the benefits of cooperation without having to pay the price of giving up resources, then organisms will pursue this strategy instead.

I'm definite

... (read more)
Strategic implications of AIs' ability to coordinate at low cost, for example by merging

Here, I think the biggest weakness in the argument is the assumption that powerful AIs should be described as having explicit utility functions.

I gave a more general argument in response to Robin Hanson, which doesn't depend on this assumption. Curious if you find that more convincing.

1Matthew Barnett1yThanks for the elaboration. You quoted Robin Hanson as saying My model says that this is about right. It generally takes a few more things for people to cooperate, such as common knowledge of perfect value matching, common knowledge of willingness to cooperate, and an understanding of the benefits of cooperation. By assumption, AIs will become smarter than humans, which makes me think they will understand the benefits of cooperation better than we do. But this understanding won't be gained "all at once" but will instead be continuous with the past. This is essentially why I think cooperation will be easier in the future, but that it will more-or-less follow a gradual transition from our current trends (I think cooperation has been increasing globally in the last few centuries anyway, for similar reasons). I agree that we will be able to search over a larger space of mind-design, and I also agree that this implies that it will be easier to find minds that cooperate. I don't agree that cooperation necessarily allows you to have a greater competitive advantage. It's worth seeing why this is true in the case of evolution, as I think it carries over to the AI case. Naively, organisms that cooperate would always enjoy some advantages, since they would never have to fight for resources. However, this naive model ignores the fact that genes are selfish: if there is a way to reap the benefits of cooperation without having to pay the price of giving up resources, then organisms will pursue this strategy instead. This is essentially the same argument that evolutionary game theorists have used to explain the evolution of aggression [https://en.wikipedia.org/wiki/Evolutionary_game_theory#Hawk_Dove], as I understand it. Of course, there are some simplifying assumptions which could be worth disputing.
AI Alignment Open Thread October 2019

Question for those (such as Paul Christiano) who both are optimistic about corrigibility as a central method for AI alignment, and think that large corporations or other large organizations (such as Google) will build the first AGIs. A corrigible AI built by Google will likely be forced to share Google’s ideological commitments, in other words, assign zero or near zero probability to beliefs that are politically unacceptable within Google and to maintain that probability against whatever evidence that exist in the world. Is this something that you have tho

... (read more)
4Wei Dai1yFor people looking into this in the future, here are a couple of academic resources: * Timur Kuran's Private Truth, Public Lies [https://www.hup.harvard.edu/catalog.php?isbn=9780674707580], which introduced the term preference falsification [https://en.wikipedia.org/wiki/Preference_falsification] (which I'm surprised to not have seen more mentions or discussion of around here, given how obviously relevant it is to AI alignment) * Geoffrey Miller's Sexual selection for moral virtues [https://www.jstor.org/stable/pdf/10.1086/517857.pdf] or his book Virtue Signaling [https://www.primalpoly.com/virtue-signaling-2019] which includes that paper as a chapter. See also his further readings for the book [https://www.primalpoly.com/virtue-signaling-further-reading].
The Credit Assignment Problem

One part of it is that I want to scrap classical (“static”) decision theory and move to a more learning-theoretic (“dynamic”) view.

Can you explain more what you mean by this, especially "learning-theoretic"? I've looked at learning theory a bit and the typical setup seems to involve a loss or reward that is immediately observable to the learner, whereas in decision theory, utility can be over parts of the universe that you can't see and therefore can't get feedback from, so it seems hard to apply typical learning theory results to decision theory. I won

... (read more)
1Abram Demski1yMy thinking is somewhat similar to Vanessa's [https://www.lesswrong.com/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#XoSSfnhwQbNzfB63f] . I think a full explanation would require a long post in itself. It's related to my recent thinking about UDT [https://www.lesswrong.com/posts/9sYzoRnmqmxZm4Whf/conceptual-problems-with-udt-and-policy-selection] and commitment races [https://www.lesswrong.com/posts/brXr7PJ2W4Na2EW2q/the-commitment-races-problem] . But, here's one way of arguing for the approach in the abstract. You once asked [https://www.lesswrong.com/posts/J8LTE8CTQfyEMkhnc/reflections-on-pre-rationality] : My contention is that rationality should be about the update process. It should be about how you adjust your position. We can have abstract rationality notions as a sort of guiding star, but we also need to know how to steer based on those. Some examples: * Logical induction can be thought of as the result of performing this transform on Bayesianism; it describes belief states which are not coherent, and gives a rationality principle about how to approach coherence -- rather than just insisting that one must somehow approach coherence. * Evolutionary game theory is more dynamic than the Nash story. It concerns itself more directly with the question of how we get to equilibrium. Strategies which work better get copied. We can think about the equilibria, as we do in the Nash picture; but, the evolutionary story also lets us think about non-equilibrium situations. We can think about attractors (equilibria being point-attractors, vs orbits and strange attractors), and attractor basins; the probability of ending up in one basin or another; and other such things. * However, although the model seems good for studying the behavior of evolved creatures, there does seem to be something missing for artificial agents learning to play games; we don't necessarily want to think of there as being a population which is se

(I don't speak for Abram but I wanted to explain my own opinion.) Decision theory asks, given certain beliefs an agent has, what is the rational action for em to take. But, what are these "beliefs"? Different frameworks have different answers for that. For example, in CDT a belief is a causal diagram. In EDT a belief is a joint distribution over actions and outcomes. In UDT a belief might be something like a Turing machine (inside the execution of which the agent is supposed to look for copies of emself). Learning theory allows us to gain insight through t

... (read more)
AI Alignment Open Thread October 2019

First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.

Ok, I think it makes sense to be more optimistic about transparency/interpretability allowing people to notice when something is wrong. My original complaint was about people seemingly being optimistic about using it to solve alignment, not just to notice

... (read more)
AI Alignment Open Thread October 2019

Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.

What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition? Aside from "single failure meant that we lose", the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deploye

... (read more)
2Rohin Shah1yTwo responses: First, this is more of a social coordination problem -- I'm claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them. Second, is there a consensus that recommendation algorithms are net negative? Within this community, that's probably the consensus, but I don't think it's a consensus more broadly. If we can't solve the bad discourse problem, but the recommendation algorithms are still net positive overall, then you want to keep them. (Part of the social coordination problem is building consensus that something is wrong.) For many ways of how they push human civilization off the rails, I would not expect transparency / interpretability to help. One example would be the scenario in which each AI is legitimately trying to help some human(s), but selection / competitive pressures on the humans lead to sacrificing all values except productivity. I'd predict that most people optimistic about transparency / interpretability would agree with at least that example.
Rohin Shah on reasons for AI optimism

I do lean closer to the stance of “whatever we decide based on some ‘reasonable’ reflection process is good”, which seems to encompass a wide range of futures, and seems likely to me to happen by default.

I think I disagree pretty strongly, and this is likely an important crux. Would you be willing to read a couple of articles that point to what I think is convincing contrary evidence? (As you read the first article, consider what would have happened if the people involved had access to AI-enabled commitment or mind-modification technologies.)

... (read more)
3Rohin Shah1y... I'm not sure why I used the word "we" in the sentence you quoted. (Maybe I was thinking about a group of value-aligned agents? Maybe I was imagining that "reasonable reflection process" meant that we were in a post-scarcity world, everyone agreed that we should be doing reflection, everyone was already safe? Maybe I didn't want the definition to sound like I would only care about what I thought and not what everyone else thought? I'm not sure.) In any case, I think you can change that sentence to "whatever I decide based on some 'reasonable' reflection process is good", and that's closer to what I meant. I am much more uncertain about multiagent interactions. Like, suppose we give every person access to a somewhat superintelligent AI assistant that is legitimately trying to help them. Are things okay by default? I lean towards yes, but I'm uncertain. I did read through those two articles, and I broadly buy the theses they advance; I still lean towards yes because: * Things have broadly become better over time, despite the effects that the articles above highlight. The default prediction is that they continue to get better. (And I very uncertainly think people from the past would agree, given enough time to understand our world?) * In general, we learn reasonably well from experience; we try things and they go badly, but then things get better as we learn from that. * Humans tend to be quite risk-averse at trying things, and groups of humans seem to be even more risk-averse. As a result, it seems unlikely that we try a thing that ends up having a "direct" existentially bad effect. * You could worry about an "indirect" existentially bad effect, along the lines of Moloch [https://slatestarcodex.com/2014/07/30/meditations-on-moloch/], where there isn't any single human's optimization causing bad things to happen, but selection pressure causes problems. Selection pressure has existed for a long time and hasn't caused an existentia
Load More