Rafael Harth

Sequences

Factored Cognition

Wiki Contributions

Comments

Inner Alignment: Explain like I'm 12 Edition

Author here. One thing I think I've done wrong in the post is to equate black-box-search-in-large-parametrized-space with all of machine learning. I've now added this paragraph at the end of chapter 1:

Admittedly, the inner alignment model is not maximally general. In this post, we've looked at black box search, where we have a parametrized model and do SGD to update the parameters. This describes most of what Machine Learning is up to in 2020, but it does not describe what the field did pre-2000 and, in the event of a paradigm shift similar to the deep learning revolution, it may not describe what the field looks like in the future. In the context of black box search, inner alignment is a well-defined property and Venn-Diagram a valid way of slicing up the problem, but there are people who expect that AGI will not be built that way.[1] There are even concrete proposals for safe AI where the concept doesn't apply. Evan Hubinger has since written a follow-up post about what he calls "training stories", which is meant to be "a general framework through which we can evaluate any proposal for building safe advanced AI".

I also converted the post to markdown, mostly for the footnotes (the previous version just had little superscripts written via the math mode).


  1. If an AGI does contain more hand-coded parts, the picture gets more complicated. E.g., if a system is logically separated into a bunch of components, inner alignment may apply to some of the components but not others. It may even apply to parts of biological systems, see e.g., Steven Byrne's Inner Alignment in the brain. ↩︎

Morality is Scary

I strongly believe that (1) well-being is objective, (2) well-being is quantifiable, and (3) Open Individualism is true (i.e., the concept of identity isn't well-defined, and you're subjectively no less continuous with the future self if any other person than your own future self).

If (1-3) are all true, then utilitronium is the optimal outcome for everyone even if they're entirely selfish. Furthermore, I expect an AGI to figure this out, and to the extent that it's aligned, it should communicate that if it's asked. (I don't think an AGI will therefore decide to do the right thing, so this is entirely compatible with everyone dying if alignment isn't solved.)

In the scenario where people get to talk to the AGI freely and it's aligned, two concrete mechanisms I see are (a) people just ask the AGI what is morally correct and it tells them, and (b) they get some small taste of what utilitronium would feel like, which would make it less scary. (A crucial piece is that they can rationally expect to experience this themselves in the utilitronium future.)

In the scenario where people don't get to talk to the AGI, who knows. It's certainly possible that we have singleton scenario with a few people in charge of the AGI, and they decide to censor questions about ethics because they find the answers scary.

The only org I know of that works on this and shares my philosophical views is QRI. Their goal is to (a) come up with a mathematical space (probably a topological one, mb a Hilbert space) that precisely describes the subjective experience of someone, (b) find a way to put someone in the scanner and create that space, and (c) find a property of that space that corresponds to their well-being in that moment. The flag ship theory is that this property is symmetry. Their model is stronger than (1-3), but if it's correct, you could get hard evidence on this before AGI since it would make strong testable predictions about people's well-being (and they think it could also point to easy interventions, though I don't understand how that works). Whether it's feasible to do this before AGI is a different question. I'd bet against it, but I think I give it better odds than any specific alignment proposal. (And I happen to know that Mike agrees that the future is dominated by concerns about AI and thinks this is the best thing to work on.)

So, I think their research is the best bet for getting more people on board with utilitronium since it can provide evidence on (1) and (2). (Also has the nice property that it won't work if (1) or (2) are false, so there's low risk of outrage.) Other than that, write posts arguing for moral realism and/or for Open Individualism.

Quantifying suffering before AGI would also plausibly help with alignment, since at least you can formally specify a broad space of outcomes you don't want. though it certainly doesn't solve it, e.g. because of inner optimizers.

Morality is Scary

I don't have any reason why this couldn't happen. My position is something like "morality is real, probably precisely quantifiable; seems plausible that in the scenario of humans with autonomy and aligned AI, this could lead to an asymmetry where more people tend toward utilitronium over time". (Hence why I replied, you didn't seem to consider that possibility.) I could make up some mechanisms for this, but probably you don't need me for that. Also seems plausible that this doesn't happen. If it doesn't happen, maybe the people who get to decide what happens with the rest of the universe tend toward utilitronium. But my model is widely uncertain and doesn't rule out futures of highly suboptimal personal utopias that persist indefinitely.

Morality is Scary

This comment seems to be consistent with the assumption that the outcome 1 year after the singularity is locked in forever. But the future we're discussing here is one where humans retain autonomy (?), and in that case, they're allowed to change their mind over time, especially if humanity has access to a superintelligent aligned AI. I think a future where we begin with highly suboptimal personal utopias and gradually transition into utilitronium is among the more plausible outcomes. Compared with other outcomes where Not Everyone Dies, anyway. Your credence may differ if you're a moral relativist.

Biology-Inspired AGI Timelines: The Trick That Never Works

1: To me, it made it more entertaining and thus easier to read. (No idea about non-anecdotal data, would also be interested.)

3: Also no data; I strongly suspect the metric is generally good because... actually I think it's just because the people I find worth listening to are overwhelmingly not condescending. This post seems highly usual in several ways.

Biology-Inspired AGI Timelines: The Trick That Never Works

Is Humbali right that generic uncertainty about maybe being wrong, without other extra premises, should increase the entropy of one's probability distribution over AGI, thereby moving out its median further away in time?

My answer to this is that

First, no update whatsoever should take place because a probability distribution already expresses uncertainty, and there's no mechanism by which the uncertainty increased. Adele Lopez independently (and earlier) came up with the same answer.

Second, if there were an update -- say EY learned "one of the steps used in my model was wrong" -- this should indeed change the distribution. However, it should change it toward the prior distribution. It's completely unclear what the prior distribution is, but there is no rule whatsoever that says "more entropy = more prior-y" as shown by the fact that a uniform distribution over the next years has extremely high entropy yet makes a ludicrously confident prediction.

See also Information Charts (second chapter). Being under-confident/losing confidence does not have to shift your probability toward the 50% mark; it shifts it toward the prior from whoever it was before, and the prior can be literally any probability. If it were universally held that AGI happens in 5 years, then this could be the prior, and updating downward on EY's gears-level model would update the probability toward quicker timelines.

Soares, Tallinn, and Yudkowsky discuss AGI cognition

The total absence of obvious output of this kind from the rest of the "AI safety" field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors. Go read fantasy novels about demons and telepathy, if you want a better appreciation of the convergent incentives of agents facing mindreaders than the "AI safety" field outside myself is currently giving you.

While this this may be a fair criticism, I feel like someone ought to point out that the vast majority of AI safety output (at least that I see on LW) isn't trying to do anything like "sketch a probability distribution over the dynamics of an AI project that is nearing AGI". This includes all technical MIRI papers I'm familiar with.

Perhaps we should be doing this (though, isn't that more for AI forecasting/strategy rather than alignment? Of course still AI safety), but then the failure isn't "no-one has enough security mindset" but rather something like "no-one has the social courage to tackle the problems that are actually important". (This would be more similar to EY's critique in the Discussion on AGI interventions post.)

Yudkowsky and Christiano discuss "Takeoff Speeds"

Yeah, it's fixed now. Thanks for pointing it out.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Survey on model updates from reading this post. Figuring out to what extent this post has led people to update may inform whether future discussions are valuable.

Results: (just posting them here, doesn't really need its own post)

The question was to rate agreement on the 1=Paul to 9=Eliezer axis before and after reading this post.

Data points: 35

Mean:

Median:

Graph of distribution before (blue) and after (red) and of mean shifts based on prior position (horizontal bar chart).

Raw Data

Anynymous Comments:

Agreement more on need for actions than on probabilities. Would be better to first present points of agreement (that it is at least possible for non(dangerously)-general AI to change situation).

the post was incredibly confusing to me and so I haven't really updated at all because I don't feel like I can crisply articulate yudkowsky's model or his differences with christiano

Ngo and Yudkowsky on AI capability gains

One of my updates from reading this is that Rapid vs. Gradual takeoff seems like an even more important variable for many people's model than I had assumed. Making this debate less one-sided might thus be super valuable even if writing up arguments is costly.

Load More