Rafael Harth

Sequences

Factored Cognition

Wiki Contributions

Comments

AGI Ruin: A List of Lethalities

This document doesn't look to me like something a lot of people would try to write. Maybe it was one of the most important things to write, but not obviously so. Among the steps (1) get the idea to write out all reasons for pessimism, (2) resolve to try, (3) not give up halfway through, and (4) be capable, I would not guess that 4 is the strongest filter.

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

Yes, but I didn't mean to ask whether it's relevant, I meant to ask whether it's accurate. Does the output of language models, in fact, feel like this? Seemed like something relevant to ask you since you've seen lots of text completions.

And if it does, what is the reason for not having long timelines? If neural networks only solved the easy part of the problem, that implies that they're a much smaller step toward AGI than many argued recently.

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

I think what you get is a person talking with no inhibitions whatsoever. Language models don’t match that.

What do you picture a language model with no inhibitions to look like? Because if I try to imagine it, then "something that outputs reasonable sounding text until sooner or later it fails hard" seems to be a decent fit. Of course haven't thought much about the generator/assessor distinction.

I mean, surely "inhibitions" of the language model don't map onto human inhibitions, right? Like, a language model without the assessor module (or a much worse assessor module) is just as likely to be imitate someone who sounds unrealistically careful as someone who has no restraints.

I find your last paragraph convincing, but that of course makes me put more credence into the theory rather than less.

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

(Extremely speculative comment, please tell me if this is nonsense.)

If it makes sense to differentiate the "Thought Generator" and "Thought Assessor" as two separate modules, is it possible to draw a parallel to language models, which seem to have strong ability to generate sentences, but lack the ability to assess if they are good?

My first reaction to this is "obviously not since the architecture is completely different, so why would they map onto each other?", but a possible answer could be "well if the brain has them as separate modules, it could mean that the two tasks require different solutions, and if one is much harder than the other, and the harder one is the assess module, that could mean language models would naturally solve just the generation first".

The related thing that I find interesting is that, a priori, it's not at all obvious that you'd have these two different modules at all (since the thought generator already receives ground truth feedback). Does this mean the distinction is deeply meaningful? Well, that depends on close to optimal the [design of the human brain] is.

Inner Alignment: Explain like I'm 12 Edition

Thanks! I agree it's an error, of course. I've changed the section, do you think it's accurate now?)

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

No; it was just that something about how the post explained it made me think that it wasn't #1.

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

I don't completely get this.

Let's call the short term predictor (in the long term predictor circuit) , so if tries to predict [what predicts in 0.3s], then the correct prediction would be to immediately predict the output at whatever point in the future the process terminates (the next ground truth injection?). In particular, would always predict the same until the ground truth comes in. But if I understand correctly, this is not what's going on.

So second try: is really still only trying to predict 0.3s into the future, making it less of a "long term predictor" and more of an "ongoing process predictor"? And then you get, e.g., the behavior of predicting a little less enzyme production with every step?

Or third try, is just trying to minimize something like the sum of squared differences between adjacent predictions, and is thus trying to minimize the number of ground-truth injections, and we get the above an emergent effect?

Inner Alignment: Explain like I'm 12 Edition

Author here. One thing I think I've done wrong in the post is to equate black-box-search-in-large-parametrized-space with all of machine learning. I've now added this paragraph at the end of chapter 1:

Admittedly, the inner alignment model is not maximally general. In this post, we've looked at black box search, where we have a parametrized model and do SGD to update the parameters. This describes most of what Machine Learning is up to in 2020, but it does not describe what the field did pre-2000 and, in the event of a paradigm shift similar to the deep learning revolution, it may not describe what the field looks like in the future. In the context of black box search, inner alignment is a well-defined property and Venn-Diagram a valid way of slicing up the problem, but there are people who expect that AGI will not be built that way.[1] There are even concrete proposals for safe AI where the concept doesn't apply. Evan Hubinger has since written a follow-up post about what he calls "training stories", which is meant to be "a general framework through which we can evaluate any proposal for building safe advanced AI".

I also converted the post to markdown, mostly for the footnotes (the previous version just had little superscripts written via the math mode).


  1. If an AGI does contain more hand-coded parts, the picture gets more complicated. E.g., if a system is logically separated into a bunch of components, inner alignment may apply to some of the components but not others. It may even apply to parts of biological systems, see e.g., Steven Byrne's Inner Alignment in the brain. ↩︎

Morality is Scary

I strongly believe that (1) well-being is objective, (2) well-being is quantifiable, and (3) Open Individualism is true (i.e., the concept of identity isn't well-defined, and you're subjectively no less continuous with the future self if any other person than your own future self).

If (1-3) are all true, then utilitronium is the optimal outcome for everyone even if they're entirely selfish. Furthermore, I expect an AGI to figure this out, and to the extent that it's aligned, it should communicate that if it's asked. (I don't think an AGI will therefore decide to do the right thing, so this is entirely compatible with everyone dying if alignment isn't solved.)

In the scenario where people get to talk to the AGI freely and it's aligned, two concrete mechanisms I see are (a) people just ask the AGI what is morally correct and it tells them, and (b) they get some small taste of what utilitronium would feel like, which would make it less scary. (A crucial piece is that they can rationally expect to experience this themselves in the utilitronium future.)

In the scenario where people don't get to talk to the AGI, who knows. It's certainly possible that we have singleton scenario with a few people in charge of the AGI, and they decide to censor questions about ethics because they find the answers scary.

The only org I know of that works on this and shares my philosophical views is QRI. Their goal is to (a) come up with a mathematical space (probably a topological one, mb a Hilbert space) that precisely describes the subjective experience of someone, (b) find a way to put someone in the scanner and create that space, and (c) find a property of that space that corresponds to their well-being in that moment. The flag ship theory is that this property is symmetry. Their model is stronger than (1-3), but if it's correct, you could get hard evidence on this before AGI since it would make strong testable predictions about people's well-being (and they think it could also point to easy interventions, though I don't understand how that works). Whether it's feasible to do this before AGI is a different question. I'd bet against it, but I think I give it better odds than any specific alignment proposal. (And I happen to know that Mike agrees that the future is dominated by concerns about AI and thinks this is the best thing to work on.)

So, I think their research is the best bet for getting more people on board with utilitronium since it can provide evidence on (1) and (2). (Also has the nice property that it won't work if (1) or (2) are false, so there's low risk of outrage.) Other than that, write posts arguing for moral realism and/or for Open Individualism.

Quantifying suffering before AGI would also plausibly help with alignment, since at least you can formally specify a broad space of outcomes you don't want. though it certainly doesn't solve it, e.g. because of inner optimizers.

Load More