Rafael Harth — AI Alignment Forum

I'm an independent researcher currently working on a sequence of posts about consciousness. You can send me anonymous feedback here: https://www.admonymous.co/rafaelharth.

Oh no, I didn't realize your perspective was this gloomy. But it makes a lot of sense. Actually it mostly comes down to, you can just dispute the consensus^[1] that the classically popular Yudkowskyian/Bostromian views have been falsified by the rise of LLMs. If they haven't, then fast takeoff now is plausible for mostly the same reasons that we used to think it's plausible.

I think the path from here to AGI is bottlenecked by researchers playing with toy models, and publishing stuff on arXiv and GitHub.

I think there is some merit to just asking these people to do something else. Maybe not a lot of merit, but a little more than zero, at least for some of them. Especially if they are on this site. Not with a tweet, but by using your platform here. (Plausibly you have already considered this and have good reasons for why it's a terrible idea, but it felt worth suggesting.)

I'm not sure if this is in fact a consensus, but it sure feels that way ↩︎

I really don't think this is a reasonable measure for ability to do long term tasks, but I don't have the time or energy to fight this battle, so I'll just register my prediction that this paper is not going to age well.

Instead of "have LLMs generated novel insights", how about "have LLMs demonstrated the ability to identify which views about a non-formal topic make more or less sense?" This question seems easier to operationalize and I suspect points at a highly related ability.

As someone who expects LLMs to be a dead end, I nonetheless think this post makes a valid point and does so using reasonable and easy to understand arguments. I voted +1.

I don't find this framing compelling. Particularly wrt to this part:

Obedience — AI that obeys the intention of a human user can be asked to help build unsafe AGI, such as by serving as a coding assistant. (Note: this used to be considered extremely sci-fi, and now it's standard practice.)

I grant the point that an AI that does what the user wants can still be dangerous (in fact it could outright destroy the world). But I'd describe that situation as "we successfully aligned AI and things went wrong anyway" rather than "we failed to align AI". I grant that this isn't obvious; it depends on how exactly AI alignment is defined. But the post frames its conclusions as definitive rather than definition-dependent, which I don't think is correct.

Is the-definition-of-alignment-which-makes-alignment-in-isolation-a-coherent-concept obviously not useful? Again, I don't think so. If you believe that "AI destroying the world because it's very hard to specify a utility function that doesn't destroy the world" is a much larger problem than "AI destroying the world because it obeys the wrong group of people", then alignement (and obedience in particular) is a concept useful in isolation. In particular, it's... well, it's not definitely helpful, so your introductory sentence remains literally true, but it's very likely helpful. The important thing is does make sense to work on obedience without worrying about how it's going to be applied because increasing obedience is helpful in expectation. It could remain helpful in expectation even if it accelerates timelines. And note that this remains true even if you do define Alignment in a more ambitious way.

I'm aware that you don't have such a view, but again, that's my point; I think this post is articulating the consequences of a particular set of beliefs about AI, rather than pointing out a logical error that other people make, which is what its framing suggests.

The post defending the claim is Reward is not the optimization target. Iirc, TurnTrout has described it as one of his most important posts on LW.

I know he's talking about alignment, and I'm criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments.

Ok, I don't disagree with this. I certainly didn't develop a gears-level understanding of why [building a brain-like thing with gradient descent on giant matrices] is doomed after reading the 2021 conversations. But that doesn't seem very informative either way; I didn't spend that much time trying to grok his arguments.

I also don't really get your position. You say that,

[Eliezer] confidently dismisses ANNs

but you haven't shown this!

In Surface Analogies and Deep Causes, I read him as saying that neural networks don't automatically yield intelligence just because they share surface similarities with the brain. This is clearly true; at the very least, using token-prediction (which is a task for which (a) lots of training data exist and (b) lots of competence in many different domains is helpful) is a second requirement. If you take the network of GPT-4 and trained it to play chess instead, you won't get something with cross-domain competence.
In Failure by Analogy he makes a very similar abstract point -- and wrt to neural networks in particular, he says that the surface similarity to the brain is a bad reason to be confident in them. This also seems true. Do you really think that neural networks work because they are similar to brains on the surface?

You also said,

The important part is the last part. It's invalid. Finding a design X which exhibits property P, doesn't mean that for design Y to exhibit property P, Y must be very similar to X.

But Eliezer says this too in the post you linked! (Failure by Analogy). His example of airplanes not flapping is an example where the design that worked was less close to the biological thing. So clearly the point isn't that X has to be similar to Y; the point is that reasoning from analogy doesn't tell you this either way. (I kinda feel like you already got this, but then I don't understand what point you are trying to make.)

Which is actually consistent with thinking that large ANNs will get you to general intelligence. You can both hold that "X is true" and "almost everyone who thinks X is true does so for poor reasons". I'm not saying Eliezer did predict this, but nothing I've read proves that he didn't.

Also -- and this is another thing -- the fact that he didn't publicly make the prediction "ANNs will lead to AGI" is only weak evidence that he didn't privately think it because this is exactly the kind of prediction you would shut up about. One thing he's been very vocal on is that the current paradigm is bad for safety, so if he was bullish about the potential of that paradigm, he'd want to keep that to himself.

Didn't he? He at least confidently rules out a very large class of modern approaches.

Relevant quote:

because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an AI coming out the other end which looks like an evolved human with 7.5MB of brain-wiring information and a childhood.

In that quote, he only rules out a large class of modern approaches to alignment, which again is nothing new; he's been very vocal about how doomed he thinks alignment is in this paradigm.

Something Eliezer does say which is relevant (in the post on Ajeya's biology anchors model) is

Or, more likely, it's not MoE [mixture of experts] that forms the next little trend. But there is going to be something, especially if we're sitting around waiting until 2050. Three decades is enough time for some big paradigm shifts in an intensively researched field. Maybe we'd end up using neural net tech very similar to today's tech if the world ends in 2025, but in that case, of course, your prediction must have failed somewhere else.

So here he's saying that there is a more effective paradigm than large neural nets, and we'd get there if we don't have AGI in 30 years. So this is genuinely a kind of bearishness on ANNs, but not one that precludes them giving us AGI.

This document doesn't look to me like something a lot of people would try to write. Maybe it was one of the most important things to write, but not obviously so. Among the steps (1) get the idea to write out all reasons for pessimism, (2) resolve to try, (3) not give up halfway through, and (4) be capable, I would not guess that 4 is the strongest filter.

Yes, but I didn't mean to ask whether it's relevant, I meant to ask whether it's accurate. Does the output of language models, in fact, feel like this? Seemed like something relevant to ask you since you've seen lots of text completions.

And if it does, what is the reason for not having long timelines? If neural networks only solved the easy part of the problem, that implies that they're a much smaller step toward AGI than many argued recently.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments