Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?
I'm claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you're in. That doesn't involve any internal search over the human utility function embedded in GPT-4's weights. I think GPT-4 can simply be queried in natural language for ethical advice, and it's pretty good at offering ethical advice in most situations that you're ever going to realistically encounter. GPT-4 is probably not human-level yet on this task, although I expect it won't be long before GPT-N is about as good at knowing what's ethical as your average human; maybe it'll even be a bit more ethical.
(But yes, this isn't the same as motivating GPT-4 to act on human values. I addressed this in my original comment though.)
I think [GPT-4 is pretty good at distinguishing valuable from non-valuable outcomes] in-distribution. I think MIRI people would be very interested in questions like “what transhumanist utopia will the AI be motivated to build?”, and it’s very unclear to me that GPT-4 would come to the same conclusions that CEV or whatever would come to. See the FAQ item on “concept extrapolation” here.
I agree that MIRI people are interested in things like "what transhumanist utopia will the AI be motivated to build" but I think saying that this is the hard part of the value identification problem is pretty much just moving the goalposts from what I thought the original claim was. Very few, if any, humans can tell you exactly how to build the transhumanist utopia either. If the original thesis was "human values are hard to identify because it's hard to extract all the nuances of value embedded in human brains", now the thesis is becoming "human values are hard to identify because literally no one knows how to build the transhumanist utopia".
But we don't need AIs to build a utopia immediately! If we actually got AI to follow common-sense morality, it would follow from common-sense morality that you shouldn't do anything crazy and irreversible right away, like killing all the humans. Instead, you'd probably want to try to figure out, with the humans, what type of utopia we ought to build.
Recently many people have talked about whether MIRI people (mainly Eliezer Yudkowsky, Nate Soares, and Rob Bensinger) should update on whether value alignment is easier than they thought given that GPT-4 seems to understand human values pretty well. Instead of linking to these discussions, I'll just provide a brief caricature of how I think this argument has gone in the places I've seen it. Then I'll offer my opinion that, overall, I do think that MIRI people should probably update in the direction of alignment being easier than they thought, despite their objections.
Here's my very rough caricature of the discussion so far, plus my contribution:
Non-MIRI people: "Eliezer talked a great deal in the sequences about how it was hard to get an AI to understand human values. For example, his essay on the Hidden Complexity of Wishes made it sound like it would be really hard to get an AI to understand common sense. Actually, it turned out that it was pretty easy to get an AI to understand common sense, since LLMs are currently learning common sense. MIRI people should update on this information."
MIRI people: "You misunderstood the argument. The argument was never about getting an AI to understand human values, but about getting an AI to care about human values in the first place. Hence 'The genie knows but does not care'. There's no reason to think that GPT-4 cares about human values, even if it can understand them. We always thought the hard part of the problem was about inner alignment, or, pointing the AI in a direction you want. We think figuring out how to point an AI in whatever direction you choose is like 99% of the problem; the remaining 1% of the problem is getting it to point at the "right" set of values."
I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have always said there was extra difficulty in getting an AI to care about human values. But I distinctly recall MIRI people making a big deal about how the value identification problem would be hard. The value identification problem is the problem of creating a function that correctly distinguishes valuable from non-valuable outcomes.
I claim that GPT-4 is pretty good at distinguishing valuable from non-valuable outcomes, unless you require something that vastly exceeds human performance on this task. In other words, GPT-4 looks like it's on a path towards an adequate solution to the value identification problem, where "adequate" means "about as good as humans". Therefore it is correct for non-MIRI people to point out that that this problem is less difficult than some people assumed in the past.
The supposed reason why the value identification problem was hard is because human value is complex (in fact that's mentioned the central foreseeable difficulty on the Arbital page). Complexity of value was used as an explicit premise in the argument for why AI alignment would be difficult many times in MIRI's history (two examples: 1, 2), and it definitely seems like the reason for this premise was because it was supposed to be an intuition for why the value identification problem would be hard. If the value identification problem was never predicted to be hard, then what was the point of making a fuss about complexity of value in the first place?
It seems to me that the following statements are true:
I previously thought you were saying something very different with (2), since the text in the OP seems pretty different.
FWIW I don't think you're getting things wrong here. I also have simply changed some of my views in the meantime.
That said, I think what I was trying to accomplish with (2) was not that alignment would be hard per se, but that it would be hard to get an AI to do very high-skill tasks in general, which included aligning the model, since otherwise it's not really "doing the task" (though as I said, I don't currently stand by what I wrote in the OP, as-is).
I think I understand my confusion, at least a bit better than before. Here's how I'd summarize what happened.
I had three arguments in this essay, which I thought of as roughly having the following form:
You said that (2) was already answered by the bio anchors model. I responded that bio anchors neglected how difficult it will be to develop AI safely. You replied that it will be easy make models to seemingly do what we want, but that the harder part will be making models that actually do what we want.
My reply was trying to say that the inherent difficulty of building TAI safely was inherently baked into (2) already. That might be a dubious reading of the actual textual argument for (2), but I think that interpretation is backed up by my initial reply to your comment.
The reason why I framed my later reply as being about perceptions was because I think the requisite capability level at which people begin to adopt TAI is an important point about how long timelines will be independent of (1) and (3). In other words, I was arguing that people's perceptions of the capability of AI will cause them wait to adopt AI until it's fully developed in the sense I described above; it won't just delay the effects of TAI after it's fully developed, or before then because of regulation.
Furthermore, I assumed that you were arguing something along the lines of "people will adopt AI once it's capable of only seeming to do what we want", which I'm skeptical of. Hence my reply to you.
My understanding was that you are also skeptical about question 2 on short timelines, and that was what you were arguing with your point (2) on overestimating generality.
Since for point 2 you said "I'm assuming that an AI CEO that does the job of CEO well until the point that it executes a treacherous turn", I am not very skeptical of that right now. I think we could probably have AIs do something that looks very similar to what a CEO would do within, idk, maybe five years.
(Independently of all of this, I've updated towards medium rather than long timelines in the last two years, but mostly because of reflection on other questions, and because I was surprised by the rate of recent progress, rather than because I have fundamental doubts about the arguments I made here, especially (3), which I think is still underrated.
ETA: though also, if I wrote this essay today I would likely fully re-write section (2), since after re-reading it I now don't agree with some of the things I said in it. Sorry if I was being misleading by downplaying how poor some of those points were.)
Sorry for replying to this comment 2 years late, but I wanted to discuss this part of your reasoning,
Fwiw, the problem I think is hard is "how to make models do stuff that is actually what we want, rather than only seeming like what we want, or only initially what we want until the model does something completely different like taking over the world".
I think that's what I meant when I said "I think it will be hard to figure out how to actually make models do stuff we want". But more importantly, I think that's how most people will in fact perceive what it means to get a model to "do what we want".
Put another way, I don't think people will actually start using AI CEOs just because we have a language model that acts like a CEO. Large corporations will likely wait until they're very confident in its reliability, robustness, and alignment. (Although idk, maybe some eccentric investors will find the idea interesting, I just expect that most people will be highly skeptical without strong evidence that it's actually better than a human.)
I think this point can be seen pretty easily in discussion of driverless cars. Regulators are quite skeptical of Tesla's autopilot despite it seeming to do what we want in perhaps over 99% of situations.
If anything, I expect most people to be intuitively skeptical that AI is really "doing what we want" even in cases where it's genuinely doing a better job than humans, and doesn't merely appear that way on the surface. The reason is simple: we have vast amounts of informal data on the reliability of humans, but very little idea how reliable AI will be. That plausibly causes people to start with a skeptical outlook, and only accept AI in safety-critical domains when they've seen it accumulate a long track record of exceptional performance.
For these reasons, I don't fully agree that "one of the major points of the bio anchors framework is to give a reasonable answer to the question of "at what level of scaling might this work"". I mean, I agree that this was what the report was trying to answer, but I disagree that it answered the question of when we will accept and adopt AI for various crucial economic activities, even if such systems were capable of automating everything in principle.
Some people seem to be hoping that nobody will ever make a misaligned human-level AGI thanks to some combination of regulation, monitoring, and enlightened self-interest. That story looks more plausible if we’re talking about an algorithm that can only run on a giant compute cluster containing thousands of high-end GPUs, and less plausible if we’re talking about an algorithm that can run on one 2023 gaming PC.
Isn't the relevant fact whether we could train an AGI with modest computational resources, not whether we could run one? If training runs are curtailed from regulation, then presumably the main effect is that AGI will be delayed until software and hardware progress permits the covert training of an AGI with modest computational resources, which could be a long time depending on how hard it is to evade the regulation.
I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free.
I don't have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitatively better at dialogue. Of course, I could have simply been prompting Davinci-002 poorly, but overall I'm quite skeptical that the main reason for ChatGPT hype was that it had a more convenient chat interface than GPT-3.
I have now published a conversation between Ege Erdil and Ronny Fernandez about this post. You can find it here.
Let me restate some of my points, which can hopefully make my position clearer. Maybe state which part you disagree with:
Language models are probability distributions over finite sequences of text.
The “true distribution” of internet text refers to a probability distribution over sequences of text that you would find on the internet (including sequences found on other internets elsewhere in the multiverse, which is just meant as an abstraction).
A language model is “better” than another language model to the extent that the cross-entropy between the true distribution and the model is lower.
A human who writes a sequence of text is likely to write something with a relatively high log probability relative to the true distribution. This is because in a quite literal sense, the true distribution is just the distribution over what humans actually write.
A current SOTA model, by contrast, is likely to write something with an extremely low log probability, most likely because it will write something that lacks long-term coherence, and is inhuman, and thus, won’t be something that would ever appear in the true distribution (or if it appears, it appears very very very rarely).
The last two points provide strong evidence that humans are actually better at the long-sequence task than SOTA models, even though they’re worse at the next character task.
Intuitively, this is because the SOTA model loses a gigantic amount of log probability when it generates whole sequences that no human would ever write. This doesn’t happen on the next character prediction task because you don’t need a very good understanding of long-term coherence to predict the vast majority of next-characters, and this effect dominates the effect from a lack of long-term coherence in the next-character task.
It is true (and I didn’t think of this before) that the human’s cross entropy score will probably be really high purely because they won’t even think to have any probability on some types of sequences that appear in the true distribution. I still don’t think this makes them worse than SOTA language models because the SOTA will also have ~0 probability on nearly all actual sequences. However…
Even if you aren’t convinced by my last argument, I can simply modify what I mean by the “true distribution” to mean the “true distribution of texts that are in the reference class of things we care about”. There’s absolutely no reason to say the true distribution has to be “everything on the internet” as opposed to “all books” or even “articles written by Rohin” if that’s what we’re actually trying to model.
Thus, I don’t accept one of your premises. I expect current language models to be better than you at next-character prediction on the empirical distribution of Rohin articles, but worse than you at whole sequence prediction for Rohin articles, for reasons you seem to already accept.
This definition seems very ambiguous to me, and I've already seen it confuse some people. Since the concept of a "STEM-level AGI" is the central concept underpinning the entire argument, I think it makes sense to spend more time making this definition less ambiguous.
Some specific questions: