There's something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus... it's not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me woul
While it's obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that "the human brain is not just LLM-like prediction", while you seem to be saying that "the human brain does not do LLM-like prediction at all". (Of course, "LLM-like prediction" is a vague concept and maybe we're just using it differently and ultimately agree.)
I disagree with whether that distinction matters:
I think technical discussions of A... (read more)
(Sorry in advance if this whole comment is stupid, I only read a bit of the report.)
As context, I think the kind of technical plan where we reward the AI for (apparently) being helpful is at least not totally doomed to fail. Maybe I’m putting words in people’s mouths, but I think even some pretty intense doomers would agree with the weak statement “such a plan might turn out OK for all we know” (e.g. Abram Demski talking about a similar situation here, Nate Soares describing a similar-ish thing as “maybe one nine” a.k.a. a mere 90% chance that it would fai... (read more)
GPT-4 is different from APTAMI. I'm not aware of any method that starts with movies of humans, or human-created internet text, or whatever, and then does some kind of ML, and winds up with a plausible human brain intrinsic cost function. If you have an idea for how that could work, then I'm skeptical, but you should tell me anyway. :)
“Extract from the brain” how? A human brain has like 100 billion neurons and 100 trillion synapses, and they’re generally very difficult to measure, right? (I do think certain neuroscience experiments would be helpful.) Or do you mean something else?
I would say “the human brain’s intrinsic-cost-like-thing is difficult to figure out”. I’m not sure what you mean by “…difficult to extract”. Extract from what?
The “similar reason as why I personally am not trying to get heroin right now” is “Example 2” here (including the footnote), or a bit more detail in Section 9.5 here. I don’t think that involves an idiosyncratic anti-heroin intrinsic cost function.
The question “What is the intrinsic cost in a human brain” is a topic in which I have a strong personal interest. See Section 2 here and links therein. “Why don’t humans have an alignment problem” is sorta painting the target around the arrow I think? Anyway, if you radically enhanced human intelligence and let t... (read more)
I agree that it would be nice to get to a place where it is known (technically) how to make a kind AGI, but nobody knows how to make an unkind AGI. That's what you're saying, right? If so, yes that would be nice, but I see it as extraordinarily unlikely. I’m optimistic that there are technical ways to make a kind AGI, but I’m unaware of any remotely plausible approach to doing so that would not be straightforwardly modifiable to turn it into a way to make an unkind AGI.
I’m not sure what you think my expectations are. I wrote “I am not crazy to hope for whole primate-brain connectomes in the 2020s and whole human-brain connectomes in the 2030s, if all goes well.“ That’s not the same as saying “I expect those things”; it’s more like “those things are not completely impossible”. I’m not an expert but my current understanding is (1) you’re right that existing tech doesn’t scale well enough (absent insane investment of resources), (2) it’s not impossible that near-future tech could scale much better than current tech. I’... (read more)
I find it interesting how he says that there is no such thing as AGI, but acknowledges that machines will "eventually surpass human intelligence in all domains where humans are intelligent" as that would meet most people's definition of AGI.
The somewhat-reasonable-position-adjacent-to-what-Yann-believes would be: “I don’t like the term ‘AGI’. It gives the wrong idea. We should use a different term instead. I like ‘human-level AI’.”
I.e., it’s a purely terminological complaint. And it’s not a crazy one! Lots of reasonable people think that “AGI” was a poorly... (read more)
I think that if you read the later Intro to Brain-Like AGI Safety series, then the only reason you might want to read this post (other than historical interest) is that the section “Dopamine category #2: RPE for “local” sub-circuit rewards” is talking about a topic that was omitted from Intro to Brain-Like AGI Safety (for brevity).
For example, practically everything I said about neuroanatomy in this post is at least partly wrong and sometimes very wrong. (E.g. the “toy loop model” diagrams are pretty bad.) The “Finally, the “prediction” part of reward pred... (read more)
Follow-up: More specific example of classic goal misgeneralization in the context of process-based supervision (more details here):
Suppose the AI is currently thinking about what to propose for step 84 of an ongoing plan to found a company. Ideally, according to the philosophy of process-based supervision, the only thing the AI “cares about” right now is getting a good score in the event that this particular proposal gets immediately audited. But the AI can't read the supervisor's mind, so figuring out how to get a good score in that hypothetical audit can... (read more)
This post doesn’t talk about classic goal misgeneralization, e.g. the CoinRun paper (see also Rob Miles explainer). If X = “get the coin” is what we want the AI to desire, and Y = “get to the right side” is something else, but Y and X are correlated in the training distribution, then we can wind up with an AI that’s trying to do Y rather than X, i.e. misalignment. (Or maybe it desires some mix of X & Y. But that counts as misalignment too.)
That problem can arise without the AI having any situational awareness, or doing any “obfuscation”. The AI doesn’t... (read more)
Seems false in RL, for basically the reason you said (“it’s not clear how to update a model towards performing the task if it intentionally tries to avoid showing us any task-performing behavior”). In other words, if we’re doing on-policy learning, and if the policy never gets anywhere close to a reward>0 zone, then the reward>0 zone isn’t doing anything to shape the policy. (In a human analogy, I can easily avoid getting addicted to nicotine by not exposing myself to nicotine in the first place.)
I think this might be a place where people-thinking-ab... (read more)
investing in government money market funds earning ~5% rather than 0% interest checking accounts
It’s easier than that—there are high-interest-rate free FDIC-eligible checking accounts. MaxMyInterest.com has a good list, although you might need to be a member to view it. As of this moment (2023-07-20), the top of their leaderboard is: Customers Bank (5.20% APY), BankProv (5.15%), BrioDirect (5.06%), UFB Direct (5.06%).
I was just trying to replace “reward” by “reinforcement”, but hit the problem that “negative reward” makes sense, but behaviorist terminology is such that “reinforcement” is always after a good thing happens, including “negative reinforcement”, which would be a kind of positive reward that entails removing something aversive. The behaviorists use the word “punishment” for “negative reward”. But “punishment” has all the same downsides as “reward”, so I assume you’re also opposed to that. Unfortunately, if I avoid both “punishment” and “reward”, then it seems I have no way to unambiguously express the concept “negative reward”.
So “negative reward” it is. ¯\_(ツ)_/¯
Nice interview, kudos to you both!
One is a bunch of very simple hardwired genomically-specified reward circuits over stuff like your sensory experiences or simple correlates of good sensory experiences.
I just want to flag that the word “simple” is contentious in this context. The above excerpt isn’t a specific claim (how simple is “simple”, and how big is “a bunch”?) so I guess I neither agree nor disagree with it as such. But anyway, my current guess (see here) is that reward circuits might effectively comprise tens of thousands of lines of pseudoco... (read more)
you have many cortices
This part isn’t quite right. Here’s some background if it helps.
Part of your brain is a big sheet of gray matter called “the cortex”. In humans, the sheet gets super-crumpled up in the brain, so much so that it’s easy to forget that it’s a single contiguous sheet in the first place. Also in humans, the sheet gets so big that the outer edges of it wind up curved up underneath the center part, kinda like the top of a cupcake or muffin that overflows its paper wrapper.
(See here if you can’t figure out what I’m talking about with the cupc... (read more)
OK! I think I’m on board now.
Let me try to explain “process-based feedback” from first principles in my own words.
We have a problem: if an agent wants to do X in the real world, dastardly real-world power-seeking actions are probably helpful for that.
The very hard manifestation of this problem is: there could be an AI that has never done any real-world power-seeking actions whatsoever, not even a little, not anytime during training, and then seemingly out of nowhere it does a power-seeking treacherous turn (maybe it outputs “Help me help me I’m suffering!”... (read more)
OK, I’ll try to construct an example of process-based supervision without boxing, and argue that it fails the criterion of “never giving gradient updates toward doing the dastardly stuff”.
We give our AI unfettered unmonitored internet access. We say “We are a hardware R&D firm, and we would like to develop a less-expensive LiDAR unit.” The AI does some internet searches and reads some books and outputs:
“My first step is to list out different LiDAR approaches on the market (and under development), and try to estimate their manufacturing cost breakdowns,... (read more)
Sorry. Thanks for your patience. When you write:
Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.
…I don’t know what a “step” is.
As above, if I sit on my couch staring into space brainstorming for an hour and then write down ... (read more)
OK, I think this is along the lines of my other comment above:
I also think it’s good (for safety) to try to keep the AI from manipulating the real world and seeing the consequences within a single “AI does some stuff” step, i.e. Example 2 is especially bad in a way that neither Examples 0 nor 1 are. I think we’re in agreement here too.
Most of your reply makes me think that what you call “process-based supervision” is what I call “Put the AI in a box, give it tasks that it can do entirely within the box, prevent it from escaping the box (and penalize it if ... (read more)
Hmm. I think “process-based” is a spectrum rather than a binary.
Let’s say there’s a cycle:
There’s a spectrum based on how long each P cycle is:
Example 1 (“GPT with process-based supervision”):
I think I see where you’re coming from but I generally have mixed feelings, and am going back and forth but leaning towards sticking with textbook terminology for my part.
Once we fix the policy network and sampling procedure, we get a mapping from observations…to probability distributions over outputs…. This mapping π is the policy.…Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present.) I think it's appropria
Once we fix the policy network and sampling procedure, we get a mapping from observations…to probability distributions over outputs…. This mapping π is the policy.…
Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present.) I think it's appropria
I think the basic idea of instrumental convergence is just really blindingly obvious, and I think it is very annoying that there are people who will cluck their tongues and stroke their beards and say "Hmm, instrumental convergence you say? I won't believe it unless it is in a very prestigious journal with academic affiliations at the top and Computer Modern font and an impressive-looking methods section."
I am happy that your papers exist to throw at such people.
Anyway, if optimal policies tend to seek power, then I desire to believe that optimal policies ... (read more)
So this basic idea: the AI takes a bunch of steps, and gradient descent is performed based on audits of whether those steps seem reasonable while blinded to what happened as a result.
Ohh, sorry you had to tell me twice, but maybe I’m finally seeing where we’re talking past each other.
Back to the OP, you wrote:
In training, an AI system gets tasks of the form “Produce a plan to accomplish X that looks good to humans” (not tasks of the form “accomplish X”).The AI system is rewarded based on whether the plan makes sense and looks good to humans - not how
Thanks, that all makes sense.
I agree you can still get a problem from goal misgeneralization and instrumental reasoning, but this seems noticeably less likely (assuming process-based training) than getting a problem from reinforcing pursuit of unintended outcomes. (https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty has some discussion.) I put significant credence on something like "Internals-based training doesn't pan out, but neither does the concern about goal misgeneralization and instrumental rea
That was helpful, thanks! I will try to rephrase. Let’s consider two failure modes. First, define some terms:
X is what we want the AI to be trying to do:
Y is something we want the AI to not try to do:
Now consider two failure modes.
FAILURE MODE 1:
Sure. That excerpt is not great.
I'd consider this to be one of the more convincing reasons to be hesitant about a pause (as opposed to the 'crying wolf' argument, which seems to me like a dangerous way to think about coordinating on AI safety?).
Can you elaborate on this? I think it’s incredibly stupid that people consider it to be super-blameworthy to overprepare for something that turned out not to be a huge deal—even if the expected value of the preparation was super-positive given what was known at the time. But, stupid as it may be, it does seem to be part of the situation we’r... (read more)
I think that’s one consideration, but I think there are a bunch of considerations pointing in both directions. For example:
Pause in scaling up LLMs → less algorithmic progress:
Pause in scaling up LLMs → more algorithmic progress:
The description doesn't seem so bad to me. Your post "Reward is not the optimization target" is about what actual RL algorithms actually do. The wiki descriptions here are a kind of normative motivation as to how people came to be looking into those algorithms in the first place. Like, if there's an RL algorithm that performs worse than chance at getting a high reward, then that ain't an RL algorithm. Right? Nobody would call it that.
I think lots of families of algorithms are likewise lumped together by a kind of normative "goal", even if any given algorit... (read more)
I think that’s what I said in the last paragraph of the comment you’re responding to:
(On a different topic, self-supervised pre-training before supervised fine-tuning is almost always better than supervised learning from random initialization, as far as I understand. Presumably if someone were following the OP protocol, which involves a supervised learning step, then they would follow all the modern best practices for supervised learning, and “start from a self-supervised-pretrained model” is part of those best practices.)
Maybe that’s what PeterMcCluskey w... (read more)
Sure, we can take some particular model-based RL algorithm (MuZero, APTAMI, the human brain algorithm, whatever), but instead of “the reward function” we call it “function #5829”, and instead of “the value function” we call it “function #6241”, etc. If you insist that I use those terms, then I would still be perfectly capable of describing step-by-step why this algorithm would try to kill us. That would be pretty annoying though. I would rather use the normal terms.
I’m not quite sure what you’re talking about (“projected from the labeled world model”... (read more)
Then ask the transformer to rate on a numeric scale how positively or negatively a human would feel in any particular situation.
Then ask the transformer to rate on a numeric scale how positively or negatively a human would feel in any particular situation.
I'm still confused. Here you're describing what you're hoping will happen at inference time. I'm asking how it's trained, such that that happens. If you have a next-frame video predictor, you can't ask it how a human would feel. You can't ask it anything at all - except "what might be the next frame of thus-and-such video?". Right?
I wonder if you've gotten thrown off by chatGPT etc. Those are NOT trained by SSL, and therefore N... (read more)
I think of my specialty as mostly “trying to solve the alignment problem for model-based RL”. (LeCun’s paper is an example of model-based RL.) I think that’s a somewhat different activity than, say, “trying to solve the alignment problem for LLMs”. Like, I read plenty of alignmentforum posts on the latter topic, and I mostly don’t find them very relevant to my work. (There are exceptions.) E.g. the waluigi effect is not something that seems at all relevant to my work, but it’s extremely relevant to the LLM-alignment crowd. Conversely, for example, here’s a... (read more)
I don't understand how the remaining technical problem is not basically the whole of the alignment problem
Yes. I don’t think the paper constitutes any progress on the alignment problem. (No surprise, since it talks about the problem for only a couple sentences.)
Hmm, maybe you’re confused that the title refers to “an unsolved technical alignment problem” instead of “the unsolved technical alignment problem”? Well, I didn’t mean it that way. I think that solving technical alignment entails solving a different (albeit related) technical problem for each diffe... (read more)
If I were trying to make this model work, I'd use mainly self-supervised learning that's aimed at getting the module to predict what a typical human would feel.
I don’t follow. Can you explain in more detail? “Self-supervised learning” means training a model to predict some function / subset of the input data from a different function / subset of the input data, right? What’s the input data here, and what is the prediction target?
Thanks. That did actually occur to me, but I left it out because I wasn’t sure and didn’t want to go on an exhausting chase down every possible interpretation of the paper.
Anyway, if the input to the Prosociality Score Model is a set of latent variables rather than a set of pixels then:
Whatever. Maybe I was just jumping on an excuse to chit-chat about possible limitations of LLMs :) And maybe I was thread-hijacking by not engaging sufficiently with your post, sorry.
This part you wrote above was the most helpful for me:
if the task is "spend a month doing novel R&D for lidar", then my framework predicts that we'll need 1-month AGI for that
I guess I just want to state my opinion that (1) summarizing a 10,000-page book is a one-month task but could come pretty soon if indeed it’s not already possible, (2) spending a month doing novel R&a... (read more)
I think the “in one second” would be cheating. The question for Ed Witten didn’t specify “the AI can’t answer it in one second”, but rather “the AI can’t answer it period”. Like, if GPT-4 can’t answer the string theory question in 5 minutes, then it probably can’t answer it in 1000 years either.
(If the AI can get smarter and smarter, and figure out more and more stuff, without bound, in any domain, by just running it longer and longer, then (1) it would be quite disanalogous to current LLMs [btw I’ve been assuming all along that this post is implicitly ima... (read more)
Ah, that’s helpful, thanks.
Sure, there are some questions that don't appear at all on the internet, but most human knowledge is, so you'd have to cherry-pick questions.
I think you’re saying “there are questions about string theory whose answers are obvious to Ed Witten because he happened to have thought about them in the course of some unpublished project, but these questions are hyper-specific, so bringing them up at all would be unfair cherry-picking.”
But then we could just ask the question: “Can you please pose a question about string theory that no AI... (read more)
(expanding on my reply to you on twitter)
For the t-AGI framework, maybe you should also specify that the human starts the task only knowing things that are written multiple times on the internet. For example, Ed Witten could give snap (1-second) responses to lots of string theory questions that are WAY beyond current AI, using idiosyncratic intuitions he built up over many years. Likewise a chess grandmaster thinking about a board state for 1 second could crush GPT-4 or any other AI that wasn’t specifically and extensively trained on chess by humans.
A star... (read more)
How long would it take (in months) to train a smart recent college graduate with no specialized training in my field to complete this task?
This doesn't seem like a great metric because there are many tasks that a college grad can do with 0 training that current AI can't do, including:
I do think that there's something important about this metric, but I think it's basically subsumed by my metric: if the task is "spend a month doing novel R&D for... (read more)
I don't know what probability you mean with "not crazy"
Me neither. I’m not close enough to the technical details to know. I did run that particular sentence by a guy who’s much more involved in the field before I published, and he said it was a good sentence, but only because “not crazy to hope for X” is a pretty weak claim.
After reading the post "Whole Brain Emulation: No Progress on C. elegans After 10 Years" …
Yeah, the C. elegans connectome has been known for a very long time. The thing that’s hard for C. elegans is going from the connectome to WBE. As ... (read more)
When I started blogging about AI alignment in my free time, it happened that GPT-2 had just come out, and everyone on LW was talking about it. So I wrote a couple blog posts (e.g. 1,2) trying (not very successfully, in hindsight, but I was really just starting out, don’t judge) to think through what would happen if GPT-N could reach TAI / x-risk levels. I don’t recall feeling strongly that it would or wouldn’t reach those levels, it just seemed like worth thinking about from a safety perspective and not many other people were doing so at the time. But in t... (read more)
Belief propagation is the kind of thing that most people wouldn't work on in an age before computers. It would be difficult to evaluate/test, but more importantly wouldn't have much hope for application.
Hmm. I’m not sure I buy that. Can’t we say the same thing about FFT? Doing belief prop by hand doesn’t seem much so different from doing an FFT by hand; and both belief prop and FFT were totally doable on a 1960s mainframe, if not earlier, AFAICT. But the modern FFT algorithm was published in 1965, and people got the gist of it in 1942, and 1932, and even G... (read more)
I guess you’re referring to my comment “for my part, if I believed that (A)-type systems were sufficient for TAI—which I don’t—then I think I would feel slightly less concerned about AI x-risk than I actually do, all things considered!”
I’m not implicitly comparing (A) to “completely unknown mystery algorithm”, instead I’m implicitly comparing (A) to “brain-like AGI, or more broadly model-based RL AGI, or even more broadly some kind of AGI that incorporates RL in a much more central way than LLMs do”. I think, by and large, more RL (as opposed to super... (read more)
Oh, I somehow missed that your original question was about takeoff speeds. When you wrote “algorithmic insights…will lead to dramatically faster AI development”, I misread it as “algorithmic insights…will lead to dramatically more powerful AIs”. Oops. Anyway, takeoff speeds are off-topic for this post, so I won’t comment on them, sorry. :)
In this post, I’m not trying to convert people to LLM plateau-ism. I only mentioned my own opinions as a side-comment + short footnote with explicitly no justification. And if I were trying to convert people to LLM plateau-ism, I would certainly not attempt to do so on the basis of my AI forecasting track record, which is basically nonexistent. :)
Can you flesh out your view of how the community is making "slow but steady progress right now on getting ready"?