I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn
To add onto this comment, let’s say there’s self-other overlap dial—e.g. a multiplier on the KL divergence or whatever.
So as you move the dial from zero to max, at some point (A) the supposed safety benefits start arising, and also at some point (B) the capabilities issues start arising. I think the OP is assuming without argument that (A) happens first, and (B) happens second. If it’s the other way around—(B) happens first, and (A) happens much later, as you gradually crank up the dial—then it’s not a problem you can solve with “minimal self-other distinction while maintaining performance”, instead the whole approach is doomed, right?
I think a simple intuition that (B) would happen before (A) is just that the very basic idea that “different people (and AIs) have different beliefs”, e.g. passing the Sally-Anne test, is already enough to open the door to AI deception, but also a very basic requirement for capabilities, one would think.
It seems pretty obvious to me that if (1) if a species of bacteria lives in an extremely uniform / homogeneous / stable external environment, it will eventually evolve to not have any machinery capable of observing and learning about its external environment; (2) such a bacterium would still be doing lots of complex homeostasis stuff, reproduction, etc., such that it would be pretty weird to say that these bacteria have fallen outside the scope of Active Inference theory. (I.e., my impression was that the foundational assumptions / axioms of Free Energy Principle / Active Inference were basically just homeostasis and bodily integrity, and this hypothetical bacterium would still have both of those things.) (Disclosure: I’m an Active Inference skeptic.)
This doesn't sound like an argument Yudkowsky would make
Yeah, I can’t immediately find the link but I recall that Eliezer had a tweet in the past few months along the lines of: If ASI wants to tile the universe with one thing, then it wipes out humanity. If ASI wants to tile the universe with sixteen things , then it also wipes out humanity.
My mental-model-of-Yudkowsky would bring up “tiny molecular squiggles” in particular for reasons a bit more analogous to the CoastRunners behavior (video)—if any one part of the motivational system is (what OP calls) decomposable etc., then the ASI would find the “best solution” to maximizing that part. And if numbers matter, then the “best solution” would presumably be many copies of some microscopic thing.
Yeah when I say things like “I expect LLMs to plateau before TAI”, I tend not to say it with the supremely high confidence and swagger that you’d hear from e.g. Yann LeCun, François Chollet, Gary Marcus, Dileep George, etc. I’d be more likely to say “I expect LLMs to plateau before TAI … but, well, who knows, I guess. Shrug.” (The last paragraph of this comment is me bringing up a scenario with a vaguely similar flavor to the thing you’re pointing at.)
I feel like “Will LLMs scale to AGI?” is right up there with “Should there be government regulation of large ML training runs?” as a black-hole-like attractor state that sucks up way too many conversations. :) I want to fight against that: this post is not about the question of whether or not LLMs will scale to AGI.
Rather, this post is conditioned on the scenario where future AGI will be an algorithm that (1) does not involve LLMs, and (2) will be invented by human AI researchers, as opposed to being invented by future LLMs (whether scaffolded, multi-modal, etc. or not). This is a scenario that I want to talk about; and if you assign an extremely low credence to that scenario, then whatever, we can agree to disagree. (If you want to argue about what credence is appropriate, you can try responding to me here or links therein, but note that I probably won’t engage, it’s generally not a topic I like to talk about for “infohazard” reasons [see footnote here if anyone reading this doesn’t know what that means].)
I find that a lot of alignment researchers don’t treat this scenario as their modal expectation, but still assign it like >10% credence, which is high enough that we should be able to agree that thinking through that scenario is a good use of time.
Yeah we already know that LLM training finds underlying patterns that are helpful for explaining / compressing / predicting the training data. Like “the vibe of Victorian poetry”. I’m not sure what you mean by “none of which are present in the training data”. Is the vibe of Victorian poetry present in the training data? I would have said “yeah” but I’m not sure what you have in mind.
One interesting result here, I think, is that the LLM is then able to explicitly write down the definition of f(blah), despite the fact that the fine-tuning training set didn't demand anything like this. That ability – to translate the latent representation of f(blah) into humanese – appeared coincidentally, as the result of the SGD chiseling-in some module for merely predicting f(blah).
I kinda disagree that this is coincidental. My mental image is something like
I’m claiming that before fine-tuning, everything is already in place except for the 1→2 connection. Fine-tuning just builds the 1→2 connection.
The thing you mention—that an LLM with the idea of x-176 in mind can output the tokens “x-176”—is part of step 3, and therefore (I hypothesize) comes entirely from LLM pretraining, not from this fine-tuning process
The fact that pretraining can and does build that aspect of step 3 seems pretty much expected to me, not coincidental, as such a connection is obviously useful for predicting GitHub code and math homework and a zillion other things in the training data. It’s also something you can readily figure out by playing with an LLM: if you say “if I subtract 6 from x-170, what do I get?”, then it’s obviously able to output the tokens “x-176”.
Here’s how I’m thinking of this result right now. Recall that we start with a normal LLM, and then 32,000 times (or whatever) we gradient-update it such that its f(blah) = blah predictions are better.
The starting LLM has (in effect) a ton of information-rich latent variables / representations that comprise the things that the LLM “understands” and can talk about. For example, obviously the concept of “x-176” is a thing that the LLM can work with, answer questions about, and so on. So there has to be something in the guts of the LLM that’s able to somehow represent that concept.
Anyway “f(blah)” doesn’t start out triggering any of these latent representations in particular. (Or, well, it triggers the ones related to how f(blah) is used in internet text, e.g. as a generic math function.) But it does presumably trigger all of them to some tiny random extent. And then each of the 32,000 gradient descent update steps will strengthen the connection between “f(blah)” and the particular “concept” / latent representation / whatever of “x-176”. …Until eventually the fine-tuned LLM is strongly invoking this preexisting “concept” / representation / activation-state / whatever of “x-176”, whenever it sees “f”. And then yeah of course it can answer questions about “f” and so on—we already know that LLMs can do those kinds of things if you activate that same “x-176” concept the old-fashioned way (by writing it in the context window).
(To be clear, I don’t think the variable name “f” is an important ingredient here; in fact, I didn’t understand the discussion of why that would ever be expected in the first place. For example, in the mixture-of-functions case, the LLM would be gradually getting tweaked such that an input of the type “User: [number]” activates such-and-such concept in the guts of the LLM. Or in fact, maybe in that case the LLM is getting tweaked such that any input at all activates such-and-such concept in the guts of the LLM!)
(Also, to be clear, I don’t think it’s necessary or important that the LLM has already seen the specific function “x-176” during pretraining. Whether it has seen that or not, the fact remains that I can log in and ask GPT-4 to talk about “x-176” right now, and it can easily do so. So, like I said, there has to be something in the guts of the LLM that’s able to somehow represent that “concept”, and whatever that thing is, fine-tuning gradient descent will eventually tweak the weights such that that thing gets triggered by the input “f(blah)”.) (Indeed, it should be super obvious and uncontroversial to say that LLMs can somehow represent and manipulate “concepts” that were nowhere in the training data—e.g. “Gandalf as a Martian pirate”.)
Anyway, in my mental model spelled out above, I now think your results are unsurprising, and I also think your use of the word “reasoning” seems a bit dubious, unless we’re gonna say that LLMs are “reasoning” every time they output a token, which I personally probably wouldn’t say, but whatever. (I also have long complained about the term “in-context learning” so maybe I’m just a stick-in-the-mud on these kinds of things.)
[not really my area of expertise, sorry if I said anything stupid.]
I guess I’m concerned that there’s some kind of “conservation law for wisdom / folly / scout mindset” in the age of instruction-following AI. If people don’t already have wisdom / scout mindset, I’m concerned that “Instruct the AGI to tell you the truth” won’t create it.
For example, if you ask the AI a question for which there’s no cheap and immediate ground truth / consequences (“Which politician should I elect?”, “Will this alignment approach scale to superintelligence?”), then the AI can say what the person wants to hear, or the AI can say what’s true.
Likewise, if there’s something worth doing that might violate conventional wisdom and make you look foolish, and ask the AI for a recommendation, the AI can recommend the easy thing that the person wants to hear, or the AI can recommend the hard annoying thing that the person doesn’t want to hear.
If people are not really deeply motivated to hear things that they don’t want to hear, I’m skeptical that instruction-following AI can change that. Here are three ways for things to go wrong:
(To be clear, if an AI is saying things that the person wants to hear in certain cases, the AI will still say that it’s telling the truth, and in fact the AI will probably even believe that it’s telling the truth! …assuming it’s a type of AI that has “beliefs”.)
(I think certain things like debate or training-on-prediction markets might help a bit with the first bullet point, and are well worth investigating for that purpose; but they wouldn’t help with the other two bullet points.)
So anyway, my background belief here is that defending the world against out-of-control AGIs will require drastic, unpleasant, and norm-violating actions. So then the two options to get there would be: (1) people with a lot of scout mindset / wisdom etc. are the ones developing and using instruction-following AGIs, and they take those actions; or (2) make non-instruction-following AGIs, and those AGIs themselves are the ones taking those actions without asking any human’s permission. E.g. “pivotal acts” would be (1), whereas AGIs that deeply care about humans and the future would be (2). I think I’m more into (2) than you both because I’m (even) more skeptical about (1) than you are, and because I’m less skeptical about (2) than you. But it’s hard to say; I have a lot of uncertainty. (We’ve talked about this before.)
Anyway, I guess I think it’s worth doing technical research towards both instruction-following-AI and AI-with-good-values in parallel.
Regardless, thanks for writing this.
My complaint about “transformative AI” is that (IIUC) its original and universal definition is not about what the algorithm can do but rather how it impacts the world, which is a different topic. For example, the very same algorithm might be TAI if it costs $1/hour but not TAI if it costs $1B/hour, or TAI if it runs at a certain speed but not TAI if it runs many OOM slower, or “not TAI because it’s illegal”. Also, two people can agree about what an algorithm can do but disagree about what its consequences would be on the world, e.g. here’s a blog post claiming that if we have cheap AIs that can do literally everything that a human can do, the result would be “a pluralistic and competitive economy that’s not too different from the one we have now”, which I view as patently absurd.
Anyway, “how an AI algorithm impacts the world” is obviously an important thing to talk about, but “what an AI algorithm can do” is also an important topic, and different, and that’s what I’m asking about, and “TAI” doesn’t seem to fit it as terminology.
Sure, but the way it's described, it sounds like there's one adjustable parameter in the source code. If the setup allows for thousands of independently-adjustable parameters in the source code, that seems potentially useful but I'd want to know more details.