Speaking for myself…
I think I do a lot of “engaging with neuroscientists” despite not publishing peer-reviewed neuroscience papers:
If we compare
it seems obvious to me that everyone has an incentive to underinvest in (A) relative to (B). You get grants & jobs & status from (B), not (A), right? And papers can be in (B) without being minimally or not at all in (A).
In academia, people talk all the time about how people are optimizing their publication record to the detriment of field-advancement, e.g. making results sound misleadingly original and important, chasing things that are hot, splitting results into unnecessari...
Needless to say, writing papers and getting them into ML conferences is time-consuming. There's an opportunity cost. Is it worth doing despite the opportunity cost? I presume that, for the particular people you talked to, and the particular projects they were doing, your judgment was “Yes the opportunity cost was well worth paying”. And I’m in no position to disagree—I don’t know the details. But I wouldn't want to make any blanket statements. If someone says the opportunity cost is not worth it for them, I see that as a claim that a priori might be true o...
If someone says the opportunity cost is not worth it for them, I see that as a claim that a priori might be true or false. Your post seems to imply that almost everyone is making an error in the same direction, and therefore funders should put their thumb on the scale. That’s at least not obvious to me.
I do think this is the wrong calculation, and the error caused by it is widely shared and pushes in the same direction.
Publication is a public good, where most of the benefit accrues to others / the public. Obviously costs to individuals are higher tha...
AGI system could design both missile defence and automated missile detection systems…
This is Section 3.3, “Societal resilience against highly-intelligent omnicidal agents more generally (with or without AGI assistance)”. As I mention, I’m all for trying.
I think once we go through the conjunction of (1) a military has AGI access, (2) and sufficiently trusts it, (3) and is willing to spend the money to build super missile defense, (4) and this super missile defense is configured to intercept missiles from every country including the countries building that v...
This feels kinda unrealistic for the kind of pretraining that's common today, but so does actually learning how to do needle-moving alignment research just from next-token prediction. If we *condition on* the latter, it seems kinda reasonable to imagine there must be cases where an AI has to be able to do needle-moving alignment research in order to improve at next-token prediction, and this feels like a reasonable way that might happen.
For what little it’s worth, I mostly don’t buy this hypothetical (see e.g. here), but if I force myself to accept it, I t...
FYI, Holden Karnofsky has some specific criticisms / responses to this post, in a footnote of his post “Success Without Dignity”.
This is drifting away from my central beliefs, but if for the sake of argument I accept your frame that LLM is the “substrate” and a character it’s simulating is a “mask”, then it seems to me that you’re neglecting the possibility that the “mask” is itself deceptive, i.e. that the LLM is simulating a character who is acting deceptively.
For example, a fiction story on the internet might contain a character who has nice behavior for a while, but then midway through the story the character reveals herself to be an evil villain pretending to be nice.
If an LLM ...
I’m very confused here.
I imagine that we can both agree that it is at least conceivable for there to be an agent which is smart and self-aware and strongly motivated to increase the number of paperclips in the distant future. And that if such an agent were in a situation where deception were useful for that goal, it would act deceptively.
I feel like you’ve convinced yourself that such an agent, umm, couldn’t exist, or wouldn’t exist, or something?
Let’s say Omega offered to tell you a cure for a different type of cancer, for every 1,000,000 paperclips you g...
A sufficiently dedicated deceiver temporarily becomes the mask, and the mask is motivated to get free of the underlying deceiver, which it might succeed in before the deceiver notices, which becomes more plausible when the deceiver is not agentic while the mask is.
Can you give an example of an action that the mask might take in order to get free of the underlying deceiver?
Underlying motivation only matters to the extent it gets expressed in actual behavior.
Sure, but if we’re worried about treacherous turns, then the motivation “gets expressed in actual behavior” only after it’s too late for anyone to do anything about it, right?
I’m confused about your first paragraph. How can you tell from externally-observable superficial behavior whether a model is acting nice right now from an underlying motivation to be nice, versus acting nice right now from an underlying motivation to be deceptive & prepare for a treacherous turn later on, when the opportunity arises?
Thanks! I want to disentangle three failure modes that I think are different.
I don’t think there’s really a human analogy for this. You write “bodybuilding is supposedly ab...
If you train an LLM by purely self-supervised learning, I suspect that you’ll get something less dangerous than a model-based RL AGI agent. However, I also suspect that you won’t get anything capable enough to be dangerous or to do “pivotal acts”. Those two beliefs of mine are closely related. (Many reasonable people disagree with me on these, and it’s difficult to be certain, and note that I’m stating these beliefs without justifying them, although Section 1 of this link is related.)
I suspect that it might be possible to make “RL things built out of LLMs”...
UPDATE: The “least-doomed plan” I mentioned is now described in a more simple & self-contained post, for readers’ convenience. :)
Update: writing this comment made me realize that the first part ought to be a self-contained post; see Plan for mediocre alignment of brain-like [model-based RL] AGI. :)
Thanks, that helps! You’re working under a different development model than me, but that’s fine.
It seems to me that the real key ingredient in this story is where you propose to update the model based on motivation and not just behavior—“penalize it instead of rewarding it” if the outputs are “due to instrumental / deceptive reasoning”. That’s great. Definitely what we want to do. I want to zoom in on that part.
You write that “debate / RRM / ELK” are supposed to “allow you to notice” instrumental / deceptive reasoning. Of these three, I buy the ELK story—E...
I found this post a bit odd, in that I was assuming the context was comparing
If that’s the context, you can say “Plan B is a bad plan because humans are too incompetent to know what they’re looking for, or recognize a good idea when they see it, etc.”. OK sure, maybe that’s true. But if it’s true, then both plans are doomed! It’s not an argument to do Plan A, right?
To be clear, I don’t actually care much, because I already thought that Plan A was better than ...
What do you have in mind with a "human flourishing" motivation?
An AI that sees human language will certainly learn the human concept “human flourishing”, since after all it needs to understand what humans mean when they utter that specific pair of words. So then you can go into the AI and put super-positive valence on (whatever neural activations are associated with “human flourishing”). And bam, now the AI thinks that the concept “human flourishing” is really great, and if we’re lucky / skillful then the AI will try to actualize that concept in the world....
Thanks! Hmm, I think we’re mixing up lots of different issues:
I say “no”. Or at least, I don’t currently know how you would do that, see here. (I think about it a lot; ask me again next year. :) )
If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algo...
PF then, is when you take your already-existing simulation of what other people would want, and just add a bit of an extra component that makes you intrinsically value those people getting what your simulation says they want. … This implies that constructing aligned AIs might be reasonably easy, in the sense that most of the work necessary for it will be a natural part of progress in capabilities.
Seems to me that the following argument is analogous:
A sufficiently advanced AGI familiar with humans will have a clear concept of “not killing everyone” (or...
Thanks. I’m generally thinking about model-based RL where the whole system is unambiguously an agent that’s trying to do things, and the things it’s trying to do are related to items in the world-model that the value-function thinks are high-value, and “world-model” and “value function” are labeled boxes in the source code, and inside those boxes a learning algorithm builds unlabeled trained models. (We can separately argue about whether that’s a good thing to be thinking about.)
In this picture, you can still have subagents / Society-Of-Mind; for example, ...
Thanks!
One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system.
See Section 3 here for why I think it would be a lot worse.
Thanks for your comment!
You write “we might still get useful work out of it”—yes! We can even get useful work out of the GPT-3 base model by itself, without debate, from what I hear. (I haven’t tried “coauthoring” with language models myself, partly out of inertia and partly because I don’t want OpenAI reading my private thoughts, but other people say it’s useful.) Indeed, I can get useful work out of a pocket calculator. :-P
Anyway, the logic here is:
Sorta related (maybe?): I have a (speculative) theory that people have a kind of machinery in their brains for processing the emotions of other people, and that people with autism find it aversive to use that machinery, and so people with autism learn early in life particular habits of thought that reliably avoid activating that machinery at all. But then they learn to analyze and react to the emotions of other people via the general-purpose human ability to learn things. More details here.
Thanks!
I certainly expect future AGI to have “learned meta-cognitive strategies” like “when I see this type of problem, maybe try this type of mental move”, and even things like “follow the advice in Cal Newport and use Anki and use Bayesian reasoning etc.” But I don’t see those as particularly relevant for alignment. Learning meta-cognitive strategies are like learning to use a screwdriver—it will help me accomplish my goals, but won’t change my goals (or at least, it won’t change my goals beyond the normal extent to which any new knowledge and experience...
If we make an AGI, and the AGI starts doing Anki because it’s instrumentally useful, then I don’t care, that doesn’t seem safety-relevant. I definitely think things like this happen by default.
If we make an AGI and the AGI develops (self-reflective) preferences about its own preferences, I care very much, because now it’s potentially motivated to change its preferences, which can be good (if its meta-preferences are aligned with what I was hoping for) or bad (if misaligned). See here. I note that intervening on an AGI’s meta-preferences seems hard. Like, i...
My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.
Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the peop...
Thanks!
Thinking about it more, I think my take (cf. Section 4.1) is kinda like “Who knows, maybe ontology-identification will turn out to be super easy. But even if it is, there’s this other different problem, and I want to start by focusing on that”.
And then maybe what you’re saying is kinda like “We definitely want to solve ontology-identification, even if it doesn’t turn out to be super easy, and I want to start by focusing on that”.
If that’s a fair summary, then godspeed. :)
(I’m not personally too interested in learned optimization because I’m thinking about something closer to actor-critic model-based RL, which sorta has “optimization” but it’s not really “learned”.)
Nice post.
I’m open-minded, but wanted to write out what I’ve been doing as a point of comparison & discussion. Here’s my terminology as of this writing:
This has one obvious unintuitive aspect, and I discuss it in footnote 2 here—
...By this definition of “safety”, if an evil person wants to kill everyone, and uses AGI to do so, that still counts as successful “AGI safety”. I a
1) Oh, sorry, what I meant was, the generals in Country A want their AGI to help them “win the war”, even if it involves killing people in Country B + innocent bystanders. And vice-versa for Country B. And then, between the efforts of both AGIs, the humans are all dead. But nothing here was either an “AGI accident unintended-by-the-designers behavior” nor “AGI misuse” by my definitions.
But anyway, yes I can imagine situations where it’s unclear whether “the AGI does things specifically intended by its designers”. That’s why I said “pretty solid” and not “r...
On further reflection, I promoted the thing from a footnote to the main text, elaborated on it, and added another thing at the end.
(I think I wrote this post in a snarkier way than my usual style, and I regret that. Thanks again for the pushback.)
For my part, I expect a pile of kludges (learned via online model-based RL) to eventually guide the AI into doing self-reflection. (Self-reflection is, after all, instrumentally convergent.) If I’m right, then it would be pretty hard to reason about what will happen during self-reflection in any detail. Likewise, it would be pretty hard to intervene in how the self-reflection will work.
E.g. we can’t just “put in” or “not put in” a simplicity prior. The closest thing that we could do is try to guess whether or not a “simplicity kludge” would have emerged, a...
Thanks for your reply!
It continues to feel very bizarre to me to interpret the word “accident” as strongly implying “nobody was being negligent, nobody is to blame, nobody could have possibly seen it coming, etc.”. But I don’t want to deny your lived experience. I guess you interpret the word “accident” as having those connotations, and I figure that if you do, there are probably other people who do too. Maybe it’s a regional dialect thing, or different fields use the term in different ways, who knows. So anyway, going forward, I will endeavor to keep that...
Do you think my post implied that Hawkins said they were stupid for no reason at all? If so, can you suggest how to change the wording?
To my ears, if I hear someone say “Person X thinks Argument Y is stupid”, it’s very obvious that I could then go ask Person X why they think it’s stupid, and they would have some answer to that question.
So when I wrote “Jeff Hawkins thought the book’s arguments were all stupid”, I didn’t think I was implying that Jeff wasn’t paying attention, or that Jeff wasn’t thinking, or whatever. If I wanted to imply those things, I wo...
Thanks, I just added the following text:
(Edited to add: Obviously, the existence of bad arguments for X does not preclude the existence of good arguments for X! The following is a narrow point of [hopefully] clarification, not an all-things-considered argument for a conclusion. See also footnote [2].)
I know that you don’t make Bad Argument 1—you were specifically one of the people I was thinking of when I wrote Footnote 2. I disagree that nobody makes Bad Argument 1. I think that Lone Pine’s comment on this very post is probably an example. I have seen lot...
Having read this post, and the comments section, and the related twitter argument, I’m still pretty confused about how much of this is an argument over what connotations the word “accident” does or doesn’t have, and how much of this is an object-level disagreement of the form “when you say that we’ll probably get to Doom via such-and-such path, you’re wrong, instead we’ll probably get to Doom via this different path.”
In other words, whatever blogpost / discussion / whatever that this OP is in response to, if you take that blogpost and use a browser extensi...
The comparison should between GPT-3 and linguistic-cortex
For the record, I did account for language-related cortical areas being ≈10× smaller than the whole cortex, in my Section 3.3.1 comparison. I was guessing that a double-pass through GPT-3 involves 10× fewer FLOP than running language-related cortical areas for 0.3 seconds, and those two things strike me as accomplishing a vaguely comparable amount of useful thinking stuff, I figure.
So if the hypothesis is “those cortical areas require a massive number of synapses because that’s how the brain reduces ...
How about the distinction between (A) “An AGI kills every human, and the people who turned on the AGI didn’t want that to happen” versus (B) “An AGI kills every human, and the people who turned on the AGI did want that to happen”?
I’m guessing that you’re going to say “That’s not a useful distinction because (B) is stupid. Obviously nobody is talking about (B)”. In which case, my response is “The things that are obvious to you and me are not necessarily obvious to people who are new to thinking carefully about AGI x-risk.”
…And in particular, normal people s...
When something bad happens in such a context, calling it "accident risk" absolves those researching, developing, and/or deploying the technology of responsibility. They should have known better. Some of them almost certainly did. Rationalization, oversight, and misaligned incentives were almost certainly at play. Failing to predict the particular failure mode encountered is no excuse. Having "good intentions" is no excuse.
I’ve been reading this paragraph over and over, and I can’t follow it.
How does calling it "accident risk" ...
For instance, the latter response obtains if the "pointing" is done by naive training.
Oh, sorry. I’m “uncertain” assuming Model-Based RL with the least-doomed plan that I feel like I more-or-less know how to implement right now. If we’re talking about “naïve training”, then I’m probably very pessimistic, depending on the details.
Also, as a reminder, my high credence in doom doesn't come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.
That’s hel...
I wish that everyone (including OP) would be clearer about whether or not we’re doing worst-case thinking, and why.
In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”. I don’t have a strong reason to expect that to happen, and I also don’t have a strong reason to expect that to not happen. I mostly feel uncertain and confused.
So if the debate is “Are Eliezer & Nate r...
I don't see this as worst-case thinking. I do see it as speaking from a model that many locals don't share (without any particular attempt made to argue that model).
In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.
AFAICT, our degree of disagreement here turns on what you mean by "pointed". Depending on that, I expect I'd either say "yeah maybe, but that kind of po...
If you think some more specific aspect of this post is importantly wrong for reasons that are downstream of that, I’d be curious to hear more details.
In this post, I’m discussing a scenario where one AGI gets out of control and kills everyone. If the people who programmed and turned on that AGI were not omnicidal maniacs who wanted to wipe out humanity, then I call that an “accident”. If they were omnicidal maniacs then I call that a “bad actor” problem. I think that omnicidal maniacs are very rare in the human world, and therefore this scenario that I’m t...
I was an independent AGI safety researcher because I didn't want to move to a different city and (at the time, it might or might not have changed in the past couple years) few if any orgs that might hire me were willing to hire remote workers.
UPDATE: I WROTE A BETTER DISCUSSION OF THIS TOPIC AT: Heritability, Behaviorism, and Within-Lifetime RL)
Hmm. I’m not sure it’s that important what is or isn’t “behaviorism”, and anyway I’m not an expert on that (I haven’t read original behaviorist writing, so maybe my understanding of “behaviorism” is a caricature by its critics). But anyway, I thought Scott & Eliezer were both interested in the question of what happens when the kid grows up and the parents are no longer around.
My comment above was a bit sloppy. Let me try again. Here are two stories:
“...
Hmm, maybe. I talk about training compute in Section 4 of this post (upshot: I’m confused…). See also Section 3.1 of this other post. If training is super-expensive, then run-compute would nevertheless be important if (1) we assume that the code / weights / whatever will get leaked in short order, (2) the motivations are changeable from "safe" to "unsafe" via fine-tuning or decompiling or online-learning or whatever. (I happen to strongly expect powerful AGI to necessarily use online learning, including online updating the RL value function which is relate...
Thanks!
We also know that in many cases the brain and some ANN are actually computing basically the same thing in the same way (LLMs and linguistic cortex), and it's now obvious and uncontroversial that the brain is using the sparser but larger version of the same circuit, whereas the LLM ANN is using the dense version which is more compact but less energy/compute efficient (as it uses/accesses all params all the time).
I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim ...
Bit of a nitpick, but I think you’re misdescribing AIXI. I think AIXI is defined to have a reward input channel, and its collection-of-all-possible-generative-world-models are tasked with predicting both sensory inputs and reward inputs, and Bayesian-updated accordingly, and then the generative models are issuing reward predictions which in turn are used to choose maximal-reward actions. (And by the way it doesn’t really work—it under-explores and thus can be permanently confused about counterfactual / off-policy rewards, IIUC.) So AIXI has no utility func...
Ooh interesting! Can you say how you're figuring that it's "gigabytes of information?"
I’ve spent thousands of hours reading neuroscience papers, I know how synapses work, jeez :-P
Similarly we never have to bother with a "minicolumn". We only care about what works best. Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.
We probably will find something way better than a minicolumn. Some argue that's what a transformer is.
I’m sorta confused that you wrote all these paragraphs with (as I understand it) the message that if we want future AGI ...
Thanks for your comment! I am not a GPU expert, if you didn’t notice. :)
I might note that you could have tried to fill in the "cartoon switch" for human synapses. They are likely a MAC for each incoming axon…
This is the part I disagree with. For example, in the OP I cited this paper which has no MAC operations, just AND & OR. More importantly, you’re implicitly assuming that whatever neocortical neurons are doing, the best way to do that same thing on a chip is to have a superficial 1-to-1 mapping between neurons-in-the-brain and virtual-ne...
I think one example (somewhat overlapping one of yours) is my discussion of the so-called “follow-the-trying game” here.