## AI ALIGNMENT FORUMAF

Steve Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms—see https://sjbyrnes.com/agi.html. Email: steven.byrnes@gmail.com. Twitter: @steve47285. Employer: https://astera.org/. Physicist by training.

# Sequences

Intro to Brain-Like-AGI Safety

# Wiki Contributions

I think gain-of-function research is (probably) bad from a total utilitarian perspective, but it's much less clear that it's bad from the perspective of people alive today.

Strong disagree; I have a footnote linking this discussion, here’s an excerpt:

Yeah, so I mean, take it down to a one in 10,000 leak risk [per worker per year, which is unrealistically optimistic by a factor of 10, for the sake of argument] and then, yeah, looking at COVID [gives us] an order of magnitude for damages. So, $10 trillion, several million dead, maybe getting around 10 million excess dead. And you know, of course these things could be worse, you could have something that did 50 or 100 times as much damage as COVID, but [even leaving that aside], 1/10,000ths of a$10 trillion burden or 10 million lives [equals] a billion dollars, 1,000 dead, that’s quite significant. And if you… You could imagine that these labs had to get insurance, in the way that if you’re going to drive a vehicle where you might kill someone, you’re required to have insurance so that you can pay to compensate for the damage. And so if you did that, then you might need a billion dollars a year of insurance for one of these labs.

And now, there’s benefits to the research that they do. They haven’t been particularly helpful in this pandemic, and critics argue that this is a very small portion of all of the work that can contribute to pandemic response and so it’s not particularly beneficial. I think there’s a lot to that, but regardless of that, it seems like there’s no way that you would get more defense against future pandemics by doing one of these gain-of-function experiments that required a billion dollars of insurance, than you would by putting a billion dollars into research that doesn’t endanger the lives of innocent people all around the world.

people [are] doing gain-of-function research are doing it because they think it reduces risks.

Objectively, GoF research results are mostly unrelated to pandemic prevention & mitigation measures like vaccines, monitoring, antivirals, etc., as I understand it—see the interview above. I’ve never met GoF researchers and I don’t know what they’re thinking in their innermost thoughts. I doubt they’re moustache-twirling villains. They might feel pressure to ensure funding to support their grad students and get tenure. They might think “other groups are sloppy in following the BSL procedures but that would never happen in my group!” They might think (cf. the Oppenheimer quote) “man this research project would be super-cool”. They might tell themselves that it’s net helpful for pandemic prevention, independent of whether that’s true or false.

Another disanalogy…

Your disanalogies seem to me like they’re pointing in the bad direction, not good. AGI is more dangerous, AGI is more obviously important (and therefore lots of different groups all around the world are going to be very motivated to work on them), etc. I guess you’re saying “after a robot killing spree, the dangers of AGI will be very salient to everyone”? And yet, immediately after COVID, the dangers of lab leaks are not salient to politicians or the general public or the NIH etc. More generally, pandemics ought to be salient right now (and for the past couple years), one would think, and yet the USA government pandemic prevention effort is practically nonexistent. In fact, many of the actions the USA government took on pandemic mitigation were counterproductive, and many in the public have now (ironically) become generally opposed to worrying about future pandemics!

There's a general phenomenon of if you break down events into independent events that all have to happen for X to happen you can make the probability of X as low as you want.

If you prefer, you could think of my list as a list of things that could go wrong. I’m not trying to impress you with the length of the list by padding it out with things that could (but almost definitely won’t) go wrong. I think many of those things are highly likely to go wrong. If they don’t seem likely to you, we might have different background assumptions?

For example, in your view, how many years will there be between (A) “there’s an AI that can really make a huge difference on the missile defense problem, and this AI is sufficiently established and reliable and trusted to have been incorporated into military systems around the world” versus (B) “The Singularity”? If your answer is “30 years”, then OK those weak AIs have a decent shot at changing the situation. Whereas my answer is maybe “2 years”, and also maybe “zero years”. You just can’t design and budget and approve and build and deploy a massive military system like missile defense within 2 years. Probably not even within 10 years. Right?

It seems likely to me that Apple and Microsoft are making the correct decision to use the buggy OS system, whereas if AGI x-risk was high they'd be making the incorrect decision.

It’s a pretty important part of my model that some people on earth have incorrect beliefs about how high AGI x-risk is. Or more precisely, about how high x-risk is for the particular AGI source code that they have in front of them.

That second sentence is important! For example, if Company A’s AGI goes terribly wrong, but the robot rampage is put down and the world doesn’t end, then we might hope that Company B says “Gee, I guess AGI x-risk is high”. But it’s equally possible for Company B to say “Pfft, y'know, Company A was doing Stupid Thing X, and that’s why their AGI had a misalignment issue. But we’re smarter than that! So x-risk is low for our AGI!”, but actually Company B is wrong about this, because their AGI source code has Stupid Thing Y instead.

(If everyone in the world had access to a software tool that would scan some source code and output the x-risk probability, and the tool was always correct, and everybody trusted the tool, then I would feel much better about x-risk!!)

Once deepmind have made their weak AGI it seems very likely that they could make very substantial advances in alignment that also make their AI systems more capable like RLHF.  FB would be incentivised to use the same methods.

It's also unclear to me if there would be other firms trying to make AGI once the first AGI gets made. It seems like the return to capital would be so insanely, vastly higher by putting it into making the existing AGI cure cancer and solve fusion.

I’m confused about your model here. Does DM email FB their source code and training environments etc.? If so, why would DM do that? If not, how can FB use the same methods? (More generally, the less detail DM provides, the less confident we should be that FB does the same thing in the same way, right?)

Is DM releasing the trained model weights? If yes, why would DM do that? And why don’t you expect FB to try fine-tuning the weights to improve the performance / stake out a niche? Or if FB does change the weights by fine-tuning, why isn’t that an x-risk? Or if DM doesn’t release the trained model weights, does it sell API access to the model at cost? If so, why? Or if DM keeps a margin for themselves, wouldn’t FB want to try to make a similarly-capable AGI to keep that margin for themselves?

FB has a whole AI research lab, FAIR. When DM makes this AGI, FB isn’t just going to lay off everyone in FAIR, right? And the FAIR people aren’t just going to twiddle their thumbs and ignore this revolutionary advance, right? They’re going to try to copy it and build on it and surpass it. It’s what they’ve been doing after noticing recent advances in language models, and I think it’s generally what they’ve always done, right? Why would that change?

I think your example was doomed from the start because

• the AGI was exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “the nanotech problem will get solved”,
• the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.

So the latter is obviously doomed to get crushed by a sufficiently-intelligent AGI.

If we can get to a place where the first bullet point still holds, but the AGI also has a comparably-strong, explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”, then we’re in a situation where the AGI is applying its formidable intelligence to fight for both bullet points, not just the first one. And then we can be more hopeful that the second bullet point won’t get crushed. (Related.)

In particular, if we can pull that off, then the AGI would presumably do “intelligent” things to advance the second bullet point, just like it does “intelligent things” to advance the first bullet point in your story. For example, the AGI might brainstorm subtle ways that its plans might pattern-match to deception, and feel great relief (so to speak) at noticing and avoiding those problems before they happen. And likewise, it might brainstorm clever ways to communicate more clearly with its supervisor, and treat those as wonderful achievements (so to speak). Etc.

Of course, there remains the very interesting open question of how to reliably get to a place where the AGI has an explicit, endorsed, strong desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno. (More detailed discussion here.) For example, most humans get zapped with positive reward when they eat yummy ice cream, and yet the USA population seems to have wound up pretty spread out along the spectrum from fully endorsing the associated desire as ego-syntonic (“Eating ice cream is friggin awesome!”) to fully rejecting & externalizing it as ego-dystonic (“I sometimes struggle with a difficult-to-control urge to eat ice cream”). Again, I think there are important open questions about how this process works, and more to the point, how to intervene on it for an AGI.

"I have to be wrong about something, which I certainly am. I have to be wrong about something which makes the problem easier rather than harder, for those people who don't think alignment's going to be all that hard. If you're building a rocket for the first time ever, and you're wrong about something, it's not surprising if you're wrong about something. It's surprising if the thing that you're wrong about causes the rocket to go twice as high, on half the fuel you thought was required and be much easier to steer than you were afraid of."

I agree with OP that this rocket analogy from Eliezer is a bad analogy, AFAICT. If someone is trying to assess the difficulty of solving a technical problem (e.g. building a rocket) in advance, then they need to brainstorm potential problems that might come up, and when they notice one, they also need to brainstorm potential technical solutions to that problem. For example “the heat of reentry will destroy the ship” is a potential problem, and “we can invent new and better heat-resistant tiles / shielding” is a potential solution to that problem. During this process, I don’t think it’s particularly unusual for the person to notice a technical problem but overlook a clever way to solve that problem. (Maybe they didn’t recognize the possibility of inventing new super-duper-heat-resistant ceramic tiles, or whatever.) And then they would wind up overly pessimistic.

There's no way to raise a human such that their value system cleanly revolves around the one single goal of duplicating a strawberry, and nothing else. By asking for a method of forming values which would permit such a narrow specification of end goals, you're asking for a value formation process that's fundamentally different from the one humans use. There's no guarantee that such a thing even exists, and implicitly aiming to avoid the one value formation process we know is compatible with our own values seems like a terrible idea.

I narrowly agree with most of this, but I tend to say the same thing with a very different attitude:

I would say: “Gee it would be super cool if we could decide a priori what we want the AGI to be trying to do, WITH SURGICAL PRECISION. But alas, that doesn’t seem possible, at least not according to any method I know of.”

I disagree with you in your apparent suggestion that the above paragraph is obvious or uninteresting, and also disagree with your apparent suggestion that “setting an AGI’s motivations with surgical precision” is such a dumb idea that we shouldn’t even waste one minute of our time thinking about whether it might be possible to do that.

For example, people who are used to programming almost any other type of software have presumably internalized the idea that the programmer can decide what the software will do with surgical precision. So it's important to spread the idea that, on current trends, AGI software will be very different from that.

BTW I do agree with you that Eliezer’s interview response seems to suggest that he thinks aligning an AGI to “basic notions of morality” is harder and aligning an AGI to “strawberry problem” is easier. If that’s what he thinks, it’s at least not obvious to me.

I think one example (somewhat overlapping one of yours) is my discussion of the so-called “follow-the-trying game” here.

Speaking for myself…

I think I do a lot of “engaging with neuroscientists” despite not publishing peer-reviewed neuroscience papers:

• I write lots of blog posts intended to be read by neuroscientists, i.e. I will attempt to engage with background assumptions that neuroscientists are likely to have, not assume non-neuroscience background knowledge or jargon, etc.
• [To be clear, I also write even more blog posts that are not in that category.]
• When one of my blog posts specifically discusses some neuroscientist’s work, I’ll sometimes cold-email them and ask for pre-publication feedback.
• When I have questions about a neuroscientist’s paper, I’ll sometimes cold-email them to try to start a chat.
• There are a handful of neuroscientists whose work is unusually relevant to AGI capabilities and/or safety (in my opinion), and I’m kinda always on the lookout for excuses to get in touch with them, with some amount of success I think.
• I got interviewed on a popular podcast in AI-adjacent neuroscience, and I have a 1-hour zoom talk that I give whenever anyone invites me.

Between those things, plus word-of-mouth, I feel pretty confident that WAY more neuroscientists are familiar with my detailed ideas than is typical given that I’ve been in the field full-time for only 2 years (and spend barely half my time on neuroscience anyway), and also WAY more than the counterfactual where I spend the same amount of time on outreach / communication but do so mainly via publishing peer-reviewed neuroscience papers. Like, sometimes I’ll read a peer-reviewed paper in detail, and talk to the author, and the author remarks that I might be the first person to have ever read it in detail apart from their own close collaborators and the referees.

If we compare

• (A) “actual progress”, versus
• (B) “legible signs of progress”,

it seems obvious to me that everyone has an incentive to underinvest in (A) relative to (B). You get grants & jobs & status from (B), not (A), right? And papers can be in (B) without being minimally or not at all in (A).

In academia, people talk all the time about how people are optimizing their publication record to the detriment of field-advancement, e.g. making results sound misleadingly original and important, chasing things that are hot, splitting results into unnecessarily many papers, etc. Right?

Hmm, I’m trying to guess where you’re coming from. Maybe you’d propose a model with

• (C) “figuring things out”
• (D) “communicating those things to the x-risk community”
• (E) “communicating those things to the ML community”

And the idea is that someone whose funding & status is coming entirely from the x-risk community has no incentive to do (E). Is that it?

If so, I strongly endorse that (E) is worth doing, to a nonzero extent. But it’s not obvious to me that the AGI x-risk community is collectively underinvesting in (E); I think I lean towards “overinvesting” on the margin. (I repeat: on the margin!! Zero investment is too little!)

I think that everyone who is both motivated by x-risk and employed by a CS department—e.g. CHAI, the OP (Krueger), etc.—is doing (E) intensively all the time, and will keep doing so perpetually. We don’t have to worry about (E) going to zero. If other people do (D) to the exclusion of (E), I think good ideas will trickle out through the above-mentioned people, and/or through ML people getting interested and gradually learning enough jargon to read the (D) stuff.

I think that, for some people / projects, getting their results into an ML conference would cut down the amount of (C) & (D) that gets done by a factor of 2, or even more, or much more when it affects the choice of what to work on in the first place. And I think that’s a very bad tradeoff.

I would say a similar thing about any technical field. I want climate modelers to spend most of their time figuring out how to do better climate modeling, in collaboration with other climate modelers, using climate modeling jargon. Obviously, to some extent, there has to be accessible communication to a wider audience of stakeholders about what’s going on in climate modeling. But that’s going to happen anyway—plenty of people like writing popular books and blog posts and stuff, and kudos to those people, and likewise some stakeholders outside the field will naturally invest in learning climate modeling jargon and injecting themselves into that conversation, and kudos to those people too. Groupthink is bad, interdisciplinarity is good, and so on, but lots of important technical work just can’t be easily communicated to people with no subject-matter expertise or investment in the subfield, and it’s really really bad if that kind of work falls by the wayside.

Then separately, if people are underinvesting in (E), I think it’s non-obvious (and often false) that the solution to that problem is to try to get papers through peer review and into ML conferences.

• For one thing, if an x-risk-concerned person can write an ML paper, they can equally well write a blog post that avoids x-risk jargon (and maybe even replaces it with ML jargon), and I think that would have a comparable chance of getting widely read by ML people and successfully communicating substantive ideas to them. It’s not like every paper in an ML conference gets widely read and cited anyway, right? But blog posts take absurdly less time.
• For another thing, if we assume for the sake of argument that the “gravitas” / style / cite-ability of academic papers is a feature not a bug, then people can get those by putting a paper onto ML arxiv, and that takes much less time than going through peer-review. I think in some cases that’s a great choice.

Needless to say, writing papers and getting them into ML conferences is time-consuming. There's an opportunity cost. Is it worth doing despite the opportunity cost? I presume that, for the particular people you talked to, and the particular projects they were doing, your judgment was “Yes the opportunity cost was well worth paying”. And I’m in no position to disagree—I don’t know the details. But I wouldn't want to make any blanket statements. If someone says the opportunity cost is not worth it for them, I see that as a claim that a priori might be true or false. Your post seems to imply that almost everyone is making an error in the same direction, and therefore funders should put their thumb on the scale. That’s at least not obvious to me.

You seem to be suggesting that people in academia don’t read blog posts, and that blog posts are generically harder to read than papers. Both seem obviously false to me; for example, many peer-reviewed ML papers come along with blog posts, and the blog posts are intended to be the more widely accessible of the two.

Of course, blog posts can be unreadable too. Generally, I think that it's healthy for people to write BOTH (A) stuff with lots of jargon & technical details that conveys information well to people already in the know AND (B) highly-accessible stuff intended for a broader audience. (That’s what I try to do, at least.) I think it’s true and uncontroversial to say that blog posts are great for (B). I also happen to think that blog posts are great for (A).

Anyway, I think this OP isn’t particularly addressed at me (I have nothing I want to share that would fit in at an ML conference, as opposed to neuroscience), but if anyone cares I’d be happy to discuss in detail why I haven’t written any peer-reviewed papers related to my AI alignment work since I started it full-time 2 years ago, and have no immediate plans to, and what I’ve been doing instead to mitigate any downsides of that decision. It’s not even a close call; this decision seems very overdetermined from my perspective.

AGI system could design both missile defence and automated missile detection systems…

This is Section 3.3, “Societal resilience against highly-intelligent omnicidal agents more generally (with or without AGI assistance)”. As I mention, I’m all for trying.

I think once we go through the conjunction of (1) a military has AGI access, (2) and sufficiently trusts it, (3) and is willing to spend the money to build super missile defense, (4) and this super missile defense is configured to intercept missiles from every country including the countries building that very missile defense system, (5) and actually builds it and turns it on within a few years of the AGI being available … it starts to sound pretty far-fetched to me, even leaving aside the technical question of whether super missile defense is possible. (Current missile defense systems barely sometimes work in the absence of any countermeasures, and such countermeasures are trivial and highly effective, in my understanding).

A more general reason to think that once we have weak, somewhat aligned AGI systems, a misaligned AGI won't try to kill all humans is that there are costs to conflict that can be avoided by bargaining. It seems like if there's already somewhat aligned AGI that a new AI system, particularly if it's two years behind, can't be 100% confident of winning a conflict so it would be better to try to strike a bargain than attempt a hostile takeover.

I think that the most dangerous kind of misaligned AGI would be one with long-term real-world aspirations that don’t involve humans, and in that case, seems to me that the only significant “cost of conflict” would be the probability of losing the conflict entirely. However much Earth is turned to rubble, who cares, it can be rebuilt in the long term, at least once we get past the point where an AGI can be self-sufficient, and I tend to think that point comes pretty early (cf. Section 3.3.3).

So then the question is: would it lose the conflict entirely? Would a bigger docile law-abiding rule-following AGI lose to a smaller out-of-control AGI? Seems pretty plausible to me, for various reasons discussed in the post. I think “docile and law-abiding and rule-following” is such a massive disadvantage that it can outweigh extra compute & other resources.

I'm also in general sceptical of risk stories that go along the lines of "here is dangerous thing x and if even 1 person does x then everyone dies." It seems like this applies to lots of things; many Americans have guns but people randomly going and shooting people is very rare,  many people have the capability to make bioweapons but don't, and pilots could intentionally crash planes.

If there are N people in a position to do Destructive act X, and each independently might do it with probability P, then it will probably happen if the product . We shouldn’t throw out that reasoning—this reasoning is obviously true, right? Just math! Instead, we could take a look around us and note that there are remarkably few people who are simultaneously competent enough to think of & execute destructive plans (or to get into a position of power & access that allows them to do that) and callous (or ideological or whatever) enough to actually carry them out. Lucky us! That doesn’t disprove the math. It just means that N×P is small here. I discuss this a bit more in Section 3.3.3.

But I think that AGI-world will have much higher N×P than human-world, because AGI motivation will have a much wider distribution, and also AGI competence / patience / etc. will be consistently much higher.

Also, even without AGIs, cf. “Yudkowsky-Moore law of mad science”, “Every 18 months, the minimum IQ to destroy the world drops by one point.” The skill & resources required to start Smallpox 2.0 really is going down each year, as biology research tooling improves (e.g. the things you can do with CRISPR were already possible before CRISPR, but much much harder, as I understand it). So I see this as an existing bad situation, and I think AGI will make this existing problem much worse very fast.

If tech companies had well-calibrated beliefs about the probability that an AI system they create kills everyone they would take safety measures that make those risks very low. It seems like in a world where we have weak AGI for two years they could be convinced of this.

Maybe! And yet, in a world that has had dozens of deadly pandemics escape from labs, perhaps including COVID, people are still doing those kinds of experiments as we speak, and indeed the USA government has been funding that work, even after COVID.

It seems unlikely to me that the AGI created by large tech firms would be lethally misaligned in the sense that it would want to kill everyone if we've already had AGI that can do things like cure cancer for two years.

I don’t understand why you think that. There’s an operating system proven to have no bugs, but that doesn’t stop Microsoft and Apple from making buggy operating systems. The latter have more features! By the same token, if DeepMind makes a helpful docile AGI and spends two years curing various types of cancer, then Facebook will obviously want to try making AGI too. And they’ll want their AGI to be even bigger and smarter and more powerful than DeepMind’s. And maybe Facebook will mess up! Right? It’s different codebases, different training environments, and made by different groups. I don’t see any reason to think that DeepMind’s success at alignment will ensure Facebook’s success at alignment. Particularly when Facebook’s leadership (Mark Zuckerberg & Yann LeCun) are both aggressively dismissive of AGI x-risk being even a remote possibility.

In this world it also seems unlikely that tech firm 2's new misaligned AGI discovers a generic learning algorithm that allows it to become a superintelligence without acquiring more compute, since it seems like tech firm 1's AGI would also have discovered this generic learning algorithm in that case.

I agree that this particular thing you describe isn’t too likely, because “doing AGI capabilities research” is one thing that I can easily imagine DeepMind using their docile AGI for. My concern comes from other directions.

It seems likely in this world that the most powerful actors in society have become just massively massively more powerful than "smart guy with laptop" such that the offense-defence balance would have to be massively massively tilted for there still be a threat here. For instance it seems like we could have dyson spheres in this world and almost certainly have nanotech if the "AGI whipped up on some blokes laptop" could quickly acquire and make nanotech itself.

I mean, in the world today, there are lots of very powerful actors, like governments, big corporations, etc. But none of those very powerful actors has halted gain-of-function research. In fact, some of those very powerful actors are pushing for gain-of-function research!

Right now, most people & institutions evidently don’t give a crap about societal resilience, and will happily trade away societal resilience for all kinds of other stupid things. (Hence gain-of-function research.) If there are lots of docile aligned AGIs that are all doing what humans and human institutions specific ask them to do, then we should expect that those aligned AGIs will likewise trade away societal resilience for other things. Just like their human overseers want them to. That’s alignment! Right?

This feels kinda unrealistic for the kind of pretraining that's common today, but so does actually learning how to do needle-moving alignment research just from next-token prediction. If we *condition on* the latter, it seems kinda reasonable to imagine there must be cases where an AI has to be able to do needle-moving alignment research in order to improve at next-token prediction, and this feels like a reasonable way that might happen.

For what little it’s worth, I mostly don’t buy this hypothetical (see e.g. here), but if I force myself to accept it, I think I’m tentatively on Holden’s side.

I’m not sure this paragraph will be helpful for anyone but me, but I wound up with a mental image vaguely like a thing I wrote long ago about “Straightforward RL” versus “Gradient descent through the model”, with the latter kinda like what you would get from next-token prediction. Again, I’m kinda skeptical that things like “gradient descent through the model” would work at all in practice, mainly because the model is only seeing a sporadic surface trace of the much richer underlying processing; but if I grant that it does (for the sake of argument), then it would be pretty plausible to me that the resulting model would have things like “strong preference to generally fit in and follow norms”, and thus it would do fine at POUDA-avoidance.