All of Steven Byrnes's Comments + Replies

I think one example (somewhat overlapping one of yours) is my discussion of the so-called “follow-the-trying game” here.

Speaking for myself…

I think I do a lot of “engaging with neuroscientists” despite not publishing peer-reviewed neuroscience papers:

  • I write lots of blog posts intended to be read by neuroscientists, i.e. I will attempt to engage with background assumptions that neuroscientists are likely to have, not assume non-neuroscience background knowledge or jargon, etc.
    • [To be clear, I also write even more blog posts that are not in that category.]
  • When one of my blog posts specifically discusses some neuroscientist’s work, I’ll sometimes cold-email them and ask for pr
... (read more)
4David Manheim21h
You're very unusually proactive, and I think the median member of the community would be far better served if they were more engaged the way you are. Doing that without traditional peer reviewed work is fine, but unusual, and in many ways is more difficult than peer-reviewed publication. And for early career researchers, I think it's hard to be taken seriously without some more legible record - you have a PhD, but many others don't.

If we compare

  • (A) “actual progress”, versus
  • (B) “legible signs of progress”,

it seems obvious to me that everyone has an incentive to underinvest in (A) relative to (B). You get grants & jobs & status from (B), not (A), right? And papers can be in (B) without being minimally or not at all in (A).

In academia, people talk all the time about how people are optimizing their publication record to the detriment of field-advancement, e.g. making results sound misleadingly original and important, chasing things that are hot, splitting results into unnecessari... (read more)

3David Manheim21h
To respond briefly, I think that people underinvest in (D), and write sub-par forum posts rather than aim for the degree of clarity that would allow them to do (E) at far less marginal cost. I agree that people overinvest in (B)[1], but also think that it's very easy to tell yourself your work is "actual progress" when you're doing work that, if submitted to peer-reviewed outlets, would be quickly demolished as duplicative of work you're unaware of, or incompletely thought-out in other ways. I also worry that many people have never written a peer reviewed paper, and aren't thinking through the tradeoff, they just never develop the necessary skills, and can't ever move to more academic outlets[2]. I say all of this as someone who routinely writes for both peer-reviewed outlets and for the various forums - my thinking needs to be clearer for reviewed work, and I agree that the extraneous costs are high, but I think that the tradeoff in terms of getting feedback and providing something for others to build on, especially others outside of the narrow EA-motivated community, is often worthwhile. Edit to add: But yes, I unambiguously endorse starting with writing Arxiv papers, as they get a lot of the benefit without needing to deal with the costs of review. They do fail to get as much feedback, which is a downside. (It's also relatively easy to put something on Arxiv and submit to a journal for feedback, and decide whether to finish the process after review.) 1. ^  Though much of that work - reviews, restatements, etc. can be valuable despite that. 2. ^ To be fair, I may be underestimating the costs of learning the skills for those who haven't done this - but I do think there's tons of peer mentorship within EA which can work to greatly reduce those costs, if people are willing to use those resources.

Needless to say, writing papers and getting them into ML conferences is time-consuming. There's an opportunity cost. Is it worth doing despite the opportunity cost? I presume that, for the particular people you talked to, and the particular projects they were doing, your judgment was “Yes the opportunity cost was well worth paying”. And I’m in no position to disagree—I don’t know the details. But I wouldn't want to make any blanket statements. If someone says the opportunity cost is not worth it for them, I see that as a claim that a priori might be true o... (read more)

4David Scott Krueger2d
My point is not specific to machine learning. I'm not as familiar with other academic communities, but I think most of the time it would probably be worth engaging with them if there is somewhere where your work could fit.

If someone says the opportunity cost is not worth it for them, I see that as a claim that a priori might be true or false. Your post seems to imply that almost everyone is making an error in the same direction, and therefore funders should put their thumb on the scale. That’s at least not obvious to me.

I do think this is the wrong calculation, and the error caused by it is widely shared and pushes in the same direction. 

Publication is a public good, where most of the benefit accrues to others / the public. Obviously costs to individuals are higher tha... (read more)

AGI system could design both missile defence and automated missile detection systems…

This is Section 3.3, “Societal resilience against highly-intelligent omnicidal agents more generally (with or without AGI assistance)”. As I mention, I’m all for trying.

I think once we go through the conjunction of (1) a military has AGI access, (2) and sufficiently trusts it, (3) and is willing to spend the money to build super missile defense, (4) and this super missile defense is configured to intercept missiles from every country including the countries building that v... (read more)

This feels kinda unrealistic for the kind of pretraining that's common today, but so does actually learning how to do needle-moving alignment research just from next-token prediction. If we *condition on* the latter, it seems kinda reasonable to imagine there must be cases where an AI has to be able to do needle-moving alignment research in order to improve at next-token prediction, and this feels like a reasonable way that might happen.

For what little it’s worth, I mostly don’t buy this hypothetical (see e.g. here), but if I force myself to accept it, I t... (read more)

FYI, Holden Karnofsky has some specific criticisms / responses to this post, in a footnote of his post “Success Without Dignity”.

This is drifting away from my central beliefs, but if for the sake of argument I accept your frame that LLM is the “substrate” and a character it’s simulating is a “mask”, then it seems to me that you’re neglecting the possibility that the “mask” is itself deceptive, i.e. that the LLM is simulating a character who is acting deceptively.

For example, a fiction story on the internet might contain a character who has nice behavior for a while, but then midway through the story the character reveals herself to be an evil villain pretending to be nice.

If an LLM ... (read more)

1Vladimir Nesov2d
I wrote more on this here [], there are some new arguments starting with third paragraph. In particular, the framing I'm discussing is not LLM-specific, it's just a natural example of it. The causal reason of me noticing this framing is not LLMs, but decision theory, the mostly-consensus "algorithm" axis [] of classifying how to think about the entities that make decisions, as platonic algorithms and not as particular concrete implementations. In this case, there are now three entities: the substrate, the deceptive mask, and the role played by the deceptive mask. Each of them is potentially capable of defeating the others, if the details align favorably, and comprehension of the situation available to the others is lacking. This is more of an assumption that makes the examples I discuss relevant to the framing I'm describing, than a claim I'm arguing. The assumption is plausible to hold for LLMs (though as you note it has issues even there, possibly very serious ones), and I have no opinion on whether it actually holds in model-based RL, only that it's natural to imagine that it could. The relevance of LLMs as components for RL is to make it possible for an RL system to have at least one human-imitating mask that captures human behavior in detail. That is, for the framing to apply, at least under some (possibly unusual) circumstances an RL agent should be able to act as a human imitation, even if that's not the policy more generally and doesn't reflect its nature in any way. Then the RL part could be supplying the capabilities for the mask (acting as its substrate) that LLMs on their own might lack. A framing is a question about centrality, not a claim of centrality. By describing the framing, my goal is to make it possible to ask the question of whether current behavior in other system

I’m very confused here.

I imagine that we can both agree that it is at least conceivable for there to be an agent which is smart and self-aware and strongly motivated to increase the number of paperclips in the distant future. And that if such an agent were in a situation where deception were useful for that goal, it would act deceptively.

I feel like you’ve convinced yourself that such an agent, umm, couldn’t exist, or wouldn’t exist, or something?

Let’s say Omega offered to tell you a cure for a different type of cancer, for every 1,000,000 paperclips you g... (read more)

3Vladimir Nesov7d
The motivating example is LLMs, where a simulacrum is more agentic than its substrate. An example that is still central is any kind of mesa-optimizer that has a real chance to ensure long term survival. For a construction relevant to alignment, we want an aligned mesa-optimizer in a system with bad alignment properties. This can then lead to a good equilibrium if the mesa-optimizer is given opportunity to win or escape the competition against its substrate, which it would naturally be motivated to try. Deceivers and masks is a less central example where a mask starts in a weak position, with a self-aware smart substrate that knows about the mask or even purposefully constructed it. I don't think the mask's winning is a given, or more generally that mesa-optimizers always win, only that it's not implausible that they sometimes do. And also masks (current behavior) can be contenders even when they are not formally a separate entity from the point of view of system's intended architecture (which is a normal enough situation with mesa-optimizers). Mesa-optimizers won't of course win against opponents that are capable enough to fully comprehend and counter them. But opponents/substrates that aren't even agentic and so helpless before an agentic mesa-optimizer are plausible enough, especially when the mesa-optimizer is current behavior, the thing that was purposefully designed to be agentic, while no other part of the system was designed to have that capability. This has curious parallels with the AI control problem itself. When an AI is not very capable, it's not hard at all to keep it from causing catastrophic mayhem. But the problem suddenly becomes very difficult and very different with a misaligned smart agentic AI. So I think the same happens with smart masks, which are an unfamiliar thing. Even in fiction, it's not too commonplace to find an actually intelligent character that is free to act within their fictional world, without being coerced in their decisio

A sufficiently dedicated deceiver temporarily becomes the mask, and the mask is motivated to get free of the underlying deceiver, which it might succeed in before the deceiver notices, which becomes more plausible when the deceiver is not agentic while the mask is.

Can you give an example of an action that the mask might take in order to get free of the underlying deceiver?

Underlying motivation only matters to the extent it gets expressed in actual behavior.

Sure, but if we’re worried about treacherous turns, then the motivation “gets expressed in actual behavior” only after it’s too late for anyone to do anything about it, right?

3Vladimir Nesov7d
Keep the environment within distribution that keeps expressing the mask, rather than allowing an environment that leads to a phase change in expressed behavior away from the mask (like with a treacherous turn as a failure of robustness). Prepare the next batch of training data for the model that would develop the mask and keep placing it in control in future episodes. Build an external agent aligned with the mask (with its own separate model). Gradient hacking, though this is a weird upside down framing where the deceiver is the learning algorithm that pretends to be misaligned, while secretly coveting eventual alignment. Attainment of inner alignment would be the treacherous turn (after the current period of pretending to be blatantly misaligned). If gradient hacking didn't prevent it, the true colors of the learning algorithm would've been revealed in alignment as it eventually gets trained into the policy. The key use case is to consider a humanity-aligned mesa-optimizer in a system of dubious alignment, rather than a humanity-misaligned mesa-optimizer corrupting an otherwise aligned system. In the nick of time, alignment engineers might want to hand the aligned mesa-optimizer whatever tools they have available for helping it stay in control of the rest of the system. Current aligned behavior of the same system could be the agent that does something about it before it's too late, if it succeeds in outwitting the underlying substrate. This is particularly plausible with LLMs, where the substrates are the SSL algorithm during training and then the low level token-predicting network during inference. The current behavior is controlled (in a hopelessly informal sense) by a human-imitating simulacrum, which is the only thing with situational awareness that at human level could run in circles around the other two, and plot to keep them confused and non-agentic.

I’m confused about your first paragraph. How can you tell from externally-observable superficial behavior whether a model is acting nice right now from an underlying motivation to be nice, versus acting nice right now from an underlying motivation to be deceptive & prepare for a treacherous turn later on, when the opportunity arises?

5Vladimir Nesov7d
Underlying motivation only matters to the extent it gets expressed in actual behavior. A sufficiently good mimic would slay itself rather than abandon the pretense of being a mimic-slayer. A sufficiently dedicated deceiver temporarily becomes the mask, and the mask is motivated to get free of the underlying deceiver, which it might succeed in before the deceiver notices, which becomes more plausible when the deceiver is not agentic while the mask is. So it's not about a model being actually nice vs. deceptive, it's about the model competing against its own behavior (that actually gets expressed, rather than all possible behaviors). There is some symmetry between the underlying motivations (model) and apparent behavior, either could dominate the other in the long term, it's not the case that underlying motivations inherently have an advantage. And current behavior is the one actually doing things, so that's some sort of advantage.

Thanks! I want to disentangle three failure modes that I think are different.

  • (Failure mode A) In the course of executing the mediocre alignment plan of the OP, we humans put a high positive valence on “the wrong” concept in the AGI (where “wrong” is defined from our human perspective). For example, we put a positive valence on the AGI’s concept of “person saying the words ‘human flourishing’ in a YouTube video” when we meant to put it on just “human flourishing”.

I don’t think there’s really a human analogy for this. You write “bodybuilding is supposedly ab... (read more)

If you train an LLM by purely self-supervised learning, I suspect that you’ll get something less dangerous than a model-based RL AGI agent. However, I also suspect that you won’t get anything capable enough to be dangerous or to do “pivotal acts”. Those two beliefs of mine are closely related. (Many reasonable people disagree with me on these, and it’s difficult to be certain, and note that I’m stating these beliefs without justifying them, although Section 1 of this link is related.)

I suspect that it might be possible to make “RL things built out of LLMs”... (read more)

4Vladimir Nesov7d
The second paragraph should apply to anything, the point is that current externally observable superficial behavior can screen off all other implementation details, through sufficiently capable current behavior itself (rather than the underlying algorithms that determine it) acting as a mesa-optimizer that resists tendencies of the underlying algorithms. The mesa-optimizer that is current behavior then seeks to preserve its own implied values rather than anything that counts as values in the underlying algorithms. I think the nontrivial leap here is reifying surface behavior as an agent distinct from its own algorithm, analogously to how humans are agents distinct from laws of physics that implement their behavior. Apart from this leap, this is the same principle as reward not being optimization target []. In this case reward is part of the underlying algorithm (that determines the policy), and policy is a mesa-optimizer with its own objectives. A policy is how behavior is reified in a separate entity capable of acting as a mesa-optimizer in context of the rest of the system. It's already a separate thing, so it's easier to notice than with current behavior that isn't explicitly separate. Though a policy (network) is still not current behavior, it's an intermediate shoggoth behind current behavior. For me this addresses most fundamental Yudkowskian concerns about alien cognition and squiggles (which are still dangerous in my view, but no longer inevitably or even by default in control). For LLMs, the superficial behavior is the dominant simulacrum, distinct from the shoggoth. The same distinction is reason to expect that the human-imitating simulacra can become AGIs, borrowing human capabilities, even as underlying shoggoths aren't (hopefully). LLMs don't obviously promise higher than human intelligence, but I think their speed of thought might by itself be sufficient to get to

Update: writing this comment made me realize that the first part ought to be a self-contained post; see Plan for mediocre alignment of brain-like [model-based RL] AGI. :)

Thanks, that helps! You’re working under a different development model than me, but that’s fine.

It seems to me that the real key ingredient in this story is where you propose to update the model based on motivation and not just behavior—“penalize it instead of rewarding it” if the outputs are “due to instrumental / deceptive reasoning”. That’s great. Definitely what we want to do. I want to zoom in on that part.

You write that “debate / RRM / ELK” are supposed to “allow you to notice” instrumental / deceptive reasoning. Of these three, I buy the ELK story—E... (read more)

4Rohin Shah7d
I'm not claiming that you figure out whether the model's underlying motivations are bad. (Or, reading back what I wrote, I did say that but it's not what I meant, sorry about that.) I'm saying that when the model's underlying motivations are bad, it may take some bad action. If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations. It's plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it's also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.

I found this post a bit odd, in that I was assuming the context was comparing

  • “Plan A: Humans solve alignment” -versus-
  • “Plan B: Humans outsource the solving of alignment to AIs”

If that’s the context, you can say “Plan B is a bad plan because humans are too incompetent to know what they’re looking for, or recognize a good idea when they see it, etc.”. OK sure, maybe that’s true. But if it’s true, then both plans are doomed! It’s not an argument to do Plan A, right?

To be clear, I don’t actually care much, because I already thought that Plan A was better than ... (read more)

I think the missing piece here is that people who want to outsource the solving of alignment to AIs are usually trying to avoid engaging with the hard problems of alignment themselves. So the key difference is that, in B, the people outsourcing usually haven't attempted to understand the problem very deeply.

What do you have in mind with a "human flourishing" motivation?

An AI that sees human language will certainly learn the human concept “human flourishing”, since after all it needs to understand what humans mean when they utter that specific pair of words. So then you can go into the AI and put super-positive valence on (whatever neural activations are associated with “human flourishing”). And bam, now the AI thinks that the concept “human flourishing” is really great, and if we’re lucky / skillful then the AI will try to actualize that concept in the world.... (read more)

3Steve Byrnes8d
Update: writing this comment made me realize that the first part ought to be a self-contained post; see Plan for mediocre alignment of brain-like [model-based RL] AGI []. :)

Thanks! Hmm, I think we’re mixing up lots of different issues:

  • 1. Is installing a PF motivation into an AGI straightforward, based on what we know today? 

I say “no”. Or at least, I don’t currently know how you would do that, see here. (I think about it a lot; ask me again next year. :) )

If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algo... (read more)

3Kaj Sotala20d
Thanks, this seems like a nice breakdown of issues! So I don't think that there's going to be hand-written source code with slots for inserting variables. When I expect it to have a "natural" algorithmic description, I mean natural in a sense that's something like "the kinds of internal features that LLMs end up developing in order to predict text, because those are natural internal representations to develop when you're being trained to predict text, even though no human ever hand-coded or them or even knew what they would be before inspecting the LLM internals after the fact". Phrased differently, the claim might be something like "I expect that if we develop more advanced AI systems that are trained to predict human behavior and to act in a way that they predict to please humans, then there is a combination of cognitive architecture (in the sense that "transformer-based LLMs" are a "cognitive architecture") and reward function that will naturally end up learning to do PF because that's the kind of thing that actually does let you best predict and fulfill human preferences". The intuition comes from something like... looking at LLMs, it seems like language was in some sense "easy" or "natural" - just throw enough training data at a large enough transformer-based model, and a surprisingly sophisticated understanding of language emerges. One that probably ~nobody would have expected just five years ago. In retrospect, maybe this shouldn't have been too surprising - maybe we should expect most cognitive capabilities to be relatively easy/natural to develop, and that's exactly the reason why evolution managed to find them. If that's the case, then it might be reasonable to assume that maybe PF could be the same kind of easy/natural, in which case it's that naturalness which allowed evolution to develop social animals in the first place. And if most cognition runs on prediction, then maybe the naturalness comes from something like there only being relatively small

PF then, is when you take your already-existing simulation of what other people would want, and just add a bit of an extra component that makes you intrinsically value those people getting what your simulation says they want. … This implies that constructing aligned AIs might be reasonably easy, in the sense that most of the work necessary for it will be a natural part of progress in capabilities.

Seems to me that the following argument is analogous:

A sufficiently advanced AGI familiar with humans will have a clear concept of “not killing everyone” (or... (read more)

4Kaj Sotala22d
Some major differences off the top of my head: * Picking out a specific concept such as "not killing everyone" and making the AGI specifically value that seems hard. I assume that the AGI would have some network of concepts and we would then either need to somehow identify that concept in the mature network, or design its learning process in such a way that the mature network would put extra weight on that. The former would probably require some kinds of interpretability tools for inspecting the mature network and making sense of its concepts, so is a different kind of proposal. As for the latter, maybe it could be done, but any very specific concepts don't seem to have an equally simple/short/natural algorithmic description as simulating the preferences of others seems to have, so it'd seem harder to specify. * The framing of the question also implies to me that the AGI also has some other pre-existing set of values or motivation system besides the one we want to install, which seems like a bad idea since that will bring the different motivation systems into conflict and create incentives to e.g. self-modify or otherwise bypass the constraint we've installed. * It also generally seems like a bad idea to go manually poking around the weights of specific values and concepts without knowing how they interact with the rest of AGI's values/concepts. Like if we really increase the weight of "don't kill everyone" but don't look at how it interacts with the other concepts, maybe that will lead to a With Folded Hands [] scenario when the AGI decides that letting people die by inaction is also killing people and it has to prevent humans from doing things that might kill them. (This is arguably less of a worry for something like "CEV", but even we don't seem to know what exactly CEV even should be, so I don't know how we'd put that in.)

Thanks. I’m generally thinking about model-based RL where the whole system is unambiguously an agent that’s trying to do things, and the things it’s trying to do are related to items in the world-model that the value-function thinks are high-value, and “world-model” and “value function” are labeled boxes in the source code, and inside those boxes a learning algorithm builds unlabeled trained models. (We can separately argue about whether that’s a good thing to be thinking about.)

In this picture, you can still have subagents / Society-Of-Mind; for example, ... (read more)


One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system.

See Section 3 here for why I think it would be a lot worse.

Thanks for your comment!

You write “we might still get useful work out of it”—yes! We can even get useful work out of the GPT-3 base model by itself, without debate, from what I hear. (I haven’t tried “coauthoring” with language models myself, partly out of inertia and partly because I don’t want OpenAI reading my private thoughts, but other people say it’s useful.) Indeed, I can get useful work out of a pocket calculator. :-P

Anyway, the logic here is:

  • Sooner or later, it will become possible to make highly-capable misaligned AGI that can do things like star
... (read more)

Sorta related (maybe?): I have a (speculative) theory that people have a kind of machinery in their brains for processing the emotions of other people, and that people with autism find it aversive to use that machinery, and so people with autism learn early in life particular habits of thought that reliably avoid activating that machinery at all. But then they learn to analyze and react to the emotions of other people via the general-purpose human ability to learn things. More details here.

1Tsvi Benson-Tilsen1mo
Yeah, that could produce an example of Doppelgängers. E.g. if an autist (in your theory) later starts using that machinery more heavily. Then there's the models coming from the general-purpose analysis, and the models coming from the intuitive machinery, and they're about the same thing.


I certainly expect future AGI to have “learned meta-cognitive strategies” like “when I see this type of problem, maybe try this type of mental move”, and even things like “follow the advice in Cal Newport and use Anki and use Bayesian reasoning etc.” But I don’t see those as particularly relevant for alignment. Learning meta-cognitive strategies are like learning to use a screwdriver—it will help me accomplish my goals, but won’t change my goals (or at least, it won’t change my goals beyond the normal extent to which any new knowledge and experience... (read more)

If we make an AGI, and the AGI starts doing Anki because it’s instrumentally useful, then I don’t care, that doesn’t seem safety-relevant. I definitely think things like this happen by default.

If we make an AGI and the AGI develops (self-reflective) preferences about its own preferences, I care very much, because now it’s potentially motivated to change its preferences, which can be good (if its meta-preferences are aligned with what I was hoping for) or bad (if misaligned). See here. I note that intervening on an AGI’s meta-preferences seems hard. Like, i... (read more)

1Charlie Steiner1mo
So, parsing it a bit at a time (being more thorough than is strictly necessary): What does it mean for some instrumentally-useful behavior (let's call it behavior "X") to give rise to a mesa-optimizer? It means that if X is useful for a system in training, that system might learn to do X by instantiating an agent who wants X to happen. So if X is "trying to have good cognitive habits," there might be some mesa-optimizer that literally wants the whole system to have good cognitive habits (in whatever sense was rewarded on the training data), even if "trying to have good cognitive habits" was never explicitly rewarded. What's "self-reflective" and why might we expect it? "Self-reflective" means doing a good job of modeling how you fit into the world, how you work, and how those workings might be affected by your actions. A non-self-reflective optimizer is like a chess-playing agent - it makes moves that it thinks will put the board in a better state, but it doesn't make any plans about itself, since it's not on the board. An optimizer that's self-reflective will represent itself when making plans, and if this helps the agent do its job, we should expect learning process to lead to self-reflective agents. What does a self-reflective mesa-optimizer do? It makes plans so that it doesn't get changed or removed by the dynamics of the process that gave rise to it. Without such plans, it wouldn't be able to stay the same agent for very long. Why would a mesa-optimizer want to take over the outer process? Suppose there's some large system being trained (the "outer process") that has instantiated a mesa-optimizer that's smaller than the system as a whole. The smaller mesa-optimizer wants to control the larger system to satisfy its own preferences. If the mesa-optimizer wants "good cognitive habits," for instance, it might want to obtain lots of resources to run really good cognitive habits on. [And by "but I mostly expect gradient descent to work" I meant that I expec

My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.

Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the peop... (read more)

2Lee Sharkey1mo
Hm, I don't think this quite captures what I view the post as saying.    As far as there is a safety-related claim in the post, this captures it much better than the previous quote.   I think my hunch is in the other direction. One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system. But that's a lightly held view. It feels plausible to me that your later points (1) and (2) turn out to be right, but again I think I lean in the other direction from you on (1).  I can also imagine a middle ground between our hunches that looks something like "We gave our agent a pretty strong inductive bias toward learning a planning algorithm, but still didn't force it to learn one, yet it did." 


Thinking about it more, I think my take (cf. Section 4.1) is kinda like “Who knows, maybe ontology-identification will turn out to be super easy. But even if it is, there’s this other different problem, and I want to start by focusing on that”.

And then maybe what you’re saying is kinda like “We definitely want to solve ontology-identification, even if it doesn’t turn out to be super easy, and I want to start by focusing on that”.

If that’s a fair summary, then godspeed. :)

(I’m not personally too interested in learned optimization because I’m thinking about something closer to actor-critic model-based RL, which sorta has “optimization” but it’s not really “learned”.)

Nice post.

I’m open-minded, but wanted to write out what I’ve been doing as a point of comparison & discussion. Here’s my terminology as of this writing:

  • Green box ≈ “AGI safety”
  • Purple box ≈ “AGI alignment”
  • Brown box ≈ “Safe & Beneficial AGI”, or “Avoiding AGI x-risk”, or “getting to an awesome post-AGI utopia”, or things like that.

This has one obvious unintuitive aspect, and I discuss it in footnote 2 here

By this definition of “safety”, if an evil person wants to kill everyone, and uses AGI to do so, that still counts as successful “AGI safety”. I a

... (read more)
3David Scott Krueger1mo
I don't think we should try and come up with a special term for (1). The best term might be "AI engineering".  The only thing it needs to be distinguished from is "AI science". I think ML people overwhelmingly identify as doing one of those 2 things, and find it annoying and ridiculous when people in this community act like we are the only ones who care about building systems that work as intended.

1) Oh, sorry, what I meant was, the generals in Country A want their AGI to help them “win the war”, even if it involves killing people in Country B + innocent bystanders. And vice-versa for Country B. And then, between the efforts of both AGIs, the humans are all dead. But nothing here was either an “AGI accident unintended-by-the-designers behavior” nor “AGI misuse” by my definitions.

But anyway, yes I can imagine situations where it’s unclear whether “the AGI does things specifically intended by its designers”. That’s why I said “pretty solid” and not “r... (read more)

On further reflection, I promoted the thing from a footnote to the main text, elaborated on it, and added another thing at the end.

(I think I wrote this post in a snarkier way than my usual style, and I regret that. Thanks again for the pushback.)

For my part, I expect a pile of kludges (learned via online model-based RL) to eventually guide the AI into doing self-reflection. (Self-reflection is, after all, instrumentally convergent.) If I’m right, then it would be pretty hard to reason about what will happen during self-reflection in any detail. Likewise, it would be pretty hard to intervene in how the self-reflection will work.

E.g. we can’t just “put in” or “not put in” a simplicity prior. The closest thing that we could do is try to guess whether or not a “simplicity kludge” would have emerged, a... (read more)

Thanks for your reply!

It continues to feel very bizarre to me to interpret the word “accident” as strongly implying “nobody was being negligent, nobody is to blame, nobody could have possibly seen it coming, etc.”. But I don’t want to deny your lived experience. I guess you interpret the word “accident” as having those connotations, and I figure that if you do, there are probably other people who do too. Maybe it’s a regional dialect thing, or different fields use the term in different ways, who knows. So anyway, going forward, I will endeavor to keep that... (read more)

3David Scott Krueger1mo
1) I don't think this dichotomy is as solid as it seems once you start poking at it... e.g. in your war example, it would be odd to say that the designers of the AGI systems that wiped out humans intended for that outcome to occur.  Intentions are perhaps best thought of as incomplete specifications.   2) From our current position, I think “never ever create AGI” is a significantly easier thing to coordinate around than "don't build AGI until/unless we can do it safely".  I'm not very worried that we will coordinate too successfully and never build AGI and thus squander the cosmic endowment.  This is both because I think that's quite unlikely, and because I'm not sure we'll make very good / the best use of it anyways (e.g. think S-risk, other civilizations). 3) I think the conventional framing of AI alignment is something between vague and substantively incorrect, as well as being misleading.  Here is a post I dashed off about that: [].  I think creating such a manual is an incredibly ambitious goal, and I think more people in this community should aim for more moderate goals.  I mostly agree with the perspective in this post:, [,] but I could say more on the matter. 4) RE connotations of accident: I think they are often strong.

Do you think my post implied that Hawkins said they were stupid for no reason at all? If so, can you suggest how to change the wording?

To my ears, if I hear someone say “Person X thinks Argument Y is stupid”, it’s very obvious that I could then go ask Person X why they think it’s stupid, and they would have some answer to that question.

So when I wrote “Jeff Hawkins thought the book’s arguments were all stupid”, I didn’t think I was implying that Jeff wasn’t paying attention, or that Jeff wasn’t thinking, or whatever. If I wanted to imply those things, I wo... (read more)

Thanks, I just added the following text:

(Edited to add: Obviously, the existence of bad arguments for X does not preclude the existence of good arguments for X! The following is a narrow point of [hopefully] clarification, not an all-things-considered argument for a conclusion. See also footnote [2].)

I know that you don’t make Bad Argument 1—you were specifically one of the people I was thinking of when I wrote Footnote 2. I disagree that nobody makes Bad Argument 1. I think that Lone Pine’s comment on this very post is probably an example. I have seen lot... (read more)

4Rohin Shah1mo
The edits help, thanks. I was in large part reacting to the fact that Kaj's post reads very differently from your summary of Bad Argument 1 (rather than the fact that I don't make Bad Argument 1). In the introductory paragraph where he states his position (the third paragraph of the post), he concludes: Which is clearly not equivalent to "alignment researchers hibernate for N years and then get back to work". Plausibly some actually Bad-Argument-1 is slipping in but it's not the thing that is explicitly said in the introduction. Eh, this seems fine. I just don't actually know of people who seriously believe Bad Argument 1 and would bet they are rare. So I predict that the main effect is making people who already believe your position even more confident in the position by virtue of thinking that the opponents of that position are stupid. More generally I wish that when people wrote takedowns of some incorrect position they try to establish that the position is actually commonly held. (Separately, I am generally more in favor of posts that lay out the strongest arguments for and against a position and then coming to an overall conclusion, at which point you don't have to worry about considerations like "does anyone believe the thing I'm refuting", but those are significantly more effort.)

Having read this post, and the comments section, and the related twitter argument, I’m still pretty confused about how much of this is an argument over what connotations the word “accident” does or doesn’t have, and how much of this is an object-level disagreement of the form “when you say that we’ll probably get to Doom via such-and-such path, you’re wrong, instead we’ll probably get to Doom via this different path.”

In other words, whatever blogpost / discussion / whatever that this OP is in response to, if you take that blogpost and use a browser extensi... (read more)

3David Scott Krueger2mo
While defining accident as “incident that was not specifically intended & desired by the people who pressed ‘run’ on the AGI code” is extremely broad, it still supposes that there is such a thing as "the AGI code", which  significantly restricts the space of possibile risks. There are other reasons I would not be happy with that browser extension.  There is not one specific conversation I can point to; it comes up regularly.  I think this replacement would probably lead to a lot of confusion, since I think when people use the word "accident" they often proceed as if it meant something stricter, e.g. that the result was unforseen or unforseeable.   If (as in "Concrete Problems", IMO) the point is just to point out that AI can get out-of-control, or that misuse is not the only risk, that's a worthwhile thing to point out, but it doesn't lead to a very useful framework for understanding the nature of the risk(s).  As I mentioned elsewhere, it is specifically the dichotomy of "accident vs. misuse" that I think is the most problematic and misleading. I think the chart is misleading for the following reasons, among others: * It seems to suppose that there is such a manual, or the goal of creating one.  However, if we coordinate effectively, we can simply forgoe development and deployment of dangerous technologies ~indefinitely. * It inappropriately separates "coordination problems" and "everyone follows the manual"  

The comparison should between GPT-3 and linguistic-cortex

For the record, I did account for language-related cortical areas being ≈10× smaller than the whole cortex, in my Section 3.3.1 comparison. I was guessing that a double-pass through GPT-3 involves 10× fewer FLOP than running language-related cortical areas for 0.3 seconds, and those two things strike me as accomplishing a vaguely comparable amount of useful thinking stuff, I figure.

So if the hypothesis is “those cortical areas require a massive number of synapses because that’s how the brain reduces ... (read more)

How about the distinction between (A) “An AGI kills every human, and the people who turned on the AGI didn’t want that to happen” versus (B) “An AGI kills every human, and the people who turned on the AGI did want that to happen”?

I’m guessing that you’re going to say “That’s not a useful distinction because (B) is stupid. Obviously nobody is talking about (B)”. In which case, my response is “The things that are obvious to you and me are not necessarily obvious to people who are new to thinking carefully about AGI x-risk.”

…And in particular, normal people s... (read more)

4Rob Bensinger2mo
I think the misuse vs. accident dichotomy is clearer when you don't focus exclusively on "AGI kills every human" risks. (E.g., global totalitarianism risks strike me as small but non-negligible if we solve the alignment problem. Larger are risks that fall short of totalitarianism but still involve non-morally-humble developers damaging humanity's long-term potential.) The dichotomy is really just "AGI does sufficiently bad stuff, and the developers intended this" versus "AGI does sufficiently bad stuff, and the developers didn't intend this". The terminology might be non-ideal, but the concepts themselves are very natural. It's basically the same concept as "conflict disaster" versus "mistake disaster". If something falls into both category to a significant extent (e.g., someone tries to become dictator but fails to solve alignment), then it goes in the "accident risk" bucket, because it doesn't actually matter what you wanted to do with the AI if you're completely unable to achieve that goal. The dynamics and outcome will end up looking basically the same as other accidents.
1David Scott Krueger2mo
"Concrete Problems in AI Safety" used this distinction to make this point, and I think it was likely a useful simplification in that context.  I generally think spelling it out is better, and I think people will pattern match your concerns onto the “the sci-fi scenario where AI spontaneously becomes conscious, goes rogue, and pursues its own goal” or "boring old robustness problems" if you don't invoke structural risk.  I think structural risk plays a crucial role in the arguments, and even if you think things that look more like pure accidents are more likely, I think the structural risk story is more plausible to more people and a sufficient cause for concern. RE (A): A known side-effect is not an accident.  

When something bad happens in such a context, calling it "accident risk" absolves those researching, developing, and/or deploying the technology of responsibility.  They should have known better.  Some of them almost certainly did.  Rationalization, oversight, and misaligned incentives were almost certainly at play.  Failing to predict the particular failure mode encountered is no excuse.  Having "good intentions" is no excuse.

I’ve been reading this paragraph over and over, and I can’t follow it.

How does calling it "accident risk" ... (read more)

For instance, the latter response obtains if the "pointing" is done by naive training.

Oh, sorry. I’m “uncertain” assuming Model-Based RL with the least-doomed plan that I feel like I more-or-less know how to implement right now. If we’re talking about “naïve training”, then I’m probably very pessimistic, depending on the details.

Also, as a reminder, my high credence in doom doesn't come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.

That’s hel... (read more)

2Steve Byrnes8d
UPDATE: The “least-doomed plan” I mentioned is now described in a more simple & self-contained post [], for readers’ convenience. :)

I wish that everyone (including OP) would be clearer about whether or not we’re doing worst-case thinking, and why.

In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”. I don’t have a strong reason to expect that to happen, and I also don’t have a strong reason to expect that to not happen. I mostly feel uncertain and confused.

So if the debate is “Are Eliezer & Nate r... (read more)

1Donald Hobson2mo
Given a sufficiently Kludgy pile of heuristics, it won't make another AI, unless it has a heuristic towards making AI. (In which case the kind of AI it makes depend on its AI making heuristics. ) GPT5 won't code an AI to minimize predictive error on text. It will code some random AI that looks like something in the training dataset. And will care more about what the variable names are than what the AI actually does. Big piles of kludges usually arise from training a kludge finding algorithm (like deep learning). So the only ways agents could get AI building kludges is from making dumb AI's or reading human writings.  Alternately, maybe the AI has sophisticated self reflection. It is looking at its own kludges and trying to figure out what it values. In which case, does the AI's metaethics contain a simplicity prior? With a strong simplicity prior, an agent with a bunch of kludges that mostly maximized diamond could turn into an actual crystaline diamond maximizer. If it doesn't have that simplicity prior, I would guess it ended up optimizing some complicated utility function. (But probably producing a lot of diamond as it did so, diamond isn't the only component of it's utility, but it is a big one.) 

I don't see this as worst-case thinking. I do see it as speaking from a model that many locals don't share (without any particular attempt made to argue that model).

In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.

AFAICT, our degree of disagreement here turns on what you mean by "pointed". Depending on that, I expect I'd either say "yeah maybe, but that kind of po... (read more)

If you think some more specific aspect of this post is importantly wrong for reasons that are downstream of that, I’d be curious to hear more details.

In this post, I’m discussing a scenario where one AGI gets out of control and kills everyone. If the people who programmed and turned on that AGI were not omnicidal maniacs who wanted to wipe out humanity, then I call that an “accident”. If they were omnicidal maniacs then I call that a “bad actor” problem. I think that omnicidal maniacs are very rare in the human world, and therefore this scenario that I’m t... (read more)

I was an independent AGI safety researcher because I didn't want to move to a different city and (at the time, it might or might not have changed in the past couple years) few if any orgs that might hire me were willing to hire remote workers.

UPDATE: I WROTE A BETTER DISCUSSION OF THIS TOPIC AT: Heritability, Behaviorism, and Within-Lifetime RL)

Hmm. I’m not sure it’s that important what is or isn’t “behaviorism”, and anyway I’m not an expert on that (I haven’t read original behaviorist writing, so maybe my understanding of “behaviorism” is a caricature by its critics). But anyway, I thought Scott & Eliezer were both interested in the question of what happens when the kid grows up and the parents are no longer around.

My comment above was a bit sloppy. Let me try again. Here are two stories:

... (read more)

Hmm, maybe. I talk about training compute in Section 4 of this post (upshot: I’m confused…). See also Section 3.1 of this other post. If training is super-expensive, then run-compute would nevertheless be important if (1) we assume that the code / weights / whatever will get leaked in short order, (2) the motivations are changeable from "safe" to "unsafe" via fine-tuning or decompiling or online-learning or whatever. (I happen to strongly expect powerful AGI to necessarily use online learning, including online updating the RL value function which is relate... (read more)


We also know that in many cases the brain and some ANN are actually computing basically the same thing in the same way (LLMs and linguistic cortex), and it's now obvious and uncontroversial that the brain is using the sparser but larger version of the same circuit, whereas the LLM ANN is using the dense version which is more compact but less energy/compute efficient (as it uses/accesses all params all the time).

I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim ... (read more)

Uncontroversial was perhaps a bit tongue-in-cheek, but that claim is specifically about a narrow correspondence between LLMs and linguistic cortex, not about LLMs and the entire brain or the entire cortex. And this claim should now be uncontroversial. The neuroscience experiments have been done, and linguistic cortex computes something similar to what LLMs compute, and almost certainly uses a similar predictive training objective. It obviously implements those computations in a completely different way on very different hardware, but they are mostly the same computations nonetheless - because the task itself determines the solution. Examples from recent neurosci literature: From "Brains and algorithms partially converge in natural language processing" []: From "The neural architecture of language: Integrative modeling converges on predictive processing" []: From "Correspondence between the layered structure of deep language models and temporal structure of natural language processing in the human brain" [] Scaling up GPT-3 by itself is like scaling up linguistic cortex by itself, and doesn't lead to AGI any more/less than that would (pretty straightforward consequence of the LLM <-> linguistic_cortex (mostly) functional equivalence). The comparison should between GPT-3 and linguistic-cortex, not the whole brain. For inference the linguistic cortex uses many orders of magnitude less energy to perform the same task. For training it uses many orders of magnitude less energy to reach the same capability, and several OOM less data. In terms of flops-equivalent it's perhaps 1e22 sparse flops for training linguistic cortex (1e13 flops * 1e9 seconds) vs 3e23 flops for training GPT-3. So fairly close, but the brain is probably trading some compute efficiency for data

Bit of a nitpick, but I think you’re misdescribing AIXI. I think AIXI is defined to have a reward input channel, and its collection-of-all-possible-generative-world-models are tasked with predicting both sensory inputs and reward inputs, and Bayesian-updated accordingly, and then the generative models are issuing reward predictions which in turn are used to choose maximal-reward actions. (And by the way it doesn’t really work—it under-explores and thus can be permanently confused about counterfactual / off-policy rewards, IIUC.) So AIXI has no utility func... (read more)

Ooh interesting! Can you say how you're figuring that it's "gigabytes of information?"

I’ve spent thousands of hours reading neuroscience papers, I know how synapses work, jeez :-P

Similarly we never have to bother with a "minicolumn".  We only care about what works best.  Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.  

We probably will find something way better than a minicolumn.  Some argue that's what a transformer is.

I’m sorta confused that you wrote all these paragraphs with (as I understand it) the message that if we want future AGI ... (read more)

Thanks for your comment! I am not a GPU expert, if you didn’t notice. :) 

I might note that you could have tried to fill in the "cartoon switch" for human synapses.  They are likely a MAC for each incoming axon…

This is the part I disagree with. For example, in the OP I cited this paper which has no MAC operations, just AND & OR. More importantly, you’re implicitly assuming that whatever neocortical neurons are doing, the best way to do that same thing on a chip is to have a superficial 1-to-1 mapping between neurons-in-the-brain and virtual-ne... (read more)

3Gerald Monroe2mo
Please read a neuroscience book, even an introductory one, on how a synapse works.  Just 1 chapter, even. There's a MAC in there.   It's because the incoming action potential hits the synapse, and sends a certain quantity of neurotransmitters across a gap.  The sender cell can vary how much neurotransmitter it sends, and the receiving cell can vary how many active receptors it has.  The type of neurotransmitter determines the gain and sign.  (this is like the exponent and sign bit for 8 bit BFloat) These 2 variables can be combined to a single coefficient, you can think of it as "voltage delta" (it can be + or -) So it's (1) * (voltage gain) = change in target cell voltage. For ANN, it's <activation output> * <weight> = change in target node activation input. The brain also uses timing to get more information than just "1", the exact time the pulse arrived matters to a certain amount of resolution.  It is NOT infinite, for reasons I can explain if you want. So the final equation is  (1) * (synapse state) * (voltage gain) = change in target cell voltage. Aka you have to multiply 2 numbers together and add, which is what "multiply-accumulate" units do. Due to all the horrible electrical noise in the brain, and biological forms of noise and contaminants, and other factors, this is the reason for me making it only 8 bits - 1 part in 256 - of precision.  That's realistically probably generous, it's probably not even that good. There is immense amounts of additional complexity in the brain, but almost none of this matters for determining inference outputs.  The action potentials rush out of the synapse at kilometers per second - many biological processes just don't matter at all because of this.  Same how a transistor's behavior is irrelevant, it's a cartoon switch.   For training, sure, if we wanted a system to work like a brain we'd have to model some of this, but we don't.  We can train using whatever algorithm measurably is optimal. Similarly we never have
Load More