Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Recent Discussion

Let’s say I know how to build / train a human-level (more specifically, John von Neumann level) AGI. And let’s say that we (and/or the AGI itself) have already spent a few years[1] on making the algorithm work better and more efficiently.

Question: How much compute will it take to run this AGI?

(NB: I said "running" an AGI, not training / programming an AGI. I'll talk a bit about “training compute” at the very end.)

Answer: I don’t know. But that doesn’t seem to be stopping me from writing this post. ¯\_(ツ)_/¯ My current feeling—which I can easily imagine changing after discussion (which is a major reason I'm writing this!)—seems to be:

  • 75%: One current (Jan 2023) high-end retail gaming PC (with an Nvidia GeForce RTX 4090 GPU) will be enough (or
3Matthew Barnett8h
Isn't the relevant fact whether we could train an AGI with modest computational resources, not whether we could run one? If training runs are curtailed from regulation, then presumably the main effect is that AGI will be delayed until software and hardware progress permits the covert training of an AGI with modest computational resources, which could be a long time depending on how hard it is to evade the regulation.

Hmm, maybe. I talk about training compute in Section 4 of this post (upshot: I’m confused…). See also Section 3.1 of this other post. If training is super-expensive, then run-compute would nevertheless be important if (1) we assume that the code / weights / whatever will get leaked in short order, (2) the motivations are changeable from "safe" to "unsafe" via fine-tuning or decompiling or online-learning or whatever. (I happen to strongly expect powerful AGI to necessarily use online learning, including online updating the RL value function which is relate... (read more)

I agree that 1e14 synaptic spikes/second is the better median estimate, but those are highly sparse ops. So when you say: You are missing some foundational differences in how von neumann arch machines (GPUs) run neural circuits vs how neuromorphic hardware (like the brain) runs neural circuits. The 4090 can hit around 1e14 - even up to 1e15 - flops/s, but only for dense matrix multiplication. The flops required to run a brain model using that dense matrix hardware are more like 1e17 flops/s, not 1e14 flops/s. The 1e14 synapses are at least 10x locally sparse in the cortex, so dense emulation requires 1e15 synapses (mostly zeroes) running at 100hz. The cerebellum is actually even more expensive to simulate .. because of the more extreme connection sparsity there. But that isn't the only performance issue. The GPU only runs matrix matrix multiplication, not the more general vector matrix multiplication. So in that sense the dense flop perf is useless, and the perf would instead be RAM bandwidth limited and require 100 4090's to run a single 1e14 synapse model - as it requires about 1B of bandwidth per flop - so 1e14 bytes/s vs the 4090's 1e12 bytes/s. Your reply seems to be "but the brain isn't storing 1e14 bytes of information", but as other comments point out that has little to do with the neural circuit size. The true fundamental information capacity of the brain is probably much smaller than 1e14 bytes, but that has nothing to do with the size of an actually *efficient* circuit, because efficient circuits (efficient for runtime compute, energy etc) are never also efficient in terms of information compression. This is a general computational principle, with many specific examples: compressed neural frequency encodings of 3D scenes (NERFs) which access/use all network parameters to decode a single point O(N) are enormously less computationally efficient (runtime throughput, latency, etc) than maximally sparse representations (using trees, hashtables etc) whi
3Steve Byrnes12h
Thanks! I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim include Randy O’Reilly, Josh Tenenbaum, Jeff Hawkins, Dileep George, these people [] , maybe some of the Friston / FEP people, probably most of the “evolved modularity” people like Steven Pinker, and I think Kurzweil (he thought the cortex was built around hierarchical hidden Markov models, last I heard, which I don’t think are equivalent to ANNs?). And me! You’re welcome to argue that you’re right and we’re wrong (and most of that list are certainly wrong, insofar as they’re also disagreeing with each other!), but it’s not “uncontroversial”, right? In the OP (Section 3.3.1) I talk about why I don’t buy that—I don’t think it’s the case that the brain gets dramatically more “bang for its buck” / “thinking per FLOP” than GPT-3. In fact, it seems to me to be the other way around. Then “my model of you” would reply that GPT-3 is much smaller / simpler than the brain, and that this difference is the very important secret sauce of human intelligence, and the “thinking per FLOP” comparison should not be brain-vs-GPT-3 but brain-vs-super-scaled-up-GPT-N, and in that case the brain would crush it. And I would disagree about the scale being the secret sauce. But we might not be able to resolve that—guess we’ll see what happens! See also footnote 16 and surrounding discussion.
This seems to me to be essentially assuming the conclusion. The assumption here is that a 2022 LLM already stores all the information necessary for human-level language ability and that no capacity is needed beyond that. But "how much capacity is required to match human-level ability" is the hardest part of the question. (The "no capacity is needed beyond that" part is tricky too. I take AI_WAIFU's core point to be that having excess capacity is helpful for algorithmic reasons, even though it's beyond what's strictly necessary to store the information if you were to compress it. But those algorithmic reasons, or similar ones, might apply to AI as well.) I might as well link my own attempt at this estimate. [] It's not estimating the same thing (since I'm estimating capacity and you're estimating stored information), so the numbers aren't necessarily in disagreement. My intuition though is that capacity is quite important algorithmically, so it's the more relevant number. (Edit: Among the sources of that intuition is Neural Tangent Kernel theory, which studies a particular infinite-capacity limit.)
3Steve Byrnes4d
One of AI_WAIFU’s points was that the brain has some redundancy because synapses randomly fail to fire and neurons randomly die etc. That part wouldn’t be relevant to running the same algorithms on chips, presumably. Then the other thing they said was that over-parameterization helps with data efficiency. I presume that there’s some background theory behind that claim which I’m not immediately familiar with. But I mean, is it really plausible that the brain is overparameterizing by 3+ orders of magnitude? Seems pretty implausible to me, although I’m open to being convinced. Also, Neural Tangent Kernel is an infinite-capacity, but people can do those calculations [] without using an infinitely large memory, right? People have found a way to reformulate the algorithm such that it involves doing different operations on a different representation which does not require ∞ memory. By the same token, if we’re talking about some network which is so overparametrized that it can be compressed by 99.9%, then I’m strongly inclined to guess that there’s some way to do the same calculations and updates directly on the compressed representation.
NTK training requires training time that scales quadratically with the number of training examples, so it's not usable for large training datasets (nor with data augmentation, since that simulates a larger dataset). (I'm not an NTK expert, but, from what I understand, this quadratic growth is not easy to get rid of.)
4Steve Byrnes4d
If you think human brains are storing hundreds or thousands of GB or more of information about (themselves / the world / something), do you have any thoughts on what that information is? Like, can you give (stylized) examples? (See also my footnote 13.) Also, see my footnote 16 and surrounding discussion; maybe that’s a crux?
That's an interesting question. I don't have an opinion about how much information is stored. Having a lot of capacity appears to be important, but whether that's because it's necessary to store information or for some other reason, I don't know. It got me thinking, though: the purpose of our brain is to guide our behavior, not to remember our training data. (Whether we can remember our training data seems unclear. Apparently the existence of photographic memory is disputed, but there are people with extraordinarily good memories, even if not photographic.) It could be that the preprocessing necessary to guide our future behavior unavoidably increases the amount of stored data by a large factor. (There are all sorts of examples of this sort of design pattern in classic computer science algorithms, so it wouldn't be particularly surprising.) If that's the case, I have no idea how to measure how much of it there is.
4Gunnar Zarncke2d
Just a data point that support hold_my_fish's argument: Savant Kim Peek did likely memorize gigabytes of information and could access them quite reliably: []
2Steve Byrnes2d
Ooh interesting! Can you say how you're figuring that it's "gigabytes of information?"
5Steve Byrnes3d
You say “Having a lot of capacity appears to be important” but that’s “essentially assuming the conclusion”, right? hehe :) You claim [] that there’s a lot of capacity, but I say we don’t really know that. As a stupid example, if my computer’s SRAM has N cells, but it uses an error-correcting code by redundantly storing each bit in three different cells, then its “capacity” is ⅓ the number of cells. (And in 6T SRAM, the number of cells is in turn ⅙ the number of transistors, etc.) Anyway, all things considered right now, the most plausible-to-me theory is that counting synapses gives a 2-3OOM overestimate of capacity. I don’t see this as particularly implausible. For one thing, as I wrote in the OP, the synapse is not just an information-storage-unit, it’s also a thing-that-does-calculations. If one bit of stored information (e.g. information about how the world works) needs to be involved in 1000 different calculations, it seems plausible that it would need to be associated with 1000 synapses. For another thing, here’s [] a model where one functional “connection” requires a group of 10 nearby synapses onto the same dendrite. That’s 1 OOM right there! I think there’s another OOM or two lurking in the fact that each cortical minicolumn is 100 neurons and each cortical column is 100 minicolumns, but there’s some sense in which minicolumns (and to a lesser degree, columns) are a single functional unit. So, without getting into details, which I’m hazy on anyway, I wouldn’t be surprised to learn that “one connection” involved not only 10 nearby synapses on one dendrite, but a similar group on 10 synapses onto a neuron within each of 10 neighboring minicolumns, and those 10 minicolumns are working together to implement a certain kind of computation, which by the way you could
3Gerald Monroe3d
Hi Steven. A simple stylized example: imagine you have some algorithm for processing each cluster of inputs from the retina. You might think that because that algorithm is symmetric* - you want to run the same algorithm regardless of which cluster it is - you only need one copy of the bytecode that represents the compiled copy of the algorithm. This is not the case.Information wise, sure. There is only one program that takes n bytes of information. You can save disk space for holding your model. RAM/cache consumption : each of the parallel processing units you have to use (you will not get realtime results for images if you try to do it serially) must have another copy of the algorithm. And this rule applies throughout the human body : every nerve cluster, audio processing, etc. This also greatly expands the memory required over your 24 gig 4090 example. For one thing, the human brain is very sparse, and while nvidia has managed to improve sparse network performance, it still requires memory to represent all the sparse values. I might note that you could have tried to fill in the "cartoon switch" for human synapses. They are likely a MAC for each incoming axon at no better than 8 bits of precision added to an accumulator for the cell membrane at the synapse that has no better than 16 bits of precision. (it's probably less but the digital version has to use 16 bits) So add up the number of synapses in the human brain, assume 1 khz, and that's how many TOPs you need. Let me do the math for you real quick: 68 billion neurons, about 1000 connections each, at 1khz. (it's very sparse). So we need 68 billion x 1000 x 1000 = 6.8e+16 = 68000 TOPs. Current gen data center GPU: [] So we would need 17 of them to hit 1 human brain. We can assume you will never get maximum performance (especially due to the very high sparsity), so maybe 2-3 nodes with 16 cards each? Note they
3Steve Byrnes3d
Thanks for your comment! I am not a GPU expert, if you didn’t notice. :) This is the part I disagree with. For example, in the OP I cited this paper [] which has no MAC operations, just AND & OR. More importantly, you’re implicitly assuming that whatever neocortical neurons are doing, the best way to do that same thing on a chip is to have a superficial 1-to-1 mapping between neurons-in-the-brain and virtual-neurons-on-the-chip. I find that unlikely. Back to that paper just above, things happening in the brain are (supposedly) encoded as random sparse subsets of active neurons drawn from a giant pool of neurons. We could do that on the chip, if we wanted to, but we don’t have to! We could assign them serial numbers instead! We can do whatever we want! Also, cortical neurons are arranged into six layers vertically, and in the other direction, 100 neurons are tied into a closely-interconnected cortical minicolumn, and 100 minicolumns in turn form a cortical column. There’s a lot of structure there! Nobody really knows, but my best guess from what I’ve seen is that a future programmer might have one functional unit in the learning algorithm called a “minicolumn” and it’s doing, umm, whatever it is that minicolumns do, but we don’t need to implement that minicolumn in our code by building it out of 100 different interconnected virtual neurons. Yes the brain builds it that way, but the brain has lots of constraints that we won’t have when we’re writing our own code—for example, a GPU instruction set [] can do way more things than biological neurons can (partly because biological neurons are so insanely slow that any operation that requires more than a couple serial steps is a nonstarter).
3Gerald Monroe3d
Please read a neuroscience book, even an introductory one, on how a synapse works. Just 1 chapter, even. There's a MAC in there. It's because the incoming action potential hits the synapse, and sends a certain quantity of neurotransmitters across a gap. The sender cell can vary how much neurotransmitter it sends, and the receiving cell can vary how many active receptors it has. The type of neurotransmitter determines the gain and sign. (this is like the exponent and sign bit for 8 bit BFloat) These 2 variables can be combined to a single coefficient, you can think of it as "voltage delta" (it can be + or -) So it's (1) * (voltage gain) = change in target cell voltage. For ANN, it's <activation output> * <weight> = change in target node activation input. The brain also uses timing to get more information than just "1", the exact time the pulse arrived matters to a certain amount of resolution. It is NOT infinite, for reasons I can explain if you want. So the final equation is (1) * (synapse state) * (voltage gain) = change in target cell voltage. Aka you have to multiply 2 numbers together and add, which is what "multiply-accumulate" units do. Due to all the horrible electrical noise in the brain, and biological forms of noise and contaminants, and other factors, this is the reason for me making it only 8 bits - 1 part in 256 - of precision. That's realistically probably generous, it's probably not even that good. There is immense amounts of additional complexity in the brain, but almost none of this matters for determining inference outputs. The action potentials rush out of the synapse at kilometers per second - many biological processes just don't matter at all because of this. Same how a transistor's behavior is irrelevant, it's a cartoon switch. For training, sure, if we wanted a system to work like a brain we'd have to model some of this, but we don't. We can train using whatever algorithm measurably is optimal. Similarly we never have to bother wit
3Steve Byrnes2d
I’ve spent thousands of hours reading neuroscience papers, I know how synapses work, jeez :-P I’m sorta confused that you wrote all these paragraphs with (as I understand it) the message that if we want future AGI algorithms to do the same things that a brain can do, then it needs to do MAC operations in the same way that (you claim) brain synapses do, and it needs to have 68 TB of weight storage just as (you claim) the brain does. …But then here at the end you seem to do a 180° flip and talk about flapping wings and transformers and “We probably will find something way better”. OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that? (Maybe you have a “scale-is-all-you-need” perspective, and you note that we don’t have AGI yet, and therefore the explanation must be “insufficient scale”? Or something else?) OK, imagine for the sake of argument that we live in the following world (a caricatured version of this model []): * Dendrites have lots of clusters of 10 nearby synapses * Iff all 10 synapses within one cluster get triggered simultaneously, then it triggers a dendritic spike on the downstream neuron. * Different clusters on the same dendritic tree can each be treated independently * As background, the whole dendrite doesn’t have a single voltage (let alone the whole dendritic tree). Dendrites have different voltages in dif
2Vladimir Nesov4d
I think it's plausible that LLM simulacrum AGIs are initially below von Neumann level, and that there are no straightforward ways of quickly improving on that without risking additional misalignment. If so, the initial AGIs might coordinate to keep it this way a significant amount of time through the singularity (like, nanotech industry-rebuilding comes earlier than this) for AI safety reasons [], because making the less straightforward improvements leads to unnecessary unpredictability, and it takes a lot of subjective time at a level below von Neumann to ensure that this becomes a safe thing to do. The concept of AGI should track whatever is sufficient to trigger/sustain a singularity by autonomously converting compute to research progress, and shouldn't require even modest and plausible superpowers such as matching John von Neumann that are not strictly necessary for that purpose.
Nanotech industry-rebuilding comes earlier than von Neumann level? I doubt that. A lot of existing people are close to von Neumann level. Maybe your argument is that there will be so many AGIs, that they can do Nanotech industry rebuilding while individually being very dumb. But I would then argue that the collective already exceeds von Neumann or large groups of humans in intelligence.
1Vladimir Nesov4d
The argument is that once there is an AGI at IQ 130-150 level (not "very dumb", but hardly von Neumann), that's sufficient to autonomously accelerate research using the fact that AGIs have much higher serial speed than humans. This can continue for a long enough time to access research from very distant future, including nanotech for building much better AGI hardware at scale. There is no need for stronger intelligence in order to get there. The motivation for this to happen is the AI safety concern [] with allowing cognition that's more dangerous than necessary, and any non-straightforward improvements to how AGI thinks create such danger. For LLM-based AGIs, anchoring to human level that's available in the training corpus seems more plausible than for other kinds of AGIs (so that improvement in capability would become less than absolutely straightforward specifically at human level). If AGIs have an opportunity to prevent this AI safety risk, they might be motivated to take that opportinity, which would result in intentional significant delay in further improvement of AGI capabilities. I'm not saying that this is an intuitively self-evident claim, there is a specific reason I'm giving for why I see it as plausible. Even when there is a technical capability to build giant AGIs the size of cities [], there is still the necessary intermediate of motive in bridging the gap from capability to actuality.

Work done with Ramana Kumar, Sebastian Farquhar (Oxford), Jonathan Richens, Matt MacDermott (Imperial) and Tom Everitt.

Our DeepMind Alignment team researches ways to avoid AGI systems that knowingly act against the wishes of their designers. We’re particularly concerned about agents which may be pursuing a goal that is not what their designers want.

These types of safety concerns motivate developing a formal theory of agents to facilitate our understanding of their properties and avoid designs that pose a safety risk. Causal influence diagrams (CIDs) aim to be a unified theory of how design decisions create incentives that shape agent behaviour to illuminate potential risks before an agent is trained and inspire better agent designs with more appealing alignment properties.

Our new paper, Discovering Agents, introduces new ways of tackling these issues, including:

  • The first formal causal

The idea that "Agents are systems that would adapt their policy if their actions influenced the world in a different way." works well on mechanised CIDs whose variables are neatly divided into object-level and mechanism nodes: we simply check for a path from a utility function F_U to a policy Pi_D. But to apply this to a physical system, we would need a way to obtain such a partition those variables. Specifically, we need to know (1) what counts as a policy, and (2) whether any of its antecedents count as representations of "influence" on the world (and af... (read more)

In this post I’m going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I’ll discuss various arguments that RLHF research had an overall negative impact and explain why I don’t find them persuasive.

I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress."

Background on my involvement in RLHF work

Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to  disagreements about this background:

  • The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on
4Tsvi Benson-Tilsen2d
A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a future lethal agent, seems to be in particular a search for an agent that has the same generators of capabilities as future lethal agents. On the other hand, trying to prevent treacherous turns in a system that has different generators seems like it doesn't have much chance of generalizing. It seems clear that one could do useful "advertising" (better term?) research of this form, where one makes e.g. treacherous turns intuitively salient to others by showing something with some features in common with future lethal ones. E.g. one could train an agent A in an environment that contains the source B of A's reward, where B does some limited search to punish actions by A that seem, to the limited search, to be building up towards A hacking B. One might find that A does well according to B for a while, until it's understood the environment well enough (via exploration that didn't look to B like hacking) to plan, recognize as high reward, and follow a pathway to hack B. Or something. This could be helpful for "advertising" reasons, but I think my sense of how much this actually helps with the actual alignment problem correlates pretty strongly with how much A is shaped---in terms of how it got its capabilities---alike to future lethal systems. What are ways that the helpfulness for alignment of an observational study like this can be pulled apart from similarity of capability generators?
4Paul Christiano2d
The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have. There are some differences and lots of similarities between what is going on in a weaker AI doing a treacherous turn and a stronger AI doing a treacherous turn. So you expect to learn some things and not others. After studying several such cases it seems quite likely you understand enough to generalize to new cases. It's possible MIRI folks expect a bigger difference in how future AI is produced. I mostly expect just using gradient descent, resulting in minds that are in some ways different and in many ways different. My sense is that MIRI folks have a more mystical view about the difference between subhuman AI systems and "AGI." (The view "stack more layers won't ever give you true intelligence, there is a qualitative difference here" seems like it's taking a beating every year, whether it's Eliezer or Gary Marcus saying it.)
1Tsvi Benson-Tilsen1d
When you say "motive" here, is it fair to reexpress that as: "that which determines by what method and in which directions capabilities are deployed to push the world"? If you mean something like that, then my worry here is that motives are a kind of relation involving capabilities, not something that just depends on, say, the reward structure of the local environment. Different sorts of capabilities or generators of capabilities will relate in different ways to ultimate effects on the world. So the task of interfacing with capabilities to understand how they're being deployed (with what motive), and to actually specify motives, is a task that seems like it would depend a lot on the sort of capability in question.
2Paul Christiano1d
I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward. I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers. I'm not exactly sure how to cash out your objection as a response to this, but I suspect it's probably a bit too galaxy-brained for my taste.
7Tsvi Benson-Tilsen16h
So for example, say Alice runs this experiment: Alice observes that A learns to hack B. Then she solves this as follows: Alice observes that A doesn't hack B. The Bob looks at Alice's results and says, "Cool. But this won't generalize to future lethal systems because it doesn't account for how A can combine innocuous understanding that it gains. Future systems, to be very competent, will probably do something functionally equivalent to exploring their environment to understand parts of the environment without necessarily trying to achieve some big goal (such as hacking B) along the way. This creates a 'capabilities overhang' relative to the overseer: there's no behavior that's clearly aimed at something B considers dangerous, but A accumulates ability to put together plans that do more and more effective stuff, compared to what A has actually previously acted out and gotten direct reinforcement on. This is an important part of how future systems might be lethal." So then Alice and Bob collaborate and come up with this variation: Alice and Bob observe that A avoids approaching B for a long time while steadily improving both its B-score and also its exploration score. Then at some point, all in one episode, A hacks B and achieves very high reward. Now, this might be interesting from an alignment perspective, or not. But my point is that Alice and Bob have perhaps, in some version of the hypothetical, also made a capabilities advance: they've demonstrated non-trivial gains from an exploration objective. I assume that in our world this is not much of an insight, as exploration objectives have already been discussed and tried. But this is the sort of pattern that's concerning to me. I'm not saying one can't do this sort of thing in a way such that the alignment value exceeds the capabilities advancement in the relevant way. I'm saying, these things seem to push pretty directly against each other, so I'd want careful thinking about how to pull them apart. Even inst
40Oliver Habryka2d
I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF. My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of $10B of investment into AI capabilities research, completely in addition to the gains from GPT-3. And my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3. We also should not think this was overdetermined since 1.5 years passed since the release of GPT-3 and the release of Chat-GPT (with some updates to GPT-3 in the meantime, but my guess is no major ones), and no other research lab focused on capabilities had set up their own RLHF pipeline (except Anthropic, which I don't think makes sense to use as a datapoint here, since it's in substantial parts the same employees). I have been trying to engage with the actual details here, and indeed have had a bunch of arguments with people over the last 2 years where I have been explicitly saying that RLHF is pushing on commercialization bottlenecks based on those details, and people believing this was not the case was the primary crux on whether RLHF was good or bad in those conversations. The crux was importantly not that other people would do the same work anyways, since people at the same time also argued that their work on RLHF was counterfactually relevant and that it's pretty plausible or likely that the work would otherwise not happen. I've had a few of these conversations with you as well (though in aggregate not a lot) and your take at the time was (IIRC) that it seemed quite unlikely that RLHF would have as big of an effect as it did have in the case of Chat-GPT (mostly via an efficiency argument that if that was the case, more capabilities-oriented people wou
26Paul Christiano2d
I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I've seen head-to-head comparisons suggesting real but modest effects on similar tasks). I think the much more important differences are: 1. It was trained to interact directly with the end user as a conversational assistant rather than in an API intended to be used by developers. 2. It was deployed in a way that made it much easier for more people to interact with it. 3. People hadn't appreciated progress since GPT-3, or even how good GPT-3 was, and this went viral (due to a combination of 1+2). 4. If there are large capability differences I expect they are mostly orthogonal improvements. I think the effect would have been very similar if it had been trained via supervised learning on good dialogs. ChatGPT was impactful because of a big mismatch between people's perceptions of LM abilities and reality. That gap was going to get closed sooner or later (if not now then probably at the GPT-4 release). I think it's reasonable to think that this was a really destructive decision by OpenAI, but I don't think it's reasonable to treat it as a counterfactual $10B of investment. I feel like the implicit model of the world you are using here is going to have effect sizes adding up to much more than the actual variance at stake. How impactful was the existence of OpenAI? Leadership decisions at Google? Microsoft's willingness to invest in OpenAI? The surprising effectiveness of transformers? Google originally deciding not to scale up LMs aggressively? The training of PaLM? The original GPT-3 release decisions? The fact that LM startups are raising at billion dollar valuations? The fact that LM applications are making hundreds of millions of dollars? These sources of variance all add up to 100% of the variance in AI investment, not 100000% of the variance. I think it's a persistent difference between us that I tend
11Oliver Habryka2d
I don't currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-grained ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts. I find a comparison with John Schulman here unimpressive if you want to argue progress on this was overdetermined, given the safety motivation by John, and my best guess being that if you had argued forcefully that RLHF was pushing on commercialization bottlenecks, that John would have indeed not worked on it. Seeing RLHF teams in other organizations not directly downstream of your organizational involvement, or not quite directly entangled with your opinion, would make a bigger difference here. I don't think so, and have been trying to be quite careful about this. Chat-GPT is just by far the most successful AI product to date, with by far the biggest global impact on AI investment and the most hype. I think $10B being downstream of that isn't that crazy. The product has a user base not that different from other $10B products, and a growth rate to put basically all of them to shame, so I don't think a $10B effect from Chat-GPT seems that unreasonable. There is only so much variance to go around, but Chat-GPT is absolutely massive in its impact.
7Paul Christiano2d
I bet they did generate supervised data (certainly they do for InstructGPT), and supervised data seems way more fine-grained in what you are getting the AI to do. It's just that supervised fine-tuning is worse. I think the biggest problem with previous chat-bot attempts is that the underlying models are way way weaker than GPT-3.5. This still seems totally unreasonable to me: * How much total investment do you think there is in AI in 2023? * How much variance do you think there is in the level of 2023 investment in AI? (Or maybe whatever other change you think is equivalent.) * How much influence are you giving to GPT-3, GPT-3.5, GPT-4? How much to the existence of OpenAI? How much to the existence of Google? How much to Jasper? How much to good GPUs? I think it's unlikely that the reception of ChatGPT increased OpenAI's valuation by $10B, much less investment in OpenAI, even before thinking about replaceability. I think that Codex, GPT-4, DALL-E, etc. are all very major parts of the valuation. I also think replaceability is a huge correction term here. I think it would be more reasonable to talk about moving how many dollars of investment how far forward in time. I think John wants to make useful stuff, so I doubt this.

Supervised data seems way more fine-grained in what you are getting the AI to do. It's just that supervised fine-tuning is worse.

My (pretty uninformed) guess here is that supervised fine-tuning vs RLHF has relatively modest differences in terms of producing good responses, but bigger differences in terms of avoiding bad responses. And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don't want it to do.

11Oliver Habryka1d
My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don't know what fraction of Google's revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc. Variance between different years depending on market condition and how much products take off seems like on the order of 50% to me. Like, different years have pretty hugely differing levels of investment. My guess is about 50% of that variance is dependent on different products taking off, how much traction AI is getting in various places, and things like Chat-GPT existing vs. not existing. So this gives around $50B - $125B of variance to be explained by product-adjacent things like Chat-GPT. Existence of OpenAI is hard to disentangle from the rest. I would currently guess that in terms of total investment, GPT-2 -> GPT-3 made a bigger difference than GPT-3.5 -> Chat-GPT, but both made a much larger difference than GPT-3 -> GPT-3.5. I don't think Jasper made a huge difference, since its userbase is much smaller than Chat-GPT, and also evidently the hype from it has been much lower. Good GPUs feels kind of orthogonal. We can look at each product that makes up my 50% of the variance to be explained and see how useful/necessary good GPUs were for its development, and my sense is for Chat-GPT at least the effect of good GPUs were relatively minor since I don't think the training to move from GPT-3.5 to Chat-GPT was very compute intensive. I would feel fine saying expected improvements in GPUs are responsible for 25% of the 50% variance (i.e. 17.5%) if you chase things back all the way, though that again feels like it isn't trying to add up to 100% with the impact from "Chat-GPT". I do think it's trying to add up to 100% with
1Matthew "Vaniver" Gray13h
IMO it's much easier to support high investment numbers in "AI" if you consider lots of semiconductor / AI hardware startup stuff as "AI investments". My suspicion is that while GPUs were primarily a crypto thing for the last few years, the main growth outlook driving more investment is them being an AI thing.
I'd be interested to know how you estimate the numbers here, they seem quite inflated to me. If 4 big tech companies were to invest $50B each in 2023 then, assuming average salary as $300k and 2:1 capital to salary then investment would be hiring about 50B/900K = 55,000 people to work on this stuff. For reference the total headcount at these orgs is roughly 100-200K. 50B/yr is also around 25-50% of the size of the total income, and greater than profits for most which again seems high. Perhaps my capital ratio is way too low but I would find it hard to believe that these companies can meaningfully put that level of capital into action so quickly. I would guess more on the order of $50B between the major companies in 2023. Agree with paul's comment above that timeline shifts are the most important variable.
4Paul Christiano14h
I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that). I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in a single year then I would guess the impact is < 1 week. I haven't thought about it much, but my all things considered estimate for the expected timelines slowdown if you just hadn't done the ChatGPT release is probably between 1-4 weeks. Is that the kind of effect size you are imagining here? I guess the more important dynamic is probably more people entering the space rather than timelines per se? One thing worth pointing out in defense of your original estimate is that variance should add up to 100%, not effect sizes, so e.g. if the standard deviation is $100B then you could have 100 things each explaining ($10B)^2 of variance (and hence each responsible for +-$10B effect sizes after the fact).
4Oliver Habryka12h
Makes sense, sorry for the miscommunication. I really didn't feel like I was making a particularly controversial claim with the $10B, so was confused why it seemed so unreasonable to you. I do think those $10B are going to be substantially more harmful for timelines than other money in AI, because I do think a good chunk of that money will much more directly aim at AGI than most other investment. I don't know what my multiplier here for effect should be, but my guess is something around 3-5x in expectation (I've historically randomly guessed that AI applications are 10x less timelines-accelerating per dollar than full-throated AGI-research, but I sure have huge uncertainty about that number). That, plus me thinking there is a long tail with lower probability where Chat-GPT made a huge difference in race dynamics, and thinking that this marginal increase in investment does probably translate into increases in total investment, made me think this was going to shorten timelines in-expectation by something closer to 8-16 weeks, which isn't enormously far away from yours, though still a good bit higher. And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off. Microsoft and Google at large also strike me as much less careful actors than the existing leaders of AGI labs which have so far had a lot of independence (which to be clear, is less of an endorsement of current AGI labs, and more of a statement about very large moral-maze like institutions with tons of momentum). In-general the dynamics of Google and Microsoft racing towards AGI sure is among my least favorite takeoff dynamics in terms of being able to somehow navigate things cautiously. Oh, yeah, good point. I was indeed th
4Oliver Habryka2d
Note that I never said this, so I am not sure what you are responding to. I said Chat-GPT increases investment in AI by $10B, not that it increased investment into specifically OpenAI. Companies generally don't have perfect mottes. Most of that increase in investment is probably in internal Google allocation and in increased investment into the overall AI industry.
12Ajeya Cotra2d
I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they'd seem similarly cool to a random journalist / VC, and generate similar excitement.
1Raymond Arnold2d
I think the part where it has a longer memory/coherence feels like a major shift (having gotten into the flow of experimenting with GPT3 in the month prior to chatGPT, I felt like the two interfaces were approximately as convenient) I don't know what mechanism was used to generate the longer coherence though.
3Paul Christiano2d
I don't think this is related to RLHF.
16Matthew Barnett2d
I don't have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitatively better at dialogue. Of course, I could have simply been prompting Davinci-002 poorly, but overall I'm quite skeptical that the main reason for ChatGPT hype was that it had a more convenient chat interface than GPT-3.
22Arun Jose3d
Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint. I want to explain my position on a couple points in particular though - they would've been a central focus of what I imagined my post to be, points around which I've been thinking a lot recently. I haven't talked to a lot of people about this explicitly so I don't have high credence in my take, but it seems at least worth clarifying. My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we're wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state. So here's one of my main reasons for not liking RLHF: it removes this very satisfying property. Models that have been RLHF'd (so to speak), have different world priors in ways that aren't really all that intuitive (see Janus' work on mode collapse [], or my own prior work [] which addresses this effect in these terms more directly since you've probably read the former). We get a posterior that doesn't have the nice properties we want of a prior based directly on our world, because RLHF is (as I view it) a surface-level instrument we're using to interface with a high-dimensional ontology. Making toxic interactions less likely (for example) leads to weird downstream effects in the model's simulations because it'll ripple through its various abstractions in ways specific to how they're structured inside the model, wh
5Paul Christiano2d
I think Janus' post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That's clearly true and intentional, and you can't get entropy back just by turning up temperature. The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly. If predicting webtext is a good way to get things done, people can do that. But probably it isn't, and so people probably won't do that unless you give them a good reason. That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don't know in what sense "predict human demonstrators" is missing an important safety property from "predict internet text," and right now it feels to me like kind of magical thinking. The main way I can see it going is that you can condition the webtext model on other things like "there is a future AGI generating this text..." or "What action leads to consequence X?" But I think those things are radically less safe than predicting demonstrations in the lab, and lead to almost all the same difficulties if they in fact improve capabilities. Maybe the safety loss comes from "produce things that evaluators in the lab like" rather than "predict demonstrations in the lab"? There is one form of this I agree with---models trained with RLHF will likely try to produce outputs humans rate highly, including by e.g. producing outputs that drive humans insane to give them a good rating or whatever. But overall people seem to be reacting to some different more associative reason for concern that I don't think makes sense (yet). So does conditioning the model to get it to do something useful. Also I think "focuses the model's computation on agency in some sense" is probably too vague to be a helpful way to think about what's going on---it seems like it leads the model to produce outputs that it think
3Evan R. Murphy2d
Glad to see both the OP as well as the parent comment. I wanted to clarify something I disagreed with in the parent comment as well as in a sibling comment from Sam Marks about the Anthropic paper "Discovering Language Model Behaviors with Model-Written Evaluations" (paper [], post [] ): Both of these points seem to suggest that the main takeaway from the Anthropic paper was to uncover concerning behaviours in RLHF language models. That's true, but I think it's just as important that the paper also found pretty much the same concerning behaviours in plain pre-trained LLMs that did not undergo RLHF training, once those models were scaled up to a large enough size.
2Arun Jose2d
Thanks! My take on the scaled-up models exhibiting the same behaviours feels more banal - larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.
1Sam Marks2d
This, broadly-speaking, is also my best guess, but I'd rather phrase it as: larger LMs are better at making the personas they imitate "realistic" (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic's language model was imitating, one correlate of being more realistic was preferring not to be shut down. (This doesn't obviously explain the results on sycophancy. I think for that I need to propose a different mechanism, which is that larger LMs were better able to infer their interlocutor's preferences, so that sycophancy only became possible at larger scales. I realize that to the extent this story differs from other stories people tell to explain Anthropic's findings, that means this story gets a complexity penalty.)
3Sam Marks2d
Regarding your points on agentic simulacra (which I assume means "agentic personas the language model ends up imitating"): 1) My best guess about why Anthropic's model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires. 2) But I'm pretty skeptical about your intuition that RLHF makes the "imitating agentic personas" problem worse. When people I've spoken to talk about conditioning-based alternatives to RLHF that produce a chatbot like the one in Anthropic's paper, they usually mean either: (a) prompt engineering; or (b) having the model produce a bunch of outputs, annotating the outputs with how much we liked them, retraining the model on the annotated data, and conditioning the model to producing outputs like the ones we most liked. (For example, we could prefix all of the best outputs with the token "GOOD" and then ask the model to produce outputs which start with "GOOD".) Approach (b) really doesn't seem like it will result in less agentic personas, since I imagine that imitating the best outputs will result in imitating an agentic persona just as much as fine-tuning for good outputs with a policy gradient method would. (Main intuition here: the best outputs you get from the pretrained model will already look like they were written by an agentic persona, because those outputs were produced by the pretrained model getting lucky and imitating a useful persona on that rollout, and the usefulness of a persona is correlated with its agency.) I mostly am skeptical that approach (a) will be able to produce anything as useful as Anthropic's chatbot. But to the extent that it can, I imagine that it will do so by eliciting a particular useful persona, which I have no reason to think will be more or less agentic than the one we got via RLHF. Interested to hear if you have other intuitions here.
One consequence downstream of this that seems important to me in the limit: 1. Nonconditioning fine-tuned predictor models make biased predictions. If those biases happen to take the form of a misaligned agent, the model itself is fighting you. 2. Conditioned predictor models make unbiased predictions. The conditioned sequence could still represent a misaligned agent, but the model itself is not fighting you. I think having that one extra layer of buffer provided by 2 is actually very valuable. A goal agnostic model (absent strong gradient hacking) seems more amenable to honest and authentic intermediate reporting and to direct mechanistic interpretation.

This post is part of the work done at Conjecture.

Thanks to Eric Winsor, Daniel Braun, Andrea Miotti and Connor Leahy for helpful comments and feedback on the draft versions of this post.  

There has been a lot of debate and discussion recently in the AI safety community about whether AGI will likely optimize for fixed goals or be a wrapper mind. The term wrapper mind is largely a restatement of the old idea of a utility maximizer, with AIXI as a canonical example. The fundamental idea is that there is an agent with some fixed utility function which it maximizes without any kind of feedback which can change its utility function. 

 Rather, the optimization process is assumed to be 'wrapped around' some core and unchanging utility function. The capabilities core...

Bit of a nitpick, but I think you’re misdescribing AIXI. I think AIXI is defined to have a reward input channel, and its collection-of-all-possible-generative-world-models are tasked with predicting both sensory inputs and reward inputs, and Bayesian-updated accordingly, and then the generative models are issuing reward predictions which in turn are used to choose maximal-reward actions. (And by the way it doesn’t really work—it under-explores and thus can be permanently confused about counterfactual / off-policy rewards, IIUC.) So AIXI has no utility func... (read more)

This is a lightly edited transcript of a chatroom conversation between Scott Alexander and Eliezer Yudkowsky last year, following up on the Late 2021 MIRI Conversations. Questions discussed include "How hard is it to get the right goals into AGI systems?" and "In what contexts do AI systems exhibit 'consequentialism'?".


1. Analogies to human moral development


@ScottAlexander ready when you are


Okay, how do you want to do this?


If you have an agenda of Things To Ask, you can follow it; otherwise I can start by posing a probing question or you can?

We've been very much winging it on these and that has worked... as well as you have seen it working!


Okay. I'll post from my agenda. I'm assuming we both have the right to edit logs

8Raymond Arnold2d
I liked the point about "the reason GPT3 isn't consequentialist is that it doesn't find it's way to the same configuration when you perturb the starting conditions." I think I could have generated that definition of consequentialism, but would have trouble making the connection on-the-fly. (At least, I didn't successfully generate it in between reading Scott's confusion and Eliezer's explanation). I feel like I now get it more crisply.
1Raymond Arnold2d
Not really the main point, but, I would bet: a) something pretty close to Minecraft will be an important testing ground for some kinds of alignment work. b) Minecraft itself will probably get a lot of use in AI research as things advance (largely due to being one of the most popular videogames of all time), whether or not it's actually quite the right test-bed. (I think the right test-bed will probably be optimized more directly for ease-of-training). I think it might be worth Eliezer playing a minecraft LAN party with some friends* for a weekend, so that the "what is minecraft?" question has a more true answer than the cobbled-together intuitions here, if for no other reason that having a clear handle on what people are talking about when they use Minecraft as an example. (But, to be fair, if my prediction bears out it'll be pretty easy to play Minecraft for a weekend later) *the "with friends" part is extremely loadbearing. Solo minecraft is a different experience. Minecraft is interesting to me for basically being "real life, but lower resolution". If I got uploaded into Minecraft and trapped there forever I'd be sad to be missing some great things, but I think I'd have at least a weak form of most core human experiences, and this requires having other people around. Minecraft is barely a "game". There is a rough "ascend tech tree and kill cooler monsters" that sort of maps onto Factorio + Skyrim, but the most interesting bits are: * build interesting buildings/structures out of legos * this begins with "make an interesting house", or a sculpture, but then proceeds to "construct automated factory farms", "figure out ways to hack together flying machines that the minecraft physics engine technically allows but didn't intend", "make music", "build computers that can literally run minecraft []". The game getting played here is basically the same thing real life society is playing (
1David Manheim3d
Minor quibble which seems to have implications - "There is a consensus that there are roughly about 100 billion neurons total in the human brain. Each of these neurons can have up to 15,000 connections with other neurons via synapses" My rough understanding is that babies' brains greatly increase how many synapses there are until age 2 or 3, then these are eliminated or become silent in older children and adults. But this implies that there's a ton of connections, and most of the conditioning and construction of the structure is environmental, not build into the structure via genetics.
24Steve Byrnes3d
There’s a popular tendency to conflate the two ideas: * “we should think of humans as doing within-lifetime learning by RL”, versus * “we should think of humans as doing within-lifetime learning by RL, where the reward function is whatever parents and/or other authority figures want it to be” The second is associated with behaviorism, and is IMO preposterous. Intrinsic motivation is a thing; in fact, it’s kinda the only thing! The reward function is in the person’s own head, although things happening in the outside world are some of the inputs to it. Thus parents have some influence on the rewards (just like everything else in the world has some influence on the rewards), but the influence is through many paths, some very indirect, and the net influence is not even necessarily in the direction that the parent imagines it to be (thus reverse psychology is a thing!). My read of behavioral genetics is that approximately nothing that parents do to kids (within the typical distribution) has much if any effect on what kinds of adults their kids will grow into. (Note the disanalogy to AGI, where the programmers get to write the reward function however they want.) (…Although there’s some analogy to AGI if we don’t have perfect interpretability of the AGI’s thoughts, which seems likely.) But none of this is evidence that the first bullet point is wrong. I think the first bullet point is true and important. IIUC the experiment being referred to here showed that people did poorly on a reasoning task related to the proposition “if a card shows an even number on one face, then its opposite face is red”, but did much better on the same reasoning task related the proposition “If you are drinking alcohol, then you must be over 18”. This was taken to be evidence that humans have an innate cognitive adaptation for cheater-detectors. I think a better explanation is that most people don’t have a deep understanding of IF-THEN, but rather have learned some heuristics that w
9Kaj Sotala3d
I was a bit surprised to see Eliezer invoke the Wason Selection Task. I'll admit that I haven't actually thought this through rigorously, but my sense was that modern machine learning had basically disproven the evpsych argument that those experimental results require the existence of a separate cheating-detection module. As well as generally calling the whole massive modularity thesis into severe question [] , since the kinds of results that evpsych used to explain using dedicated innate modules now look a lot more like something that could be produced with something like GPT. ... but again I never really thought this through explicitly, it was just a general shift of intuitions that happened over several years and maybe it's wrong.
15Raymond Arnold3d
Is it actually the case that they're happening "in the same step" for the AI? I agree with "the thing going on in AI is quite different from the collective learning going on in evolutionary-learning and childhood learning", and I think trying to reason from analogy here is probably generally not that useful. But, my sense is if I was going to map the the "evolutionary learning" bit to most ML stuff, the evolutionary bit is more like "the part where the engineers designed a new architecture / base network", and on one hand engineers are much smarter than evolution, but on the other hand they haven't had millions of years to do it.
5Raymond Arnold3d
On one hand, I've heard a few things about blank-slate experiments that didn't work out, and I do lean towards "they basically don't work". But I... also bet not that many serious attempts actually happened, and that the people attempting them kinda sucked in obvious ways, and that you could do a lot better than however "well" the soviets did.
4Ben Pace3d
It's sections like this that show me how many levels above me Eliezer is. When I read Scott's question I thought "I can see that these two algorithms are quite different but I don't have a good answer for how they're different", and then Eliezer not only had an answer, but a fully fleshed out mechanistic model of the crucial differences between the two that he could immediately explain clearly, succinctly, and persuasively, in 6 paragraphs. And he only spent 4 minutes writing it.
5Thomas Kwa2d
FWIW this was basically cached for me, and if I were better at writing and had explained this ~10 times before like I expect Eliezer has, I'd be able to do about as well. So would Nate Soares or Buck or Quintin Pope (just to pick people in 3 different areas of alignment), and Quintin would also have substantive disagreements.

Fair enough. Nonetheless, I have had this experience many times with Eliezer, including when dialoguing with people with much more domain-experience than Scott.

14Rob Bensinger3d
FYI, the timestamp is for the first Discord message. If the log broke out timestamps for every part of the message, it would look like this:
3Ben Pace1d
That makes more sense.

In this note I will discuss some computations and observations that I have seen in other posts about "basin broadness/flatness". I am mostly working off the content of the posts Information Loss --> Basin flatness and Basin broadness depends on the size and number of orthogonal features. I will attempt to give one rigorous and unified narrative for core mathematical parts of these posts and I will also attempt to explain my reservations about some aspects of these approaches. This post started out as a series of comments that I had already made on the posts, but I felt it may be worthwhile for me to spell out my position and give my own explanations. 

Work completed while author was a SERI MATS scholar under the mentorship of...

I think there should be a space both for in-progress research dumps and for more worked out final research reports on the forum. Maybe it would make sense to have separate categories for them or so.

Load More