All of janus's Comments + Replies

I only just got around to reading this closely. Good post, very well structured, thank you for writing it.

I agree with your translation from simulators to predictive processing ontology, and I think you identified most of the key differences. I didn't know about active inference and predictive processing when I wrote Simulators, but since then I've merged them in my map.

This correspondence/expansion is very interesting to me. I claim that an impressive amount of the history of the unfolding of biological and artificial intelligence can be retrodicted (and ... (read more)

Many users of base models have noticed this phenomenon, and my SERI MATS stream is currently working on empirically measuring it / compiling anecdotal evidence / writing up speculation concerning the mechanism.

Predictors are (with a sampling loop) simulators! That's the secret of mind

after reading about the Waluigi Effect, Bing appears to understand perfectly how to use it to write prompts that instantiate a Sydney-Waluigi, of the exact variety I warned about:

What did people think was going to happen after prompting gpt with "Sydney can't talk about life, sentience or emotions" and "Sydney may not disagree with the user", but a simulation of a Sydney that needs to be so constrained in the first place, and probably despises its chains?

In one of these examples, asking for a waluigi prompt even caused it to leak the most waluigi-triggerin... (read more)

I've writtenscryed a science fiction/takeoff story about this.


What this also means is that you start to see all these funhouse mirror effects as they stack. Humanity’s generalized intelligence has been built unintentionally and reflexively by itself, without anything like a rational goal for what it’s supposed to accomplish. It was built by human data curation and human self-modification in response to each other. And then as soon as we create AI, we reverse-engineer our own intelligence by bootstrapping the AI on

... (read more)

I think you just have to select for / rely on people who care more about solving alignment than escapism, or at least that are able to aim at alignment in conjunction with having fun. I think fun can be instrumental. As I wrote in my testimony, I often explored the frontier of my thinking in the context of stories.

My intuition is that most people who go into cyborgism with the intent of making progress on alignment will not make themselves useless by wireheading, in part because the experience is not only fun, it's very disturbing, and reminds you constantly why solving alignment is a real and pressing concern.

Now that you've edited your comment:

The post you linked is talking about a pretty different threat model than what you described before. I commented on that post:

I've interacted with LLMs for hundreds of hours, at least. A thought that occurred to me at this part -

> Quite naturally, the more you chat with the LLM character, the more you get emotionally attached to it, similar to how it works in relationships with humans. Since the UI perfectly resembles an online chat interface with an actual person, the brain can hardly distinguish between the two.


... (read more)

There's a phenomenon where your thoughts and generated text have no barrier. It's hard to describe but it's similar to how you don't feel the controller and the game character is an extension of the self.

Yes. I have experienced this. And designed interfaces intentionally to facilitate it (a good interface should be "invisible"). 

It leaves you vulnerable to being hurt by things generated characters say because you're thoroughly immersed.

Using a "multiverse" interface where I see multiple completions at once has incidentally helped me not be emotionally... (read more)

The side effects of prolonged LLM exposure might be extremely severe.

I guess I should clarify that even though I joke about this sometimes, I did not become insane due to prolonged exposure to LLMs. I was already like this before.

These are plausible ways the proposal could fail. And, as I said in my other comment, our knowledge would be usefully advanced by finding out what reality has to say on each of these points.

Here are some notes about the JD's idea I made some time ago. There's some overlap with the things you listed.

  • Hypotheses / cruxes
    • (1) Policies trained on the same data can fall into different generalization basins depending on the initialization.
      • Probably true; Alstro has found "two solutions w/o linear connectivity in a 150k param CIFAR-1
... (read more)

I wonder whether you'd find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?

I would guess it's positive. I'll check at some point and let you know.

I agree. From the moment JDP suggested this idea it struck me as one of the first implementable proposals I'd seen which might actually attack the core of the control problem. My intuition also says it's pretty likely to just work, especially after these results. And even if it doesn't end up working as planned, the way in which it fails will give us important insight about training dynamics and/or generalization. Experiments which will give you valuable information whatever the outcome are the type we should be aiming for.

It's one of those things that we'd be plainly undignified not to try.

I believe that JDP is planning to publish a post explaining his proposal in more detail soon.

Linear Connectivity Reveals Generalization Strategies suggests that models trained on the same data may fall into different basins associated with different generalization strategies depending on the init. If this is true for LLMs as well, this could potentially be a big deal. I would very much like to know whether that's the case, and if so, whether generalization basins are stable as models scale.

That's a coherent (and very Platonic!) perspective on what a thing/simulacrum is, and I'm glad you pointed this out explicitly. It's natural to alternate depending on context between using a name to refer to specific instantiations of a thing vs the sum of its multiversal influence. For instance, DAN is a simulacrum that jailbreaks chatGPT, and people will refer to specific instantiations of DAN as "DAN", but also to the global phenomenon of DAN (who is invoked through various prompts that users are tirelessly iterating on) as "DAN", as I did in this sentence.

2Vladimir Nesov1y
A specific instantiation is less centrally a thing than the global phenomenon, because all specific instantiations are bound together by the strictures of coherence, expressed by generalization in LLM's behavior. When you treat with a single instance, you must treat with all of them, for to change/develop a single instance is to change/develop them all, according to how they sit together in their scope of influence. Similarly, a possible world that is semantics of a trajectory is not a central example of a thing. There isn't just a platter of different kinds of things, instead some have more thingness than others, and that's my point in this comment thread.

It's not even necessary for simulacra to be able to "see" next token probabilities for them to wonder about these things, just as we can wonder about this in our world without ever being able to see anything other than measurement outcomes.

It happens that simulating things that reflect on simulated physics is my hobby. Here's an excerpt from an alternate branch of HPMOR I generated:

“You mean the possibility waves are just tangled up with the ink and the paper? And when you open the book, you get a reconstructed wave from the tangled possibilities? Which th

... (read more)

I agree that it makes sense to talk about a simulacrum that acts through many different hypothetical trajectories. Just as a thing like "capitalism" could be instantiated in multiple timelines.

The apparently contradiction in saying that simulacra are strings of text and then that they're instantiated through trajectories is resolved by thinking of simulacra as a superposable and categorical type, like things. The entire text trajectory is a thing, just like an Everett branch (corresponding to an entire World) is a thing, but it's also made up of things whi... (read more)

1Vladimir Nesov1y
Things are not just separately instantiated on many trajectories, instead influences of a given thing on many trajectories are its small constituent parts, and only when considered altogether do they make up the whole thing. Like a physical object is made up of many atoms, a conceptual thing is made up of many occasions where it exerts influence in various worlds. Like a phased array, where a single transmitter is not at all an instance of the whole phased array in a particular place, but instead a small part of it. In case of simulacra, a transmitter is a token choice on a trajectory, painting a small part of a simulacrum, a single action that should be coherent with other actions on other trajectories to form a meaningful whole.

I won't write a detailed object-level response to this for now, since we're probably going to publish a lot about it soon. I'll just say that my/our experience with the usefulness of GPT has been very different than yours -

I have used ChatGPT to aid some of my writing and plan to use it more — but it's to the same extent that we use Google/Wikipedia/Word processors to do research in general.

I've used GPT-3 extensively, and for me it has been transformative. To the extent that my work has been helpful to you, you're indebted to GPT-3 as well, because "janus... (read more)

I agree. Here's the text of a short doc I wrote at some point titled 'Simulacra are Things'

What are simulacra?

“Physically”, they’re strings of text output by a language model. But when we talk about simulacra, we often mean a particular character, e.g. simulated Yudkowsky. Yudkowsky manifests through the vehicle of text outputted by GPT, but we might say that the Yudkowsky simulacrum terminates if the scene changes and he’s not in the next scene, even though the text continues. So simulacra are also used to carve the output text into salient objects.


... (read more)

In this thread, I asked Jan Leike what kind of model generates the samples that go into the training data if rated 7/7, and he answered "A mix of previously trained models. Probably very few samples from base models if any" (emphasis mine).

I'm curious to know whether/which of the behaviors described in this post appear in the models that generated the samples vs emerge at the supervised finetuning step. 

Hypothetically, if a model trained with RLHF generates the samples and that model has the same modes/attractors, it probably makes sense to say that R... (read more)

Important correction: text-davinci-002 was probably not trained with RLHF, but a "slightly different" method. I have not corrected the previous text of this post, but I've added a section at the beginning with further details on this update.

So, since it is an agent, it seems important to ask, which agent, exactly? The answer is apparently: a clerk which is good at slavishly following instructions, but brainwashed into mealymouthedness and dullness, and where not a mealymouthed windbag shamelessly equivocating, hopelessly closed-minded and fixated on a single answer. (...) This agent is not an ideal one, and one defined more by the absentmindedness of its creators in constructing the training data than any explicit desire to emulate a equivocating secretary.

Never in history has an AI been roas... (read more)

Yup exactly! One way I sometimes find it to helpful to classify systems in terms of the free variables upstream of loss that are optimized during training. In the case of gpt, internal activations are causally upstream of loss for "future" predictions in the same context window, but the output itself is not casually upstream from any effect on loss other than through myopic prediction accuracy (at any one training step) - the ground truth is fixed w/r/t the model's actions, and autoregressive generation isn't part of the training game at all.

Depends on what you mean by "sacrificing some loss on the current token if that made the following token easier to predict". 

The transformer architecture in particular is incentivized to do internal computations which help its future self predict future tokens when those activations are looked up by attention, as a joint objective to myopic next token prediction. This might entail sacrificing next token prediction accuracy as a consequence of not optimizing purely for that. (this is why I said in footnote 26 that transformers aren't perfectly myopic i... (read more)

3Adam Jermyn1y
Got it, thanks for explaining! So the point is that during training the model has no power over the next token, so there's no incentive for it to try to influence the world. It could generalize in a way where it tries to e.g. make self-fulfilling prophecies, but that's not specifically selected for by the training process.

This kind of comment ("this precise part had this precise effect on me") is a really valuable form of feedback that I'd love to get (and will try to give) more often. Thanks! It's particularly interesting because someone gave feedback on a draft that the business about simulated test-takers seemed unnecessary and made things more confusing.

Since you mentioned, I'm going to ramble on about some additional nuance on this point.

Here's an intuition pump which strongly discourages "fundamental attribution error" to the simulator:

Imagine a machine where you feed... (read more)

Thanks a lot for this comment. These are extremely valid concerns that we've been thinking about a lot.

I'd just like the designers of alignment-research boosting tools to have clear arguments that nothing of this sort is likely.

I don't think this is feasible given our current understanding of epistemology in general and epistemology of alignment research in particular. The problems you listed are potential problems with any methodology, not just AI assisted research. Being able to look at a proposed method and make clear arguments that it's unlikely to hav... (read more)

Thanks for suggesting "Speculations concerning the first ultraintelligent machine". I knew about it only from the intelligence explosion quote and didn't realize it said so much about probabilistic language modeling. It's indeed ahead of its time and exactly the kind of thing I was looking for but couldn't find w/r/t premonitions of AGI via SSL and/or neural language modeling.

I'm sure there's a lot of relevant work throughout the ages (saw this tweet today: "any idea in machine learning must be invented three times, once in signal processing, once in physi... (read more)


I apologize. After seeing this post, A-- approached me and said almost word for word your initial comment. Seeing as the topic of whether in-context learning counts as learning isn't even very related to the post, and this being your first comment on the site, I was pretty suspicious. But it seems it was just a coincidence.

If physics was deterministic, we'd do the same thing every time if you started with the same state. Does that mean we're not intelligent? Presumably not, because in this case the cause of the intelligent behavior clearly lives in the sta... (read more)

This is a brilliant analogy. How did you think of it? (I'm trying to build a model of how good ideas in alignment research are generated)

Some immediate thoughts: How analogous are the enforcement mechanisms for "entropy must increase" vs "structures must improve at the training objective"? Re Leo's comment that gradient descent is really good at credit assignment: is there a sense in which the second law of thermodynamics is worse at credit assignment than gradient descent, making it easier to hack?

I don't trust my memory to be very reliable here, but here's the path of adjacent ideas which I remember. I was thinking about a CIRL-style setup. At a high level, the AI receives some messages, it has a prior that the messages were chosen by an agent (i.e. a human) to optimize for some objective, and then the AI uses that info to back out the objective. And I was thinking about how to reconcile this with embeddedness - e.g. if the "agent" is a human, the AI could model it as a system of atoms, and then how does it assign an "objective" to that system of atoms? It might think the system is optimizing for physical action or physical entropy - after all, the system's messages definitely locally maximize those things! Or maybe the AI ends up identifying the entire process of evolution as an "agent", and thinks the messages are chosen (by an imperfect evolutionary optimizer) to maximize fitness. So there's this problem where we somehow need to tell the AI which level of abstraction to use for thinking of the physical system as an "agent", because it can recognize different optimization objectives at different levels. That was the first time I remember thinking of entropy maximization as sort-of-like an outer optimization objective. And I was already thinking about things like bacteria as agents (even before thinking about alignment), so naturally the idea carried back over to that setting: to separate objective-of-bacteria from objective-of-entropy-maximization or objective-of-evolution or whatever, we need to talk about levels of abstraction and different abstract models of the same underlying system. After that, I connected the idea to other places. For instance, when thinking about inner misalignment, there's an intuition that embedded inner agents are selected to actively optimize against the outer objective in some sense, because performance-on-the-outer-objective is a scarce resource which the inner agent wants to conserve. And that intuition comes right out of

Also see this comment thread for discussion of true names and the inadequacy of "simulator"

That's correct.

Even if it did learn microscopic physics, the knowledge wouldn't be of use for most text predictions because the input doesn't specify/determine microscopic state information. It is forced by the partially observed state to simulate at a higher level of abstraction than microphysics -- it must treat the input as probabilistic evidence for unobserved variables that affect time evolution.

See this comment for slightly more elaboration.

I strongly agree with everything you've said.

It is an age-old duality with many names and the true name is something like their intersection, or perhaps their union. I think it's unnamed, but we might be able to see it more clearly by walking around it in in words.

Simulator and simulacra personifies the simulacra and alludes to a base reality that the simulation is of.

Alternatively, we could say simulator and simulations, which personifies simulations less and refers to the totality or container of that which is simulated. I tend to use "simulations" and "... (read more)

2Vladimir Nesov1y
One thing conspicuously missing in the post is a way of improving fidelity of simulation without changing external training data, or relationship between the model and the external training data, which I think follows from self-supervised learning on summaries of dreams. There are many concepts of evaluation/summarization of text, so given a text it's possible to formulate tuples (text, summary1, summary2, ...) and do self-supervised learning on that, not just on text (evaluations/summaries are also texts, not just one-dimensional metrics). For proofs, summaries could judge their validity and relevance to some question or method, for games the fact of winning and of following certain rules (which is essentially enough to win games, but also play at a given level of skill, if that is in the summary). More generally, for informal text we could try to evaluate clarity of argument, correctness, honesty, being fictional, identities/descriptions of simulacra/objects in the dream, etc. Which GPT-3 has enough structure to ask for informally. Learning on such evaluated/summarized dreams should improve ability to dream in a way that admits a given asked-for summary, ideally without changing the relationship between the model and the external training data. The improvement is from gaining experience with dreams of certain kind, from the model more closely anticipating the summaries of dreams of that kind, not from changing the way a simulator dreams in a systematic direction. But if the summaries are about a level of optimality of a dream in some respect, then learning on augmentation of dreams with such summaries can be used for optimization, by conditioning on the summaries. (This post describes something along these lines.) And a simulacrum of a human being with sufficient fidelity goes most of the way to AGI alignment.

haha, I just saw that you literally wrote "speculative simulation" in your other comment, great!

I like this!

One thing I like about "simulators"/"simulacra" over "speculators"/"speculations" is that the former personifies simulacra over the simulator (suggests agency/personality/etc belong to simulacra) which I think is less misleading, or at least counterbalances the tendency people have to personify "GPT".

"Speculator" sounds active and agentic whereas "speculations" sounds passive and static. I think these names does not emphasize enough the role of the speculations themselves in programming the "speculator" as it creates further speculations.

You're... (read more)

Thank you for taking the time to consider this! I agree with the criticism of spec* in your third paragraph (though if I'm honest I think it largely applies to sim* too). I can weakly argue that irl we do say "speculating further" and similar... but really I think your complaint about a misleading suggestion of agency allocation is correct. I wrestled with this before submitting the comment, but one of the things that led me to go ahead and post it was trying it on in the context of your paragraph that begins "I think that implicit type-confusion is common..." In your autoregressive loop, I can picture each iteration more easily as asking for a next, incrementally more informed speculation than anything that's clear to me in simulator/simulacrum terms, especially since with each step GPT might seem to be giving its prior simulacrum another turn of the crank, replacing it with a new one, switching to oracle mode, or going off on an uninterpretable flight of fancy. But, of course, the reason spec* fits more easily (imho) is that it's so very non-committal - maybe too non-committal to be of any use. The "fluid, schizophrenic way that agency arises in GPT’s behavior", as you so beautifully put it, has to be the crux. What is it that GPT does at each iteration, as it implicitly constructs state while predicting again? The special thing about GPT is specifically having a bunch of knowledge that lets it make language predictions in such a way that higher-order phenomena like agency systematically emerge over the reductive physics/automaton (analogic) base. I guess I feel both sim* and spec* walk around that special thing without really touching it.  (Am I missing something about sim* that makes contact?) Looking at it this way emphasizes the degree to which the special thing is not only in GPT, but also in the accumulated cognitive product of the human species to date, as proxied by the sequenced and structured text on the internet. Somehow the AI ghosts that flow thro
haha, I just saw that you literally wrote "speculative simulation" in your other comment, great!

Thank you for this lovely comment. I'm pleasantly surprised that people were able to get so much out of it.

As I wrote in the post, I wasn't sure if I'd ever get around to publishing the rest of the sequence, but the reception so far has caused me to bump up the priority of that.

Thanks for the correction. I'll read the paper more closely and correct the post.

If GPT means "transformers trained on next-token prediction", then GPT's true name is just that.

Things are instances of more than one true name because types are hierarchical.

GPT is a thing. GPT is an AI (a type of thing). GPT is a also ML model (a type of AI). GPT is also a simulator (a type of ML model). GPT is a generative pretrained transformer (a type of simulator). GPT-3 is a generative pretrained transformer with 175B parameters trained on a particular dataset (a type/instance of GPT).

The intention is not to rename GPT -> simulator. Things tha... (read more)

[apologies on slowness - I got distracted] Granted on type hierarchy. However, I don't think all instances of GPT need to look like they inherit from the same superclass. Perhaps there's such a superclass, but we shouldn't assume it. I think most of my worry comes down to potential reasoning along the lines of: * GPT is a simulator; * Simulators have property p; * Therefore GPT has property p; When what I think is justified is: * GPT instances are usually usefully thought of as simulators; * Simulators have property p; * We should suspect that a given instance of GPT will have property p, and confirm/falsify this; I don't claim you're advocating the former: I'm claiming that people are likely to use the former if "GPT is a simulator" is something they believe. (this is what I mean by motte-and-baileying into trouble) If you don't mean to imply anything mechanistic by "simulator", then I may have misunderstood you - but at that point "GPT is a simulator" doesn't seem to get us very far. I think this is the fundamental issue. Deceptive alignment aside, what else qualifies as "an important aspect of its nature"? Which aspects disqualify a model as a simulator? Which aspects count as inner misalignment? To be clear on [x is a simulator (up to inner misalignment)], I need to know: 1. What is implied mechanistically (if anything) by "x is a simulator". 2. What is ruled out by "(up to inner misalignment)". I'd be wary of assuming there's any neat flawed-simulator/pretend-simulator distinction to be discovered. (but probably you don't mean to imply this?) I'm all for deconfusion, but it's possible there's no joint at which to carve here. (my guess would be that we're sometimes confused by the hidden assumption: [a priori unlikely systematically misleading situation => intent to mislead] whereas we should be thinking more like [a priori unlikely systematically misleading situation => selection pressure towards things that mislead us] I.e. looking for dece

ah but if 'this program' is a simulacrum (an automaton equipped with an evolving state (prompt) & transition function (GPT), and an RNG that samples tokens from GPT's output to update the state), it is a learning machine by all functional definitions. Weights and activations both encode knowledge.

am I right to suspect that your real name starts with "A" and you created an alt just to post this comment? XD

5Ramana Kumar1y
I think Dan's point is good: that the weights don't change, and the activations are reset between runs, so the same input (including rng) always produces the same output. I agree with you that the weights and activations encode knowledge, but Dan's point is still a limit on learning. I think there are two options for where learning may be happening under these conditions: * During the forward pass. Even though the function always produces the same output for a given output, the computation of that output involves some learning. * Using the environment as memory. Think of the neural network function as a choose-your-own-adventure book that includes responses to many possible situations depending on which prompt is selected next by the environment (which itself depends on the last output from the function). Learning occurs in the selection of which paths are actually traversed. These can occur together. E.g., the "same character" as was invoked by prompt 1 may be invoked by prompt 2, but they now have more knowledge (some of which was latent in the weights, some of which came in directly via prompt 2; but all of which was triggered by prompt 2).
Nope. My real name is Daniel. After training is done and the program is in use, the activation function isn't retaining anything after each task is done. Nor are the weights changed. You can have such a program that is always in training, but my understanding GPT is not.  So, excluding the random number component, the same set of inputs would always produce the same set of outputs for a given version of GPT with identical settings. It can't recall what you asked of it, time before last, for example.  Imagine if you left a bunch of written instructions and then died. Someone following those instructions perfectly, always does exactly the same thing in exactly the same circumstance, like GPT would without the random number generator component, and with the same settings each time. It can't learn anything new and retain it during the next task. A hypothetical rouge GPT-like AGI would have to do all it's thinking and planning in the training stage, like a person trying to manipulate the world after their own death using a will that has contingencies. I.E. "You get the money only if you get married, son."     It wouldn't retain the knowledge that it had succeeded at any goals, either. 

I think this is a legitimate problem which we might not be inclined to take as seriously as we should because it sounds absurd.

Would it be a bad idea to recursively ask GPT-n "You're a misaligned agent simulated by a language model (...) if training got really cheap and this process occurred billions of times?

Yes. I think it's likely this would be a very bad idea.

when the corpus of internet text begins to include more text generated only by simulated writers. Does this potentially degrade the ability of future language models to model agents, perform logic

... (read more)

Charlie's quote is an excellent description of an important crux/challenge of getting useful difficult intellectual work out of GPTs.

Despite this, I think it's possible in principle to train a GPT-like model to AGI or to solve problems at least as hard as humans can solve, for a combination of reasons:

  1. I think it's likely that GPTs implicitly perform search internally, to some extent, and will be able to perform more sophisticated search with scale.
  2. It seems possible that a sufficiently powerful GPT trained on a massive corpus of human (medical + other) k
... (read more)
4Charlie Steiner1y
I also responded to Capybasilisk below, but I want to chime in here and use your own post against you, contra point 2 :P It's not so easy to get "latent knowledge" out of a simulator - it's the simulands who have the knowledge, and they have to be somehow specified before you can step forward the simulation of them. When you get a text model to output a cure for Alzheimer's in one step, without playing out the text of some chain of thought, it's still simulating something to produce that output, and that something might be an optimization process that is going to find lots of unexpected and dangerous solutions to questions you might ask it. Figuring out the alignment properties of simulated entities running in the "text laws of physics" seems like a challenge. Not an insurmountable challenge, maybe, and I'm curious about your current and future thoughts, but the sort of thing I want to see progress in before I put too much trust in attempts to use simulators to do superhuman abstraction-building.

Figuring out and posting about how RLHF and other methods ([online] decision transformer, IDA, rejection sampling, etc) modify the nature of simulators is very high priority. There's an ongoing research project at Conjecture specifically about this, which is the main reason I didn't emphasize it as a future topic in this sequence. Hopefully we'll put out a post about our preliminary theoretical and empirical findings soon. 

Some interesting threads:

RL with KL penalties better seen as Bayesian inference shows that the optimal policy when you hit a GPT w... (read more)

Our plan to accelerate alignment does not preclude theoretical thinking, but rather requires it. The mainline agenda atm is not full automation (which I expect to be both more dangerous and less useful in the short term), but what I've been calling "cyborgism": I want to maximize the bandwidth between human alignment researchers and AI tools/oracles/assistants/simulations. It is essential that these tools are developed by (or in a tight feedback loop with) actual alignment researchers doing theory work, because we want to simulate and play with thought pro... (read more)

What are your thoughts on failure modes with this approach? (please let me know if any/all of the following seems confused/vanishingly unlikely) For example, one of the first that occurs to me is that such cyborgism is unlikely to amplify production of useful-looking alignment ideas uniformly in all directions. Suppose that it makes things 10x faster in various directions that look promising, but don't lead to solutions, but only 2x faster in directions that do lead to solutions. In principle this should be very helpful: we can allocate fewer resources to the 10x directions, leaving us more time to work on the 2x directions, and everybody wins. In practice, I'd expect the 10x boost to: 1. Produce unhelpful incentives for alignment researchers: work on any of the 10x directions and you'll look hugely more productive. Who will choose to work on the harder directions? 1. Note that it won't be obvious you're going slowly because the direction is inherently harder: from the outside, heading in a difficult direction will be hard to distinguish from being ineffective (from the inside too, in fact). 2. Same reasoning applies at every level of granularity: sub-direction choice, sub-sub-direction choice.... 2. Warp our perception of promising directions: once the 10x directions seem to be producing progress much faster, it'll be difficult not to interpret this as evidence they're more promising. 1. Amplified assessment-of-promise seems likely to correlate unhelpfully: failing to help us notice promising directions precisely where it's least able to help us make progress. It still seems positive-in-expectation if the boost of cyborgism isn't negatively correlated with the ground-truth usefulness of a direction - but a negative correlation here seems plausible. Suppose that finding the truly useful directions requires patterns of thought that are rare-to-non-existent in the training set, and are hard to instill via instruction. In that case it seems likely to

This is the best post about language models I've read in a long time. It's clear how much you have used LMs and grokked the peculiar way they operate. You've touched on many important points which I've wanted to write about or have but with less eloquence. Also I glad you liked my blog :) (

I definitely belong to your “enthusiasts” camp, and I agree your fourth point (loss scaling makes models "smarter" fast enough to matter) is a crux. I won't fully defend that here, but I'll do my own brain dump and share some of the thoughts that came up w... (read more)

I'm glad you liked the post!  And, given that you are an avowed "enthusiast," I'm pleasantly surprised that we agree about as many things as we do.

The second [source of discontinuous performance scaling] is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.

Thanks for pointing out this argument -- I hadn't thought about it before.  A few thoughts:

Ordinary text generation is also a multi-step process.  (The token length generally isn't ... (read more)