All of janus's Comments + Replies

Thanks a lot for this comment. These are extremely valid concerns that we've been thinking about a lot.

I'd just like the designers of alignment-research boosting tools to have clear arguments that nothing of this sort is likely.

I don't think this is feasible given our current understanding of epistemology in general and epistemology of alignment research in particular. The problems you listed are potential problems with any methodology, not just AI assisted research. Being able to look at a proposed method and make clear arguments that it's unlikely to hav... (read more)

Thanks for suggesting "Speculations concerning the first ultraintelligent machine". I knew about it only from the intelligence explosion quote and didn't realize it said so much about probabilistic language modeling. It's indeed ahead of its time and exactly the kind of thing I was looking for but couldn't find w/r/t premonitions of AGI via SSL and/or neural language modeling.

I'm sure there's a lot of relevant work throughout the ages (saw this tweet today: "any idea in machine learning must be invented three times, once in signal processing, once in physi... (read more)

I apologize. After seeing this post, A-- approached me and said almost word for word your initial comment. Seeing as the topic of whether in-context learning counts as learning isn't even very related to the post, and this being your first comment on the site, I was pretty suspicious. But it seems it was just a coincidence.

If physics was deterministic, we'd do the same thing every time if you started with the same state. Does that mean we're not intelligent? Presumably not, because in this case the cause of the intelligent behavior clearly lives in the sta... (read more)

This is a brilliant analogy. How did you think of it? (I'm trying to build a model of how good ideas in alignment research are generated)

Some immediate thoughts: How analogous are the enforcement mechanisms for "entropy must increase" vs "structures must improve at the training objective"? Re Leo's comment that gradient descent is really good at credit assignment: is there a sense in which the second law of thermodynamics is worse at credit assignment than gradient descent, making it easier to hack?

I don't trust my memory to be very reliable here, but here's the path of adjacent ideas which I remember. I was thinking about a CIRL-style setup. At a high level, the AI receives some messages, it has a prior that the messages were chosen by an agent (i.e. a human) to optimize for some objective, and then the AI uses that info to back out the objective. And I was thinking about how to reconcile this with embeddedness [] - e.g. if the "agent" is a human, the AI could model it as a system of atoms, and then how does it assign an "objective" to that system of atoms? It might think the system is optimizing for physical action [] or physical entropy - after all, the system's messages definitely locally maximize those things! Or maybe the AI ends up identifying the entire process of evolution as an "agent", and thinks the messages are chosen (by an imperfect evolutionary optimizer) to maximize fitness. So there's this problem where we somehow need to tell the AI which level of abstraction to use for thinking of the physical system as an "agent", because it can recognize different optimization objectives at different levels. That was the first time I remember thinking of entropy maximization as sort-of-like an outer optimization objective. And I was already thinking about things like bacteria as agents (even before thinking about alignment), so naturally the idea carried back over to that setting: to separate objective-of-bacteria from objective-of-entropy-maximization or objective-of-evolution or whatever, we need to talk about levels of abstraction and different abstract models of the same underlying system. After that, I connected the idea to other places. For instance, when thinking about inner misalignment, there's an intuition that embedded inner agents are selected to actively optimize against the outer objective in some sense, because performance-on-the-outer-objective i

Also see this comment thread for discussion of true names and the inadequacy of "simulator"

That's correct.

Even if it did learn microscopic physics, the knowledge wouldn't be of use for most text predictions because the input doesn't specify/determine microscopic state information. It is forced by the partially observed state to simulate at a higher level of abstraction than microphysics -- it must treat the input as probabilistic evidence for unobserved variables that affect time evolution.

See this comment for slightly more elaboration.

I strongly agree with everything you've said.

It is an age-old duality with many names and the true name is something like their intersection, or perhaps their union. I think it's unnamed, but we might be able to see it more clearly by walking around it in in words.

Simulator and simulacra personifies the simulacra and alludes to a base reality that the simulation is of.

Alternatively, we could say simulator and simulations, which personifies simulations less and refers to the totality or container of that which is simulated. I tend to use "simulations" and "... (read more)

2Vladimir Nesov17d
One thing conspicuously missing in the post is a way of improving fidelity of simulation without changing external training data, or relationship between the model and the external training data, which I think follows from self-supervised learning on summaries of dreams. There are many concepts of evaluation/summarization of text, so given a text it's possible to formulate tuples (text, summary1, summary2, ...) and do self-supervised learning on that, not just on text (evaluations/summaries are also texts, not just one-dimensional metrics). For proofs, summaries could judge their validity and relevance to some question or method, for games the fact of winning and of following certain rules (which is essentially enough to win games, but also play at a given level of skill, if that is in the summary). More generally, for informal text we could try to evaluate clarity of argument, correctness, honesty, being fictional, identities/descriptions of simulacra/objects in the dream, etc. Which GPT-3 has enough structure to ask for informally. Learning on such evaluated/summarized dreams should improve ability to dream in a way that admits a given asked-for summary, ideally without changing the relationship between the model and the external training data. The improvement is from gaining experience with dreams of certain kind, from the model more closely anticipating the summaries of dreams of that kind, not from changing the way a simulator dreams in a systematic direction. But if the summaries are about a level of optimality of a dream in some respect, then learning on augmentation of dreams with such summaries can be used for optimization, by conditioning on the summaries. (This post [] describes something along these lines.) And a simulacrum of a human being with sufficient fidelity goes most of the way to AGI alignment.

haha, I just saw that you literally wrote "speculative simulation" in your other comment, great!

I like this!

One thing I like about "simulators"/"simulacra" over "speculators"/"speculations" is that the former personifies simulacra over the simulator (suggests agency/personality/etc belong to simulacra) which I think is less misleading, or at least counterbalances the tendency people have to personify "GPT".

"Speculator" sounds active and agentic whereas "speculations" sounds passive and static. I think these names does not emphasize enough the role of the speculations themselves in programming the "speculator" as it creates further speculations.

You're... (read more)

Thank you for taking the time to consider this! I agree with the criticism of spec* in your third paragraph (though if I'm honest I think it largely applies to sim* too). I can weakly argue that irl we do say "speculating further" and similar... but really I think your complaint about a misleading suggestion of agency allocation is correct. I wrestled with this before submitting the comment, but one of the things that led me to go ahead and post it was trying it on in the context of your paragraph that begins "I think that implicit type-confusion is common..." In your autoregressive loop, I can picture each iteration more easily as asking for a next, incrementally more informed speculation than anything that's clear to me in simulator/simulacrum terms, especially since with each step GPT might seem to be giving its prior simulacrum another turn of the crank, replacing it with a new one, switching to oracle mode, or going off on an uninterpretable flight of fancy. But, of course, the reason spec* fits more easily (imho) is that it's so very non-committal - maybe too non-committal to be of any use. The "fluid, schizophrenic way that agency arises in GPT’s behavior", as you so beautifully put it, has to be the crux. What is it that GPT does at each iteration, as it implicitly constructs state while predicting again? The special thing about GPT is specifically having a bunch of knowledge that lets it make language predictions in such a way that higher-order phenomena like agency systematically emerge over the reductive physics/automaton (analogic) base. I guess I feel both sim* and spec* walk around that special thing without really touching it. (Am I missing something about sim* that makes contact?) Looking at it this way emphasizes the degree to which the special thing is not only in GPT, but also in the accumulated cognitive product of the human species to date, as proxied by the sequenced and structured text on the internet. Somehow the AI ghosts that flow throu
haha, I just saw that you literally wrote "speculative simulation" in your other comment, great!

Thank you for this lovely comment. I'm pleasantly surprised that people were able to get so much out of it.

As I wrote in the post, I wasn't sure if I'd ever get around to publishing the rest of the sequence, but the reception so far has caused me to bump up the priority of that.

Thanks for the correction. I'll read the paper more closely and correct the post.

If GPT means "transformers trained on next-token prediction", then GPT's true name is just that.

Things are instances of more than one true name because types are hierarchical.

GPT is a thing. GPT is an AI (a type of thing). GPT is a also ML model (a type of AI). GPT is also a simulator (a type of ML model). GPT is a generative pretrained transformer (a type of simulator). GPT-3 is a generative pretrained transformer with 175B parameters trained on a particular dataset (a type/instance of GPT).

The intention is not to rename GPT -> simulator. Things tha... (read more)

[apologies on slowness - I got distracted] Granted on type hierarchy. However, I don't think all instances of GPT need to look like they inherit from the same superclass. Perhaps there's such a superclass, but we shouldn't assume it. I think most of my worry comes down to potential reasoning along the lines of: * GPT is a simulator; * Simulators have property p; * Therefore GPT has property p; When what I think is justified is: * GPT instances are usually usefully thought of as simulators; * Simulators have property p; * We should suspect that a given instance of GPT will have property p, and confirm/falsify this; I don't claim you're advocating the former: I'm claiming that people are likely to use the former if "GPT is a simulator" is something they believe. (this is what I mean by motte-and-baileying into trouble) If you don't mean to imply anything mechanistic by "simulator", then I may have misunderstood you - but at that point "GPT is a simulator" doesn't seem to get us very far. I think this is the fundamental issue. Deceptive alignment aside, what else qualifies as "an important aspect of its nature"? Which aspects disqualify a model as a simulator? Which aspects count as inner misalignment? To be clear on [x is a simulator (up to inner misalignment)], I need to know: 1. What is implied mechanistically (if anything) by "x is a simulator". 2. What is ruled out by "(up to inner misalignment)". I'd be wary of assuming there's any neat flawed-simulator/pretend-simulator distinction to be discovered. (but probably you don't mean to imply this?) I'm all for deconfusion, but it's possible there's no joint at which to carve here. (my guess would be that we're sometimes confused by the hidden assumption: [a priori unlikely systematically misleading situation => intent to mislead] whereas we should be thinking more like [a priori unlikely systematically misleading situation => selection pressure towards things that mislead us] I.e. looking for d

ah but if 'this program' is a simulacrum (an automaton equipped with an evolving state (prompt) & transition function (GPT), and an RNG that samples tokens from GPT's output to update the state), it is a learning machine by all functional definitions. Weights and activations both encode knowledge.

am I right to suspect that your real name starts with "A" and you created an alt just to post this comment? XD

4Ramana Kumar16d
I think Dan's point is good: that the weights don't change, and the activations are reset between runs, so the same input (including rng) always produces the same output. I agree with you that the weights and activations encode knowledge, but Dan's point is still a limit on learning. I think there are two options for where learning may be happening under these conditions: * During the forward pass. Even though the function always produces the same output for a given output, the computation of that output involves some learning. * Using the environment as memory. Think of the neural network function as a choose-your-own-adventure book that includes responses to many possible situations depending on which prompt is selected next by the environment (which itself depends on the last output from the function). Learning occurs in the selection of which paths are actually traversed. These can occur together. E.g., the "same character" as was invoked by prompt 1 may be invoked by prompt 2, but they now have more knowledge (some of which was latent in the weights, some of which came in directly via prompt 2; but all of which was triggered by prompt 2).
Nope. My real name is Daniel. After training is done and the program is in use, the activation function isn't retaining anything after each task is done. Nor are the weights changed. You can have such a program that is always in training, but my understanding GPT is not. So, excluding the random number component, the same set of inputs would always produce the same set of outputs for a given version of GPT with identical settings. It can't recall what you asked of it, time before last, for example. Imagine if you left a bunch of written instructions and then died. Someone following those instructions perfectly, always does exactly the same thing in exactly the same circumstance, like GPT would without the random number generator component, and with the same settings each time. It can't learn anything new and retain it during the next task. A hypothetical rouge GPT-like AGI would have to do all it's thinking and planning in the training stage, like a person trying to manipulate the world after their own death using a will that has contingencies. I.E. "You get the money only if you get married, son." It wouldn't retain the knowledge that it had succeeded at any goals, either.

I think this is a legitimate problem which we might not be inclined to take as seriously as we should because it sounds absurd.

Would it be a bad idea to recursively ask GPT-n "You're a misaligned agent simulated by a language model (...) if training got really cheap and this process occurred billions of times?

Yes. I think it's likely this would be a very bad idea.

when the corpus of internet text begins to include more text generated only by simulated writers. Does this potentially degrade the ability of future language models to model agents, perform logic

... (read more)

Charlie's quote is an excellent description of an important crux/challenge of getting useful difficult intellectual work out of GPTs.

Despite this, I think it's possible in principle to train a GPT-like model to AGI or to solve problems at least as hard as humans can solve, for a combination of reasons:

  1. I think it's likely that GPTs implicitly perform search internally, to some extent, and will be able to perform more sophisticated search with scale.
  2. It seems possible that a sufficiently powerful GPT trained on a massive corpus of human (medical + other) k
... (read more)
2Charlie Steiner17d
I also responded to Capybasilisk below, but I want to chime in here and use your own post against you, contra point 2 :P It's not so easy to get "latent knowledge" out of a simulator - it's the simulands who have the knowledge, and they have to be somehow specified before you can step forward the simulation of them. When you get a text model to output a cure for Alzheimer's in one step, without playing out the text of some chain of thought, it's still simulating something to produce that output, and that something might be an optimization process that is going to find lots of unexpected and dangerous solutions to questions you might ask it. Figuring out the alignment properties of simulated entities running in the "text laws of physics" seems like a challenge. Not an insurmountable challenge, maybe, and I'm curious about your current and future thoughts, but the sort of thing I want to see progress in before I put too much trust in attempts to use simulators to do superhuman abstraction-building.

Figuring out and posting about how RLHF and other methods ([online] decision transformer, IDA, rejection sampling, etc) modify the nature of simulators is very high priority. There's an ongoing research project at Conjecture specifically about this, which is the main reason I didn't emphasize it as a future topic in this sequence. Hopefully we'll put out a post about our preliminary theoretical and empirical findings soon. 

Some interesting threads:

RL with KL penalties better seen as Bayesian inference shows that the optimal policy when you hit a GPT w... (read more)

Our plan to accelerate alignment does not preclude theoretical thinking, but rather requires it. The mainline agenda atm is not full automation (which I expect to be both more dangerous and less useful in the short term), but what I've been calling "cyborgism": I want to maximize the bandwidth between human alignment researchers and AI tools/oracles/assistants/simulations. It is essential that these tools are developed by (or in a tight feedback loop with) actual alignment researchers doing theory work, because we want to simulate and play with thought pro... (read more)

What are your thoughts on failure modes with this approach? (please let me know if any/all of the following seems confused/vanishingly unlikely) For example, one of the first that occurs to me is that such cyborgism is unlikely to amplify production of useful-looking alignment ideas uniformly in all directions. Suppose that it makes things 10x faster in various directions that look promising, but don't lead to solutions, but only 2x faster in directions that do lead to solutions. In principle this should be very helpful: we can allocate fewer resources to the 10x directions, leaving us more time to work on the 2x directions, and everybody wins. In practice, I'd expect the 10x boost to: 1. Produce unhelpful incentives for alignment researchers: work on any of the 10x directions and you'll look hugely more productive. Who will choose to work on the harder directions? 1. Note that it won't be obvious you're going slowly because the direction is inherently harder: from the outside, heading in a difficult direction will be hard to distinguish from being ineffective (from the inside too, in fact). 2. Same reasoning applies at every level of granularity: sub-direction choice, sub-sub-direction choice.... 2. Warp our perception of promising directions: once the 10x directions seem to be producing progress much faster, it'll be difficult not to interpret this as evidence they're more promising. 1. Amplified assessment-of-promise seems likely to correlate unhelpfully: failing to help us notice promising directions precisely where it's least able to help us make progress. It still seems positive-in-expectation if the boost of cyborgism isn't negatively correlated with the ground-truth usefulness of a direction - but a negative correlation here seems plausible. Suppose that finding the truly useful directions requires patterns of thought that are rare-to-non-existent in the training

This is the best post about language models I've read in a long time. It's clear how much you have used LMs and grokked the peculiar way they operate. You've touched on many important points which I've wanted to write about or have but with less eloquence. Also I glad you liked my blog :) (

I definitely belong to your “enthusiasts” camp, and I agree your fourth point (loss scaling makes models "smarter" fast enough to matter) is a crux. I won't fully defend that here, but I'll do my own brain dump and share some of the thoughts that came up w... (read more)

I'm glad you liked the post!  And, given that you are an avowed "enthusiast," I'm pleasantly surprised that we agree about as many things as we do.

The second [source of discontinuous performance scaling] is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.

Thanks for pointing out this argument -- I hadn't thought about it before.  A few thoughts:

Ordinary text generation is also a multi-step process.  (The token length generally isn't ... (read more)