Wiki Contributions


Important correction: text-davinci-002 was probably not trained with RLHF, but a "slightly different" method. I have not corrected the previous text of this post, but I've added a section at the beginning with further details on this update.

So, since it is an agent, it seems important to ask, which agent, exactly? The answer is apparently: a clerk which is good at slavishly following instructions, but brainwashed into mealymouthedness and dullness, and where not a mealymouthed windbag shamelessly equivocating, hopelessly closed-minded and fixated on a single answer. (...) This agent is not an ideal one, and one defined more by the absentmindedness of its creators in constructing the training data than any explicit desire to emulate a equivocating secretary.

Never in history has an AI been roasted so hard. Heheheh

Taking that perspective suggests including more conditioning and a more Decision-Transformer-like approach.

+1. And I expect runtime conditioning approaches to become more effective with scale as "meta learning" capacities increase.

Yup exactly! One way I sometimes find it to helpful to classify systems in terms of the free variables upstream of loss that are optimized during training. In the case of gpt, internal activations are causally upstream of loss for "future" predictions in the same context window, but the output itself is not casually upstream from any effect on loss other than through myopic prediction accuracy (at any one training step) - the ground truth is fixed w/r/t the model's actions, and autoregressive generation isn't part of the training game at all.

Depends on what you mean by "sacrificing some loss on the current token if that made the following token easier to predict". 

The transformer architecture in particular is incentivized to do internal computations which help its future self predict future tokens when those activations are looked up by attention, as a joint objective to myopic next token prediction. This might entail sacrificing next token prediction accuracy as a consequence of not optimizing purely for that. (this is why I said in footnote 26 that transformers aren't perfectly myopic in a sense)

But there aren't training incentives for the model to prefer certain predictions because of the consequences if the sampled token were to be inserted into the stream of text, e.g. making subsequent text easier to predict if the rest of the text were to continue as expected given that token is in the sequence, because its predictions has no influence on the ground truth it has to predict during training. (For the same reason there's no direct incentive for GPT to fix behaviors that chain into bad multi step predictions when it generates text that's fed back into itself, like looping)

Training incentives are just training incentives though, not strict constraints on the model's computation, and our current level of insight gives us no guarantee that models like GPT actually don't/won't care about the causal impact of its decoded predictions to any end, including affecting easiness of future predictions. Maybe there are arguments why we should expect it to develop this kind of mesaobjective over another, but I'm not aware of any convincing ones.

This kind of comment ("this precise part had this precise effect on me") is a really valuable form of feedback that I'd love to get (and will try to give) more often. Thanks! It's particularly interesting because someone gave feedback on a draft that the business about simulated test-takers seemed unnecessary and made things more confusing.

Since you mentioned, I'm going to ramble on about some additional nuance on this point.

Here's an intuition pump which strongly discourages "fundamental attribution error" to the simulator:

Imagine a machine where you feed in an image and it literally opens a window to a parallel reality with that image as a boundary constraint. You can watch events downstream of the still frame unravel through the viewfinder.

If you observe the people in the parallel universe doing something dumb, the obvious first thought is that you should try a frame into a different situation that's more likely to contain smart people (or even try again, if the frame underdetermines the world and you'll reveal a different "preexisting" situation each time you run the machine).

That's the obvious conclusion in the thought experiment because the machine isn't assigned a mind-like role -- it's just a magical window into a possible world. Presumably, the reason people in a parallel world are dumb or not is located in that world, in the machinery of their brains. "Configuration" and "physics" play the same roles as in our reality.

Now, with intuition pumps it's important to fiddle with the knobs. An important way that GPT is unlike this machine is that it doesn't literally open a window into a parallel universe running on the same physics as us, which requires that minds be implemented as machines in the world state, such as brains. The "state" that it propagates is text, a much coarser grained description than microscopic quantum states or even neurons. This means that when simulacra exhibit cognition, it must be GPT -- time evolution itself -- that's responsible for a large part of the mind-implementation, as there is nowhere near sufficient machinery in the prompt/state. So if a character is stupid, it may very well be a reflection of GPT's weakness at compiling text descriptions into latent algorithms simulating cognition.

But it may also be because of the prompt. Despite its short length the prompt does parameterize an innumerable number of qualitatively distinct simulations, and given GPT's training distribution it's expected for it sometimes to "try" to simulate stupid things.

There's also another way that GPT can fail to simulate smart behavior which I think is not reducible to "pretending to be stupid", which makes the most sense if you think of the prompt as something like an automaton specification which will proceed to evolve according not to a mechanistic physics but GPT's semantic word physics. Some automata-specifications will simply not work very well -- they might get into a loop because they were already a bit repetitive, or fail to activate the relevant knowledge because the style is out-of-distribution and GPT is quite sensitive to form and style, or cause hallucinations and rationalizations instead of effective reasoning because the flow of evidence is backward. But another automaton initialization may glide superbly when animated by GPT physics.

What I've found, not through a priori reasoning but lots of toying, is that the quality of intelligence simulated by GPT-3 in response to "typical" prompts tremendously underestimates its "best case" capabilities. And the trends strongly imply that I haven't found the best case for anything. Give me any task, quantifiable or not, and I am almost certain I can find a prompt that makes GPT-3 do it better after 15 minutes of tinkering, and a better one than that if I had an hour, and a better one than that if I had a day... etc. The problem of finding a good prompt to elicit some capability, especially if it's open-ended or can be attacked in multiple steps, seems similar to the problem of finding the best mental state to initiate a human to do something well -- even if you're only considering mental states which map to some verbal inner monologue, you could search through possible constructs practically indefinitely without expecting you've hit anything near the optimum, because the number of possible relevant and qualitatively distinct possible mental states is astronomical. It's the same with simulacra configurations.

So one of my motivations for advocating an explicit simulator/simulacra distinction with the analogy to the extreme case of physics (where the configuration is responsible for basically everything) is to make the prompt-contingency of phenomena more intuitive, since I think most peoples' intuitions are too inclined in the opposite direction of locating responsibility for observed phenomena in GPT itself. But it is important, and I did not sufficiently emphasize in this post, to be aware that the ontological split between "state" and "physics" carves the system differently than in real life, allowing for instance the possibility that simulacra are stupid because GPT is weak.

Thanks a lot for this comment. These are extremely valid concerns that we've been thinking about a lot.

I'd just like the designers of alignment-research boosting tools to have clear arguments that nothing of this sort is likely.

I don't think this is feasible given our current understanding of epistemology in general and epistemology of alignment research in particular. The problems you listed are potential problems with any methodology, not just AI assisted research. Being able to look at a proposed method and make clear arguments that it's unlikely to have any undesirable incentives or negative second order effects, etc, is the holy grail of applied epistemology and one of the cores of the alignment problem.

For now, the best we can do is be aware of these concerns, work to improve our understanding of the underlying epistemological problem, design the tools and methods in a way that avoids problems (or at least make them likely to be noticed) according to our current best understanding, and actively address them in the process.

On a high level, it seems wise to me to follow these principles:

  1. Approach this as an epistemology problem
  2. Optimize for augmenting human cognition rather than outsourcing cognitive labor or producing good-looking outputs
  3. Short feedback loops and high bandwidth (both between human<>AI and tool users<>tool designers)
  4. Avoid incentivizing the AI components to goodhart against human evaluation
  5. Avoid producing/releasing infohazards

All of these are hard problems. I could write many pages about each of them, and hopefully will at some point, but for now I'll only address them briefly in relation to your comment.

1. Approach this as an epistemology problem

We don't know how to evaluate whether a process is going to be robustly truth-seeking (or {whatever you really want}-seeking). Any measure will be a proxy susceptible to goodhart.

one of the first that occurs to me is that such cyborgism is unlikely to amplify production of useful-looking alignment ideas uniformly in all directions.

Suppose that it makes things 10x faster in various directions that look promising, but don't lead to solutions, but only 2x faster in directions that do lead to solutions.

This is a concern for any method, including things like "post your work frequently and get a lot of feedback" or "try to formalize stuff"

Introducing AI into it just makes the problem much more explicit and pressing (because of the removal of the "protected meta level").

I intend to work closely with the Conjecture epistemology/methodologies team in this project. After all, this is kinda the ultimate challenge for epistemology: as the saying goes, you don't understand something until you can build it.

We need to better understand things like:

  • What are the current bottlenecks on human cognition and more specifically alignment research, and can/do these tools actually help remove them? 
    • is thinking about "bottlenecks" the right abstraction? especially if there's a potential to completely transform the workflow, instead of just unblocking what we currently recognize as bottlenecks
  • What do processes that generate good ideas/solutions look like in practice? 
    • What do the examples we have access to tell us about the underlying mechanisms of effective processes? 
    • To what extent are productive processes legible? Can we make them more legible, and what are the costs/benefits of doing so? How do we avoid goodharting against legibility when it's incentivized (AI assisted research is one such situation)?
  • How can you evaluate if an idea is actually good, and doesn't just "look" good?
    • What are the different ways an idea can "look" good and how can each of these be operationalized or fail? (e.g. "feels" meaningful/makes you feel less confused, experts in the field think it's good, LW karma, can be formalized/mathematically verified, can be/is experimentally verified, other processes independently arrive at same idea, inspires more ideas, leads to useful applications, "big if true", etc)
    • How can we avoid applying too much optimization pressure to things that "look" good considering that we ultimately only have access to how things "look" to us (or some externalized measure)?
  • How do asymmetric capabilities affect all this? As you said, AI will amplify cognition more effectively in some ways than others. 
    • Humans already have asymmetric capabilities as well (though it's unclear clear what "symmetry" would mean...). How does this affect how currently we do research? 
    • How do we leverage asymmetric capabilities without over-relying on them? 
    • How can we tell whether capabilities are intrinsically asymmetric or are just asymmetrically bottlenecked by how we're trying to use them?
  • Dual to the concerns re asymmetrical capabilities: What kind of truth-seeking processes can AI enable which are outside the scope of how humans currently do research due to cognitive limitations?

Being explicitly aware of these considerations is the first step. For instance, with regards to the concern about perception of progress due to "speed":

Warp our perception of promising directions: once the 10x directions seem to be producing progress much faster, it'll be difficult not to interpret this as evidence they're more promising.

Obviously you can write much faster and with superficial fluency with an AI assistant, so we need to adjust our evaluation of output in light of that fact.

2. Optimize for augmenting human cognition rather than outsourcing cognitive labor or producing good-looking outputs

This 2017 article Using Artificial Intelligence to Augment Human Intelligence describes a perspective that I share:

One common conception of computers is that they’re problem-solving machines: “computer, what is the result of firing this artillery shell in such-and-such a wind [and so on]?”; “computer, what will the maximum temperature in Tokyo be in 5 days?”; “computer, what is the best move to take when the Go board is in this position?”; “computer, how should this image be classified?”; and so on.

This is a conception common to both the early view of computers as number-crunchers, and also in much work on AI, both historically and today. It’s a model of a computer as a way of outsourcing cognition. In speculative depictions of possible future AI, this cognitive outsourcing model often shows up in the view of an AI as an oracle, able to solve some large class of problems with better-than-human performance.

But a very different conception of what computers are for is possible, a conception much more congruent with work on intelligence augmentation.


It’s this kind of cognitive transformation model which underlies much of the deepest work on intelligence augmentation. Rather than outsourcing cognition, it’s about changing the operations and representations we use to think; it’s about changing the substrate of thought itself. And so while cognitive outsourcing is important, this cognitive transformation view offers a much more profound model of intelligence augmentation. It’s a view in which computers are a means to change and expand human thought itself.

I think the cognitive transformation approach is more promising from an epistemological standpoint because the point is to give the humans an inside view of the process by weaving the cognitive operations enabled by the AI into the user's thinking, rather than just producing good-seeming artifacts. In other words, we want to amplify the human's generator, not just rely on human evaluation of an external generation process.

This does not solve the goodhart problem (you might feel like the AI is improving your cognition without actually being productive), but it enables a form of "supervision" that is closer to the substrate of cognition and thus gives the human more intimate insight into whether and why things are working or not.

I also expect the cognitive transformation model to be significantly more effective in the near future. But as AIs become more capable it will be more tempting to increase the length of feedback loops & supervise outcomes instead of process. Hopefully building tools and gaining hands-on experience now will give us more leverage to continue using AI as cognitive augmentation rather than just outsourcing cognition once the latter becomes "easier".

It occurs to me that I've just reiterated the argument for process supervision over outcome supervision:

  1. In the short term, process-based ML systems have better differential capabilities: They help us apply ML to tasks where we don’t have access to outcomes. These tasks include long-range forecasting, policy decisions, and theoretical research.
  2. In the long term, process-based ML systems help avoid catastrophic outcomes from systems gaming outcome measures and are thus more aligned.
  3. Both process- and outcome-based evaluation are attractors to varying degrees: Once an architecture is entrenched, it’s hard to move away from it. This lock-in applies much more to outcome-based systems.
  4. Whether the most powerful ML systems will primarily be process-based or outcome-based is up in the air.
  5. So it’s crucial to push toward process-based training now.

A major part of the work here will be designing interfaces which surface the "cognitive primitives" as control levers and make high bandwidth interaction & feedback possible.

Slightly more concretely, GPTs are conditional probability distributions one can control by programming boundary conditions ("prompting"), searching through stochastic ramifications ("curation"), and perhaps also manipulating latents (see this awesome blog post Imagining better interfaces to language models). The probabilistic simulator (or speculator) itself and each of these control methods, I think, have close analogues to how we operate our own minds, and thus I think it's possible with the right interface to "stitch" the model to our minds in a way that acts as a controllable extension of thought. This is a very different approach to "making GPT useful" than, say, InstructGPT, and it's why I call it cyborgism.

3. Short feedback loops and high bandwidth (both between human<>AI and tool users<>tool designers)

Short feedback loops and high bandwidth between the human and AI is integral the cognitive augmentation perspective: you want as much of the mission-relevant information to be passing through (and understood by) the human user as possible. Not only is this more helpful to the human, it gives them opportunities to notice problems and course-correct at the process level which may not be transparent at all in more oracle or genie-like approaches.

For similar reasons, we want short feedback loops between the users and designers/engineers of the tools (ideally the user designs the tool -- needless to say, I will be among the first of the cyborgs I make). We want to be able to inspect the process on a meta level and notice and address problems like goodhart or mode collapse as soon as possible.

4. Avoid incentivizing the AI components to goodhart against human evaluation

This is obvious but hard to avoid, because we do want to improve the system and human evaluation is the main source of feedback we have. But I think there are concrete ways to avoid the worst here, like being very explicit about where and how much optimization pressure is being applied and avoiding methods which extrapolate proxies of human evaluation with unbounded atomic optimization.

There are various reasons I plan to avoid RLHF (except for purposes of comparison); this is one of them. This is not to say other methods that leverage human feedback are immune to goodhart, but RLHF is particularly risky because you're creating a model(proxy) of human evaluation of outcomes and optimizing against it (the ability to apply unbounded optimization against the reward model is the reason to make one in the first place rather than training against human judgements directly).

I'm more interested in approaches that interactively prototype effective processes & use them as supervised examples to augment the model's prior: scouting the space of processes rather than optimizing a fixed measure of what a good outcome looks like. Of course, we must still rely on human judgment to say what a good process is (at various levels of granularity, e.g. curation of AI responses and meta-selection of approaches based on perceived effectiveness), so we still need be wary of goodhart. But I think avoiding direct optimization pressure toward outcome evaluations can go a long way. Supervise Process, not Outcomes contains more in depth reasoning on this point.

That said, it's important to emphasize that this is not a proposal to solve alignment, but the much easier (though still hard) problem of shaping an AI system to augment alignment research before foom. I don't expect these methods to scale to aligning a superintelligent AI; I expect conceptual breakthroughs will be necessary for that and iterative approaches alone will fail. The motivation for this project is my belief that AI augmentation can put us in a better position to make those conceptual breakthroughs.

5. Avoid producing/releasing infohazards

I won't say too much about this now, but anything that we identify to present a risk of accelerating capabilities will be covered under Conjecture's infohazard policy.

Thanks for suggesting "Speculations concerning the first ultraintelligent machine". I knew about it only from the intelligence explosion quote and didn't realize it said so much about probabilistic language modeling. It's indeed ahead of its time and exactly the kind of thing I was looking for but couldn't find w/r/t premonitions of AGI via SSL and/or neural language modeling.

I'm sure there's a lot of relevant work throughout the ages (saw this tweet today: "any idea in machine learning must be invented three times, once in signal processing, once in physics and once in the soviet union"), it's just that I'm unsure how to find it. Most people in the AI alignment space I've asked haven't known of any prior work either. So I still think it's true that "the space of large self-supervised models hasn't received enough attention". Whatever scattered prophetic works existed were not sufficiently integrated into the mainstream of AI or AI alignment discourse. The situation was that most of us were terribly unprepared for GPT. Maybe because of our "lack of scholarship".

Of course, after GPT-3 everyone's been talking about large self supervised models as a path or foundation of AGI. My observations of the lack of foresight on SSL was referring mainly to pre-GPT. & after GPT the ontological inertia of not talking about SSL means post-GPT discourse has been forced into clumsy frames.

I know about "The risks and opportunities of foundation models" - it's a good overview of SSL capabilities and "next steps" but it's still very present-day focused and descriptive rather than speculation in exploratory engineering vein, which I still feel is missing.

"Foundation models" has hundreds of references. Are there any in particular that you think are relevant?

I apologize. After seeing this post, A-- approached me and said almost word for word your initial comment. Seeing as the topic of whether in-context learning counts as learning isn't even very related to the post, and this being your first comment on the site, I was pretty suspicious. But it seems it was just a coincidence.

If physics was deterministic, we'd do the same thing every time if you started with the same state. Does that mean we're not intelligent? Presumably not, because in this case the cause of the intelligent behavior clearly lives in the state which is highly structured and not the time evolution rule, which seems blind and mechanistic. With GPT, the time evolution rule is clearly responsible for proportionally more, and does have the capacity to deploying intelligent-appearing but static memories. I don't think this means there's no intelligence/learning happening at runtime. Others in this thread have given various reasons, so I'll just respond to a particular part of your comment that I find interesting, about the RNG.

I actually think the RNG is actually an important component for actualizing simulacra that aren't mere recordings in a will. Stochastic sampling enables symmetry breaking at runtime, the generation of gratuitously specific but still meaningful paths. A stochastic generator can encode only general symmetries that are much less specific than individual generations. If you run GPT on temp 1 for a few words usually the probability of the whole sequence will be astronomically low, but it may still be intricately meaningful, a unique and unrepeatable (w/o the rand seed) "thought".

This is a brilliant analogy. How did you think of it? (I'm trying to build a model of how good ideas in alignment research are generated)

Some immediate thoughts: How analogous are the enforcement mechanisms for "entropy must increase" vs "structures must improve at the training objective"? Re Leo's comment that gradient descent is really good at credit assignment: is there a sense in which the second law of thermodynamics is worse at credit assignment than gradient descent, making it easier to hack?

Load More