[Simulators seminar sequence] #1 Background & shared assumptions

Jan; Charlie Steiner; Logan Riggs; janus; jacquesthibs; metasemi; Michael Oesterle; Lucas Teixeira; peligrietzer; remember

Meta: Over the past few months, we've held a seminar series on the Simulators theory by janus. As the theory is actively under development, the purpose of the series is to discover central structures and open problems. Our aim with this sequence is to share some of our discussions with a broader audience and to encourage new research on the questions we uncover. Below, we outline the broader rationale and shared assumptions of the participants of the seminar.

Shared assumptions

Going into the seminar series, we determined a number of assumptions that we share. The degree to which each participant subscribes to each assumption varies, but we agreed to postpone discussions on these topics to have a maximally productive seminar. This restriction does not apply to the reader of this post, so please feel free to question our assumptions.

Aligning AI is a crucial task that needs to be addressed as AI systems rapidly become more capable.
1. (Probably a rather uncontroversial assumption for readers of this Forum, but worth stating explicitly.)
The core part of the alignment problem involves "deconfusion research."
1. We do not work on deconfusion for the sake of deconfusion but in order to engineer concepts, identify unknown unknowns, and transition from philosophy to mathematics to algorithms to implementation.
The problem is complex because we have to reason about something that doesn't yet exist.
1. AGI is going to be fundamentally different from anything we have ever known and will thus present us with challenges that are very hard to predict. We might only have a very narrow window of opportunity to perform critical actions and might not get the chance to iterate on a solution.
However, this does not mean that we should ignore evidence as it emerges.
1. It is essential to carefully consider the GPT paradigm as it is being developed and implemented. At this point, it appears to us more plausible than not that GPT will be a core component of AGI.
One feasible-seeming approach is "accelerating alignment," which involves leveraging AI as it is developed to help solve the challenging problems of alignment.
1. This is not a novel idea, as it's related to previously suggested concepts such as seed AI, nanny AI, and iterated amplification and distillation (IDA).

Simulators refresher

Going into the seminar series, we had all read the original Simulators post by janus. We recommend reading the post in the original but provide a condensed summary as a refresher below.

A fruitful way to think about GPT is
GPT is a simulator (i.e. a model trained with predictive loss on a self-supervised dataset)^[1]
The entities simulated by GPT are simulacra (agentic or non-agentic; different objective than the simulator)
The simulator terminology has appropriate connotations
GPT is not (per se) an oracle, genie, agentic, …
all GPT “cares about” is simulating/modeling the training distribution
log-loss is a proper scoring rule^[2]

Solving alignment with simulators

While much of this sequence will focus on the details and consequences of simulator theory, we want to clearly state at the outset that we do this work with the goal of contributing to a solution to the alignment problem. In this section, we briefly outline how we might expect simulator theory to concretely contribute to such a solution.

One slightly^[3] naive approach to solving the alignment is to use a strong, GPT-like model as a simulator, prompt it with "the solution to the alignment problem is", to cross your fingers and hit enter. The list of ways in which this approach fails is hard to exhaust and includes the model's tendency to hallucinate, the generally weak reasoning ability, as well as more fundamental issues with steering simulations. Strategies to mitigate some of these problems have been proposed, but (in line with our shared assumptions listed above), we believe that a foundational understanding of simulators is necessary in order to enable/ensure positive outcomes.

Conditional on having a foundational understanding of simulators (supported by theorems and empirical results), we hope to be able to construct a simulation (or a set of simulations) that reliably produces useful alignment research. On a high level, it seems wise to optimize for augmenting human cognition rather than outsourcing cognitive labor or producing good-looking outputs (here, here). We acknowledge the worries of some that constructing a simulator that reliably produces useful alignment research is an alignment-complete problem. However, we believe that the simulator framework provides a few key advantages that are afforded by the "non-agency" of simulators:

simulators are not as exposed to goodharting since they are not optimizing for an objective (other than predicting the data distribution)
simulators (in particular, the simulacra they simulate) are not optimizing for survival by default and thus do not present the same degree of worry about treacherous turns
simulators can be configured to simulate many simulacra in tandem and can thus produce a variety of perspectives on a given problem

We hope to substantiate these claims with more rigor over the course of this sequence (or to at least point in the direction in which other alignment researchers might fruitfully work).

^{^}
While a base GPT model trained with the predictive loss approximates a simulator, this property can get lost with further training. When talking about simulators, we tend to think of base GPT without fine-tuning. Understanding how and to which degree the simulator property interacts with fine-tuning is very high on our list of priorities.
^{^}
A proper scoring rule is optimized by predicting the “true” probabilities of the distribution which generates observations, and thus incentivizes honest probabilistic guesses. Log-loss (such as GPT is trained with) is a proper scoring rule.
This means the model is directly incentivized to match its predictions to the probabilistic transition rule which implicitly governs the training distribution. As a model is made increasingly optimal with respect to this objective, the rollouts that it generates become increasingly statistically indistinguishable from training samples, because they come closer to being described by the same underlying law: closer to a perfect simulation.
^{^}
The proposal is "naive" in the sense that it is by far not pessimistic/prepared enough for the worst-case.

simulators can be configured to simulate many simulacra in tandem and can thus produce a variety of perspectives on a given problem

It would be nice to have a way of telling that different texts have the same simulacrum acting through them, or concern the same problem. Expected utility arises from coherence of actions by an agent (that's not too updateless), so more general preference is probably characterized by actions coherent in a more general sense. Some aspects of alignment between agents might be about coherence between actions performed by them in separate situations, not necessarily with the agents interacting with each other. Could mutual alignment of different simulacra be measured? In a simulator, talking about this probably requires moving sideways in the text space, finding more examples of a given thing in different texts, sampling from all texts that talk about a given thing.

To be clear, I enjoyed the post and am looking forward to this sequence. A point of disagreement though:

One feasible-seeming approach is "accelerating alignment," which involves leveraging AI as it is developed to help solve the challenging problems of alignment. This is not a novel idea, as it's related to previously suggested concepts such as seed AI, nanny AI, and iterated amplification and distillation (IDA).

I disagree that using AI to accelerate alignment research is particularly load bearing for the development of a practical alignment craft or really necessary.

I think we should do it to be clear — I have used ChatGPT to aid some of my writing and plan to use it more — but it's to the same extent that we use Google/Wikipedia/Word processors to do research in general. That is, I don't expect AI assistance to be load bearing enough for alignment in general to merit special distinction.

To the extent that one does expect AI to be particularly load bearing for progress on developing useful alignment craft in particular, I think they're engaging in wishful thinking and snorting too much hopium. That sounds like shying away/avoiding the hard/difficult problems of alignment. John Wentworth has said that we shouldn't do that:

Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things (or Clever Reasons To Ignore The Hard Things), rather than just Directly Tackling The Hard Things.

The most common pattern along these lines is to propose outsourcing the Hard Parts to some future AI, and "just" try to align that AI without understanding the Hard Parts of alignment ourselves. ... You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off. That's one of the big problems with trying to circumvent the Hard Parts: when the circumvention inevitably fails, we are still no closer to solving the Hard Parts. (It has been observed both that alignment researchers mostly seem to not be tackling the Hard Parts, and that alignment research mostly doesn't seem to build on itself; I claim that the latter is a result of the former.)

Mostly, I think the hard parts are things like "understand agency in general better" and "understand what's going on inside the magic black boxes". If your response to such things is "sounds hard, man", then you have successfully identified (some of) the Hard Parts.

I don't think this point should be on the list (or at least, I don't think I endorse the position implied by explicitly placing the point on the list).

I won't write a detailed object-level response to this for now, since we're probably going to publish a lot about it soon. I'll just say that my/our experience with the usefulness of GPT has been very different than yours -

I have used ChatGPT to aid some of my writing and plan to use it more — but it's to the same extent that we use Google/Wikipedia/Word processors to do research in general.

I've used GPT-3 extensively, and for me it has been transformative. To the extent that my work has been helpful to you, you're indebted to GPT-3 as well, because "janus" is a cyborg whose ideas crystalized out of hundreds of hours of cybernetic scrying. But then, I used GPT in fairly unusual/custom ways -- high-bandwidth human-in-the-loop workflows iterating deep simulations -- and it took me months to learn to drive and build maps to the fruitful parts of latent space, so I don't expect others to reap the same benefits out of the box, unless the "box" has been optimized to be useful in this dimension (chatGPT is optimized in a very different dimension).

We also used GPT to summarize seminar meeting and produce posts from the summaries, such as [Simulators seminar sequence] #2 Semiotic physics, where it came up with some of the propositions and proof sketches.

I don't expect AI assistance to be load bearing enough for alignment in general to merit special distinction.

I do. I expect AI to be superhuman at a lot of things quite soon.

It's like this: magic exists now. The amount of magic in the world is increasing, allowing for increasingly powerful spells and artifacts, such as CLONE MIND. This is concerning for obvious reasons. One would hope that the protagonists, whose goal it is to steer this autocatalyzing explosion of psychic energy through the needle of an eye to utopia, will become competent at magic.