Cinera Verinia

Theoretical Computer Science Msc student at the University of [Redacted] in the United Kingdom. 

I'm an aspiring alignment theorist; my research vibes are descriptive formal theories of intelligent systems (and their safety properties) with a bias towards constructive theories.

I think it's important that our theories of intelligent systems remain rooted in the characteristics of real world intelligent systems; we cannot develop adequate theory from the null string as input.

Wiki Contributions


GPTs are not Imitators, nor Simulators, but Predictors.

I think an issue is that GPT is used to mean two things:

  1. A predictive model whose output is a probability distribution over token space given its prompt and context
  2. Any particular techniques/strategies for sampling from the predictive model to generate responses/completions for a given prompt.

[See the Appendix]


The latter kind of GPT, is what I think is rightly called a "Simulator".


From @janus' Simulators (italicised by me):

I use the generic term “simulator” to refer to models trained with predictive loss on a self-supervised dataset, invariant to architecture or data type (natural language, code, pixels, game states, etc). The outer objective of self-supervised learning is Bayes-optimal conditional inference over the prior of the training distribution, which I call the simulation objective, because a conditional model can be used to simulate rollouts which probabilistically obey its learned distribution by iteratively sampling from its posterior (predictions) and updating the condition (prompt). Analogously, a predictive model of physics can be used to compute rollouts of phenomena in simulation. A goal-directed agent which evolves according to physics can be simulated by the physics rule parameterized by an initial state, but the same rule could also propagate agents with different values, or non-agentic phenomena like rocks. This ontological distinction between simulator (rule) and simulacra (phenomena) applies directly to generative models like GPT.


It is exactly because of the existence of GPT the predictive model, that sampling from GPT is considered simulation; I don't think there's any real tension in the ontology here.


Credit for highlighting this distinction belongs to @Cleo Nardo

Remark 2: "GPT" is ambiguous 

We need to establish a clear conceptual distinction between two entities often referred to as "GPT" —

  • The autoregressive language model  which maps a prompt  to a distribution over tokens .
  • The dynamic system that emerges from stochastically generating tokens using  while also deleting the start token

Don't conflate them! These two entities are distinct and must be treated as such. I've started calling the first entity "Static GPT" and the second entity "Dynamic GPT", but I'm open to alternative naming suggestions. It is crucial to distinguish these two entities clearly in our minds because they differ in two significant ways: capabilities and safety.

  1. Capabilities:
    1. Static GPT has limited capabilities since it consists of a single forward pass through a neural network and is only capable of computing functions that are O(1). In contrast, Dynamic GPT is practically Turing-complete, making it capable of computing a vast range of functions.
  2. Safety:
    1. If mechanistic interpretability is successful, then it might soon render Static GPT entirely predictable, explainable, controllable, and interpretable. However, this would not automatically extend to Dynamic GPT. This is because Static GPT describes the time evolution of Dynamic GPT, but even simple rules can produce highly complex systems. 
    2. In my opinion, Static GPT is unlikely to possess agency, but Dynamic GPT has a higher likelihood of being agentic. An upcoming article will elaborate further on this point.

This remark is the most critical point in this article. While Static GPT and Dynamic GPT may seem similar, they are entirely different beasts.

To summarise:

  • Static GPT: GPT as predictor
  • Dynamic GPT: GPT as simulator

What do you think MIRI is currently doing wrong/what should they change about their approach/general strategy?

To be clear, I enjoyed the post and am looking forward to this sequence. A point of disagreement though:


One feasible-seeming approach is "accelerating alignment," which involves leveraging AI as it is developed to help solve the challenging problems of alignment. This is not a novel idea, as it's related to previously suggested concepts such as seed AI, nanny AI, and iterated amplification and distillation (IDA).

I disagree that using AI to accelerate alignment research is particularly load bearing for the development of a practical alignment craft or really necessary.

I think we should do it to be clear — I have used ChatGPT to aid some of my writing and plan to use it more — but it's to the same extent that we use Google/Wikipedia/Word processors to do research in general. That is, I don't expect AI assistance to be load bearing enough for alignment in general to merit special distinction.

To the extent that one does expect AI to be particularly load bearing for progress on developing useful alignment craft in particular, I think they're engaging in wishful thinking and snorting too much hopium. That sounds like shying away/avoiding the hard/difficult problems of alignment. John Wentworth has said that we shouldn't do that:

Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things (or Clever Reasons To Ignore The Hard Things), rather than just Directly Tackling The Hard Things.

The most common pattern along these lines is to propose outsourcing the Hard Parts to some future AI, and "just" try to align that AI without understanding the Hard Parts of alignment ourselves. ... You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off. That's one of the big problems with trying to circumvent the Hard Parts: when the circumvention inevitably fails, we are still no closer to solving the Hard Parts. (It has been observed both that alignment researchers mostly seem to not be tackling the Hard Parts, and that alignment research mostly doesn't seem to build on itself; I claim that the latter is a result of the former.)

Mostly, I think the hard parts are things like "understand agency in general better" and "understand what's going on inside the magic black boxes". If your response to such things is "sounds hard, man", then you have successfully identified (some of) the Hard Parts.

I don't think this point should be on the list (or at least, I don't think I endorse the position implied by explicitly placing the point on the list).

I disagree that intelligence and rationality are more fundamental than physics; the territory itself is physics, and that is all that is really there. Everything else (including the body of our phone knowledge) are models for navigating that territory.

Turing formalised computation and established the limits of computation given certain assumptions. However, those limits only apply as long as the assumptions are true. Turing did not prove that no mechanical system is superior to a Universal Turing Machine, and weird physics may enable super Turing computation.

The point I was making is that our models are only as good as their correlation with the territory. The abstract models we have aren't part of the territory itself.