Daniel Murfet

3mo75

## 4. Goals misgeneralize out of distribution.

See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learning

OAA Solution: (4.1) Use formal methods with verifiable proof certificates^{[2]}. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property—but such arguments can in fact be made! See VNN-COMP. In practice, it will not be possible to have enough information about the world to "prove" that a catastrophe will not be caused by an unfortunate coincidence, but instead we can obtain guaranteed probabilistic bounds via stochastic model checking.

Based on the Bold Plan post and this one my main point of concern is that I don't believe in the feasibility of the model checking, even in principle. The state space S and action space A of the world model will be too large for techniques along the lines of COOL-MC which (if I understand correctly) have to first assemble a discrete-time Markov chain by querying the NN and then try to apply formal verification methods to that. I imagine that actually you are thinking of learned coarse-graining of both S and A, to which one applies something like formal verification.

Assuming that's correct, then there's an inevitable lack of precision on the inputs to the formal verification step. You have to either run the COOL-MC-like process until you hit your time and compute budget and then accept that you're missing state-action pairs, or you coarse-grain to some degree within your budget and accept a dependence on the quality of your coarse-graining. If you're doing an end-run around this tradeoff somehow, could you direct me to where I can read more about the solution?

I know there's literature on learned coarse-grainings of S and A in the deep RL setting, but I haven't seen it combined with formal verification. Is there a literature? It seems important.

I'm guessing that this passage in the Bold Plan post contains your answer:

> Defining a sufficiently expressive formal meta-ontology for world-models with multiple scientific explanations at different levels of abstraction (and spatial and temporal granularity) having overlapping domains of validity, with all combinations of {Discrete, Continuous} and {time, state, space}, and using an infra-bayesian notion of epistemic state (specifically, convex compact down-closed subsets of subprobability space) in place of a Bayesian state

In which case I see where you're going, but this seems like the hard part?

Great question, thanks. tldr it depends what you mean by established, probably the obstacle to establishing such a thing is lower than you think.

To clarify the two types of phase transitions involved here, in the terminology of Chen et al:

Bayesian phase transition in number of samples:as discussed in the post you link to in Liam's sequence, where the concentration of the Bayesian posterior shifts suddenly from one region of parameter space to another, as the number of samples increased past some critical sample size n. There are also Bayesian phase transitions with respect to hyperparameters (such as variations in the true distribution) but those are not what we're talking about here.Dynamical phase transitions: the "backwards S-shaped loss curve". I don't believe there is an agreed-upon formal definition of what people mean by this kind of phase transition in the deep learning literature, but what we mean by it is that the SGD trajectory is for some time strongly influenced (e.g. in the neighbourhood of) a critical point w∗α and then strongly influenced by another critical point w∗β. In the clearest case there are two plateaus, the one with higher loss corresponding to the label α and the one with the lower loss corresponding to β. In larger systems there may not be a clear plateau (e.g. in the case of induction heads that you mention) but it may still reasonable to think of the trajectory as dominated by the critical points.The former kind of phase transition is a first-order phase transition in the sense of statistical physics, once you relate the posterior to a Boltzmann distribution. The latter is a notion that belongs more to the theory of dynamical systems or potentially catastrophe theory. The link between these two notions is, as you say, not obvious.

However Singular Learning Theory (SLT) does provide a link, which we explore in Chen et al. SLT says that the phases of Bayesian learning are also dominated by critical points of the loss, and so you can ask whether a given dynamical phase transition α→β has "standing behind it" a Bayesian phase transition where at some critical sample size the posterior shifts from being concentrated near w∗α to being concentrated near w∗β.

It turns out that, at least for sufficiently large n, the only real obstruction to this Bayesian phase transition existing is that the local learning coefficient near w∗β should be higherthan near w∗α. This will be hard to prove theoretically in non-toy systems, but we can estimate the local learning coefficient, compare them, and thereby provide evidence that a Bayesian phase transition exists.

This has been done in the Toy Model of Superposition in Chen et al, and we're in the process of looking at a range of larger systems including induction heads. We're not ready to share those results yet, but I would point you to Nina Rimsky and Dmitry Vaintrob's nice post on modular addition which I would say provides evidence for a Bayesian phase transition in that setting.

There are some caveats and details, that I can go into if you're interested. I would say the existence of Bayesian phase transitions in non-toy neural networks is not established yet, but at this point I think we can be reasonably confident they exist.