AI ALIGNMENT FORUM
AF

SLT for AI Safety
Inner AlignmentInterpretability (ML & AI)Singular Learning TheoryAI
Frontpage

15

SLT for AI Safety

by Jesse Hoogland
1st Jul 2025
4 min read
0

15

Inner AlignmentInterpretability (ML & AI)Singular Learning TheoryAI
Frontpage
New Comment
Moderation Log
Curated and popular this week
0Comments

This sequence draws from a position paper co-written with Simon Pepin Lehalleur, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexander Gietelink Oldenziel, Stan van Wingerden, George Wang, Zach Furman, Liam Carroll, Daniel Murfet. Thank you to Stan, Dan, and Simon for providing feedback on this post. 

Alignment ⊆ Capabilities. As of 2025, there is essentially no difference between the methods we use to align models and the methods we use to make models more capable. Everything is based on deep learning, and the main distinguishing factor is the choice of training data. So, the question is: what is the right data?

Figure 1: Data differentiates alignment from capabilities. Deep learning involves three basic inputs: (1) the architecture (+ loss function), (2) the optimizer, and (3) the training data. Of these, the training data is the main variable that distinguishes alignment from capabilities. 

Alignment is data engineering. Alignment training data specifies our values indirectly in the form of example demonstrations (instruction/safety fine-tuning), binary preferences (RLHF, DPO), or a constitution of abstract principles (constitutional AI, deliberative alignment). We train models against those underdetermined specifications and hope they internalize our intended constraints. 

Alignment may require fundamental scientific progress. As we've seen with recent examples of sycophancy and reward-hacking, our current indirect and empirical approach to alignment might not be enough to align increasingly powerful systems. These symptoms appear to stem from deeper scientific problems with deep learning:

  1. The Problem of Generalization: Distribution shifts can break learning-based constraints, so two models that behave similarly with respect to the training specification may generalize very differently in deployment. In particular, behaviors associated with general-purpose capabilities may generalize more robustly than alignment properties (the "sharp left turn").
  2. The Problem of Learning: Even on a fixed distribution, the training process may learn an unintended solution. Our learning processes are noisy, heterogeneous, and not guaranteed to converge to an optimal solution for a given set of constraints. As a result, models may learn a dangerous simplification of our specifications, acquire unintended instrumental strategies (deception, incorrigibility, power-seeking, sandbagging, etc.), or suddenly change in ways that are hard to predict (and execute a ""). 

Past a certain level of capabilities, making further safety progress may require making scientific progress on some of these more fundamental questions. It is hard to align what you do not understand, and we still do not understand deep learning. 

Figure 1: Key questions for a science of AI safety. Many practical questions in AI safety are grounded in fundamental scientific questions about deep learning: 
1a. (Learning) How does training data determine the algorithms that models learn?
1b. (Alignment) How can we choose training data to control what algorithms models learn?
2a. (Generalization) How do learned algorithms generalize (under distribution shift)?
2b. (Interpretability) How do a model internals enable (mis)generalization? 

Singular learning theory for a science of alignment. Singular Learning Theory (SLT) is a theory of Bayesian statistics that suggests the key to understanding deep learning is the geometry (specifically, the degeneracies) of the loss function and parameter-function map. SLT provides a starting point for understanding how learned algorithms underlie generalization and how training data determines those learned algorithms:

  1. Interpretability: The loss landscape around a given set of weights reflects the algorithms encoded by that set of weights and the sensitivity of those algorithms to changes in the data distribution [1, 4, 6]. If we can develop (a) the empirical tools to read this geometry and (b) the theoretical understanding to "invert" this reflection (a hypothesis we call "Structural Bayesianism"  [6]), then we gain a principled framework for interpreting learned algorithms and linking this structure to generalization.
  2. Alignment: The algorithms a model learns are ultimately rooted in the model's training data: (a) Statistical structure in training data determines (b) the geometric structure in the loss landscape, which determines (c) the developmental structure in the learning process, which determines (d) the algorithmic structure in the final weights a model has learned. If we can (partially) "invert" this mapping (a hypothesis we call the "S4 correspondence", for four different kinds of "structure" [3, 8]), then we can transform constraints on learned algorithms into interventions on training data. This provides a framework for aligning models through careful control of the training environment.

DL theory contains a number of important but mostly isolated ideas, such as the information bottleneck, NTK, saddle-to-saddle dynamics, links to statistical mechanics and QFT, mode connectivity, the study of SGD via high-dimensional probability, etc. Among these ideas, SLT seems one of the most important and one of the best bets for developing an eventual "unified theory" of deep learning, interpretability, and alignment. If successful, this would enable not just a better understanding of the theoretical problems of generalization and learning, but also the development of a new set of principled tools for "reading" and "writing" model internals. 

This progress would not be enough to guarantee safety on its own and would increase the risk burden on other areas in safety. However, fundamental progress may be necessary if we aren't on track to solve (inner) alignment by default.

SLT for AI safety. The loss landscape is where the training data, architecture, and optimizer interact. We expect that understanding the geometry of this landscape is equivalent to understanding internal structure, how it generalizes, and how to reliably control what structures arise. This enables a rigorous framework for advancing interpretability and alignment.

Outline. The rest of this sequence will survey recent progress in SLT [1, 2, 3, 4, 5, 6, 7, 8] and explore the potential applications for interpretability, alignment, and other areas of safety discussed above:

  • SLT for Interpretability (upcoming)
  • SLT for Alignment (upcoming)
  • SLT for Present-Day Safety (upcoming)

For more on the relevance of SLT to safety, see our position paper (co-written with a larger group of authors). For more on the technical fundamentals of SLT, see Distilling Singular Learning Theory by Liam Carroll.  

treacherous turn