Nate Soares argues that there's a deep tension between training an AI to do useful tasks (like alignment research) and training it to avoid dangerous actions. Holden is less convinced of this tension. They discuss a hypothetical training process and analyze potential risks.
Credal sets, a special case of infradistributions[1] in infra-Bayesianism and classical objects in imprecise probability theory, provide a means of describing uncertainty without assigning exact probabilities to events as in Bayesianism. This is significant because as argued in the introduction to this sequence, Bayesianism is inadequate as a framework for AI alignment research. We will focus on credal sets rather than general infradistributions for simplicity of the exposition.
Recall that the total-variation metric is one example of a metric on the set of probability distributions over a finite set A set is closed with respect to a metric if it contains all of its limit points with respect to the metric. For example, let The set of probability distributions over is given by
There is a bijection between and the closed interval which is...
I have updated towards the LLMs as giant lookup table model, curious if that feels approximately right or not to you.
Haven't engaged enough to know, could be bid to engage, won't by default.
This is an idea I came up with and presented in the Agent Foundations 2025 at CMU conference.
Here is a nice simple formalism for decision theory, that in particular supports the decision theory coming out of infra-Bayesianism. I now call the latter decision theory "Disambiguative Decision Theory", since the counterfactuals work by "disambiguating" the agent's belief.
Let
If you try to get reward-seekers to cooperate by pooling reward in multi-agent settings, you're not changing its decision theory, you're just changing the reward structure so that CDT reward-seekers are incentivized to cooperate with each other.
tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment.
When we noticed the increase in verbalized alignment evaluation awareness during capabilities-focused RL, we initially thought that the right mental model was something like:
However, qualitatively neither of these seemed particularly salient to the model:
Thanks for doing this! I'm curious about the lower scores for o3 compared to the other two model checkpoints. Was there a layer of safety training between "RL (late)" and production o3?
...
It seems the answer to my question is probably yes? From the earlier blog post:
...We analyze the increase in verbalized metagaming observed in a portion of capabilities-focused RL training (called exp-rl-cap in Schoen et al) that was part of training o3, prior to any safety- or alignment-focused training.[8] ... We have also observed an increase in reasoning about the reward
Infra-Bayesianism is a mathematical framework for studying artificial learning and intelligence that developed from Vanessa Kosoy’s Learning Theoretic AI Alignment Research Agenda. As applied to reinforcement learning, the main character of infra-Bayesianism is an agent that is learning about an unknown environment and making decisions in pursuit of some goal. Infra-Bayesianism provides novel ways to model this agent’s beliefs and make decisions, which address problems arising when an agent does not or cannot consider the true environment possible at the beginning of the learning process. This setting, a non-realizable environment, is relevant to various scenarios important to AI alignment, including scenarios when agents may consider themselves as part of the environment, and scenarios involving self-modifying agents, multi-agent interactions, and decision theory problems. Furthermore, it is the most...
the computational complexity of individual hypotheses in the hypothesis class cannot be the thing that characterizes the hardness of learning, but rather it has to be some measure of how complex the entire hypothesis class is.
This is true, of course, but mostly immaterial. Outside of contrived examples, it's rare for the hypothesis class to be feasible to learn while containing hypotheses that are infeasible to evaluate. It seems extremely implausible that you can find a hypothesis class that is simultaneously (i) possible to specify in practice [...
What you propose here doesn't address the issue of non-realizability at all. For example, let's say is countable. Then any of the 3 regret criteria (uniform, Bayesian and your own "credal" proposal) implies that the algorithm would converge to a near-optimal policy for any given . This cannot work if some such is infeasible to optimize.