Credal sets, a special case of infradistributions[1] in infra-Bayesianism and classical objects in imprecise probability theory, provide a means of describing uncertainty without assigning exact probabilities to events as in Bayesianism. This is significant because as argued in the introduction to this sequence, Bayesianism is inadequate as a framework for AI alignment research. We will focus on credal sets rather than general infradistributions for simplicity of the exposition.
Recall that the total-variation metric is one example of a metric on the set of probability distributions over a finite set A set is closed with respect to a metric if it contains all of its limit points with respect to the metric. For example, let The set of probability distributions over is given by
There is a bijection between and the closed interval which is...
The goal of this post is to give a summary of classical reinforcement learning theory that primes the reader to learn about infra-Bayesianism, which is a new framework for reinforcement learning that aims to solve problems related to AI alignment. We will concentrate on basic aspects of the classical theory that have analogous concepts in infra-Bayesianism, and explain these concepts using infra-Bayesianism conventions. The more technical proofs are contained in the proof section.
For the first part of this sequence and for links to other writings, see What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism.
One special case of reinforcement learning is the case of stochastic bandits. For example, a bird may have a choice of three levers. When the bird steps on a lever, either some...
You're right, it's supposed to be
where
Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call "fitness-seeking"—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern.
In this piece, I lay out what I take to be the central mechanisms by which fitness-seeking motivations might lead to human disempowerment, and propose mitigations to them. While the analysis is inherently speculative, this kind of speculation seems worthwhile: AI control emerged from explicitly taking scheming motivations seriously and asking what interventions are implied, and my hope is that developing mitigations for fitness-seeking will benefit from similar forward-looking analysis.
Fitness-seekers are, in many ways, notably safer than what I'll...
I think you're making the distinction more confusing than it has to be.
There are things that has motivational pull, and there are things that don't but I do them anyway because they are a necessary step to get what I actually want.
Say I want to get an apple, and the easiest way to get one is going to the store and by some. Going to the sore in this story is clearly an instrumental goal, and enjoying eating my apple is a terminal goal[1]
Things that are instrumental can acquire the property of being terminal by association in our brain, because of how huma...
Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper by Zhang et al. that arguably signaled its demise. Today, I cover the aftermath, and the 2019 paper that devastated deep learning theory again.
As a brief summary, I argued that the rise of deep learning posed an existential challenge to the dominant theoretical paradigm of statistical learning theory, because neural networks have a lot of complexity. The response from the field was to attempt to quantify other ways in which the hypothesis class of neural networks in practice was simple, using alternative metrics of complexity. Zhang et al. 2016 showed that the standard neural network architectures trained with standard training methods could memorize large quantities of random labelled data,...
Yeah, good question. I think the word "data-dependent" has different connotations (even if it is standard terminology).
Using the sketch definition
With high probability over possible training sets S, for all h in the hypothesis class, we have |expected test error of hypothesis h - empirical error of h on S| <= (Some bound involving the size of the training data and high level properties of h).[2]
You're right that properties of h are, in general different from properties of the data. The "data-dependent"...
Gradient hacking is when a deceptively aligned AI deliberately acts to influence how the training process updates it. For example, it might try to become more brittle in ways that prevent its objective from being changed. This poses challenges for AI safety, as the AI might try to remove evidence of its deception during training.
You're absolutely right, it should be a quotient space, not a subspace. In principle, it can be represented as a closed subspace of the product of copies of , where stands for "undefined".
Actually, we do? For example, consider the space . Then the following set is open:
However, thi... (read more)