We've written a paper on online imitation learning, and our construction allows us to bound the extent to which mesa-optimizers could accomplish anything. This is not to say it will definitely be easy to eliminate mesa-optimizers in practice, but investigations into how to do so could look here as a starting point. The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.
Here is the abstract:
In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. No existing work
An AI actively trying to figure out what I want might show me snapshots of different possible worlds and ask me to rank them. Of course, I do not have the processing power to examine entire worlds; all I can really do is look at some pictures or video or descriptions. The AI might show me a bunch of pictures from one world in which a genocide is quietly taking place in some obscure third-world nation, and another in which no such genocide takes place. Unless the AI already considers that distinction important enough to draw my attention to it, I probably won’t notice it from the pictures, and I’ll rank those worlds similarly - even though I’d prefer the one without the genocide. Even if the...
Co-authored with Rebecca Gorman.
In section 5.2 of their Arxiv paper, "The Incentives that Shape Behaviour", which introduces structural causal influence models and a proposal for addressing misaligned AI incentives, the authors present the following graph:
The blue node is a "decision node", defined as where the AI chooses its action. The yellow node is a "utility node", defined as the target of the AI's utility-maximising goal. The authors introduce this graph to introduce the concept of control incentives; the AI, given utility-maximizing goal of user clicks, discovers an intermediate control incentive: influencing user options. By influencing user opinions, the AI better fulfils its objective. This 'control incentive' is graphically represented by surrounding it in dotted orange.
A click-maximising AI would only care about user opinions indirectly: they are a...
In a previous post, I argued for the study of goal-directedness in two steps:
- Defining goal-directedness: depends only on the complete behavior of the system, and probably assumes infinite compute and resources.
- Computing goal-directedness: depends on the internal structure, and more specifically what information about the complete behavior can be extracted from this structure.
Intuitively, understanding goal-directedness should mean knowing which questions to ask about the complete behavior of the system to determine its goal-directedness. Here the “complete” part is crucial; it simplifies the problem by removing the need to infer what the system will do based on limited behavior. Similarly, we don’t care about the tractability/computability of the questions asked; the point is to find what to look for, without worrying (yet) about how to get it.
This post proposes...
In this post, we'll be going over the basic theory of belief functions, which are functions that map policies to sets of sa-measures, much like how an environment can be viewed as a function that maps policies to probability distributions over histories. Also, we'll be showing some nifty decision theory results at the end. The proofs for this post are in the following three posts (1,2,3), though it's inessential to read them and quite difficult.
Now, it's time to address desideratum 1 (dynamic consistency), and desideratum 3 (how do we formalize the Nirvana trick to capture policy selection problems) from the first post. We'll be taking the path where Nirvana counts as infinite reward, instead of counting...
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
Why does deep and cheap learning work so well? (Henry W. Lin et al) (summarized by Rohin): We know that the success of neural networks must be at least in part due to some inductive bias (presumably towards “simplicity”), based on the following...