We've written a paper on online imitation learning, and our construction allows us to bound the extent to which mesa-optimizers could accomplish anything. This is not to say it will definitely be easy to eliminate mesa-optimizers in practice, but investigations into how to do so could look here as a starting point. The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.
Here is the abstract:
...In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. No existing work
Hmm, added to reading list, thank you.
An AI actively trying to figure out what I want might show me snapshots of different possible worlds and ask me to rank them. Of course, I do not have the processing power to examine entire worlds; all I can really do is look at some pictures or video or descriptions. The AI might show me a bunch of pictures from one world in which a genocide is quietly taking place in some obscure third-world nation, and another in which no such genocide takes place. Unless the AI already considers that distinction important enough to draw my attention to it, I probably won’t notice it from the pictures, and I’ll rank those worlds similarly - even though I’d prefer the one without the genocide. Even if the...
In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.
But you didn't actually give an argument for that -- you simply stated it. As a matter of fact, I disagree: it seems really easy for an AGI to misunderstand what I mean when I use english words. To go back to the "fusion power generator", maybe it has a very deep model of such generators that abstracts away most of the c... (read more)
Co-authored with Rebecca Gorman.
In section 5.2 of their Arxiv paper, "The Incentives that Shape Behaviour", which introduces structural causal influence models and a proposal for addressing misaligned AI incentives, the authors present the following graph:
The blue node is a "decision node", defined as where the AI chooses its action. The yellow node is a "utility node", defined as the target of the AI's utility-maximising goal. The authors introduce this graph to introduce the concept of control incentives; the AI, given utility-maximizing goal of user clicks, discovers an intermediate control incentive: influencing user options. By influencing user opinions, the AI better fulfils its objective. This 'control incentive' is graphically represented by surrounding it in dotted orange.
A click-maximising AI would only care about user opinions indirectly: they are a...
This is because I think that the counterexample given here dissolves if there is an additional path without node from the matchmaking policy to the priced payed
I think you are using some mental model where 'paths with nodes' vs. 'paths without nodes' produces a real-world difference in outcomes. This is the wrong model to use when analysing CIDs. A path in a diagram -->[node]--> can always be replaced by a single arrow --> to produce a model that makes equivalent predictions, and the opposite operation is also possible.
So the number of nodes... (read more)
In a previous post, I argued for the study of goal-directedness in two steps:
- Defining goal-directedness: depends only on the complete behavior of the system, and probably assumes infinite compute and resources.
- Computing goal-directedness: depends on the internal structure, and more specifically what information about the complete behavior can be extracted from this structure.
Intuitively, understanding goal-directedness should mean knowing which questions to ask about the complete behavior of the system to determine its goal-directedness. Here the “complete” part is crucial; it simplifies the problem by removing the need to infer what the system will do based on limited behavior. Similarly, we don’t care about the tractability/computability of the questions asked; the point is to find what to look for, without worrying (yet) about how to get it.
This post proposes...
So, if you haven't read the first two posts, do so now.
In this post, we'll be going over the basic theory of belief functions, which are functions that map policies to sets of sa-measures, much like how an environment can be viewed as a function that maps policies to probability distributions over histories. Also, we'll be showing some nifty decision theory results at the end. The proofs for this post are in the following three posts (1,2,3), though it's inessential to read them and quite difficult.
Now, it's time to address desideratum 1 (dynamic consistency), and desideratum 3 (how do we formalize the Nirvana trick to capture policy selection problems) from the first post. We'll be taking the path where Nirvana counts as infinite reward, instead of counting...
Lets ponder the bestiary of decision-theory problems
"Lets" should be "Let's"
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
Why does deep and cheap learning work so well? (Henry W. Lin et al) (summarized by Rohin): We know that the success of neural networks must be at least in part due to some inductive bias (presumably towards “simplicity”), based on the following...
What John said. To elaborate, it's specifically talking about the case where there is some concept from which some probabilistic generative model creates observations tied to the concept, and claiming that the log probabilities follow a polynomial.
Suppose the most dog-like nose size is K. One function you could use is y = exp(-(x - K)^d) for some positive integer d. That's a function whose maximum value is 0 (where higher values = more "dogness") and doesn't blow up unreasonably anywhere.
(Really you should be talking about probabilities, in which case you use the same sort of function but then normalize, which transforms the exp into a softmax, as the paper suggests)