We (Ramana, Abram, Josiah, Daniel) are working together as part of PIBBSS this summer. The goal of the PIBBSS fellowship program is to bring researchers in alignment (in our case, Ramana and Abram) together with researchers from other relevant fields (in our case, Josiah and Daniel, who are both PhD students in Logic and Philosophy of Science) to work on alignment.

We’ve spent a few weeks leading up to the summer developing a number of possible project ideas. We’re writing this post in order to both help ourselves think through the various projects and how they might actually help with alignment (theory of change), and to (hopefully!) get feedback from other alignment researchers about which projects seem most promising/exciting.

We’ve discussed five possible project directions. The first two in particular are a bit more fleshed out. For each project we’ll describe the core goals of the project, what a possible plan(s) of attack might be, and how we’d expect a successful version of the project to contribute to alignment. Many of our projects inherit the theory of change for all agent foundations work (described here by John). In the descriptions below we focus on slightly more specific ways the projects might matter.

Possible Projects

Project 1: Inferring Algebras from Behavior


Standard representation theorems in decision theory (for example, Savage and Jeffrey-Bolker) show that when an agent’s preferences satisfy certain rationality and structural constraints, then we can represent the preferences as if they were generated by an agent who is maximizing expected utility. In particular, they allow us to infer meaningful things about both the probability and the utility function. However, these representation theorems start off with the agent’s conceptual space (formally, an algebra[1]) already “known” to the person trying to infer the structure. The goal of this project would be to generalize representations theorems so that we can also infer things about the structure of an agent’s algebra from her preferences or choice behavior.

Theory of Change

A representation theorem is a particular kind of selection theorem. John has argued that selection theorems can help us understand agency in a way that will help with alignment. Inferring an agent’s conceptual space from her behavior also seems like it might be useful for ELK (for ELK, we might want to think of this project as helping with a translation problem between the agent’s algebra/conceptual space and our own).

Plan of Attack

In order to develop a new representation theorem (or at least understand why proving such a theorem would be hard/impossible), there are two core choices we would need to make.

The first is how to define the data that we have access to. For example, in Savage the starting data is a preference ordering over acts (which are themselves functions from states to outcomes). In Jeffrey-Bolker, the data is a preference order over all propositions in the agent’s algebra. Notice that both preference orderings are defined over the kinds of things we are tying to infer: Savage acts make essential use of states and outcomes, and in Jeffrey-Bolker the preference ordering is over the members of the algebra themselves. Thus, we would need to find some type of data that looks like preferences, but not preferences over the very objects we are trying to infer. One possible candidate would be observed acts (but then we would need a theory of what counts as an act).

Additionally, and perhaps (given the difficulty of the problem) importantly, we might allow ourselves access to “side data”. For example, we might help ourselves out to facts about the agent’s architecture, the process that generated the agent, or the amount of compute it uses.

The second core choice is defining the space of possible answers. For example, are we only working with algebras of sets? Do we want more structure to the points in our sample space (state descriptions versus just points)? Do we make assumptions about the kinds of algebras we might output, and thus consider a restricted class of algebras? Do we want our inference process to output a single, “best fit” algebra, a set of admissible algebras, a probability distribution over algebras? Do we allow for non-Boolean algebras? There are many possibilities.

Once these choices are made and we have a more formal description of the problem, the main work is to actually see if we can get any inference procedure/representation theorem off the ground. The difficulty and generality of the theorem will depend on the choices we make about the inputs and outputs. A core part of the project will also be understanding this interaction.

Project 2: Bridging Subjective Optimality and Success in Action


Bayesian decision theory describes optimal action from a subjective point of view: given an agent’s beliefs and desires, it describes the best act for the agent to take. However, you can have a perfect Bayesian agent that consistently fails in action (for example, perhaps they are entirely delusional about what the world is like). This project’s goal is to better understand the bridge principles needed between subjective, first person optimality and objective, third person success.

Theory of Change

Something like Bayesian/expected utility maximization seems useful for understanding agents and agency. However, there is the problem that expected utility theory doesn’t seem to predict anything in particular. We want a better response to “Expected utility theory doesn’t predict anything” that can describe the insight of EU theory re what agents are without being misinterpreted / without failing to constrain expectations at all technically. In particular, we want a response that identifies cases in which the expected utility behavior actually matters for what ends up happening in the real world. Understanding bridge principles will help clarify when EU maximization matters, and when it doesn't.

Plan of Attack

The first step for tackling this project would be to understand the extent to which the core question is already addressed by something like the grain of truth condition (see, for example, here and here). There are also a number of other promising directionsWe might want to understand better the nature of hypotheses (what does it mean to include the truth?). John's optimization at a distance idea seems relevant, in the sense that agents with non-distant goals might wire head and cease to be very agential. Similarly, the relationship between time-horizon length and agency seems worth exploring. Also, understanding the kinds of guarantees we want; if the world is sufficiently adversarial, and contains traps, then nothing can do well. Do we rule out such cases?

Assuming that the grain of truth condition seems fairly comprehensive, we would want to understand the extent to which agents can actually satisfy it, or approximate it. Broadly speaking, there seems to be two general strategies: be big, or grow.

The first strategy is to try to find some very large class of hypotheses that the agent can consider, and then consider all of the ones in the class (or, more realistically, approximate considering them). Solomonoff induction/AIXI basically pursues this strategy. However, there are reasons to think that this is not entirely satisfactory (see chapter 4 of this). Perhaps there are better ways to try to be big.

The second strategy is to not try to be big at the very beginning, but to grow the hypotheses under consideration in a way that gives us some good guarantees on approach the grain of truth. We would first want to understand work that is already being done (for example, here and here), and see the extent to which it can be helpful in alignment.

We would characterize the trade-offs of both approaches, and try to extend/modify them as necessary to help with alignment.

Project 3: Characterizing Demons in Non-Expert Based Systems


We know things like Solomonoff induction has demons: programs/experts that are competent enough to predict well and yet misaligned with the agent who is consulting them. There are also reasons to think you can get demons in search. Demons seem most clear/intuitive when we do something that looks like aggregating predictions of differemt “experts” (both Solmonoff Induction and Predict-O-Matic seem to fit broadly into something like the Prediction with Expert Advice framework). However, if you are using a weighted average of expert predictions in order to generate a prior over some space, then it seems meaningful to say that that resulting prior also has demons, even if in fact it was generated a different way. This then leads us to ask, given an arbitrary prior over some algebra, is there a way to characterize whether or not the prior has demons? This project has two main goals: getting clearer on demons in systems that look like Prediction with Expert Advice, and spelling out specific conditions on distributions over algebras that lead to demons.

Theory of Change

Understanding demons better, and the extent to which they can “live” in things (probability distributions) that don’t look exactly like experts should help us understand various inner alignment failure modes. For example, can something like what happens at the end of the Predict-O-Matic parable happen in systems that aren’t explicitly consulting subprogram/experts (or how likely is that).

Plan of Attack

This is less specified at the moment. We would start with reviewing the relevant literature/posts, and then trying to formalize our question. A lot of this might look like conceptual work—trying to understanding the relationship between arbitrary probability distributions, and various processes that we think might generate them. Understanding representation theorems (for example, de Finetti, perhaps more relevantly, theorem 2.13 of this) might be helpful. With a better understanding of this relationship, we would then try to say something meaningful about systems that are not obviously/transparently using something like prediction with expert advice.

Project 4: Dealing with no Ground Truth in Human Preferences


We want to align agents we build with what we want. But, what do we want? Human preferences are inconsistent, incomplete, unstable, path-dependent, etc. Our preferences do not admit of a principled utility representation. In other words, there is no real ground truth about our preferences. This project would explore different ways of trying to deal with this problem, in the context of alignment.

Theory of Change

This project has a very robust theory of change. If we want to make sure systems we build are outer aligned, they will have to have some way of inferring what we want. This is true regardless of how we build such systems: we might build a system that we explicitly give a utility function, and in order to build it we have to know what we want. We might build systems that themselves try to infer what we want from behavior, examples, etc. If what they are trying to infer (coherent preferences) doesn’t exist, then this might pose a problem (for example, pushing us towards having preferences that are easier to satisfy). Understanding how to deal with a lack of ground truth might help us avoid various outer alignment failures.

Plan of Attack

First, review the relevant literature. In particular, Louis Narens and Brian Skyrms have tried to deal with a similar problem when it comes to interpersonal comparisons of utility. This is a case in which there are no measurement-theoretic foundations available for such a comparison. Their approach embraces convention as a way to learn something that is not a preexisting truth. We are interested in understanding the advantages and disadvantages of such an approach, and seeing what this can teach us about conventionality and lack of ground truth in human preferences more broadly. (Example path of inquiry: treat an individual with path dependent preferences as a set of individuals with different preferences, and use the Louis and Skyrms approach to make tradeoffs between these different possible future preferences. Will this work? Will this be satisfactory? That question is also part of the project.)

Project 5: Subjective Probability and Alignment


This is the vaguest project. Daniel and Josiah think that our best accounts of probability and possibility are subjective: the degrees of belief and epistemic possibilities of an agent, respectively. If one takes this perspective seriously, then this puts pressure on various projects and frameworks in alignment that seem to rely on more objective notions of probability, possibility, information, etc. So, there is a kind of negative project available here: characterize the extent to which various proposals/strategies/frameworks in alignment are undermined by appealing to ill-founded notions of probability. Unfortunately, it is less clear to us at this point what a positive contribution would be. Thus, perhaps this kind of project is best left for smaller, independent posts, not as part of PIBBSS. But, if we could come up with some actual positive direction for this project, that might be really good as well.

Theory of Change

The theory of change of the negative version of the project is clear: insofar as it corrects or at least sheds light on certain alignment strategies, it will help alignment. Since we do not yet have a clear idea for a positive project here, we do not know what the theory of change would be.

Plan of Attack

For the negative, take a framework/proposal/strategy in alignment, and describe the extent to which it relies on ungrounded uses of probability. For example, we've been thinking about where probability/possibility might be a problem for Cartesian frames.  For the positive project, the plan is less clear, given how vague the project is.


We appreciate any and all comments, references to relevant literature, corrections, disagreements on various framings, votes or bids for which project to do, and even descriptions of neighboring projects that you think we might be well suited to attack this summer.

  1. ^

    For example, in the binary sequence prediction context, an agent's algebra might be  the minimal -field generated by the cylinder sets.

New Comment
5 comments, sorted by Click to highlight new comments since:

Something like Bayesian/expected utility maximization seems useful for understanding agents and agency. However, there is the problem that expected utility theory doesn’t seem to predict anything in particular. We want a better response to “Expected utility theory doesn’t predict anything” that can describe the insight of EU theory re what agents are without being misinterpreted / without failing to constrain expectations at all technically.

Agents are policies with a high value of g. So, "EU theory" does "predict" something, although it's a "soft" prediction (i.e. agency is a matter of degree).

Re: Project 2

This project’s goal is to better understand the bridge principles needed between subjective, first person optimality and objective, third person success.

This seems quite valuable, because there is, properly speaking, no objective, third person perspective on which we can speak, only the inferred sense that there exists something that looks to use like a third person perspective from our first person perspectives. Thus I think this seems like a potentially fruitful line of research since the proposed premise contains the confusion that needs to be unraveled to get to addressing what is something more like the intersubjective agreement on what the world is like.

Dealing with no Ground Truth in Human Preferences

A variation on this: If preference is known, but difficult to access in some sense. For example, estimates change in time outside agent's control, like market data for some security regarding any given question of "expected utility", actual preference is the dividends that haven't been voted on yet, or else there is a time-indexed sequence of utility functions that converges in some sense (probably with strong requirements that make the limit predictable in a useful way), and what matters is expected utility according to the limit of this sequence. Or there is a cost for finding out more, so that good things happening without having to be known to be good are better, and it's useful to work out which queries to preference are worth paying for. Or there is a logical system for reasoning about preference (preference is given by a program). How do you build an agent that acts on this?

Is there something intended to be an optimizer for this setting that ends up essentially doing soft optimization instead, because of the difficulty in accessing preference? One possibility/explanation for why this might happen is treating optimizer's own unknown/intractable preference as adversarially assigned, as moves of the other player in a game that should be won, packaging intractability of preference in the other player's strategy.

In the case of preference-as-computation, there is the usual collection of embedded agency issues where the agent might control preference and the question of predicting/computing it is not straightforward, the answer might depend on agent's own behavior (which is related to demons, ASP, and control via approximate and possibly incorrect predictions), there might be spurious proofs of preference being a certain way (an inner alignment problem of preference specification or of a computation that reasons about it).

It's often said that if agent's preference is given by the result of running a program that's not immediately tractable, then the agent is motivated to work on computing it. How do we build a toy model of this actually happening? Probably something about value of information, but value is still intractable when value of information needs to be noticed.

Re Project 4, you might find my semi-abandoned (mostly because I wasn't and still am not in a position to make further progress on it) research agenda for deconfusing human values useful.

This work by Michael Aird and Justin Shovelain might also be relevant: "Using vector fields to visualise preferences and make them consistent"

And I have a post where I demonstrate that reward modeling can extract utility functions from non-transitive preference orderings: "Inferring utility functions from locally non-transitive preferences"

(Extremely cool project ideas btw)