Formal Philosophy and Alignment Possible Projects

[-]Vanessa Kosoy3y50

Something like Bayesian/expected utility maximization seems useful for understanding agents and agency. However, there is the problem that expected utility theory doesn’t seem to predict anything in particular. We want a better response to “Expected utility theory doesn’t predict anything” that can describe the insight of EU theory re what agents are without being misinterpreted / without failing to constrain expectations at all technically.

Agents are policies with a high value of g. So, "EU theory" does "predict" something, although it's a "soft" prediction (i.e. agency is a matter of degree).

[-]Gordon Seidoh Worley3y21

Re: Project 2

This project’s goal is to better understand the bridge principles needed between subjective, first person optimality and objective, third person success.

This seems quite valuable, because there is, properly speaking, no objective, third person perspective on which we can speak, only the inferred sense that there exists something that looks to use like a third person perspective from our first person perspectives. Thus I think this seems like a potentially fruitful line of research since the proposed premise contains the confusion that needs to be unraveled to get to addressing what is something more like the intersubjective agreement on what the world is like.

[-]Vladimir_Nesov3y*20

Dealing with no Ground Truth in Human Preferences

A variation on this: If preference is known, but difficult to access in some sense. For example, estimates change in time outside agent's control, like market data for some security regarding any given question of "expected utility", actual preference is the dividends that haven't been voted on yet, or else there is a time-indexed sequence of utility functions that converges in some sense (probably with strong requirements that make the limit predictable in a useful way), and what matters is expected utility according to the limit of this sequence. Or there is a cost for finding out more, so that good things happening without having to be known to be good are better, and it's useful to work out which queries to preference are worth paying for. Or there is a logical system for reasoning about preference (preference is given by a program). How do you build an agent that acts on this?

Is there something intended to be an optimizer for this setting that ends up essentially doing soft optimization instead, because of the difficulty in accessing preference? One possibility/explanation for why this might happen is treating optimizer's own unknown/intractable preference as adversarially assigned, as moves of the other player in a game that should be won, packaging intractability of preference in the other player's strategy.

In the case of preference-as-computation, there is the usual collection of embedded agency issues where the agent might control preference and the question of predicting/computing it is not straightforward, the answer might depend on agent's own behavior (which is related to demons, ASP, and control via approximate and possibly incorrect predictions), there might be spurious proofs of preference being a certain way (an inner alignment problem of preference specification or of a computation that reasons about it).

It's often said that if agent's preference is given by the result of running a program that's not immediately tractable, then the agent is motivated to work on computing it. How do we build a toy model of this actually happening? Probably something about value of information, but value is still intractable when value of information needs to be noticed.

[-]Gordon Seidoh Worley3y10

Re Project 4, you might find my semi-abandoned (mostly because I wasn't and still am not in a position to make further progress on it) research agenda for deconfusing human values useful.

[-]Jan3y20

This work by Michael Aird and Justin Shovelain might also be relevant: "Using vector fields to visualise preferences and make them consistent"

And I have a post where I demonstrate that reward modeling can extract utility functions from non-transitive preference orderings: "Inferring utility functions from locally non-transitive preferences"

(Extremely cool project ideas btw)

^{^}

For example, in the binary sequence prediction context, an agent's algebra might be the minimal -field generated by the cylinder sets.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

17

Formal Philosophy and Alignment Possible Projects

17

Context

Possible Projects

Project 1: Inferring Algebras from Behavior

Description

Theory of Change

Plan of Attack

Project 2: Bridging Subjective Optimality and Success in Action

Description

Theory of Change

Plan of Attack

Project 3: Characterizing Demons in Non-Expert Based Systems

Description

Theory of Change

Plan of Attack

Project 4: Dealing with no Ground Truth in Human Preferences

Description

Theory of Change

Plan of Attack

Project 5: Subjective Probability and Alignment

Description

Theory of Change

Plan of Attack

Conclusion