Complete Class: Consequentialist Foundations

[-]jessicata7y60

Making the hyperplane argument as before, we get a π which places positive weight on each individual. This is interpreted as each individual's weight in the coalition. The collective decision must be the result of a (positive) linear combination of each individual's cardinal utilities -- and those cardinal utilities can in turn be constructed via an application of VNM to individual ordinal preferences.

What this says is that any Pareto-optimal outcome can be rationalized as maximizing a positive linear combination of individual utilities, not that it can be generated in this way. For example, Nash bargaining results in Pareto optimal outcomes, yet it can't be specified as the unique maximization of some positive linear combination of individual utilities. After running the algorithm, the result is optimal according to some linear combination of individual utilities, but this is a rationalization rather than the actual generation procedure. (This also works as a criticism of Bayesianism)

[-]abramdemski7y50

I basically agree with this criticism, and would like to understand what the alternative to Bayesian decision theory which comes out of the analogy would be.

[-]cousin_it7y30

I think when several AIs with bounded utility functions decide to merge, they can reach any point on the Pareto frontier like this:

1) Allow linear combinations of utility functions. This lets you reach all "pointy" points.

2) Allow making a tuple of functions of type (1) whose values should be compared lexicographically (e.g. "maximize U+V, break ties by maximizing U"). This lets you reach some points on the edges of flat parts.

3) Allow the merging process to choose randomly which function of type (2) to give to the merged AI. This lets you reach the rest of the points on flat parts.

That's a bit complicated, but I don't think there's a simpler way.

[-]jessicata7y10

I don't see why 2 is necessary given that any point on the Pareto frontier is a mixture of pointy points (intuition for this: any point on the face of a polyhedron is a mixture of that face's corners). In any case, I agree with the basic mathematical point that you can get any Pareto optimal mixture of outcomes by mixing between non-negative linear combinations of utility functions.

[-]cousin_it7y20

Well, I was imagining a Pareto frontier that changes smoothly from flat to curved. Then we can't quite get a pointy point exactly on the edge of the flat part. That's what 2 is for, it gives us some of these points (though not all). But I guess that doesn't matter if things are finite enough.

[-]jessicata7y10

Ok, that seems right.

[-]jessicata7y30

I think the basic CCT theorem is wrong. Consider a game where there is a coin that will be flipped, and you are going to predict the probability that this coin comes up heads. You will be scored using a proper scoring rule, such as square error (i.e. you get loss p^2 if it's heads and (1-p)^2 if it's tails). The policy of always saying p = 0 is admissible, since no other policy is better when the coin always comes up tails. But it is not the result of any non-dogmatic prior (which would say a probability strictly between 0 and 1).

I didn't understand your proof. I get the argument for why there's a hyperplane but can't the hyperplane be parallel to one of the axes, so it never intersects that one?

[-]abramdemski7y10

Ah, yeah, you're right. The separating hyperplane theorem only gives us $\leq$ , and I was assuming <.

I think "admissible if and only if non-dogmatic" may still hold as I stated it, because I don't see how to set up an example like the one you give when $A$ is finite. I'm editing the post anyway, since (1) I don't know how to how that at the moment, and (2) the if and only if falls apart for infinite action sets as in your example anyway, which makes it kind of feel "wrong in spirit".

[-]cousin_it7y30

Yeah, I explored this direction pretty thoroughly a few years ago. The simplest way is to assume that agents don't have probabilities, only utility functions over combined outcomes, where a "combined outcome" is a combination of outcomes in all possible worlds. (That also takes care of updating on observations, we just follow UDT instead.) Then if we have two agents with utility functions U and V over combined outcomes, any Pareto-optimal way of merging them must behave like an agent with utility function aU+bV for some a and b. The theory sheds no light on choosing a and b, so that's as far as it goes. Do you think there's more stuff to be found?

[-]abramdemski7y20

It sounds like you considered a more general setting than I am an the moment. I want to eventually move to that kind of "combined outcome" setting, but first, I want to understand more classical preference structures and break things one at a time.

Do you think your version sheds any light on value learning in UDT? I had a discussion with Alex Appel about this, in which it seemed like you have a "nosy neighbors" problem, where a potential set of values may care about what happens even in worlds where different values hold; but, this problem seemed to be bounded by such other-world preferences acting like beliefs. For example, you could imagine a UDT agent with world-models in which either vegetarianism or carnivorism are right (which somehow make different predictions). Each set of preferences can either be "nosy" (cares what happens regardless of which facts end up true) or "non-nosy" (each preference set only cares about what happens in their own world -- vegetarianism cares about the amount of meat eaten in veg-world, and carnivorism cares about amount of meat eaten in carn-world).

The claim which seemed plausible was that nosiness has some kind of balancing behavior which acts like probability: putting some of your caring measure on other worlds reduces your caring measure on your own.

Anything structurally similar in your framework?

[-]cousin_it7y10

By nosy preferences, do you mean something like this?

"I am grateful to Zeus for telling me that cows have feelings. Now I know that, even if Zeus had told me that cows are unfeeling brutes, eating them would still be wrong."

But that just seems irrational and not worth modeling. Or do you have some other kind of situation in mind?

[-]Diffractor7y20

Pretty much that, actually. It doesn't seem too irrational, though. Upon looking at a mathematical universe where torture was decided upon as a good thing, it isn't an obvious failure of rationality to hope that a cosmic ray flips the sign bit of the utility function of an agent in there.

The practical problem with values that care about other mathematical worlds, however, is that if the agent you built has a UDT prior over values, it's an improvement (from the perspective of the prior) for the nosy neigbors/values that care about other worlds, to dictate some of what happens in your world (since the marginal contribution of your world to the prior expected utility looks like some linear combination of the various utility functions, weighted by how much they care about your world) So, in practice, it'd be a bad idea to build a UDT value learning prior containing utility functions that have preferences over all worlds, since it'd add a bunch of extra junk from different utility functions to our world if run.

[-]cousin_it7y10

Are you talking about something like this?

"I'm grateful to HAL for telling me that cows have feelings. Now I'm pretty sure that, even if HAL had a glitch and mistakenly told me that cows are devoid of feeling, eating them would still be wrong."

That's valid reasoning. The right way to formalize it is to have two worlds, one where eating cows is okay and another where eating cows is not okay, without any "nosy preferences". Then you receive probabilistic evidence about which world you're in, and deal with it in the usual way.

[-]abramdemski7y10

I'm not clear on whether it is rational or not. It seems like behavior we don't want from a value learner, but I was curious about how "inevitable" it is from attempts to mix updatelessness with value learning. (Perhaps it is a really simple point, but I haven't thought it entirely through, still.)

[-]cousin_it7y40

I have a recent result about value learning in UDT, it turns out to work very nicely and doesn't suffer from the problem you describe.

[-]abramdemski7y10

Another way in which there might be something interesting in this direction is if we can further formalize Scott's argument about when Bayesian probabilities are appropriate and inappropriate, which is framed in terms of pareto-style justifications of bayesianism.

[-]cousin_it7y40

Well, the version of UDT I'm using doesn't have probabilities, only a utility function over combined outcomes. It's just a simpler way to think about things. I think you and Scott might be overestimating the usefulness of probabilities. For example, in the Sleeping Beauty problem, the coinflip is "spacelike separated" from you (under Scott's peculiar definition), but it can be assigned different "probabilities" depending on your utility function over combined outcomes.

[-]abramdemski7y10

That seems good to understand better in itself, but it isn't a crux for the argument. Whether you've got "probabilities" or a "caring measure" or just raw utility which doesn't reduce to anything like that, it still seems like you're justifying it with Pareto-type arguments. Scott's claim is that Pareto-type arguments won't apply if you correctly take into account the way in which you have control over certain things. I'm not sure if that makes any sense, but basically the question is whether CCT can make sense in a logical setting where you may have self-referential sentences and so on.

[-]cousin_it7y10

That's a great question. My current (very vague) idea is that we might need to replace first order logic with something else. A theory like PA is already updateful, because it can learn that a sentence is true, so trying to build updateless reasoning on top of it might be as futile as trying to build updateless reasoning on top of probabilities. But I have no idea what an updateless replacement for first order logic could look like.

[-]abramdemski7y10

Another part of the idea (not fully explained in Scott's post I referenced earlier) is that nonexploited bargaining (AKA bargaining away from the pareto fronteir AKA cooperating with agents with different notions of fairness) provides a model of why agents should not just take pareto improvements all the time, and may therefore be a seed of "non-Bayesian" decision theory (in so far as Bayes is about taking pareto improvements).

[-]jessicata7y10

We then apply the VNM theorem within each θ, to get a cardinal-valued utility within each world. The CCT argument can then proceed as usual.

VNM assumes you have preferences over lotteries, which imply belief in frequencies/probabilities.

[-]abramdemski7y10

In the context in the post, I'm saying that before we get rid of an assumption of probabilities, it is easier to first get rid of the assumption that we have a real-valued L (proving it from other assumptions instead). We do this by applying VNM, still assuming mixed strategies are described by probabilities. I finally get rid of that assumption a few paragraphs later. I edited the post a little to try and clarify.

[-]rk7y10

I'd like to check my understanding of the last two transitions a little. If someone could check the below I'd be grateful.

When we move to the utilitarianism-like quadrant, individual actors have preference ordering over actions (not states). So if they were the kind of utilitarians we often think about (with utilities attached to states), that would be something like ordering the actions by their expected utility (as they see it). So we actually get something like combined epistemic modesty and utilitarianism here?

Then, when we move to the futarchy-like quadrant, we additionally elicit probabilities for (all possible?) observations from each agent. We give them less and less weight in the decision calculus when their predictions are wrong. This stops agents that have crazy beliefs reliably voting us into world states they wouldn't like anyway (though I think that world states aren't present in the formalism in this quadrant).

Does the above paragraph mean that people with unique preferences and crazy beliefs eventually end up without having their preferences respected (whereas someone with unique preferences and accurate beliefs would still have their preferences respected)?

(Some extra questions. I'm more interested in answers to questions above the line, so feel free to skip these)

Also, do we have to treat the agents as well-calibrated across all domains? Or is the system able to learn that their thoughts should be given weight in some circumstances and not others? The reason I think we can't do that is because it seems like there is just one number that represents the agent's overall accuracy (theta_i)

A possible fix to the above is that individual agents could do this subject-specific evaluation of other agents and would update their credences based on partially-accurate agents, thus the information still gets preserved. But I think this leads to another problem: could there be a double-counting when both Critch's mechanism and other agents pick up on the accuracy of an agent? Or are we fine because agents who over-update on others' views also get their vote penalised?

[-]abramdemski7y20

Does the above paragraph mean that people with unique preferences and crazy beliefs eventually end up without having their preferences respected (whereas someone with unique preferences and accurate beliefs would still have their preferences respected)?

Yes. This might be too harsh. The "libertarian" argument in favor of it is: who are you to keep someone from betting away all of their credit in the system? If you make a rule preventing this, agents will tend to want to find some way around it. If you just give some free credit to agents who are completely out, this harms the calibration of the system by reducing the incentive to be sane about your bets.

On the other hand, there may well be a serious game-theoretic reason why it is "too harsh": someone who is getting to cooperation from the system has no reason to cooperate in turn. I'm curious if a CCT-adjacent formalism could capture this (or some other reason to be gentler). That would be the kind of thing which might have interesting analogues when we try to import insights back into decision theory.

Also, do we have to treat the agents as well-calibrated across all domains? Or is the system able to learn that their thoughts should be given weight in some circumstances and not others?

In the formalism, no, you just win or lose points across all domains. Realistically, it seems prudent to introduce stuff like that.

A possible fix to the above is that individual agents could do this subject-specific evaluation of other agents and would update their credences based on partially-accurate agents, thus the information still gets preserved.

That's exactly what could happen in a logical-induction like setting.

could there be a double-counting when both Critch's mechanism and other agents pick up on the accuracy of an agent?

There might temporarily be all sorts of crazy stuff like this, but we know it would (somehow) self-correct eventually.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

21

Complete Class: Consequentialist Foundations

21

Background

My Motives

Other Foundations

Four Complete Class Theorems

Basic CCT

Removing Likelihoods (and other unfortunate assumptions)

Utilitarianism

Futarchy

Conclusion