At the end of my post on needing a theory of human values, I stated that the three components of such a theory were:
- A way of defining the basic preferences (and basic meta-preferences) of a given human, even if these are under-defined or situational.
- A method for synthesising such basic preferences into a single utility function or similar object.
- A guarantee we won't end up in a terrible place, due to noise or different choices in the two definitions above.
To summarise this post, I sketch out methods for 1. and 2., and look at what 3. might look like, and what we can expect from such a guarantee, and some of the issues with it.
Basic human preferences
For the first point, I'm defining a basic preference as existing within the mental models of a human.
Any preference judgement within that model - that some outcome was better than another, that some action was a mistake, that some behaviour was foolish, that someone is to be feared - is defined to be a basic preference.
Basic meta-preferences work in the same way, with meta-preferences just defined to be preferences over preferences (or over methods of synthesising preferences). Also include odd meta-preferences here - such as preferences over beliefs. I'll try to transform these odd preferences in "identity preferences": preferences over the kind of person you want to be.
To define that, we need to define the class of "reasonable" situations in which to have these mental models. These could be real situations (Mrs X thought that she'd like some sushi as she went past the restaurant) or counterfactual (if Mr Y had gone past that restaurant, he would have wanted sushi). The "one-step hypotheticals post" is about defining these reasonable situations.
Anything that occurs outside of a reasonable situation is discarded as not indicative of genuine basic human preference; this is due to the fact that humans can be persuaded to endorse/unendorse almost anything in the right situation (eg by drugs or brain surgery, if all else fails).
We can have preferences and meta-preferences over non-reasonable situations (what to do in a world where plants were conscious?), as long as these preferences and meta-preferences were expressed in reasonable situations. We can have a CEV style meta-preference ("I wish my preferences were more like what a CEV would generate"), but, apart from that, the preferences a CEV would generate are not directly relevant: the situations where "we knew more, thought faster, were more the people we wished we were, had grown up farther together" are highly non-typical.
We would not want the AI itself manipulating the definition of "reasonable" situations. It's for this that I've looked into ways of quantifying and removing AI rigging and influencing of the learning process.
Synthesising human preferences
The simple preferences and meta-preferences constructed above will be often wildly contradictory (eg we want to be generous and rich), inconsistent across time, and generally underdefined. They can also be weakly or strongly held.
The important thing now is to synthesise all of these into some adequate overall reward or utility function. Not because utility functions are intrinsically good, but because they are stable: if you're not an expected utility maximiser, events may likely push you into becoming one. And it's much better to start off with an adequate utility function, than to hope that random-drift-until-our-goals-are-stable will get us to an adequate outcome.
Synthesising the preference utility function
The idea is to start with three things:
- A way of resolving contradictions between preferences (and between meta-preferences, and so on).
- A way of applying meta-preferences to preferences (endorsing and anti-endorsing other preferences).
- A way of allowing (relevant) meta-preferences to change the methods used in the two points above.
This post showed one method of doing that, with contradictions resolved by weighting the reward/utility function for each preference and then adding them together linearly. The weights were proportional to some measure of the intensity of each preference.
In a more recent post, I realised that linear addition may not be the natural thing to do for some types of preferences (which I dubbed "identity" preferences). The smooth minimum gives another way of combining utilities, though it needs a natural zero as well as a weight. So the human's model of the status quo is relevant here. For preferences combined in a smoothmin, we can just reset the natural zero (raising it to make the preference less important, lowering it to make it more) rather than changing the weight.
I'm distinguishing between identity and world preferences, but the real distinction is between preferences that humans prefer to combine linearly, and those they prefer to combine in a smoothmin. So it could work that along with preference and weight (and natural zero), one thing we could ask of basic preferences is whether they should go in the linear of the smoothmin group.
Also, though I'm very willing to let a linear preference get sent to zero if the human's meta-preferences unendorse them, I'm less sure about those in the other group; it's possible that unendorsing of a smoothmin preference should raise the "natural zero" rather than sending the preference to zero. After all, we've identified these preferences as key parts of our identity, even though we unendorse them.
Meta-changes to the synthesis method
Then finally, on point 3 above, the relevant human meta-preferences can change the synthesis process. Heavily weighted meta-preferences of this type will result in completely different processes than described above; lightly weighted meta-preferences will make only small changes. The original post looked into that in more detail.
Notice that I am making some deliberate and somewhat arbitrary choices: using linear or smoothmin to combine meta-preferences (including those that might want to change the methods of combinations). How much weight a meta-preference must have, before it seriously changes the synthesis method, is somewhat arbitrary.
I'm also starting with two types of preference combinations, linear and smoothmin, rather than many more or just one. The idea is that these two way of combining preferences seem the most salient to me, and our own meta-values can change these ways if we feel strongly about them. It's as if I'm starting the design of a formula one car, before an AI trains itself to complete the design. I know it'll change a lot of things, but if I start with "four wheels, a cockpit and a motor", I'm hoping to get them started on the right path, even if they eventually overrule me.
Or, if you prefer, I think starting with this design is more likely to nudge a bad outcome into a good one, than to do the opposite.
Now for the most tricky part of this: given the above, can we expect non-terrible outcomes?
This is a difficult question to answer, because "terrible outcomes" remains undefined (if we had a full definition, it could serve a utility function itself), and, in a sense, there is no principled trade-off between two preferences: the only general optimality measure is Pareto, and that can be reached by any linear combination of utilities.
Scope insensitivity to the rescue?
There are two obvious senses in which an outcome could be terrible:
- We could lose something of great value, never to have it again.
- We could fall drastically short of maximising a utility function to the upmost.
From the perspective of a utility maximiser, both these outcomes could be equally terrible - it's just a question of computing the expected utility difference between the two scenarios.
However, for actual humans, the first scenario seems to loom much larger. This can be seen as a form of scope insensitivity: we might say that we believe in total utilitarianism, but we don't feel that a trillion people is really a trillion times better than a trillion people, so the larger the numbers grow, the more we are, in practice, willing to trade off total utilitarianism for other values.
Now, we might deplore that state of affairs (that deploring is a valid meta-preference), but that does seem to be how human work. And though there are arguments against scope insensitivity for actually existent beings, it is perfectly consistent to reject them when considering whether we have a duty to create new beings.
What this means is that people's preferences seem much closer to smooth minimums than to linear sums. Some are explicitly setup like that from the beginning (those that go in the smoothmin bucket). Others may be like that in practice, either because meta-preferences want them to be, or because of the vast size of the future: see next section.
The size of the future
The future is vast, with the energy of billions of galaxies, efficiently used, at our disposal. Potentially far, far larger than that, if we're clever about our computations.
That means that it's far easier to reach "agreement" between two utility functions with diminishing marginal returns (as most of them will be, in practice and in theory). Even without diminishing marginal returns, and without using smoothmin, it's unlikely that one utility function will remain highest marginal returns all the way up to all resources being used up. At some point, benefiting a tiny little preference slightly will likely be easier.
The exception of this is if preferences are explicitly opposed to each other; eg masochism versus pain-reduction. But even there, they are unlikely to be completely and exactly negations of one another. The masochist may find some activities that don't fit perfectly under "increased pain" as traditionally understood, so some compromise between the two preferences becomes possible.
The underdefined nature of some preference may be an boon here; if is forbidden, but only in situations in , then going outside of may allow the -loving preferences their space to grow. So, for example, obeying promises might become a general value, but we might allow games, masked balls, or similar situation where lying is allowed, because the advantages of honesty - reputation, ease of coordination - are deliberately absent.
Growth, learning, and following your own preferences
I've argued that our values and preference will soon become stable as we start to self modify.
This is going to be hard for those who put an explicit premium on continual moral growth. Now, it's still possible to have continued moral change withing a narrow band, but
Finally, there's the issue of what happens when the AI tells you "here is , the synthesis of your preferences", and you go "well, I have all these problems with it". Since humans are often contrarian by nature, it may be impossible for an AI to construct a that we would ever explicitly endorse. This is a sort of "self-reference" problem in synthesising preferences.
The whole design - with an initial framework, liberal use of smoothmin, a default for standard combinations of preferences, and a vast amount of resources available - is designed to reach an adequate, rather than an optimal solution. Optimal solutions are very subject to Goodhart's law if we don't include everything we care about; if we do include everything we care about, the process may come to resemble the one I've defined here, above.
Conversely, if the human fear that such a synthesis will become badly behaved in certain extreme situations - then that fear will be included in the synthesis. And, if the fear is strong enough, will serve to direct the outcomes away from those extreme situations.
So the whole design is somewhat tolerant to changes in the initial conditions: different starting points may end up in different end points, but all of them will hopefully be acceptable.
Did I think of everything?
With all such methods, there's the risk of not including everything, so ending up in a terrible point by omission. That risk is certainly there, but it seems that we couldn't end up in a terrible hellworlds, or at least no in one that could be meaningfully described/summarised to the human (because avoiding hellworlds is high on human preference and meta-preferences, and there is little explicit force pushing the other way).
And I've argued that it's unlikely that indescribable hellworlds are even possible.
However, there are still a lot of holes to fill, and I have to ensure that this doesn't just end up as a series of patches until I can't think of any further patches. That's my greatest fear, and I'm not yet sure how to address it.
This seems to assume a fairly specific (i.e., anti-realist) metaethics. I'm quite uncertain about metaethics and I'm worried that if moral realism is true (and say for example that total hedonic utilitarianism is the true moral theory), and what you propose here causes the true moral theory to be able to control only a small fraction of the resources of our universe, that would constitute a terrible outcome. Given my state of knowledge, I'd prefer not to make any plans that imply commitment to a specific metaethical theory, like you seem to be doing here.
What's your response to people with other metaethics or who are very uncertain about metaethics?
I don't think this is true for me, or maybe I'm misunderstanding what you mean by the two scenarios.
Leaning on this, someone could write a post about the "infectiousness of realism" since it might be hard to reconcile openness to non-zero probabilities of realism with anti-realist frameworks? :P
For people who believe their actions matter infinitely more if realism is true, this could be modeled as an overriding meta-preference to act as though realism is true. Unfortunately if realism isn't true this could go in all kinds of directions depending on how the helpful AI system would expect to get into such a judged-to-be-wrong epistemic state.
Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we'll know by that point whether we've found the right way to do metaphilosophy (and how approaching that question is different from approaching whichever procedures philosophically sophisticated people would pick to settle open issues in something like the above proposals). It seems like there has to come a point where one has to hand off control to some in-advance specified "metaethical framework" or reflection procedure, and judged from my (historically overconfidence-prone) epistemic state it doesn't feel obvious why something like Stuart's anti-realism isn't already close to there (though I'd say there are many open questions and I'd feel extremely unsure about how to proceed regarding for instance "2. A method for synthesising such basic preferences into a single utility function or similar object," and also to some extent about the premise of squeezing a utility function out of basic preferences absent meta-preferences for doing that). Adding layers of caution sounds good though as long as they don't complicate things enough to introduce large new risks.
I think there's some (small) hope that by the time we need it, we can hit upon a solution to metaphilosophy that will just be clearly right to most (philosophically sophisticated) people, like how math and science were probably once methodologically quite confusing but now everyone mostly agrees on how math and science should be done. Failing that, we probably need some sort of global coordination to prevent competitive pressures leading to value lock-in (like the kind that would follow from Stuart's scheme). In other words, if there wasn't a race to build AGI, then there wouldn't be a need to solve AGI safety, and there would be no need for schemes like Stuart's that would lock in our values before we solve metaphilosophy.
Stuart's scheme uses each human's own meta-preferences to determine their own (final) object-level preferences. I would less concerned if this was used on someone like William MacAskill (with the caveat that correctly extracting William MacAskill's meta-preferences seems equivalent to learning metaphilosophy from William) but a lot of humans have seemingly terrible meta-preferences or at least different meta-preferences which likely lead to different object-level preferences (so they can't all be right, assuming moral realism).
To put it another way, my position is that if moral realism or relativism (positions 1-3 in this list) is right, we need "metaphilosophical paternalism" to prevent a "terrible outcome", and that's not part of Stuart's scheme.
In those cases, I'd give more weight to the preferences than the meta-preferences. There is the issue of avoiding ignorant-yet-confident meta-preferences, which I'm working on writing up right now (partially thanks to you very comment here, thanks!)
Moral realism is ill-defined, and some allow that humans and AI would have different types of morally true facts. So it's not too much of a stretch to assume that different humans might have different morally true facts from each other, so I don't see this as being necessarily a problem.
Moral realism through acausal trade is the only version of moral realism that seems to be coherent, and to do that, you still have to synthesise individual preferences first. So "one single universal true morality" does not necessarily contradict "contingent choices in figuring out your own preferences".
I look forward to reading that. In the meantime can you address my parenthetical point in the grand-parent comment: "correctly extracting William MacAskill’s meta-preferences seems equivalent to learning metaphilosophy from William"? If it's not clear, what I mean is that suppose Will wants to figure out his values by doing philosophy (which I think he actually does), does that mean that under you scheme the AI needs to learn how to do philosophy? If so, how do you plan to get around the problems with applying ML to metaphilosophy that I described in Some Thoughts on Metaphilosophy?
There is one way of doing metaphilosophy this way, which is "run (simulated) William MacAskill until he thinks he's found a good metaphilosophy" or "find a description of metaphilosophy to which WA would say 'yes'."
But what the system I've sketched would most likely do is come up with something to which WA would say "yes, I can kinda see why that was built, but it doesn't really fit together as I'd like and has a some of ad hoc and object level features". That's the "adequate" part of the process.
My aim is to find a decent synthesis of human preferences. If someone has a specific metaethics and compelling reasons why we should follow that metaethics, I'd then defer to that. The fact I'm focusing my research on the synthesis is because I find that possibility very unlikely (the more work I do, the less coherent moral realism seems to become).
But, as I said, I'm not opposed to moral realism in principle. Looking over your post, I would expect that if 1, 4, 5, or 6 were true, that would be reflected in the synthesis process. Depending on how I interpret it, 2 would be partially reflected in the synthesis process, and 3 maybe very partially.
If there were strong evidence for 2 or 3, then we could either a) include them in the synthesis process, or b) tell humans about them, which would include them in the synthesis process indirectly.
Since I see the synthesis process as aiming for an adequate outcome, rather than an optimal one (which I don't think exists), I'm actually ok with adding in some moral-realism or other assumptions, as I see this as making a small shift among adequate outcomes.
As you can see in this post, I'm also ok with some extra assumptions in how we combine individual preferences.
There's also some moral-realism-for-humans variants, which assume that there are some moral facts which are true for humans specifically, but not for agents in general; this would be like saying there is a unique synthesis process. For those variants, and some other moral realist claims, I expect the process of figuring out partial preferences and synthesising them, will be useful building blocks.
But mainly, my attitude to most moral realist arguments, is "define your terms and start proving your claims". I'd be willing to take part in such a project, if it seemed realistically likely to succeed.
You may not be the most typical of persons :-) What I mean is that if we divided people's lifetimes by a third, or had a vicious totalitarian takeover, or made everyone live in total poverty, then people would find either of these outcomes quite bad, even if we increased lifetimes/democracy/GDP to compensate for the loss along one axis.