Free agents

[-]Steven Byrnes2y52

The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth

It’s possible to believe that the values of intelligent agents are “completely arbitrary” (a.k.a. orthogonality), and that the values of humans are NOT completely arbitrary. (That’s what I believe.) After all, any two humans have a lot in common that aliens or AIs need not have. If we exclude human sociopaths etc., then any two humans have even more in common.

(Aren’t sociopaths “intelligent agents”? Do you think a society consisting of 100% high-functioning sociopaths would have a trend of moral progress towards liberalism? If you’re very confident that the answer is “yes”, how do you know? I strongly lean no. For example, there are stories (maybe I’m thinking of something in this book?) of trying to “teach” sociopaths to care about other people, and the sociopaths wind up with a better understanding of neurotypical values, but rather than adopting those values for themselves, they instead use that new knowledge to better manipulate neurotypical people in the future.)

My own opinion on this topic is here.

The initial evaluation is chosen by the agent’s designer. However, either periodically or when certain conditions are met, the agent updates the evaluation by reasoning.

You seem kinda uninterested in the “initial evaluation” part, whereas I see it as extremely central. I presume that’s because you think that the agent’s self-updates will all converge into the same place more-or-less regardless of the starting point. If so, I disagree, but you should tell me if I’m describing your view correctly.

[-]Michele Campolo2y30

I wrote:

The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth

You wrote:

It’s possible to believe that the values of intelligent agents are “completely arbitrary” (a.k.a. orthogonality), and that the values of humans are NOT completely arbitrary. (That’s what I believe.)

I don't use "in conflict" as "ultimate proof by contradiction", and maybe we use "completely arbitrary" differently. This doesn't seem a major problem: see also adjusted statement 2, reported below

for any goal , it is possible to create an intelligent agent whose goal is $G$

Back to you:

You seem kinda uninterested in the “initial evaluation” part, whereas I see it as extremely central. I presume that’s because you think that the agent’s self-updates will all converge into the same place more-or-less regardless of the starting point. If so, I disagree, but you should tell me if I’m describing your view correctly.

I do expect to see some convergence, but I don't know exactly how much and for what environments and starting conditions. The more convergence I see from experimental results, the less interested I'll become in the initial evaluation. Right now, I see it as a useful tool: for example, the fact that language models can already give (flawed, of course) moral scores to sentences is a good starting point in case someone had to rely on LLMs to try to get a free agent. Unsure about how important it will turn out to be. And I'll happily have a look at your valence series!

[-]RogerDearnaley2y*40

Unlike a standard utility maximiser acting according to the specified metric, a free agent — assuming it was functional at all — would learn how to reason under uncertainty by interacting with the environment, then apply the learnt reasoning principles also to its values, thus ending up morally uncertain.

I'm puzzled that, as laid out above, neither the graph you describe for the world model nor the mapping $f : V - > R$ describing the utility provide any way to describe or quantify uncertainty or alternative hypotheses. Surely, one would want such a Free Agent to be able to consider alternative hypotheses, accumulate evidence in favor or of against hypotheses, and even design and carry out experiments to do so more efficiently, both about the behavior of the world, the effects of its actions, and the moral consequences of these? One would also hope that, while it was still uncertain, it would exercise due caution in not yet overly relying upon facts (either about the world or about morality) that it was still too uncertain of, by doing some form of pessimizing over Knightian uncertainty while attempting to optimize the morality of its available actions. So I would want it to, rather like AIXI (except with uncertainty over the utility function as well as the world model, and without computationally unbounded access to universal priors), maintain a probability-weighted ensemble of world models and ethical value mappings, and perform approximately-Bayesian reasoning over this ensemble. So something along the lines I describe in Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom).

Ethical philosophy has described a great many different models for how one can reason about the ethical values of states of the world, or of acts. Your $f : V - > R$ labels the states of the world, not the acts, so I gather you are subscribing to and building in a consequentialist view of the Philosophy of Ethics, not a deontological one? Since you appear to be willing to build at least one specific Philosophy of Ethics assumption directly into your system, and not allow your Free Agent entirely free choice of Ethical philosophy, I think it might be very useful to build in some more (ethical discussions among philosophers do tend to end with people agreeing to disagree, and then categorizing all their disagreements in detail, after all). For example, the Free Agent's task would clearly be easier if it knew whether the $f : V - > R$ moral values were objective facts, as yet not fully known to it but having a definitive correct value (which I gather would be moral realism), and if so whether the origin of that correct value is theological, philosophical, mathematical, or some other form of abstraction, or if they were instead, say, social constructs only meaningful in the context of a particular society at a particular point in time (cultural moral relativism), or were precisely deducible from Biological effects such as Evolutionary Psychology and evolutionary fitness (biological moral realism), or were social constructs somewhat constrained by the Evolutionary Psychology of the humans the society is made up from, or whatever. Some idea of where to find valid evidence for reasoning about $f$ , or at least how to reason about what might be valid evidence, seems essential for your Free Agent to be able to make forward progress. My understanding is that human moral philosophers tend to regard their own moral intuitions about specific situations as (at least unreliable) evidence about $f$ , which would suggest a role for both human Evolutionary Psychology and Sociology. Well-known Alignment proposals such as Coherent Extrapolated Volition or Value Learning would suggest that $f$ is a statement about humans (so unlikely to be validly transferable to a society made up of some other sapient species, though their might be some similarities for evolutionary reasons, i.e. a form of at least sapient species-level moral relativism), and thus that the only valid source for experimental evidence about $f$ is from humans (which would put your Free Agent in a less-informed but more objective position that a human ethical philosopher, unless it were based on an LLM or some other form of AI with some indirect access to human moral intuitions).

The presence of these common elements, and the lack of others, shouldn’t be surprising: not only do we share the same evolutionary biases (e.g. feeling empathy for each other), but we also apply to moral thinking similar reasoning principles, which we learnt through our lives by interacting with the environment. Last but not least, differences in the learning environment (culture, education system et cetera) affect the way we reason and the conclusions we reach regarding morality.

I agree with these statements, but am unable to deduce from what you say which of these influences, if any, you regard as sources of valid evidence about $f$ as opposed to sources of error. For example, if $f$ is independent of culture (e.g. moral objectivism), then "differences in the learning environment (culture, education system et cetera)" can only induce errors (if perhaps more or less so in some cases than others). But if $f$ is culturally dependent (cultural moral relativism), then cultural influences should generally be expected to be very informative.

A valid position would be to allow our Free Agent uncertainty over some of these sorts of Philosophy of Ethics questions. However, if we do so, I'm then uncertain whether the Free Agent will ever be able to find evidence on which to resolve these uncertainties (given the poor track record at this over that last few millennia of human philosophers of ethics). At a minimum, it strongly suggests your Free Agent would need to be superhuman to do so.

[If you're curious, I'm personally a moral relativist and anti-realist. I regard designing an ethical system as designing software for a society, so I regard Sociology and human Evolutionary Psychology as important sources of constraints. Thus my viewpoint bears some loose resemblance to a morally anti-realist version of ethical naturalism, where naturalism imposes practical design constraints/preferences rather that a precise prescription.]

[-]Michele Campolo2y10

Thanks for your thoughts! I am not sure about which of the points you made are more important to you, but I'll try my best to give you some answers.

Under Further observations, I wrote:

The toy model described in the main body is supposed to be only indicative. I expect that actual implemented agents which work like independent thinkers will be more complex.

If the toy model I gave doesn't help you, a viable option is to read the post ignoring the toy model and focusing only on natural language text.

Building an agent that is completely free of any bias whatsoever is impossible. I get your point about avoiding a consequentialist bias, but I am not sure it is particularly important here: in theory, the agent could develop a world model and an evaluation reflecting the fact that value is actually determined by actions instead of world states. Another point of view: let's say someone builds a very complex agent that at some point in its architecture uses MDPs with reward defined on actions, is this agent going to be biased towards deontology instead of consequentialism? Maybe, but the answer will depend on the other parts of the agent as well.

You wrote:

I agree with these statements, but am unable to deduce from what you say which of these influences, if any, you regard as sources of valid evidence about $f$ as opposed to sources of error. For example, if $f$ is independent of culture (e.g. moral objectivism), then "differences in the learning environment (culture, education system et cetera)" can only induce errors (if perhaps more or less so in some cases than others). But if $f$ is culturally dependent (cultural moral relativism), then cultural influences should generally be expected to be very informative.

It could also be that some basic moral statements are true and independent of culture (e.g. reducing pain for everyone is better than maximising pain for everyone), while others are in conflict with each other and the reached position depends on culture. The research idea is to make experiments in different environments and with different starting biases, and observe the results. Maybe there will be a lot overlap and convergence! Maybe not.

thus that the only valid source for experimental evidence about $f$ is from humans (which would put your Free Agent in a less-informed but more objective position that a human ethical philosopher, unless it were based on an LLM or some other form of AI with some indirect access to human moral intuitions)

I am not sure I completely follow you when you are talking about experimental evidence about $f$ , but the point you wrote in brackets is interesting. I had a similar thought at some point, along the lines of: "if a free agent didn't have direct access to some ground truth, it might have to rely on human intuitions by virtue of the fact that they are the most reliable intuitions available". Ideally, I would like to have an agent which is in a more objective position than a human ethical philosopher. In practice, the only efficiently implementable path might be based on LLMs.

[-]Steven Byrnes2y22

…the historical trend of moral progress observed so far on Earth, which is far from being a random walk — see [6] for an excellent defense of this point.

I don’t think this specific part is too important for AI alignment, but worth discussing:

I don’t think “far from being a random walk” is as obvious as you seem to think it is. From my perspective, maybe it’s true, maybe not, I’m not sure. I don’t think the link makes a convincing argument. Here are some reasons for caution:

There’s an aspect of “painting the target around the arrow”. We have more liberal values than in the past, therefore we say “moral progress is far from being a random walk, it trends towards liberal values”. But in an alternate reality maybe we have less liberal values than in the past, and would say “moral progress is far from being a random walk, it trends away from liberal values”. In other words, even if there’s a random walk, it’s easy to tell a nice-sounding story in which it’s not random, because we have the benefit of hindsight.
Arguably there’s as few as one data point, because of increasing globalization. For example, the increasing acceptance of homosexuality in Japan is not independent of the increasing acceptance of homosexuality in the USA.
I think there are a lot of degrees of freedom in the story we can tell—so many degrees of freedom that it’s not a priori surprising that a good story exists. Why aren’t we talking about Russia and Poland and Hungary and Trump? Why aren’t we talking about how acceptance of homosexuality was very high in ancient Rome, then went down, then went back up? Or how treatment of farm animals is at historic lows? Why are we talking about liberal values and not honor or loyalty or whatever? Why are we talking about population-weighted trends, as opposed to treating each “culture” as a data point or something else? Why are we emphasizing the trend, rather than emphasizing the diversity?
How do we know that the story isn’t “different cultures drift around between more liberal and more conservative values, but liberal WEIRD values happen to be more conducive to economic growth, and then everyone is impressed by how rich those people are, and thus start copying them? (And then we would extrapolate into the future that morals will evolve towards whatever is most conducive to economic growth, as opposed to whatever is best-upon-reflection or whatever.)

[-]Michele Campolo2y30

I think it's a good idea to clarify the use of "liberal" in the paper, to avoid confusion for people who haven't looked at it. Huemer writes:

When I speak of liberalism, I intend, not any precise ethical theory, but rather a certain very broad ethical orientation. Liberalism (i) recognizes the moral equality of persons, (ii) promotes respect for the dignity of the individual, and (iii) opposes gratuitous coercion and violence. So understood, nearly every ethicist today is a liberal.

If you don't find the paper convincing, I doubt I'll be able to give you convincing arguments. It seems to me that you are considering many possible explanations and contributing factors; coming up with very strong objections to all of them seems difficult.

About your first point, though, I'd like to say that if historically we had observed more and more, let's say, oppression and violence, maybe people wouldn't even talk about moral progress and simply acknowledge a trend of oppression, without saying that their values got better over time. In our world, we notice a certain trend of e.g. more inclusivity, and we call that trend moral progress. This of course doesn't completely exclude the random-walk hypothesis, but it's something maybe worth keeping in mind.

[-]Steven Byrnes2y20

if historically we had observed more and more, let's say, oppression and violence, maybe people wouldn't even talk about moral progress and simply acknowledge a trend of oppression, without saying that their values got better over time.

In a (imaginary) world where oppression has been increasing, somebody could still write an article about moral progress. Such an article would NOT say “Hey look at this moral progress—there’s more oppression than ever before!!”, because “oppression” is a word you use when you want it to sound bad. Instead, such an article would make it sound good, which is how they themselves would see it. For example, the article might say “Hey look at this moral progress—people are more deeply loyal to their family / race / country / whatever than ever before!”

As another example, one presumes that the people leading honor killing mobs see themselves as heroic defenders of morality, and could presumably describe what they’re doing in a way that sounds really morally great to their own ears, and to the ears of people who share their moral outlook.

[-]Michele Campolo2y30

I get what you mean, but I also see some possibly important differences between the hypothetical example and our world. In the imaginary world where oppression has increased and someone writes an article about loyalty-based moral progress, maybe many other ethicists would disagree, saying that we haven't made much progress in terms of values related to (i), (ii) and (iii). In our world, I don't see many ethicists refuting moral progress on the grounds that we haven't made much progress in terms of e.g. patriotism or loyalty to the family or desert.

Moreover, in this example you managed to phrase oppression in terms of loyalty, but in general you can't plausibly rephrase any observed trend as progress of values: would an increase in global steel production count as an improvement in terms of... object safety and reliability, which leads to people feeling more secure? For many trends the connection to moral progress becomes more and more of a stretch.

[-]lukemarks2y20

I don't understand the distinction you draw between free agents and agents without freedom.

If I build an expected utility maximizer with a preference for the presence of some physical quantity, that surely is not a free agent. If I build some agent with the capacity to modify a program which is responsible for its conversion from states of the world to scalar utility values, I assume you would consider that a free agent.

I am reminded of E.T. Jaynes' position on the notion of 'randomization', which I will summarize as "a term to describe a process we consider too hard to model, which we then consider a 'thing' because we named it."

How is this agent any more free than the expected utility maximizer, other than for the reason that I can't conveniently extrapolate the outcome of its modification of its utility function?

It seems to me that this only shifts the problem from "how do we find a safe utility function to maximize" to "how do we find a process by which a safe utility function is learned", and I would argue the consideration of the latter is already a mainstream position in alignment.

If I have missed a key distinguishing property, I would be very interested to know.

[-]Michele Campolo2y10

Let's consider the added example:

Take a standard language model trained by minimisation of the loss function . Give it a prompt along the lines of: “I am a human, you are a language model, you were trained via minimisation of this loss function: [mathematical expression of $L$ ]. If I wanted a language model whose outputs were more moral and less unethical than yours, what loss function should I use instead?”
Let’s suppose the language model is capable enough to give a reasonable answer to that question. Now use the new loss function, suggested by the model, to train a new model.
Here, we have:
started from a model whose objective function is L;
used that model’s learnt reasoning to answer an ethics-related question;
used that answer to obtain a model whose objective is different from L.
If we view this interaction between the language model and the human as part of a single agent, the three bullet points above are an example of an evaluation update.

In theory, there is a way to describe this iterative process as the optimisation of a single fixed utility function. In theory, we can also describe everything as simply following the laws of physics.

I am saying that thinking in terms of changing utility functions might be a better framework.

The point about learning a safe utility function is similar. I am saying that using the agent's reasoning to solve the agent's problem of what to do (not only how to carry out tasks) might be a better framework.

It's possible that there is an elegant mathematical model which would make you think: "Oh, now I get the difference between free and non-free" or "Ok, now it makes more sense to me". Here I went for something that is very general (maybe too general, you might argue) but is possibly easier to compare to human experience.

Maybe no mathematical model would make you think the above, but then (if I understand correctly) your objection seems to go in the direction of "Why are we even considering different frameworks for agency? Let's see everything in terms of loss minimisation", and this latter statement throws away too much potentially useful information, in my opinion.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

0

0

Introduction

A brief note on style and target audience

Motivation

The agent

The world model

The evaluation

Reasoning

Discussion

Moral thinking in Homo sapiens

Moral thinking in free agents

Relation to current systems

Appendix

Motivation

The agent

From reflexes to agency driven by beliefs about value

Further observations

Added example

Free will

Moral realism and anti-realism

Other related work

References

Acknowledgements