This post will be about AIs that “refine” their utility function over time, and how it might be possible to construct such systems without giving them undesirable properties. The discussion relates to corrigibilityvalue learning, and (to a lesser extent) wireheading.

We (Joar Skalse and Justin Shovelain) have spent some time discussing this topic, and we have gained a few new insights we wish to share. The aim of this post is to be a brief but explanatory summary of those insights. We will provide some motivating intuitions, a problem statement, and a possible partial solution to the problem given in the problem statement. We do not have a complete technical solution to the problem, but one could perhaps be built on this partial solution.

Sections which can be skipped are marked with an asterisk (*).

Brief Background*

This section says things that you probably already know. The main purpose of it is to prime you.

In the “classical” picture of AI systems, the AI contains a utility function that encodes a goal that it is trying to accomplish. The AI then selects actions whose outcome it expects will yield high utility (roughly). For example, the utility function might be equal to the number of paperclips in existence, in which case the AI would try to take actions that result in many paperclips. 

In the “classical” picture, the utility function is fixed over time, and corresponds to an equation that at some point is typed into the AI’s source code. Unfortunately, we humans don’t really know what we want, so we cannot provide such an equation. If we try to propose a specific utility function directly, we typically get a function that would result in catastrophic consequences if it were pursued with arbitrary competence. This is worrying.

This problem could perhaps be alleviated if we could construct AIs that can refine their utility function over time. For example, maybe we could create an AI that starts out with an imperfect understanding of human values, but then improves that understanding over time. Such an AI should ideally “want” to improve its understanding of human values (and actively come up with ways to do this), and it should at minimum not resist if humans attempt to update it. Unfortunately, it turns out to be difficult to design such systems. In this post we will talk more about this approach.


A Puzzle of Reference*

Consider this puzzle: I am able to talk and reason about ”human values”. However, I cannot define human values, or give you a definite description of what human values are – if I could do this, I could solve a large part of the AI alignment problem by writing down a safe utility function directly. I can also not give you a method for finding out what human values are – if I could do this, I could solve the problem of Inverse Reinforcement Learning. Moreover, I don’t think I could reliably recognize human values either – if you show me a bunch of utility functions, I might not be able to tell if any of them encodes human values. I’m not even sure if I could reliably recognize methods for finding out what human values are – if you show me a proposal for how to do Inverse Reinforcement Learning, I might not be able to tell whether the method truly learns human values.

In spite of all this, the term “human values” means something when I say it – it has semantic content, and refers to some (abstract) object. How does this work? What makes it so that the term “human values” even has any meaning at all when I say it? And, given that it has a meaning, what makes it so that it has the particular meaning it does? It seems like some feature of human cognition and/or language can make it possible for us to refer to certain things that we have very little information about. What is the mechanism behind this, and could it be used when defining utility functions in AI systems?

Problem Statement

We want a method for creating agents that update their utility function over time. That is, we want:

  1. A method for “pointing to” a utility function (such as “human values”) indirectly, without giving an explicit statement of the utility function in question.
  2. A method for “clarifying” a utility function specified with the method given in (1), so that you in the limit of infinite information obtain an explicit/concrete utility function.
  3. A method for creating an agent that uses an indirectly specified utility function, such that:
    • The agent at any given time takes actions which are sensible given its current beliefs about its utility function.
    • The agent will try to find information that would help it to clarify it’s utility function.
    • The agent would resist attempts to change its utility function away from its indirectly specified utility function.

This problem statement is of course somewhat loose, but that is by necessity, since we don’t yet have a clear idea of what it really means to define utility functions “indirectly” (in the sense we are interested in here).

Utility Functions and Intensional Semantics*

What is in this section is a tangent about wireheading -- it might be interesting to read this while thinking about this topic, but it is not necessary to do so.

How should an AI evaluate plans if its utility function changes over time? Suppose we have an AI that currently has utility function U1, and that it considers a plan P that would lead to outcome O, where in O the AI would have the utility function U2. Should the utility of P be defined as U1(O) or U2(O)? If it’s U1(O) then the AI is maximizing its utility function de re, and if it’s U2(O) then it’s maximizing its utility function de dicto. Which is more sensible?

In brief, an AI that maximizes utility de re will resist attempts to modify its current utility function, and thus not satisfy (3). An AI that maximizes utility de dicto would wirehead, and thus also not satisfy (3). An AI that maximizes utility de re would not wirehead.

This is perhaps a somewhat interesting observation, but it doesn’t help us solve (1)-(3).

Limiting Utility Functions -- Possibly a Partial Solution

Let’s  define a process P that generates a sequence of utility functions {Ui}. We call this a utility function defining process. An example of such a process P could be the following:

P is an episodic process, the input and output to which is one proposed human utility function and one set of notes. Given these, P runs n human brain emulations (EMs) forsubjective years. The brains can speak with each other, and have a copy of the internet that they can access. The EMs are meant to use this time to figure out what human preferences are. At the end of the episode they output their best guess, together with a set of notes for their successors to read. By chaining P to itself we obtain a sequence of utility functions {Ui}.

We would like to stress that this process P is an example, and not the central point of this post.

Suppose (for the sake of the argument) that the sequence of utility functions {Ui} generated by this process P has a well-defined limit U (in the ordinary mathematical sense of a limit). We can then define an AI system whose utility function is to maximize lim i→∞ Ui (= U). It seems as though such a system would satisfy many of the properties in (1)-(3). In particular:

  • The AI should at any given time take actions that are good according to most of the plausible values of U.
  • The AI would be incentivized to gather information that would help it learn more about U.
  • The AI would not be incentivized to gather information about U at the expense of maximizing U (eg, it would not be incentivized to run “unethical experiments”). 
  • The AI would be incentivized to resist changes to its utility function that would mean that it’s no longer aiming to maximize U.  
  • The AI should be keen to maintain option value as it learns more about U, until it’s very confident about what U looks like.

Overall, it seems like such an AI would satisfy most of the properties we would want an AI with an updating utility function to have.

To clarify, note that we are not saying that you run the utility function defining process P to convergence and then write the utility function you end up with into the AI – you would not need to run P at all. The purpose of P is to point to U – the work of actually finding out what U is is offloaded onto the AI. The AI might of course do this by actually running P, but if P is very complex (as in the example above) then the AI could also use other methods for gaining information about U.

Again, we stress that the point here isn’t the specific process P we propose above – that is just an example. As far as the approach is concerned, you could use any well-defined process that produces a sequence of utility functions that converges to a well-defined limit.


There are a few issues with this approach. Notably:

  1. The approach is very unwieldy, and it seems like it requires a fairly high minimum level of intelligence to work. For example, it couldn’t be used as-is with a contemporary RL agent. 
    • It’s not clear what would be needed to use this approach with an AI that starts out below the minimum required level of intelligence, but then gets more intelligent over time.
    • The nitty-gritty details of getting an AI system to maximize the limit of a mathematical sequence would in general presumably require good methods for dealing with logical uncertainty.
  2. We still need to provide a specific process P, such that we are sure that P has a well-defined limit, and such we are confident that this limit corresponds to the utility function that we are actually interested in.
    • Note however that this might be much easier than, for example, solving Inverse Reinforcement Learning. For example, there isn’t really any need for P to be efficient or practical to run.
  3. With the current version of this approach, all the information required to figure out what U is must in some sense be contained within P from the start. This is problematic – what if it’s not possible to figure out what human values are based on all information that can be accessed when the system is deployed? For example, what if you need some facts about the human brain that just aren’t in the scientific literature at the time?
    • One way to get around this is to allow P to request new external information (by proposing an experiment to run, for example). However, this introduces new difficulties. Depending on what information is requested, this could make the value of U depend on contingencies in the real world. In particular, it could make the value of U depend on things that the AI can influence. For example, if P requests that a survey is run then the AI could probably influence the outcome of that survey (and the outcome would also depend on the specific time at which the survey is run, etc etc). In this case it’s unclear how you would even ensure that U is well-defined, and it seems very difficult to ensure that the AI still has the intended incentives.

Nonetheless, it seems like this approach has many nice and desirable properties, and the issues are not fatal, so it might still be possible to use this approach in an AI system, or build on it to create an even better approach.


In summary, we want a method for pointing to utility functions that works even if we don’t have a concrete expression of that function (like how I can point to human values by saying “human values”, even though I can’t say much about them). We also want a method for making an AI system maximize a function that has been pointed to in this way, which doesn’t incentivize bad behavior.

We have proposed a possible approach for doing this, which is to define a mathematical or computational process that generates a sequence of utility functions, which limits to some well-defined utility function, and then have the AI system try to maximize that limit function. This gives us a quite flexible way to define utility functions, and the resulting AI system seems to get the incentives we would want.

This approach has a few limitations. The most problematic of these is probably that it seems to induce a fairly large overhead cost, in terms of computational complexity, in terms of the complexity of the code, and in terms of how intelligent the AI system would have to be. Other issues include defining the utility function generating process, ensuring that it has a well-defined limit, and ensuring that that limit is the function we intend. However, these issues are probably less significant by comparison, since other methods for defining AGI utility functions usually have similar issues.


The prompting idea for this post was from Justin Shovelain, Joar Skalse and Justin Shovelain collaboratively came up with the much improved Updating Utility Functions idea, and Joar Skalse was the primary writer.


New Comment