## AI ALIGNMENT FORUMAF

Alex Mennen

Dutch-Booking CDT: Revised Argument

I think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don't think your argument is going to work.

For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract for some sufficiently small price  if , even if  is not the optimal action (let's say  is the optimal action). When the time comes to take an action, the agent's best bet is  (prime meaning sell the contract for price ). The way I described the set-up, the agent doesn't choose between  and , because actions other than the top choice all happen with probability epsilon. The fact that the agent sells the contract back in its top choice isn't a Dutch book, because the case where the agent's top choice goes through is the case in which the contract is worthless, and the contract's value is derived from other cases.

We could modify the epsilon exploration assumption so that the agent also chooses between  and  even while its top choice is . That is, there's a lower bound on the probability with which the agent takes an action in , but even if that bound is achieved, the agent still has some flexibility in distributing probability between  and . In this case, contrary to your argument, the agent will prefer  rather than , i.e., it will not get Dutch booked. This is because the agent is still choosing  as the only action with high probability, and  refers to the expected consequence of the agent choosing  as its intended action, so the agent cannot use  when calculating which of  or  is better to pick as its next choice if its attempt to implement intended action  fails.

Another source of uncertainty that the agent could have about its actions is if it believes it could gain information in the future, but before it has to make a decision, and this information could be relevant to which decision it makes. Say that  and  are the agent's expectations at time  of the utility that taking action  would cause it to get, and the utility it would get conditional on taking action , respectively. Suppose the bookie offers the deal at time , and the agent must act at time . If the possibility of gaining future knowledge is the only source of the agent's uncertainty about its own decisions, then at time , it knows what action it is taking, and  is undefined on actions not taken.  and  should both be well-defined, but they could be different. The problem description should disambiguate between them. Suppose that every time you say  and  in the description of the contract, this means  and , respectively. The agent purchases the contract, and then, when it comes time to act, it evaluates consequences by , not , so the argument for why the agent will inevitably resell the contract fails. If the  appearing in the description of the contract instead means  (since the agent doesn't know what that is yet, this means the contract references what the agent will believe in the future, rather than stating numerical payoffs), then the agent won't purchase it in the first place because it will know that the contract will only have value if  seems to be suboptimal at time  and it takes action  anyway, which it knows won't happen, and hence the contract is worthless.

Utility Maximization = Description Length Minimization

I don't see the connection to the Jeffrey-Bolker rotation? There, to get the shouldness coordinate, you need to start with the epistemic probability measure, and multiply it by utility; here, utility is interpreted as a probability distribution without reference to a probability distribution used for beliefs.

An overview of 11 proposals for building safe advanced AI

For individual ML models, sure, but not for classes of similar models. E.g. GPT-3 presumably was more expensive to train than GPT-2 as part of the cost to getting better results. For each of the proposals in the OP, training costs constrain how complex a model you can train, which in turn would affect performance.

Relaxed adversarial training for inner alignment

I'm concerned about Goodhart's law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.

An overview of 11 proposals for building safe advanced AI

Is there a difference between training competitiveness and performance competitiveness? My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance. If this is the case, then whether a factor influencing competitiveness is framed as affecting the cost of training or as affecting the performance of the final product, either way it's just affecting the efficiency with which putting resources towards training leads to good performance. Separating competitiveness into training and performance competitiveness would make sense if there's a fixed amount of training that must be done to achieve any reasonable performance at all, but past that, more training is not effective at producing better performance. My impression is that this isn't usually what happens.

An Orthodox Case Against Utility Functions
This makes Savage a better comparison point, since the Savage axioms are more similar to the VNM framework while also trying to construct probability and utility together with one representation theorem.

Sure, I guess I just always talk about VNM instead of Savage because I never bothered to learn how Savage's version works. Perhaps I should.

As a representation theorem, this makes VNM weaker and JB stronger: VNM requires stronger assumptions (it requires that the preference structure include information about all these probability-distribution comparisons), where JB only requires preference comparison of events which the agent sees as real possibilities.

This might be true if we were idealized agents who do Bayesian updating perfectly without any computational limitations, but as it is, it seems to me that the assumption that there is a fixed prior is unreasonably demanding. People sometimes update probabilities based purely on further thought, rather than empirical evidence, and a framework in which there is a fixed prior which gets conditioned on events, and banishes discussion of any other probability distributions, would seem to have some trouble handling this.

Doesn't pointless topology allow for some distinctions which aren't meaningful in pointful topology, though?

Sure, for instance, there are many distinct locales that have no points (only one of which is the empty locale), whereas there is only one ordinary topological space with no points.

Isn't the approach you mention pretty close to JB? You're not modeling the VNM/Savage thing of arbitrary gambles; you're just assigning values (and probabilities) to events, like in JB.

Assuming you're referring to "So a similar thing here would be to treat a utility function as a function from some lattice of subsets of (the Borel subsets, for instance) to the lattice of events", no. In JB, the set of events is the domain of the utility function, and in what I said, it is the codomain.

An Orthodox Case Against Utility Functions
In the Savage framework, an outcome already encodes everything you care about.

Yes, but if you don't know which outcome is the true one, so you're considering a probability distribution over outcomes instead of a single outcome, then it still makes sense to speak of the probability that the true outcome has some feature. This is what I meant.

So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems to be very demanding: it requires imagining these very detailed scenarios.

You do not need to be able to imagine every possible outcome individually in order to think of functions on or probability distributions over the set of outcomes, any more than I need to be able to imagine each individual real number in order to understand the function or the standard normal distribution.

It seems that you're going by an analogy like Jeffrey-Bolker : VNM :: events : outcomes, which is partially right, but leaves out an important sense in which the correct analogy is Jeffrey-Bolker : VNM :: events : probability distributions, since although utility is defined on outcomes, the function that is actually evaluated is expected utility, which is defined on probability distributions (this being a distinction that does not exist in Jeffrey-Bolker, but does exist in my conception of real-world human decision making).

An Orthodox Case Against Utility Functions

I agree that the considerations you mentioned in your example are not changes in values, and didn't mean to imply that that sort of thing is a change in values. Instead, I just meant that such shifts in expectations are changes in probability distributions, rather than changes in events, since I think of such things in terms of how likely each of the possible outcomes are, rather than just which outcomes are possible and which are ruled out.

An Orthodox Case Against Utility Functions

It seems to me that the Jeffrey-Bolker framework is a poor match for what's going on in peoples' heads when they make value judgements, compared to the VNM framework. If I think about how good the consequences of an action are, I try to think about what I expect to happen if I take that action (ie the outcome), and I think about how likely that outcome is to have various properties that I care about, since I don't know exactly what the outcome will be with certainty. This isn't to say that I literally consider probability distributions in my mind, since I typically use qualitative descriptions of probability rather than numbers in [0,1], and when I do use numbers, they are very rough, but this does seem like a sort of fuzzy, computationally limited version of a probability distribution. Similarly, my estimations of how good various outcomes are are often qualitative, rather than numerical, and again this seems like a fuzzy, computationally limited version of utility function. In order to determine the utility of the event "I take action A", I need to consider how good and how likely various consequences are, and take the expectation of the 'how good' with respect to the 'how likely'. The Jeffrey-Bolker framework seems to be asking me to pretend none of that ever happened.

An Orthodox Case Against Utility Functions

I think we're going to have to back up a bit. Call the space of outcomes and the space of Turing machines . It sounds like you're talking about two functions, and . I was thinking of as the utility function we were talking about, but it seems you were thinking of .

You suggested should be computable but should not be. It seems to me that should certainly be computable (with the caveat that it might be a partial function, rather than a total function), as computation is the only thing Turing machines do, and that if non-halting is included in a space of outcomes (so that is total), it should be represented as some sort of limit of partial information, rather than represented explicitly, so that is continuous.

In any case, a slight generalization of Rice's theorem tells us that any computable function from Turing machines to reals that depends only of the machine's semantics must be constant, so I suppose I'm forced to agree that, if we want a utility function that is defined on all Turing machines and depends only on their semantics, then at least one of or should be uncomputable. But I guess I have to ask why we would want to assign utilities to Turing machines.