Alex Mennen

Wiki Contributions


How To Get Into Independent Research On Alignment/Agency

This post claims that having the necessary technical skills probably means grad-level education, and also that you should have a broad technical background. While I suppose these claims are probably both true, it's worth pointing out that there's a tension between them, in that PhD programs typically aim to develop narrow skillsets, rather than broad ones. Often the first year of a PhD program will focus on acquiring a moderately broad technical background, and then rapidly get progressively more specialized, until you're writing a thesis, at which point whatever knowledge you're still acquiring is highly unlikely to be useful for any project that isn't very similar to your thesis.

My advice for people considering a PhD as preparation for work in AI alignment is that only the first couple years should really be thought of as preparation, and for the rest of the program, you should be actually doing the work that the beginning of the PhD was preparation for. While I wouldn't discourage people from starting a PhD as preparation for work in AI alignment work, I would caution that finishing the program may or may not be a good course of action for you, and you should evaluate this while in the program. Don't end up like me, a seventh-year PhD student working on a thesis project highly unlikely to be applicable to AI alignment despite harboring vague ambitions of working in the field.

The Promise and Peril of Finite Sets

the still-confusing revised slogan that all computable functions are continuous

For anyone who still finds this confusing, I think I can give a pretty quick explanation of this.

The reason I'd imagine it might sound confusing is that you can think of what seem like simple counterexamples. E.g. you can write a short function in your favorite programming language that takes a floating-point real as input, and returns 1 if the input is 0, and returns 0 otherwise. This appears to be a computation of the indicator function for {0}, which is discontinuous. But it doesn't accurately implement any function on  at all, because its input is a floating-point real, and floating-point arithmetic has rounding errors, so you might apply your function to some expression which equals something very small but nonzero, but gets evaluated to 0 on your computer, or which equals 0, but gets evaluated to something small but nonzero on your computer. This problem will arise for any attempt to implement a discontinuous function; rounding errors in the input can move it across the discontinuity.

The conventional definition of a computation of a real function is one for which the output can be made accurate to any desired degree of precision by making the input sufficiently precise. This is essentially a computational version of the epsilon-delta definition of continuity. And most continuous functions you can think of can in fact be implemented computationally, if you use an arbitrary-precision data type for the input (fixed-precision reals are discrete, and cannot be sufficiently precise approximations to arbitrary reals).

Dutch-Booking CDT: Revised Argument

I think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don't think your argument is going to work.

For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract for some sufficiently small price  if , even if  is not the optimal action (let's say  is the optimal action). When the time comes to take an action, the agent's best bet is  (prime meaning sell the contract for price ). The way I described the set-up, the agent doesn't choose between  and , because actions other than the top choice all happen with probability epsilon. The fact that the agent sells the contract back in its top choice isn't a Dutch book, because the case where the agent's top choice goes through is the case in which the contract is worthless, and the contract's value is derived from other cases.

We could modify the epsilon exploration assumption so that the agent also chooses between  and  even while its top choice is . That is, there's a lower bound on the probability with which the agent takes an action in , but even if that bound is achieved, the agent still has some flexibility in distributing probability between  and . In this case, contrary to your argument, the agent will prefer  rather than , i.e., it will not get Dutch booked. This is because the agent is still choosing  as the only action with high probability, and  refers to the expected consequence of the agent choosing  as its intended action, so the agent cannot use  when calculating which of  or  is better to pick as its next choice if its attempt to implement intended action  fails.

Another source of uncertainty that the agent could have about its actions is if it believes it could gain information in the future, but before it has to make a decision, and this information could be relevant to which decision it makes. Say that  and  are the agent's expectations at time  of the utility that taking action  would cause it to get, and the utility it would get conditional on taking action , respectively. Suppose the bookie offers the deal at time , and the agent must act at time . If the possibility of gaining future knowledge is the only source of the agent's uncertainty about its own decisions, then at time , it knows what action it is taking, and  is undefined on actions not taken.  and  should both be well-defined, but they could be different. The problem description should disambiguate between them. Suppose that every time you say  and  in the description of the contract, this means  and , respectively. The agent purchases the contract, and then, when it comes time to act, it evaluates consequences by , not , so the argument for why the agent will inevitably resell the contract fails. If the  appearing in the description of the contract instead means  (since the agent doesn't know what that is yet, this means the contract references what the agent will believe in the future, rather than stating numerical payoffs), then the agent won't purchase it in the first place because it will know that the contract will only have value if  seems to be suboptimal at time  and it takes action  anyway, which it knows won't happen, and hence the contract is worthless.

Utility Maximization = Description Length Minimization

I don't see the connection to the Jeffrey-Bolker rotation? There, to get the shouldness coordinate, you need to start with the epistemic probability measure, and multiply it by utility; here, utility is interpreted as a probability distribution without reference to a probability distribution used for beliefs.

An overview of 11 proposals for building safe advanced AI

For individual ML models, sure, but not for classes of similar models. E.g. GPT-3 presumably was more expensive to train than GPT-2 as part of the cost to getting better results. For each of the proposals in the OP, training costs constrain how complex a model you can train, which in turn would affect performance.

Relaxed adversarial training for inner alignment

I'm concerned about Goodhart's law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.

An overview of 11 proposals for building safe advanced AI

Is there a difference between training competitiveness and performance competitiveness? My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance. If this is the case, then whether a factor influencing competitiveness is framed as affecting the cost of training or as affecting the performance of the final product, either way it's just affecting the efficiency with which putting resources towards training leads to good performance. Separating competitiveness into training and performance competitiveness would make sense if there's a fixed amount of training that must be done to achieve any reasonable performance at all, but past that, more training is not effective at producing better performance. My impression is that this isn't usually what happens.

An Orthodox Case Against Utility Functions
This makes Savage a better comparison point, since the Savage axioms are more similar to the VNM framework while also trying to construct probability and utility together with one representation theorem.

Sure, I guess I just always talk about VNM instead of Savage because I never bothered to learn how Savage's version works. Perhaps I should.

As a representation theorem, this makes VNM weaker and JB stronger: VNM requires stronger assumptions (it requires that the preference structure include information about all these probability-distribution comparisons), where JB only requires preference comparison of events which the agent sees as real possibilities.

This might be true if we were idealized agents who do Bayesian updating perfectly without any computational limitations, but as it is, it seems to me that the assumption that there is a fixed prior is unreasonably demanding. People sometimes update probabilities based purely on further thought, rather than empirical evidence, and a framework in which there is a fixed prior which gets conditioned on events, and banishes discussion of any other probability distributions, would seem to have some trouble handling this.

Doesn't pointless topology allow for some distinctions which aren't meaningful in pointful topology, though?

Sure, for instance, there are many distinct locales that have no points (only one of which is the empty locale), whereas there is only one ordinary topological space with no points.

Isn't the approach you mention pretty close to JB? You're not modeling the VNM/Savage thing of arbitrary gambles; you're just assigning values (and probabilities) to events, like in JB.

Assuming you're referring to "So a similar thing here would be to treat a utility function as a function from some lattice of subsets of (the Borel subsets, for instance) to the lattice of events", no. In JB, the set of events is the domain of the utility function, and in what I said, it is the codomain.

An Orthodox Case Against Utility Functions
In the Savage framework, an outcome already encodes everything you care about.

Yes, but if you don't know which outcome is the true one, so you're considering a probability distribution over outcomes instead of a single outcome, then it still makes sense to speak of the probability that the true outcome has some feature. This is what I meant.

So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems to be very demanding: it requires imagining these very detailed scenarios.

You do not need to be able to imagine every possible outcome individually in order to think of functions on or probability distributions over the set of outcomes, any more than I need to be able to imagine each individual real number in order to understand the function or the standard normal distribution.

It seems that you're going by an analogy like Jeffrey-Bolker : VNM :: events : outcomes, which is partially right, but leaves out an important sense in which the correct analogy is Jeffrey-Bolker : VNM :: events : probability distributions, since although utility is defined on outcomes, the function that is actually evaluated is expected utility, which is defined on probability distributions (this being a distinction that does not exist in Jeffrey-Bolker, but does exist in my conception of real-world human decision making).

An Orthodox Case Against Utility Functions

I agree that the considerations you mentioned in your example are not changes in values, and didn't mean to imply that that sort of thing is a change in values. Instead, I just meant that such shifts in expectations are changes in probability distributions, rather than changes in events, since I think of such things in terms of how likely each of the possible outcomes are, rather than just which outcomes are possible and which are ruled out.

Load More