Alex Mennen

It sounds to me like, in the claim "deep learning is uninterpretable", the key word in "deep learning" that makes this claim true is "learning", and you're substituting the similar-sounding but less true claim "deep neural networks are uninterpretable" as something to argue against. You're right that deep neural networks can be interpretable if you hand-pick the semantic meanings of each neuron in advance and carefully design the weights of the network such that these intended semantic meanings are correct, but that's not what deep learning is. The other things you're comparing it to that are often called more interpretable than deep learning are in fact more interpretable than deep learning, not (as you rightly point out) because the underlying structures they work with is inherently more interpretable, but because they aren't machine learning of any kind.

This seems related in spirit to the fact that time is only partially ordered in physics as well. You could even use special relativity to make a model for concurrency ambiguity in parallel computing: each processor is a parallel worldline, detecting and sending signals at points in spacetime that are spacelike-separated from when the other processors are doing these things. The database follows some unknown worldline, continuously broadcasts its contents, and updates its contents when it receives instructions to do so. The set of possible ways that the processors and database end up interacting should match the parallel computation model. This makes me think that intuitions about time that were developed to be consistent with special relativity should be fine to also use for computation.

Wikipedia claims that every sequence is Turing reducible to a random one, giving a positive answer to the non-resource-bounded version of any question of this form. There might be a resource-bounded version of this result as well, but I'm not sure.

This post claims that having the necessary technical skills probably means grad-level education, and also that you should have a broad technical background. While I suppose these claims are probably both true, it's worth pointing out that there's a tension between them, in that PhD programs typically aim to develop narrow skillsets, rather than broad ones. Often the first year of a PhD program will focus on acquiring a moderately broad technical background, and then rapidly get progressively more specialized, until you're writing a thesis, at which point whatever knowledge you're still acquiring is highly unlikely to be useful for any project that isn't very similar to your thesis.

My advice for people considering a PhD as preparation for work in AI alignment is that only the first couple years should really be thought of as preparation, and for the rest of the program, you should be actually doing the work that the beginning of the PhD was preparation for. While I wouldn't discourage people from starting a PhD as preparation for work in AI alignment work, I would caution that finishing the program may or may not be a good course of action for you, and you should evaluate this while in the program. Don't end up like me, a seventh-year PhD student working on a thesis project highly unlikely to be applicable to AI alignment despite harboring vague ambitions of working in the field.

the still-confusing revised slogan that all

computablefunctions are continuous

For anyone who still finds this confusing, I think I can give a pretty quick explanation of this.

The reason I'd imagine it might sound confusing is that you can think of what seem like simple counterexamples. E.g. you can write a short function in your favorite programming language that takes a floating-point real as input, and returns 1 if the input is 0, and returns 0 otherwise. This appears to be a computation of the indicator function for {0}, which is discontinuous. But it doesn't *accurately* implement any function on at all, because its input is a floating-point real, and floating-point arithmetic has rounding errors, so you might apply your function to some expression which equals something very small but nonzero, but gets evaluated to 0 on your computer, or which equals 0, but gets evaluated to something small but nonzero on your computer. This problem will arise for any attempt to implement a discontinuous function; rounding errors in the input can move it across the discontinuity.

The conventional definition of a computation of a real function is one for which the output can be made accurate to any desired degree of precision by making the input sufficiently precise. This is essentially a computational version of the epsilon-delta definition of continuity. And most continuous functions you can think of can in fact be implemented computationally, if you use an arbitrary-precision data type for the input (fixed-precision reals are discrete, and cannot be sufficiently precise approximations to arbitrary reals).

I think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don't think your argument is going to work.

For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract for some sufficiently small price if , even if is not the optimal action (let's say is the optimal action). When the time comes to take an action, the agent's best bet is (prime meaning sell the contract for price ). The way I described the set-up, the agent doesn't choose between and , because actions other than the top choice all happen with probability epsilon. The fact that the agent sells the contract back in its top choice isn't a Dutch book, because the case where the agent's top choice goes through is the case in which the contract is worthless, and the contract's value is derived from other cases.

We could modify the epsilon exploration assumption so that the agent also chooses between and even while its top choice is . That is, there's a lower bound on the probability with which the agent takes an action in , but even if that bound is achieved, the agent still has some flexibility in distributing probability between and . In this case, contrary to your argument, the agent will prefer rather than , i.e., it will not get Dutch booked. This is because the agent is still choosing as the only action with high probability, and refers to the expected consequence of the agent choosing as its intended action, so the agent cannot use when calculating which of or is better to pick as its next choice if its attempt to implement intended action fails.

Another source of uncertainty that the agent could have about its actions is if it believes it could gain information in the future, but before it has to make a decision, and this information could be relevant to which decision it makes. Say that and are the agent's expectations at time of the utility that taking action would cause it to get, and the utility it would get conditional on taking action , respectively. Suppose the bookie offers the deal at time , and the agent must act at time . If the possibility of gaining future knowledge is the only source of the agent's uncertainty about its own decisions, then at time , it knows what action it is taking, and is undefined on actions not taken. and should both be well-defined, but they could be different. The problem description should disambiguate between them. Suppose that every time you say and in the description of the contract, this means and , respectively. The agent purchases the contract, and then, when it comes time to act, it evaluates consequences by , not , so the argument for why the agent will inevitably resell the contract fails. If the appearing in the description of the contract instead means (since the agent doesn't know what that is yet, this means the contract references what the agent will believe in the future, rather than stating numerical payoffs), then the agent won't purchase it in the first place because it will know that the contract will only have value if seems to be suboptimal at time and it takes action anyway, which it knows won't happen, and hence the contract is worthless.

I don't see the connection to the Jeffrey-Bolker rotation? There, to get the shouldness coordinate, you need to start with the epistemic probability measure, and multiply it by utility; here, utility is interpreted as a probability distribution without reference to a probability distribution used for beliefs.

For individual ML models, sure, but not for classes of similar models. E.g. GPT-3 presumably was more expensive to train than GPT-2 as part of the cost to getting better results. For each of the proposals in the OP, training costs constrain how complex a model you can train, which in turn would affect performance.

I'm concerned about Goodhart's law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.

It would kind of use assumption 3 inside step 1, but inside the syntax, rather than in the metalanguage. That is, step 1 involves checking that the number encoding "this proof" does in fact encode a proof of C. This can't be done if you never end up proving C.

One thing that might help make clear what's going on is that you can follow the same proof strategy, but replace "this proof" with "the usual proof of Lob's theorem", and get another valid proof of Lob's theorem, that goes like this: Suppose you can prove that []C->C, and let n be the number encoding a proof of C via the usual proof of Lob's theorem. Now we can prove C a different way like so:

Step 1 can't be correctly made precise if it isn't true that n encodes a proof of C.