What are we assuming about utility functions?

Ok, I see what you mean about independence of irrelevant alternatives only being a real coherence condition when the probabilities are objective (or otherwise known to be equal because they come from the same source, even if there isn't an objective way of saying what their common probability is).

But I disagree that this makes VNM only applicable to settings in which all sources of uncertainty have objectively correct probabilities. As I said in my previous comment, you only need there to exist some source of objective probabilities, and you can then use preferences over lotteries involving objective probabilities and preferences over related lotteries involving other sources of uncertainty to determine what probability the agent must assign for those other sources of uncertainty.

Re: the difference between VNM and Bayesian expected utility maximization, I take it from the word "Bayesian" that the way you're supposed to choose between actions does involve first coming up with probabilities of each outcome resulting from each action, and from "expected utility maximization", that these probabilities are to be used in exactly the way the VNM theorem says they should be. Since the VNM theorem does not make any assumptions about where the probabilities came from, these still sound essentially the same, except with Bayesian expected utility maximization being framed to emphasize that you have to get the probabilities somehow first.

What are we assuming about utility functions?

I think you're underestimating VNM here.

only two of those four are relevant to coherence. The main problem is that the axioms relevant to coherence (acyclicity and completeness) do not say anything at all about probability

It seems to me that the independence axiom is a coherence condition, unless I misunderstand what you mean by coherence?

correctly point out problems with VNM

I'm curious what problems you have in mind, since I don't think VNM has problems that don't apply to similar coherence theorems.

VNM utility stipulates that agents have preferences over "lotteries" with known, objective probabilities of each outcome. The probabilities are assumed to be objectively known from the start. The Bayesian coherence theorems do not assume probabilities from the start; they derive probabilities from the coherence criteria, and those probabilities are specific to the agent.

One can construct lotteries with probabilities that are pretty well understood (e.g. flipping coins that we have accumulated a lot of evidence are fair), and you can restrict attention to lotteries only involving uncertainty coming from such sources. One may then get probabilities for other, less well-understood sources of uncertainty by comparing preferences involving such uncertainty to preferences involving easy-to-quantify uncertainty (e.g. if A is preferred to B, and you're indifferent between 60%A+40%B and "A if X, B if not-X", then you assign probability 60% to X. Perhaps not quite as philosophically satisfying as deriving probabilities from scratch, but this doesn't seem like a fatal flaw in VNM to me.

I do not expect agent-like systems in the wild to be pushed toward VNM expected utility maximization. I expect them to be pushed toward Bayesian expected utility maximization.

I understood those as being synonyms. What's the difference?

[AN #66]: Decomposing robustness into capability robustness and alignment robustness

I do, however, believe that the single step cooperate-defect game which they use to come up with their factors seems like a very simple model for what will be a very complex system of interactions. For example, AI development will take place over time, and it is likely that the same companies will continue to interact with one another. Iterated games have very different dynamics, and I hope that future work will explore how this would affect their current recommendations, and whether it would yield new approaches to incentivizing cooperation.

It may be difficult for companies to get accurate information about how careful their competitors are being about AI safety. An iterated game in which players never learn what the other players did on previous rounds is the same as a one-shot game. This points to a sixth factor that increases chance of cooperation on safety: high transparency, so that companies may verify their competitors' cooperation on safety. This is closely related to high trust.

A Critique of Functional Decision Theory

I object to the framing of the bomb scenario on the grounds that low probabilities of high stakes are a source of cognitive bias that trip people up for reasons having nothing to do with FDT. Consider the following decision problem: "There is a button. If you press the button, you will be given $100. Also, pressing the button has a very small (one in a trillion trillion) chance of causing you to burn to death." Most people would not touch that button. Using the same payoffs and probabilies in a scenario to challenge FDT thus exploits cognitive bias to make FDT look bad. A better scenario would be to replace the bomb with something that will fine you $1000 (and, if you want, also increase the chance of of error).

But then, it seems to me, that FDT has lost much of its initial motivation: the case for one-boxing in Newcomb’s problem didn’t seem to stem from whether the Predictor was running a simulation of me, or just using some other way to predict what I’d do.

I think the crucial difference here is how easily you can cause the predictor to be wrong. In the case where the predictor simulates you, if you two-box, then the predictor expects you to two-box. In the case where the predictor uses your nationality to predict your behavior, Scots usually one-box, and you're Scottish, if you two-box, then the predictor will still expect you to one-box because you're Scottish.

But now suppose that the pathway by which S causes there to be money in the opaque box or not is that another agent looks at S...

I didn't think that was supposed to matter at all? I haven't actually read the FDT paper, and have mostly just been operating under the assumption that FDT is basically the same as UDT, but UDT didn't build in any dependency on external agents, and I hadn't heard about any such dependency being introduced in FDT; it would surprise me if it did.

A Critique of Functional Decision Theory

I don't know if I'm a simulation or a real person.

A possible response to this argument is that the predictor may be able to accurately predict the agent without explicitly simulating them. A possible counter-response to this is to posit that any sufficiently accurate model of a conscious agent is necessarily conscious itself, whether the model takes the form of an explicit simulation or not.

Troll Bridge

I think the counterfactuals used by the agent are the correct counterfactuals for someone else to use while reasoning about the agent from the outside, but not the correct counterfactuals for the agent to use while deciding what to do. After all, knowing the agent's source code, if you see it start to cross the bridge, it is correct to infer that it's reasoning is inconsistent, and you should expect to see the troll blow up the bridge. But while deciding what to do, the agent should be able to reason about purely causal effects of its counterfactual behavior, screening out other logical implications.

Also, counterfactuals which predict that the bridge blows up seem to be saying that the agent can control whether PA is consistent or inconsistent.

Disagree that that's what's happening. The link between the consistency of the reasoning system and the behavior of the agent is because the consistency of the reasoning system controls the agent's behavior, rather than the other way around. Since the agent is selecting outcomes based on their consequences, it does make sense to speak of the agent choosing actions to some extent, but I think speaking of logical implications of the agent's actions on the consistency of formal systems as "controlling" the consistency of the formal system seems like an inappropriate attribution of agency to me.

Safely and usefully spectating on AIs optimizing over toy worlds

The agent could be programmed to have a certain hard-coded ontology rather than searching through all possible hypotheses weighted by description length.

Safely and usefully spectating on AIs optimizing over toy worlds

Are you worried about leaks from the abstract computational process into the real world, leaks from the real world into the abstract computational process, or both? (Or maybe neither and I'm misunderstanding your concern?)

There will definitely be tons of leaks from the abstract computational process into the real world; just looking at the result is already such a leak. The point is that the AI should have no incentive to optimize such leaks, not that the leaks don't exist, so the existence of additional leaks that we didn't know about shouldn't be concerning.

Leaks from the outside world into the computational abstraction would be more concerning, since the whole point is to prevent those from existing. It seems like it should be possible to make hardware arbitrarily reliable by devoting enough resources to error detection and correction, which would prevent such leaks, though I'm not an expert, so it would be good to know if this is wrong. There may be other ways to get the AI to act similarly to the way it would in the idealized toy world even when hardware errors create small differences. This is certainly the sort of thing we would want to take seriously if hardware can't be made arbitrarily reliable.

Incidentally, that story about accidental creation of a radio with an evolutionary algorithm was part of what motivated my post in the first place. If the evolutionary algorithm had used tests of its oscillator design in a computer model, rather than in the real world, then it would have have built a radio receiver, since radio signals from nearby computers would not have been included in the computer model of the environment, even though they were present in the actual environment.

Probabilistic Tiling (Preliminary Attempt)

What I meant was that the computation isn't extremely long in the sense of description length, not in the sense of computation time. Also, we aren't doing policy search over the set of all turing machines, we're doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on)

Wouldn't the set of all action sequences have lower description length than some large finite set of policies? There's also the potential problem that all of the policies in the large finite set you're searching over could be quite far from optimal.

Theorem: Fuzzy beliefs (as in https://www.alignmentforum.org/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#X6fFvAHkxCPmQYB6v ) form a continuous DCPO. (At least I'm pretty sure this is true. I've only given proof sketches so far)

The relevant definitions:

A fuzzy belief over a set X is a concave function ϕ:ΔX→[0,1] such that sup(ϕ)=1 (where ΔX is the space of probability distributions on X). Fuzzy beliefs are partially ordered by ϕ≤ψ⟺∀μ∈ΔX:ϕ(μ)≥ψ(μ) . The inequalities reverse because we want to think of "more specific"/"less fuzzy" beliefs as "greater", and these are the functions with lower values; the most specific/least fuzzy beliefs are ordinary probability distributions, which are represented as the concave hull of the function assigning 1 to that probability distribution and 0 to all others; these should be the maximal fuzzy beliefs. Note that, because of the order-reversal, the supremum of a set of functions refers to their pointwise infimum.

A DCPO (directed-complete partial order) is a partial order in which every directed subset has a supremum.

In a DCPO, define x<<y to mean that for every directed set D with supD≥y, ∃d∈D such that d≥x. A DCPO is continuous if for every y , y=sup{x∣x<<y}.

Lemma: Fuzzy beliefs are a DCPO.

Proof sketch: Given a directed set D , (supD)(μ)=min{d(μ)∣d∈D} is convex, and {μ∣(supD)(μ)=1}=⋂d∈D{μ∣d(μ)=1}. Each of the sets in that intersection are non-empty, hence so are finite intersections of them since D is directed, and hence so is the whole intersection since ΔX is compact.

Lemma: ϕ<<ψ iff {μ∣ψ(μ)=1} is contained in the interior of {μ∣ϕ(μ)=1} and for every μ such that ψ(μ)≠1, ϕ(μ)>ψ(μ).

Proof sketch: If supD≥ψ, then ⋂d∈D{μ∣d(μ)=1}⊆{μ∣ψ(μ)=1} , so by compactness of ΔX and directedness of D, there should be d∈D such that {μ∣d(μ)=1}⊆int({μ∣ϕ(μ)=1}). Similarly, for each μ such that ψ(μ)≠1, there should be dμ∈D such that dμ(μ)<ϕ(μ) . By compactness, there should be some finite subset of {d}∪{dμ∣ψ(μ)≠1} such that any upper bound for all of them is at least ϕ .

Lemma: ψ=sup{ϕ∣{μ∣ψ(μ)=1}⊆int{μ∣ϕ(μ)=1},∀μψ(μ)≠1→ϕ(μ)>ψ(μ)} .

Proof: clear?