Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda.

My Current Take on Counterfactuals

I only skimmed this post for now, but a few quick comments on links to infra-Bayesianism:

InfraBayes doesn’t seem to have that worry, since it applies to non-realizable cases. (Or does it? Is there some kind of non-oscillation guarantee? Or is non-oscillation part of what it means for a set of environments to be learnable -- IE it can oscillate in some cases?)... AFAIK the conditions for learnability in the InfraBayes case are still pretty wide open.

It's true that these questions still need work, but I think it's rather clear that something like "there are no traps" is a sufficient condition for learnability. For example, if you have a finite set of "episodic" hypotheses (i.e. time is divided into episodes, and no states is preserved from one episode to another), then a simple adversarial bandit algorithm (e.g. Exp3) that treats the hypotheses as arms leads to learning. For a more sophisticated example, consider Tian et al which is formulated in the language of game theory, but can be regarded as an infra-Bayesian regret bound for infra-MDPs.

Radical Probabalism and InfraBayes are plausibly two orthogonal dimensions of generalization for rationality. Ultimately we want to generalize in both directions, but to do that, working out the radical-probabilist (IE logical induction) decision theory in more detail might be necessary.

True, but IMO the way to incorporate "radical probabilism" is via what I called Turing RL.

I don’t know how to talk about the CDT vs EDT insight in the InfraBayes world.

I'm not sure what precisely you mean by "CDT vs EDT insight" but our latest post might be relevant: it shows how you can regard infra-Bayesian hypotheses as joint beliefs about observations *and* actions, EDT-style.

Perhaps more importantly, the Troll Bridge insights. As I mentioned in the beginning, in order to meaningfully solve Troll Bridge, it’s necessary to “respect logic” in the right sense. InfraBayes doesn’t do this, and it’s not clear how to get it to do so.

Is there a way to operationalize "respecting logic"? For example, a specific toy scenario where an infra-Bayesian agent would fail due to not respecting logic?

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

From your reply to Paul, I understand your argument to be something like the following:

- Any solution to single-single alignment will involve a tradeoff between alignment and capability.
- If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe.
- If AI systems
*are*designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff. - Given the technical knowledge to design cooperative AI, the incentives are in favor of cooperative AI since cooperative AIs can come ahead by striking mutually-beneficial deals even purely in terms of capability. Therefore, producing such technical knowledge will prevent catastrophe.
- We might still need regulation to prevent players who irrationally choose to deploy uncooperative AI, but this kind of regulation is relatively easy to promote since it aligns with competitive incentives (an uncooperative AI wouldn't have much of an edge, it would just threaten to drag everyone into a mutually destructive strategy).

I think this argument has merit, but also the following weakness: given single-single alignment, we can delegate the design of cooperative AI to the initial uncooperative AI. Moreover, uncooperative AIs have an incentive to self-modify into cooperative AIs, if they assign even a small probability to their peers doing the same. I think we definitely need more research to understand these questions better, but it seems plausible we can reduce cooperation to "just" solving single-single alignment.

Formal Solution to the Inner Alignment Problem

I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit.

I think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect both and in a similar way.

More generally, I guess I'm more optimistic than you about solving all such philosophical liabilities.

I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior.

I don't understand the proposal. Is there a link I should read?

This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with "giant" universes that do all the possible computations you would want, and then using the "free" complexity in the bridge rules to pick which of the computations you actually wanted.

So, you can let your physics be a dovetailing of all possible programs, and delegate to the bridge rule the task of filtering the outputs of only one program. But the bridge rule is not "free complexity" because it's not coming from a simplicity prior at all. For a program of length , you need a particular DFA of size . However, the actual DFA is of expected size with . The probability of having the DFA you need embedded in that is something like . So moving everything to the bridge makes a much less likely hypothesis.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I don't understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research. It seems like the failures described happened because the AI systems were misaligned in the usual "unipolar" sense. These management assistants, DAOs etc *are not aligned to the goals of their respective, individual users/owners*.

I do see two reasons why multipolar scenarios might require more technical research:

- Maybe several AI systems aligned to different users with different interests can interact in a Pareto inefficient way (a tragedy of the commons among the AIs), and maybe this can be prevented by designing the AIs in particular ways.
- In a multipolar scenario, aligned AI might have to compete with already deployed unaligned AI, meaning that safety must not come on expense of capability
^{[1]}.

In addition, aligning a single AI to multiple users also requires extra technical research (we need to somehow balance the goals of the different users and solve the associated mechanism design problem.)

However, it seems that this article is arguing for something different, since none of the above aspects are highlighted in the description of the scenarios. So, I'm confused.

In fact, I suspect this desideratum is impossible in its strictest form, and we actually have no choice but somehow making sure aligned AIs have a significant head start on all unaligned AIs. ↩︎

Formal Solution to the Inner Alignment Problem

Is bounded? I assign significant probability to it being or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble.

Yes, you're right. A malign simulation hypothesis can be a very powerful explanation to the AI for the why it found itself at a point suitable for this attack, thereby compressing the "bridge rules" by a lot. I believe you argued as much in your previous writing, but I managed to confuse myself about this.

Here's the sketch of a proposal how to solve this. Let's construct our prior to be the *convolution* of a simplicity prior with a computational easiness prior. As an illustration, we can imagine a prior that's sampled as follows:

- First, sample a hypothesis from the Solomonoff prior
- Second, choose a number according to some simple distribution with high expected value (e.g. ) with
- Third, sample a DFA with states and a uniformly random transition table
- Fourth, apply to the output of

We think of the simplicity prior as choosing "physics" (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing "bridge rules" (which we expect to have low computational complexity but possibly high description complexity). Ofc this convolution can be regarded as another sort of simplicity prior, so it differs from the original simplicity prior merely by a factor of , however the source of our trouble is also "merely" a factor of .

Now the simulation hypothesis no longer has an advantage via the bridge rules, since the bridge rules have a large constant budget allocated to them anyway. I think it should be possible to make this into some kind of theorem (two agents with this prior in the same universe that have access to roughly the same information should have similar posteriors, in the limit).

Vanessa Kosoy's Shortform

So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is and your quantilization fraction is then the AI's probability of corruption is bounded by .

I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn't specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.

Vanessa Kosoy's Shortform

More observations about this attack vector ("attack from counterfactuals"). I focus on "amplifying by subjective time".

**The harder the takeoff the more dangerous this attack vector:**During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system*in the beginning of the cycle*^{[1]}. On the other hand, the capability of the attacker depends on its power*in the end of the cycle*. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier.**Inner control of anchor makes system safer:**Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system.**Additional information about the external world makes system safer:**Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with more confidence.

More precisely, it is somewhat better than this since, if we are at some late cycle among a sequence of cycles anchored to same point in objective time, the simulated user can use knowledge generated in previous cycles to improve the defense system during the current cycle. So, early cycles contribute more to the risk. ↩︎

Inframeasures and Domain Theory

Virtually all the credit for this post goes to Alex, I think the proof of Proposition 1 was more or less my only contribution.

Vanessa Kosoy's Shortform

The distribution is the user's policy, and the utility function for this purpose is the *eventual success probability* estimated by the user (as part of the timeline report), in the end of the "maneuver". More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it for MDPs.

I guess we can try studying Troll Bridge using infra-Bayesian modal logic, but atm I don't know what would result.

Ah, but there is a sense in which it doesn't. The radical update rule is equivalent to updating on "secret evidence". And in TRL we have such secret evidence. Namely, if we only look at the agent's beliefs about "physics" (the environment), then they would be updated radically, because of secret evidence from "mathematics" (computations).