## AI ALIGNMENT FORUMAF

Martín Soto

Mathematical Logic grad student, doing AI Safety research for ethical reasons.

Working on conceptual alignment, decision theory, cooperative AI and cause prioritization.

My webpage.

Leave me anonymous feedback.

# Wiki Contributions

(I will not try to prove transitivity here, since my goal is to get the overall picture across; I have not checked it, although I expect it to hold.)

Transitivity doesn't hold, here's a counterexample.

The intuitive story is: X's action tells you whether Z failed, Y fails sometimes, and Z fails more rarely.

The full counterexample (all of the following is according to your beliefs ): Say available actions are 0 and 1. There is a hidden fair coin, and your utility is high if you manage to match the coin, and low if you don't. Y peeks at the coin, and takes the correct action, except when it fails, which has a 1/4 chance. Z does the same, but it only fails with a 1/100 chance. X plays 1 iff Z has failed.
Given X's and Y's action, you always go with Y's action, since X tells you nothing about the coin, and Y gives you some information. Given Z's and Y's actions, you always go with Z's, because it's less likely to have failed (even when they disagree). But given Z's and X's, there will be some times (1/100), in which you see X played 1, and then you will not play the same as Z.

The same counterexample works for beliefs (or continuous actions) instead of discrete actions (where you will choose a probability  to believe, instead of an action ), but needs a couple small changes. Now both Z and Y fail with 1/4 probability (independently). Also, Y outputs its guess as 0.75 or 0.25 (instead of 1 or 0), because YOU (that is, ) will be taking into account the possibility that it has failed (and Y better output whatever you will want to guess after seeing it). Instead of Z, consider A as the third expert, which outputs 0.5 if Z and Y disagree, 15/16 if they agree on yes, and 1/16 if they agree on no. X still tells you whether Z failed. Seeing Y and X, you always go with Y's guess. Seeing A and Y, you always go with A's guess. But if you see A = 15/16 and X = 1, you know both failed, and guess 0. (In fact, even when you see X = 0, you will guess 1 instead of 15/16.)

I think this has a fix-point selection problem: If one or both of them start with a different prior under which the other player punishes them for not racing / doesn't reward them enough (maybe because they have very little faith in the other's rationality, or because they think it's not within their power to decide that, and also there's not enough evidential correlation in their decisions), then they'll race.

Of course, considerations about whether the other player normatively endorses something LDT-like also enter the picture. And even if individual humans would endorse it (and that's already a medium-big if), I worry our usual decision structures (for example in AI labs) don't incentivize it (and what's the probability some convincing decision theorist cuts through them? not sure).

we have only said that P2B is the convergent instrumental goal. Whenever there are obvious actions that directly lead towards the goal, a planner should take them instead.

Hmm, given your general definition of planning, shouldn't it include realizations (and their corresponding guided actions) of the form "further thinking about this plan is worse than already acquiring some value now", so that P2B itself already includes acquiring the terminal goal (and optimizing solely for P2B is thus optimal)?

I guess your idea is "plan to P2B better" means "plan with the sole goal of improving P2B", so that it's a "non-value-laden" instrumental goal.

Since this hypothesis makes distinct predictions, it is possible for the confidence to rise above 50% after finitely many observations.

I was confused about why this is the case. I now think I've got an answer (please anyone confirm):
The description length of the Turing Machine enumerating theorems of PA is constant. The description length of any Turing Machine that enumerates theorems of PA up until time-step n and the does something else grows with n (for big enough n). Since any probability prior over Turing Machines has an implicit simplicity bias, no matter what prior we have, for big enough n the latter Turing Machines will (jointly) get arbitrarily low probability relative to the first one. Thus, after enough time-steps, given all observations are PA theorems, our listener will assign arbitrarily higher probability to the first one than all the rest, and thus the first one will be over 50%.

Edit: Okay, I now saw you mention the "getting over 50%" problem further down:

I don't know if the argument works out exactly as I sketched; it's possible that the rich hypothesis assumption needs to be "and also positive weight on a particular enumeration". Given that, we can argue: take one such enumeration; as we continue getting observations consistent with that observation, the hypothesis which predicts it loses no weight, and hypotheses which (eventually) predict other things must (eventually) lose weight; so, the updated probability eventually believes that particular enumeration will continue with probability > 1/2.

But I think the argument goes through already with the rich hypothesis assumption as initially stated. If the listener has non-zero prior probability on the speaker enumerating theorems of PA, it must have non-zero probability on it doing so in a particular enumeration. (unless our specification of the listener structure doesn't even consider different enumerations? but I was just thinking of their hypothesis space as different Turing Machines the whole time) And then my argument above goes through, which I think is just your argument + explicitly mentioning the additional required detail about the simplicity prior.

Nice!

⊬a

Should be , right?

In particular, this theorem shows that players  with very low  (little capital/influence on ) will accurately predict

You mean ?

Solution: Black box the whole setup and remove it from the simulation to avoid circularity.

Addendum: I now notice this amounts to brute-forcing a solution to certain particular counterfactuals.

Hi Vanessa! Thanks again for your previous answers. I've got one further concern.

Are all mesa-optimizers really only acausal attackers?

I think mesa-optimizers don't need to be purely contained in a hypothesis (rendering them acausal attackers), but can be made up of a part of the hypotheses-updating procedures (maybe this is obvious and you already considered it).

Of course, since the only way to change the AGI's actions is by changing its hypotheses, even these mesa-optimizers will have to alter hypothesis selection. But their whole running program doesn't need to be captured inside any hypothesis (which would be easier for classifying acausal attackers away).

That is, if we don't think about how the AGI updates its hypotheses, and just consider them magically updating (without any intermediate computations), then of course, the only mesa-optimizers will be inside hypotheses. If we actually think about these computations and consider a brute-force search over all hypotheses, then again they will only be found inside hypotheses, since the search algorithm itself is too simple and provides no further room for storing a subagent (even if the mesa-optimizer somehow takes advantage of the details of the search). But if more realistically our AGI employs more complex heuristics to ever-better approximate optimal hypotheses update, mesa-optimizers can be partially or completely encoded in those (put another way, those non-optimal methods can fail / be exploited). This failure could be seen as a capabilities failure (in the trivial sense that it failed to correctly approximate perfect search), but I think it's better understood as an alignment failure.

The way I see PreDCA (and this might be where I'm wrong) is as an "outer top-level protocol" which we can fit around any superintelligence of arbitrary architecture. That is, the superintelligence will only have to carry out the hypotheses update (plus some trivial calculations over hypotheses to find the best action), and given it does that correctly, since the outer objective we've provided is clearly aligned, we're safe. That is, PreDCA is an outer objective that solves outer alignment. But we still need to ensure the hypotheses update is carried out correctly (and that's everything our AGI is really doing).

I don't think this realization rules out your Agreement solution, since if truly no hypothesis can steer the resulting actions in undesirable ways (maybe because every hypothesis with a user has the human as the user), then obviously not even optimizers in hypothesis update can find malign hypotheses (although they can still causally attack hacking the computer they're running on etc.). But I think your Agreement solution doesn't completely rule out any undesirable hypothesis, but only makes it harder for an acausal attacker to have the user not be the human. And in this situation, an optimizer in hypothesis update could still select for malign hypotheses in which the human is subtly incorrectly modelled in such a precise way that has relevant consequences for the actions chosen. This can again be seen as a capabilities failure (not modelling the human well enough), but it will always be present to some degree, and it could be exploited by mesa-optimizers.