Vivek Hebbar

58

The idea of dividing failure stories into "failures involving rogue deployments" and "other failures" seems most useful if the following argument goes through:

1. Catastrophes require a very large (superhuman?) quantity and/or quality of intellectual labor

2. Either this labor is done by AIs in approved scaffolds, or it is done in "rogue deployments"

3. Hence the only easy-by-default disaster route is through a rogue deployment

4. Hence if we rule out rogue deployments **and** very impressive/difficult malicious labor in our scaffolds, we are safe

This seems true for most catastrophes which only need outsider affordances (such as "order the proteins etc for a pandemic or a nanofactory"). Catastrophes with outsider affordances can usually be presumed difficult because rogue humans don't randomly end the world.

However, this argument for #1 doesn't work for routes which use insider affordances or responsibilities. Nor for routes which go through work done by future generations of AIs. The central case of both is routes that involve using insider privileges to backdoor or intentionally misalign future AIs. For example:

- Planting backdoors in the codebase which allow safety training code to be turned off or misconfigured at a later time

- Poisoning datasets (idk if insider access helps though)

- Intentionally creating and defending flawed alignment plans

These routes get future generations of AIs to do the difficult work required for the final catastrophe. But they aren't necessarily rogue deployments. Consider a backdoored future AI running only in authorized scaffolds, where the safety case for the future AI was an alignment case (not a control case). The alignment case fails due to the subversive actions of the earlier AI. So the earlier AI has caused a catastrophe without doing the difficult work of directly taking over the world, and also without a rogue deployment.

One could separately argue that these routes are also "fundamentally hard" (even if not as hard as directly causing a catastrophe), but I don't see a clear blanket reason.

10

This google search seems to turn up some interesting articles (like maybe this one, though I've just started reading it).

30

Paul [Christiano] called this “problems of the interior” somewhere

Since it's slightly hard to find: Paul references it here (ctrl+f for "interior") and links to this source (once again ctrl+f for "interior"). Paul also refers to it in this post. The term is actually "position of the interior" and apparently comes from military strategist Carl von Clausewitz.

12

Note: The survey took me 20 mins (but also note selection effects on leaving this comment)

10

Here's a fun thing I noticed:

There are 16 boolean functions of two variables. Now consider an embedding that maps each of the four pairs {(A=true, B=true), (A=true, B=false), ...} to a point in 2d space. For any such embedding, at most 14 of the 16 functions will be representable with a linear decision boundary.

For the "default" embedding (x=A, y=B), xor and its complement are the two excluded functions. If we rearrange the points such that xor is linearly represented, we always lose some other function (and its complement). In fact, there are 7 meaningfully distinct colinearity-free embeddings, each of which excludes a different pair of functions.^{[1]}

I wonder how this situation scales for higher dimensions and variable counts. It would also make sense to consider sparse features (which allow superposition to get good average performance).

^{^}The one unexcludable pair is ("always true", "always false").

These are the seven embeddings:

10

Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn't precisely the right well-behavedness desideratum, but I retract the "directionally wrong".

20

The variance of the multivariate uniform distribution is largest along the direction , which is exactly the direction which we would want to represent a AND b.

The variance is actually the same in all directions. One can sanity-check by integration that the variance is 1/12 both along the axis and along the diagonal.

In fact, there's nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are independent and have equal variance.^{[1]}

The diagram in the post showing that "and"* *is linearly represented works if the features are represented discretely (so that there are exactly 4 points for 2 binary features, instead of a distribution for each combination). As soon as you start defining features with thresholds like DanielVarga did, the argument stops going through in general, and the claim can become false.

The stuff about unimodality doesn't seem relevant to me, and in fact seems directionally wrong.

^{^}I have a not-fully-verbalized proof which I don't have time to write out

10

Maybe models track which features are basic and enforce that these features be more salient

Couldn't it just write derivative features more weakly, and therefore not need any tracking mechanism other than the magnitude itself?

10

When you describe the "emailing protein sequences -> nanotech" route, are you imagining an AGI with computers on which it can run code (like simulations)? Or do you claim that the AGI could design the protein sequences without writing simulations, by simply thinking about it "in its head"?

If the agent follows EDT, it seems like you are giving it epistemically unsound credences. In particular, the premise is that it's very confident it will go left, and the consequence is that it in fact goes right. This was the world model's fault, not EDT's fault. (It

isnotable though that EDT introduces this loopiness into the world model's job.)