Mark Xu

Defusing AGI Danger

I absolutely agree that I'm not arguing for "safety by default".

I don't quite agree that you should split effort between strategies, i.e. it seems likely that if you think 80% disaster by default, you should dedicate 100% of your efforts to that world.

Operationalizing compatibility with strategy-stealing

Using the perspective from The ground of optimization means you can get rid of the action space and just say "given some prior and some utility function, what percentile of the distribution does this system tend to evolve towards?" (where the optimization power is again the log of this percentile)

We might then say that an optimizing system is compatible with strategy stealing if it's retargetable for a wide set of utility functions in a way that produces an optimizing system that has the same amount of optimization power.

An AI that is compatible with strategy stealing is one such way to of producing an optimizing system that is compatible with strategy stealing with a particularly easy form of retargeting, but the difficulty of retargeting provides another useful dimension along which optimizing systems vary, e.g. instead of the optimization power the AI can direct towards different goals, you have a space of the "size" of allowed retargeting and the optimization power applied toward the goal for all goals and retargeting sizes.

Defusing AGI Danger

Thanks! Also, oops - fixed.

Understanding “Deep Double Descent”

This post gave a slightly better understanding of the dynamics happening inside SGD. I think deep double descent is strong evidence that something like a simplicity prior exists in SGG, which might have actively bad generalization properties, e.g. by incentivizing deceptive alignment. I remain cautiously optimistic that approaches like Learning the Prior can get circumnavigate this problem.

A space of proposals for building safe advanced AI

I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)'s own standards.

This seems like a reasonable way to think of debate.

I think, in practice (if this even means anything), the power of debate is quite bounded by the power of the human, so some other technique is needed to make the human capable of supervising complex debates, e.g. imitative amplification.

A space of proposals for building safe advanced AI

Debate: train M* to win debates against Amp(M).

I think Debate is closer to "train M* to win debates against itself as judged by Amp(M)".

Does SGD Produce Deceptive Alignment?

Yep. Meant to say "if a model knew that it was in its last training episode and it wasn't going to be deployed." Should be fixed.

Introduction to Cartesian Frames

This is very exciting. Looking forward to the rest of the sequence.

As I was reading, I found myself reframing a lot of things in terms of the rows and columns of the matrix. Here's my loose attempt to rederive most of the properties under this view.

- The world is a set of states. One way to think about these states is by putting them in a matrix, which we call "cartesian frame." In this frame, the rows of the matrix are possible "agents" and the columns are possible "environments".
- Note that you don't have to put all the states in the matrix.

- Ensurables are the part of the world that the agent can always ensure we end up in. Ensurables are the rows of the matrix, closed under supersets
- Preventables are the part of the world that the agent can always ensure we don't end up in. Preventables are the complements of the rows, closed under subsets
- Controllables are parts of the world that are both ensurable and preventable. Controlables are rows (or sets of rows) for which there exists rows that are disjoint. [edit: previous definition of "contains elements not found in other rows" was wrong, see comment by crabman]
- Observeables are parts of the environment that the agent can observe and act conditionally according to. Observables are columns such that for every pair of rows there is a third row that equals the 1st row if the environment is in that column and the 2nd row otherwise. This means that for every two rows, there's a third row that's made by taking the first row and swapping elements with the 2nd row where it intersects with the column.
- Observables have to be sets of columns because if they weren't, you can find a column that is partially observable and partially not. This means you can build an action that says something like "if I am observable, then I am not observable. If I am not observable, I am observable" because the swapping doesn't work properly.
- Observables are closed under boolean combination (note it's sufficient to show closure under complement and unions):
- Since swapping index 1 of a row is the same as swapping all non-1 indexes, observables are closed under complements.
- Since you can swap indexes 1 and 2 by first swapping index 1, then swapping index 2, observables are closed under union.
- This is equivalent to saying "If A or B, then a0, else a2" is logically equivalent to "if A, then a0, else (if B, then a0, else a2)"

- Since controllables are rows with specific properties and observables are columns with specific properties, then nothing can be both controllable and observable. (The only possibility is the entire matrix, which is trivially not controllable because it's not preventable)
- This assumes that the matrix has at least one column

- The image of a cartesian frame is the actual matrix part.
- Since an ensurable is a row (or superset) and an observable is a column (or set of columns), then if something is ensurable and observable, then it must contain every column, so it must be the whole matrix (image).
- If the matrix has 1 or 0 rows, then the observable constraint is trivially satisfied, so the observables are all possible sets of (possible) environment states (since 0/1 length columns are the same as states).
- "0 rows" doesn't quite make sense, but just pretend that you can have a 0 row matrix which is just a set of world states.

- If the matrix has 0 columns, then the ensurable/preventable contraint is trivially satisfied, so the ensurables are the same as the preventables are the same as the controllables, which are all possible sets of (possible) environment states (since "length 0" rows are the same as states).
- "0 columns doesn't make that much sense either but pretend that you can have a 0 column matrix which is just a set of world state.

- If the matrix has exactly 1 column, then the ensurable/preventable constraint is trivially satisfied
*for states in the image (matrix)*, so the ensurables are all non-empty sets of states in the matrix (since length 1 columns are the same as states), closed under union with states outside the matrix. It should be easy to see that controllables are all possible sets of states that intersect the matrix non-trivially, closed under union with states outside the matrix.

Introduction to Cartesian Frames

In 4.1:

Given a0 and a1, since S∈Obs(C), there exists an a2∈A such that for all e∈E, we have a2∈if(S,a0,a1). Then, since T∈Obs(C), there exists an a3∈A such that for all e∈E, we have a3∈if(S,a0,a2). Unpacking and combining these, we get for all e∈E, a3∈if(S∪T,a0,a1). Since we could construct such an a3 from an arbitrary a0,a1∈A, we know that S∪T∈Obs(C). □

I think there's a typo here. Should be , not .

(also not sure how to copy latex properly).

My opposite intuition is suggested by the fact that if you're trying to guess correctly a series of random digits with 80% "1" and 20% "0", then you should always guess "1".

I don't quite know how to model cross-pollination and diminishing sort of returns. I think working on both for the information value is likely going to be very good. It seems hard to imagine a scenario where you're robustly confident that one project is 80% better taking diminishing returns into account without being able to create a 3rd project with the best features of both, but if you're in that scenario I think just spending all your efforts on the 80% project seems correct.

One example is deciding between 2 fundamentally different products your startup could be making. We also supposed that creating an MVP of either product that would provide information would take a really long time. In this situation, if you suspect one of them is 60% likely to be better than the other it would be less useful to spend your time in a 60/40 split rather than building the MVP of the one likely to be better and reevaluating after getting more information.

The version of your claim that I agree with is "In your current epistemic state, you should spend all your time pursuing the 80% project, but the 80% probably isn't that robust, working on a project has diminishing returns, and other projects will give more information value, globally the amount of time you expect to spend on the 80% project is about 80%."