Curated. It's nice to see a return to the problems of yore, and I think this is a nice incremental proposal. Bringing in causal counterfactuals seems like a neat trick (with lots of problems, as discussed in the post and the comments), and so does bringing in some bargaining theory.

I have lots of confusions and questions, like

so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenting

doesn't make sense to me yet, as it seems easy for the utility functions / belief states to all prefer killing humans quickly, even if the humans don't affect the shutdown button exactly. Or the aside on bargaining with non-causally-counterfacting agents. But they're confusions and questions that afford some mulling, which is pretty cool!

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

kave3mo10

Yep, I noted you said "update as if" rather than "update that". I also expect this will make it pretty hard to say for sure which of us was right, because it's pretty hard to tell if someone updated as if X vs updated that X.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

kave3mo32

I think that predictably, people will update as if they saw actual deceptive alignment

Thanks for predicting this! I'll go on the record as predicting not-this. Look forward to us getting some data (though it may be a little muddied by the fact that you've already publically pushed back, making people less likely to make that mistake).

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

kave3mo135

This paper also seems dialectically quite significant. I feel like it's a fairly well-delineated claim that can be digested by mainsteam ML and policy spaces. Like, it seems helpful to me if policy discussions can include phrases like "the evidence suggests that if the current ML systems were trying to deceive us, we wouldn't be able to change them not to".

Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

kave5mo22

The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]

Would you be willing to rephrase this as something like

The model shows early signs of autonomous self-replication ability. Autonomous self-replication ability is defined as 50% aggregate success rate on the capabilities for which we list evaluations in [Appendix on Autonomy Evaluations]

The hope here is to avoid something like "well this system doesn't have autonomous self-replication ability/isn't ASL-3, because Anthropic's evals failed to elicit the behaviour. That definitionally means it's not ASL-3", and get a bit more map-territory distinction in.

TurnTrout's shortform feed

kave5mo20

Two quick thoughts (that don't engage deeply with this nice post).

I'm worried in some cases where the goal is not consistent across situations. For example, if prompted to pursue some goal, it then does it seriously with convergent instrumental goals.
I think it seems pretty likely that future iterations of transformers will have bits of powerful search in them, but people who seem very worried about that search seem to think that once that search is established enough, gradient descent will cause the internals of the model to be organised mostly around that search (I imagine the search circuits "bubbling out" to be the outer structure of the learned algorithm). Probably this is all just conceptually confused, but to the extent it's not, I'm pretty surprised by their intuition.

AI Timelines

kave6mo55

Curated. I feel like over the last few years my visceral timelines have shortened significantly. This is partly in contact with LLMs, particularly their increased coding utility, and a lot downstream of Ajeya's and Daniel's models and outreach (I remember spending an afternoon on an arts-and-crafts 'build your own timeline distribution' that Daniel had nerdsniped me with). I think a lot of people are in a similar position and have been similarly influenced. It's nice to get more details on those models and the differences between them, as well as to hear Ege pushing back with "yeah but what if there are some pretty important pieces that are missing and won't get scaled away?", which I hear from my environment much less often.

There are a couple of pieces of extra polish that I appreciate. First, having some specific operationalisations with numbers and distributions up-front is pretty nice for grounding the discussion. Second, I'm glad that there was a summary extracted out front, as sometimes the dialogue format can be a little tricky to wade through.

On the object level, I thought the focus on schlep in the Ajeya-Daniel section and slowness of economy turnover in the Ajaniel-Ege section was pretty interesting. I think there's a bit of a cycle with trying to do complicated things like forecast timelines, where people come up with simple compelling models that move the discourse a lot and sharpen people's thinking. People have vague complaints that the model seems like it's missing something, but it's hard to point out exactly what. Eventually someone (often the person with the simple model) is able to name one of the pieces that is missing, and the discourse broadens a bit. I feel like schlep is a handle that captures an important axis that all three of our participants differ on.

I agree with Daniel that a pretty cool follow-up activity would be an expanded version of the exercise at the end with multiple different average worlds.

You’re Measuring Model Complexity Wrong

kave6mo30

Looks like it's in-progress.

Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

kave7mo1413

As a general matter, Anthropic has consistently found that working with frontier AI models is an essential ingredient in developing new methods to mitigate the risk of AI.

What are some examples of work that is most largeness-loaded and most risk-preventing? My understanding is that interpretability work doesn't need large models (though I don't know about things like influence functions). I imagine constitutional AI does. Is that the central example or there are other pieces that are further in this direction?