This is a cool result. If I'm understanding correctly, M- increases its loss the more that M+ is represented in the mixture, thereby encouraging SGD to make M- more prominent.
Is there a way to extend this to cases where M- doesn't have access to the weights? I think that probably requires an RL environment, but that's entirely based on "I thought about it for a few minutes and couldn't find a way to do it without RL" so I could be way off here.
Given an RL environment I suspect M- could steer the model into scenarios that make it look better than M+...
I’m worried about running HCH because it seems likely that in worlds that can run HCH people are not sufficiently careful to restrict GPU access and those worlds get taken over by unsafe AI built by other actors. Better to just not have the GPU’s at all.
I don’t think the description-length prior enters here. The generative model has a prior based on training data we fed it, and I don’t see why it would prefer short description lengths (which is a very uninformed prior) over “things that are likely in the world given the many PB of data it’s seen”.
Putting that aside, can you say why you think the “AI does weird dances” world is more likely conditioned on the observations than “humans happened to do this weird thing”?
I think I basically agree re: honeypots.
I'm sure there'll be weird behaviors if we outlaw simulations, but I don't think that's a problem. My guess is that a world where simulations are outlawed has some religion with a lot of power that distrusts computers, which definitely looks weird but shouldn't stop them from solving alignment.
I don’t think that’s an example of the model noticing it’s in a simulation. There’s nothing about simulations versus the real world that makes RSA instances more or less likely to pop up.
Rather, that’s a case where the model just has a defecting condition and we don’t hit it in the simulation. This is what I was getting at with “other challenge” #2.
I'm assuming we can input observations about the world for conditioning, and those don't need to be text. I didn't go into this in the post, but for example I think the following are fair game:
Whereas the following are not allowed because I don't see how they could be operationalized:
I think I basically hold disagreement (1), which I think is close to Owain’s comment. Specifically. I think a plausible story for a model learning causality is:
In this story the model learns both a causal model and the spurious correlations. It doesn’t dismiss the spurious correlations but still models the causal ones. This lets it minimize loss, which I think addresses the counter argument to (1).
Right. Maybe a better way to say it is:
The two together give a bit of a lever that I think we can use to bias away from deception if we can find the right operational notion of hidden behaviors.
Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function?
That would work, yeah. I was thinking of an approach based on making ad-hoc updates to the weights (beyond SGD), but an evolutionary approach would be much cleaner!