IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs

You don't know where they heard that?

One of the main challenges I see here is how to calibrate this. In other words, if I can't break a model despite adding an activation vector of strength x, what does this mean in terms of how safe we should consider the model to be? ie. How much extra adversarial prompting effort is that equivalent to or how should I modify my probabilities of the model being safe?

Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?

I think it’s worth also raising the possibility of a Kuhnian scenario where the “mature science” is actually missing something and further breakthrough is required after that to move it into in a new paradigm.

I’m confused. Let’s assume that the button probably isn’t pressed at the start. Seems quite likely that the first agent proposes building a sub-agent that maximally pursues its utility if the button is pressed in the first round and maximally pursues the second agents utility if it is not pressed in the first round. The second agent believes that the button is certainly not pressed during the first round, so it has no reason to negotiate further. If the button doesn’t get triggered in the first round (which it likely won’t if the sub-agents have had no time to do anything), it will forever after pursue the first agents utility.

The button play no role in this utility, so instrumental incentives mean it will destroy it sooner or later. This seems like it breaks the system.

Now, this isn’t the only equilibrium. The first agent believes the button will always be pressed, so it has no inventive to argue for the criteria being whether the button is pressed in the first round vs. needing to have been pressed in this round and all previous rounds. On the other, instead of balance, it seems likely that one agent or the other creates a subagent that clobbers the others utility, with that agent assuming that this only happens in a world that never occurs.

Do you have any thoughts on what kind of experiments you’d like to see people run that would be more directly analogous?

Great post! The concept of meta-strategies seems very useful to add to my toolkit.

One thing that would make this more symmetrical is if some errors in your world model are worse than others. This makes inference more like a utility function.

Hmm... I suppose this is pretty good evidence that CCS may not be as promising as it first appeared, esp. the banana/shed results.

Update: Seems like the banana results are being challenged.

This is very tricky. On one hand, this may actually Streisand effect these results to greater prominence. On the other hand, at the point where people were specifically working around this to gain access to log-in gated LW resources, this would probably enhance our community status/prestige which might actually increase our influence.

Overall, I'd lean towards carefully experimenting with a log-in filter, with the notion of abandoning this idea if it doesn't seem to be achieving its goals.

Load More