AI ALIGNMENT FORUM
AF

jsd
Ω6040
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
0jsd's Shortform
3y
0
Win/continue/lose scenarios and execute/replace/audit protocols
jsd10mo90

This distinction reminds me of Evading Black-box Classifiers Without Breaking Eggs, in the black box adversarial examples setting. 

Reply
Thoughts on open source AI
jsd2y23

If they are right then this protocol boils down to “evaluate, then open source.” I think there are advantages to having a policy which specializes to what AI safety folks want if AI safety folks are correct about the future and specializes to what open source folks want if open source folks are correct about the future.

In practice, arguing that your evaluations show open-sourcing is safe may involve a bunch of paperwork and maybe lawyer fees. If so, this would be a big barrier for small teams, so I expect open-source advocates not to be happy with such a trajectory.

Reply
unRLHF - Efficiently undoing LLM safeguards
jsd2y20

I'd be curious about how much more costly this attack is on LMs Pretrained with Human Preferences (including when that method is only applied to "a small fraction of pretraining tokens" as in PaLM 2).

Reply
Emergent modularity and safety
jsd4y30

Relevant related work : NNs are surprisingly modular

https://arxiv.org/abs/2003.04881v2?ref=mlnew

I believe Richard linked to Clusterability in Neural Networks, which has superseded Pruned Neural Networks are Surprisingly Modular. 

The same authors also recently published Detecting Modularity in Deep Neural Networks.

Reply
No posts to display.