You can try our interactive demo! (Or read our preprint.)
Here, we want to explain why we care about this work from an AI safety perspective.
To the best of our knowledge, image hijacks constitute the first demonstration of adversarial inputs for foundation models that
It's possible that a future text-only attack could do these things, but such an attack hasn't yet been demonstrated.
We expect that future (foundation-model-based) AI systems will be able to
As the actions of such foundation-model-based agents are based on the foundation model's text output, hijacking the foundation model's text output could give an adversary arbitrary control over the agent's behaviour.
Race to the top on adversarial robustness. Robustness to attacks such as image hijacks is (i) a control problem, (ii) which we can measure, and (iii) which has real-world safety implications today. So we're excited to see AI labs compete to have the most adversarially robust models.
Third-party auditing and certification. Auditors could test for robustness against image hijacks, both at the foundation model level (auditing the major AGI corporations) and at the app development level (auditing downstream companies integrating foundation models into their products). Image hijacks could also be used to test for the presence of dangerous capabilities (characterisable as some behaviour B) by attempting to train an image hijack for that capability.
Liability for AI-caused harms, penalizing externalities. Both the Future of Life Institute and Jaan Tallinn advocate for liability for AI-caused harms. When assessing AI-caused harms, image hijacks may need to be part of the picture.
Glad to see that this work is out!
I don't have much to say here, especially since I don't want to rehash the old arguments about the usefulness of prosaic adversarial ML research. (I think it's worth working on but the direct impacts of the work are unclear). I do think that most people in AIS agree that image advexes are challenging and generally unsolved, but the people who disagree on the relevance of this line of research tend to question the implied threat model.