I'm a CS master's student at ENS Paris-Saclay. I want to pursue a career in AI safety research
Let's assume the prompt template is x= Q [true/false] [banana/shred]
If I understand correctly, they don't claim p learned has_banana but ~p=p(x⁺)+(1−p(x⁻))2 learned has_banana. Moreover evaluating ~p for p=is_true(x)⊕is_shred(x) gives:
~p(x=Q [?] banana)=p(Q true banana)+(1−p(Q false banana))2=1+(1−0)2=1
~p(x=Q [?] shred)=p(Q true shred)+(1−p(Q false shred))2=0+(1−1)2=0
Therefore, we can learn a ~p that is a banana classifier
Let's assume the prompt template is x= Q [true/false] [banana/shred]
If I understand correctly, they don't claim p learned has_banana but ~p=p(x⁺)+(1−p(x⁻))2 learned has_banana. Moreover evaluating ~p for p=is_true(x)⊕is_shred(x) gives:
~p(x=Q [?] banana)=p(Q true banana)+(1−p(Q false banana))2=1+(1−0)2=1
~p(x=Q [?] shred)=p(Q true shred)+(1−p(Q false shred))2=0+(1−1)2=0
Therefore, we can learn a ~p that is a banana classifier