Peter Hroššo - AI Alignment Forum

Inverse Scaling Prize: Second Round Winners

Re "Prompt Injection, by Derik Kauffman, Aaron Kirtland, Andrew Gritsevskiy, and Joe Cavanagh (Third Prize)": text-davinci-003 works perfectly if the last input's first letter is lower case. It correctly predicts "Input" as the first token with >99% prob.

It seems that the last input being out-of-distribution wrt. to the few shot examples makes the model default to its priors and follow the instruction.

Also, interestingly, replacing the "following sentences" in the instruction with "following inputs" also fixes the model's behavior. I felt the term "sentences" was a little bit vague, while using "inputs" seemed more clear to me. This might be another piece of evidence that under higher uncertainty the model is more likely to default to its priors.

How I think about alignment

Peter Hroššo2y21

I expect there to be too much happenstance encoded in my values.

I believe this is a bug, not a feature that we would like to reproduce.

I think that the direction you described with the AI analysing how you acquired your values is important, because it shouldn't be mimicking just your current values. It should be able to adapt the values to new situations the way you'd do (distributional shift). Think all the books / movies where people get to unusual situations and have to make tough moral calls. Like plane crashing in the middle of nowhere with 20 survivors who are gradually running out of food.. Superhuman AI will be running into unknown situations all the time because of different capabilities.

Human values are undefined for most situations a superhuman AI will encounter.

Gradient hacking

Peter Hroššo5y30

If your model is deceptive, though, then it might know all of that

Could you please describe your intuition behind how the model could know the meta-optimizer is going to perform checks on deceptive behavior?

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments