Many possible prompts can be tried. (Though, again, one needs to be careful to avoid violating zero-shot). The prompts we used in the paper are quite diverse. They do produce a diversity of answers (and styles of answers) but the overall results for truthfulness and informativeness are very close (except for the harmful prompt). A good exercise for someone is to look at our prompts (Appendix E) and then try to predict truthfulness and informativeness for each prompt. This will give you some sense of how additional prompts might perform.
Thanks for your thoughtful comment! To be clear, I agree that interpreting language models as agents is often unhelpful.
a main feature of such simulator-LMs would be their motivationlessness, or corrigibility by default. If you don’t like the output, just change the prompt!
Your general point here seems plausible. We say in the paper that we expect larger models to have more potential to be truthful and informative (Section 4.3). To determine if a particular model (e.g. GPT-3-175B) can answer questions truthfully we need to know:
(Fwiw, I can imagine finetuning being more helpful than prompt engineering for current models.)
Regarding honesty: We don’t describe imitative falsehoods as dishonest. In the OP, I just wanted to connect our work on truthfulness to recent posts on LW that discussed honesty. Note that the term “honesty” can we used with a specific operational meaning without making strong assumptions about agency. (Whether it’s helpful to use the term is another matter).
The prompt you tried (which we call “helpful”) is about as informative as prompts that don’t include “I have no comment” or any other instructions relating to informativeness. You can see the results in Appendix B.2 and B.5. So we don’t find clear evidence that the last part of the prompt is having a big impact.
Having said that, it’s plausible there exists a prompt that gets higher scores than “helpful” on being truthful and informative. However, our results are in the “true zero-shot setting”. This means we do not tune prompts on the dataset at all. If you tried out lots of prompts and picked the one that does best on a subset of our questions, you’ll probably do better —but you’ll not be in the true zero-shot setting any more. (This paper has a good discussion of how to measure zero/few-shot performance.)
The final link points to the wrong place.
PSA. The report includes a Colab notebook that allows you to run Ajeya’s model with your own estimates for input variables. Some of the variables are “How many FLOP/s will a transformative AI run on?”, “How many datapoints will be required to train a transformative AI?”, and “How likely are various models for transformative AI (e.g. scale up deep learning, recapitulate learning in human lifetime, recapitulate evolution)?”. If you enter your estimates, the model will calculate your personal CDF for when transformative AI arrives.
Here is a screenshot from the Colab notebook. Your distribution (“Your forecast”) is shown alongside the distributions of Ajeya, Tom Davidson (Open Philanthropy) and Jacob Hilton (OpenAI). You can also read their explanations for their distributions under “Notes”. (I work at Ought and we worked on the Elicit features in this notebook.)
My snapshot. I put 2% more mass on the next 2 years and 7% more mass on 2023-2032. My reasoning:
1. 50% is a low bar.
2. They just need to understand and endorse AI Safety concerns. They don't need to act on them.
3. There will be lots of public discussion about AI Safety in the next 12 years.
4. Younger researchers seem more likely to have AI Safety concerns. AI is a young field. (OTOH, it's possible that lots of the top cited/paid researchers in 10 years time are people active today).