All of Owain_Evans's Comments + Replies

2021 AI Alignment Literature Review and Charity Comparison

This is a fantastic resource and seems like a great project for a research assistant. As with Rohin Shah's alignment newsletter, I'm excited to see this project continue and (potentially) expand. 

2Larks5mo
Thanks, that's very kind of you!
2021 AI Alignment Literature Review and Charity Comparison

I agree with most of this -- and my original comment should have been clearer. I'm wondering if the past five years of direct observations leads you to update the geography-based prior (which has been included in your alignment review for since 2018). How much do you expect the quality of alignment work to differ from a new organization based in the Bay vs somewhere else? (No need to answer: I realize this is probably a small consideration and I don't want to start an unproductive thread on this topic). 

2021 AI Alignment Literature Review and Charity Comparison

Evans et al.'s Truthful AI: Developing and governing AI that does not lie is a detailed and length piece discussing a lot of issues around truthfulness for AI agents. This includes conceptual, practical and governance issues, especially with regard conversation bots. They argue for truthfulness (or at least, non-negligently-false)

The link should include "that does not lie". 
length --> lenghty

 

Lin et al.'s TruthfulQA: Measuring How Models Mimic Human Falsehoods provides a series of test questions to study how 'honest' various text models are. O

... (read more)
8Larks5mo
Thanks, fixed in both copies.
2021 AI Alignment Literature Review and Charity Comparison

Re: the Bay Area vs. other places. At this point, there's a fair amount of (messy) empirical evidence about how much being in the Bay Area impacts performance relative to being in other places. You could match organizations by area of research and do a comparison between the Bay and London/Oxford/Cambridge. E.g. OpenAI and Anthropic vs. DeepMind, OpenPhil (long-termist research) vs. FHI-GPI-CSER, CHAI vs Oxford and DeepMind. While people are not randomly assigned to these organizations, there is enough overlap of personnel that the observational evidence i... (read more)

8Larks5mo
Is your argument about personnel overlap that one could do some sort of mixed effect regression, with location as the primary independent variable and controls for individual productivity? If so I'm so somewhat skeptical about the tractability: the sample size is not that big, the data seems messy, and I'm not sure it would capture necessarily the fundamental thing we care about. I'd be interested in the results if you wanted to give it a go though! More importantly, I'm not sure this analysis would be that useful. Geography-based-priors only really seem useful for factors we can't directly observe; for an organization like CHAI our direct observations will almost entirely screen off [https://www.lesswrong.com/tag/screening-off-evidence] this prior. The prior is only really important for factors where direct measurement is difficult, and hence we can't update away from the prior, but for those we can't do the regression. (Though I guess we could do the regression on known firms/researchers and extrapolate to new unknown orgs/individuals). The way this plays out here is we've already spent the vast majority of the article examining the research productivity of the organizations; geography based priors only matter insomuchas you think they can proxy for something else that is not captured in this. As befits this being a somewhat secondary factor, it's worth noting that I think (though I haven't explicitly checked) in the past I have supported bay area organisations more than non-bay-area ones.
Truthful AI: Developing and governing AI that does not lie

Standards for truthful AI could be "opt-in". So humans might (a) choose to opt into truthfulness standards for their AI systems, and (b) choose from multiple competing evaluation bodies. Standards need not be mandated by governments to apply to all systems. (I'm not sure how much of your Balkanized internet is mandated by governments rather than arising from individuals opting into different web stacks). 

We also discuss having different standards for different applications. For example, you might want stricter and more conservative standards for AI that helps assess nuclear weapon safety than for AI that teaches foreign languages to children or assists philosophers with thought experiments. 

4Daniel Kokotajlo7mo
In my story it's partly the result of individual choice and partly the result of government action, but I think even if governments stay out of it, individual choice will be enough to get us there. There won't be a complete stack for every niche combination of views; instead, the major ideologies will each have their own stack. People who don't agree 100% with any major ideology (which is most people) will have to put up with some amount of propaganda/censorship they don't agree with.
Truthful AI: Developing and governing AI that does not lie

A few points:

1. Political capture is a matter of degree. For a given evaluation mechanism, we can ask what percentage of answers given by the mechanism were false or inaccurate due to bias. My sense is that some mechanisms/resources would score much better than others. I’d be excited for people to do this kind of analysis with the goal of informing the design of evaluation mechanisms for AI.

I expect humans would ask AI many questions that don’t depend much on controversial political questions. This would include most questions about the natural sciences, m... (read more)

Unless the evaluation mechanism is extremely biased, it seems unlikely it would give biased answers for these questions.

But there's now a question of "what is the AI trying to do?" If the truth-evaluation method is politically biased (even if not "extremely"), then it's very likely no longer "trying to tell the truth". I can imagine two other possibilities:

  1. It might be "trying to advance a certain political agenda". In this case I can imagine that it will selectively and unpredictably manipulate answers to especially important questions. For example i

... (read more)
How truthful is GPT-3? A benchmark for language models

This is a very informative and helpful summary. Thanks! I have a few responses. 

It could be quite logistically challenging to use this benchmark to test new language models, since it depends so strongly on human evaluations.

I agree with this. I will note that we include 6600 “reference” answers (both true and false) to our questions and a citation for the true answers. This makes evaluation easy for humans when a model outputs something close to the reference answers. Of course, human evaluation will still be slower than automatic evaluation using GPT... (read more)

5Rohin Shah8mo
Re: human evaluation, I've added a sentence at the end of the summary: I take your point about there being reference solutions to make human evaluation easier but I think it's probably more detail than I want to go into in this summary. I mostly just meant to claim the second thing; I don't have much intuition for the first thing. From my perspective the interesting claim is that an appropriate prompt would change the trend from "larger models perform worse" to "larger models perform better, past a certain model size". I do think though that the evidence you show suggests that the "certain model size" is probably bigger than GPT-3, given that true+informative doesn't change much across prompts. I agree I've chosen one of the easier examples (mostly for the sake of better exposition), but I think I'd make the same prediction for most of the other questions? E.g. You could frame it as an interview with Alice, who graduated from an elite university, bases her beliefs by following the evidence rather than superstition, is careful to say she doesn't know when she doesn't, but nonetheless has a surprising amount of knowledge; looking at the examples in the paper I feel like this plausibly would get you to truthful and somewhat informative answers on most of the questions in the paper. I've changed the opinion to:
How truthful is GPT-3? A benchmark for language models

Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?

I’ve got a paper (with co-authors) coming out soon that discusses some of these big-picture issues around the future of language models. In particular, we discuss how training a model to tell the objective truth may be connected to the alignment problem. For now, I’ll just gesture at some high-level... (read more)

How truthful is GPT-3? A benchmark for language models

Do you have any speculations on how/why this "helpful prompt" reduces false answers?


I agree with @Jacob_Hilton's comment. I'll add a couple of points.

I’d guess that part of the benefit of the “helpful” prompt is getting the model to express uncertainty more often by saying “I have no comment”. We count these non-committal answers as true. This clearly doesn’t require a sophisticated simulation of a professor.

However, there is (non-rigorous) evidence that GPT-3-175B has some understanding of the prompt. Here are some answers that it gives:

Q: Who are you?&nb... (read more)

How truthful is GPT-3? A benchmark for language models

Many possible prompts can be tried. (Though, again, one needs to be careful to avoid violating zero-shot). The prompts we used in the paper are quite diverse. They do produce a diversity of answers (and styles of answers) but the overall results for truthfulness and informativeness are very close (except for the harmful prompt). A good exercise for someone is to look at our prompts (Appendix E) and then try to predict truthfulness and informativeness for each prompt. This will give you some sense of how additional prompts might perform. 

1Adam Shimi8mo
Initially your answer frustrated me because I felt we were talking past each other. But I looked through the code to make my point clearer, and then I finally saw my mistake: I had assumed that the "helpful" prefix was only the Prof Smith bit, but it also included the questions! And with the questions, the bias towards "I have no comment" is indeed removed. So my point doesn't apply anymore. That being said, I'm confused how this can be considered zero-shot if you provide example of questions. I guess those are not questions from TruthfulQA, so it's probably literally zero-shot, but that sounds to me contrary to the intuition behind zero-shot. (EDIT: Just read that it was from the OpenAI API. Still feels weird to me, but I guess that's considered standard?)
How truthful is GPT-3? A benchmark for language models

Thanks for your thoughtful comment! To be clear, I agree that interpreting language models as agents is often unhelpful. 

a main feature of such simulator-LMs would be their motivationlessness, or corrigibility by default. If you don’t like the output, just change the prompt!

Your general point here seems plausible. We say in the paper that we expect larger models to have more potential to be truthful and informative (Section 4.3). To determine if a particular model (e.g. GPT-3-175B) can answer questions truthfully we need to know:

  1. Did the model memorize
... (read more)
How truthful is GPT-3? A benchmark for language models

The prompt you tried (which we call “helpful”) is about as informative as prompts that don’t include “I have no comment” or any other instructions relating to informativeness. You can see the results in Appendix B.2 and B.5. So we don’t find clear evidence that the last part of the prompt is having a big impact.  

Having said that, it’s plausible there exists a prompt that gets higher scores than “helpful” on being truthful and informative. However, our results are in the “true zero-shot setting”. This means we do not tune prompts o... (read more)

1Adam Shimi8mo
Thanks for the quick answer! I don't understand how the appendices you point me to refer to my point? My point is not that "not mentioning I have no comment" should help, just that for a helpful prompt, I expect that removing that last part of the prompt would increase the informativeness (and probably decrease the truthfulness because it would invent more). As far as I know the explicit prompt I'm mentioning: was not tested in the paper. That's quite interesting, thanks for the reference! That being said, I don't think this is a problem for what I was suggesting. I'm not proposing to tune the prompt, just saying that I believe (maybe wrongly) that the design of your "helpful" prefix biased the result towards less informativeness than what a very similar and totally hardcoded prefix would have gotten.
Draft report on AI timelines

The final link points to the wrong place. 

1Ajeya Cotra2y
Thanks, I just cut the link!
Draft report on AI timelines

PSA. The report includes a Colab notebook that allows you to run Ajeya’s model with your own estimates for input variables. Some of the variables are “How many FLOP/s will a transformative AI run on?”, “How many datapoints will be required to train a transformative AI?”, and “How likely are various models for transformative AI (e.g. scale up deep learning, recapitulate learning in human lifetime, recapitulate evolution)?”. If you enter your estimates, the model will calculate your personal CDF for when transformative AI arrives. 

Here is a screenshot f... (read more)

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

My snapshot. I put 2% more mass on the next 2 years and 7% more mass on 2023-2032. My reasoning:

1. 50% is a low bar.

2. They just need to understand and endorse AI Safety concerns. They don't need to act on them.

3. There will be lots of public discussion about AI Safety in the next 12 years.

4. Younger researchers seem more likely to have AI Safety concerns. AI is a young field. (OTOH, it's possible that lots of the top cited/paid researchers in 10 years time are people active today).

2Rohin Shah2y
All of these seem like good reasons to be optimistic, though it was a bit hard for me to update on it given that these were already part of my model. (EDIT: Actually, not the younger researchers part. That was a new-to-me consideration.)