This is a distribution of math problems GPT-3 wasn't finetuned on. Yet it's able to few-shot generalize and perform well. This is an amazing level of robustness relative to 2018 deep learning systems. I don't see why scaling and access to external tools (e.g. to perform long calculations) wouldn't produce the kind of robustness you have in mind.
I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling
GPT-3 (without external calculators) can do very well on math word problems (https://arxiv.org/abs/2206.02336) that combine basic facts about the world with abstract math reasoning. Why think that the kind of causal reasoning humans do is out of reach of scaling (especially if you allow external calculators)? It doesn't seem different in kind from these math word problems.
when can/do foundation models internalize explicitly stated knowledge
Some human causal reasoning is explicit. Humans can't do complex and exact calculations using System 1 intuition, and neither can we do causal reasoning of any sophistication using System 1. The prior over causal relations (e.g. that without looking at any data 'smoking causes cancer' is way more likely than the reverse) is more about general world-model building, and maybe there's more uncertainty about how well scaling learns that.
In the pre-training set, there are lots of places where humans talk about causality (both informally and more formally in myriad academic papers). So a model would ultimately need to learn abstract stuff about causality (e.g. correlation is not causation, arrow of time, causes are local, etc) and concrete causal facts (the moon causes tides, tiny organisms cause mold, etc). Given this knowledge, it's plausible a model M could make reasonable guesses for questions like, "What happens when a model with [properties of model M] starts interacting with the world?" These guesses would be improved by finetuning by RL on actual interaction between M and the world.
(It seems that most of what my ability to make OOD predictions or causal inferences is based on passive/offline learning. I know science from books/papers and not from running my own rigorous control experiments or RCTs.)
Cool post! Did you try seeing whether GPT-3 can regenerate parts of the Iris dataset (or any other datasets that may appear in its training data)? I'd also be interested to see finetuning results, results for the latest InstructGPT, and to see analysis of the GPT-3 Embeddings for integers and floats.
This is a fantastic resource and seems like a great project for a research assistant. As with Rohin Shah's alignment newsletter, I'm excited to see this project continue and (potentially) expand.
I agree with most of this -- and my original comment should have been clearer. I'm wondering if the past five years of direct observations leads you to update the geography-based prior (which has been included in your alignment review for since 2018). How much do you expect the quality of alignment work to differ from a new organization based in the Bay vs somewhere else? (No need to answer: I realize this is probably a small consideration and I don't want to start an unproductive thread on this topic).
Evans et al.'s Truthful AI: Developing and governing AI that does not lie is a detailed and length piece discussing a lot of issues around truthfulness for AI agents. This includes conceptual, practical and governance issues, especially with regard conversation bots. They argue for truthfulness (or at least, non-negligently-false)
The link should include "that does not lie". length --> lenghty
Lin et al.'s TruthfulQA: Measuring How Models Mimic Human Falsehoods provides a series of test questions to study how 'honest' various text models are. Of course, these models are trying to copy human responses, not be honest, so because many of the questions allude to common misconceptions, the more advanced models 'lie' more often. Interestingly they also used GPT-3 to evaluate the truth of these answers. See also the discussion here. Researchers from OpenPhil were also named authors on the paper. #Other
"OpenPhil" --> OpenAIAs a minor clarification, all the results in the paper are based on human evaluation of truth. But we show that GPT-3 can be used as a fairly reliably substitute for human evaluation under certain conditions.
Re: the Bay Area vs. other places. At this point, there's a fair amount of (messy) empirical evidence about how much being in the Bay Area impacts performance relative to being in other places. You could match organizations by area of research and do a comparison between the Bay and London/Oxford/Cambridge. E.g. OpenAI and Anthropic vs. DeepMind, OpenPhil (long-termist research) vs. FHI-GPI-CSER, CHAI vs Oxford and DeepMind. While people are not randomly assigned to these organizations, there is enough overlap of personnel that the observational evidence is likely to be meaningful. This kind of comparison seems preferable to general arguments like that the Bay Area is expensive + has bad epistemics. (In terms of general arguments, I'd also mention that the Bay Area has the best track record in the world by a huge margin for producing technology companies and is among the top 5 regions in the world for cutting-edge scientific research.) ETA: I tried to clarify my thoughts in the reply to Larks.
Standards for truthful AI could be "opt-in". So humans might (a) choose to opt into truthfulness standards for their AI systems, and (b) choose from multiple competing evaluation bodies. Standards need not be mandated by governments to apply to all systems. (I'm not sure how much of your Balkanized internet is mandated by governments rather than arising from individuals opting into different web stacks). We also discuss having different standards for different applications. For example, you might want stricter and more conservative standards for AI that helps assess nuclear weapon safety than for AI that teaches foreign languages to children or assists philosophers with thought experiments.
A few points:
1. Political capture is a matter of degree. For a given evaluation mechanism, we can ask what percentage of answers given by the mechanism were false or inaccurate due to bias. My sense is that some mechanisms/resources would score much better than others. I’d be excited for people to do this kind of analysis with the goal of informing the design of evaluation mechanisms for AI.
I expect humans would ask AI many questions that don’t depend much on controversial political questions. This would include most questions about the natural sciences, math/CS, and engineering. This would also include “local” questions about particular things (e.g. “Does the doctor I’m seeing have expertise in this particular sub-field?”, “Am I likely to regret renting this particular apartment in a year?”). Unless the evaluation mechanism is extremely biased, it seems unlikely it would give biased answers for these questions. (The analogous question is what percentage of all sentences on Wikipedia are politically controversial.)
2. AI systems have the potential to provide rich epistemic information about their answers. If a human is especially interested in a particular question, they could ask, “Is this controversial? What kind of biases might influence answers (including your own answers)? What’s the best argument on the opposing side? How would you bet on a concrete operationalized version of the question?”. The general point is that humans can interact with the AI to get more nuanced information (compared to Wikipedia or academia). On the other hand: (a) some humans won’t ask for more nuance, (b) AIs may not be smart enough to provide it, (c) the same political bias may influence how the AI provides nuance.
3. Over time, I expect AI will be increasingly involved in the process of evaluating other AI systems. This doesn’t remove human biases. However, it might mean the problem of avoiding capture is somewhat different than with (say) academia and other human institutions.