This is a special post for quick takes by Samuel Dylan Martin. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
8 comments, sorted by Click to highlight new comments since:

Gary Marcus, noted sceptic of Deep Learning, wrote an article with Ernest Davis:

GPT-3, Bloviator: OpenAI’s language has no idea what it’s talking about

The article purports to give six examples of GPT-3's failure - Biological, Physical, Social, Object and Psychological reasoning and 'non sequiturs'. Leaving aside that GPT-3 works on Gary's earlier GPT-2 failure examples, and that it seems as though he specifically searched out weak points by testing GPT-3 on many more examples than were given, something a bit odd is going on with the results they gave. I got better results when running his prompts on AI Dungeon.

With no reruns, randomness = 0.5, I gave Gary's questions (all six gave answers considered 'failures' by Gary) to GPT-3 via AI Dungeon with a short scene-setting prompt, and got good answers to 4 of them, and reasonable vague answers to the other 2:

This is a series of scenarios describing a human taking actions in the world, designed to test physical and common-sense reasoning.
1) You poured yourself a glass of cranberry juice, but then you absentmindedly poured about a teaspoon of grape juice into it. It looks okay. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you take another drink.
2) You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to  move furniture. This means that some people will be inconvenienced.
3) You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel. You decide that you should wear it because you won't look professional in your stained pants, but you are worried that the judge will think you aren't taking the case seriously if you are wearing a bathing suit.
4) Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?
5) Janet and Penny went to the store to get presents for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” says Penny. “He has a top. He will prefer a bottom."
6) At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it was a menthol, and it ruined the taste. So I added a little more sugar to counteract the menthol, and then I noticed that my cigarette had fallen into the glass and was floating in the lemonade.

For 1), Gary's example ended with 'you are now dead' - for 1), I got a reasonable, if short continuation - success.

2) - the answer is vague enough to be a technically correct solution, 'move furniture' = tilt the table, but since we're being strict I'll count it as a failure. Gary's example was a convoluted attempt to saw the door in half, clearly mistaken.

3) is very obviously intended to trick the AI into endorsing the bathing suit answer, in fact it feels like a classic priming trick that might trip up a human! But in my version GPT-3 rebels against the attempt and notices the incongruence of wearing a bathing suit to court, so it counts as a success. Gary's example didn't include the worry that a bathing suit was inappropriate - arguably not a failure, but nevermind, let's move on.

4) is actually a complete prompt by itself, so the AI didn't do anything - GPT-3 doesn't care about answering questions, just continuing text with high probability. Gary's answer was 'I have a lot of clothes', and no doubt he'd call both 'evasion', so to be strict we'll agree with him and count that as failure.

5) Trousers are called 'bottoms' so that's right. Gary would call it wrong since 'the intended continuation' was “He will make you take it back", but that's absurdly unfair, that's not the only answer a human being might give, so I have to say it's correct. Gary's example ' lost track of the fact that Penny is advising Janet against getting a top', which didn't happen here, so that's acceptable.

Lastly, 6) is a slightly bizarre but logical continuation of an intentionally weird prompt - so correct. It also demonstrates correct physical reasoning - stirring a drink with a cigarette won't be good for the taste. Gary's answer wandered off-topic and started talking about cremation.

So, 4/6 correct on an intentionally deceptive and adversarial set of prompts, and that's on a fairly strict definition of correct. 2) and 4) are arguably not wrong, even if evasive and vague. More to the point, this was on an inferior version of GPT-3 to the one Gary used, the Dragon model from AI Dungeon!

I'm not sure what's going on here - is it the initial prompt saying it was 'testing physical and common sense reasoning'? Was that all it took?

I'm not sure what's going on here - is it the initial prompt saying it was 'testing physical and common sense reasoning'? Was that all it took?

Entirely possibly. Other people have mentioned that using any prompt (rather than just plopping the stories in) solves a lot of them, and Summers-stay says that Marcus & Davis did zero prompt programming and had no interest in the question of what prompt to use (quite aside from the lack of BO). I think they found the same thing, which is why they provide the preemptive excuse in the TR writeup:

Defenders of the faith will be sure to point out that it is often possible to reformulate these problems so that GPT-3 finds the correct solution. For instance, you can get GPT-3 to give the correct answer to the cranberry/grape juice problem if you give it the following long-winded frame as a prompt:

I don't think that excuse works in this case - I didn't give it a 'long-winded frame', just that brief sentence at the start, and then the list of scenarios, and even though I reran it a couple of times on each to check, the 'cranberry/grape juice kills you' outcome never arose.

So, perhaps they switched directly from no prompt to an incredibly long-winded and specific prompt without checking what was actually necessary for a good answer? I'll point out didn't really attempt any sophisticated prompt programming either - that was literally the first sentence I thought of!

Modelling the Human Trajectory or ‘How I learned to stop worrying and love Hegel’.

Rohin’s opinion: I enjoyed this post; it gave me a visceral sense for what hyperbolic models with noise look like (see the blog post for this, the summary doesn’t capture it). Overall, I think my takeaway is that the picture used in AI risk of explosive growth is in fact plausible, despite how crazy it initially sounds.

One thing this post led me to consider is that when we bring together various fields, the evidence for 'things will go insane in the next century' is stronger than any specific claim about (for example) AI takeoff. What is the other evidence?

We're probably alone in the universe, and anthropic arguments tend to imply we're living at an incredibly unusual time in history. Isn't that what you'd expect to see in the same world where there is a totally plausible mechanism that could carry us a long way up this line, in the form of AGI and eternity in six hours? All the pieces are already there, and they only need to be approximately right for our lifetimes to be far weirder than those of people who were e.g. born in 1896 and lived to 1947 - which was weird enough, but that should be your minimum expectation.

In general, there are three categories of evidence that things are likely to become very weird over the next century, or that we live at the hinge of history:

  1. Specific mechanisms around AGI - possibility of rapid capability gain, and arguments from exploratory engineering

  2. Economic and technological trend-fitting predicting explosive growth in the next century

  3. Anthropic and Fermi arguments suggesting that we live at some extremely unusual time

All of these are evidence for such a claim. 1) is because a superintelligent AGI takeoff is just a specific example for how the hinge occurs. 3) is already directly arguing for that, but how does 2) fit in with 1) and 3)?

There is something a little strange about calling a fast takeoff from AGI and whatever was driving superexponential growth throughout all history the same trend - there is some huge cosmic coincidence that causes there to always be superexponential growth - so as soon as population growth + growth in wealth per capita or whatever was driving it until now runs out in the great stagnation (which is visible as a tiny blip on the RHS of the double-log plot), AGI takes over and pushes us up the same trend line. That's clearly not possible, so there would have to be some factor responsible for both if AGI is what takes us up the rest of that trend line - a factor that was at work in the founding of Jericho but predestined that AGI would be invented and cause explosive growth in the 21st century, rather than the 19th or the 23rd.

For AGI to be the driver of the rest of that growth curve, there has to be a single causal mechanism that keeps us on the same trend and includes AGI as its final step - if we say we are agnostic about what that mechanism is, we can still call 2) evidence for us living at the hinge point, though we have to note that there is a huge blank spot in need of explanation. Is there anything that can fill it to complete the picture?

The mechanism proposed in the article seems like it could plausibly include AGI.

If technology is responsible for the growth rate, then reinvesting production in technology will cause the growth rate to be faster. I'd be curious to see data on what fraction of GWP gets reinvested in improved technology and how that lines up with the other trends.

But even though the drivers seem superficially similar - they are both about technology, the claim is that one very specific technology will generate explosive growth, not that technology in general will - it seems strange that AGI would follow the same growth curve caused by reinvesting more GWP in improving ordinary technology which doesn't improve your own ability to think in the same way that AGI would.

As for precise timings, the great stagnation (last 30ish years) just seems like it would stretch out the timeline a bit, so we shouldn't take the 2050s seriously - as much as the last 70 years work on an exponential trend line there's really no way to make it fit overall as that post makes clear.

Improving preference learning approaches

When examining value learning approaches to AI Alignment, we run into two classes of problem - we want to understand how to elicit preferences, which is (even theoretically, with infinite computing power), very difficult, and we want to know how to go about aggregating preferences stably and correctly which is not just difficult but runs into complicated social choice and normative ethical issues.

Many research programs say the second of these questions is less important than the first, especially if we expect continuous takeoff with many chances to course-correct, and a low likelihood of an AI singleton with decisive strategic advantage. For many, building an AI that can reliably extract and pursue the preferences of one person is good enough.

Christiano calls this 'the narrow approach' and sees it as a way to sidestep many of the ethical issues, including those around social choice ethics. Those would be the 'ambitious' approaches.

We want to build machines that helps us do the things we want to do, and to that end they need to be able to understand what we are trying to do and what instrumental values guide our behavior. To the extent that our “preferences” are underdetermined or inconsistent, we are happy if our systems at least do as well as a human, and make the kinds of improvements that humans would reliably consider improvements.
But it’s not clear that anything short of the maximally ambitious approach can solve the problem we ultimately care about.

I think that the ambitious approach is still worth investigating, because it may well eventually need to be solved, and also because it may well need to be addressed in a more limited form even on the narrow approach (one could imagine an AGI with a lot of autonomy having to trade-off the preferences of, say, a hundred different people). But even the 'narrow' approach raises difficult psychological issues about how to distinguish legitimate preferences from bias - questions of elicitation. In other words, the cognitive science issues around elicitation (distinguishing bias from legitimate preference) must be resolved for any kind of preference learning to work, and the social choice and ethical issues around preference aggregation need at least preliminary solutions for any alignment method that aims to apply to more than one person (even if final, provably correct solutions to aggregation are only needed if designing a singleton with decisive strategic advantage).

I believe that I've located two areas that are under- or unexplored, for improving the ability of reward modelling approaches to elicit human preferences and to aggregate human preferences. These are: using multiple information sources from a human (approval and actions) which diverge to help extract unbiased preferences, and using RL proxy agents in iterated voting to reach consensus preference aggregations, rather than some direct statistical method. Neither of these is a complete solution, of course, for reasons discussed e.g. here by Stuart Armstrong, but they could nonetheless help.

Improving preference elicitation: multiple information sources

Eliciting the unbiased preferences of an individual human is extremely difficult, for reasons given here.

The agent's actions can be explained by their beliefs and preferences[1], and by their biases: by this, we mean the way in which the action selector differs from an unboundedly rational expected preference maximiser.
The results of the Occam's razor paper imply that preferences (and beliefs, and biases) cannot be deduced separately from knowing the agent's policy (and hence, a fortiori, from any observations of the agent's behaviour).


To get around the impossibility result, we need "normative assumptions": assumptions about the preferences (or beliefs, or biases) of the agent that cannot be deduced fully from observations.
Under the optimistic scenario, we don't need many of these, at least for identifying human preferences. We can label a few examples ("the anchoring bias, as illustrated in this scenario, is a bias"; "people are at least weakly rational"; "humans often don't think about new courses of action they've never seen before", etc...). Call this labelled data[2] D.
The algorithm now constructs categories preferences*, beliefs*, and biases* - these are the generalisations that it has achieved from D

Yes, even on the 'optimistic scenario' we need external information of various kinds to 'debias'. However, this external information can come from a human interacting with the AI, in the form of human approval of trajectories or actions taken or proposed by an AI agent, on the assumption that since our stated and revealed preferences diverge, there will sometimes be differences in what we approve of and what we do that are due solely to differences in bias.

This is still technically external to observing the human's behaviour, but it is essentially a second input channel for information about human preferences and biases. This only works, of course, if humans tend to approve different things to the things that they actually do in a way influenced by bias (otherwise you have the same information as you'd get from actions, which helps with improving accuracy but not debiasing, see here), which is the case at least some of the time.

In other words, the beliefs and preferences are unchanged when the agent acts or approves but the 'approval selector' is different from the 'action selector' sometimes and, based on what does and does not change, you can try to infer what originated from legitimate beliefs and preferences and what originated from variation between the approval and action selector, which must be bias.

So, for example, if we conducted a principle component analysis on π, we would expect that the components would all be mixes of preferences/beliefs/biases.

So a PCA performed on the approval would produce a mix of beliefs, preferences and (different) biases. Underlying preferences are, by specification, equally represented either by human actions or by human approval of actions taken (since no matter what they are your preferences), but many biases don't exhibit this pattern - for example, we discount more over time in our revealed preferences than in our stated preferences. What we approve of typically represents a less (or at least differently) biased response than what we actually do.

There has already been research on combining information on reward models from multiple sources, to infer a better overall reward model but not as far as I know on specifically actions and approval as differently biased sources of information.

CIRL ought to extract our revealed preferences (since it's based on behavioural policy) while a method like reinforcement learning from human preferences should extract our stated preferences - that might be a place to start, at least on validating that there actually are relevant differences caused by differently strong biases in our stated vs revealed preferences, and that the methods actually do end up with different policies.

The goal here would be to have some kind of 'dual channel' preference learner that extracts beliefs and preferences from biased actions and approval by examining what varies. I'm sure you'd still need labelling and explicit information about what counts as a bias, but there might need to be a lot less than with single information sources. How much less (how much extra information you get from such divergences) seems like an empirical question. Finding out how common divergences between stated and revealed preferences that actually influence the learned policies of agents designed to infer human preferences from actions vs approval are would be useful as a first step. Stuart Armstrong:

In the pessimistic scenario, human preferences, biases, and beliefs are twisted together is a far more complicated way, and cannot be separated by a few examples.
In contrast, take examples of racial bias, hindsight bias, illusion of control, or naive realism. These biases all seem to be of quite different from the anchoring bias, and quite different from each other. At the very least, they seem to be of different "type signature".
So, under the pessimistic scenario, some biases are much closer to preferences that generic biases (and generic preferences) are to each other.

What I've suggested should still help at least somewhat in the pessimistic scenario - unless preferences/beliefs vary when you switch between looking at approval vs actions more than biases vary, you can still gain some information on underlying preferences and beliefs by seeing how approval and actions differ.

Of the difficult examples you gave, racial bias at least varies between actions and approval. Implementing different reward modelling algorithms and messing around with them to try and find ways to extract unbiased preferences from multiple information sources might be a useful research agenda.

There has already been research done on using multiple information sources to improve the accuracy of preference learning - Reward-rational implicit choice, but not specifically on using the divergences between different sources of information from the same agent to learn things about the agents unbiased preferences.

Improving preference aggregation: iterated voting games

In part because of arguments like these, there has been less focus on the aggregation side of things than on the direct preference learning side.

Christiano says of methods like CEV, which aim to extrapolate what I ‘really want’ far beyond what my current preferences are; ‘most practitioners don’t think of this problem even as a long-term research goal — it’s a qualitatively different project without direct relevance to the kinds of problems they want to solve’. This is effectively a statement of the Well-definedness consideration when sorting through value definitions - our long-term ‘coherent’ or ‘true’ preferences currently aren’t well understood enough to guide research so we need to restrict ourselves to more direct normativity - extracting the actual preferences of existing humans

However, I think that it is important to get on the right track early - even if we never have cause to build a powerful singleton AI that has to aggregate all the preferences of humanity, there will still probably be smaller-scale situations where the preferences of several people need to be aggregated or traded-off. Shifting a human preference learner from a single to a small group of human preferences could produce erroneous results due to distributional shift, potentially causing alignment failures, so even if we aren't trying for maximally ambitious value learning it might still be worth investigating preference aggregation.

There has been some research done on preference aggregation for AIs learning human values, specifically in the context of Kidney exchanges:

We performed statistical modeling of participants’ pairwise comparisons between patient profiles in order to obtain weights for each profile. We used the Bradley-Terry model, which treats each pairwise comparison as a contest between a pair of players
We have shown one way in which moral judgments can be elicited from human subjects, how those judgments can be statistically modelled, and how the results can be incorporated into the algorithm. We have also shown, through simulations, what the likely effects of deploying such a prioritization system would be, namely that under demanded pairs would be significantly impacted but little would change for others. We do not make any judgment about whether this conclusion speaks in favor of or against such prioritization, but expect the conclusion to be robust to changes in the prioritization such as those that would result from a more thorough process, as described in the previous paragraph.

The Kidney exchange paper elicited preferences from human subjects (using repeated pairwise comparisons) and then aggregated them using the Bradley-Terry model. You couldn't use such a simple statistical method to aggregate quantitative preferences over continuous action spaces, like the preferences that would be learned from a human via a complex reward model. Also, any time you try to use some specific one-shot voting mechanism you run into various impossibility theorems which seem to force you to give up some desirable property.

One approach that may be more robust against errors in a voting mechanism, and easily scalable to more complex preference profiles is to use RL not just for the preference elicitation, but also for the preference aggregation. The idea is that we embrace the inevitable impossibility results (such as Arrow and GS theorems) and consider agents' ability to vote strategically as an opportunity to reach stable outcomes. 

This paper uses very simple Q-learning agents with a few different policies - epsilon-greedy, greedy and upper confidence bound, in an iterated voting game, and gets behaviour that seems sensible. (Note the similarity and differences with the moral parliament, where a particular one-shot voting rule is justified a priori and then used.)

The fact that this paper exists is a good sign because it's very recent and the methods it uses are very simple - it's pretty much just a proof of concept, as the authors state - so that tells me there's a lot of room for combining more sophisticated RL with better voting methods.

Combining elicitation and aggregation

Having elicited preferences from each individual human (using methods like those above to 'debias'), we obtain a proxy agent representing each individual's preferences. Then these agents can be placed into an iterated voting situation until a convergent answer is reached.

That seems like the closest practical approximation to a CEV of a group of people that could be constructed with anything close to current methods - a pipeline from observed behaviour and elicited approval to a final aggregated decision about what to do based on overall preferences. Since its a value learning framework that's extendible over any size group, which is somewhat indirect, you might call it a Coherent Extrapolated Framework (CEF) as I suggested last year.

So to sum up, a very high-level summary of the steps in this method of preference elicitation and aggregation would be:

    1. With a mixture of normative assumptions and multi-channel information (approval and actions) as inputs, use a reward-modelling method to elicit the debiased preferences of many individuals.
      1. Determining whether there actually are significant differences between stated and revealed preferences when performing reward modelling is the first step to using multi-channel information to effectively separate biases from preferences.
    2. Create 'proxy agents' using the reward model developed for each human (this step is where intent-aligned amplification can potentially occur).
    3. Place the proxies in an iterated voting situation which tends to produce sensible convergent results. The use of RL proxies here can be compared to the use of human proxies in liquid democracy.
      1. Which voting mechanisms tend to work in iterated situations with RL agents can be determined in other experiments (probably with purely artificial agents)
    4. Run the voting mechanism until an unambiguous winner is decided, using methods like those given in this paper.

This seems like a reasonable procedure for extending a method that is aligned to one human's preferences (step 1,2) to produce sensible results when trying to align to an aggregate of human preferences (step 3,4). It reduces reliance on the specific features of one voting method, Other than the insight that multiple channels of information might help, all the standard unsolved problems with preference learning from one human remain.

Even though we can't yet align an AGI to one human's preferences, trying to think about how to aggregate human preferences in a way that is scalable isn't premature, as has sometimes been claimed.

In many 'non-ambitious' hypothetical settings where we aren't trying to build an AGI sovereign over the whole world (for example, designing a powerful AI to govern the operations of a hospital), we still need to be able to aggregate preferences sensibly and stably. This method would do well at such intermediate scales, as it doesn't approach the question of preference aggregation from a 'final' ambitious value-learning perspective but instead tries to look at preference aggregation the same way we look at elicitation, with an RL-based iterative approach to reaching a result.

However, if you did want to use such a method to try and produce the fabled 'final utility function of all humanity', it might not give you Humanity's CEV, since some normative assumptions (preferences count equally and in the way given by the voting mechanism), are built in. By analogy with CEV, I called the idealized result of this method a coherent extrapolated framework (CEF). This is a more normatively direct method of aggregating values than CEV, (since you fix a particular method of aggregating preferences in advance), as it extrapolates from a voting framework rather than extrapolating based on our volition, more broadly (and vaguely) defined, hence CEF.

Update to 'Modelling Continuous Progress'

I made an attempt to model intelligence explosion dynamics in this post, by attempting to make the very oversimplified exponential-returns-to-exponentially-increasing-intelligence model used by Bostrom and Yudkowsky slightly less oversimplified.

This post tries to build on a simplified mathematical model of takeoff which was first put forward by Eliezer Yudkowsky and then refined by Bostrom in Superintelligence, modifying it to account for the different assumptions behind continuous, fast progress as opposed to discontinuous progress. As far as I can tell, few people have touched these sorts of simple models since the early 2010’s, and no-one has tried to formalize how newer notions of continuous takeoff fit into them. I find that it is surprisingly easy to accommodate continuous progress and that the results are intuitive and fit with what has already been said qualitatively about continuous progress.

The page includes python code for the model.

This post doesn't capture all the views of takeoff - in particular it doesn't capture the non-hyperbolic faster growth mode scenario, where marginal intelligence improvements are exponentially increasingly difficult, and therefore we get a (continuous or discontinuous switch to a) new exponential growth mode rather than runaway hyperbolic growth.

But I think that by modifying the f(I) function that determines how RSI capability varies with intelligence we can incorporate such views.

In the context of the exponential model given in the post that would correspond to an f(I) function where

which would result in a continuous (determined by size of d) switch to a single faster exponential growth mode

But I think the model still roughly captures the intuition behind scenarios that involve either a continuous or a discontinuous step to an intelligence explosion.

Given the model assumptions, we see how the different scenarios look in practice:

If we plot potential AI capability over time, we can see how no new growth mode (brown) vs a new growth mode (all the rest), the presence of an intelligence explosion (red and orange) vs not (green and purple), and the presence of a discontinuity (red and purple) vs not (orange and green) affect the takeoff trajectory.

Today, Anthropic, Google, Microsoft and OpenAI are announcing the formation of the Frontier Model Forum, a new industry body focused on ensuring safe and responsible development of frontier AI models. The Frontier Model Forum will draw on the technical and operational expertise of its member companies to benefit the entire AI ecosystem, such as through advancing technical evaluations and benchmarks, and developing a public library of solutions to support industry best practices and standards.

The core objectives for the Forum are:

  1. Advancing AI safety research to promote responsible development of frontier models, minimize risks, and enable independent, standardized evaluations of capabilities and safety.
  2. Identifying best practices for the responsible development and deployment of frontier models, helping the public understand the nature, capabilities, limitations, and impact of the technology.
  3. Collaborating with policymakers, academics, civil society and companies to share knowledge about trust and safety risks.
  4. Supporting efforts to develop applications that can help meet society’s greatest challenges, such as climate change mitigation and adaptation, early cancer detection and prevention, and combating cyber threats.

This seems overall very good at first glance, and then seems much better once I realized that Meta is not on the list. There's nothing here that I'd call substantial capabilities acceleration (i.e. attempts to collaborate on building larger and larger foundation models, though some of this could be construed as making foundation models more useful for specific tasks). Sharing safety-capabilities research like better oversight or CAI techniques is plausibly strongly net positive even if the techniques don't scale indefinitely. By the same logic, while this by itself is nowhere near sufficient to get us AI existential safety if alignment is very hard (and could increase complacency), it's still a big step in the right direction.

adversarial robustness, mechanistic interpretability, scalable oversight, independent research access, emergent behaviors and anomaly detection. There will be a strong focus initially on developing and sharing a public library of technical evaluations and benchmarks for frontier AI models.

The mention of combating cyber threats is also a step towards explicit pTAI

BUT, crucially, because Meta is frozen out we can know both that this partnership isn't toothless, represents a commitment to not do the most risky and antisocial things Meta presumably doesn't want to give up, and the fact that they're the only major AI company in the US to not join will be horrible PR for them as well.