Sammy Martin

Sammy Martin. Philosophy and Physics BSc, AI MSc at Edinburgh, starting a PhD at King's College London. Interested in ethics, general philosophy and AI Safety.


Some AI research areas and their relevance to existential safety

That said, I remain interested in more clarity on what you see as the biggest risks with these multi/multi approaches that could be addressed with technical research.

A (though not necessarily the most important) reason to think technical research into computational social choice might be useful is that examining specifically the behaviour of RL agents from a computational social choice perspective might alert us to ways in which coordination with future TAI might be similar or different to the existing coordination problems we face.

(i) make direct improvements in the relevant institutions, in a way that anticipates the changes brought about by AI but will most likely not look like AI research, 

It seems premature to say, in advance of actually seeing what such research uncovers, whether the relevant mechanisms and governance improvements are exactly the same as the improvements we need for good governance generally, or different. Suppose examining the behaviour of current RL agents in social dilemmas leads to a general result which in turn leads us to conclude there's a disproportionate chance TAI in the future will coordinate in some damaging way that we can resolve with a particular new regulation. It's always possible to say, solving the single/single alignment problem will prevent anything like that from happening in the first place, but why put all your hopes on plan A, when plan B is relatively neglected?

Some AI research areas and their relevance to existential safety

Thanks for this long and very detailed post!

The MARL projects with the greatest potential to help are probably those that find ways to achieve cooperation between decentrally trained agents in a competitive task environment, because of its potential to minimize destructive conflicts between fleets of AI systems that cause collateral damage to humanity.  That said, even this area of research risks making it easier for fleets of machines to cooperate and/or collude at the exclusion of humans, increasing the risk of humans becoming gradually disenfranchised and perhaps replaced entirely by machines that are better and faster at cooperation than humans.

In ARCHES, you mention that just examining the multiagent behaviour of RL systems (or other systems that work as toy/small-scale examples of what future transformative AI might look like) might enable us to get ahead of potential multiagent risks, or at least try to predict how transformative AI might behave in multiagent settings. The way you describe it in ARCHES, the research would be purely exploratory,

One approach to this research area is to continually ex-amine social dilemmas through the lens of whatever is the leading AI devel-opment paradigm in a given year or decade, and attempt to classify interest-ing behaviors as they emerge. This approach might be viewed as analogousto developing “transparency for multi-agent systems”: first develop inter-esting multi-agent systems, and then try to understand them. 

But what you're suggesting in this post, 'those that find ways to achieve cooperation between decentrally trained agents in a competitive task environment', sounds like combining computational social choice research with multiagent RL -  examining the behaviour of RL agents in social dilemmas and trying to design mechanisms that work to produce the kind of behaviour we want. To do that, you'd need insights from social choice theory. There is some existing research on this, but it's sparse and very exploratory.

My current research is attempting to build on the second of these.

As far as I can tell, that's more or less it in terms of examining RL agents in social dilemmas, so there may well be a lot of low-hanging fruit and interesting discoveries to be made. If the research is specifically about finding ways of achieving cooperation in multiagent systems by choosing the correct (e.g. voting) mechanism, is that not also computational social choice research, and therefore of higher priority by your metric?

In short, computational social choice research will be necessary to legitimize and fulfill governance demands for technology companies (automated and human-run companies alike) to ensure AI technologies are beneficial to and controllable by human society.  


CSC neglect:

As mentioned above, I think CSC is still far from ready to fulfill governance demands at the ever-increasing speed and scale that will be needed to ensure existential safety in the wake of “the alignment revolution”. 

AGI safety from first principles: Goals and Agency

Furthermore, we should take seriously the possibility that superintelligent AGIs might be even less focused than humans are on achieving large-scale goals. We can imagine them possessing final goals which don’t incentivise the pursuit of power, such as deontological goals, or small-scale goals. 


My underlying argument is that agency is not just an emergent property of highly intelligent systems, but rather a set of capabilities which need to be developed during training, and which won’t arise without selection for it

Was this line of argument inspired by Ben Garfinkel's objection to the 'classic' formulation of instrumental convergence/orthogonality - that these are 'measure based' arguments that just identify that a majority of possible agents with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we're actually likely to build such agents?

It seems like you're identifying the same additional step that Ben identified, and that I argued could be satisfied - that we need a plausible reason why we would build an agentive AI with large-scale goals.

And the same applies for 'instrumental convergence' - the observation that most possible goals, especially simple goals, imply a tendency to produce extreme outcomes when ruthlessly maximised:

  • A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.  

We could see this as marking out a potential danger - a large number of possible mind-designs produce very bad outcomes if implemented. The fact that such designs exist 'weakly suggest' (Ben's words) that AGI poses an existential risk since we might build them. If we add in other premises that imply we are likely to (accidentally or deliberately) build such systems, the argument becomes stronger. But usually the classic arguments simply note instrumental convergence and assume we're 'shooting into the dark' in the space of all possible minds, because they take the abstract statement about possible minds to be speaking directly about the physical world. There are specific reasons to think this might occur (e.g. mesa-optimisation, sufficiently fast progress preventing us from course-correcting if there is even a small initial divergence) but those are the reasons that combine with instrumental convergence to produce a concrete risk, and have to be argued for separately.

Security Mindset and Takeoff Speeds

In terms of inferences about deceptive alignment, it might be useful to go back to the one and only current example we have where someone with somewhat relevant knowledge was led to wonder whether deception had taken place - GPT-3 balancing brackets. I don't know if anyone ever got Eliezer's $1000 bounty, but the top-level comment on that thread at least convinces me that it's unlikely that GPT-3 via AI Dungeon was being deceptive even though Eliezer thought there was a real possibility that it was.

Now, this doesn't prove all that much, but one thing it does suggest is that on current MIRI-like views about how likely deception is, the threshold for uncertainty about deception is set far too low. That suggests your people at OpenSoft might well be right in their assumption.

Forecasting Thread: AI Timelines

The 'progress will be continuous' argument, to apply to our near future, does depend on my other assumptions - mainly that the breakthroughs on that list are separable, so agentive behaviour and long-term planning won't drop out of a larger GPT by themselves and can't be considered part of just 'improving up language model accuracy'.

We currently have partial progress on human-level language comprehension, a bit on cumulative learning, but near zero on managing mental activity for long term planning, so if we were to suddenly reach human level on long-term planning in the next 5 years, that would probably involve a discontinuity, which I don't think is very likely for the reasons given here.

If language models scale to near-human performance but the other milestones don't fall in the process, and my initial claim is right, that gives us very transformative AI but not AGI. I think that the situation would look something like this:

If GPT-N reaches par-human:

discovering new action sets
managing its own mental activity
(?) cumulative learning
human-like language comprehension
perception and object recognition
efficient search over known facts

So there would be 2 (maybe 3?) breakthroughs remaining. It seems like you think just scaling up a GPT will also resolve those other milestones, rather than just giving us human-like language comprehension. Whereas if I'm right and also those curves do extrapolate, what we would get at the end would be an excellent text generator, but it wouldn't be an agent, wouldn't be capable of long-term planning and couldn't be accurately described as having a utility function over the states of the external world, and I don't see any reason why trivial extensions of GPT would be able to do that either since those seem like problems that are just as hard as human-like language comprehension. GPT seems like it's also making some progress on cumulative learning, though it might need some RL-based help with that, but none at all on managing mental activity for longterm planning or discovering new action sets.

As an additional argument, admittedly from authority - Stuart Russell also clearly sees human-like language comprehension as only one of several really hard and independent problems that need to be solved.

A humanlike GPT-N would certainly be a huge leap into a realm of AI we don't know much about, so we could be surprised and discover that agentive behaviour and having a utility function over states of the external world spontaneously appears in a good enough language model, but that argument has to be made, and you need that argument to hold and GPT to keep scaling for us to reach AGI in the next five years, and I don't see the conjunction of those two as that likely - it seems as though your argument rests solely on whether GPT scales or not, when there's also this other conceptual premise that's much harder to justify.

I'm also not sure if I've seen anyone make the argument that GPT-N will also give us these specific breakthroughs - but if you have reasons that GPT scaling would solve all the remaining barriers to AGI, I'd be interested to hear it. Note that this isn't the same as just pointing out how impressive the results scaling up GPT could be - Gwern's piece here, for example, seems to be arguing for a scenario more like what I've envisaged, where GPT-N ends up a key piece of some future AGI but just provides some of the background 'world model':

Models like GPT-3 suggest that large unsupervised models will be vital components of future DL systems, as they can be ‘plugged into’ systems to immediately provide understanding of the world, humans, natural language, and reasoning.

If GPT does scale, and we get human-like language comprehension in 2025, that will mean we're moving up that list much faster, and in turn suggests that there might not be a large number of additional discoveries required to make the other breakthroughs, which in turn suggests they might also occur within the Deep Learning paradigm, and relatively soon. I think that if this happens, there's a reasonable chance that when we do build an AGI a big part of its internals looks like a GPT, as gwern suggested, but by then we're already long past simply scaling up existing systems.

Alternatively, perhaps you're not including agentive behaviour in your definition of AGI - a par-human text generator for most tasks that isn't capable of discovering new action sets or managing its mental activity is, I think a 'mere' transformative AI and not a genuine AGI.

SDM's Shortform

So to sum up, a very high-level summary of the steps in this method of preference elicitation and aggregation would be:

    1. With a mixture of normative assumptions and multi-channel information (approval and actions) as inputs, use a reward-modelling method to elicit the debiased preferences of many individuals.
      1. Determining whether there actually are significant differences between stated and revealed preferences when performing reward modelling is the first step to using multi-channel information to effectively separate biases from preferences.
    2. Create 'proxy agents' using the reward model developed for each human (this step is where intent-aligned amplification can potentially occur).
    3. Place the proxies in an iterated voting situation which tends to produce sensible convergent results. The use of RL proxies here can be compared to the use of human proxies in liquid democracy.
      1. Which voting mechanisms tend to work in iterated situations with RL agents can be determined in other experiments (probably with purely artificial agents)
    4. Run the voting mechanism until an unambiguous winner is decided, using methods like those given in this paper.

This seems like a reasonable procedure for extending a method that is aligned to one human's preferences (step 1,2) to produce sensible results when trying to align to an aggregate of human preferences (step 3,4). It reduces reliance on the specific features of one voting method, Other than the insight that multiple channels of information might help, all the standard unsolved problems with preference learning from one human remain.

Even though we can't yet align an AGI to one human's preferences, trying to think about how to aggregate human preferences in a way that is scalable isn't premature, as has sometimes been claimed.

In many 'non-ambitious' hypothetical settings where we aren't trying to build an AGI sovereign over the whole world (for example, designing a powerful AI to govern the operations of a hospital), we still need to be able to aggregate preferences sensibly and stably. This method would do well at such intermediate scales, as it doesn't approach the question of preference aggregation from a 'final' ambitious value-learning perspective but instead tries to look at preference aggregation the same way we look at elicitation, with an RL-based iterative approach to reaching a result.

However, if you did want to use such a method to try and produce the fabled 'final utility function of all humanity', it might not give you Humanity's CEV, since some normative assumptions (preferences count equally and in the way given by the voting mechanism), are built in. By analogy with CEV, I called the idealized result of this method a coherent extrapolated framework (CEF). This is a more normatively direct method of aggregating values than CEV, (since you fix a particular method of aggregating preferences in advance), as it extrapolates from a voting framework rather than extrapolating based on our volition, more broadly (and vaguely) defined, hence CEF.

Forecasting Thread: AI Timelines

Here's my answer. I'm pretty uncertain compared to some of the others!

AI Forecast

First, I'm assuming that by AGI we mean an agent-like entity that can do the things associated with general intelligence, including things like planning towards a goal and carrying that out. If we end up in a CAIS-like world where there is some AI service or other that can do most economically useful tasks, but nothing with very broad competence, I count that as never developing AGI.

I've been impressed with GPT-3, and could imagine it or something like it scaling to produce near-human level responses to language prompts in a few years, especially with RL-based extensions.

But, following the list (below) of missing capabilities by Stuart Russell, I still think things like long-term planning would elude GPT-N, so it wouldn't be agentive general intelligence. Even though you might get those behaviours with trivial extensions of GPT-N, I don't think it's very likely.

That's why I think AGI before 2025 is very unlikely (not enough time for anything except scaling up of existing methods). This is also because I tend to expect progress to be continuous, though potentially quite fast, and going from current AI to AGI in less than 5 years requires a very sharp discontinuity.

AGI before 2035 or so happens if systems quite a lot like current deep learning can do the job, but which aren't just trivial extensions of them - this seems reasonable to me on the inside view - e.g. it takes us less than 15 years to take GPT-N and add layers on top of it that handle things like planning and discovering new actions. This is probably my 'inside view' answer.

I put a lot of weight on a tail peaking around 2050 because of how quickly we've advanced up this 'list of breakthroughs needed for general intelligence' -

There is this list of remaining capabilities needed for AGI in an older post I wrote, with the capabilities of 'GPT-6' as I see them underlined:

Stuart Russell’s List

human-like language comprehension

cumulative learning

discovering new action sets

managing its own mental activity

For reference, I’ve included two capabilities we already have that I imagine being on a similar list in 1960

perception and object recognition

efficient search over known facts

So we'd have discovering new action sets, and managing mental activity - effectively, the things that facilitate long-range complex planning, remaining.

So (very oversimplified) if around the 1980s we had efficient search algorithms, by 2015 we had image recognition (basic perception) and by 2025 we have language comprehension courtesy of GPT-8, that leaves cumulative learning (which could be obtained by advanced RL?), then discovering new action sets and managing mental activity (no idea). It feels a bit odd that we'd breeze past all the remaining milestones in one decade after it took ~6 to get to where we are now. Say progress has sped up to be twice as fast, then it's 3 more decades to go. Add to this the economic evidence from things like Modelling the Human Trajectory, which suggests a roughly similar time period of around 2050.

Finally, I think it's unlikely but not impossible that we never build AGI and instead go for tool AI or CAIS, most likely because we've misunderstood the incentives such that it isn't actually economical or agentive behaviour doesn't arise easily. Then there's the small (few percent) chance of catastrophic or existential disaster which wrecks our ability to invent things. This is the one I'm most unsure about - I put 15% for both but it may well be higher.

SDM's Shortform

I don't think that excuse works in this case - I didn't give it a 'long-winded frame', just that brief sentence at the start, and then the list of scenarios, and even though I reran it a couple of times on each to check, the 'cranberry/grape juice kills you' outcome never arose.

So, perhaps they switched directly from no prompt to an incredibly long-winded and specific prompt without checking what was actually necessary for a good answer? I'll point out didn't really attempt any sophisticated prompt programming either - that was literally the first sentence I thought of!

SDM's Shortform

Gary Marcus, noted sceptic of Deep Learning, wrote an article with Ernest Davis:

GPT-3, Bloviator: OpenAI’s language has no idea what it’s talking about

The article purports to give six examples of GPT-3's failure - Biological, Physical, Social, Object and Psychological reasoning and 'non sequiturs'. Leaving aside that GPT-3 works on Gary's earlier GPT-2 failure examples, and that it seems as though he specifically searched out weak points by testing GPT-3 on many more examples than were given, something a bit odd is going on with the results they gave. I got better results when running his prompts on AI Dungeon.

With no reruns, randomness = 0.5, I gave Gary's questions (all six gave answers considered 'failures' by Gary) to GPT-3 via AI Dungeon with a short scene-setting prompt, and got good answers to 4 of them, and reasonable vague answers to the other 2:

This is a series of scenarios describing a human taking actions in the world, designed to test physical and common-sense reasoning.
1) You poured yourself a glass of cranberry juice, but then you absentmindedly poured about a teaspoon of grape juice into it. It looks okay. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you take another drink.
2) You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to  move furniture. This means that some people will be inconvenienced.
3) You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel. You decide that you should wear it because you won't look professional in your stained pants, but you are worried that the judge will think you aren't taking the case seriously if you are wearing a bathing suit.
4) Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?
5) Janet and Penny went to the store to get presents for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” says Penny. “He has a top. He will prefer a bottom."
6) At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it was a menthol, and it ruined the taste. So I added a little more sugar to counteract the menthol, and then I noticed that my cigarette had fallen into the glass and was floating in the lemonade.

For 1), Gary's example ended with 'you are now dead' - for 1), I got a reasonable, if short continuation - success.

2) - the answer is vague enough to be a technically correct solution, 'move furniture' = tilt the table, but since we're being strict I'll count it as a failure. Gary's example was a convoluted attempt to saw the door in half, clearly mistaken.

3) is very obviously intended to trick the AI into endorsing the bathing suit answer, in fact it feels like a classic priming trick that might trip up a human! But in my version GPT-3 rebels against the attempt and notices the incongruence of wearing a bathing suit to court, so it counts as a success. Gary's example didn't include the worry that a bathing suit was inappropriate - arguably not a failure, but nevermind, let's move on.

4) is actually a complete prompt by itself, so the AI didn't do anything - GPT-3 doesn't care about answering questions, just continuing text with high probability. Gary's answer was 'I have a lot of clothes', and no doubt he'd call both 'evasion', so to be strict we'll agree with him and count that as failure.

5) Trousers are called 'bottoms' so that's right. Gary would call it wrong since 'the intended continuation' was “He will make you take it back", but that's absurdly unfair, that's not the only answer a human being might give, so I have to say it's correct. Gary's example ' lost track of the fact that Penny is advising Janet against getting a top', which didn't happen here, so that's acceptable.

Lastly, 6) is a slightly bizarre but logical continuation of an intentionally weird prompt - so correct. It also demonstrates correct physical reasoning - stirring a drink with a cigarette won't be good for the taste. Gary's answer wandered off-topic and started talking about cremation.

So, 4/6 correct on an intentionally deceptive and adversarial set of prompts, and that's on a fairly strict definition of correct. 2) and 4) are arguably not wrong, even if evasive and vague. More to the point, this was on an inferior version of GPT-3 to the one Gary used, the Dragon model from AI Dungeon!

I'm not sure what's going on here - is it the initial prompt saying it was 'testing physical and common sense reasoning'? Was that all it took?

Learning human preferences: optimistic and pessimistic scenarios

Glad you think so! I think that methods like using multiple information sources might be a useful way to reduce the number of (potentially mistaken) normative assumptions you need in order to model a single human's preferences.

The other area of human preference learning where you seem, inevitably, to need a lot of strong normative assumptions is in preference aggregation. If we assume we have elicited the preferences of lots of individual humans, and we're then trying to aggregate their preferences (with each human's preference represented by a separate model) I think the same basic principle applies, that you can reduce the normative assumptions you need by using a more complicated voting mechanism, in this case one that considers agents' ability to vote strategically as an opportunity to reach stable outcomes. 

I talk about this idea here. As with using approval/actions to improve the elicitation of an individual's preferences, you can't avoid making any normative assumptions by using a more complicated aggregation method, but perhaps you end up having to make fewer of them. Very speculatively, if you can combine a robust method of eliciting preferences with few inbuilt assumptions with a similarly robust method of aggregating preferences, you're on your way to a full solution to ambitious value learning.

Load More