Sammy Martin

Sammy Martin. Philosophy and Physics BSc, AI MSc at Edinburgh, starting a PhD at King's College London. Interested in ethics, general philosophy and AI Safety.


Review of Soft Takeoff Can Still Lead to DSA

I agree with your argument about likelihood of DSA being higher compared to previous accelerations, due to society not being able to speed up as fast as the technology. This is sorta what I had in mind with my original argument for DSA; I was thinking that leaks/spying/etc. would not speed up nearly as fast as the relevant AI tech speeds up.

Your post on 'against GDP as a metric' argues more forcefully for the same thing that I was arguing for, that 

'the economic doubling time' stops being so meaningful - technological progress speeds up abruptly but other kinds of progress that adapt to tech progress have more of a lag before the increased technological progress also affects them? 

So we're on the same page there that it's not likely that 'the economic doubling time' captures everything that's going on all that well, which leads to another problem - how do we predict what level of capability is necessary for a transformative AI to obtain a DSA (or reach the PONR for a DSA)?

I notice that in your post you don't propose an alternative metric to GDP, which is fair enough since most of your arguments seem to lead to the conclusion that it's almost impossibly difficult to predict in advance what level of advantage over the rest of the world in which areas are actually needed to conquer the world, since we seem to be able to analogize persuasion tools to or conquistador-analogues who had relatively small tech advantages, to the AGI situation.

I think that there is still a useful role for raw economic power measurements, in that they provide a sort of upper bound on how much capability difference is needed to conquer the world. If an AGI acquires resources equivalent to controlling >50% of the world's entire GDP, it can probably take over the world if it goes for the maximally brute force approach of just using direct military force. Presumably the PONR for that situation would be awhile before then, but at least we know that an advantage of a certain size would be big enough given no assumptions about the effectiveness of unproven technologies of persuasion or manipulation or specific vulnerabilities in human civilization.

So we can use our estimate of how doubling time may increase, anchor on that gap and estimate down based on how soon we think the PONR is, or how many 'cheat' pathways that don't involve economic growth there are.

The whole idea of using brute economic advantage as an upper limit 'anchor' I got from Ajeya's Post about using biological anchors to forecast what's required for TAI - if we could find a reasonable lower bound for the amount of advantage needed to attain DSA we could do the same kind of estimated distribution between them. We would just need a lower limit - maybe there's a way of estimating it based on the upper limit of human ability since we know no actually existing human has used persuasion to take over the world but as you point out they've come relatively close.

I realize that's not a great method, but is there any better alternative given that this is a situation we've never encountered before, for trying to predict what level of capability is necessary for DSA? Or perhaps you just think that anchoring your prior estimate based on economic power advantage as an upper bound is so misleading it's worse than having a completely ignorant prior. In that case, we might have to say that there are just so many unprecedented ways that a transformative AI could obtain a DSA that we can just have no idea in advance what capability is needed, which doesn't feel quite right to me.

Review of Soft Takeoff Can Still Lead to DSA

Currently the most plausible doom scenario in my mind is maybe a version of Paul’s Type II failure. (If this is surprising to you, reread it while asking yourself what terms like “correlated automation failure” are euphemisms for.) 

This is interesting, and I'd like to see you expand on this. Incidentally I agree with the statement, but I can imagine both more and less explosive, catastrophic versions of 'correlated automation failure'. On the one hand it makes me think of things like transportation and electricity going haywire, on the other it could fit a scenario where a collection of powerful AI systems simultaneously intentionally wipe out humanity.

Clock-time leads shrink automatically as the pace of innovation speeds up, because if everyone is innovating 10x faster, then you need 10x as many hoarded ideas to have an N-year lead. 

What if, as a general fact, some kinds of progress (the technological kinds more closely correlated with AI) are just much more susceptible to speed-up? I.e, what if 'the economic doubling time' stops being so meaningful - technological progress speeds up abruptly but other kinds of progress that adapt to tech progress have more of a lag before the increased technological progress also affects them? In that case, if the parts of overall progress that affect the likelihood of leaks, theft and spying aren't sped up by as much as the rate of actual technology progress, the likelihood of DSA could rise to be quite high compared to previous accelerations where the order of magnitude where the speed-up occurred was fast enough to allow society to 'speed up' the same way.

In other words - it becomes easier to hoard more and more ideas if the ability to hoard ideas is roughly constant but the pace of progress increases. Since a lot of these 'technologies' for facilitating leaks and spying are more in the social realm, this seems plausible.

But if you need to generate more ideas, this might just mean that if you have a very large initial lead, you can turn it into a DSA, which you still seem to agree with:

  • Even if takeoff takes several years it could be unevenly distributed such that (for example) 30% of the strategically relevant research progress happens in a single corporation. I think 30% of the strategically relevant research happening in a single corporation at beginning of a multi-year takeoff would probably be enough for DSA.
Eight claims about multi-agent AGI safety

Humans have skills and motivations (such as deception, manipulation and power-hungriness) which would be dangerous in AGIs. It seems plausible that the development of many of these traits was driven by competition with other humans, and that AGIs trained to answer questions or do other limited-scope tasks would be safer and less goal-directed. I briefly make this argument here.

Note that he claims that this may be true even if single/single alignment is solved, and all AGIs involved are aligned to their respective users.

It strikes me as interesting that much of the existing work that's been done on multiagent training, such as it is, focusses on just examining the behaviour of artificial agents in social dilemmas. The thinking seems to be - and this was also suggested in ARCHES - that it's useful just for exploratory purposes to try to characterise how and whether RL agents cooperate in social dilemmas, what mechanism designs and what agent designs promote what types of cooperation, and if there are any general trends in terms of what kinds of multiagent failures RL tends to fall into.

For example, it's generally known that regular RL tends to fail to cooperate in social dilemmas, 'Unfortunately, selfish MARL agents typically fail when faced with social dilemmas'. From ARCHES:

One approach to this research area is to continually ex-amine social dilemmas through the lens of whatever is the leading AI devel-opment paradigm in a given year or decade, and attempt to classify interest-ing behaviors as they emerge. This approach might be viewed as analogous to developing “transparency for multi-agent systems”: first develop inter-esting multi-agent systems, and then try to understand them.

There seems to be an implicit assumption here that something very important and unique to multiagent situations would be uncovered - by analogy to things like the flash crash. It's not clear to me that we've examined the intersection of RL and social dilemmas enough to notice if this were true, if it were true, and I think that's the major justification for working on this area.

Commentary on AGI Safety from First Principles

Yeah - this is a case where how exactly the transition goes seems to make a very big difference. If it's a fast transition to a singleton, altering the goals of the initial AI is going to be super influential. But if it's that there are many generations of AIs that over time become the larger majority of the economy, then just control everything - predictably altering how that goes seems a lot harder at least.

Comparing the entirety of the Bostrom/Yudkowsky singleton intelligence explosion scenario to the slower more spread out scenario, it's not clear that it's easier to predictably alter the course of the future in the first compared to the second.

In the first, assuming you successfully set the goals of the singleton, the hard part is over and the future can be steered easily because there are, by definition, no more coordination problems to deal with. But in the first, a superintelligent AGI could explode on us out of nowhere with little warning and a 'randomly rolled utility function', so the amount of coordination we'd need pre-intelligence explosion might be very large.

In the second slower scenario, there are still ways to influence the development of AI - aside from massive global coordination and legislation, there may well be decision points where two developmental paths are comparable in terms of short-term usefulness but one is much better than the other in terms of alignment or the value of the long-term future. 

Stuart Russell's claim that we need to replace 'the standard model' of AI development is one such example - if he's right, a concerted push now by a few researchers could alter how nearly all future AI systems are developed for the better. So different conditions have to be met for it to be possible to predictably alter the future long in advance on the slow transition model (multiple plausible AI development paths that could be universally adopted and have ethically different outcomes) compared to the fast transition model (the ability to anticipate when and where the intelligence explosion will arrive and do all the necessary alignment work in time), but its not obvious to me one is easier to meet than the other.


For this reason, I think it's unlikely there will be a very clearly distinct "takeoff period" that warrants special attention compared to surrounding periods.

I think the period AI systems can, at least in aggregate, finally do all the stuff that people can do might be relatively distinct and critical -- but, if progress in different cognitive domains is sufficiently lumpy, this point could be reached well after the point where we intuitively regard lots of AI systems as on the whole "superintelligent."

This might be another case (like 'the AIs utility function') where we should just retire the term as meaningless, but I think that 'takeoff' isn't always a strictly defined interval, especially if we're towards the medium-slow end. The start of the takeoff has a precise meaning only if you believe that RSI is an all-or-nothing property. In this graph from a post of mine, the light blue curve has an obvious start to the takeoff where the gradient discontinuously changes, but what about the yellow line? There clearly is a takeoff in that progress becomes very rapid, but there's no obvious start point, but there is still a period very different from our current period that is reached in a relatively short space of time - so not 'very clearly distinct' but still 'warrants special attention'.


At this point I think it's easier to just discard the terminology altogether. For some agents, it's reasonable to describe them as having goals. For others, it isn't. Some of those goals are dangerous. Some aren't. 

Daniel Dennett's Intentional stance is either a good analogy for the problem of "can't define what has a utility function" or just a rewording of the same issue. Dennett's original formulation doesn't discuss different types of AI systems or utility functions, ranging in 'explicit goal directedness' all the way from expected-minmax game players to deep RL to purely random agents, but instead discusses physical systems ranging from thermostats up to humans. Either way, if you agree with Dennett's formulation of the intentional stance I think you'd also agree that it doesn't make much sense to speak of 'the utility function as necessarily well-defined.

Some AI research areas and their relevance to existential safety

That said, I remain interested in more clarity on what you see as the biggest risks with these multi/multi approaches that could be addressed with technical research.

A (though not necessarily the most important) reason to think technical research into computational social choice might be useful is that examining specifically the behaviour of RL agents from a computational social choice perspective might alert us to ways in which coordination with future TAI might be similar or different to the existing coordination problems we face.

(i) make direct improvements in the relevant institutions, in a way that anticipates the changes brought about by AI but will most likely not look like AI research, 

It seems premature to say, in advance of actually seeing what such research uncovers, whether the relevant mechanisms and governance improvements are exactly the same as the improvements we need for good governance generally, or different. Suppose examining the behaviour of current RL agents in social dilemmas leads to a general result which in turn leads us to conclude there's a disproportionate chance TAI in the future will coordinate in some damaging way that we can resolve with a particular new regulation. It's always possible to say, solving the single/single alignment problem will prevent anything like that from happening in the first place, but why put all your hopes on plan A, when plan B is relatively neglected?

Some AI research areas and their relevance to existential safety

Thanks for this long and very detailed post!

The MARL projects with the greatest potential to help are probably those that find ways to achieve cooperation between decentrally trained agents in a competitive task environment, because of its potential to minimize destructive conflicts between fleets of AI systems that cause collateral damage to humanity.  That said, even this area of research risks making it easier for fleets of machines to cooperate and/or collude at the exclusion of humans, increasing the risk of humans becoming gradually disenfranchised and perhaps replaced entirely by machines that are better and faster at cooperation than humans.

In ARCHES, you mention that just examining the multiagent behaviour of RL systems (or other systems that work as toy/small-scale examples of what future transformative AI might look like) might enable us to get ahead of potential multiagent risks, or at least try to predict how transformative AI might behave in multiagent settings. The way you describe it in ARCHES, the research would be purely exploratory,

One approach to this research area is to continually ex-amine social dilemmas through the lens of whatever is the leading AI devel-opment paradigm in a given year or decade, and attempt to classify interest-ing behaviors as they emerge. This approach might be viewed as analogousto developing “transparency for multi-agent systems”: first develop inter-esting multi-agent systems, and then try to understand them. 

But what you're suggesting in this post, 'those that find ways to achieve cooperation between decentrally trained agents in a competitive task environment', sounds like combining computational social choice research with multiagent RL -  examining the behaviour of RL agents in social dilemmas and trying to design mechanisms that work to produce the kind of behaviour we want. To do that, you'd need insights from social choice theory. There is some existing research on this, but it's sparse and very exploratory.

My current research is attempting to build on the second of these.

As far as I can tell, that's more or less it in terms of examining RL agents in social dilemmas, so there may well be a lot of low-hanging fruit and interesting discoveries to be made. If the research is specifically about finding ways of achieving cooperation in multiagent systems by choosing the correct (e.g. voting) mechanism, is that not also computational social choice research, and therefore of higher priority by your metric?

In short, computational social choice research will be necessary to legitimize and fulfill governance demands for technology companies (automated and human-run companies alike) to ensure AI technologies are beneficial to and controllable by human society.  


CSC neglect:

As mentioned above, I think CSC is still far from ready to fulfill governance demands at the ever-increasing speed and scale that will be needed to ensure existential safety in the wake of “the alignment revolution”. 

AGI safety from first principles: Goals and Agency

Furthermore, we should take seriously the possibility that superintelligent AGIs might be even less focused than humans are on achieving large-scale goals. We can imagine them possessing final goals which don’t incentivise the pursuit of power, such as deontological goals, or small-scale goals. 


My underlying argument is that agency is not just an emergent property of highly intelligent systems, but rather a set of capabilities which need to be developed during training, and which won’t arise without selection for it

Was this line of argument inspired by Ben Garfinkel's objection to the 'classic' formulation of instrumental convergence/orthogonality - that these are 'measure based' arguments that just identify that a majority of possible agents with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we're actually likely to build such agents?

It seems like you're identifying the same additional step that Ben identified, and that I argued could be satisfied - that we need a plausible reason why we would build an agentive AI with large-scale goals.

And the same applies for 'instrumental convergence' - the observation that most possible goals, especially simple goals, imply a tendency to produce extreme outcomes when ruthlessly maximised:

  • A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.  

We could see this as marking out a potential danger - a large number of possible mind-designs produce very bad outcomes if implemented. The fact that such designs exist 'weakly suggest' (Ben's words) that AGI poses an existential risk since we might build them. If we add in other premises that imply we are likely to (accidentally or deliberately) build such systems, the argument becomes stronger. But usually the classic arguments simply note instrumental convergence and assume we're 'shooting into the dark' in the space of all possible minds, because they take the abstract statement about possible minds to be speaking directly about the physical world. There are specific reasons to think this might occur (e.g. mesa-optimisation, sufficiently fast progress preventing us from course-correcting if there is even a small initial divergence) but those are the reasons that combine with instrumental convergence to produce a concrete risk, and have to be argued for separately.

Security Mindset and Takeoff Speeds

In terms of inferences about deceptive alignment, it might be useful to go back to the one and only current example we have where someone with somewhat relevant knowledge was led to wonder whether deception had taken place - GPT-3 balancing brackets. I don't know if anyone ever got Eliezer's $1000 bounty, but the top-level comment on that thread at least convinces me that it's unlikely that GPT-3 via AI Dungeon was being deceptive even though Eliezer thought there was a real possibility that it was.

Now, this doesn't prove all that much, but one thing it does suggest is that on current MIRI-like views about how likely deception is, the threshold for uncertainty about deception is set far too low. That suggests your people at OpenSoft might well be right in their assumption.

Forecasting Thread: AI Timelines

The 'progress will be continuous' argument, to apply to our near future, does depend on my other assumptions - mainly that the breakthroughs on that list are separable, so agentive behaviour and long-term planning won't drop out of a larger GPT by themselves and can't be considered part of just 'improving up language model accuracy'.

We currently have partial progress on human-level language comprehension, a bit on cumulative learning, but near zero on managing mental activity for long term planning, so if we were to suddenly reach human level on long-term planning in the next 5 years, that would probably involve a discontinuity, which I don't think is very likely for the reasons given here.

If language models scale to near-human performance but the other milestones don't fall in the process, and my initial claim is right, that gives us very transformative AI but not AGI. I think that the situation would look something like this:

If GPT-N reaches par-human:

discovering new action sets
managing its own mental activity
(?) cumulative learning
human-like language comprehension
perception and object recognition
efficient search over known facts

So there would be 2 (maybe 3?) breakthroughs remaining. It seems like you think just scaling up a GPT will also resolve those other milestones, rather than just giving us human-like language comprehension. Whereas if I'm right and also those curves do extrapolate, what we would get at the end would be an excellent text generator, but it wouldn't be an agent, wouldn't be capable of long-term planning and couldn't be accurately described as having a utility function over the states of the external world, and I don't see any reason why trivial extensions of GPT would be able to do that either since those seem like problems that are just as hard as human-like language comprehension. GPT seems like it's also making some progress on cumulative learning, though it might need some RL-based help with that, but none at all on managing mental activity for longterm planning or discovering new action sets.

As an additional argument, admittedly from authority - Stuart Russell also clearly sees human-like language comprehension as only one of several really hard and independent problems that need to be solved.

A humanlike GPT-N would certainly be a huge leap into a realm of AI we don't know much about, so we could be surprised and discover that agentive behaviour and having a utility function over states of the external world spontaneously appears in a good enough language model, but that argument has to be made, and you need that argument to hold and GPT to keep scaling for us to reach AGI in the next five years, and I don't see the conjunction of those two as that likely - it seems as though your argument rests solely on whether GPT scales or not, when there's also this other conceptual premise that's much harder to justify.

I'm also not sure if I've seen anyone make the argument that GPT-N will also give us these specific breakthroughs - but if you have reasons that GPT scaling would solve all the remaining barriers to AGI, I'd be interested to hear it. Note that this isn't the same as just pointing out how impressive the results scaling up GPT could be - Gwern's piece here, for example, seems to be arguing for a scenario more like what I've envisaged, where GPT-N ends up a key piece of some future AGI but just provides some of the background 'world model':

Models like GPT-3 suggest that large unsupervised models will be vital components of future DL systems, as they can be ‘plugged into’ systems to immediately provide understanding of the world, humans, natural language, and reasoning.

If GPT does scale, and we get human-like language comprehension in 2025, that will mean we're moving up that list much faster, and in turn suggests that there might not be a large number of additional discoveries required to make the other breakthroughs, which in turn suggests they might also occur within the Deep Learning paradigm, and relatively soon. I think that if this happens, there's a reasonable chance that when we do build an AGI a big part of its internals looks like a GPT, as gwern suggested, but by then we're already long past simply scaling up existing systems.

Alternatively, perhaps you're not including agentive behaviour in your definition of AGI - a par-human text generator for most tasks that isn't capable of discovering new action sets or managing its mental activity is, I think a 'mere' transformative AI and not a genuine AGI.

SDM's Shortform

So to sum up, a very high-level summary of the steps in this method of preference elicitation and aggregation would be:

    1. With a mixture of normative assumptions and multi-channel information (approval and actions) as inputs, use a reward-modelling method to elicit the debiased preferences of many individuals.
      1. Determining whether there actually are significant differences between stated and revealed preferences when performing reward modelling is the first step to using multi-channel information to effectively separate biases from preferences.
    2. Create 'proxy agents' using the reward model developed for each human (this step is where intent-aligned amplification can potentially occur).
    3. Place the proxies in an iterated voting situation which tends to produce sensible convergent results. The use of RL proxies here can be compared to the use of human proxies in liquid democracy.
      1. Which voting mechanisms tend to work in iterated situations with RL agents can be determined in other experiments (probably with purely artificial agents)
    4. Run the voting mechanism until an unambiguous winner is decided, using methods like those given in this paper.

This seems like a reasonable procedure for extending a method that is aligned to one human's preferences (step 1,2) to produce sensible results when trying to align to an aggregate of human preferences (step 3,4). It reduces reliance on the specific features of one voting method, Other than the insight that multiple channels of information might help, all the standard unsolved problems with preference learning from one human remain.

Even though we can't yet align an AGI to one human's preferences, trying to think about how to aggregate human preferences in a way that is scalable isn't premature, as has sometimes been claimed.

In many 'non-ambitious' hypothetical settings where we aren't trying to build an AGI sovereign over the whole world (for example, designing a powerful AI to govern the operations of a hospital), we still need to be able to aggregate preferences sensibly and stably. This method would do well at such intermediate scales, as it doesn't approach the question of preference aggregation from a 'final' ambitious value-learning perspective but instead tries to look at preference aggregation the same way we look at elicitation, with an RL-based iterative approach to reaching a result.

However, if you did want to use such a method to try and produce the fabled 'final utility function of all humanity', it might not give you Humanity's CEV, since some normative assumptions (preferences count equally and in the way given by the voting mechanism), are built in. By analogy with CEV, I called the idealized result of this method a coherent extrapolated framework (CEF). This is a more normatively direct method of aggregating values than CEV, (since you fix a particular method of aggregating preferences in advance), as it extrapolates from a voting framework rather than extrapolating based on our volition, more broadly (and vaguely) defined, hence CEF.

Load More