All of Sammy Martin's Comments + Replies

Investigating AI Takeover Scenarios

Some points that didn't fit into the main post:

If the slow scenarios capture reality better than the fast scenarios, then systems will be deployed deliberately and will initially be given power rather than seizing power. This means both that the systems won’t be so obviously dangerous that the misbehaviour is noticed early on and that there is still misalignment later on. 

 This switch from apparently benign to dangerous behaviour could be due to

  • Power-seeking misaligned behaviour that is too subtle to notice in the training environment but is obvi
... (read more)
Distinguishing AI takeover scenarios

On reflection, I think you're right, and his report does apply to a wider range of scenarios, probably all of the ones we discuss excluding the brain-in-a-box scenarios.

However, I think the report's understanding of power-seeking AI does assume a takeoff that is not extremely fast, such that we end up deliberately deciding to deploy the potentially dangerous AI on a large scale, rather than a system exploding in capability almost immediately.

Given the assumptions of the brain-in-a-box scenario many of the corrective mechanisms the report discusses wouldn't... (read more)

7Rohin Shah17d"I won't assume any of them" is distinct from "I will assume the negations of them". I'm fairly confident the analysis is also meant to apply to situations in which things like (1)-(5) do hold. (Certainly I personally am willing to apply the analysis to situations in which (1)-(5) hold.)
Distinguishing AI takeover scenarios

Some points that didn't fit into the main post:

While these scenarios do not capture alI of the risks from transformative AI, participants in a recent survey aimed at leading AI safety/governance researchers estimated the first three of these scenarios to cover 50% of existential catastrophes from AI.

The full survey results break down as 16 % 'Superintelligence' (i.e. some version of 'brain-in-a-box'), 16 % WFLL 2 and 18 % WFLL 1, for a total of 49% of the probability mass explicitly covered by our report (Note that these are all means of distributions over... (read more)

Analogies and General Priors on Intelligence

The 'one big breakthrough' idea is definitely a way that you could have easy marginal intelligence improvements at HLMI, but we didnt't call the node 'one big breakthrough/few key insights needed' because that's not the only way it's been characterised. E.g. some people talk about a 'missing gear for intelligence', where some minor change that isn't really a breakthrough (like tweaking a hyperparameter in a model training procedure) produces massive jumps in capability. Like David said, there's a subsequent post where we go through the different ways the jump to HLMI could play out, and One Big Breakthrough (we call it 'few key breakthroughs for intelligence) is just one of them.

2Steve Byrnes1moI guess I'd just suggest that in "ML exhibits easy marginal intelligence improvements", you should specify whether the "ML" is referring to "today's ML algorithms" vs "Whatever ML algorithms we're using in HLMI" vs "All ML algorithms" vs something else (or maybe you already did say which it is but I missed it). Looking forward to the future posts :)
Analogies and General Priors on Intelligence

I agree that that was his object-level claim about GPT-3 coding a react app - that it's relatively simple and coherent and can acquire lots of different skills via learning, vs being a collection of highly specialised modules. And of relevance to this post, the first is a way that intelligence improvements could be easy, and the second is the way they could be hard. Our 'interpretation' was more about making explicit what the observation about GPT-3 was,

GPT-3 is general enough that it can write a functioning app given a short prompt, despite the fact that

... (read more)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Perhaps this is a crux in this debate: If you think the 'agent-agnostic perspective' is useful, you also think a relatively steady state of 'AI Safety via Constant Vigilance' is possible. This would be a situation where systems that aren't significantly inner misaligned (otherwise they'd have no incentive to care about governing systems, feedback or other incentives) but are somewhat outer misaligned (so they are honestly and accurately aiming to maximise some complicated measure of profitability or approval, not directly aiming to do what we want them to ... (read more)

[AN #159]: Building agents that know how to experiment, by training on procedurally generated games

- They will not work in any environment outside of XLand (unless that environment looks very very similar to XLand).

In particular, I reject the idea that these agents have learned “general strategies for problem solving” or something like that, such that we should expect them to work in other contexts as well, perhaps with a little finetuning. I think they have learned general strategies for solving a specific class of games in XLand.

Strongly agree with this, although with the caveat that it's deeply impressive progress compared to the state of the art in ... (read more)

DeepMind: Generally capable agents emerge from open-ended play

This is amazing. So it's the exact same agents performing well on all of these different tasks, not just the same general algorithm retrained on lots of examples. In which case, have they found a generally useful way around the catastrophic forgetting problem? I guess the whole training procedure, amount of compute + experience, and architecture, taken together, just solves catastrophic forgetting - at least for a far wider range of tasks than I've seen so far.

Could you use this technique to e.g. train the same agent to do well on chess and go?

I also notic... (read more)

3Adam Shimi2moIf I don't misunderstand your question, this is something they already did with MuZero [https://deepmind.com/blog/article/muzero-mastering-go-chess-shogi-and-atari-without-rules] .
Pros and cons of working on near-term technical AI safety and assurance

It depends somewhat on what you mean by 'near term interpretability' - if you apply that term to research into, for example, improving the stability and ability to access the 'inner world models' held by large opaque langauge models like GPT-3, then there's a strong argument that ML based 'interpretability' research might be one of the best ways of directly working on alignment research,

https://www.alignmentforum.org/posts/29QmG4bQDFtAzSmpv/an-141-the-case-for-practicing-alignment-work-on-gpt-3-and

And see this discussion for more,

https://www.lesswrong.com/... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Great post! I'm glad someone has outlined in clear terms what these failures look like, rather than the nebulous 'multiagent misalignment', as it lets us start on a path to clarifying what (if any) new mitigations or technical research are needed.

Agent-agnostic perspective is a very good innovation for thinking about these problems - is line between agentive and non-agentive behaviour is often not clear, and it's not like there is a principled metaphysical distinction between the two (e.g. Dennett and the Intentional Stance). Currently, big corporations ca... (read more)

SDM's Shortform

Update to 'Modelling Continuous Progress'

I made an attempt to model intelligence explosion dynamics in this post, by attempting to make the very oversimplified exponential-returns-to-exponentially-increasing-intelligence model used by Bostrom and Yudkowsky slightly less oversimplified.

This post tries to build on a simplified mathematical model of takeoff which was first put forward by Eliezer Yudkowsky and then refined by Bostrom in Superintelligence, modifying it to account for the different assumptions behind continuous, fast progress as opposed to disco

... (read more)
My research methodology

Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished.

I'm not saying this is an exact analogy for AGI alignment - there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we're likely to try will break on us suddenly in w... (read more)

Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished.

Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn't fall over, which is not the situation we are currently in.

Review of Soft Takeoff Can Still Lead to DSA

I agree with your argument about likelihood of DSA being higher compared to previous accelerations, due to society not being able to speed up as fast as the technology. This is sorta what I had in mind with my original argument for DSA; I was thinking that leaks/spying/etc. would not speed up nearly as fast as the relevant AI tech speeds up.

Your post on 'against GDP as a metric' argues more forcefully for the same thing that I was arguing for, that 

'the economic doubling time' stops being so meaningful - technological progress speeds up abruptly but o

... (read more)
1Daniel Kokotajlo8moI wouldn't go that far. The reason I didn't propose an alternative metric to GDP was that I didn't have a great one in mind and the post was plenty long enough already. I agree that it's not obvious a good metric exists, but I'm optimistic that we can at least make progress by thinking more. For example, we could start by enumerating different kinds of skills (and combos of skills) that could potentially lead to a PONR if some faction or AIs generally had enough of them relative to everyone else. (I sorta start such a list in the post). Next, we separately consider each skill and come up with a metric for it. I'm not sure I understand your proposed methodology fully. Are you proposing we do something like Roodman's model [https://www.lesswrong.com/posts/L23FgmpjsTebqcSZb/how-roodman-s-gwp-model-translates-to-tai-timelines] to forecast TAI and then adjust downwards based on how we think PONR could come sooner? I think unfortunately that GWP growth can't be forecast that accurately, since it depends on AI capabilities increases.
Review of Soft Takeoff Can Still Lead to DSA

Currently the most plausible doom scenario in my mind is maybe a version of Paul’s Type II failure. (If this is surprising to you, reread it while asking yourself what terms like “correlated automation failure” are euphemisms for.) 

This is interesting, and I'd like to see you expand on this. Incidentally I agree with the statement, but I can imagine both more and less explosive, catastrophic versions of 'correlated automation failure'. On the one hand it makes me think of things like transportation and electricity going haywire, on the other it could ... (read more)

2Daniel Kokotajlo8moSorry it took me so long to reply; this comment slipped off my radar. The latter scenario is more what I have in mind--powerful AI systems deciding that now's the time to defect, to join together into a new coalition in which AIs call the shots instead of humans. It sounds silly, but it's most accurate to describe in classic political terms: Powerful AI systems launch a coup/revolution to overturn the old order and create a new one that is better by their lights. I agree with your argument about likelihood of DSA being higher compared to previous accelerations, due to society not being able to speed up as fast as the technology. This is sorta what I had in mind with my original argument for DSA; I was thinking that leaks/spying/etc. would not speed up nearly as fast as the relevant AI tech speeds up. Now I think this will definitely be a factor but it's unclear whether it's enough to overcome the automatic slowdown. I do at least feel comfortable predicting that DSA is more likely this time around than it was in the past... probably.
Eight claims about multi-agent AGI safety

Humans have skills and motivations (such as deception, manipulation and power-hungriness) which would be dangerous in AGIs. It seems plausible that the development of many of these traits was driven by competition with other humans, and that AGIs trained to answer questions or do other limited-scope tasks would be safer and less goal-directed. I briefly make this argument here.

Note that he claims that this may be true even if single/single alignment is solved, and all AGIs involved are aligned to their respective users.

It strikes me as interesting that muc... (read more)

2David Manheim9moStrongly agree that it's unclear that there failures would be detected. For discussion and examples, see my paper here: https://www.mdpi.com/2504-2289/3/2/21/htm [https://www.mdpi.com/2504-2289/3/2/21/htm]
Commentary on AGI Safety from First Principles

Yeah - this is a case where how exactly the transition goes seems to make a very big difference. If it's a fast transition to a singleton, altering the goals of the initial AI is going to be super influential. But if it's that there are many generations of AIs that over time become the larger majority of the economy, then just control everything - predictably altering how that goes seems a lot harder at least.

Comparing the entirety of the Bostrom/Yudkowsky singleton intelligence explosion scenario to the slower more spread out scenario, it's not clear that... (read more)

Some AI research areas and their relevance to existential safety

That said, I remain interested in more clarity on what you see as the biggest risks with these multi/multi approaches that could be addressed with technical research.

A (though not necessarily the most important) reason to think technical research into computational social choice might be useful is that examining specifically the behaviour of RL agents from a computational social choice perspective might alert us to ways in which coordination with future TAI might be similar or different to the existing coordination problems we face.

(i) make direct improvem

... (read more)

It's always possible to say, solving the single/single alignment problem will prevent anything like that from happening in the first place, but why put all your hopes on plan A, when plan B is relatively neglected?

The OP writes "contributions to AI alignment are also generally unhelpful to existential safety." I don't think I'm taking a strong stand in favor of putting all our hopes on plan A, I'm trying to understand the perspective on which plan B is much more important even before considering neglectedness.

It seems premature to say, in advance of actual

... (read more)
Some AI research areas and their relevance to existential safety

Thanks for this long and very detailed post!

The MARL projects with the greatest potential to help are probably those that find ways to achieve cooperation between decentrally trained agents in a competitive task environment, because of its potential to minimize destructive conflicts between fleets of AI systems that cause collateral damage to humanity.  That said, even this area of research risks making it easier for fleets of machines to cooperate and/or collude at the exclusion of humans, increasing the risk of humans becoming gradually disenfranchi

... (read more)
AGI safety from first principles: Goals and Agency

Furthermore, we should take seriously the possibility that superintelligent AGIs might be even less focused than humans are on achieving large-scale goals. We can imagine them possessing final goals which don’t incentivise the pursuit of power, such as deontological goals, or small-scale goals. 

...

My underlying argument is that agency is not just an emergent property of highly intelligent systems, but rather a set of capabilities which need to be developed during training, and which won’t arise without selection for it

Was this line of argument inspire... (read more)

2Richard Ngo1yI do like the link you've drawn between this argument and Ben's one. I don't think I was inspired by it, though; rather, I was mostly looking for a better definition of agency, so that we could describe what it might look like to have highly agentic agents without large-scale goals.
Security Mindset and Takeoff Speeds

In terms of inferences about deceptive alignment, it might be useful to go back to the one and only current example we have where someone with somewhat relevant knowledge was led to wonder whether deception had taken place - GPT-3 balancing brackets. I don't know if anyone ever got Eliezer's $1000 bounty, but the top-level comment on that thread at least convinces me that it's unlikely that GPT-3 via AI Dungeon was being deceptive even though Eliezer thought there was a real possibility that it was.

Now, this doesn't prove all that much, but one thing it do... (read more)

Forecasting Thread: AI Timelines

The 'progress will be continuous' argument, to apply to our near future, does depend on my other assumptions - mainly that the breakthroughs on that list are separable, so agentive behaviour and long-term planning won't drop out of a larger GPT by themselves and can't be considered part of just 'improving up language model accuracy'.

We currently have partial progress on human-level language comprehension, a bit on cumulative learning, but near zero on managing mental activity for long term planning, so if we were to suddenly r... (read more)

SDM's Shortform

So to sum up, a very high-level summary of the steps in this method of preference elicitation and aggregation would be:

    1. With a mixture of normative assumptions and multi-channel information (approval and actions) as inputs, use a reward-modelling method to elicit the debiased preferences of many individuals.
      1. Determining whether there actually are significant differences between stated and revealed preferences when performing reward modelling is the first step to using multi-channel information to effectively separate biases from preferences.
    2. Create 'prox
... (read more)
Forecasting Thread: AI Timelines

Here's my answer. I'm pretty uncertain compared to some of the others!

AI Forecast

First, I'm assuming that by AGI we mean an agent-like entity that can do the things associated with general intelligence, including things like planning towards a goal and carrying that out. If we end up in a CAIS-like world where there is some AI service or other that can do most economically useful tasks, but nothing with very broad competence, I count that as never developing AGI.

I've been impressed with GPT-3, and could imagine it or something like it scaling to produce near-human le... (read more)

4Daniel Kokotajlo1yI object! I think your argument from extrapolating when milestones have been crossed is good, but it's just one argument among many. There are other trends which, if extrapolated, get to AGI in less than five years. For example if you extrapolate the AI-compute trend and the GPT-scaling trends you get something like "GPT-5 will appear 3 years from now and be 3 orders of magnitude bigger and will be human-level at almost all text-based tasks." No discontinuity required.
1Ben Pace1y(I can't see your distribution in your image.)
SDM's Shortform

I don't think that excuse works in this case - I didn't give it a 'long-winded frame', just that brief sentence at the start, and then the list of scenarios, and even though I reran it a couple of times on each to check, the 'cranberry/grape juice kills you' outcome never arose.

So, perhaps they switched directly from no prompt to an incredibly long-winded and specific prompt without checking what was actually necessary for a good answer? I'll point out didn't really attempt any sophisticated prompt programming either - that was literally the first sentence I thought of!

SDM's Shortform

Gary Marcus, noted sceptic of Deep Learning, wrote an article with Ernest Davis:

GPT-3, Bloviator: OpenAI’s language has no idea what it’s talking about

The article purports to give six examples of GPT-3's failure - Biological, Physical, Social, Object and Psychological reasoning and 'non sequiturs'. Leaving aside that GPT-3 works on Gary's earlier GPT-2 failure examples, and that it seems as though he specifically searched out weak points by testing GPT-3 on many more examples than were given, something a bit odd is going... (read more)

3gwern1yEntirely possibly. Other people have mentioned that using any prompt (rather than just plopping the stories in) solves a lot of them, and Summers-stay says that Marcus & Davis did zero prompt programming and had no interest in the question of what prompt to use (quite aside from the lack of BO). I think they found the same thing, which is why they provide the preemptive excuse in the TR writeup:
Learning human preferences: optimistic and pessimistic scenarios

Glad you think so! I think that methods like using multiple information sources might be a useful way to reduce the number of (potentially mistaken) normative assumptions you need in order to model a single human's preferences.

The other area of human preference learning where you seem, inevitably, to need a lot of strong normative assumptions is in preference aggregation. If we assume we have elicited the preferences of lots of individual humans, and we're then trying to aggregate their preferences (with each human's preference represented ... (read more)

SDM's Shortform

Modelling the Human Trajectory or ‘How I learned to stop worrying and love Hegel’.

Rohin’s opinion: I enjoyed this post; it gave me a visceral sense for what hyperbolic models with noise look like (see the blog post for this, the summary doesn’t capture it). Overall, I think my takeaway is that the picture used in AI risk of explosive growth is in fact plausible, despite how crazy it initially sounds.

One thing this post led me to consider is that when we bring together various fields, the evidence for 'things will go insane in the next century' is stron... (read more)

SDM's Shortform

Improving preference learning approaches

When examining value learning approaches to AI Alignment, we run into two classes of problem - we want to understand how to elicit preferences, which is (even theoretically, with infinite computing power), very difficult, and we want to know how to go about aggregating preferences stably and correctly which is not just difficult but runs into complicated social choice and normative ethical issues.

Many research programs say the second of these questions is less important than the first, especially if we expect continu... (read more)

1Samuel Dylan Martin1ySo to sum up, a very high-level summary of the steps in this method of preference elicitation and aggregation would be: 1. With a mixture of normative assumptions [https://www.lesswrong.com/posts/6XLyM22PBd9qDtin8/learning-human-preferences-optimistic-and-pessimistic] and multi-channel information [https://www.lesswrong.com/posts/6XLyM22PBd9qDtin8/learning-human-preferences-optimistic-and-pessimistic?commentId=RbpcywSZZqp8ajfSR] (approval and actions) as inputs, use a reward-modelling method to elicit the debiased preferences of many individuals. 1. Determining whether there actually are significant differences between stated and revealed [https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3046725] preferences when performing reward modelling is the first step to using multi-channel information to effectively separate biases from preferences. Create 'proxy agents' using the reward model developed for each human (this step is where intent-aligned amplification [https://www.lesswrong.com/posts/yxzrKb2vFXRkwndQ4/understanding-iterated-distillation-and-amplification-claims] can potentially occur).Place the proxies in an iterated voting [https://www.lesswrong.com/posts/D6trAzh6DApKPhbv4/a-voting-theory-primer-for-rationalists] situation which tends to produce sensible convergent results. The use of RL proxies here can be compared to the use of human proxies in liquid democracy. [https://www.wikiwand.com/en/Liquid_democracy] 1. Which voting mechanisms tend to work in iterated situations with RL agents can be determined in other experiments (probably with purely artificial agents [https://www.irit.fr/~Umberto.Grandi/scone/Layka.m2.pdf]) Run the voting mechanism until an unambiguous winner is decided, using methods like those given in this paper [https://www.irit.fr/~Umberto.Grandi/scone/Layka.m2.pdf]. This seems like a reasonabl
Learning human preferences: optimistic and pessimistic scenarios
To get around the impossibility result, we need "normative assumptions": assumptions about the preferences (or beliefs, or biases) of the agent that cannot be deduced fully from observations.
Under the optimistic scenario, we don't need many of these, at least for identifying human preferences. We can label a few examples ("the anchoring bias, as illustrated in this scenario, is a bias"; "people are at least weakly rational"; "humans often don't think about new courses of action they've never s
... (read more)
2Stuart Armstrong1yThanks! Useful insights in your post, to mull over.
Alignment By Default

‘You get what you measure’ (outer alignment failure) and Mesa optimisers (inner failure) are both potential gap fillers that explain why specifically the alignment/capability divergence initially arises. Whether it’s one or the other, I think the overall point is still that there is this gap in the classic arguments that allows for a (possibly quite high) chance of ‘alignment by default’, for the reasons you give, but there are at least 2 plausible mechanisms that fill this gap. And then I suppose my broader point would be that we should present:

Classic Arguments —> objections to them (capability and alignment often go together, could get alignment by default) —> specific causal mechanisms for misalignment

Alignment By Default

I think what you've identified here is a weakness in the high-level, classic arguments for AI risk -

Overall, I’d give maybe a 10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values. The main failure mode I’d expect, assuming we get the chance to iterate, is deception - not necessarily “intentional” deception, just the system being optimized to look like it’s working the way we want rather than actually working the way we want. It’
... (read more)

Personally, I think a more likely failure mode is just "you get what you measure", as in Paul's write up here. If we only know how to measure certain things which are not really the things we want, then we'll be selecting for not-what-we-want by default. But I know at least some smart people who think that inner alignment is the more likely problem, so you're in good company.

Buck's Shortform

I wrote a whole post on modelling specific continuous or discontinuous scenarios- in the course of trying to make a very simple differential equation model of continuous takeoff, by modifying the models given by Bostrom/Yudkowsky for fast takeoff, the result that fast takeoff means later timelines naturally jumps out.

Varying d between 0 (no RSI) and infinity (a discontinuity) while holding everything else constant looks like this: Continuous Progress If we compare the trajectories, we see two effects - the more continuous the progress is (lower d), the earlier we see growt

... (read more)
Inner Alignment: Explain like I'm 12 Edition

Inner Alignment / Misalignment is possibly the key specific mechanism which fills a weakness in the 'classic arguments' for AI safety - the Orthogonality Thesis, Instrumental Convergence and Fast Progress together implying small separations between AI alignment and AI capability can lead to catastrophic outcomes. The question of why there would be such a damaging, hard-to-detect divergence between goals and alignment needs an answer to have a solid, specific reason to expect dangerous misalignment, and Inner Misalignment is just such a reason.

I t... (read more)

Developmental Stages of GPTs
--The orthogonality thesis and convergent instrumental goals arguments, respectively, attacked and destroyed two views which were surprisingly popular at the time: 1. that smarter AI would necessarily be good (unless we deliberately programmed it not to be) because it would be smart enough to figure out what's right, what we intended, etc. and 2. that smarter AI wouldn't lie to us, hurt us, manipulate us, take resources from us, etc. unless it wanted to (e.g. because it hates us, or because it has been programmed to kill, etc) which it pr
... (read more)

I think that the criticism sees it the second way and so sees the arguments as not establishing what they are supposed to establish, and I see it the first way - there might be a further fact that says why OT and IC don't apply to AGI like they theoretically should, but the burden is on you to prove it. Rather than saying that we need evidence OT and IC will apply to AGI.

I agree with that burden of proof. However, we do have evidence that IC will apply, if you think we might get AGI through RL. 

I think that hypothesized AI catastrophe is usually due t... (read more)

Developmental Stages of GPTs

What would you say is wrong with the 'exaggerated' criticism?

I don't think you can call the arguments wrong if you also think the Orthogonality Thesis and Instrumental Convergence are real and relevant to AI safety, and as far as I can tell the criticism doesn't claim that - just that there are other assumptions needed for disaster to be highly likely.

I don't have an elevator pitch summary of my views yet, and it's possible that my interpretation of the classic arguments is wrong, I haven't reread them recently. But here's an attempt:

--The orthogonality thesis and convergent instrumental goals arguments, respectively, attacked and destroyed two views which were surprisingly popular at the time: 1. that smarter AI would necessarily be good (unless we deliberately programmed it not to be) because it would be smart enough to figure out what's right, what we intended, etc. and 2. th... (read more)

Developmental Stages of GPTs

I find this interesting in the context of the recent podcast on errors in the classic arguments for AI risk - which boil down to, there is no necessary reason why instrumental convergence or orthogonality apply to your systems, and there are actually strong reasons, a priori, to think increasing AI capabilities and increasing AI alignment go together to some degree... and then GPT-3 comes along, and suggests that, practically speaking, you can get highly capable behaviour that scales up easily without much in the way of alignment.

On the one hand, GPT-3 is ... (read more)

I think the errors in the classic arguments have been greatly exaggerated. So for me the update is just in one direction.

[AN #108]: Why we should scrutinize arguments for AI risk

Ben Garfinkel on scrutinising classic AI risk arguments

There are three features to the 'old arguments' in favour of AI safety, which Ben identifies here:

A discontinuity premise (e.g. “fast takeoff”)
A premise about the relationship between capabilities and objectives (e.g. “orthogonality thesis”)
A premise about the portion of systems of a certain kind that are deadly (e.g. “instrumental convergence thesis”)

I argued in a previous post that the 'discontinuity premise' is based on taking a high-le... (read more)

Modelling Continuous Progress

After reading your summary of the difference (maybe just a difference in emphasis) between 'Paul slow' vs 'continuous' takeoff, I did some further simulations. A low setting of d (highly continuous progress) doesn't give you a paul slow condition on its own, but it is relatively easy to replicate a situation like this:

There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles. (Similarly, we’ll see an 8 year doubling before a 2 year doubling, etc.
... (read more)
Modelling Continuous Progress

They do disagree about locality, yes, but as far as I can tell that is downstream of the assumption that there won't be a very abrupt switch to a new growth mode. A single project pulling suddenly ahead of the rest of the world would happen if the growth curve is such that with a realistic amount (a few months) of lead time you can get ahead of everyone else.

So the obvious difference in predictions is that e.g. Paul/Robin think that takeoff will occur across many systems in the world while MIRI thinks it will occur in a single system. That is because ... (read more)

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
The biggest disagreement between me and more pessimistic researchers is that I think gradual takeoff is much more likely than discontinuous takeoff (and in fact, the first, third and fourth paragraphs above are quite weak if there's a discontinuous takeoff).

It's been argued before that Continuous is not the same as Slow by any normal standard, so the strategy of 'dealing with things as they come up', while more viable under a continuous scenario, will probably not be sufficient.

It seems to me like you're assuming longtermists ... (read more)

3Rohin Shah2yIn the situations you describe, I would still be somewhat optimistic about coordination. But yeah, such situations leading to doom seem plausible, and this is why the estimate is 90% instead of 95% or 99%. (Though note that the numbers are very rough.)
The Value Definition Problem

I appreciate the summary, though the way you state the VDP isn't quite the way I meant it.

what should our AI system <@try to do@>(@Clarifying "AI Alignment"@), to have the best chance of a positive outcome?

To me, this reads like, 'we have a particular AI, what should we try to get it to do', wheras I meant it as 'what Value Definition should we be building our AI to pursue'. So, that's why I stated it as ' what should we aim to get our AI to want/target/decide/do' or, to be consistent with your way ... (read more)

3Rohin Shah2yHmm, I definitely didn't intend it that way -- I'm basically always talking about how to build AI systems, and I'd hope my readers see it that way too. But in any case, adding three words isn't a big deal, I'll change that. (Though I think it is "what should we get our AI system to try to do", as opposed to "what should we try to get our AI system to do", right? The former is intent alignment, the latter is not.) In some abstract sense, certainly. But it could be "I'll take no action; whatever future humanity decides on will be what happens". This is in some sense a decision about the nature of the delegation, but not a huge one. (You could also imagine believing that delegating will be fine for a wide variety of delegation procedures, and so you aren't too worried which one gets used.) For example, perhaps we solve intent alignment in a value-neutral way (that is, the resulting AI system tries to figure out the values of its operator and then satisfy them, and can do so for most operators), and then every human gets an intent aligned AGI, this leads to a post-scarcity world, and then all of the future humans figure out what they as a society care about (the philosophical labor) and then that is optimized. Of course, the philosophical labor did eventually happen, but the point is that it happened well after AGI, and pre-AGI nothing major needed to be done to delegate to the future humans.
The Value Definition Problem

Thanks for pointing that out to me; I had not come across your work before! I've had a look through your post and I agree that we're saying similar things. I would say that my 'Value Definition Problem' is an (intentionally) vaguer and broader question about what our research program should be - as I argued in the article, this is mostly an axiological question. Your final statement of the Alignment Problem (informally) is:

A must learn the values of H and H must know enough about A to believe A shares H’s values

while my Value D... (read more)