Wiki Contributions


Hmm. Perhaps the thing I'd endorse is more [include this in every detailed statement about policy/regulation], rather than [shout it from the rooftops].

So, for example, if the authors agree with the statement, I think this should be in:

  • ARC Evals' RSP post.
  • Every RSP.
  • Proposals for regulation.
  • ...

I'm fine if we don't start printing it on bumper stickers.

The outcome I'm interested in is something like: every person with significant influence on policy knows that this is believed to be a good/ideal solution, and that the only reasons against it are based on whether it's achievable in the right form.

If ARC Evals aren't saying this, RSPs don't include it, and many policy proposals don't include it..., then I don't expect this to become common knowledge.
We're much less likely to get a stop if most people with influence don't even realize it's the thing that we'd ideally get.

I think forcing people to publicly endorse policies

Saying "If this happened, it would solve the problem" is not to endorse a policy. (though perhaps literally shouting from the rooftops might be less than ideal)

It's entirely possible to state both "If x happened, it'd solve the problem", and "The policy we think is most likely to be effective in practice is Y". They can be put in the same statement quite simply.

It's reasonable to say that this might not be the most effective communication strategy. (though I think on balance I'd disagree)
It's not reasonable to say that this amounts to publicly endorsing a policy.

...if we ended capitalism, it would solve climate change. Though true...

This seems an unhelpful parallel, first because it's not clearly true. (In particular "ended capitalism" isn't "ended capitalism, and replaced it with communism", nor "ended capitalism overnight without a plan to replace it").
Second, because the proposal in this case is to not actively enact a radically disruptive change to society.

The logic of the point you're making is reasonable, but the parallel has a bunch of baggage that reduces overall clarity IMO.

...because there are much more effective ways of addressing climate change than starting a communist revolution...

This part isn't even a parallel: here even if successful the communist revolution wouldn't be most effective. However if successful a sufficiently good pause would be most effective.

The specific conversation is much better than nothing - but I do think it ought to be emphasized that solving all the problems we're aware of isn't sufficient for safety. We're training on the test set.[1]
Our confidence levels should reflect that - but I expect overconfidence.

It's plausible that RSPs could be net positive, but I think that given successful coordination [vague and uncertain] beats [significantly more concrete, but overconfident].
My presumption is that without good coordination (a necessary condition being cautious decision-makers), things will go badly.

RSPs seem likely to increase the odds we get some international coordination and regulation. But to get sufficient regulation, we'd need the unknown unknowns issue to be covered at some point. To me this seems simplest to add clearly and explicitly from the beginning. Otherwise I expect regulation to adapt to issues for which we have concrete new evidence, and to fail to adapt beyond that.

Granted that you're not the median voter/decision-maker - but you're certainly one of the most, if not the most, influential voice on the issue. It seems important not to underestimate your capacity to change people's views before figuring out a compromise to aim for (I'm primarily thinking of government people, who seem more likely to have views that might change radically based on a few conversations). But I'm certainly no expert on this kind of thing.

  1. ^

    I do wonder whether it might be helpful not to share all known problems publicly on this basis - I'd have somewhat more confidence in safety measures that succeeded in solving some problems of a type the designers didn't know about.

Thanks for writing this.

I'd be interested in your view on the comments made on Evan's RSP post w.r.t unknown unknowns. I think aysja put it best in this comment. It seems important to move the burden of proof.

Would you consider "an unknown unknown causes a catastrophe" to be a "concrete way in which they fail to manage risk"? Concrete or not, this seems sufficient grounds to stop, unless there's a clear argument that a bit more scaling actually helps for safety. (I'd be interested in your take on that - e.g. on what speed boost you might expect with your own research, given AI assistants of some level)

By default, I don't expect the "affirmative case for safety that will require novel science" to be sufficient if it ends up looking like "We developed state of the art tools that address all known problems, and we don't expect others".

On the name, it's not 'responsible' that bothers me, but rather 'scaling'.
"Responsible x-ing policy" gives the strong impression that x-ing can be done responsibly, and that x-ing will continue. I'd prefer e.g. "Responsible training and deployment policy". That way scaling isn't a baked in presumption, and we're naming things that we know can be done responsibly.

This is not what most people mean by "for personal gain". (I'm not disputing that Alice gets personal gain)

Insofar as the influence is required for altruistic ends, aiming for it doesn't imply aiming for personal gain.
Insofar as the influence is not required for altruistic ends, we have no basis to believe Alice was aiming for it.

"You're just doing that for personal gain!" is not generally taken to mean that you may be genuinely doing your best to create a better world for everyone, as you see it, in a way that many would broadly endorse.

In this context, an appropriate standard is the post's own:
Does this "predictably lead people to believe false things"?
Yes, it does. (if they believe it)

"Lying for personal gain" is a predictably misleading description, unless much stronger claims are being made about motivation (and I don't think there's sufficient evidence to back those up).

The "lying" part I can mostly go along with. (though based on a contextual 'duty' to speak out when it's unusually important; and I think I'd still want to label the two situations differently: [not speaking out] and [explicitly lying] may both be undesirable, but they're not the same thing)
(I don't really think in terms of duties, but it's a reasonable shorthand here)

Ah okay - thanks. That's clarifying.

Agreed that the post is at the very least not clear.
In particular, it's obviously not true that [if we don't stop today, there's more than a 10% chance we all die], and I don't think [if we never stop, under any circumstances...] is a case many people would be considering at all.

It'd make sense to be much clearer on the 'this' that "many people believe".

(and I hope you're correct on P(doom)!)

[I agree with most of this, and think it's a very useful comment; just pointing out disagreements]

For instance, I think that well implemented RSPs required by a regulatory agency can reduce risk <5% (partially by stopping in worlds where this appears needed).

I assume this would be a crux with Connor/Gabe (and I think I'm at least much less confident in this than you appear to be).

  • We're already in a world where stopping appears necessary.
  • It's entirely possible we all die before stopping was clearly necessary.
  • What gives you confidence that RSPs would actually trigger a pause?
    • If a lab is stopping for reasons that aren't based on objective conditions in an RSP, then what did the RSP achieve?
    • Absent objective tests that everyone has signed up for, a lab may well not stop, since there'll always be the argument "Well we think that the danger is somewhat high, but it doesn't help if only we pause".
    • It's far from clear that we'll get objective and sufficient conditions for safety (or even for low risk). I don't expect us to - though it'd obviously be nice to be wrong.
      • [EDIT: or rather, ones that allow scaling to continue safely - we already know sufficient conditions for safety: stopping]

Calling something a "pragmatic middle ground" doesn't imply that there aren't better options

I think the objection here is more about what is loosely suggested by the language used, and what is not said - not about logical implications. What is loosely suggested by the ARC Evals language is that it's not sensible to aim for the more "extreme" end of things (pausing), and that this isn't worthy of argument.

Perhaps ARC Evals have a great argument , but they don't make one. I think it's fair to say that they argue the middle ground is practical. I don't think it can be claimed they argue for pragmatic until they address both the viability of other options, and the risks of various courses. Doing a practical thing that would predictably lead to higher risk is not pragmatic.

It's not clear what the right course here, but making no substantive argument gives a completely incorrect impression. If they didn't think it was the right place for such an argument, then it'd be easy to say that: that this is a complex question, that it's unclear this course is best, and that RSPs vs Pause vs ... deserves a lot more analysis.

The post presupposes a communication/advocacy norm and states violations of this norm should be labeled "lying". I'm not sure I'm sold on this communication norm in the first place.

I'd agree with that, but I do think that in this case it'd be useful for people/orgs to state both a [here's what we'd like ideally] and a [here's what we're currently pushing for]. I can imagine many cases where this wouldn't hold, but I don't see the argument here. If there is an argument, I'd like to hear it! (fine if it's conditional on not being communicated further)

I agree with most of this, but I think the "Let me call this for what it is: lying for personal gain" section is silly and doesn't help your case.

The only sense in which it's clear that it's "for personal gain" is that it's lying to get what you want.
Sure, I'm with you that far - but if what someone wants is [a wonderful future for everyone], then that's hardly what most people would describe as "for personal gain".
By this logic, any instrumental action taken towards an altruistic goal would be "for personal gain".

That's just silly.
It's unhelpful too, since it gives people a somewhat legitimate reason to dismiss the broader point.

Of course it's possible that the longer-term altruistic goal is just a rationalization, and people are after power for its own sake, but I don't buy that this is often true - at least not in any clean [they're doing this and only this] sense. (one could have similar altruistic-goal-is-rationalization suspicions about your actions too)

In many cases, I think overconfidence is sufficient explanation.
And if we get into "Ah, but isn't it interesting that this overconfidence leads to power gain", then I'd agree - but then I claim that you should distinguish [conscious motivations] from [motivations we might infer by looking at the human as a whole, deep shadowy subconscious included]. If you're pointing at the latter, please make that clear. (and we might also ask "What actions are not for personal gain, in this sense?")

Again, entirely with you on the rest.
I'm not against accusations that may hurt feelings - but I think that more precision would be preferable here.

Thanks, this seems very reasonable. I'd missed your other comment.
(Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))

This is clarifying, thanks.

A few thoughts:

  • "Serial speed is key":
    • This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn't need to be re-directed frequently [EDIT: the preceding was poorly worded - I meant that if prior to the availability of AI assistants this were true, it'd allow a lot of speedup as the AIs take over this work; otherwise it's less clearly so helpful].
      Perhaps this is true for ARC - that's encouraging (though it does again make me wonder why they don't employ more mathematicians - surely not all the problems are serial on a single critical path?).
      I'd guess it's less often true for MIRI and John.
      • Of course once there's a large speedup of certain methods, the most efficient methodology would look different. I agree that 5x to 10x doesn't seem implausible.
  • " the future we'll have access to the exact AIs we're worried about.":
    • We'll have access to the ones we're worried about deploying.
    • We won't have access to the ones we're worried about training until we're training them.
    • I do buy that this makes safety work for that level of AI more straightforward - assuming we're not already dead. I expect most of the value is in what it tells us about a more general solution, if anything - similarly for model organisms. I suppose it does seem plausible that this is the first level we see a qualitatively different kind of general reasoning/reflection that leads us in new theoretical directions. (though I note that this makes [this is useful to study] correlate strongly with [this is dangerous to train])
  • "Researching how to make trustworthy human level AIs seems much more tractable than researching how to align wildly superhuman systems":
    • This isn't clear to me. I'd guess that the same fundamental understanding is required for both. "trustworthy" seems superficially easier than "aligned", but that's not obvious in a general context.
      I'd expect that implementing the trustworthy human-level version would be a lower bar - but that the same understanding would show us what conditions would need to obtain in either case. (certainly I'm all for people looking for an easier path to the human-level version, if this can be done safely - I'd just be somewhat surprised if we find one)
  • "So coordination to do better than this would be great".
    • I'd be curious to know what you'd want to aim for here - both in a mostly ideal world, and what seems most expedient.
Load More