This still seems to somewhat miss the point (as I pointed out last time):
Conditional on org X having an aligned / corrigible AGI, we should expect:
The question isn't what the humans in X would do, but what the [AGI + humans] would do, given that the humans have access to that AGI.
If org X is initially pro-unilateral-PAs, then we should expect an aligned AGI to talk them out of it if it's not best.
If org X is initially anti-unilateral-PAs, then we should expect an aligned AGI to talk them into it if it is best.
X will only be favouring/disfavouring PAs for instrumental reasons - and we should expect the AGI to correct them as appropriate.
For these reasons, I'd expect the initial attitude of org X to be largely irrelevant.
Since this is predictable, I don't expect it to impact race dynamics: what will matter is whether the unilateral PA seems more/less likely to succeed than the distributed approach to the AGI.
Well I'm sure I could have been clearer. (and it's possible that I'm now characterising what I think, rather than what I wrote)
But getting that impression is pretty natural: in my argument, a large part of the problem does come from its sometimes being correct to pick the question-ignoring answer. ('correct' meaning something like: [leads to best consequences, according to our values])
Or alternatively, that a correct decision algorithm would sometimes pick the question-ignoring answer.
I think I focus on this, since it's the non-obvious part of the argument: it's already clear that poor decisions / decision-algorithms may sometimes pick the question-ignoring answer.
Probably I should have emphasized more that unexpected behaviour when things are going right will make it harder to know when things are going wrong.
...the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer...
I don't think you disagreed with this?
Yes, agreed.
A few points on the rest:
This mostly seems to be an argument for: "It'd be nice if no pivotal act is necessary", but I don't think anyone disagrees with that.
As for "Should an AGI company be doing this?" the obvious answer is "It depends on the situation". It's clearly nice if it's not necessary. Similarly, if [the world does the enforcement] has higher odds of success than [the AGI org does the enforcement] then it's clearly preferable - but it's not clear that would be the case.
I think it's rather missing the point to call it a "pivotal act philosophy" as if anyone values pivotal acts for their own sake. Some people just think they're plausibly necessary - as are many unpleasant and undesirable acts. Obviously this doesn't imply they should be treated lightly, or that the full range of more palatable options shouldn't be carefully considered,
I don't buy that an intention to perform pivotal acts is a significant race-dynamic factor: incentives to race seem over-determined already. If we could stop the existing race, I imagine most pivotal-act advocates would think a pivotal act were much less likely to be necessary.
Depending on the form an aligned AGI takes, it's also not clear that the developing organisation gets to decide/control what it does. Given that special-casing avoidance of every negative side-effect is a non-starter, an aligned AGI will likely need a very general avoids-negative-side-effects mechanism. It's not clear to me that an aligned AGI that knowingly permits significant avoidable existential risk (without some huge compensatory upside) is a coherent concept.
If you're allowing a [the end of the world] side-effect, what exactly are you avoiding, and on what basis? As soon as your AGI takes on any large-scale long-term task, then [the end of the world] is likely to lead to a poor outcome on that task, and [prevent the end of the world] becomes an instrumental goal.
Forms of AGI that just do the pivotal act, whatever the creators might think about it, are at least plausible.
I assume this will be an obvious possibility for other labs to consider in planning.
Examples would be interesting, certainly. Concerning the post's point, I'd say the relevant claim is that [type of alignment research that'll be increasingly done in slow takeoff scenarios] is already being done by non x-risk motivated people.
I guess the hope is that at some point there are clear-to-everyone problems with no hacky solutions, so that incentives align to look for fundamental fixes - but I wouldn't want to rely on this.
Wholeheartedly agree, and I think it's great that you're doing this.
I'll be very interested in what you learn along the way w.r.t. more/less effective processes.
(Bonus points for referencing the art of game design - one of my favourite books.)
Thanks. A few thoughts:
For sure I agree that the researcher knowing these things is a good start - so getting as many potential researchers to grok these things is important.
My question is about which ideas researchers should focus on generating/elaborating given that they understand these things. We presumably don't want to restrict thinking to ideas that may overcome all these issues - since we want to use ideas that fail in some respects, but have some aspect that turns out to be useful.
Generating a broad variety of new ideas is great, and we don't want to be too quick in throwing out those that miss the target. The thing I'm unclear about is something like:
What target(s) do I aim for if I want to generate the set of ideas with greatest value?
I don't think that "Aim for full alignment solution" is the right target here.
I also don't think that "Aim for wacky long-shots" is the right target - and of course I realize that Adam isn't suggesting this.
(we might find ideas that look like wacky long-shots from outside, but we shouldn't be aiming for wacky long-shots)
But I don't have a clear sense of what target I would aim for (or what process I'd use, what environment I'd set up, what kind of people I'd involve...), if my goal were specifically to generate promising ideas (rather than to work on them long-term, or to generate ideas that I could productively work on).
Another disanalogy with previous research/invention... is that we need to solve this particular problem. So in some sense a history of:
[initially garbage-looking-idea] ---> [important research problem solved] may not be relevant.
What we need is: [initially garbage-looking-idea generated as attempt to solve x] ---> [x was solved]
It's not good enough if we find ideas that are useful for something, they need to be useful for this.
I expect the kinds of processes that work well to look different from those used where there's no fixed problem.
Mostly I'd agree with this, but I think there needs to be a bit of caution and balance around:
How do we get more streams of evidence? By making productive mistakes. By attempting to leverage weird analogies and connections, and iterating on them. We should obviously recognize that most of this will be garbage, but you’ll be surprised how many brilliant ideas in the history of science first looked like, or were, garbage.
Do we want variety? Absolutely: worlds where things work out well likely correlate strongly with finding a variety of approaches.
However, there's some risk in Do(increase variety). The ideal is that we get many researchers thinking about the problem in a principled way, and variety happens. If we intentionally push too much for variety, we may end up with a lot of wacky approaches that abandoned too much principled thinking too early. (I think I've been guilty of this at times)
That said, I fully agree with the goal of finding a variety of approaches. It's just rather less clear to me how much an individual researcher should be thinking in terms of boosting variety. (it's very clear that there should be spaces that provide support for finding different approaches, so I'm entirely behind that; currently it's much more straightforward to work on existing ideas than to work on genuinely new ones)
Certainly many great ideas initially looked like garbage - but I'll wager a lot of garbage initially looked like garbage too. I'd be interested in knowing more about the hidden-greatness-garbage: did it tend to have any common recognisable qualities at the time? Did it tend to emerge from processes with common recognisable qualities? In environments with shared qualities?...
Sure, that's true - but in that case the entire argument should be put in terms of:
We can (aim to) implement a pivotal process before a unilateral AGI-assisted pivotal act is possible.
And I imagine the issue there would all be around the feasibility of implementation. I think I'd give a Manhattan project to solve the technical problem much higher chances than a pivotal process. (of course people should think about it - I just won't expect them to come up with anything viable)
Once it's possible, the attitude of the creating org before interacting with their AGI is likely to be irrelevant.
So e.g. this just seems silly to me:
They won't be on their own: they'll have an AGI to set them straight on what will/won't work.