Wiki Contributions


Pivotal outcomes and pivotal processes

Sure, that's true - but in that case the entire argument should be put in terms of:
We can (aim to) implement a pivotal process before a unilateral AGI-assisted pivotal act is possible.

And I imagine the issue there would all be around the feasibility of implementation. I think I'd give a Manhattan project to solve the technical problem much higher chances than a pivotal process. (of course people should think about it - I just won't expect them to come up with anything viable)

Once it's possible, the attitude of the creating org before interacting with their AGI is likely to be irrelevant.

So e.g. this just seems silly to me:

So, thankfully-according-to-me, no currently-successful AGI labs are oriented on carrying out pivotal acts, at least not all on their own.

They won't be on their own: they'll have an AGI to set them straight on what will/won't work.

Pivotal outcomes and pivotal processes

This still seems to somewhat miss the point (as I pointed out last time):
Conditional on org X having an aligned / corrigible AGI, we should expect:

  1. If the AGI is an aligned sovereign, it'll do the pivotal act (PA) unilaterally if that's best, and do it in distributed fashion if that's best (according to whatever it's aligned to).
  2. If the AGI is more like a corrigible tool, we should expect X to ask 'their' AGI what would be best to do (or equivalent), and we're pretty-much back to case 1.

The question isn't what the humans in X would do, but what the [AGI + humans] would do, given that the humans have access to that AGI.

If org X is initially pro-unilateral-PAs, then we should expect an aligned AGI to talk them out of it if it's not best.
If org X is initially anti-unilateral-PAs, then we should expect an aligned AGI to talk them into it if it is best.

X will only be favouring/disfavouring PAs for instrumental reasons - and we should expect the AGI to correct them as appropriate.

For these reasons, I'd expect the initial attitude of org X to be largely irrelevant.
Since this is predictable, I don't expect it to impact race dynamics: what will matter is whether the unilateral PA seems more/less likely to succeed than the distributed approach to the AGI.

Optimal play in human-judged Debate usually won't answer your question

Well I'm sure I could have been clearer. (and it's possible that I'm now characterising what I think, rather than what I wrote)

But getting that impression is pretty natural: in my argument, a large part of the problem does come from its sometimes being correct to pick the question-ignoring answer. ('correct' meaning something like: [leads to best consequences, according to our values])
Or alternatively, that a correct decision algorithm would sometimes pick the question-ignoring answer.

I think I focus on this, since it's the non-obvious part of the argument: it's already clear that poor decisions / decision-algorithms may sometimes pick the question-ignoring answer.

Probably I should have emphasized more that unexpected behaviour when things are going right will make it harder to know when things are going wrong.

Optimal play in human-judged Debate usually won't answer your question

...the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer...
I don't think you disagreed with this?

Yes, agreed.

A few points on the rest:

  1. At the highest level, the core issue is that QI makes it quite a bit harder to identify misalignment. If aligned systems will sometimes not answer the question, non-answering isn't necessarily strong evidence of misalignment.
    So "consequentialist judges will [sometimes correctly] select QIA's" is bad in the sense that it provides cover for "consequentialist judges will [sometimes incorrectly] select QIA's".
  2. I talk about consequentialists, but not rational consequentialists. I expect the kind of judge we'd pick to be highly rational relative to the average human - but that's a low bar. I expect all humans to have exploitable inconsistencies, and that optimal play will exploit them. (similarly for groups of humans)
    So yes, this is only a problem where manipulation is possible - but since it is possible, we'll have difficulty distinguishing [judge correctly selected a non-answer as the winner for principled reasons] from [judge was manipulated...].
    1. It's much less clear when such issues show up with sub-optimal play.
  3. With "Is this definitely undesirable? I'm not sure, but probably." I'm referring to the debate structure's having this property in general being undesirable. The judge can know it is undesirable in general, but also think that in this specific case things are different - and, of course, the judge can be wrong about this.
    1. Noting here that humans can't make binding pre-commitments. (saying words doesn't qualify)
  4. It's hard (/impossible?) to avoid this issue through oversight, since we just move from [QI exceptions that persuade the judge win] to [QI exceptions that persuade the judge and oversight system win].
“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

This mostly seems to be an argument for: "It'd be nice if no pivotal act is necessary", but I don't think anyone disagrees with that.

As for "Should an AGI company be doing this?" the obvious answer is "It depends on the situation". It's clearly nice if it's not necessary. Similarly, if [the world does the enforcement] has higher odds of success than [the AGI org does the enforcement] then it's clearly preferable - but it's not clear that would be the case.

I think it's rather missing the point to call it a "pivotal act philosophy" as if anyone values pivotal acts for their own sake. Some people just think they're plausibly necessary - as are many unpleasant and undesirable acts. Obviously this doesn't imply they should be treated lightly, or that the full range of more palatable options shouldn't be carefully considered,

I don't buy that an intention to perform pivotal acts is a significant race-dynamic factor: incentives to race seem over-determined already. If we could stop the existing race, I imagine most pivotal-act advocates would think a pivotal act were much less likely to be necessary.

Depending on the form an aligned AGI takes, it's also not clear that the developing organisation gets to decide/control what it does. Given that special-casing avoidance of every negative side-effect is a non-starter, an aligned AGI will likely need a very general avoids-negative-side-effects mechanism. It's not clear to me that an aligned AGI that knowingly permits significant avoidable existential risk (without some huge compensatory upside) is a coherent concept.

If you're allowing a [the end of the world] side-effect, what exactly are you avoiding, and on what basis? As soon as your AGI takes on any large-scale long-term task, then [the end of the world] is likely to lead to a poor outcome on that task, and [prevent the end of the world] becomes an instrumental goal.

Forms of AGI that just do the pivotal act, whatever the creators might think about it, are at least plausible.
I assume this will be an obvious possibility for other labs to consider in planning.

Takeoff speeds have a huge effect on what it means to work on AI x-risk

Examples would be interesting, certainly. Concerning the post's point, I'd say the relevant claim is that [type of alignment research that'll be increasingly done in slow takeoff scenarios] is already being done by non x-risk motivated people.

I guess the hope is that at some point there are clear-to-everyone problems with no hacky solutions, so that incentives align to look for fundamental fixes - but I wouldn't want to rely on this.

Refine: An Incubator for Conceptual Alignment Research Bets

Wholeheartedly agree, and I think it's great that you're doing this.
I'll be very interested in what you learn along the way w.r.t. more/less effective processes.

(Bonus points for referencing the art of game design - one of my favourite books.)

Truthfulness, standards and credibility

Thanks. A few thoughts:

  • It is almost certainly too long. Could use editing/distillation/executive-summary. I erred on the side of leaving more in, since the audience I'm most concerned with are those who're actively working in this area (though for them there's a bit much statement-of-the-obvious, I imagine).
  • I don't think most of it is new, or news to the authors: they focused on the narrow version for a reason. The only part that could be seen as a direct critique is the downside risks section: I do think their argument is too narrow.
  • As it relates to Truthful AI, much of the rest can be seen in terms of "Truthfulness amplification doesn't bridge the gap". Here again, I doubt the authors would disagree. They never claim that it would, just that it expands the scope - that's undeniably true.
  • On being net-positive below a certain threshold, I'd make a few observations:
    • For the near-term, this post only really argues that the Truthful AI case for positive impact is insufficient (not broad enough). I don't think I've made a strong case the the output would be net negative, just that it's a plausible outcome (it'd be my bet for most standards in most contexts).
    • I do think such standards would be useful in some sense for very near future AIs - those that are not capable of hard-to-detect manipulation. However, I'm not sure eliminating falsehoods there is helpful overall: it likely reduces immediate harm a little, but risks giving users the false impression that AIs won't try to mislead them. If the first misleading AIs are undetectably misleading, that's not good.
    • Some of the issues are less clearly applicable in a CAIS-like setup, but others seem pretty fundamental: e.g. that what we care about is something like [change in accuracy of beliefs] not [accuracy of statement]. The "all models are wrong" issue doesn't go away. If you're making determinations in the wrong language game, you're going to make errors.
  • Worth emphasizing that "...and this path requires something like intent alignment" isn't really a critique. That's the element of Truthfulness research I think could be promising - looking at concepts in the vicinity of intent alignment from another angle. I just don't expect standards that fall short of this to do much that's useful, or to shed much light on the fundamentals.
  • ...but I may be wrong!
Productive Mistakes, Not Perfect Answers

For sure I agree that the researcher knowing these things is a good start - so getting as many potential researchers to grok these things is important.

My question is about which ideas researchers should focus on generating/elaborating given that they understand these things. We presumably don't want to restrict thinking to ideas that may overcome all these issues - since we want to use ideas that fail in some respects, but have some aspect that turns out to be useful.

Generating a broad variety of new ideas is great, and we don't want to be too quick in throwing out those that miss the target. The thing I'm unclear about is something like:

What target(s) do I aim for if I want to generate the set of ideas with greatest value?

I don't think that "Aim for full alignment solution" is the right target here.
I also don't think that "Aim for wacky long-shots" is the right target - and of course I realize that Adam isn't suggesting this.
(we might find ideas that look like wacky long-shots from outside, but we shouldn't be aiming for wacky long-shots)

But I don't have a clear sense of what target I would aim for (or what process I'd use, what environment I'd set up, what kind of people I'd involve...), if my goal were specifically to generate promising ideas (rather than to work on them long-term, or to generate ideas that I could productively work on).

Another disanalogy with previous research/invention... is that we need to solve this particular problem. So in some sense a history of:
[initially garbage-looking-idea] ---> [important research problem solved] may not be relevant.

What we need is: [initially garbage-looking-idea generated as attempt to solve x] ---> [x was solved]
It's not good enough if we find ideas that are useful for something, they need to be useful for this.

I expect the kinds of processes that work well to look different from those used where there's no fixed problem.

Productive Mistakes, Not Perfect Answers

Mostly I'd agree with this, but I think there needs to be a bit of caution and balance around:

How do we get more streams of evidence? By making productive mistakes. By attempting to leverage weird analogies and connections, and iterating on them. We should obviously recognize that most of this will be garbage, but you’ll be surprised how many brilliant ideas in the history of science first looked like, or were, garbage.

Do we want variety? Absolutely: worlds where things work out well likely correlate strongly with finding a variety of approaches.

However, there's some risk in Do(increase variety). The ideal is that we get many researchers thinking about the problem in a principled way, and variety happens. If we intentionally push too much for variety, we may end up with a lot of wacky approaches that abandoned too much principled thinking too early. (I think I've been guilty of this at times)

That said, I fully agree with the goal of finding a variety of approaches. It's just rather less clear to me how much an individual researcher should be thinking in terms of boosting variety. (it's very clear that there should be spaces that provide support for finding different approaches, so I'm entirely behind that; currently it's much more straightforward to work on existing ideas than to work on genuinely new ones)

Certainly many great ideas initially looked like garbage - but I'll wager a lot of garbage initially looked like garbage too. I'd be interested in knowing more about the hidden-greatness-garbage: did it tend to have any common recognisable qualities at the time? Did it tend to emerge from processes with common recognisable qualities? In environments with shared qualities?...

Load More