Wiki Contributions


Morality is Scary

Ah, I was just being an idiot on the bargaining system w.r.t. small numbers of people being able to hold it to ransom. Oops. Agreed that more majority power isn't desirable.
[re iteration, I only meant that the bargaining could become iterated if the initial bargaining result were to decide upon iteration (to include more future users). I now don't think this is particularly significant.]

I think my remaining uncertainty (/confusion) is all related to the issue I first mentioned (embedded copy experiences). It strikes me that something like this can also happen where minds grow/merge/overlap.

This operator will declare both the manifesting and evaluation of the source codes of other users to be "out of scope" for a given user. Hence, a preference of  to observe the suffering of  would be "satisfied" by observing nearly anything, since the maximization can interpret anything as a simulation of .

Does this avoid the problem if 's preferences use indirection? It seems to me that a robust pointer to  may be enough: that with a robust pointer it may be possible to implicitly require something like source-code-access without explicitly referencing it. E.g. where  has a preference to "experience  suffering in circumstances where there's strong evidence it's actually  suffering, given that these circumstances were the outcome of this bargaining process".

If  can't robustly specify things like this, then I'd guess there'd be significant trouble in specifying quite a few (mutually) desirable situations involving other users too. IIUC, this would only be any problem for the denosed bargaining to find a good : for the second bargaining on the true utility functions there's no need to put anything "out of scope" (right?), so win-wins are easily achieved.

Considerations on interaction between AI and expected value of the future

I agree with most of this. Not sure about how much moral weight I'd put on "a computation structured like an agent" - some, but it's mostly coming from [I might be wrong] rather than [I think agentness implies moral weight].

Agreed that malthusian dynamics gives you an evolution-like situation - but I'd guess it's too late for it to matter: once you're already generally intelligent, can think your way to the convergent instrumental goal of self-preservation, and can self-modify, it's not clear to me that consciousness/pleasure/pain buys you anything.

Heuristics are sure to be useful as shortcuts, but I'm not sure I'd want to analogise those to qualia (??? presumably the right kind would be - but I suppose I don't expect the right kind by default).

The possibilities for signalling will also be nothing like that in a historical evolutionary setting - the utility of emotional affect doesn't seem to be present (once the humans are gone).
[these are just my immediate thoughts; I could easily be wrong]

I agree with its being likely that most vertebrates are moral patients.

Overall, I can't rule out AIs becoming moral patients - and it's clearly possible.
I just don't yet see positive reasons to think it has significant probability (unless aimed for explicitly).

Considerations on interaction between AI and expected value of the future

Thanks, that's interesting, though mostly I'm not buying it (still unclear whether there's a good case to be made; fairly clear that he's not making a good case).

  1. Most of it seems to say "Being a subroutine doesn't imply something doesn't suffer". That's fine, but few positive arguments are made. Starting with the letter 'h' doesn't imply something doesn't suffer either - but it'd be strange to say "Humans obviously suffer, so why not houses, hills and hiccups?".
  2. We infer preference from experience of suffering/joy...:
    [Joe Xs when he might not X] & [Joe experiences suffering and joy] -> [Joe prefers Xing]
    [this rock is Xing] -> [this rock Xs]
    Methinks someone is petitioing a principii.
    (Joe is mechanistic too - but the suffering/joy being part of that mechanism is what gets us to call it "preference")
  3. Too much is conflated:

    In particular, I can aim to x and not care whether I succeed. Not achieving an aim doesn't imply frustration or suffering in general - we just happen to be wired that way (but it's not universal, even for humans: we can try something whimsical-yet-goal-directed, and experience no suffering/frustration when it doesn't work). [taboo/disambiguate 'aim' if necessary]
    1. There's no argument made for frustration/satisfaction. It's just assumed that not achieving a goal is frustrating, and that achieving one is satisfying. A case can be made to ascribe intentionality to many systems - e.g. Dennett's intentional stance. Ascribing welfare is a further step, and requires further arguments.
      Non-achievement of an aim isn't inherently frustrating (c.f. Buddhists - and indeed current robots).
      1. The only argument I saw on this was "we can sum over possible interpretations" - sure, but I can do that for hiccups too.
Morality is Scary

Sure, I'm not sure there's a viable alternative either. This kind of approach seems promising - but I want to better understand any downsides.

My worry wasn't about the initial 10%, but about the possibility of the process being iterated such that you end up with almost all bargaining power in the hands of power-keepers.

In retrospect, this is probably silly: if there's a designable-by-us mechanism that better achieves what we want, the first bargaining iteration should find it. If not, then what I'm gesturing at must either be incoherent, or not endorsed by the 10% - so hard-coding it into the initial mechanism wouldn't get the buy-in of the 10% to the extent that they understood the mechanism.

In the end, I think my concern is that we won't get buy-in from a large majority of users:
In order to accommodate some proportion with odd moral views it seems likely you'll be throwing away huge amounts of expected value in others' views - if I'm correctly interpreting your proposal (please correct me if I'm confused).

Is this where you'd want to apply amplified denosing?
So, rather than filtering out the undesirable , for these  you use:

        [i.e. ignoring y and imagining it's optimal]

However, it's not clear to me how we'd decide who gets strong denosing (clearly not everyone, or we don't pick a ). E.g. if you strong-denose anyone who's too willing to allow bargaining failure [everyone dies] you might end up filtering out altruists who worry about suffering risks.
Does that make sense?

Some thoughts on why adversarial training might be useful

This seems generally clear and useful. Thumbs-up.

(2) could probably use a bit more "...may...", "...sometimes..." for clarity. You can't fool all of the models all of the time etc. etc.

Considerations on interaction between AI and expected value of the future

Sure, that's possible (and if so I agree it'd be importantly dystopic) - but do you see a reason to expect it?
It's not something I've thought about a great deal, but my current guess is that you probably don't get moral patients without aiming for them (or by using training incentives much closer to evolution than I'd expect).

Considerations on interaction between AI and expected value of the future

My sense is that most would-be dystopian scenarios lead to extinction fairly quickly. In most Malthusian situations, ruthless power struggles... humans would be a fitness liability that gets optimised away.

The way this doesn't happen is if we have AIs with human-extinction-avoiding constraints: some kind of alignment (perhaps incomplete/broken).

I don't think it makes much sense to reason further than this without making a guess at what those constraints may look like. If there aren't constraints, we're dead. If there are, then those constraints determine the rules of the game.

Morality is Scary

Ah - that's cool if IB physicalism might address this kind of thing (still on my to-read list).

Agreed that the subjoe thing isn't directly a problem. My worry is mainly whether it's harder to rule out  experiencing a simulation of , since sub isn't a user. However, if you can avoid the suffering s by limiting access to information, the same should presumably work for relevant sub-s.

If existing people want future people to be treated well, then they have nothing to worry about since this preference is part of the existing people's utility functions.

This isn't so clear (to me at least) if:

  1. Most, but not all current users want future people to be treated well.
  2. Part of being "treated well" includes being involved in an ongoing bargaining process which decides the AI's/future's trajectory.

For instance, suppose initially 90% of people would like to have an iterated bargaining process that includes future (trans/post)humans as users, once they exist. The other 10% are only willing to accept such a situation if they maintain their bargaining power in future iterations (by whatever mechanism).

If you iterate this process, the bargaining process ends up dominated by users who won't relinquish any power to future users. 90% of initial users might prefer drift over lock-in, but we get lock-in regardless (the disagreement point also amounting to lock-in).

Unless I'm confusing myself, this kind of thing seems like a problem. (not in terms of reaching some non-terrible lower bound, but in terms of realising potential)
Wherever there's this kind of asymmetry/degradation over bargaining iterations, I think there's an argument for building in a way to avoid it from the start - since anything short of 100% just limits to 0 over time. [it's by no means clear that we do want to make future people users on an equal footing to today's people; it just seems to me that we have to do it at step zero or not at all]

Morality is Scary

This is very interesting (and "denosing operator" is delightful).

Some thoughts:

If I understand correctly, I think there can still be a problem where user  wants an experience history such that part of the history is isomorphic to a simulation of user  suffering ( wants to fully experience  suffering in every detail).

Here a fixed  may entail some fixed  for (some copy of) some .

It seems the above approach can't then avoid leaving one of  or  badly off:
If  is permitted to freely determine the experience of the embedded  copy, the disagreement point in the second bargaining will bake this in:  may be horrified to see that  wants to experience its copy suffer, but will be powerless to stop it (if  won't budge in the bargaining).

Conversely, if the embedded  is treated as a user which  will imagine is exactly to 's liking, but who actually gets what  wants, then the selected  will be horrible for  (e.g. perhaps  wants to fully experience Hitler suffering, and instead gets to fully experience Hitler's wildest fantasies being realized).

I don't think it's possible to do anything like denosing to avoid this.

It may seem like this isn't a practical problem, since we could reasonably disallow such embedding. However, I think that's still tricky since there's a less exotic version of the issue: my experiences likely already are a collection of subagents' experiences. Presumably my maximisation over  is permitted to determine all the .

It's hard to see how you draw a principled line here: the ideal future for most people may easily be transhumanist to the point where today's users are tomorrow's subpersonalities (and beyond).

A case that may have to be ruled out separately is where  wants to become a suffering . Depending on what I consider 'me', I might be entirely fine with it if 'I' wake up tomorrow as suffering  (if I'm done living and think  deserves to suffer).
Or perhaps I want to clone myself  times, and then have all copies convert themselves to suffering s after a while. [in general, it seems there has to be some mechanism to distribute resources reasonably - but it's not entirely clear what that should be]

Formalizing Policy-Modification Corrigibility

Oh and of course your non-obstruction does much better at capturing what we care about.
It's not yet clear to me whether some adapted version of  gets at something independently useful. Maybe.

[I realize that you're aiming to get at something different here - but so far I'm not clear on a context where I'd be interested in  as more than a curiosity]

Load More