Sorted by New

Wiki Contributions


Are you proposing applying this to something potentially prepotent? Or does this come with corrigibility guarantees? If you applied it to a prepotence, I'm pretty sure this would be an extremely bad idea. The actual human utility function (the rules of the game as intended) supports important glitch-like behavior, where cheap tricks can extract enormous amounts of utility, which means that applying this to general alignment has the potential of foreclosing most value that could have existed.

Example 1: Virtual worlds are a weird out-of-distribution part of the human utility function that allows the AI to "cheat" and create impossibly good experiences by cutting the human's senses off from the real world and showing them an illusion. As far as I'm concerned, creating non-deceptive virtual worlds (like, very good video games) is correct behavior and the future would be immeasurably devalued if it were disallowed.

Example 2: I am not a hedonist, but I can't say conclusively that I wouldn't become one (turn out to be one) if I had full knowledge of my preferences, and the ability to self-modify, as well as lots of time and safety to reflect, settle my affairs in the world, set aside my pride, and then wirehead. This is a glitchy looking behavior that allows the AI to extract a much higher yield of utility from each subject by gradually warping them into a shape where they lose touch with most of what we currently call "values", where one value dominates all of the others. If it is incorrect behavior, then sure, it shouldn't be allowed to do that, but humans don't have the kind of self-reflection that is required to tell whether it's incorrect behavior or not, today, and if it's correct behavior, forever forbidding it is actually a far more horrifying outcome, what you'd be doing is, in some sense of 'suffering', forever prolonging some amount of suffering. That's fine if humans tolerate and prefer some amount of suffering, but we aren't sure of that yet.

I've noticed that the word "stipulation" is a pretty good word for the category of claims that become true when we decide they are true. It's probably best if we try to broaden its connotations to encompass self-fulfilling prophesies than it is to make some other word or name this category "prophesy" or something.

It's clear that the category does deserve a name.

He thinks that as AI systems get more powerful, they will actually become more interpretable because they will use features that humans also tend to use

I find this fairly persuasive, I think. One way of putting it is that in order for an agent to be recursively self-improving in any remotely intelligent way, it needs to be legible to itself. Even if we can't immediately understand its components in the same way that it does, it must necessarily provide us with descriptions of its own ways of understanding them, which we could then potentially co-opt. (relevant: )

This may be useful in the early phases, but I'm skeptical as to whether humans can import those new ways of understanding fast enough to be permitted to stand as an air-gap for very long. There is a reason, for instance, we don't have humans looking over and approving every credit card transaction. Taking humans out of the loop is the entire reason those systems are useful. The same dynamic will pop up with AGI.

This xkcd comic seems relevant ("sandboxing cycle")

There is a tension between connectivity and safe isolation and navigating it is hard.

Hmm. I don't think I can answer the question, but if you're interested in finding fairly realistic ways to dutchbook CDT agents, I'm curious, would the following be a good method? Death in damascus would be very hard to do IRL, because you'd need a mindreader, and most CDT agents will not allow you to read their mind for obvious reasons.

A game with a large set of CDT agents. They can each output Sensible or Exceptional. If they Sensible, they receive 1$. Those who Exceptional don't get anything in that stage

Next, if their output is the majority output, then an additional 2$ is subtracted from their score. If they're exceptionally clever, if they manage to disagree with the majority, then 2$ is added to their score. A negative final score means they lose money to us. We will tend to profit, because generally, they're not exceptional. there are more majority betters than minority betters.

CDT agents act on the basis of an imagined future where their own action is born from nothing, and has no bearing on anything else in the world. As a result of that, they will reliably overestimate⬨ (or more precisely, reliably act as if they have overestimated) their ability to evade the majority. They are exceptionalists. They will (act as if they) overestimate how exceptional they are.

Whatever method they use to estimate⬨ the majority action, they will tend to come out with the same answer, and so they will tend to bet the same way, and so they will tend to lose money to the house continuously.

⬨ They will need to resort to some kind of an estimate, wont they? If a CDT tries to simulate itself (with the same inputs), that wont halt (the result is undefined). If a CDTlike agent can exist in reality, they'll use some approximate method for this kind of recursive prediction work.

After enough rounds, I suppose it's possible that their approximations might go a bit crazy from all of the contradictory data and reach some kind of equilibrium where they're betting different ways somewhere around 1:1 and it'll become unprofitable for us to continue the contest, but by then we will have made a lot of money.