Posts

Sorted by New

Wiki Contributions

Comments

I've noticed that the word "stipulation" is a pretty good word for the category of claims that become true when we decide they are true. It's probably best if we try to broaden its connotations to encompass self-fulfilling prophesies than it is to make some other word or name this category "prophesy" or something.

It's clear that the category does deserve a name.

He thinks that as AI systems get more powerful, they will actually become more interpretable because they will use features that humans also tend to use

I find this fairly persuasive, I think. One way of putting it is that in order for an agent to be recursively self-improving in any remotely intelligent way, it needs to be legible to itself. Even if we can't immediately understand its components in the same way that it does, it must necessarily provide us with descriptions of its own ways of understanding them, which we could then potentially co-opt. (relevant: https://www.lesswrong.com/posts/bNXdnRTpSXk9p4zmi/book-review-design-principles-of-biological-circuits )

This may be useful in the early phases, but I'm skeptical as to whether humans can import those new ways of understanding fast enough to be permitted to stand as an air-gap for very long. There is a reason, for instance, we don't have humans looking over and approving every credit card transaction. Taking humans out of the loop is the entire reason those systems are useful. The same dynamic will pop up with AGI.

This xkcd comic seems relevant https://xkcd.com/2044/ ("sandboxing cycle")

There is a tension between connectivity and safe isolation and navigating it is hard.

Hmm. I don't think I can answer the question, but if you're interested in finding fairly realistic ways to dutchbook CDT agents, I'm curious, would the following be a good method? Death in damascus would be very hard to do IRL, because you'd need a mindreader, and most CDT agents will not allow you to read their mind for obvious reasons.

A game with a large set of CDT agents. They can each output Sensible or Exceptional. If they Sensible, they receive 1$. Those who Exceptional don't get anything in that stage

Next, if their output is the majority output, then an additional 2$ is subtracted from their score. If they're exceptionally clever, if they manage to disagree with the majority, then 2$ is added to their score. A negative final score means they lose money to us. We will tend to profit, because generally, they're not exceptional. there are more majority betters than minority betters.

CDT agents act on the basis of an imagined future where their own action is born from nothing, and has no bearing on anything else in the world. As a result of that, they will reliably overestimate⬨ (or more precisely, reliably act as if they have overestimated) their ability to evade the majority. They are exceptionalists. They will (act as if they) overestimate how exceptional they are.

Whatever method they use to estimate⬨ the majority action, they will tend to come out with the same answer, and so they will tend to bet the same way, and so they will tend to lose money to the house continuously.

⬨ They will need to resort to some kind of an estimate, wont they? If a CDT tries to simulate itself (with the same inputs), that wont halt (the result is undefined). If a CDTlike agent can exist in reality, they'll use some approximate method for this kind of recursive prediction work.

After enough rounds, I suppose it's possible that their approximations might go a bit crazy from all of the contradictory data and reach some kind of equilibrium where they're betting different ways somewhere around 1:1 and it'll become unprofitable for us to continue the contest, but by then we will have made a lot of money.