mako yass — AI Alignment Forum

However, one big challenge of building this is how two adversarial parties could ever gain enough confidence to allow such a hardware/software package into a secure facility, especially if it's whole point is to have a communication channel to their adversary.

Isn't it enough to constrain the output format to so that no steganographic leaks would be possible? Wont the counterparty usually be satisfied just with an hourly signal saying either "Something is wrong" (encompassing "Auditor saw a violation" / "no signal, the host has censored the auditor's report" / "invalid signal, the host has tampered with the auditor system" / "auditor has been blinded to the host's operations, or has ascertained that there are operations which the auditor cannot see") or "Auditor confirms that all systems are nominal and without violation."?

The host can remain in control of their facilities, as long as the auditor is running on tamperproof hardware. It's difficult to prove that a physical device can't be tampered with, it may be possible to take some components of the auditor even further and run them in a zero knowledge virtual machine, which provides a cryptographic guarantee that the program wasn't tampered with, so long as you can make it lithe enough to fit (zero knowledge virtual machines currently run at a 10,000x slowdown, though I don't think specialized hardware for them is available yet, crypto may drive that work), though a ZKVM wont provide a guarantee that the inputs to the system aren't being controlled, the auditor is monitoring inputs of such complexity — either footage of the real world or logs of a large training run — that it may be able to prove algorithmically to itself that the sensory inputs weren't tampered with either and the algorithm does have a view into the real world (I'm contending that even large state actors could not create Descarte's evil demon).

Are you proposing applying this to something potentially prepotent? Or does this come with corrigibility guarantees? If you applied it to a prepotence, I'm pretty sure this would be an extremely bad idea. The actual human utility function (the rules of the game as intended) supports important glitch-like behavior, where cheap tricks can extract enormous amounts of utility, which means that applying this to general alignment has the potential of foreclosing most value that could have existed.

Example 1: Virtual worlds are a weird out-of-distribution part of the human utility function that allows the AI to "cheat" and create impossibly good experiences by cutting the human's senses off from the real world and showing them an illusion. As far as I'm concerned, creating non-deceptive virtual worlds (like, very good video games) is correct behavior and the future would be immeasurably devalued if it were disallowed.

Example 2: I am not a hedonist, but I can't say conclusively that I wouldn't become one (turn out to be one) if I had full knowledge of my preferences, and the ability to self-modify, as well as lots of time and safety to reflect, settle my affairs in the world, set aside my pride, and then wirehead. This is a glitchy looking behavior that allows the AI to extract a much higher yield of utility from each subject by gradually warping them into a shape where they lose touch with most of what we currently call "values", where one value dominates all of the others. If it is incorrect behavior, then sure, it shouldn't be allowed to do that, but humans don't have the kind of self-reflection that is required to tell whether it's incorrect behavior or not, today, and if it's correct behavior, forever forbidding it is actually a far more horrifying outcome, what you'd be doing is, in some sense of 'suffering', forever prolonging some amount of suffering. That's fine if humans tolerate and prefer some amount of suffering, but we aren't sure of that yet.

I've noticed that the word "stipulation" is a pretty good word for the category of claims that become true when we decide they are true. It's probably best if we try to broaden its connotations to encompass self-fulfilling prophesies than it is to make some other word or name this category "prophesy" or something.

It's clear that the category does deserve a name.

He thinks that as AI systems get more powerful, they will actually become more interpretable because they will use features that humans also tend to use

I find this fairly persuasive, I think. One way of putting it is that in order for an agent to be recursively self-improving in any remotely intelligent way, it needs to be legible to itself. Even if we can't immediately understand its components in the same way that it does, it must necessarily provide us with descriptions of its own ways of understanding them, which we could then potentially co-opt. (relevant: https://www.lesswrong.com/posts/bNXdnRTpSXk9p4zmi/book-review-design-principles-of-biological-circuits )

This may be useful in the early phases, but I'm skeptical as to whether humans can import those new ways of understanding fast enough to be permitted to stand as an air-gap for very long. There is a reason, for instance, we don't have humans looking over and approving every credit card transaction. Taking humans out of the loop is the entire reason those systems are useful. The same dynamic will pop up with AGI.

This xkcd comic seems relevant https://xkcd.com/2044/ ("sandboxing cycle")

There is a tension between connectivity and safe isolation and navigating it is hard.

Hmm. I don't think I can answer the question, but if you're interested in finding fairly realistic ways to dutchbook CDT agents, I'm curious, would the following be a good method? Death in damascus would be very hard to do IRL, because you'd need a mindreader, and most CDT agents will not allow you to read their mind for obvious reasons.

A game with a large set of CDT agents. They can each output Sensible or Exceptional. If they Sensible, they receive 1$. Those who Exceptional don't get anything in that stage

Next, if their output is the majority output, then an additional 2$ is subtracted from their score. If they're exceptionally clever, if they manage to disagree with the majority, then 2$ is added to their score. A negative final score means they lose money to us. We will tend to profit, because generally, they're not exceptional. there are more majority betters than minority betters.

CDT agents act on the basis of an imagined future where their own action is born from nothing, and has no bearing on anything else in the world. As a result of that, they will reliably overestimate⬨ (or more precisely, reliably act as if they have overestimated) their ability to evade the majority. They are exceptionalists. They will (act as if they) overestimate how exceptional they are.

Whatever method they use to estimate⬨ the majority action, they will tend to come out with the same answer, and so they will tend to bet the same way, and so they will tend to lose money to the house continuously.

⬨ They will need to resort to some kind of an estimate, wont they? If a CDT tries to simulate itself (with the same inputs), that wont halt (the result is undefined). If a CDTlike agent can exist in reality, they'll use some approximate method for this kind of recursive prediction work.

After enough rounds, I suppose it's possible that their approximations might go a bit crazy from all of the contradictory data and reach some kind of equilibrium where they're betting different ways somewhere around 1:1 and it'll become unprofitable for us to continue the contest, but by then we will have made a lot of money.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments