Towards formalizing universality

[-]Wei Dai7y30

This post defines the concept of “ascription universality,” which tries to capture the property that a question-answering system A is better-informed than any particular simpler computation C.

I'm getting lost right away here. In the case of "large enough teams of humans can carry out arbitrarily complicated reasoning", what is A and C? Presumably A is the whole team, but I have no idea what C is.

[-]paulfchristiano7y30

C is an arbitrary computation, to be universal the humans must be better informed than *any* simple enough computation C.

[-]Wei Dai7y20

Also, I'm confused about the practical case.

For example, suppose that C formulates a plan to “trick” A[C]. Then the subjective universality condition implies that we don’t expect C to succeed.

What does "expect" mean here? Probability > .5? Also, can you walk through an example of how C might "trick" A[C] and how subjective dominance implies that we don't expect C to succeed?

We’d like to say that the impoverished perspective is still “good enough” for us to feel safe, despite not being good enough to capture literally everything we know. But now we risk begging the question: how do we evaluate whether the impoverished perspective is good enough? I think this is probably OK, but it’s definitely subtle.

I don't know how to make sense of this. If an impoverished perspective expects C not to be able to "trick" A, it seems kind of obvious that's not good enough for me to think the same? You must have reason to think otherwise but the inferential distance is too far for me to figure out what it is.

[-]paulfchristiano7y20

What does "expect" mean here? Probability > .5?

We don't think that C has any expected edge in identifying an attack that A does not---if C is up to something, and A can't tell, then we think that the "being up to something" provides no evidence about whether something bad is about to happen. This is much stronger than a fixed probability of failure.

Also, can you walk through an example of how C might "trick" A[C] and how subjective dominance implies that we don't expect C to succeed?

Simple caricatured examples:

C might propose a design for a computer that has a backdoor that an attacker can use to take over the computer. But if this backdoor will actually be effective, then A[C] will know about it.
C might propose a design that exploits a predictable flaw in A's reasoning (e.g. overlooking consequences of a certain kind, being overly optimistic about some kinds of activities, incorrectly equating two importantly different quantities...). But then A[C] will know about it, and so if A[C] actually reasons in that way then (in some sense) it is endorsed.

I don't know how to make sense of this. If an impoverished perspective expects C not to be able to "trick" A, it seems kind of obvious that's not good enough for me to think the same?

If an impoverished perspective expects C not to be able to trick A, then either:

In fact C won't be able to trick A.
C will trick A, but the perspective is too weak to tell.

I think I don't quite understand what you are saying here, what exactly is obvious?

From a suitably advanced perspective it's obvious that C will be able to trick A sometimes---it will just get "epistemically lucky" and make an assumption that A regards as silly but turns out to be right.

[-]Wei Dai7y20

I think I don’t quite understand what you are saying here, what exactly is obvious?

I think I expressed myself badly there. What I mean is that it seems a sensible default to not trust an impoverished perspective relative to oneself, and you haven't stated a reason why we should trust the impoverished perspective. This seems to be at least a big chunk of the formalization of universality that you haven't sketched out yet.

[-]paulfchristiano7y20

Suppose that I convinced you "if you didn't know much chemistry, you would expect this AI to yield good outcomes." I think you should be pretty happy. It may be that the AI would predictably cause a chemistry-related disaster in a way that would be obvious to you if you knew chemistry, but overall I think you should expect not to have a safety problem.

This feels like an artifact of a deficient definition, I should never end up with a lemma like "if you didn't know much chemistry, you'd expect this AI to to yield good outcomes" rather than being able to directly say what we want to say.

That said, I do see some appeal in proving things like "I expect running this AI to be good," and if we are ever going to prove such statements they are probably going to need to be from some impoverished perspective (since it's too hard to bring all of the facts about our actual epistemic state into such a proof), so I don't think it's totally insane.

If we had a system that is ascription universal from some impoverished perspective, you may or may not be OK. I'm not really worrying about it; I expect this definition to change before the point where I literally end up with a system that is ascription universal from some impoverished perspective, and this definition seems good enough to guide next research steps.

[-]Wei Dai7y20

So "simpler" in that sentence should be replaced by "simple enough"? In other words, it's not the case that A is better-informed than every computation C that is simpler than A, right? Also, can you give a sense of how much simpler is simple enough?

[-]paulfchristiano7y20

I'm aiming for things like:

$n$ round debate dominating any fast computation with $n - 1$ alternations (including an $n - 1$ round debate)
max-HCH with budget $k n$ dominating max-HCH with budget $n$ for some constant $k > 1$ .
HCH with advice and budget $k n$ dominating HCH with no advice and budget $n$ .

[-]Stuart_Armstrong7y20

It seems the ascription process is approximately "deduce an agent's beliefs from their outputs". This seems to have the same problem as "deduce an agent's preferences from their outputs", which I showed was not possible in general, even with simplicity.

So when dealing with non-perfectly rational agents, it seems you'll have to put in the irrationality by hand. So it's not so much "ascribing beliefs", but "prescribing beliefs": our interpretation determines what the agent believes. The fact that "This procedure wouldn’t capture the beliefs of a native Spanish speaker, or for someone who wasn’t answering questions honestly", are just two examples of a much more universal problem.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

9

Towards formalizing universality

9

I. Definition

1. Ascribing beliefs to A

2. Ascribing beliefs to arbitrary computations

3. Comparing beliefs

4. Complexity and parameterization

Putting it all together

II. Discussion

Why is (subjective) dominance sufficient?

Why trust opaque computation?

Why be so general?

Universal from whose perspective?

III. Which A might be universal?

Two regimes

Idealized models

Practical models

IV. Which C are hard to epistemically dominate?

Deduction

Modeling

Alien reasoning

Deliberation and self-improvement