Abram Demski


Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Embedded Agency


Debate Minus Factored Cognition

Thanks for taking the time to reply!

I don’t think that’s what I did? Here’s what I think the structure of my argument is:

  1. Every dishonest argument has a defeater. (Your assumption.)
  2. Debaters are capable of finding a defeater if it exists. (I said “the best counterargument” before, but I agree it can be weakened to just “any defeater”. This doesn’t feel that qualitatively different.)
  3. 1 and 2 imply the Weak Factored Cognition hypothesis. I’m not assuming factored cognition, I’m proving it using your assumption.

Ah, interesting, I didn't catch that this is what you were trying to do. But how are you arguing #3? Your original comment seems to be constructing a tree computation for my debate, which is why I took it for an argument that my thing can be computed within factored cognition, not vice versa.

I think maybe what you're trying to argue is that #1 and #2 together imply that we can root out dishonest arguments (at least, in the honest equilibrium), which I would agree with -- and then you're suggesting that this means we can recognize good arguments in the factored-cognition sense of good (IE arguments supported by a FC tree)? But I don't yet see the implication from rooting out dishonest arguments to being able to recognize arguments that are valid in FC terms.

Perhaps an important point is that by "dishonest" I mean manipulative, ie, arguments which appear valid to a human on first reading them but which are (in some not-really-specified sense) bad. So, being able to root out dishonest arguments just means we can prevent the human from being improperly convinced. Perhaps you are reading "dishonest" to mean "invalid in an FC sense", ie, lacking an FC tree. This is not at all what I mean by dishonest. Although we might suppose dishonest implies dishonest, this supposition still would not make your argument go through (as far as I am seeing), because the set of not-dishonest arguments would still not equal the set of FC-valid arguments.

If you did mean for "honest" to be defined as "has a supporting FC tree", my objection to your argument quoted above would be that #1 is implausibly strong, since it requires that any flaw in a tree can be pointed out in a single step. (Analogically, this is assuming PSPACE=NP.)

Possibly your worry is that the argument trees will never terminate, because every honest defeater could still have a dishonest defeater?

I mean, that's a concern I have, but not necessarily wrt the argument above. (Unless you have a reason why it's relevant.)

It is true that I do need an additional assumption of some sort to ensure termination. Without that assumption, honesty becomes one of multiple possible equilibria (but it is still an equilibrium).

Based on what argument? Is this something from the original debate paper that I'm forgetting?

I also agree with this; does anyone think it is proving something about the safety properties of debate w.r.t messy situations?

Fair question. Possibly it's just my flawed assumption about why the analogy was supposed to be interesting. I assumed people were intending the PSPACE thing as evidence about what would happen in messier situations.

This seems good; I think probably I don’t get what exactly you’re arguing. (Like, what’s the model of human fallibility where you don’t access NP in one step? Can the theoretical-human not verify witnesses? What can the theoretical-human verify, that lets them access NP in multiple timesteps but not one timestep?)

My model is like this:

Imagine that we're trying to optimize a travelling salesman route, using an AI advice system. However, whenever the AI says "democratic" or "peaceful" or other such words, the human unthinkingly approves of the route, without checking the claimed distance calculation.

This is, of course, a little absurd, but similar effects have been observed in experiments.

I'm then making the further assumption that humans can correct these errors when they're explained sufficiently well.

That's my model; the proposal in the post lives or dies on its merits.

I agree that you get a “clawing on to the argument in hopes of winning” effect, but I don’t see why that changes the equilibrium away from honesty. Just because a dishonest debater would claw on doesn’t mean that they’d win. The equilibrium is defined by what makes you win.

The point of the "clawing" argument is that it's a rational deviation from honesty, so it means honesty isn't an equilibrium. It's a 50/50 chance of winning (whoever gets the last word), which is better than a sure failure (in the case that a player has exhausted its ability to honestly argue).

Granted, there may be zero-sum rules which nonetheless don't allow this. I'm only saying that I didn't see how to avoid it with zero-sum scoring.

I don’t really understand why you want it to be non-zero-sum [...] I really just needed it for my argument to go through. If you have an alternate argument which works for the zero-sum case, I’m interested in hearing it. I mean, I tried to give one (see response to your first point; I’m not assuming the Factored Cognition hypothesis). I’m not sure what’s unconvincing about it.

I remain curious to hear your clarification wrt that (specifically, how you justify point #3). However, if that argument went through, how would that also be an argument that the same thing can be accomplished with a zero-sum set of rules?

Based on your clarification, my current understanding of what that argument tries to accomplish is "I’m not assuming factored cognition, I’m proving it using your assumption." How would establishing that help establish a set of zero sum rules which have an honest equilibrium?

AI safety via market making

This was a very interesting comment (along with its grandparent comment), thanks -- it seems like a promising direction.

However, I'm still confused about whether this would work. It's very different from judging procedure outlined here; why is that? Do you have a similarly detailed write-up of the system you're describing here?

I'm actually less concerned about loops and more concerned about arguments which are infinite trees, but the considerations are similar. It seems possible that the proposal you're discussing very significantly addresses concerns I've had about debate.

Debate Minus Factored Cognition

I think I disagree with the claim you're making about being able to avoid requiring the judge to assume that one player is honest (but I might be confused about what you're proposing). 

Don't you yourself disagree with requiring the judge to assume that one player is honest? In a recent comment, you discuss how claims should not be trusted by default.

Debate Minus Factored Cognition

I don't know if you've seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior - seems somewhat relevant to what you're thinking about here. 

I took a look, and it was indeed helpful. However, I left a comment there about a concern I have. The argument at the end only argues for what you call D-acceptability: having no answer that's judged better after D steps of debate. My concern is that even if debaters are always D-acceptable for all D, that does not mean they are honest. They can instead use non-well-founded argument trees which never bottom out.

Debate Minus Factored Cognition

It seems to me that your argument is very similar, except that you get a little more mileage out of assumption 2, that the debaters can find the true decomposition tree.

While I agree that the defeater tree can be encoded as a factored cognition tree, that just means that if we assume factored cognition, and make my assumption about (recursive) defeaters, then we can show that factored cognition can handle the defeater computation. This is sort of like proving that the stronger theory can handle what the weaker theory can handle, which would not be surprising -- I'd still be interested in the weaker theory as a way to argue safety from fewer assumptions. But it's not even that, since you'd still need to additionally suppose my thesis about defeaters, beyond (strong/weak) factored cognition.

Essentially what's happening is that with your argument we get to trust that the debaters have explored all possible counterarguments and selected the best one and so the human gets to assume that no other more compelling counterarguments exist, which is not something we typically get to assume with weak Factored Cognition. It feels to me intuitively like this puts more burden on the assumption that we find the true equilibrium, though formally it's the same assumption as before.

I don't really get this part -- what's so important about the best counterargument? I think my argument in the post is more naturally captured by supposing counterarguments either work or don't, in binary fashion. So a debater just has to find a defeater. Granted, some defeaters have a higher probability of working, in a realistic situation with a fallible judge. And sure, the debaters should find those. But I don't see where I'm putting a higher burden on finding the true equilibrium. What are you pointing at?

Idk, it seems like this is only true because you are forcing your human to make a judgment. If the judge were allowed to say "I don't know" (in which case no one gets reward, or the reward is split), then I think one step of debate once again provides an NP oracle.

Or perhaps you're assuming that the human is just not very good at being a poly-time algorithm; if that's what you're saying that seems like it's missing the point of the computational complexity analogy. I don't think people who make that analogy (including myself) mean that humans could actually implement arbitrary poly-time algorithms faithfully.

Yeah, my reply would be that I don't see how you get NP oracles out of one step, because a one-step debate will just result in maximally convincing arguments which have little to do with the truth.

I mean, I agree that if you're literally trying to solve TSP, then a human could verify proposed solutions. However, it seems like we don't have to get very messy before humans become exceedingly manipulable through dishonest argument.

So if the point of the computational complexity analogy is to look at what debate could accomplish if humans could be perfect (but poly-time) judges, then I accept the conclusion, but I just don't think that's telling you very much about what you can accomplish on messier questions (and especially, not telling you much about safety properties of debate).

Instead, I'm proposing a computational complexity analogy in which we account for human fallibility as judges, but also allow for the debate to have some power to correct for those errors. This seems like a more realistic way to assess the capabilities of highly trained debate systems.

So far, all of this discussion still works with the zero-sum setting, so I don't really understand why you say

>The following is a fairly nonstandard setup for AI Debate, but I found it necessary to make my argument go through.

Hm, well, I thought I was pretty clear in the post about why I needed that to make my argument work, so I'm not sure what else to say. I'll try again:

In my setup, a player is incentivised to concede when they're beaten, rather than continue to defeat the arguments of the other side. This is crucial, because any argument may have a (dishonest) defeater, so the losing side could continue on, possibly flipping the winner back and forth until the argument gets decided by who has the last word. Thus, my argument that there is an honest equilibrium would not go through for a zero-sum mechanism where players are incentivised to try and steal victory back from the jaws of defeat.

Perhaps I could have phrased my point as the pspace capabilities of debate are eaten up by error correction. 

In any case, it seems to me like making it non-zero-sum is an orthogonal axis. I don't really understand why you want it to be non-zero-sum -- you say that it is to incentivize honesty at every step, but why doesn't this happen with standard debate? If you evaluate the debate at the end rather than at every step, then as far as I can tell under the assumptions you use the best strategy is to be honest.


Overall it seemed to me like the non-zero-sum aspect introduced some problems (might no longer access PSPACE, introduces additional equilibria beyond the honest one), and did not actually help solve anything, but I'm pretty sure I just completely missed the point you were trying to make.

I really just needed it for my argument to go through. If you have an alternate argument which works for the zero-sum case, I'm interested in hearing it.

Maybe you mean that if we assume (weak/strong) factored cognition, you can argue that zero-sum debate works, because argument trees terminate, so who wins is not in fact just up to who gets the last word. But (a) this would require factored cognition; (b) I'm interested in hearing your argument even if it relies on factored cognition, because I'm still concerned that a dishonest player can use flawless but non-well-founded argument trees (and is incentivised to do so, even in the honest equilibrium, to avert loss).

As usual when talking about the debate, I get the feeling that I'm possibly being dumb about something because everyone else seems to buy that there are arguments in support of various points. I'm kind of worried that there aren't really arguments for those things, which is a big part of why I bothered to write a post at all -- this post is basically my attempt to articulate the part of debate that I can currently understand why would work. But getting the argument I'm missing would certainly be helpful.

Radical Probabilism

DP: I'm not saying that hardware is infinitely reliable, or confusing a camera for direct access to reality, or anything like that. But, at some point, in practice, we get what we get, and we have to take it for granted. Maybe you consider the camera unreliable, but you still directly observe what the camera tells you. Then you would make probabilistic inferences about what light hit the camera, based on definite observations of what the camera tells you. Or maybe it's one level more indirect from that, because your communication channel with the camera is itself imperfect. Nonetheless, at some point, you know what you saw -- the bits make it through the peripheral systems, and enter the main AI system as direct observations, of which we can be certain. Hardware failures inside the core system can happen, but you shouldn't be trying to plan for that in the reasoning of the core system itself -- reasoning about that would be intractable. Instead, to address that concern, you use high-reliability computational methods at a lower level, such as redundant computations on separate hardware to check the integrity of each computation.

RJ: Then the error-checking at the lower level must be seen as part of the rational machinery.

DP: True, but all the error-checking procedures I know of can also be dealt with in a classical bayesian framework.

RJ: Can they? I wonder. But, I must admit, to me, this is a theory of rationality for human beings. It's possible that the massively parallel hardware of the brain performs error-correction at a separated, lower level. However, it is also quite possible that it does not. An abstract theory of rationality should capture both possibilities. And is this flexibility really useless for AI? You mention running computations on different hardware in order to check everything. But this requires a rigid setup, where all computations are re-run a set number of times. We could also have a more flexible setup, where computations have confidence attached, and running on different machines creates increased confidence. This would allow for finer-grained control, re-running computations when the confidence is really important. And need I remind you that belief prop in Bayesian networks can be understood in radical probabilist terms? In this view, a belief network can be seen as a network of experts communicating with one another. This perspective has been, as I understand it, fruitful.

DP: Sure, but we can also see belief prop as just an efficient way of computing the regular Bayesian math. The efficiency can come from nowhere special, rather than coming from a core insight about rationality. Algorithms are like that all the time -- I don't see the fast fourier transform as coming from some basic insight about rationality.

RJ: The "factor graph" community says that belief prop and fast fourier actually come from the same insight! But I concede the point; we don't actually need to be radical probabilists to understand and use belief prop. But why are you so resistant? Why are you so eager to posit a well-defined boundary between the "core system" and the environment?

DP: It just seems like good engineering. We want to deal with a cleanly defined boundary if possible, and it seems possible. And this way we can reason explicitly about the meaning of sensory observations, rather than implicitly being given the meaning by way of uncertain updates which stipulate a given likelihood ratio with no model. And it doesn't seem like you've given me a full alternative -- how do you propose to, really truly, specify a system without a boundary? At some point, messages have to be interpreted as uncertain evidence. It's not like you have a camera automatically feeding you virtual evidence, unless you've designed the hardware to do that. In which case, the boundary would be the camera -- the light waves don't give you virtual evidence in the format the system accepts, even if light is "fundamentally uncertain" in some quantum sense or whatever. So you have this boundary, where the system translates input into evidence (be it uncertain or not) -- you haven't eliminated it.

RJ: That's true, but you're supposing the boundary is represented in the AI itself as a special class of "sensory" propositions. Part of my argument is that, due to logical uncertainty, we can't really make this distinction between sensory observations and internal propositions. And, once we make that concession, we might as well allow the programmer/teacher to introduce virtual evidence about whatever they want; this allows direct feedback on abstract matters such as "how to think about this", which can't be modeled easily in classic Bayesian settings such as Solomonoff induction, and may be important for AI safety.

DP: Very well, I concede that while I still hold out hope for a fully Bayesian treatment of logical uncertainty, I can't provide you with one. And, sure, providing virtual evidence about arbitrary propositions does seem like a useful way to train a system. I'm just suspicious that there's a fully Bayesian way to do everything you might want to do...

The Pointers Problem: Clarifications/Variations

Oh, well, satisfying the logical induction criterion is stronger than just PSPACE. I see debate, and iterated amplification, as attempts to get away with less than full logical induction. See https://www.lesswrong.com/posts/R3HAvMGFNJGXstckQ/relating-hch-and-logical-induction, especially Paul's comment https://www.lesswrong.com/posts/R3HAvMGFNJGXstckQ/relating-hch-and-logical-induction?commentId=oNPtnwTYcn8GixC59

The Pointers Problem: Clarifications/Variations

I don't have much to say other than that I agree with the connection. Honestly, thinking of it in those terms makes me pessimistic that it's true -- it seems quite possible that humans, given enough time for philosophical reflection, could point to important value-laden features of worlds/plans which are not PSPACE. 

Debate Minus Factored Cognition

I think the collusion concern basically over-anthropomorphizes the training process. Say, in prisoner's dilemma, if you train myopically, then "all incentives point toward defection" translates concretely to actual defection.

Granted, there are training regimes in which this doesn't happen, but those would have to be avoided.

OTOH, the concern might be that an inner optimizer would develop which colludes. This would have to be dealt with by more general anti-inner-optimizer technology.

I don’t know if you’ve seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior—seems somewhat relevant to what you’re thinking about here.

Yep, I should take a look!

Debate Minus Factored Cognition

Basically, it sounds like you’re saying that we can get good answers by just running the whole debate and throwing out answers that turn out to have a defeater, or a defeater-defeater-defeater, or whatever. But if this is the only guarantee we’re providing, then we’re going to need to run an extremely large number of debates to ever get a good answer (ie an exp number of debates for a question where the explanation for the answer is exp-sized)

I'm not sure why you're saying this, but in the post, I restricted my claim to NP-like problems. So for example, traveling salesman -- the computation to find good routes may be very difficult, but the explanation for the answer remains short (EG an explicit path). So, yes, I'm saying that I don't see the same sort of argument working for exp-sized explanations. (Although Rohin's comment gave me pause, and I still need to think it over more.)

But aside from that, I'm also not sure what you mean by the "run an extremely large number of debates" point. Debate isn't like search, where we run more/longer to get better answers. Do you mean that my proposal seems to require longer training time to get anywhere? If so, why is that? Or, what do you mean?

It sounds like you’re saying that we can not require that the judge assume one player is honest/trust the claims lower in the debate tree when evaluating the claims higher in the tree. But if we can’t assume this, that presumably means that some reasonable fraction of all claims being made are dishonest

I'm not asserting that the judge should distrust, either. Like the normal debate argument, I want to end up in an honest equilibrium. So I'm not saying we need some kind of equilibrium where the judge is justified in distrust.

My concern involves the tricky relationship between the equilibrium we're after and what the judge has to actually do during training (when we might not be anywhere near equilibrium). I don't want the judge to have to pretend answers are honest at times when they're statistically not. I didn't end up going through that whole argument in the post (unfortunately), but in my notes for the post, the judge being able to judge via honest opinion at all times during training was an important criterion.

(because if there were only a few dishonest claims, then they’d have honest defeaters and we’d have a clear training signal away from dishonesty, so after training for a bit we’d be able to trust the lower claims).

I agree that that's what we're after. But I think maybe the difference in our positions can be captured if we split "honest" into two different notions...

a-honesty: the statement lacks an immediate (a-honest) counterargument. IE, if I think a statement is a-honest, then I don't think there's a next statement which you can (a-honestly) tell me which would make me disbelieve the statement.

b-honesty: the statement cannot be struck down by multi-step (b-honest) debate. IE, if I think a statement is b-honest, I think as debate proceeds, I'll still believe it.

Both definitions are recursive; their definitions require the rest of the debate being honest in the appropriate sense. However, my intuition is that a-honesty can more easily be established incrementally, starting with a slight pressure toward honesty (because it's supposedly easier in the first place), making the opening statements converge to honesty quickly (in response to the fact that honest defeaters in the first responses are relatively common), then the first responses, etc. On the other hand, converging to b-honesty seems relatively difficult to establish by induction; it seems to me that in order to argue that a particular level of the debate is b-honest, you need the whole remainder of the debate to be probably b-honest.

Now, critically, if the judge thinks debaters are a-honest but not b-honest, then the judge will believe NP type arguments (a TSP path can be struck down by pointing out a single error), but not trust claimed outputs of exponential-tree computations.

So my intuition is that, trying to train for b-honesty, you get debaters making subtle arguments that push the inconsistencies ever-further-out, because you don't have the benefit of an inductive assumption where the rest of the debate is probably b-honest; you have no reason to inductively assume that debaters will follow a strategy where they recursively descend the tree to zero in on errors. They have no reason to do this if they're not already in that equilibrium.

This, in turn, means that judges of the debate have little reason to expect b-honesty, so shouldn't (realistically) assume that at least one of the debaters is honest; but this would exacerbate the problem further, since this would mean there is little training signal (for debates which really do rest on questions about exponential trees, that is). Hence the need to tell the judge to assume at least one debater is honest.

On the other hand, trying for a-honesty, individual a-dishonest claims can be defeated relatively easily (ie, in one step). This gives the judge a lot more reason to probabilistically conclude that the next step in the debate would have been a-honest, and thus, that all statements seen were probably a-honest (unless the judge sees an explicit defeater, of course).

Granted, I don't claim to have a training procedure which results in a-honesty, so I'm not claiming it's that easy.

At this point, debate isn’t really competitive, because it gives us dud answers almost all the time, and we’re going to have to run an exponential number of debates before we happen on a correct one.

Again, I don't really get the idea of running more debates. If the debaters are trained well, so they're following an approximately optimal strategy, we should get the best answer right away.

Are you suggesting we use debate more as a check on our AI systems, to help us discover that they’re bad, rather than as a safe alternative? Ie debate never produces good answers, it just lets you see that bad answers are bad?

My suggestion is certainly going in that direction, but as with regular debate, I am proposing that the incentives produced by debate could produce actually-good answers, not just helpful refutations of bad answers.

But also, the ‘amplified judge consulting sub-debates’ sounds like it’s just the same thing as letting the judge assume that claims lower in the debate are correct when evaluating claims higher in the tree.

You're right, it introduces similar problems. We certainly can't amplify the judge in that way at the stage where we don't even trust the debaters to be a-honest.

But consider:

Let's say we train "to convergence" with a non-amplified judge. (Or at least, to the point where we're quite confident in a-honesty.) Then we can freeze that version, and start using it as a helper to amplify the judge.

Now, we've already got a-honesty, but we're training for a*-honesty: a-honesty with a judge who can personally verify more statements (and thus recognize more sophisticated defeaters, and thus, trust a wider range of statements on the grounds that they could be defeated if false). We might have to shake up the debater strategies to get them to try to take advantage of the added power, so they may not even be a-honest for a while. But eventually they converge to a*-honesty, and can be trusted to answer a broader range of questions.

Again we freeze these debate strategies and use them to amplify the judge, and repeat the whole process.

So here, we have an inductive story, where we build up reason to trust each level. This should eventually build up to large computation trees of the same kind b-honesty is trying to compute.

Load More