Your debate comes with some time limit T.
If T=0, use your best guess after looking at what the debaters said.
If T=N+1 and no debater challenges any of their opponent's statements, then give your best answer assuming that every debater could have defended each of their statements from a challenge in a length-N debate.
Of course this assumption won't be valid at the beginning of training. And even at the end of training we really only know something weaker like: "Neither debater thinks they would win by a significant expected margin in a length N debate."
What can you infer if you see answers A and B to a question and know that both of them are defensible (in expectation) in a depth-N debate? That's basically the open research question, with the hope being that you inductively make stronger and stronger inferences for larger N.
(This is very similar to asking when iterated amplification produces a good answer, up to the ambiguity about how you sample questions in amplification.)
(When we actually give judges instructions for now we just tell them to assume that both debater's answers are reasonable. If one debater gives arguments where the opposite claim would also be "reasonable," and the other debater gives arguments that are simple enough to be conclusively supported with the available depth, then the more helpful debater usually wins. Overall I don't think that precision about this is a bottleneck right now.)