Abram Demski


Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Embedded Agency


The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Agreed. The problem is with AI designs which don't do that. It seems to me like this perspective is quite rare. For example, my post Policy Alignment was about something similar to this, but I got a ton of pushback in the comments -- it seems to me like a lot of people really think the AI should use better AI concepts, not human concepts. At least they did back in 2018.

As you mention, this is partly due to overly reductionist world-views. If tables/happiness aren't reductively real, the fact that the AI is using those concepts is evidence that it's dumb/insane, right?

Illustrative excerpt from a comment there:

From an “engineering perspective”, if I was forced to choose something right now, it would be an AI “optimizing human utility according to AI beliefs” but asking for clarification when such choice diverges too much from the “policy-approval”.

Probably most of the problem was that my post didn't frame things that well -- I was mainly talking in terms of "beliefs", rather than emphasizing ontology, which makes it easy to imagine AI beliefs are about the same concepts but just more accurate. John's description of the pointers problem might be enough to re-frame things to the point where "you need to start from human concepts, and improve them in ways humans endorse" is bordering on obvious.

(Plus I arguably was too focused on giving a specific mathematical proposal rather than the general idea.)

Recursive Quantilizers II

At MIRI we tend to take “we’re probably fine” as a strong indication that we’re not fine ;p

(I should remark that I don't mean to speak "for MIRI" and probably should have phrased myself in a way which avoided generalizing across opinions at MIRI.)

Yeah I have been and continue to be confused by this perspective, at least as an empirical claim (as opposed to a normative one). I get the sense that it’s partly because optimization amplifies and so there is no “probably”, there is only one or the other. I can kinda see that when you assume an arbitrarily powerful AIXI-like superintelligence, but it seems basically wrong when you expect the AI system to apply optimization that’s not ridiculously far above that applied by a human.

I think I would agree with this if you said "optimization that's at or below human level" rather than "not ridiculously far above".

Humans can be terrifying. The prospect of a system slightly smarter than any human who has ever lived, with values that are just somewhat wrong, seems not great. In particular, this system could do subtle things resulting in longer-term value shift, influence alignment research to go down particular paths, etc. (I realize the hypothetical scenario has at least some safeguards, so I won't go into more extreme scenarios like winning at politics hard enough to become world dictator and set the entire future path of humanity, etc. But I find this pretty plausible in a generic "moderately above human" scenario. Society rewards top performers disproportionately for small differences. Being slightly better than any human author could get you not only a fortune in book sales, but a hugely disproportionate influence. So it does seem to me like you'd need to be pretty sure of whatever safeguards you have in place, particularly given the possibility that you mis-estimate capability, and given the possibility that the system will improve its capabilities in ways you may not anticipate.)

But really, mainly, I was making the normative claim. A culture of safety is not one in which "it's probably fine" is allowed as part of any real argument. Any time someone is tempted to say "it's probably fine", it should be replaced with an actual estimate of the probability, or a hopeful statement that combined with other research it could provide high enough confidence (with some specific sketch of what that other research would be), or something along those lines. You cannot build reliable knowledge out of many many "it's probably fine" arguments; so at best you should carefully count how many you allow yourself.

A relevant empirical claim sitting behind this normative intuition is something like: "without such a culture of safety, humans have a tendency to slide into whatever they can get away with, rather than upholding safety standards".

This all seems pretty closely related to Eliezer's writing on security mindset.

Often just making feedback uncertain can help. For example, in the preference learning literature, Boltzmann rationality has emerged as the model of choice for how to interpret human feedback.

This seems like a perfect example to me. It works pretty well for current systems, but scaled up, it seems it would ultimately reach dramatically wrong ideas about what humans value.

Yup, totally agree. Make sure to update it as you scale up further.

You said that you don't think learning human values is a good target for "X", so I worry that focusing on this will be a bit unfair to your perspective. But it's also the most straightforward example, and we both seem to agree that it illustrates things we care about here. So I'm just going to lampshade the fact that I'll use "human values" as an example a lot in what follows.

The point of this is to be able to handle feedback at arbitrary levels, not to require feedback at arbitrary levels.

What’s the point of handling feedback at high levels if we never actually get feedback at those levels?

I think what's really interesting to me is making sure the system is reasoning at all those levels, because I have an intuition that that's necessary (to get concepts we care about right). Accepting feedback at all those levels is a proxy. (I want to include it in my list of criteria because I don't know a better way of operationalising "reasoning at all levels", and also, because I don't have a fixed meta-level at which I'd be happy to cap feedback. Capping meta-levels at something like 1K doesn't seem like it would result in a better research agenda.)

Sort of like Bayes' Law promises to let you update on anything. You wouldn't travel back in time and tell Bayes "what's the point of researching updating on anything, when in fact we only ever need to update on some relatively narrow set of propositions relating to the human senses?" It's not a perfect analogy, but it gets at part of the point.

My basic claim is that we've seen the same sorts of problems occur at multiple meta-levels, and each time, it's tempting to retreat to another meta-level. I therefore want a theory of (these particular sorts of) meta-levels, because it's plausible to me that in such a context, we can solve the general problem rather than continue to push it back. Or at least, that it would provide tools to better understand the problem.

There's a perspective in which "having a fixed maximum meta-level at all" is pretty directly part of the problem. So it's natural to see if we can design systems which don't have that property.

From this perspective, it seems like my response to your "incrementally improve loss functions as capability levels rise" perspective should be:

It seems like this would just be a move you'd eventually want to make, anyway.

At some point, you don't want to keep designing safe policies by hand; you want to optimize them to minimize some loss function.

At some point, you don't want to keep designing safe loss functions by hand; you want to do value learning.

At some point, you don't want to keep inventing better and better value-learning loss functions by hand; you want to learn-to-learn.

At some point, you won't want to keep pushing back meta-levels like this; you'll want to do it automatically.

From this perspective, I'd just be looking ahead in the curve. Which is pretty much what I think I'm doing anyway.

So although the discussion of MIRI-style security mindset and just-how-approximately-right safety concepts need to be seems relevant, it might not be the crux.

Perhaps another way of framing it: suppose we found out that humans were basically unable to give feedback at level 6 or above. Are you now happy having the same proposal, but limited to depth 5? I get the sense that you wouldn’t be, but I can’t square that with “you only need to be able to handle feedback at high levels but you don’t require such feedback”.

This depends. There are scenarios where this would significantly change my mind.

But let's suppose humans have trouble with 6 or above just because it's hard to keep that many meta-levels in working memory. How would my proposal function in this world?

We want the system to extrapolate to the higher levels, figuring out what humans (implicitly) believe. But a consistent (and highly salient) extrapolation is the one which mimics human ineptitude at those higher levels. So we need to be careful about what we mean by "extrapolate".

What we want the system to do is reason as if it received the feedback we would have given if we had more working memory (to the extent that we endorse "humans with more working memory" as better reasoners). My proposal is that a system should be taught to do exactly this.

This is where my proposal differs from proposals more reliant on human imitation. Any particular thing we can say about what better reasoning would look like, the system attempts to incorporate.

Another way humans indirectly give evidence about higher levels is through their lower-level behavior. To some extent, we can infer from a human applying a specific form of reasoning, that the human reflectively endorses that style of reasoning. This idea can be used to transfer information about level N to some information about level N+1. But the system should learn caution about this inference, by observing cases where it fails (cases where humans habitually reason in a particular way, but don't endorse doing so), as well as by direct instruction.

Even if humans only ever provide feedback about finitely many meta levels, the idea is for the system to generalize to other levels.

I don’t super see how this happens but I could imagine it does. (And if it did it would answer my question above.) I feel like I would benefit from concrete examples with specific questions and their answers.

A lot of ideas apply to many meta-levels; EG, the above heuristic is an example of something which generalizes to many meta-levels. (It is true in general that you can make a probabilistic inference about level N+1 by supposing level-N activity is probably endorsed; and, factors influencing the accuracy of this heuristic probably generalize across levels. Applying this to human behavior might only get us examples at a few meta-levels, but the principle should also be applied to idealized humans, EG the model of humans with more working memory. So it can continue to bear fruit at many meta-levels, even when actual human feedback is not available.)

Importantly, process-level feedback usually applies directly to all meta-levels. This isn't a matter of generalization of feedback to multiple levels, but rather, direct feedback about reasoning which applies at all levels.

For example, humans might give feedback about how to do sensible probabilistic reasoning. This information could be useful to the system at many meta-levels. For example, it might end up forming a general heuristic that its value functions (at every meta-level) should be expectation functions which quantify uncertainty about important factors. (Or it could be taught this particular idea directly.)

More importantly, anti-daemon ideas would apply at every meta-level. Every meta-level would include inner-alignment checks as a heavily weighted part of its evaluation function, and at all levels, proposal distributions should heavily avoid problematic parts of the search space.

Okay, that makes sense. It seemed to me like since the first few bits of feedback determine how the system interprets all future feedback, it’s particularly important for those first few bits to be correct and not lock in e.g. a policy that ignores all future feedback.

In principle, you could re-start the whole training process after each interaction, so that each new piece of training data gets equal treatment (it's all part of what's available "at the start"). In practice that would be intractable, but that's the ideal which practical implementations should aim to approximate.

So, yes, it's a problem, but it's one that implementations should aim to mitigate.

(Part of how one might aim to mitigate this is to teach the system that it's a good idea to try to approximate this ideal. But then it's particularly important to introduce this idea early in training, to avoid the failure mode you mention; so the point stands.)

For example, Learning to summarize from human feedback does use Boltzmann rationality, but could finetune GPT-3 to e.g. interpret human instructions pragmatically. This interpretation system can apply “at all levels”, in the same way that human brains can apply similar heuristics “at all levels”.

(There are still issues with just applying the learning from human preferences approach, but they seem to be much more about “did the neural net really learn the intended concept” / inner alignment, rather than “the neural net learned what to do at level 1 but not at any of the higher levels”.)

It seems to me like you're trying to illustrate something like "Abram's proposal doesn't get at the bottlenecks".

I think it's pretty plausible that this agenda doesn't yield significant fruit with respect to several important alignment problems, and instead (at best) yields a scheme which would depend on other solutions to those particular problems.

It's also plausible that those solutions would, themselves, be sufficient for alignment, rendering this research direction extraneous.

In particular, it's plausible to me that iterated amplification schemes (including Paul's schemes and mine) require a high level of meta-competence to get started, such that achieving that initial level of competence already requires a method of aligning superhuman intelligence, making anything else unnecessary. (This was one of Eliezer's critiques of iterated amplification.)

However, the world you describe, in which alignment tech remains imperfect (wrt scaling) for a long time, but we can align successively more intelligent agents with successively refined tools, is not one of those worlds. In that world, it is possible to make incrementally more capable agents incrementally more perfectly aligned, until that point at which we have something smart (and aligned) enough to serve as the base case for an iterated amplification scheme. In that world, the scheme I describe could be just one of the levels of alignment tech which end up useful at some point.

One example of how to do this is to use X = “revert to a safe baseline policy outside of <whitelist>”, and enlarge the whitelist over time. In this case “failing to scale” is “our AI system couldn’t solve the task because our whitelist hobbled it too much”.

I'm curious how you see whitelisting working.

It feels like your beliefs about what kind of methods might work for "merely way-better-than-human" systems are a big difference between you and I, which might be worth discussing more, although I don't know if it's very central to everything else we're discussing.

Recursive Quantilizers II

Thanks for the review, btw! Apparently I didn't think to respond to it before.

**On feedback types:** It seems like the scheme introduced here is relying quite strongly on the ability of humans to give good process-level feedback _at arbitrarily high levels_. It is not clear to me that this is something humans can do: it seems to me that when thinking at the meta level, humans often fail to think of important considerations that would be obvious in an object-level case. I think this could be a significant barrier to this scheme, though it’s hard to say without more concrete examples of what this looks like in practice.

I agree that this is a significant barrier -- humans have to be able to provide significant information about a significant number of levels for this to work.

However, I would emphasize two things:

  • The point of this is to be able to handle feedback at arbitrary levels, not to require feedback at arbitrary levels. This creates a system which is not limited to optimizing at some finite meta-level.
  • Even if humans only ever provide feedback about finitely many meta levels, the idea is for the system to generalize to other levels. This could provide nontrivial, useful information at very high meta-levels. For example, the system could learn anti-wireheading and anti-manipulation patterns relevant to all meta-levels. This is kind of the whole point of the setup -- most of these ideas originally came out of thinking about avoiding wireheading and manipulation, and how "going up a meta-level" seems to make some progress, but not eliminate the essential problem.

**On interaction:** I’ve previously <@argued@>(@Human-AI Interaction@) that it is important to get feedback _online_ from the human; giving feedback “all at once” at the beginning is too hard to do well. However, the idealized algorithm here does have the feedback “all at once”. It’s possible that this is okay, if it is primarily process-level feedback, but it seems fairly worrying to me.

My intention is for the procedure to be interactive; however, I definitely haven't emphasized how that aspect would work.

I don't think you could get very good process-level feedback without humans actually examining examples of the system processing, at some point. Although I also think the system should learn from artificially constructed examples which humans use to demonstrate catastrophically bad behavior.

**On desiderata:** The desiderata introduced in the first post feel stronger than they need to be. It seems possible to specify a method of interpreting feedback that is _good enough_: it doesn’t exactly capture everything, but it gets it sufficiently correct that it results in good outcomes. This seems especially true when talking about process-level feedback, or feedback one meta level up -- as long as the AI system has learned an okay notion of “being helpful” or “being corrigible”, then it seems like we’re probably fine.

Partly I want to defend the "all meta levels" idea as an important goalpost rather than necessary -- yes, maybe it's stronger than necessary, but wouldn't it be interesting to end up in a place where we didn't have to worry about whether we'd supported enough meta-levels? I wasn't thinking very much about necessity when I wrote the criteria down. Instead, I was trying to articulate a vision which I had a sense would be interesting.

As discussed in Normativity, this is about what ideal alignment really would be. How does the human concept of "should" work? What kind of thing can we think of "human values" as? Whether it's necessary/possible to make compromises is a separate question.

But partly I do want to defend it as necessary -- or rather, necessary in the absence of a true resolution of problems at a finite meta-level. It's possible that problems of AI safety can be solved a different way, but if we could solve them this way, we'd be set. (So I guess I'm saying, sufficiency seems like the more interesting question than necessity.)

 It seems possible to specify a method of interpreting feedback that is _good enough_

My question is: do you think there's a method that's good enough to scale up to arbitrary capability? IE, both on the capability side and the alignment side:

  • Does it seem possible to pre-specify some fixed way of interpreting feedback, which will scale up to arbitrarily capable systems? IE, when I say a very capable system "understands" what I want, does it really seem like we can rely on a fixed notion of understanding, even thinking only of capabilities?
  • Especially for alignment purposes, don't you expect any fixed model of interpreting feedback to be too brittle by default, and somehow fall apart when a sufficiently powerful intelligence is interpreting feedback in such a fixed way?

I'm happy for a solution at a fixed meta-level to be found, but in its absence, I prefer something meeting the criteria I outline, where (it seems to me) we can tell the system everything we've come up with so far about what a good solution would look like.

as long as the AI system has learned an okay notion of “being helpful” or “being corrigible”, then it seems like we’re probably fine.

At MIRI we tend to take "we're probably fine" as a strong indication that we're not fine ;p

More seriously: I think "being corrigible" is an importantly highly-meta concept. Quoting from Zhu's FAQ:

The property we’re trying to guarantee is something closer to “alignment + extreme caution about whether it’s aligned and cautious enough”. Paul usually refers to this as corrigibility.

This extreme caution is importantly recursive; a corrigible agent isn't just cautious about whether it's aligned, it's also cautious about whether it's corrigible.

This is important for Paul's agenda because corrigibility needs to be preserved (and indeed, improved) across many levels of iterated amplification and distillation. This kind of recursive definition is precisely what we need for that.

It's similarly important for any situation where a system could self-improve many times.

Even outside that context, I just don't know that it's possible to specify a very good notion of "corrigibility" at a finite meta-level. It's kind of about not trusting any value function specified at any finite meta-level.

I also think most approximate notions of "being helpful" will be plagued by human manipulation or other problems.

Often just making feedback uncertain can help. For example, in the preference learning literature, Boltzmann rationality has emerged as the model of choice for how to interpret human feedback.

This seems like a perfect example to me. It works pretty well for current systems, but scaled up, it seems it would ultimately reach dramatically wrong ideas about what humans value. (In particular, it ultimately must think the highest-utility action is the most probable one, an assumption which will engender poor interpretations of situations in which errors are more common than 'correct' actions, such as those common to the heuristics and biases literature.)

That being said, the second post introduced a different argument: that the method does as well as we “could” do given the limits of human reasoning. I like this a lot more as a desideratum; it feels more achievable and more tied to what we care about.

Yeah, although I stand by my desiderata as stated, I do not think I've yet done a good job of explaining why all the desiderata are important and how they connect into a big picture, or even, exactly what problems I'm trying to address.

Recursive Quantilizers II

Here are some thoughts on how to fix the issues with the proposal.

  1. The initial proposal distribution should be sampling things close to the initial QAS, not sampling totally random neutral networks / etc.

This addresses the "quantilizers can't do the work of finding powerful QASs" problem. Instead, quantilizers are only doing the work of improving QASs. We don't need to assume smart, aligned QASs are significantly more common than malign QASs; instead, we only need to assume that catastrophic modifications are rare in the local neighborhood of something aligned.

This approach has some problems.

1a. We need to start out with a QAS in the same format as are trying to train (eg a big neutral net).

This is because (I presume) it's pretty difficult to sample nearby QASs otherwise.

This is potentially annoying because it's easier to build the initial value function and proposal distribution out of regular code.

1b. We need a distance metric which we can easily sample from, while also being quite confident that "close QASs are safe QASs" (if chosen at random).

We need to be pretty confident that close QASs are substantially similar; eg, there's no "flip the value function" move which would count as "nearby" in our metric. This suggests a metric on outputs (ie, answers to questions) rather than directly on representation (eg, neural network weights). This is obviously more difficult to sample from.

Also, a single modified answer can easily be catastrophic down the line; eg, a "nearby" QAS might not directly flip the value function, but it might modify the value function to prefer a flip, while keeping other aspects identical. So we can come up with yet-more-sophisticated distance functions which check for closeness in higher-order properties (ie, close-0 might be similarity in answers, while close-1 is close-0 similarity in children's answers, close-2 is the close-0 similarity of grandchildren, etc). Each increment makes sampling significantly more difficult, but makes it more difficult for random sampling to bring catastrophe.

There may be other distance metrics which capture what's important even better.

The question is: at what point are we satisfied that catastrophe is improbable enough? IE, what distance metric allows us to turn up our quantilizer's optimization sufficiently?

  1. The next stage of quantilization should always be an improvement in value; perhaps, always an improvement in several relevant dimensions of value.

This is intended to address the "iterated quantilization accumulates risk of drift" problem.

The requirement is close to "the quantilizers need to be sampling at most the top 50%", but that's not quite right, because the current QAS might not be at 50%. If it's hard to sample improvements (so most samples are worse by our value function), the current QAS might be high in the ranking.

This gives us a stopping condition: the system should never continue if it can't safely eliminate current and past QASs from the running. Each step should be better than what's come before.

It's also possible we should require each new QAS to be "better in every way"; ie, require simultaneous improvement along several important dimensions. For example, improvement in raw score on example questions, improvement on "safety of cognition" score (ie, evaluation of how good a thought process looks when it's examined by the current QAS), improvement on "quality of proposal distribution", improvement on "quality of value function".

Each additional requirement makes it more difficult to proceed with quantilization, and hence, more probable that the process will stop early.

It's particularly important that the next QAS has a lower chance of catastrophe -- indeed, it should be significantly lower. For example, we might require it to cut chance of catastrophe in half. (I'm not working out how risk accumulates in detail; this might not be a good proposal.)

Chance of catastrophe is as evaluated by previous QASs. So it's critical that the initial QAS has a good model of how to evaluate risk of catastrophe. This is, of course, difficult. Moreover, the quality of this evaluation is another critical feature to preserve and improve. One bad risk evaluation in the chain can itself become a catastrophe.

Fixing The Good Regulator Theorem

I'm really happy to see this. I've had similar thoughts about the Good Regulator Theorem, but didn't take the time to write them up or really pursue the fix.

Marginally related: my hope at some point was to fix the Good Regulator Theorem and then integrate it with other representation theorems, to come up with a representation theorem which derived several things simultaneously:

  1. Probabilistic beliefs (or some appropriate generalization).
  2. Expected utility theory (or some appropriate generalization).
  3. A notion of "truth" based on map/territory correspondence (or some appropriate modification). This is the Good Regulator part.

The most ambitious version I can think of would address several more questions:

A. Some justification for the basic algebra. (I think sigma-algebras are probably not the right algebra to base beliefs on. Something resembling linear logic might be better for reasons we've discussed privately; that's very speculative of course. Ideally the right algebra should be derived from considerations arising in construction of the representation theorem, rather than attempting to force any outcome top-down.) This is related to questions of what's the right logic, and should touch on questions of constructivism vs platonism. Due to point #3, it should also touch on formal theories of truth, particularly if we can manage a theorem related to embedded agency rather than cartesian (dualistic) agency.

B. It should be better than CCT in that it should represent the full preference ordering, rather than only the optimal policy. This may or may not lead to InfraBayesian beliefs /expected values (the InfraBayesian representation theorem being a generalization of CCT which represents the whole preference ordering).

C. It should ideally deal with logical uncertainty, not just the logically omniscient case. This is hard. (But your representation theorem for logical induction is a start.) Or failing that, it should at least deal with a logically omniscient version of Radical Probabilism, ie address the radical-probabilist critique of Bayesian updating. (See my post Radical Probabilism; currently typing on a phone, so getting links isn't convenient.

D. Obviously it would ideally also deal with questions of CDT vs EDT (ie present a solution to the problem of counterfactuals).

E. And deal with tiling problems, perhaps as part of the basic criteria.

Debate Minus Factored Cognition

I think you also need that at least some of the time good arguments are not describably bad

While I agree that there is a significant problem, I'm not confident I'd want to make that assumption.

As I mentioned in the other branch, I was thinking of differences in how easy lies are to find, rather than existence. It seems natural to me to assume that every individual thing does have a convincing counterargument, if we look through the space of all possible strings (not because I'm sure this is true, but because it's the conservative assumption -- I have no strong reason to think humans aren't that hackable, even if we are less vulnerable to adversarial examples in some sense).

So my interpretation of "finding the honest equilibrium" in debate was, you enter a regime where the the (honest) debate strategies are too powerful, such that small mutations toward lying are defeated because they're not lying well.

All of this was an implicit model, not a carefully thought out position on my part. Thus, I was saying things like "50% probability the opponent finds a plausible lie" which don't make sense as an equilibrium analysis -- in true equilibrium, players would know all the plausible lies, and know their opponents knew them, etc.

But, this kind of uncertainty still makes sense for any realistic level of training.

Furthermore, one might hope that the rational-player perspective (in which the risks and rewards of lying are balanced in order to determine whether to lie) simply doesn't apply, because in order to suddenly start lying well, a player would have to invent the whole art of lying in one gradient descent step. So, if one is sufficiently stuck in an honesty "basin", one cannot jump over the sides, even if there are perfectly good plays which involve doing so. I offer this as the steelman of the implicit position I had.

Overall, making this argument more explicit somewhat reduces my credulity in debate, because:

  • I was not explicitly recognizing that talk of "honest equilibrium" relies on assumptions about misleading counterarguments not existing, as opposed to weaker assumptions about them being hard to find (I think this also applies to regular debate, not just my framework here)
  • Steelmanning "dishonest arguments are harder to make" as an argument about training procedures, rather than about equilibrium, seems to rest on assumptions which would be difficult to gain confidence in.

-2/+1 Scoring

It's worth explicitly noting that this weakens my argument for the -2/+1 scoring.

I was arguing that although -2/+1 can seriously disadvantage honest strategies in some cases (as you mention, it could mean the first player can lie, and the second player keeps silent to avoid retribution), it fixes a problem within the would-be honest attractor basin. Namely, I argued that it cut off otherwise problematic cases where dishonest players can force a tie (in expectation) by continuing to argue forever.

Now, the assumptions under which this is a problem are somewhat complex (as we've discussed). But I must assume there is a seeming counterargument to almost anything (at least, enough that the dishonest player can steer toward conversational territory in which this is true). Which means we can't be making an argument about the equilibrium being good. Therefore, if this concern is relevant for us, we must be arguing about training rather than equilibrium behavior. (In the sense I discussed above.)

But if we're arguing about training, we hopefully still have some assumption about lies being harder to find (during training). So, there should already be some other way to argue that you can't go on dishonestly arguing forever.

So the situation would have to be pretty weird for -2/+1 to be useful.

(I don't by any means intend to say that "a dishonest player continuing to argue in order to get a shot at not losing" isn't a problem -- just that if it's a problem, it's probably not a problem -2/+1 scoring can help with.)

Debate Minus Factored Cognition

I think in this comment thread I’ve been defining an honest answer as one that can be justified via arguments that eventually don’t have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters—while this doesn’t strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn’t consciously realize I was making that assumption.)

Right, I agree -- I was more or less taking that as a definition of honesty. However, this doesn't mean we'd want to take it as a working definition of correctness, particularly not for WFC.

Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I’m usually hoping for with debate, but I don’t think that was the definition I was using here.

It sounds like you are saying you intended the first case I mentioned in my previous argument, IE:

If I suppose “correct” is close to has an honest argument which gives enough information to convince a human (let’s call this correct), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.

Do you agree with my conclusion that your argument would, then, have little to do with factored cognition? (If so, I want to edit my first reply to you to summarize the eventual conclusion of this and other parts of the discussion, to make it easier on future readers -- so I'm asking if you agree with that summary.)

To elaborate: the "correct version" of WFC says, essentially, that NP-like problems (more specifically: informal questions whose answers have supporting arguments which humans can verify, though humans may also incorrectly verify wrong answers/arguments) have computation trees which humans can inductively verify.

This is at best a highly weakened version of factored cognition, and generally, deals with a slightly different issue (ie tries to deal with the problem of verifying incorrect arguments).

I still think that working with this “definition” is an interesting theoretical exercise, though I agree it doesn’t correspond to reality. Looking back I can see that you were talking about how this “definition” doesn’t actually correspond to the realistic situation, but I didn’t realize that’s what you were saying, sorry about that.

I think you are taking this somewhat differently than I am taking this. The fact that correct doesn't serve as a plausible notion of "correctness" (in your sense) and that honest doesn't serve as a plausible notion of "honesty" (in the sense of getting the AI system to reveal all information it has) isn't especially a crux for the applicability of my analysis, imho. My crux is, rather, the "no indescribably bad argument" thesis.

If bad arguments are always describably bad, then it's plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.

Debate Minus Factored Cognition

Sorry, I don't get this. How could we make the argument that the probability is below 50%?

I think my analysis there was not particularly good, and only starts to make sense if we aren't yet in equilibrium.

Depending on the answer, I expect I'd follow up with either
3. Why is it okay for us to assume that we're in an honest-enough regime?

I think #3 is the most reasonable, with the answer being "I have no reason why that's a reasonable assumption; I'm just saying, that's what you'd usually try to argue in a debate context..."

(As I stated in the OP, I have no claims as to how to induce honest equilibrium in my setup.)

Debate Minus Factored Cognition

When I say that you get the correct answer, or the honest answer, I mean something like "you get the one that we would want our AI systems to give, if we knew everything that the AI systems know". An alternative definition is that the answer should be "accurately reporting what humans would justifiably believe given lots of time to reflect" rather than "accurately corresponding to reality".

Right, OK.

So my issue with using "correct" like this in the current context is that it hides too much and creates a big risk of conflation. By no means do I assume -- or intend to argue -- that my debate setup can correctly answer every question in the sense above. Yet, of course, I intend for my system to provide "correct answers" in some sense. (A sense which has less to do with providing the best answer possible from the information available, and more to do with avoiding mistakes.)

If I suppose "correct" is close to has an honest argument which gives enough information to convince a human (let's call this correct), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.

If I suppose "correct" is close to what HCH would say (correct) then I still don't buy your argument at all, for precisely the same reason that I don't buy the version where "correct" simply means "true" -- namely, because correct answers don't necessarily win in my debate setup, any more than correct answers do.

Of course neither of those would be very sensible definitions of "correct", since either would make the WFC claim uninteresting.

Let's suppose that "correct" at least includes answers which an ideal HCH would give (IE, assuming no alignment issues with HCH, and assuming the human uses pretty good question-answering strategies). I hope you think that's a fair supposition -- your original comment was trying to make a meaningful statement about the relationship between my thing and factored cognition, so it seems reasonable to interpret WFC in that light.

I furthermore suppose that actual literal PSPACE problems can be safely computed by HCH. (This isn't really clear, given safety restrictions you'd want to place on HCH, but we can think about that more if you want to object.)

So my new counterexample is PSPACE problems. Although I suppose an HCH can answer such questions, I have no reason to think my proposed debate system can. Therefore I think the tree you propose (which iiuc amounts to a proof of "A is never fully defeated") won't systematically be correct (A may be defeated by virtue of its advocate not being able to provide the human with enough reason to think it is true).


Other responses:

In this case they're acknowledging that the other player's argument is "correct" (i.e. more likely than not to win if we continued recursively debating). While this doesn't guarantee their loss, it sure seems like a bad sign.

In this position, I would argue to the judge that not being able to identify specifically which assumption of my opponent's is incorrect does not indicate concession, precisely because my opponent may have a complex web of argumentation which hides the contradiction deep in the branches or pushes it off to infinity.

Yes, I agree this is true under those specific rules. But if there was a systematic bias in this way, you could just force exploration of both player's arguments in parallel (at only 2x the cost).

Agreed -- I was only pointing out that the setup you linked didn't have the property you mentioned, not that it would be particularly hard to get.

Debate Minus Factored Cognition

I don't think I ever use "fully defeat" in a leaf? It's always in a Node, or in a Tree (which is a recursive call to the procedure that creates the tree).

Ahhhhh, OK. I missed that that was supposed to be a recursive call, and interpreted it as a leaf node based on the overall structure. So I was still missing an important part of your argument. I thought you were trying to offer a static tree in that last part, rather than a procedure.

Load More