# All of abramdemski's Comments + Replies

My Current Take on Counterfactuals

Is there a way to operationalize "respecting logic"? For example, a specific toy scenario where an infra-Bayesian agent would fail due to not respecting logic?

"Respect logic" means either (a) assigning probability one to tautologies (at least, to those which can be proved in some bounded proof-length, or something along those lines), or, (b) assigning probability zero to contradictions (again, modulo boundedness). These two properties should be basically equivalent (ie, imply each other) provided the proof system is consistent. If it's inconsistent, they i... (read more)

2Vanessa Kosoy2dI guess we can try studying Troll Bridge using infra-Bayesian modal logic, but atm I don't know what would result. Ah, but there is a sense in which it doesn't. The radical update rule is equivalent to updating on "secret evidence". And in TRL we have such secret evidence. Namely, if we only look at the agent's beliefs about "physics" (the environment), then they would be updated radically, because of secret evidence from "mathematics" (computations).
Reflective Bayesianism

This post seemed to be praising the virtue of returning to the lower-assumption state. So I argued that in the example given, it took more than knocking out assumptions to get the benefit.

Agreed. Simple Bayes is the hero of the story in this post, but that's more because the simple bayesian can recognize that there's something beyond.

Phylactery Decision Theory

I'm using talk about control sometimes to describe what the agent is doing from the outside, but the hypothesis it believes all have a form like "The variables such and such will be as if they were set by BDT given such and such inputs".

Right, but then, are all other variables unchanged? Or are they influenced somehow? The obvious proposal is EDT -- assume influence goes with correlation. Another possible answer is "try all hypotheses about how things are influenced."

Phylactery Decision Theory

One problem with this is that it doesn't actually rank hypotheses by which is best (in expected utility terms), just how much control is implied. So it won't actually converge to the best self-fulfilling prophecy (which might involve less control).

Another problem with this is that it isn't clear how to form the hypothesis "I have control over X".

1Bunthut3dYou don't. I'm using talk about control sometimes to describe what the agent is doing from the outside, but the hypothesis it believes all have a form like "The variables such and such will be as if they were set by BDT given such and such inputs". For the first setup, where its trying to learn what it has control over, thats true. But you can use any ordering of hypothesis for the descent, so we can just take "how good that world is" as our ordering. This is very fragile of course. If theres uncountably many great but unachievable worlds, we fail, and in any case we are paying for all this with performance on "ordinary learning". If this were running in a non-episodic environment, we would have to find a balance between having the probability of hypothesis decline according to goodness, and avoiding the "optimistic humean troll" hypothesis by considering complexity as well. It really seems like I ought to take "the active ingredient" of this method out, if I knew how.
Reflective Bayesianism

I wanted to separate what work is done by radicalizing probabilism in general, vs logical induction specifically.

From my perspective, Radical Probabilism is a gateway drug. Explaining logical induction intuitively is hard. Radical Probabilism is easier to explain and motivate. It gives reason to believe that there's something interesting in the direction. But, as I've stated before, I have trouble comprehending how Jeffrey correctly predicted that there's something interesting here, without logical uncertainty as a motivation. In hindsight, I feel hi... (read more)

2Bunthut4dThis post seemed to be praising the virtue of returning to the lower-assumption state. So I argued that in the example given, it took more than knocking out assumptions to get the benefit. It wasn't meant to be. I agree that logical inductors seem to de facto implement a Virtuous Epistemic Process, with attendent properties, whether or not they understand that. I just tend to bring up any interesting-seeming thoughts that are triggered during conversation and could perhaps do better at indicating that. Whether its fine to set it aside provisionally depends on where you want to go from here.
Reflective Bayesianism

So, let's suppose for a moment that ZFC set theory is the one true foundation of mathematics, and it has a "standard model" that we can meaningfully point at, and the question is whether our universe is somewhere in the standard model (or, rather, "perfectly described" by some element of the standard model, whatever that means).

In this case it's easy to imagine that the universe is actually some structure not in the standard model (such as the standard model itself, or the truth predicate for ZFC; something along those lines).

Now, granted, the whole point ... (read more)

Reflective Bayesianism

What is actually left of Bayesianism after Radical Probabilism? Your original post on it was partially explaining logical induction, and introduced assumptions from that in much the same way as you describe here. But without that, there doesn't seem to be a whole lot there. The idea is that all that matters is resistance to dutch books, and for a dutch book to be fair the bookie must not have an epistemic advantage over the agent. Said that way, it depends on some notion of "what the agent could have known at the time", and giving a coherent account of thi

0Bunthut4dOk. I suppose my point could then be made as "#2 type approaches aren't very useful, because they assume something thats no easier than what they provide". Well, you certainly know more about that than me. Where did the criterion come from in your view? Quite possibly. I wanted to separate what work is done by radicalizing probabilism in general, vs logical induction specifically. That said, I'm not sure logical inductors properly have beliefs about their own (in the de dicto sense) future beliefs. It doesn't know "its" source code (though it knows that such code is a possible program) or even that it is being run with the full intuitive meaning of that, so it has no way of doing that. Rather, it would at some point think about the source code that we know is its, and come to believe that that program gives reliable results - but only in the same way in which it comes to trust other logical inductors. It seems like a version of this [https://www.lesswrong.com/posts/pba68kdmmHrp9oHGG/phylactery-decision-theory] in the logical setting. By "knowing where they are", I mean strategies that avoid getting dutch-booked without doing anything that looks like "looking for dutch books against me". One example of that would be The Process That Believes Everything Is Independent And Therefore Never Updates, but thats a trivial stupidity.
Reflective Bayesianism

I had it in mind as a possible topic when writing, but it didn't make it into the post. I think I might be able to put together a model that makes more sense than the original version, but I haven't done it yet.

Reflective Bayesianism

I don't know what it would look like, but that isn't an argument that the universe is mathematical.

Frankly, I think there's something confused about the way I'm/we're talking about this, so I don't fully endorse what I'm saying here. But I'm going to carry on.

I guess I'm confused because in my head "mathematical" means "describable by a formal system", and I don't know how a thing could fail to be so describable.

So, the kind of thing I have in mind is the claim that reality is precisely and completely described by some particular mathematical object.

3DanielFilan4dIn my head, the argument goes roughly like this, with 'surely' to be read as 'c'mon I would be so confused if not': 1. Surely there's some precise way the universe is. 2. If there's some precise way the universe is, surely one could describe that way using a precise system that supports logical inference. I guess it could fail if the system isn't 'mathematical', or something? Like I just realized that I needed to add 'supports logical inference' to make the argument support the conclusion.
Reflective Bayesianism

Thanks! I assume you're referring primarily to the way I made sure footnotes appear in the outline by using subheadings, and perhaps secondarily to aesthetics.

1DanielFilan5dJust referring to the primary thing.
Troll Bridge

This sounds like PA is not actually the logic you're using.

Maybe this is the confusion. I'm not using PA. I'm assuming (well, provisionally assuming) PA is consistent.

If PA is consistent, then an agent using PA believes the world is consistent -- in the sense of assigning probability 1 to tautologies, and also assigning probability 0 to contradictions.

(At least, 1 to tautologies it can recognize, and 0 to contradictions it can recognize.)

Hence, I (standing outside of PA) assert that (since I think PA is probably consistent) agents who use PA don't know whe... (read more)

Troll Bridge

Sure, thats always true. But sometimes its also true that . So unless you believe PA is consistent, you need to hold open the possibility that the ball will both (stop and continue) and (do at most one of those). But of course you can also prove that it will do at most one of those. And so on. I'm not very confident whats right, ordinary imagination is probably just misleading here.

I think you're still just confusing levels here. If you're reasoning using PA, you'll hold open the possibility that PA is inconsistent, but you won't hold open the... (read more)

0Bunthut17dDo you? This sounds like PA is not actually the logic you're using. Which is realistic for a human. But if PA is indeed inconsistent, and you don't have some further-out system to think in, then what is the difference to you between "PA is inconsistent" and "the world is inconsistent"? In both cases you just believe everything and its negation. This also goes with what I said about thought and perception not being separated in this model, which stand in for "logic" and "the world". So I suppose that is where you would look when trying to fix this. You do fully believe in PA. But it might be that you also believe its negation. Obviously this doesn't go well with probabilistic approaches.
Troll Bridge

Now you might think this is dumb, because its impossible to see that. But why do you think its impossible? Only because its inconsistent. But if you're using PA, you must believe PA really might be inconsistent, so you can't believe its impossible.

This part, at least, I disagree with. If I'm using PA, I can prove that . So I don't need to believe PA is consistent to believe that the ball won't stop rolling and also continue rolling.

On the other hand, I have no direct objection to believing you can control the consistency of PA by doing some... (read more)

0Bunthut19dSure, thats always true. But sometimes its also true thatA&¬A. So unless you believe PA is consistent, you need to hold open the possibility that the ball will both (stop and continue) and (do at most one of those). But of course you can also prove that it will do at most one of those. And so on. I'm not very confident whats right, ordinary imagination is probably just misleading here. The facts about what you think are theorems of PA. Judging from the outside: clearly if an agent with this source code crosses the bridge, then PA is inconsistent. So, I think the agent is reasoning correctly about the kind of agent it is. I agree that the outcome looks bad - but its not clear if the agent is "doing something wrong". For comparison, if we built an agent that would only act if it could be sure its logic is consistent, it wouldn't do anything - but its not doing anything wrong. Its looking for logical certainty, and there isn't any, but thats not its fault.
Four Motivations for Learning Normativity

I'd be interested to hear this elaborated further. It seems to me to be technically challenging but not very;

1. I agree; I'm not claiming this is a multi-year obstacle even. Mainly I included this line because I thought "add a meta-level" would be what some readers would think, so, I wanted to emphasize that that's not a solution.
2. To elaborate on the difficulty: this is challenging because of the recursive nature of the request. Roughly, you need hypotheses which not only claim things at the object level but also hypothesize a method of hypothesis evaluation i
2Daniel Kokotajlo20dThanks! Well, I for one am feeling myself get nerd-sniped by this agenda. I'm resisting so far (so much else to do! Besides, this isn't my comparative advantage) but I'll definitely be reading your posts going forward and if you ever want to bounce ideas off me in a call I'd be down. :)
The case for aligning narrowly superhuman models

I do see a post with a formal result, which seems like a direct contradiction to what you're saying, though I'll look in more detail.

If you mean to suggest this post has a positive result, then I think you're just mis-reading it; the key result is

The conclusion of this post is the following: if there exists some set of natural tasks for which the fastest way to solve them is to do some sort of machine learning to find a good policy, and there is some task for which that machine learning results in deceptive behavior, then there exists a natural task

The case for aligning narrowly superhuman models

My guess is that you would agree that "minimal circuit that gives good advice" is smaller than "circuit that gives good advice but will later betray you", and therefore there exist two model sizes where one is dangerous and one is safe but useful. I know I saw posts on this a while back, so there may be relevant math about what that gap might be, or it might be unproven but with some heuristics of what the best result probably is.

There was indeed a post posing this question a while back, and discussion in the comments included a counterexample: a construct... (read more)

1magfrump1moLooks like the initial question was here [https://www.lesswrong.com/posts/nyCHnY7T5PHPLjxmN/open-question-are-minimal-circuits-daemon-free] and a result around it was posted here. [https://www.lesswrong.com/posts/fM5ZWGDbnjb7ThNKJ/are-minimal-circuits-deceptive] At a glance I don't see the comments with counterexamples, and I do see a post with a formal result, which seems like a direct contradiction to what you're saying, though I'll look in more detail. Coming back to the scaling question, I think I agree that multiplicative scaling over the whole model size is obviously wrong. To be more precise, if there's something like a Q-learning inner optimizer for two tasks, then you need the cross product of the state spaces, so the size of the Q-space could scale close-to-multiplicatively. But the model that condenses the full state space into the Q-space scales additively, and in general I'd expect the model part to be much bigger--like the Q-space has 100 dimensions and the model has 1 billion parameters, so going adding a second model of 1 billion parameters and increasing the Q-space to 10k dimensions is mostly additive in practice, even if it's also multiplicative in a technical sense. I'm going to update my probability that "GPT-3 can solve X, Y implies GPT-3 can solve X+Y," and take a closer look at the comments on the linked posts. This also makes me think that it might make sense to try to find simpler problems, even already-mostly-solved problems like Chess or algebra, and try to use this process to solve them with GPT-2, to build up the architecture and search for possible safety issues in the process.
The case for aligning narrowly superhuman models

Thanks for trying further to bridge the gap!

(It would be nice if you flagged a little better which things you think I think / which things you think I disagree with)

I think what I'm saying is, GPT-3 probably doesn't have any general truth+noise models. But I would expect it to copy a truth+noise model from people, when the underlying model is simple.

OK, that makes sense. So you're not saying that GPT contains useful diagnostic models in the overall statistics of its models of Reddit users (EG that someone complaining of one symptom will often complain of a... (read more)

3magfrump1moI'm replying on my phone right now because I can't stop thinking about it. I will try to remember to follow up when I can type more easily. I think the vague shape of what I think I disagree about is how dense GPT-3's sets of implicit knowledge are. I do think we agree that GPT-5000 will be broadly superhuman, even if it just has a grab bag of models in this way, for approximately the reasons you give. I'm thinking about "intelligent behavior" as something like the set of real numbers, and "human behavior" as covering something like rational numbers, so we can get very close to most real numbers but it takes some effort to fill in the decimal expansion. Then I'm thinking of GPT-N as being something like integers+1/N. As N increases, this becomes close enough to the rational numbers to approximate real numbers, and can be very good at approximating some real numbers, but can't give you incomputable numbers (unaligned outcomes) and usually won't give you duplicitous behavior (numbers that look very simple at first approximation but actually aren't, like .2500000000000004, which seems to be 1/4 but secretly isn't). I'm not sure where that intuition comes from but I do think I endorse it with moderate confidence. Basically I think for minimal circuit reasons that if "useful narrowly" emerges in GPT-N, then "useful in that same domain but capable of intentionally doing a treacherous turn" emerges later. My intuition is that this won't be until GPT-(N+3) or more, so if you are able to get past unintentional turns like "the next commenter gives bad advice" traps, this alignment work is very safe, and important to do as fast as possible (because attempting it later is dangerous!) In a world where GPT-(N+1) can do a treacherous turn, this is very dangerous, because you might accidentally forget to check if GPT-(N-1) can do it, and get the treacherous turn. My guess is that you would agree that "minimal circuit that gives good advice" is smaller than "circuit that gives
The case for aligning narrowly superhuman models

Let's see if I can properly state the nature of the disagreement.

I stated that there's a spectrum between "GPT knows more than the average human across a broad variety of domains, but only uses this knowledge to imitate humans, so it's not obvious" and "GPT really knows very little, and its apparent stupidity is stupidity-in-fact".

I somewhat operationalized the difference as one of internal representation: to what extent is GPT using a truth+noise model (where it knows a lot of stuff about reality, and then filters it through the biases of particular persp... (read more)

1magfrump1moI think this is obscuring (my perception of) the disagreement a little bit. I think what I'm saying is, GPT-3 probably doesn't have any general truth+noise models. But I would expect it to copy a truth+noise model from people, when the underlying model is simple. I then expect GPT-3 to "secretly" have something like an interesting diagnostic model, and probably a few other narrowly superhuman skills. But I would expect it to not have any kind of significant planning capacity, because that planning capacity is not simple. In particular my expectation is that coherently putting knowledge from different domains together in generally useful ways is MUCH, MUCH harder than being highly superhuman in narrow domains. Therefore I expect Ajeya's approach to be both effective, because "narrowly superhuman" can exist, and reasonably safe, because the gap between "narrowly superhuman" or even "narrowly superhuman in many ways" and "broadly superhuman" is large so GPT-3 being broadly superhuman is unlikely. Phrased differently, I am rejecting your idea of smartness-spectrum. My intuition is that levels of GPT-N competence will scale the way computers have always scaled at AI tasks--becoming usefully superhuman at a few very quickly, while taking much much longer to exhibit the kinds of intelligence that are worrying, like modeling human behavior for manipulation.
Four Motivations for Learning Normativity

Well, transparency is definitely a challenge. I'm mostly saying this is a technical challenge even if you have magical transparency tools, and I'm kind of trying to design the system you would want to use if you had magical transparency tools.

But I don't think it's difficult for the reason you say. I don't think multi-level feedback or whole-process feedback should be construed as requiring the levels to be sorted out nicely. Whole-process feedback in particular just means that you can give feedback on the whole chain of computation; it's basically against... (read more)

The case for aligning narrowly superhuman models

I don’t want to drop “superhuman” from the name because that’s the main reason it feels like “practicing what we eventually want to do.”

One response I generated was, "maybe it's just not so much about practicing what we eventually want to do, and that part is an illusion of the poor framing. We should figure out the right framing first and then ask whether it seems like practice, not optimize the framing to make it sound like practice."

But I think my real response is: why is the superhuman part important, here? Maybe what's really important is being abl... (read more)

In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn't trying here to make something different sound like it's about practice. I don't think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I'd be similarly excited about or maybe more excited about.

In my mind, the "better than evaluators" part is kind of self-evidently intriguing for the... (read more)

The case for aligning narrowly superhuman models

Basically, it’s about trying to find areas where lackluster performance seems to mostly be about “misalignment” rather than “capabilities” (recognizing those are both fuzzy terms).

Right, ok, I like that framing better (it obviously fits, but I didn't generate it as a description before).

The case for aligning narrowly superhuman models

I might be on board if "narrowly superhuman" were simply defined differently.

“Try to use some humans to align a model in a domain where the model is better than the humans at the task”

Isn't it something more like "the model has information sufficient to do better"? EG, in the GPT example, you can't reliably get good medical advice from it right now, but you strongly suspect it's possible. That's a key feature of the whole idea, right?

Is your suggested research program better described as: find (highly capable) models with inaccessible information and g... (read more)

1Ajeya Cotra1moI don't feel confident enough in the frame of "inaccessible information" to say that the whole agenda is about it. It feels like a fit for "advice", but not a fit for "writing stories" or "solving programming puzzles" (at least not an intuitive fit -- you could frame it as "the model has inaccessible information about [story-writing, programming]" but it feels more awkward to me). I do agree it's about "strongly suspecting it has the potential to do better than humans" rather than about "already being better than humans." Basically, it's about trying to find areas where lackluster performance seems to mostly be about "misalignment" rather than "capabilities" (recognizing those are both fuzzy terms).
The case for aligning narrowly superhuman models

This isn't an objection to the research direction, just a response to how you're framing it:

If you think GPT-3 is "narrowly superhuman" at medical advice, what topic don't you think it's narrowly superhuman in? It seems like you could similarly argue that GPT-3 knows more than the average human about mechanics, chemistry, politics, and just about anything that language is good at describing. (EG, not walking, riding a bike, the concrete skills needed for painting, etc.)

A tool capable of getting GPT-3 to give good medical advice would, probably, be a tool t... (read more)

2Ajeya Cotra1moYeah, you're definitely pointing at an important way the framing is awkward. I think the real thing I want to say is "Try to use some humans to align a model in a domain where the model is better than the humans at the task", and it'd be nice to have a catchy term for that. Probably a model which is better than some humans (e.g. MTurkers) at one task (e.g. medical advice) will also be better than those same humans at many other tasks (e.g. writing horror stories); but at the same time for each task, there's some set of humans (e.g. doctors in the first case and horror authors in the second) where the model does worse. I don't want to just call it "align superhuman AI today" because people will be like "What? We don't have that", but at the same time I don't want to drop "superhuman" from the name because that's the main reason it feels like "practicing what we eventually want to do." I considered "partially superhuman", but "narrowly" won out. I'm definitely in the market for a better term here.
MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

Rob Bensinger: Nate and I tend to talk about "understandability" instead of "transparency" exactly because we don't want to sound like we're talking about normal ML transparency work.

Eliezer Yudkowsky: Other possible synonyms:  Clarity, legibility, cognitive readability.

Ajeya Cotra: Thanks all -- I like the project of trying to come up with a good handle for the kind of language model transparency we're excited about (and have talked to Nick, Evan, etc about it too) but I think I don't want to push it in this blog post right now because I haven't hit

This was a really helpful articulation, thanks! I like "frankness", "forthrightness", "openness", etc. (These are all terms I was brainstorming to get at the "ascription universality" concept at one point.)

Recursive Quantilizers II

Okay, I think with this elaboration I stand by what I originally said

You mean with respect to the system as described in the post (in which case I 100% agree), or the modified system which restarts training upon new feedback (which is what I was just describing)?

Because I think this is pretty solidly wrong of the system that restarts.

Specifically, isn't it the case that the first few bits of feedback determine , which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?

All feedback so far determines the new... (read more)

2Rohin Shah1moI continue to not understand this but it seems like such a simple question that it must be that there's just some deeper misunderstanding of the exact proposal we're now debating. It seems not particularly worth it to find this misunderstanding; I don't think it will really teach us anything conceptually new. (If I did want to find it, I would write out pseudocode for the new proposed system and then try to make a more precise claim in terms of the variables in the pseudocode.)
Recursive Quantilizers II

This seems like a pretty big disagreement, which I don't expect to properly address with this comment. However, it seems a shame not to try to make any progress on it, so here are some remarks.

Less important response: If by "not great" you mean "existentially risky", then I think you need to explain why the smartest / most powerful historical people with now-horrifying values did not constitute an existential risk.

My answer to this would be, mainly because they weren't living in times as risky as ours; for example, they were not born and raised in a litera... (read more)

2Rohin Shah1moThis makes sense, though I probably shouldn't have used "5x" as my number -- it definitely feels intuitively more like your tools could be robust to many orders of magnitude of increased compute / model capacity / data. (Idk how you would think that relates to a scaling factor on intelligence.) I think the key claim / crux here is something like "we can develop techniques that are robust to scaling up compute / capacity / data by N orders, where N doesn't depend significantly on the current compute / capacity / data".
Recursive Quantilizers II

My questions / comments are about the implementation proposed in this post. I thought that you were identifying "levels of reasoning" with "depth in the idealized recursive QAS tree"; if that's the case I don't see how feedback at one level generalizes to all the other levels (feedback at that level is used to make the QAS at that level, and not other levels, right?)

I'm pretty sure I'm just failing to understand some fact about the particular implementation, or what you mean by "levels of reasoning", or its relation to the idealized recursive QAS tree.

2Rohin Shah1moMost of this makes sense (or perhaps more accurately, sounds like it might be true, but there's a good chance if I reread the post and all the comments I'd object again / get confused somehow). One thing though: Okay, I think with this elaboration I stand by what I originally said: Specifically, isn't it the case that the first few bits of feedback determineD1, which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Agreed. The problem is with AI designs which don't do that. It seems to me like this perspective is quite rare. For example, my post Policy Alignment was about something similar to this, but I got a ton of pushback in the comments -- it seems to me like a lot of people really think the AI should use better AI concepts, not human concepts. At least they did back in 2018.

As you mention, this is partly due to overly reductionist world-views. If tables/happiness aren't reductively real, the fact that the AI is using those concepts is evidence that it's dumb/in... (read more)

Recursive Quantilizers II

At MIRI we tend to take “we’re probably fine” as a strong indication that we’re not fine ;p

(I should remark that I don't mean to speak "for MIRI" and probably should have phrased myself in a way which avoided generalizing across opinions at MIRI.)

Yeah I have been and continue to be confused by this perspective, at least as an empirical claim (as opposed to a normative one). I get the sense that it’s partly because optimization amplifies and so there is no “probably”, there is only one or the other. I can kinda see that when you assume an arbitrarily

Recursive Quantilizers II

Thanks for the review, btw! Apparently I didn't think to respond to it before.

**On feedback types:** It seems like the scheme introduced here is relying quite strongly on the ability of humans to give good process-level feedback _at arbitrarily high levels_. It is not clear to me that this is something humans can do: it seems to me that when thinking at the meta level, humans often fail to think of important considerations that would be obvious in an object-level case. I think this could be a significant barrier to this scheme, though it’s hard to say with

2Rohin Shah2mo(Noting that given this was a month ago I have lost context and am more likely than usual to contradict what I previously wrote) What's the point of handling feedback at high levels if we never actually get feedback at those levels? Perhaps another way of framing it: suppose we found out that humans were basically unable to give feedback at level 6 or above. Are you now happy having the same proposal, but limited to depth 5? I get the sense that you wouldn't be, but I can't square that with "you only need to be able to handle feedback at high levels but you don't require such feedback". I don't super see how this happens but I could imagine it does. (And if it did it would answer my question above.) I feel like I would benefit from concrete examples with specific questions and their answers. Okay, that makes sense. It seemed to me like since the first few bits of feedback determine how the system interprets all future feedback, it's particularly important for those first few bits to be correct and not lock in e.g. a policy that ignores all future feedback. I agree that any safety story will probably require you to get some concept X right. (Corrigibility is one candidate for X.) Your safety story would then be "X is inductively preserved as the AI system self-modifies / learns new information / makes a successor agent", and so X has to scale arbitrarily far. You have to get this "perfectly" right in that it can't be that your agent satisfies X under normal conditions but then fails when COVID hits; this is challenging. You don't have to get it "perfectly" right in that you could get some more conservative / careful X' that restricts the agent's usefulness (e.g. it has to check in with the human more often) but over time it can self-modify / make successor agents with property X instead. Importantly, if it turns out that X = corrigibility is too hard, we can also try less performant but safer things, like X = "we revert to a safe baseline policy if we're not in
Recursive Quantilizers II

Here are some thoughts on how to fix the issues with the proposal.

1. The initial proposal distribution should be sampling things close to the initial QAS, not sampling totally random neutral networks / etc.

This addresses the "quantilizers can't do the work of finding powerful QASs" problem. Instead, quantilizers are only doing the work of improving QASs. We don't need to assume smart, aligned QASs are significantly more common than malign QASs; instead, we only need to assume that catastrophic modifications are rare in the local neighborhood of something ... (read more)

Fixing The Good Regulator Theorem

I'm really happy to see this. I've had similar thoughts about the Good Regulator Theorem, but didn't take the time to write them up or really pursue the fix.

Marginally related: my hope at some point was to fix the Good Regulator Theorem and then integrate it with other representation theorems, to come up with a representation theorem which derived several things simultaneously:

1. Probabilistic beliefs (or some appropriate generalization).
2. Expected utility theory (or some appropriate generalization).
3. A notion of "truth" based on map/territory correspondence
Debate Minus Factored Cognition

I think you also need that at least some of the time good arguments are not describably bad

While I agree that there is a significant problem, I'm not confident I'd want to make that assumption.

As I mentioned in the other branch, I was thinking of differences in how easy lies are to find, rather than existence. It seems natural to me to assume that every individual thing does have a convincing counterargument, if we look through the space of all possible strings (not because I'm sure this is true, but because it's the conservative assumption -- I have no... (read more)

2Rohin Shah2moYeah all of this makes sense to me; I agree that you could make an argument about the difference in difficulty of finding defeaters to good vs. bad arguments, and that could then be used to say "debate will in practice lead to honest policies".
Debate Minus Factored Cognition

I think in this comment thread I’ve been defining an honest answer as one that can be justified via arguments that eventually don’t have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters—while this doesn’t strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn’t consciously realize I was making that assumption.)

Right, I agree -- I was more or less taking that as a definition of honesty. However, this doesn't mean we'... (read more)

2Rohin Shah2moI think you also need that at least some of the time good arguments are not describably bad (i.e. they don't have defeaters); otherwise there is no way to distinguish between good and bad arguments. (Or you need to posit some external-to-debate method of giving the AI system information about good vs bad arguments.) I think I'm still a bit confused on the relation of Factored Cognition to this comment thread, but I do agree at least that the main points we were discussing are not particularly related to Factored Cognition. (In particular, the argument that zero-sum is fine can be made without any reference to Factored Cognition.) So I think that summary seems fine.
Debate Minus Factored Cognition

Sorry, I don't get this. How could we make the argument that the probability is below 50%?

I think my analysis there was not particularly good, and only starts to make sense if we aren't yet in equilibrium.

[...]
3. Why is it okay for us to assume that we're in an honest-enough regime?

I think #3 is the most reasonable, with the answer being "I have no reason why that's a reasonable assumption; I'm just saying, that's what you'd usually try to argue in a debate context..."

(As I stated in the OP, I have no claims as to how to induce honest equilibrium in my setup.)

Debate Minus Factored Cognition

When I say that you get the correct answer, or the honest answer, I mean something like "you get the one that we would want our AI systems to give, if we knew everything that the AI systems know". An alternative definition is that the answer should be "accurately reporting what humans would justifiably believe given lots of time to reflect" rather than "accurately corresponding to reality".

Right, OK.

So my issue with using "correct" like this in the current context is that it hides too much and creates a big risk of conflation. By no means do I assume -- or... (read more)

2Rohin Shah3moRe: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I'm usually hoping for with debate, but I don't think that was the definition I was using here. I think in this comment thread I've been defining an honest answer as one that can be justified via arguments that eventually don't have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters -- while this doesn't strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn't consciously realize I was making that assumption.) I still think that working with this "definition" is an interesting theoretical exercise, though I agree it doesn't correspond to reality. Looking back I can see that you were talking about how this "definition" doesn't actually correspond to the realistic situation, but I didn't realize that's what you were saying, sorry about that.
Debate Minus Factored Cognition

I don't think I ever use "fully defeat" in a leaf? It's always in a Node, or in a Tree (which is a recursive call to the procedure that creates the tree).

Ahhhhh, OK. I missed that that was supposed to be a recursive call, and interpreted it as a leaf node based on the overall structure. So I was still missing an important part of your argument. I thought you were trying to offer a static tree in that last part, rather than a procedure.

Debate Minus Factored Cognition

Ah, I suppose this is still consistent with honesty being an equilibrium. But it would then be a really weak sort of equilibrium—there would be no reason to be honest, but no specific reason to be dishonest, either.

Yeah, I agree this is possible. (The reason to not expect dishonesty is that sometimes you’ll see honest arguments to which there is no dishonest defeater.)

Then I concede that there is an honest equilibrium where the first player tells the truth, and the seco

Debate Minus Factored Cognition

Ah, OK, so you were essentially assuming that humans had access to an oracle which could verify optimal play.

This sort of makes sense, as a human with access to a debate system in equilibrium does have such an oracle. I still don't yet buy your whole argument, for reasons being discussed in another branch of our conversation, but this part makes enough sense.

Your argument also has some leaf nodes which use the terminology "fully defeat", in contrast to "defeat". I assume this means that in the final analysis (after expanding the chain of defeaters) this re... (read more)

2Rohin Shah3moI don't think I ever use "fully defeat" in a leaf? It's always in a Node, or in a Tree (which is a recursive call to the procedure that creates the tree). Yes, that's what I mean by "fully defeat".
Debate Minus Factored Cognition

For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say “player 1 offers no argument for its position” and receive points for that, as far as I am concerned.

I think at this point I want a clearer theoretical model of what assumptions you are and aren’t making. Like, at this point, I’m feeling more like “why are we even talking about defeaters; there are much bigger issues with this s

Debate Minus Factored Cognition

Another problem with your argument—WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don’t address).

Not sure what you want me to “address”. The leaf nodes that are taken on faith really are true under optimal play, which is what happens at equilibrium.

To focus on this part, because it seems quite tractable --

Let's grant for the sake of argument that these nodes are true under optimal play. How can the human verify that? Optimal play is quite a com... (read more)

4Rohin Shah3moI don't think so, but to formalize the argument a bit more, let's define this new version of the WFC: Special-Tree WFC: For any question Q with correct answer A, there exists a tree of decompositions T arguing this such that: 1. Every internal node has exactly one child leaf of the form "What is the best defeater to X?" whose answer is auto-verified, 2. For every other leaf node, a human can verify that the answer to the question at that node is correct, 3. For every internal node, a human can verify that the answer to the question is correct, assuming that the subanswers are correct. (As before, we assume that the human never verifies something incorrect, unless the subanswers they were given were incorrect.) Claim 1: (What I thought was) your assumption => Special-Tree WFC, using the construction I gave. Claim 2: Special-Tree WFC + assumption of optimal play => honesty is an equilibrium, using the same argument that applies to regular WFC + assumption of optimal play. Idk whether this is still true under the assumptions you're using; I think claim 1 in particular is probably not true under your model.
Debate Minus Factored Cognition

The computational complexity analogy version would have to put a polynomial limit on the depth of the tree if you wanted to argue that the problem is in PSPACE. My construction doesn’t do this; there will be questions where the depth of the tree is super-polynomial, but the tree still exists. (These will be the cases in which, even under optimal play by an honest agent, the “length” of a chain of defeaters can be super-polynomially large.) So I don’t think my argument is proving too much.

OK, but this just makes me regret pointing to the computational co... (read more)

2Rohin Shah3moI think at this point I want a clearer theoretical model of what assumptions you are and aren't making. Like, at this point, I'm feeling more like "why are we even talking about defeaters; there are much bigger issues in this setup". I wouldn't be surprised at this point if most of the claims I've made are actually false under the assumptions you seem to be working under. Not sure what you want me to "address". The leaf nodes that are taken on faith really are true under optimal play, which is what happens at equilibrium. This is why I prefer the version of debate outlined here [https://www.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1#Current_debate_rules] , where both sides make a claim and then each side must recurse down on the other's arguments. I didn't realize you were considering a version where you don't have to specifically rebut the other player's arguments. I just meant to include the fact that the honest player is able to find the defeaters to dishonest arguments. If you include that in "the honest policy", then I agree that "in equilibrium" is unnecessary. (I definitely could have phrased that better.)
Debate Minus Factored Cognition

I think this is only true when you have turn-by-turn play and your opponent has already "claimed" the honest debater role.

Yeah, I was assuming turn-by-turn play.

In the simultaneous play setting, I think you expect both agents to be honest.

This is a significant point that I was missing: I had assumed that in simultaneous play, the players would randomize, so as to avoid choosing the same answer, since choosing the same answer precludes winning. However, if choosing a worse answer means losing, then players prefer a draw.

But I'm not yet convinced, because th... (read more)

3Rohin Shah3moWhoops, I seem to have missed this comment, sorry about that. I think at this point we're nearly at agreement. Yeah, I agree this is possible. (The reason to not expect dishonesty is that sometimes you'll see honest arguments to which there is no dishonest defeater.) Similar comment here -- the more you expect that honest claims will likely have dishonest defeaters, the weaker you expect the equilibrium to be. (E.g. it's clearly not a tie when honest claims never have dishonest defeaters; in this case first player always wins.) On the specific -2/+1 proposal, the issue is that then the first player just makes some dishonest argument, and the second player concedes because even if they give an honest defeater, the second player could then re-defeat that with a dishonest defeater. (I realize I'm just repeating myself here; there's more discussion in the next section.) But more broadly, I claim that given your assumptions there is no possible scoring rule that (in the worst case) makes honesty a unique equilibrium. This worst case is when every argument has a defeater (and in particular, every honest argument has a dishonest defeater). In this situation, there is no possible way to distinguish between honesty and dishonesty -- under your assumptions, the thing that characterizes honesty is that honest arguments (at least sometimes) don't have defeaters. From the perspective of the players, the salient feature of the game is that they can make statements; all such statements will have defeaters; there's no information available to them in the structure of the game that distinguishes honesty from dishonesty. Therefore honesty can't be the unique equilibrium; whatever the policy is, there should be an equivalent one that is at least sometimes dishonest. In this worst case, I suspect that for any judge-based scoring rule, the equilibrium behavior is either "the first player says something and the second concedes", or "every player always provides some arbitrary defea
Debate Minus Factored Cognition

There are two arguments:

1. Your assumption + automatic verification of questions of the form "What is the best defeater to X" implies Weak Factored Cognition (which as defined in my original comment is of the form "there exists a tree such that..." and says nothing about what equilibrium we get).

Right, of course, that makes more sense. However, I'm still feeling dense -- I still have no inkling of how you would argue weak factored cognition from #1 and #2. Indeed, Weak FC seems far too strong to be established from anything resembling #1 and #2: WFC says that... (read more)

I'll just flag that I still don't know this argument, either, and I'm curious where you're getting it from / what it is.

I just read the Factored Cognition sequence since it has now finished, and this post derives WFC as the condition necessary for honesty to be an equilibrium in (a slightly unusual form of) debate, under the assumption of optimal play.

2Rohin Shah3moThe computational complexity analogy version would have to put a polynomial limit on the depth of the tree if you wanted to argue that the problem is in PSPACE. My construction doesn't do this; there will be questions where the depth of the tree is super-polynomial, but the tree still exists. (These will be the cases in which, even under optimal play by an honest agent, the "length" of a chain of defeaters can be super-polynomially large.) So I don't think my argument is proving too much. (The tree could be infinite if you don't have an assumption that guarantees termination somehow, hence my caveats about termination. WFC should probably ask for the existence of a finite tree.) For the actual argument, I'll refer back to my original comment, which provides a procedure to construct the tree. Happy to clarify whichever parts of the argument are confusing. No, I am in fact asserting that given the two assumptions, all questions are answerable by (potentially super-polynomially large) verifiable trees (again assuming we deal with termination somehow). I think it differs based on what assumptions you make on the human judge, so there isn't a canonical version of it. In this case, the assumption on the human judge is that if the subanswers they are given are true, then they never verify an incorrect overall answer. (This is different from the "defeaters" assumption you have, for which I'd refer to the argument I gave above.) Claim: In a turn-by-turn unlimited-length debate, if the first player is honest, then the first player always wins in equilibrium. Argument: By WFC, we assume there is a finite tree T that can be verified. The first player then has the following strategy: take the question under consideration (initially the original question; later it is whatever subquestion the opponent is disputing). Report "the answer is A, which because the answer to subquestion 1 is A1 and the answer to subquestion 2 is A2". The opponent will always have to recurse into o
Debate Minus Factored Cognition

Thanks for taking the time to reply!

I don’t think that’s what I did? Here’s what I think the structure of my argument is:

1. Every dishonest argument has a defeater. (Your assumption.)
2. Debaters are capable of finding a defeater if it exists. (I said “the best counterargument” before, but I agree it can be weakened to just “any defeater”. This doesn’t feel that qualitatively different.)
3. 1 and 2 imply the Weak Factored Cognition hypothesis. I’m not assuming factored cognition, I’m proving it using your assumption.

Ah, interesting, I didn't catch that this ... (read more)

2Rohin Shah3moThere are two arguments: 1. Your assumption + automatic verification of questions of the form "What is the best defeater to X" implies Weak Factored Cognition (which as defined in my original comment is of the form "there exists a tree such that..." and says nothing about what equilibrium we get). 2. Weak Factored Cognition + debate + human judge who assumes optimal play implies an honest equilibrium. (Maybe also: if you assume debate trees terminate, then the equilibrium is unique. I think there's some subtlety here though.) In my previous comment, I was talking about 1, and taking 2 for granted. This is all in the zero-sum setting. But let's leave that aside and instead talk about a simpler argument that doesn't talk about Factored Cognition at all. ---- Zero-sum setting, argument that honesty is an equilibrium (for the first player in a turn-by-turn game, or either player in a simultaneous-action game): If you are always honest, then whenever you can take an action, there will exist a defeater (by your assumption), therefore you will have at least as many options as any non-honest policy (which may or may not have a defeater). Therefore you maximize your value by being honest. Additional details: In the case where arguments never terminate (every argument, honest or not, has a defeater), then being dishonest will also leave you with many options, and so that will also be an equilibrium. When arguments do terminate quickly enough (maximum depth of the game tree is less than the debate length), that ensures that the honest player always gets the "last word" (the point at which a dishonest defeater no longer exists), and so honesty always wins and is the unique equilibrium. In the middle, where most arguments terminate quickly but some go on forever, honesty is usually incentivized, but sometimes it can be swapped out for a dishonest strategy that achieves the same value. I think this is only true when you have turn-by-turn play and
AI safety via market making

This was a very interesting comment (along with its grandparent comment), thanks -- it seems like a promising direction.

However, I'm still confused about whether this would work. It's very different from judging procedure outlined here; why is that? Do you have a similarly detailed write-up of the system you're describing here?

I'm actually less concerned about loops and more concerned about arguments which are infinite trees, but the considerations are similar. It seems possible that the proposal you're discussing very significantly addresses concerns I've had about debate.

1Beth Barnes2moI was trying to describe something that's the same as the judging procedure in that doc! I might have made a mistake, but I'm pretty sure the key piece about recursion payments is the same. Apologies that things are unclear. I'm happy to try to clarify, if there were particular aspects that seem different to you. Yeah, I think the infinite tree case should work just the same - ie an answer that's only supported by an infinite tree will behave like an answer that's not supported (it will lose to an answer with a finite tree and draw with an answer with no support) That's exciting!
Debate Minus Factored Cognition

I think I disagree with the claim you're making about being able to avoid requiring the judge to assume that one player is honest (but I might be confused about what you're proposing).

Don't you yourself disagree with requiring the judge to assume that one player is honest? In a recent comment, you discuss how claims should not be trusted by default.

1Beth Barnes2moI don't think 'assuming one player is honest' and 'not trusting answers by default' are in contradiction. if the judge assumes one player is honest, then if they see two different answers they don't know which one to trust, but if they only see one answer (the debaters agree on an answer/the answer is not challenged by the opposing debater) then they can trust that answer.