Abram Demski


Partial Agency

Alternate Alignment Ideas


Embedded Agency


The Fusion Power Generator Scenario

I wholeheartedly agree. I think this implies:

  1. Getting very clear on what we want. Can we give a fairly technical specification of the kind of safety that's necessary+possible?
  2. Some degree of safety beyond tool-type non-malignancy. A proposal which I keep thinking about is my consent-based helpfulness. The idea is that, in addition to believing that you want something (with sufficient confidence), the system should also believe that you understand the implications of that thing (with some kind of sufficient detail). In the fusion example, the system would engage the user in conversation until it was clear that the consequences for society were understood and approved of.

Note that the fusion power example could be answered directly with a value-alignment type approach, where you have an agent rather than a tool -- the agent infers your values, and infers that you would not really want backyard fusion power if it put the world at risk. That's the moral that I imagine people more into value learning would give to your story. But I'm reaching further afield for solutions, because:

  • Value learning systems could Goodhart on the approximate values learned
  • Value learning systems are not corrigible if they become overly confident (which could happen at test time due to unforeseen flaws in the system's reasoning -- hence the desire for corrigibility)
  • Value learning systems could manipulate the human.
Infinite Data/Compute Arguments in Alignment

FWIW, I think of Eliezer's essay Methodology of Unbounded Analysis as the standard ref here. (But, it has not yet been ported over to Alignment Forum or LW.)

Comparing LICDT and LIEDT

If I'm interpreting thing correctly, this is just because anything that's upstream gets screened off, because the agent knows what action it's going to take.

Not quite. The agent might play a mixed strategy if there is a predictor in the environment, e.g., when playing rock/paper/scissors with a similarly-intelligent friend you (more or less) want to predict what you're going to do and then do something else than that. (This is especially obvious if you assume the friend is exactly as smart as you, IE, assigns the same probability to things if there's no hidden information -- we can model this by supposing both of you use the same logical inductor.) You don't know what you're going to do, because your deliberation process is unstable: if you were leaning in any direction, you would immediately lean in a different direction. This is what it means to be playing a mixed strategy.

In this situation, I'm nonetheless still claiming that what's "downstream" should be what's logically correlated with you. So what screens off everything else is knowledge of the state of your deliberation, not the action itself. In the case of a mixed strategy, you know that you are balanced on a razor's edge, even though you don't know exactly which action you're taking. And you can give a calibrated probability for that action.

You assert that LICDT pays the blackmail in XOR blackmail because it follows this law of logical causality. Is this because, conditioned on the letter being sent, if there is a disaster the agent assigns  to sending money, and if there isn't a disaster the agent assigns  to sending money, so the disaster must be causally downstream of the decision to send money if the agent is to know whether or not it sends money?

I don't recall whether I've written the following up, but a while after I wrote the OP here, I realized that LICDT/LIEDT can succeed in XOR Blackmail (failing to send the money), but for an absolutely terrible reason.

Suppose that the disaster is sufficiently rare -- much less probable than the exploration probability . Furthermore, suppose the exploration mechanism is p-chicken, IE "if you're too sure of what you do, do something else." (The story is more complicated for other exploration methods.)

Now how often does the agent respond to the letter?

Now suppose that, overall, the agent responds to the letter with frequency at least  (including rounds where the agent doesn't receive the letter). Then, conditional on the letter being sent, the agent is pretty sure it will respond to the letter -- it believes this with probability . This is because the minimum response frequency would be , but this is already much more common than the disaster. Since the letter is only supposed to arrive when the disaster has happened or when the agent would respond (by hypothesis), it must be pretty likely that the agent is responding. It should learn this after enough trials.

But the agent is playing p-chicken. If the probability of responding to the letter is greater than , then the agent will refuse to do so. If the agent refuses, then the letter won't be sent except if the rare disaster is occurring. This contradicts the assumption that the agent responds to the letter with frequency at least .

So the agent receives and responds to the letter with frequency less than . On most rounds, the predictor simulates the agent and finds that the agent would have refused, had the letter been sent.

This is good. But the agent's reason for refusing is bonkers. The agent refuses because it thinks it responds to the letter. Its own credence in its responding to the letter is always bumping up against its  ceiling. 

A very similar thing can happen in transparent newcomb, except this time the response deprives the agent of the prize. In that problem, an agent only sees a full box in cases where it'll 1-box. So if it sees a full box infinitely often, its credence that it will 1-box (upon seeing a full box) must approach 1. But this can't be, since it must stay below . So in fact, the agent only sees a full box finitely often before being relegated to empty boxes forever. Omega keeps checking whether the agent would 1-box, and keeps seeing that it would 2-box due to its exploration clause triggering.

Moral of the story: p-chicken is pretty awful in perfect-predictor cases, particularly when the predictor is interested in what you do conditional on a particular observation.

Other exploration mechanisms only fare a little better.

Does the lottery ticket hypothesis suggest the scaling hypothesis?

The implication depends on the distribution of lottery tickets. If there is a short-tailed distribution, then the rewards of scaling will be relatively small; bigger would still get better, but very slowly. A long-tailed distribution, on the other hand, would suggest continued returns to getting more lottery tickets.

I ask a question here about what's true in practice.

Arguments against myopic training

I think this is where I disagree with this argument. I think you can get myopic agents which are competitive on long-run tasks because they are trying to do something like “be as close to HCH as possible” which results in good long-run task performance without actually being specified in terms of the long-term consequences of the agent's actions.

I'm somewhat conflicted here. I sympathize with Rohin's sibling comment. One of my take-aways from the discussion between Rohin and Ricraz is that it's really not very meaningful to classify training as myopic/nonmyopic based on superficial features, such as whether feedback is aggregated across multiple rewards. As Ricraz repeatedly states, we can shift back and forth between "myopic" and "nonmyopic" by e.g. using a nonmyopic reasoner to provide the rewards to a perfectly myopic RL agent. Rohin pushes back on this point by pointing out that the important thing about a myopic approval-directed agent (for example) is the additional information we get from the human in the loop. An approval-directed berry-collecting agent will not gain approval from steps which build infrastructure to help fool the human judge, where a non-myopic approval-seeking RL agent could estimate high expected value for taking such steps. But the same cannot be said of an agent myopically trained to approximate an approval-seeking RL agent.

So it seems misleading to describe a system as myopically imitating a non-myopic system -- there is no significant difference between non-myopic Q-learning vs myopic imitation of Q-learning. A notion of "myopia" which agrees with your usage (allowing for myopic imitation of HCH) does not seem like a very useful notion of myopia. I see this as the heart of Ricraz' critique (or at least, the part that I agree with).

OTOH, Rohin's defense of myopia turns on a core claim:

While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.

So, if we want a useful concept of "myopic training", it seems it should support this claim -- IE, myopic training should be the sort of optimization for which small errors in loss function are unlikely to create huge errors in outcomes.

Going back to the example of myopically imitating HCH, it seems what's important here is how errors might be introduced. If we assume HCH is trusted, then a loss function which introduces independent noise on HCH's answers to different questions would be fine. On the other hand, an approximation which propagated those errors along HCH trees -- so a wrong conclusion about human values influences many many upstream computations -- would be not-fine, in the same way nonmyopic RL is non-fine.

I'm not sure how to resolve this in terms of a notion of "myopic training" which gets at the important thing.

How should AI debate be judged?

It seems like you've ignored the possibility of importance sampling?

Ah, right, I agree. I forgot about that suggestion as I was writing. It seems likely some version of this would work.

(I feel like I'm restating what you said, I guess I'm confused why you interpret this as evidence that the simplicity of the setup is "hiding something".)

Yep, sorry, I think you should take that as something-about-Scott's-point-abram-didn't-explain. I still disclaim myself as maybe missing part of Scott's point. But: what the simpler setup is "hiding" is the complexity of comparing answers:

  • The complexity of determining whether two claims are "different".
  • The complexity of determining whether two claims are mutually exclusive.
  • The complexity of comparing the quality of different arguments, when the different answers may be expressed in very different ontologies, and deal with very difficult-to-compare considerations.

Making the two sides defend entirely unrelated claims makes all this obvious. In addition, it makes the first two bullet points irrelevant, removing a "fake difficulty" from the setup.

To what extent is GPT-3 capable of reasoning?

I like the second suggestion a lot more than the first. To me, the first is getting more at "Does GPT convert to a semantic representation, or just go based off of syntax?" I already strongly suspect it does something more meaningful than "just syntax" -- but whether it then reasons about it is another matter.

AI safety via market making

I think the consideration "you can point out sufficiently short circular arguments" should at least make you feel better about debate than iterated amplification or market making -- it's one additional way in which you can avoid circular arguments, and afaict there isn't a positive consideration for iterated amplification / market making that doesn't also apply to debate.

My interpretation of the situation is this breaks the link between factored cognition and debate. One way to try to judge debate as an amplification proposal would have been to establish a link to HCH, by establishing that if there's an HCH tree computing some answer, then debate can use the tree as an argument tree, with the reasons for any given claim being the children in the HCH tree. Such a link would transfer any trust we have in HCH to trust in debate. The use of non-DAG arguments by clever debaters would seem to break this link.

OTOH, IDA may still have a strong story connecting it to HCH. Again, if we trusted HCH, we would then transfer that trust to IDA.

Are you saying that we can break the link between IDA and HCH in a very similar way, but which is worse due having no means to reject very brief circular arguments?

How should AI debate be judged?

I'm still confused. Suppose the answers are free-form, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it's a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?

This is undesirable, because if both players give the same answer there is no training signal. We still want to search for better answers rather than allowing things to stall out early in training. So (barring other ways of mitigating this problem) we want to encourage players to give different answers. Therefore, rather than flipping a coin for close calls, ties can be decided in favor of player 1. This means player 2's best bet is to select a plausible lie, if player 1 has already selected the best answer. That's how I understood debate to work previous to the current discussion. But, as I've mentioned, this solution isn't totally satisfactory. See here for my discussion of some other approaches to the problem.

How should AI debate be judged?

Symmetry Concerns

It seems the symmetry concerns of that document are quite different from the concerns I was voicing. The symmetry concerns in the document are, iiuc,

  • The debate goes well if the honest player expounds an argument, and the dishonest player critiques that argument. However, the debate goes poorly if those roles end up reversed. Therefore we force both players to do both.

OTOH, my symmetry concerns can be summarized as follows:

  • If player 2 chooses an answer after player 1 (getting access to player 1's answer in order to select a different one), then assuming competent play, player 1's answer will almost always be the better one. This prior taints the judge's decision in a way which seems to seriously reduce the training signal and threaten the desired equilibrium.
  • If the two players choose simultaneously, then it's hard to see how to discourage them from selecting the same answer. This seems likely at late stages due to convergence, and also likely at early stages due to the fact that both players actually use the same NN. This again seriously reduces the training signal.

I now believe that this concern can be addressed, although it seems a bit fiddly, and the mechanism which I currently believe addresses the problem is somewhat complex.

Known Debate Length

I'm a bit confused why you would make the debate length known to the debaters. This seems to allow them to make indefensible statements at the very end of a debate, secure in the knowledge that they can't be critiqued. One step before the end, they can make statements which can't be convincingly critiqued in one step. And so on.

Instead, it seems like you'd want the debate to end randomly, according to a memoryless distribution. This way, the expected future debate length is the same at all times, meaning that any statement made at any point is facing the same expected demand of defensibility.

Factored Cognition

I currently think all my concerns can be addressed if we abandon the link to factored cognition and defend a less ambitious thesis about debate. The feb 2020 proposal does touch on some of my concerns there, by enforcing a good argumentative structure, rather than allowing the debate to spiral out of control (due to e.g. delaying tactics).

However, my overall position is still one of skepticism wrt the link to factored cognition. The most salient reason for me ATM is the concern that debaters needn't structure their arguments as DAGs which ground out in human-verifiable premises, but rather, can make large circular arguments (too large for the debate structure to catch) or unbounded argument chains (or simply very very high depth argument trees, which contain a flaw at a point far too deep for debate to find).

ETA: Having now read more of the feb 2020 report, I see that very similar concerns are expressed near the end -- the long computation problem seems pretty similar to what I'm pointing at.

Load More