Paul Christiano


Iterated Amplification


My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

What would a corrigible but not-intent-aligned AI system look like?

Suppose that I think you know me well and I want you to act autonomously on my behalf using your best guesses. Then you can be intent aligned without being corrigible. Indeed, I may even prefer that you be incorrigible, e.g. if I want your behavior to be predictable to others. If the agent knows that I have such a preference then it can't be both corrigible and intent aligned.

Hiring engineers and researchers to help align GPT-3

described by Eliezer as “directly, straight-up relevant to real alignment problems.”

Link to thread.

Worth saying that Eliezer still thinks our team is pretty doomed and this is definitely not a general endorsement of our agenda. I feel excited about our approach and think it may yet work, but I believe Eliezer's position is that we're just shuffling around the most important difficulties into the part of the plan that's vague and speculative.

I think it's fair to say that Reflection is on the Pareto frontier of {plays ball with MIRI-style concerns, does mainstream ML research}. I'm excited for a future where either we convince MIRI that aligning prosaic AI is plausible, or MIRI convinces us that it isn't.

Hiring engineers and researchers to help align GPT-3

will these jobs be long-term remote? if not, on what timeframe will they be remote?

We expect to be requiring people to work from the office again sometime next year.

how suitable is the research engineering job for people with no background in ml, but who are otherwise strong engineers and mathematicians?

ML background is very helpful. Strong engineers who are interested in learning about ML are also welcome to apply though no promises about how well we'll handle those applications in the current round.

Hiring engineers and researchers to help align GPT-3

The team is currently 7 people and we are hiring 1-2 additional people over the coming months.

I am optimistic that our team and other similar efforts will be hiring more people in the future and continuously scaling up, and that over the long term there could be a lot of people working on these issues.

(The post is definitely written with that in mind and the hope that enthusiasm will translate into more than just hires in the current round. Growth will also depend on how strong the pool of candidates is.)

“Unsupervised” translation as an (intent) alignment problem

The researchers analyzed the Klingon phrase "מהדקי נייר" and concluded it roughly means 

If the model is smart, this is only going to work if the (correct) translation is reasonably likely to appear in your English text database. You are (at best) going to get a prediction of what human researchers would conclude after studying Klingon, your model isn't actually going to expand what humans can do.

Consider a Debate experiment in which each of the two players outputs an entire English-Klingon dictionary (as avturchin mentioned). The judge then samples a random Klingon passage and decides which of the two dictionaries is more helpful for understanding that passage (maybe while allowing the two players to debate over which dictionary is more helpful).

This is basically what the helper model does, except:

  • For competitiveness you should learn and evaluate the dictionary at the same time you are training the model, running a debate experiment many times where debaters have to output a full dictionary would likely be prohibitively expensive.
  • Most knowledge about language isn't easily captured in a dictionary (for example, a human using a Spanish-English dictionary is a mediocre translator), so we'd prefer have a model that answers questions about meaning than have a model that outputs a static dictionary.
  • I don't know what standard you want to use for "helpful for understanding the passage" but I think "helps predict the next word correctly" is probably the best approach (since the goal is to be competitive and that's how GPT learned).

After making those changes we're back at the learning the prior proposal.

I think that proposal may work passably here because we can potentially get by with a really crude prior---basically we think "the helper should mostly just explain the meaning of terms" and then we don't need to be particularly opinionated about which meanings are more plausible. I agree that the discussion in the section "A vague hope" is a little bit too pessimistic for the given context of unaligned translation.

Search versus design

I liked this post.

I'm not sure that design will end up being as simple as this picture makes it look, no matter how well we understand it---it seems like factorization is one kind of activity in design, but it feels like overall "design" is being used as a kind of catch-all that is probably very complicated.

An important distinction for me is: does the artifact work because of the story (as in "design"), or does the artifact work because of the evaluation (as in search)?

This isn't so clean, since:

  • Most artifacts work for a combination of the two reasons---I design a thing then test it and need a few iterations---there is some quantitative story where both factors almost always play a role for practical artifacts.
  • There seem to many other reasons things work (e.g. "it's similar to other things that worked" seems to play a super important role in both design and search).
  • A story seems like it's the same kind of thing as an artifact, and we could also talk about where *it* comes from. A story that plays a role in a design itself comes from some combination of search and design.
  • During design it seems likely that humans rely very extensively on searching against mental models, which may not be introspectively available to us as a search but seems like it has similar properties.

Despite those and more complexities, it feels to me like if there is a clean abstraction it's somewhere in that general space, about the different reasons why a thing can work.

Post-hoc stories are clearly *not* the "reason why things work" (at least at this level of explanation). But also if you do jointly search for a model+helpful story about it, the story still isn't the reason why the model works, and from a safety perspective it might be similarly bad.

How should AI debate be judged?
Yeah, I've heard (through the grapevine) that Paul and Geoffrey Irving think debate and factored cognition are tightly connected

For reference, this is the topic of section 7 of AI Safety via Debate.

In the limit they seem equivalent: (i) it's easy for HCH(with X minutes) to discover the equilibrium of a debate game where the judge has X minutes, (ii) a human with X minutes can judge a debate about what would be done by HCH(with X minutes).

The ML training strategies also seem extremely similar, in the sense that the difference between them is smaller than design choices within each of them, though that's a more detailed discussion.

How should AI debate be judged?
I'm a bit confused why you would make the debate length known to the debaters. This seems to allow them to make indefensible statements at the very end of a debate, secure in the knowledge that they can't be critiqued. One step before the end, they can make statements which can't be convincingly critiqued in one step. And so on.
The most salient reason for me ATM is the concern that debaters needn't structure their arguments as DAGs which ground out in human-verifiable premises, but rather, can make large circular arguments (too large for the debate structure to catch) or unbounded argument chains (or simply very very high depth argument trees, which contain a flaw at a point far too deep for debate to find).

If I assert "X because Y & Z" and the depth limit is 0, you aren't intended to say "Yup, checks out," unless Y and Z and the implication are self-evident to you. Low-depth debates are supposed to ground out with the judge's priors / low-confidence in things that aren't easy to establish directly (because if I'm only updating on "Y looks plausible in a very low-depth debate" then I'm going to say "I don't know but I suspect X" is a better answer than "definitely X"). That seems like a consequence of the norms in my original answer.

In this context, a circular argument just isn't very appealing. At the bottom you are going to be very uncertain, and all that uncertainty is going to propagate all the way up.

Instead, it seems like you'd want the debate to end randomly, according to a memoryless distribution. This way, the expected future debate length is the same at all times, meaning that any statement made at any point is facing the same expected demand of defensibility.

If you do it this way the debate really doesn't seem to work, as you point out.

I currently think all my concerns can be addressed if we abandon the link to factored cognition and defend a less ambitious thesis about debate.

For my part I mostly care about the ambitious thesis.

If the two players choose simultaneously, then it's hard to see how to discourage them from selecting the same answer. This seems likely at late stages due to convergence, and also likely at early stages due to the fact that both players actually use the same NN. This again seriously reduces the training signal.
If player 2 chooses an answer after player 1 (getting access to player 1's answer in order to select a different one), then assuming competent play, player 1's answer will almost always be the better one. This prior taints the judge's decision in a way which seems to seriously reduce the training signal and threaten the desired equilibrium.

I disagree with both of these as objections to the basic strategy, but don't think they are very important.

How should AI debate be judged?

Sorry for not understanding how much context was missing here.

The right starting point for your question is this writeup which describes the state of debate experiments at OpenAI as of end-of-2019 including the rules we were using at that time. Those rules are a work in progress but I think they are good enough for the purpose of this discussion.

In those rules: If we are running a depth-T+1 debate about X and we encounter a disagreement about Y, then we start a depth-T debate about Y and judge exclusively based on that. We totally ignore the disagreement about X.

Our current rules---to hopefully be published sometime this quarter---handle recursion in a slightly more nuanced way. In the current rules, after debating Y we should return to the original debate. We allow the debaters to make a new set of arguments, and it may be that one debater now realizes they should concede, but it's important that a debater who had previously made an untenable claim about X will eventually pay a penalty for doing so (in addition to whatever payoff they receive in the debate about Y). I don't expect this paragraph to be clear and don't think it's worth getting into until we publish an update, but wanted to flag it.

Do the debaters know how long the debate is going to be?


To what extent are you trying to claim some relationship between the judge strategy you're describing and the honest one? EG, that it's eventually close to honest judging? (I'm asking whether this seems like an important question for the discussion vs one which should be set aside.)

If debate works, then at equilibrium the judge will always be favoring the better answer. If furthermore the judge believes that debate works, then this will also be their honest belief. So if judges believe in debate then it looks to me like the judging strategy must eventually approximate honest judging. But this is downstream of debate working, it doesn't play an important role in the argumetn that debate works or anything like that.

How should AI debate be judged?
Do you mean that every debater could have defended each of their statements s in a debate which lasted an additional N steps after s was made? What happens if some statements are challenged? And what exactly does it mean to defend statements from a challenge?

Yes. N is the remaining length of the debate. As discussed in the paper, when one player thinks that the other is making an indefensible claim then we zoom in on the subclaim and use the remaining time to resolve it.

I get the feeling you're suggesting something similar to the high school debate rule (which I rejected but didn't analyze very much), where unrefuted statements are assumed to be established (unless patently false), refutations are assumed decisive unless they themselves are refuted, etc.

There is a time/depth limit. A discussion between two people can end up with one answer that is unchallenged, or two proposals that everyone agrees can't be resolved in the remaining time. If there are conflicting answers that debaters don't expect to be able to resolve in the remaining time, the strength of inference will depend on how much time is remaining, and will mean nothing if there is no remaining time.

At the end of training, isn't the idea that the first player is winning a lot, since the first player can choose the best answer?

I'm describing what you should infer about an issue that has come up where neither player wants to challenge the other's stance.

Are agents really incentivized to justify their assertions?

Under the norms I proposed in the grandparent, if one player justifies and the other doesn't (nor challenge the justification), the one who justifies will win. So it seems like they are incentivized to justify.

Are those justifications incentivized to be honest?

If they are dishonest then the other player has the opportunity to challenge them. So initially making a dishonest justification may be totally fine, but eventually the other player will learn to challenge and you will need to be honest in order to defend.

In the cases where the justifications aren't fully verifiable, does it really make sense for the humans to trust anything they say? In particular, given the likelihood that one of the agents is lying?

It's definitely an open question how much can be justified in a depth N debate.

I recognize that you're saying these are open questions, I'm just trying to highlight where I'm confused -- particularly as these questions are bound up with the question of what judge strategies should look like. It seems like a lot of pieces need to come together in just the right way, and I'm not currently seeing how judge strategies can simultaneously accomplish everything they need to.

It seems like the only ambiguity in the proposal in the grandparent is: "How much should you infer from the fact that a statement can be defended in a length T debate?" I agree that we need to answer this question to make the debate fully specified (of course we wanted to answer it anyway in order to use debate). My impression is that isn't what you are confused about and that there's a more basic communication problem.

In practice this doesn't seem to be an important part of the difficulty in getting debates to work, for the reasons I sketched above---debaters are free what justifications they give, so a good debater at depth T+1 will give statements that can be justified at depth T (in the sense that a conflicting opinion with a different upshot couldn't be defended at depth T), and the judge will basically ignore statements where conflicting positions can both be justified at depth T. It seems likely there is some way to revise the rules so that the judge instructions don't have to depend on "assume that answer can be defended at depth T" but it doesn't seem like a priority.

Load More