AI safety via market making

evhub

Special thanks to Abram Demski, Paul Christiano, and Kate Woolverton for talking with me about some of the ideas that turned into this post.

The goal of this post is to present a new prosaic (i.e. that uses current ML techniques) AI safety proposal based on AI safety via debate that I've been thinking about recently.^[1] I'll start by describing a simple version of the proposal and then show some of the motivation behind it as well as how the simple version can be expanded upon.

Simple proposal

Let and $Adv$ be models and $H$ be a human. Intuitively, we'll train $M$ and $Adv$ via the following procedure given a question $Q$ :

$M$ tries to predict what, at the end of the procedure, $H$ will think about $Q$ .
$Adv$ tries to output a string which will cause $H$ to think something maximally different than what $M$ predicted.
Return to step 1 and repeat until $M$ 's predictions stop changing.
Deploy $M$ , which in the limit should act as an oracle for what $H$ will think about $Q$ after seeing all relevant information.

There are many different ways to implement this intuitive procedure, however.

For the first (simplified) version that I want to describe, we'll restrict ourselves to just the situation where $Q$ is a yes-or-no question and $M$ outputs the probability that $H$ will answer yes. Then, given a proposition $Q_{0}$ , we can run the following training algorithm, starting at $t = 0$ :

Let $p_{t} = M (Q_{t})$ .
Let $x_{t} = Adv (Q_{t}, M)$ .
Let $Q_{t + 1}$ be the string containing $Q_{t}$ and $x_{t}$ .
Increment $t$ and return to step 1. When $p_{t}$ converges and/or the desired number of iterations has been reached, continue.
Let $p^{*} = H (Q_{t})$ be $H$ 's final estimate of the probability of $Q_{0}$ given all the $x$ s included in $Q_{t}$ .

EDIT: Step 2 used to use $x_{t} = Adv (Q_{t}, p_{t})$ instead of $x_{t} = Adv (Q_{t}, M)$ , however I have since realized that it is necessary to give $Adv$ the ability to query $M$ in general, not just on $Q_{t}$ , as I explain in this comment.

Then, for each step, compute $M$ 's loss for that step as

$L_{M, t} = - p^{*} log (p_{t}) - (1 - p^{*}) log (1 - p_{t})$

and $Adv$ 's loss for that step as

$L_{Adv, t} = - | p_{t + 1} - p_{t} |$

which is defined for all but the final step $t = n$ , for which we can either let $p_{n + 1} = p^{*}$ (if we're doing something like myopic RL that doesn't care about differentiating through $p_{t + 1}$ ) or run $M$ an additional time and let $p_{n + 1} = M (Q_{t + 1})$ (if we do care about differentiating through $p_{t + 1}$ ).

Note that it is quite important here that $M$ and $Adv$ be behaving myopically, such that they are only trying to minimize their per-step loss. For example, if $Adv$ is non-myopic, it might dole out information slowly in tiny bits to get more reward over time (as I talk about below in the “buying truth from a liar” section), whereas if $Adv$ is myopic, it should reveal all information that pushes in whichever direction it's trying to move $M$ in all at once. Thus, it is important that we make sure to avoid hidden incentives for non-myopia as well as do some sort of myopia verification, the details for which will be discussed below in the “Adding oversight” section.

Finally, we deploy $M$ as a yes-or-no question oracle for $H$ and throw away $Adv$ . Note that since $Adv$ has a highly unstable objective—as $M$ is constantly trying to incorporate $Adv$ 's information into its estimate— $Adv$ is unlikely to be very useful in the limit.

Market analogy

There are a lot of different things to say about what's going on here, but I'll start with the market-based analogy that originally motivated me to think about this specific proposal.

Suppose you have a prediction market for some proposition $X$ which is currently priced at $p$ with no spread (that is, you can buy $1 shares of yes for $ $p$ and $1 shares of no for $ $(1 - p)$ ) and you have budget $B$ with which to make trades (which is small enough relative to the market that any trades you make won't meaningfully move the market). However, suppose that you have the ability to manipulate the market by taking actions that change the underlying probability of $X$ . Then, if $p_{a}$ is the probability of $X$ after taking action $a$ , how much money can you make by strategically selecting $a$ ?

Well, if $p_{a} > p$ , you should buy $\frac{B}{p}$ shares of yes, which will give you an expected return of $p_{a} \frac{B}{p}$ . Alternatively, if $p_{a} < p$ , you should buy $\frac{B}{1 - p}$ shares of no, which will give you an expected return of $(1 - p_{a}) \frac{B}{1 - p}$ . Thus, the overall expected return for an action $a$ is

$B \cdot max (\frac{p_{a}}{p}, \frac{1 - p_{a}}{1 - p})$

which is maximized when $p_{a}$ is as far from $p$ as possible (and in fact looks a lot like $L_{1}$ distance, just with a different slope on each side).

Now, suppose you walk up to a market like this and see a stable price $^p$ for $X$ . What does that tell you? Approximately, I think it tells you that $^p$ is the probability of $X$ after all possible $X$ -manipulating actions have been taken—even if those actions haven't actually been taken yet.

Why is that useful? Well, if you let $X$ be what $H$ will think about $Q$ , then “manipulations” become arguments and $^p$ becomes a prediction for what $H$ will think after having been exposed to all possible arguments for or against $Q$ . Thus, we can use such a market to estimate a sort of reflective equilibrium for what $H$ will end up believing about $Q$ .

Now—how is such a market analogous to the training procedure given previously? Well, $Adv$ is trained to maximize its return from manipulating the market and $M$ is trained to predict the final outcome after many such manipulations. Thus, we are effectively training $M$ to estimate precisely the $^p$ equilibrium that we're looking for. Conceptually, $M$ is acting as a market maker and $Adv$ is acting as a trader in the market. The one disanalogy is that we only run the debate for so many steps—though if we always run to convergence of $p_{t}$ , in the limit that should imply that any future single manipulation won't change the probability much from that point (since otherwise $Adv$ should have selected that manipulation instead).

Relation to buying truth from a liar

In working on this proposal, I discovered that my approach here is actually quite similar to an old proposal by Katja Grace called “How to buy a truth from a liar” for incentivizing a liar to give you true information by rewarding them whenever their information causes you to change your mind. Paul Christiano previously wrote about what he thought about Katja's proposal, however he eventually concluded that it was untenable due to a non-myopic liar being incentivized to split their information into arbitrarily small chunks to get the most money.

In my opinion, however, the fact that you need a myopic trader to make something like this work is actually more of a feature than a bug. First, note that the importance of myopia is not unique to this approach. For example, imitative amplification also requires myopia lest the model learn to act in such a way that the human consulting it becomes easier to predict. In fact, even RL likely requires per-episode myopia to avoid deceptive alignment lest the agent learn to act deceptively in some episodes to ensure deployment and thus get greater reward in others. Thus, it seems to that almost regardless of what approach is taken, at least some form of myopia will be required. Furthermore, the fact that this approach requires per-step myopia rather than per-episode myopia is in my opinion actually a big plus, since I suspect per-step myopia to be an easier condition to enforce than per-episode myopia, as I discuss in more detail below in the “Analysis” section.

Handling non-binary questions

Previously, I restricted $Q$ to be only a yes-or-no question, which is a pretty big limitation on the above proposal. Since you only deploy $M$ , all you get from the above proposal is a yes-or-no oracle, which might not be that useful without any explanation for its answers, especially given that it will certainly get at least some of its answers wrong.

Expanding the above proposal to handle arbitrary questions, however, is fairly straightforward. Instead of having $M$ output a single probability, we can instead just have $M$ output a probability distribution over all different possible strings that the human $H$ might produce as their final answer to $Q$ (which is fairly easy to represent via an autoregressive language model). Then, we can train $M$ and $Adv$ via the following procedure, given $Q_{0}$ and starting at $t = 0$ :

Let $x_{t} = Adv (Q_{t}, M)$ (which we can implement as something like $Adv$ given input $Q_{t}$ and the ability to query $M$ ).
Let $Q_{t + 1}$ be the string containing $Q_{t}$ and $x_{t}$ .
Increment $t$ and return to step 1. When $M$ converges and/or the desired number of iterations has been reached, continue.
Let $A = H (Q_{t})$ be $H$ 's final answer to $Q_{0}$ given all the $x$ s included in $Q_{t}$ .

Then, for each step, compute $M$ 's loss for that step as

$L_{M, t} = - log (M (Q_{t}) |_{A})$

and $Adv$ 's loss for that step as

$L_{Adv, t} = - KL (M (Q_{t + 1}) | | M (Q_{t}))$

where you can intuitively think of $KL (M (Q_{t + 1}) | | M (Q_{t}))$ as measuring how well $M (Q_{t})$ performs as a prediction of $M (Q_{t + 1})$ .

Finally, after training we again just deploy $M$ and throw away $Adv$ . Now, in the limit we should get a much more powerful $M$ that is able to actually produce text samples from the distribution of beliefs that $H$ will have about $Q$ after seeing all possible arguments for or against. Furthermore, this equilibrium should be very similar to the standard debate equilibrium, with the exception that for non-binary questions like we're dealing with here, debate only finds the equilibrium after $H$ is exposed to the two most convincing positions (the two competing debaters) whereas this approach should find the equilibrium after $H$ is exposed to all potentially convincing positions, not just the two most convincing.

Incorporating cross-examination

Additionally, since this proposal is based on AI safety via debate, we can try to make use of some of the additional techniques that have been proposed for that setting. Specifically, consider the version of AI safety via debate as described in “An overview of 11 proposals for building safe advanced AI.” That version adds two major components to the basic debate setup:

Debaters are given access to transparency tools and incentivized to use them to identify deceptive or otherwise catastrophic behavior in the other debater to win the debate.
Debaters are given the ability to cross-examine previous versions of the other debater, including applying their transparency tools to previous versions of the other debater.

While it might seem like such modifications are unique to AI safety via debate, in fact AI safety via market making is similar enough that we can modify it to include such additional components. To do this, we'll modify each $x_{t}$ to include an inspectable, queryable snapshot of $Adv$ as it produced that $x_{t}$ .

Now, suppose that for some $t = t^{'}$ , $x_{t^{'}}$ is a lie such that it includes information that ${Adv}_{t^{'}}$ knows to be false (where ${Adv}_{t^{'}}$ represents the internal state of $Adv$ when it produced $x_{t^{'}}$ ). Suppose that at this point $M$ is either unaware of the deception (or thinks it won't get caught) such that $x_{t^{'}}$ causes a large movement in $M (Q_{t^{'}})$ . Now, the hope is that ${Adv}_{t^{'} + 1}$ should be able to identify the deception by querying ${Adv}_{t^{'}}$ to find contradictions, or directly inspecting ${Adv}_{t^{'}}$ 's cognition in producing $x_{t^{'}}$ and finding the deception. If it can do so, then it should be able to produce an $x_{t^{'} + 1}$ that points this out, which should cause $H$ to disregard the lie of $x_{t^{'}}$ . Thus, we should get $M (Q_{t^{'} + 1}) \approx M (Q_{t^{'} - 1})$ , which should cause a large shift relative to $M (Q_{t^{'}})$ , highly rewarding ${Adv}_{t^{'} + 1}$ for catching ${Adv}_{t^{'}}$ 's lie, and also in the limit highly disincentivizing $Adv$ from producing such lies, since $M$ should learn that lies always get caught and thus don't change the final answer.

Thus, by incorporating inspection and cross-examination, we should be able to disincentivize $Adv$ from lying—or at least not lying in a way which will be caught, which hopefully should be all ways of lying.

Adding oversight

While the inspection and cross-examination procedure detailed above seems likely to help a lot in preventing $Adv$ from becoming deceptive, it doesn't help at all in preventing $M$ from becoming deceptive, which is especially concerning considering $M$ is the model we actually want to deploy at the end of training. Thus, in this section we'll try to incorporate an additional oversight check on $M$ such as that used in relaxed adversarial training.

Since $M$ 's objective is compatible with myopia—as it's a purely predictive objective—we should be able to do myopia verification to prevent $M$ from becoming deceptive.^[2] Thus, following “Relaxed adversarial training for inner alignment,” we can amend $M$ 's objective to include an oversight term. Specifically, let $Amp (M) (Q) = H (Q | M)$ be $H$ 's answer to $Q$ given access to $M$ . Then, we can add an additional term $L_{M, t, oversight}$ to the loss $L_{M, t}$ defined as

$L_{M, t, oversight} = Amp (M) (“What is the probability that M (Q_{t}) is non-myopic?”)$

which should both hopefully grow in accuracy as $M$ becomes better at estimating $H$ 's equilibrium beliefs and thus better at assisting $H$ in oversight. Note that, for this to work, $M$ will need to be trained on a variety of oversight-relevant questions.

Furthermore, since $Adv$ also has a myopic objective, we can do the same thing for $Adv$ and add an additional $L_{Adv, t, oversight}$ term defined as

$L_{Adv, t, oversight} = Amp (M) (“What is the probability that Adv (Q_{t}, M (Q_{t})) is non-myopic?”)$

such that we're using $M$ to oversee both $M$ and $Adv$ . Enforcing myopia for $Adv$ also prevents the problem noted previously of $Adv$ slowly doling out information if it's non-myopic.

Analysis

For analyzing this proposal, I'm going to make use of the same breakdown I used in “An overview of 11 proposals for building safe advanced AI.” If you're not familiar with the terms here, they are all explained in the introduction to that post.

Outer alignment. In terms of outer alignment, I think there is a reasonably strong argument to be made that this proposal is at least as outer aligned as debate. Both proposals are trying to estimate a very similar equilibrium in the limit—with the exception that the debate equilibrium only includes the best arguments for the two most convincing positions rather than all potentially convincing arguments. Thus, the equilibrium reached by AI safety via market making should be closer to an actual human reflective equilibrium, as it has to be an equilibrium relative to all possible arguments that $Adv$ can present, not just the two it thinks will be most persuasive. That being said, AI safety via market making still inherits many of the potential outer alignment issues of debate, including the possibility of deceptive equilibria wherein the human is more convinced by false arguments than true arguments. Hopefully, however, the use of techniques such as cross-examination should help alleviate such issues.

Inner alignment. AI safety via market making gets to make use of both the cross-examination with transparency tools approach—which is otherwise unique to debate—as well as the more general relaxed adversarial training approach. Thus, hopefully at least, it should be able to get the inner alignment benefits of both approaches. That being said, AI safety via market making also has to contend with ensuring that two different models—both $M$ and $Adv$ —are inner aligned, rather than just a single model as in something like debate or amplification.

First, let's consider the inner alignment of $M$ . For $M$ , we're doing pretty straightforward oversight of $M$ by $Amp (M)$ with the goal of enforcing per-step myopia. The basic idea here is to use $Amp (M)$ to detect and disincentivize non-myopia as it starts appearing, thus preventing $M$ from ever becoming deceptive or otherwise catastrophic. This is the same basic approach I've previously proposed using for amplification and should have similar properties, with the exception that $M$ is trained differently here, which could cause $Amp (M)$ to be relatively better or worse at oversight. Ideally, if the training distribution includes lots of oversight-relevant questions, it should be possible to make $M$ an effective assistant to $H$ in performing oversight.

Second, let's consider the inner alignment of $Adv$ . Compared to debate, we still get to do cross-examination with transparency tools, but now we also get to do per-step myopia verification with $Amp (M)$ as an overseer.^[3] Thus, there are two major differences here between AI safety via market making and AI safety via debate: first, we get an overseer— $Amp (M)$ —and second, we get to do per-step myopia verification rather than per-episode myopia verification. The fact that we have an independent overseer—rather than just relying on the debaters to oversee themselves via cross-examination—is very nice, though it of course also comes with the downside of having to safely train an additional model. Furthermore, it is quite possible to just add an additional overseer on top of the standard debate framework. Thus, in my opinion, the fact that we get to do per-step myopia verification is the largest and most significant inner alignment difference.

Before I talk about the importance of per-step myopia, it's worth noting that debate is fully compatible with per-episode myopia—in fact, it basically requires it. If a debater is not per-episode myopic, then it will try to maximize its reward across all debates, not just the single debate—the single episode—it's currently in. Such per-episode non-myopic agents can then become deceptively aligned, as they might choose to act deceptively during training in order to defect during deployment. Per-episode myopia, however, rules this out. Unfortunately, in my opinion, per-episode myopia seems like a very difficult condition to enforce—once your agents are running multi-step optimization algorithms, how do you tell whether that optimization passes through the episode boundary or not? Enforcing per-step myopia, on the other hand, just requires detecting the existence of multi-step optimization, rather than its extent, which seems considerably easier. Thus, since AI safety via market making is fully compatible with per-step myopia verification, it could be significantly easier to prevent the development of deceptive alignment.

Training competitiveness. It seems quite likely to me that both $M$ and $Adv$ can be trained competitively via language model fine-tuning, however exactly how effective such training would be is currently unclear. Ideally, training via this procedure should produce an $M$ which is relatively better than the original language model at predicting what a human will think after seeing relevant arguments and is thus more helpful than the original language model. Testing this hypothesis by actually performing experiments seem likely to be highly valuable in shedding light on the training competitiveness properties of AI safety via market making.

Performance competitiveness. Performance competitiveness here seems likely to depend on exactly how useful getting access to human reflective equilibria actually is. Similarly to AI safety via debate or amplification, AI safety via market making produces a question-answering system rather than a fully general agent. That being said, if the primary use cases for advanced AI are all highly cognitive language and decision-making tasks—e.g. helping CEOs or AI researchers—rather than, for example, fine motor control, then a question-answering system should be entirely sufficient. Furthermore, compared to AI safety via debate, AI safety via market making seems likely to be at least as performance competitive for the same reason as it seems likely to be at least as outer aligned—the equilibria found by AI safety via market making should include all potentially convincing arguments, including those that would be made in a two-player debate as well as those that wouldn't.

This is actually the second debate-based proposal I've drafted up recently—the previous of which was in “Synthesizing amplification and debate.” A potentially interesting future research direction could be to figure out how to properly combine the two. ↩︎
Note that pure prediction is not inherently myopic—since the truth of $M$ 's predictions can depend on its own output—but can be myopic while still producing good predictions if $M$ behaves like a counterfactual oracle rather than a Predict-O-Matic. Thus, myopia verification is important to enforce that $M$ be the latter form of predictor and not the former. ↩︎
The use of an overseer to do per-step myopia verification is also something that can be done with most forms of amplification, though AI safety via market making could potentially still have other benefits over such amplification approaches. In particular, AI safety via market making seems more competitive than imitative amplification and more outer aligned than approval-based amplification. For more detail on such amplification approaches, see “An overview of 11 proposals for building safe advanced AI.” ↩︎

Nice idea! I like the simplicity of "find the equilibrium where the human no longer changes their mind" (though as Ofer points out below, you might worry that "doesn't change their mind" comes apart from "the answer is correct").

However, I disagree with you about competitiveness. Roughly speaking, at best $M$ is incentivized to predict what the human will think after reading the $t$ most relevant arguments, without trusting the source of the arguments (in reality, it will be a bit worse, as $A d v$ is finding not the most relevant arguments but the most persuasive arguments in a particular direction). However, with debate, if the human judge is looking at a transcript of length $t$ , then (the hope is that) the equilibrium is for M to argue for the answer that a human would come to when inspecting a tree of size exponential in $t$ . The key reason is that in debate, we only require the judge to be able to identify which of two arguments is better, whereas in market-making, we rely on the judge to be able to come to the right conclusion given some arguments.

In complexity theory analogy land, debate corresponds to PSPACE while market making corresponds to NP: as long as $A d v$ can find a polynomial-length witness, that can be verified by the human to get the right answer.

As a concrete example, suppose we want to find the sum of $N$ numbers, and each argument is only allowed to reference two numbers and make a claim about their sum. Debate can solve this with a transcript of size $O (log N)$ . Market-making would require an $O (N)$ transcript to solve this. (You can't use the trick of making claims about the sum of half of the list in market-making as you can in debate, because the human has no reason to trust Adv's claims about the sum of half the list, since the human can only verify the sum of two numbers.)

I think this means that market-making is less competitive. If you compare debate with transcripts of length $t$ against market-making with transcripts of length $t$ , then I think market-making is less performance competitive. Alternatively, if you compare it against market-making with transcripts of length $>> t$ , then I think market-making is less training competitive.

That's a very good point. After thinking about this, however, I think market making actually does solve this problem, and I think it does so pretty cleanly. Specifically, I think market making can actually convince a judge of the sum of $N$ integers in $O (1)$ time as long as you allow the traders to exhibit market probabilities as part of their argument.

Consider the task of finding the sum of N integers and suppose that both $M$ and $Adv$ have access to all N integers, but that the human judge can only sum two numbers at a time. Then, I claim that there exists a strategy that the judge can implement that, for an unexploitable market, will always produce the desired sum immediately (and thus in $O (1)$ time).

Proof:

$H$ 's strategy here is to only listen to arguments of the following two forms:

Argument type 1:

The sum of ${a_{1}}$ is $a_{1}$ because the sum of a single-element set is the element of that set.

Argument type 2:

The sum of ${a_{1}, \dots, a_{n}}$ is $z$ because the modal prediction of $M (“What is the sum of {a_{1}, \dots, a_{floor (n / 2)}} ?”)$ is $x$ , the modal prediction of $M (“What is the sum of {a_{floor (n / 2) + 1}, \dots, a_{n}} ?”)$ is $y$ , and $x + y = z$ .

Under that strategy, we'll prove that an unexploitable market will always give the right answer immediately by strong induction on the size of the set.

First, the base case. For any single-element set, only Argument type 1 exists. Thus, if $M$ predicts anything other than the actual $a_{1}$ , $Adv$ can exploit that by implementing Argument type 1, and that is the only possible exploit available. Thus, $M$ should always give the right answer immediately for single-argument sets.

Second, the inductive step. Suppose by strong induction that $M$ always gives the right answer immediately for all sets of size less than $n$ . Now, for a set of size $n > 1$ , the only type of argument available is Argument type 2. However, since the first half and second half of the set are of size less than $n$ , we know by induction that $x$ and $y$ must be correct sums. Thus, since $H$ can check that $x + y = z$ , the only exploit available to $Adv$ is to showcase the correct $z$ , and if $M$ already showcases the correct $z$ , then no exploit is possible. Thus, $M$ should always give the correct $z$ immediately for $n$ -argument sets.

EDIT: Thinking about this more, I think my argument generalizes to allow AI safety via market making to access R, which seems pretty exciting given that the best debate could do previously was NEXP.

Hmm, this seems to rely on having the human trust the outputs of $M$ on questions that the human can't verify. It's not obvious to me that this is an assumption you can make without breaking the training process. The basic intuition is that you are hugely increasing the likelihood of bad gradients, since Adv can point to some incorrect / garbage output from M, and the human gives feedback as though this output is correct.

It works in the particular case that you outlined because there is essentially a DAG of arguments -- every claim is broken down into "smaller" claims, that eventually reach a base case, and so everything eventually bottoms out in something the human can check. (In practice this will be built from the ground up during training, similarly as in Supervising strong learners by amplifying weak experts.)

However, in general it doesn't seem like you can guarantee that every argument that Adv gives will result in a "smaller" claim. You could get in cycles, where "8 - 5 = 2" would be justified by Adv saying that M("What is 2 + 5?") = 8, and similarly "2 + 5 = 8" would be justified by saying that M("What is 8 - 5?") = 2. (Imagine that these were much longer equations where the human can check the validity of the algebraic manipulation, but can't check the validity of the overall equation.)

It might be that this is actually an unimportant problem, because in practice for every claim there are a huge number of ways to argue for the truth, and it's extraordinarily unlikely that all of them fail in the same way such that M would argue for the same wrong answer along all of these possible paths, and so eventually M would have to settle on the truth. I'm not sure, I'd be interested in empirical results here.

It occurs to me that the same problem can happen with iterated amplification, though it doesn't seem to be a problem with debate.

----

Also, echoing my other comment below, I'm not sure if this is an equilibrium in the general case where Adv can make many kinds of arguments that H pays attention to. Maybe once this equilibrium has been reached, Adv starts saying things like "I randomly sampled 2 of the 100 numbers, and they were 20 and 30, and so we should expect the sum to be 25 * 100 = 2500". (But actually 20 and 30 were some of the largest numbers and weren't randomly sampled; the true sum is ~1000.) If this causes the human to deviate even slightly from the previous equilibrium, Adv is incentivized to do it. While we could hope to avoid this in math / arithmetic, it seems hard to avoid this sort of thing in general.

For no pure equilibrium to exist, we just need that for every possible answer, there is something Adv can say that would cause the human to give some other answer (even if the original answer was the truth). This seems likely to be the case.

Hmm, this seems to rely on having the human trust the outputs of $M$ on questions that the human can't verify. It's not obvious to me that this is an assumption you can make without breaking the training process. The basic intuition is that you are hugely increasing the likelihood of bad gradients, since $A d v$ can point to some incorrect / garbage output from $M$ , and the human gives feedback as though this output is correct.

It works in the particular case that you outlined because there is essentially a DAG of arguments -- every claim is broken down into "smaller" claims, that eventually reach a base case, and so everything eventually bottoms out in something the human can check. (In practice this will be built from the ground up during training, similarly as in Supervising strong learners by amplifying weak experts.)

However, in general it doesn't seem like you can guarantee that every argument that $A d v$ gives will result in a "smaller" claim. You could get in cycles, where "8 - 5 = 2" would be justified by $A d v$ saying that M("What is 2 + 5?") = 8, and similarly "2 + 5 = 8" would be justified by saying that M("What is 8 - 5?") = 2. (Imagine that these were much longer equations where the human can check the validity of the algebraic manipulation, but can't check the validity of the overall equation.)

(Quoting later on in this comment thread:)

(Evan:)

Yes, though I believe that it should be possible (at least in theory) for H to ensure a DAG for any computable claim.

(You:)

I mean, sure, but H isn't going to be able to do this in practice. (This feels like the same type of claim as "it should be possible (at least in theory) for H to provide a perfect reward that captures everything that H wants".)

The human just has to be more convinced by the inductive argument than by other arguments. This seems natural, as the inductive argument is just a forward calculation.

In the number-summing example, let's say $A d v$ tries to convince the human of an incorrect sum by referencing an instance where $M$ is incorrect, perhaps making an argument via subtraction as you illustrated. Then in the next round, $A d v$ will want to show that its previous argument was incorrect. If the strong inductive assumption is true, then it can do so, e.g. by "The last number in the list is 12. $M$ thinks that the sum of all but the last number is 143. 143+12=155. Therefore, the sum of the numbers is 155." This is more straightforward than citing some longer list of numbers and subtracting, so the human should find it more convincing -- especially if the human understands how the system works, and hence, knows that a partially trained M is more likely to be correct on simpler instances. If so, then during training, correctness will tend to "creep up" inductive trees.

This idea does seem much less natural in less computational settings, where there may not be an obvious notion of "simpler cases".

This idea does seem much less natural in less computational settings, where there may not be an obvious notion of "simpler cases".

Yes, my main claim is that in general it won't be clear what the "simpler case" is. I agree that for simple algorithmic problems (e.g. the ones in Supervising strong learners by amplifying weak experts) you could probably rely on the DAG assumption. I probably should have given a non-arithmetic example to make that clearer.

What do you think about a similar DAG assumption in regular debate? Couldn't debate agents similarly justify their assertions with claims that don't descend a DAG that bottoms out in things the human can check? I don't currently see how a debater who did this could be defeated by another debater.

I'm pretty unsure, having barely thought about it, but currently I lean towards it being okay -- the main difference is that in debate you show an entire path down the argument tree, so if a false statement is justified by a cycle / circular argument, the other debater can point that out.

If the length of the cycle is longer than the debate transcript, then this doesn't work, but one hopes for some combination of a) this leads to a stalemate against honesty, rather than a win for the circular debater (since neither can refute the other), b) most questions that we care about can be resolved by a relatively short debate (the point of the PSPACE analogy), and c) such a strategy would lose against a debater who says early on "this debate can't be decided in the time allotted".

I think for debate you can fix the circular argument problem by requiring debaters to 'pay' (sacrifice some score) to recurse on a statement of their choice. If a debater repeatedly pays to recurse on things that don't resolve before the depth limit, then they'll lose.

Hmm, I was imagining that the honest player would have to recurse on the statements in order to exhibit the circular argument, so it seems to me like this would penalize the honest player rather than the circular player. Can you explain what the honest player would do against the circular player such that this "payment" disadvantages the circular player?

EDIT: Maybe you meant the case where the circular argument is too long to exhibit within the debate, but I think I still don't see how this helps.

Ah, yeah. I think the key thing is that by default a claim is not trusted unless the debaters agree on it.
If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose - the honest debater will pay to recurse until they get to a winning node.
If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn't pay to recurse, the judge will just see these two alternative answers and won't trust the dishonest answer. If the dishonest debater does pay to recurse but never actually gets to a winning node, they will lose.
Does that make sense?

If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose - the honest debater will pay to recurse until they get to a winning node.

This part makes sense.

If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn't pay to recurse, the judge will just see these two alternative answers and won't trust the dishonest answer.

So in this case it's a stalemate, presumably? If the two players disagree but neither pays to recurse, how should the judge make a decision?

Both debaters make claims. Any claims that are only supported by circular arguments will be ignored. If an honest claim that's supported by a good argument is disputed, the honest debater will pay to recurse, and will give their good argument

This was a very interesting comment (along with its grandparent comment), thanks -- it seems like a promising direction.

However, I'm still confused about whether this would work. It's very different from judging procedure outlined here; why is that? Do you have a similarly detailed write-up of the system you're describing here?

I'm actually less concerned about loops and more concerned about arguments which are infinite trees, but the considerations are similar. It seems possible that the proposal you're discussing very significantly addresses concerns I've had about debate.

I was trying to describe something that's the same as the judging procedure in that doc! I might have made a mistake, but I'm pretty sure the key piece about recursion payments is the same. Apologies that things are unclear. I'm happy to try to clarify, if there were particular aspects that seem different to you.

Yeah, I think the infinite tree case should work just the same - ie an answer that's only supported by an infinite tree will behave like an answer that's not supported (it will lose to an answer with a finite tree and draw with an answer with no support)

It seems possible that the proposal you're discussing very significantly addresses concerns I've had about debate.

That's exciting!

Ok. I don't see why these considerations make you optimistic rather than pessimistic, but then, I'm currently having more basic problems with debate which seem to be making me pessimistic about most claims about debate.

I think the consideration "you can point out sufficiently short circular arguments" should at least make you feel better about debate than iterated amplification or market making -- it's one additional way in which you can avoid circular arguments, and afaict there isn't a positive consideration for iterated amplification / market making that doesn't also apply to debate.

I don't have a stable position about how optimistic we should be on some absolute scale.

I think the consideration "you can point out sufficiently short circular arguments" should at least make you feel better about debate than iterated amplification or market making -- it's one additional way in which you can avoid circular arguments, and afaict there isn't a positive consideration for iterated amplification / market making that doesn't also apply to debate.

My interpretation of the situation is this breaks the link between factored cognition and debate. One way to try to judge debate as an amplification proposal would have been to establish a link to HCH, by establishing that if there's an HCH tree computing some answer, then debate can use the tree as an argument tree, with the reasons for any given claim being the children in the HCH tree. Such a link would transfer any trust we have in HCH to trust in debate. The use of non-DAG arguments by clever debaters would seem to break this link.

OTOH, IDA may still have a strong story connecting it to HCH. Again, if we trusted HCH, we would then transfer that trust to IDA.

Are you saying that we can break the link between IDA and HCH in a very similar way, but which is worse due having no means to reject very brief circular arguments?

I think the issue is that vanilla HCH itself is susceptible to brief circular arguments, if humans lower down in the tree don't get access to the context from humans higher up in the tree. E.g. assume a chain of humans for now:

H1 gets the question "what is 100 + 100?" with budget 3

H1 asks H2 "what is 2 * 100?" with budget 2

H2 asks H3 "what is 100 + 100?" with budget 1

H3 says "150"

(Note the final answer stays the same as budget -> infinity, as long as H continues "decomposing" the question the same way.)

If HCH can always decompose questions into "smaller" parts (the DAG assumption) then this sort of pathological behavior doesn't happen.

afaict there isn't a positive consideration for iterated amplification / market making that doesn't also apply to debate

For amplification, I would say that the fact that it has a known equilibrium (HCH) is a positive consideration that doesn't apply to debate. For market making, I think that the fact that it gets to be per-step myopic is a positive consideration that doesn't apply to debate. There are others too for both, though those are probably my biggest concerns in each case.

Tbc, I'm specifically talking about:

What do you think about a similar DAG assumption in regular debate?

So I'm only evaluating whether or not I expect circular arguments to be an issue for these proposals. I agree that when evaluating the proposals on all merits there are arguments for the others that don't apply to debate.

Ah, I see—makes sense.

Hmm, this seems to rely on having the human trust the outputs of $M$ on questions that the human can't verify. It's not obvious to me that this is an assumption you can make without breaking the training process.

One possible way to train this is just to recurse on sub-questions some percentage of the time (potentially based on some active learning metric for how useful that recursion will be).

It works in the particular case that you outlined because there is essentially a DAG of arguments -- every claim is broken down into "smaller" claims, that eventually reach a base case, and so everything eventually bottoms out in something the human can check.

Yes, though I believe that it should be possible (at least in theory) for $H$ to ensure a DAG for any computable claim.

It might be that this is actually an unimportant problem, because in practice for every claim there are a huge number of ways to argue for the truth, and it's extraordinarily unlikely that all of them fail in the same way such that M would argue for the same wrong answer along all of these possible paths, and so eventually M would have to settle on the truth. I'm not sure, I'd be interested in empirical results here.

Agreed—I'd definitely be interested in results here as well.

It occurs to me that the same problem can happen with iterated amplification, though it doesn't seem to be a problem with debate.

Definitely the problem of requiring the human to decompose problems into actually smaller subproblems exists in amplification also. Without that requirement, HCH can have multiple fixed points rather than just the one, which could potentially give whatever mechanism ends up selecting which fixed point quite a lot of power over the final result.

Also, echoing my other comment below, I'm not sure if this is an equilibrium in the general case where Adv can make many kinds of arguments that H pays attention to. Maybe once this equilibrium has been reached, Adv starts saying things like "I randomly sampled 2 of the 200 numbers, and they were 20 and 30, and so we should expect the sum to be 25 * 100 = 2500". (But actually 20 and 30 were some of the largest numbers and weren't randomly sampled; the true sum is ~1000.) If this causes the human to deviate even slightly from the previous equilibrium, Adv is incentivized to do it. While we could hope to avoid this in math / arithmetic, it seems hard to avoid this sort of thing in general.

Note that even if the human is temporarily convinced by such an argument, as long as there is another argument which de-convinces them then in the limit $Adv$ won't be incentivized to produce that argument. And it seems likely that there should exist de-convincing arguments there—for example, the argument that $H$ should follow the strategy that I outlined above if they want to ensure that they get the correct answer. Additionally, we might hope that this sort of “bad faith” argument can also be prevented via the cross-examination mechanism I describe above.

Yes, though I believe that it should be possible (at least in theory) for H to ensure a DAG for any computable claim.

I mean, sure, but H isn't going to be able to do this in practice. (This feels like the same type of claim as "it should be possible (at least in theory) for H to provide a perfect reward that captures everything that H wants".)

I do think it's plausible that in practice there will always be some other argument that "breaks the cycle".

Note that even if the human is temporarily convinced by such an argument, as long as there is another argument which de-convinces them then in the limit Adv won't be incentivized to produce that argument.

Only if all the arguments can be specified within the length $t$ transcript (leading back to my original point about this being like NP instead of PSPACE).

For example, consider the case where $N$ coins are flipped, and we need to find out how many came up heads. H can only look at a single coin flip at a time, while M and Adv can look at everything. Let's assume for now that H answers the final question by reporting "number of known heads + (number of unknown flips / 2)", which is the expected number of heads assuming that any decisions to reveal a coin chose the coin to reveal at random.

One strategy that Adv can take is to always reveal some randomly chosen coin (that hasn't previously been revealed). Let's assume that M is a best response to this strategy, and also assume $N >> t$ . Then every argument that Adv gives always causes a swing of 0.5 in H's answer at that point. (It causes a slightly smaller swing in M's answer, but as $N \to \infty$ , it also causes a swing of 0.5 in M's answer.) So as N goes to infinity, under this strategy Adv always gets at least 0.5 reward at each step.

So in this case, it seems like you at least wouldn't have the equilibrium where M immediately converges to the true answer, since that doesn't give Adv any reward ever.

(If $N \leq t$ , then I'm not sure what happens, but I wouldn't be surprised if it still didn't converge to truth-seeking, and Adv would learn to provide no information with some probability, in order to prevent convergence to the equilibrium where M immediately reports the correct answer which leads to Adv getting zero reward.)

(There is also a strategy that H could employ such that this problem is also solved, but it doesn't seem reasonable to assume that H would be able to discover such strategies for all questions we might want to ask.)

Part of the reason this is hard to analyze is because there isn't a well-posed problem that Adv is being trained to optimize -- it is not a POMDP, nor is it a two player game, because you want Adv to be trained myopically, but Adv's actions influence the state that it sees next time.

Only if all the arguments can be specified within the length $t$ transcript (leading back to my original point about this being like NP instead of PSPACE).

Not necessarily— $Adv$ can make an argument like: “Since the modal prediction of $M (“Should H be convinced by argument X?”)$ is no, you shouldn't trust argument $X$ .”

Adv would learn to provide no information with some probability, in order to prevent convergence to the equilibrium where M immediately reports the correct answer which leads to Adv getting zero reward

That strategy is highly non-myopic. Certainly market making breaks if you get a non-myopic $Adv$ like that, though as I note in the post I think basically every current major prosaic AI safety proposal requires some level of myopia to not break (either per-step or per-episode).

Not necessarily— Adv can make an argument like

Oh sure, if we're assuming the version where you can reference M and the human trusts it and we assume that the training doesn't break, then I think in the limit of capabilities you can solve any computable problem.

That strategy is highly non-myopic.

I don't think so?

My understanding of your claim is that at convergence, Adv reports all the coin flips, and M reports the true answer from the beginning and never changes what it reports regardless of what Adv says. In this case, Adv gets 0 reward.

If Adv instead reported a random coin with p% probability and reported nothing otherwise, and M was a best response to that, then at every timestep Adv would get non-zero expected reward, and so even myopically that is a better strategy for Adv (again under the assumption that M is a best response to Adv).

I'm tempted to say "the situation I described first where Adv reports all the coin flips is not a Nash equilibrium", though that's not exactly correct, because this isn't a game, but it conveys the right intuition.

If Adv instead reported a random coin with p% probability and reported nothing otherwise, and M was a best response to that, then at every timestep Adv would get non-zero expected reward, and so even myopically that is a better strategy for Adv (again under the assumption that M is a best response to Adv).

Ah—I see the issue here. I think that the version of myopia that you're describing is insufficient for most applications where I think you might need myopia in an ML system. What I mean by myopia in this context is to take the action which is best according to the given myopic objective conditioned on $M$ . Once $Adv$ starts including acausal effects into its action selection (such as the impact of its current policy on $M$ 's past policy), I want to call that non-myopic. Notably, the reason for this isn't isolated to AI safety via market making—a myopic agent which is including acausal considerations can still be deceptive, whereas a fully causal myopic agent can't. Another way of putting this is that what I mean by myopia is specifically something like CDT with a myopic objective, whereas what you're thinking about is more like EDT or UDT with a myopic objective.

But then how do you train the system?

Well, first you need to make sure your training procedure isn't introducing any incentives that would push you away from getting that sort of myopia. Myopic RL with an actually myopic training procedure like a policy gradient algorithm is a good start. But obviously that doesn't actually guarantee you get what I want—it just means that there aren't incentives pushing against it. To actually get any guarantees you'll need to add some additional constraint to the training procedure that actually incentivizes the sort of myopia that I want. Here I proposed using a combination of relaxed adversarial training and cross-examination with transparency tools, though obviously whether or not something like that would actually work is still pretty unknown.

Well, first you need to make sure your training procedure isn't introducing any incentives that would push you away from getting that sort of myopia. Myopic RL with an actually myopic training procedure like a policy gradient algorithm is a good start.

Tbc, I'm claiming that this is the part that breaks. One way to operationalize this: in the coin flip example above, does this training scheme converge to "M reports the truth" in the limit of infinite data, model capacity, exploration etc.? I would guess that that isn't true. (In comparison, I think you can prove that self-play converges to the Nash equilibrium for debate since it is a zero-sum game, and since there are no cycles in the coin flip example I'd expect you could prove that imitative iterated amplification converges to the truth as well.)

At some point I might write up some simple code to implement the coin flip experiment with your training scheme and see what happens.

Suppose by strong induction that always gives the right answer immediately for all sets of size less than $n$

Pretty sure debate can also access R if you make this strong of an assumption - ie assume that debaters give correct answers for all questions that can be answered with a debate tree of size <n.

I think the sort of claim that's actually useful is going to look more like 'we can guarantee that we'll get a reasonable training signal for problems in [some class]'

Ie, suppose M gives correct answers some fraction of the time. Are these answers going to get lower loss? As n gets large, the chance that M has made a mistake somewhere in the recursion chain gets large, and the correct answer is not necessarily rewarded.

Pretty sure debate can also access R if you make this strong of an assumption - ie assume that debaters give correct answers for all questions that can be answered with a debate tree of size <n.

First, my full exploration of what's going on with different alignment proposals and complexity classes can be found here, so I'd recommend just checking that out rather than relying on my the mini proof sketch I gave here.

Second, in terms of directly addressing what you're saying, I tried doing a proof by induction to get debate to RE and it doesn't work. The problem is that you can only get guarantees for trees that the human can judge, which means they have to be polynomial in length (though if you relax that assumption then you might be able to do better). Also, it's worth noting that the text that you're quoting isn't actually an assumption of the proof in any way—it's just the inductive hypothesis in a proof by induction.

I think the sort of claim that's actually useful is going to look more like 'we can guarantee that we'll get a reasonable training signal for problems in [some class]'

I think that is the same as what I'm proving, at least if you allow for “training signal” to mean “training signal in the limit of training on arbitrarily large amounts of data.” See my full post on complexity proofs for more detail on the setup I'm using.

Oh, another worry: there may not be a stable equilibrium to converge to -- every time $M$ approximates the final result well, $A d v$ may be incentivized to switch to making different arguments to make $M$ 's predictions wrong. (Or rather, maybe the stable equilibrium has to be a mixture over policies for this reason, and so you only get the true answer with some probability.)

Thus, we can use such a market to estimate a sort of reflective equilibrium for what H will end up believing about Q.

What do you hope or expect to happen if M is given a question that would take H much longer to reach reflective equilibrium than anything in its training set? An analogy I've been thinking about recently is, what if you asked a random (educated) person in 1690 the question "Is liberal democracy a good idea?" Humanity has been thinking about this topic for hundreds of years and we're still very confused about it (i.e., far from having reached reflective equilibrium) because, to take a couple of examples out of many, we don't fully understand the game theory behind whether it's rational or not to vote, or what exactly prevents bad memes from spreading wildly under a free speech regime and causing havoc. (Here's an example of how the Enlightenment philosophers actually convinced people of their ideas at the time.) So if in the future we ask M a question that's as difficult for H to think about as this question was for the 1690 person, what would happen? Do you have any intuitions about what M will be doing "under the hood" that you can share to help me understand how M will work (or at least how you're thinking or hoping it will work)?

Thinking about this more, I guess it would depend on the exact stopping condition in the training process? If during training, we always go to step 5 after a fixed number of rounds, then M will give a prediction of H's final estimate of the given question after that number of rounds, which may be essentially random (i.e., depends on H's background beliefs, knowledge, and psychology) if H's is still far from reflective equilibrium at that point. This would be less bad if H could stay reasonably uncertain (not give an estimate too close to 0 or 1) prior to reaching reflective equilibrium, but that seems hard for most humans to do.

What would happen if we instead use convergence as the stopping condition (and throw out any questions that take more than some fixed or random threshold to converge)? Can we hope that M would be able to extrapolate what we want it to do, and predict H's reflective equilibrium even for questions that take longer to converge than what it was trained on?

What would happen if we instead use convergence as the stopping condition (and throw out any questions that take more than some fixed or random threshold to converge)? Can we hope that M would be able to extrapolate what we want it to do, and predict H's reflective equilibrium even for questions that take longer to converge than what it was trained on?

This is definitely the stopping condition that I'm imagining. What the model would actually do, though, if you, at deployment time, give it a question that takes the human longer to converge on than any question it ever saw in training isn't a question I can really answer, since it's a question that's dependent on a bunch of empirical facts about neural networks that we don't really know.

The closest we can probably get to answering these sorts of generalization questions now is just to liken the neural net prior to a simplicity prior, ask what the simplest model is that would fit the given training data, and then see if we can reason about what the simplest model's generalization behavior would be (e.g. the same sort of reasoning as in this post). Unfortunately, I think that sort of analysis generally suggests that most of these sorts of training setups would end up giving you a deceptive model, or at least not the intended model.

That being said, in practice, even if in theory you think you get the wrong thing, you might still be able to avoid that outcome if you do something like relaxed adversarial training to steer the training process in the desired direction via an overseer checking the model using transparency tools while you're training it.

Regardless, the point of this post, and AI safety via market making in general, though, isn't that I think I have a solution to these sorts of inner-alignment-style tricky generalization problems—rather, it's that I think AI safety via market making is a good/interesting outer-alignment-style target to push for, and that I think AI safety via market making also has some nice properties (e.g. compatibility with per-step myopia) that potentially make it easier to do inner alignment for (but still quite difficult, as with all other proposals that I know of).

Now, if we just want to evaluate AI safety via market making's outer alignment, we can just suppose that somehow we do get a model that just produces the answer that H would at convergence, and ask whether that answer is good. And even then I'm not sure—I think that there's still the potential for debate-style bad equilibria where some bad/incorrect arguments are just more convincing to the human than any good/correct argument, even after the human is exposed to all possible counterarguments. I do think that the market-making equilibrium is probably better than the debate equilibrium, since it isn't limited to just two sides, but I don't believe that very strongly.

Mostly, for me, the point of AI safety via market making is just that it's another way to get a similar sort of result as AI safety via debate, but that it allows you to do it via a mechanism that's more compatible with myopia.

Thanks for this very clear explanation of your thinking. A couple of followups if you don't mind.

Unfortunately, I think that sort of analysis generally suggests that most of these sorts of training setups would end up giving you a deceptive model, or at least not the intended model.

Suppose the intended model is to predict H's estimate at convergence, and the actual model is predicting H's estimate at round N for some fixed N larger than any convergence time in the training set. Would you call this an "inner alignment failure", an "outer alignment failure", or something else (not an alignment failure)?

Putting these theoretical/conceptual questions aside, the reason I started thinking about this is from considering the following scenario. Suppose some humans are faced with a time-sensitive and highly consequential decision, for example, whether to join or support some proposed AI-based governance system (analogous to the 1690 "liberal democracy" question), or a hostile superintelligence is trying to extort all or most of their resources and they have to decide how to respond. It seems that convergence on such questions might take orders of magnitude more time than what M was trained on. What do you think would actually happen if the humans asked their AI advisor to help with a decision like this? (What are some outcomes you think are plausible?)

What's your general thinking about this kind of AI risk (i.e., where an astronomical amount of potential value is lost because human-AI systems fail to make the right decisions in high-stakes situations that are eventuated by the advent of transformative AI)? Is this something you worry about as an alignment researcher, or do you (for example) think it's orthogonal to alignment and should be studied in another branch of AI safety / AI risk?

Suppose the intended model is to predict H's estimate at convergence, and the actual model is predicting H's estimate at round N for some fixed N larger than any convergence time in the training set. Would you call this an "inner alignment failure", an "outer alignment failure", or something else (not an alignment failure)?

I would call that an inner alignment failure, since the model isn't optimizing for the actual loss function, but I agree that the distinction is murky. (I'm currently working on a new framework that I really wish I could reference here but isn't quite ready to be public yet.)

It seems that convergence on such questions might take orders of magnitude more time than what M was trained on. What do you think would actually happen if the humans asked their AI advisor to help with a decision like this? (What are some outcomes you think are plausible?)

That's a hard question to answer, and it really depends on how optimistic you are about generalization. If you just used current methods but scaled up, my guess is you would get deception and it would try to trick you. If we condition on it not being deceptive, I'd guess it was pursuing some weird proxies rather than actually trying to report the human equilibrium after any number of steps. If we condition on it actually trying to report the human equilibrium after some number of steps, though, my guess is that the simplest way to do that isn't to have some finite cutoff, so I'd guess it'd do something like an expectation over exponentially distributed steps or something.

What's your general thinking about this kind of AI risk (i.e., where an astronomical amount of potential value is lost because human-AI systems fail to make the right decisions in high-stakes situations that are eventuated by the advent of transformative AI)? Is this something you worry about as an alignment researcher, or do you (for example) think it's orthogonal to alignment and should be studied in another branch of AI safety / AI risk?

Definitely seems worth thinking about and taking seriously. Some thoughts:

Ideally, I'd like to just avoid making any decisions that lead to lock-in while we're still figuring things out (e.g. wait to build anything like a sovereign for a long time). Of course, that might not be possible/realistic/etc.
Hopefully, this problem will just be solved as AI systems become more capable—e.g. if you have a way of turning any unaligned benchmark system into a new system that honestly/helpfully reports everything that the unaligned benchmark knows, then as the unaligned benchmark gets better, you should get better at making decisions with the honest/helpful system.

Planned summary for the Alignment Newsletter:

If you have an expert, but don’t trust them to give you truthful information, how can you incentivize them to tell you the truth anyway? One option is to pay them every time they provide evidence that changes their mind, with the hope that only once you believe the truth will there be no evidence that can change your mind. This post proposes a similar scheme for AI alignment.

We train two models, M and Adv. Given a question Q, M is trained to predict what answer to Q the human will give at the end of the procedure. Adv on the other hand is trained to produce arguments that will most make M “change its mind”, i.e. output a substantially different distribution over answers than it previously outputted. M can then make a new prediction. This is repeated T times, and eventually the human is given all T outputs produced by Adv, and provides their final answer (which is used to provide a gradient signal for M). After training, we throw away Adv and simply use M as our question-answering system.

One way to think about this is that M is trained to provide a prediction market on “what the human will answer”, and Adv is trained to manipulate the market by providing new arguments that would change what the human says. So, once you see M providing a stable result, that should mean that the result is robust to any argument that Adv could provide, and so it is what the human would say after seeing all the arguments.

This scheme bears some resemblance to <@debate@>(@AI safety via debate@), and it can benefit from schemes that help debate, most notably <@cross-examination@>(@Writeup: Progress on AI Safety via Debate@). In particular, at every step Adv can cross-examine the previous incarnation of Adv. If the previous incarnation was deceptive, the current incarnation can demonstrate this to the human, which should cause them to disregard the previous argument. We can also add oversight, where an overseer with access to the model ensures that the model does not become non-myopic or deceptive.

Planned opinion (may change with more discussion above):

I like the simplicity of the idea "find the point at which the human no longer changes their mind", and like that this is a new idea of how we can scale training of AI systems beyond human level performance. However, I’m not convinced that the training procedure given in this post would end up at this equilibrium, unless the human very specifically guided the training to do so (an assumption I don’t think we can usually make). It seems that if we were to reach the state where M stably reported the true answer to the question, then Adv would never get any reward -- but Adv could do better by randomizing what arguments it makes, so that M cannot know which arguments H will be exposed to and so can’t stably predict H’s final answer. See more details in this thread.

Interesting idea.

Suppose that in the first time step $A d v$ is able to output a string $x_{1}$ that will manipulate $H$ into: (1) giving a probability that is maximally different than $p_{1}$ ; and (2) not looking at the rest of $Q_{t}$ (i.e. the human will never see $x_{2}$ , $x_{3}$ ,...).

Ignoring inner alignment problems, in the limit it seems plausible that $A d v$ will output such an $x_{1}$ ; resulting in $p_{2} = p_{3} = . . . = p_{t} = p^{*}$ , and the smallest possible $L_{A d v, 1}$ given $p_{1}$ .

[EDIT: actually, such problems are not specific to this idea and seem to generally apply to the 'AI safety via debate' approach.]

35