## AI ALIGNMENT FORUMAF

Vojtech Kovarik

Background in mathematics (descriptive set theory, Banach spaces) and game-theory (mostly zero-sum, imperfect information games). CFAR mentor. Usually doing alignment research.

# Posts

Sorted by New

Risk Map of AI Systems
As a concrete tool for discussing AI risk scenarios, I wonder if it doesn't have too many parameters? Like you have to specify the map (at least locally), how far we can see, what will the impact be of this and that research...

I agree that the model does have quite a few parameters. I think you can get some value out of it already by being aware of what the different parameters are and, in case of a disagreement, identifying the one you disagree about the most.

>If we currently have some AI system x, we can ask which system are reachable from x --- i.e., which other systems are we capable of constructing now that we have x.
What are the constraints on "construction"? Because by definition, if those others systems are algorithms/programs, we can build them. It might be easier or faster to build them using x, but it doesn't make construction possible where it was impossible before. I guess what you're aiming at is something like "x reaches y" as a form of "x is stronger than y", and also "if y is stronger, it should be considered when thinking about x"?

I agree that if we can build x, and then build y (with the help of x), then we can in particular build y. So having/not having x doesn't make a difference on what is possible. I implicitly assumed some kind of constraint on the construction like "what can we build in half a year", "what can we build for \$1M" or, more abstractly, "how far can we get using X units of effort".

[About the colour map:] This part confused me. Are you just saying that to have a color map (a map from the space of AIs to R), you need to integrate all systems into one, or consider all systems but one as irrelevant?

The goal was more to say that in general, it seems hard to reason about the effects of individual AI systems. But if we make some of the extra assumptions, it will make more sense to treat harmfulness as a function from AIs to $$\mathbb R$$.

Values Form a Shifting Landscape (and why you might care)

Thank you for the comment. As for the axes, the y-axis always denotes the desirability of the given value-system (except for Figure 1). And you are exactly right with the x-axis --- that is a multidimensional space of value-systems that we put on a line, because drawing this in 3D (well, (multi+1)-D :-) ) would be a mess. I will see if I can make it somewhat clearer in the post.

AI Problems Shared by Non-AI Systems
Much of the concern about AI systems is when they lack support for these kind of interventions, whether it be because they are too fast, too complex, or can outsmart the would-be intervening human trying to correct what they see as an error.

All of these possible causes for the lack of support are valid. I would like to add one more: when the humans that could provide this kind of support don't care about providing it or have incentives against providing it. For example, I could report a bug in some system, but this would cost me time and only benefit people I don't know, so I will happily ignore it :-).

"Zero Sum" is a misnomer.

As a game theorist, I completely endorse the proposed terminology. Just don't tell other game theorists... Sometimes, things get even worse when some people use the term "general sum games" to refer to games that are not constant-sum.

I like to imagine different games on a scale between completely adversarial and completely cooperative. With things in the middle being called "mixed-motive games".

Integrating Hidden Variables Improves Approximation

I am usually reasonably good at translating from math to non-abstract intuitive examples...but I didn't have much success here. Do you have "in English, for simpletons" example to go with this? :-) (You know, something that uses apples and biscuits rather than English-but-abstract words like "there are many hidden variables mediating the interactions between observables" :D.)

Otherwise, my current abstract interpretation of this is something like: "There are detailed models, and those might vary a lot. And then there are very abstract models, which will be more similar to each other...well, except that they might also be totally useless." So I was hoping that a more specific example would clarify things for a bit and tell me whether there is more to this (and also whether I got it all wrong or not :-).)

How should AI debate be judged?

Even if you keep the argumentation phase asymmetric, you might want to make the answering phase simultaneous or at least allow the second AI to give the same answer as the first AI (which can mean a draw by default).

This doesn't make for a very good training signal, but might have better equilibria.

AI Unsafety via Non-Zero-Sum Debate

I will assume you can avoid getting "hacked" while overseeing the debate. If you don't assume that, then it might be important whether you can differentiate between arguments that are vs aren't relevant to the question at hand. (I suppose that it is much harder to get hacked when strictly sticking to a specific subject-matter topic. And harder yet if you are, e.g., restricted to answering in math proofs, which might be sufficient for some types of questions.)

As for the features of safe questions, I think that one axis is the potential impact of the answer and an orthogonal one is the likelihood that the answer will be undesirable/misaligned/bad. My guess is that if you can avoid getting hacked, then the lower-impact-of-downstream-consequences questions are inherently safer (from the trivial reason of being less impactful). But this feels like a cheating answer, and the second axis seems more interesting.

My intuition about the "how likely are we to get an aligned answer" axis is this: There questions where I am fairly confident in our judging skills (for example, math proofs). Many of those could fall into the "definitely safe" category. Then there is the other extreme of questions where our judgement might be very fallible - things that are too vague or that play into our biases. (For example hard philosophical questions and problems whose solutions depend on answers to such questions. E.g., I wouldn't trust myself to be a good judge of "how should we decide on the future of the universe" or "what is the best place for me to go for a vacation".) I imagine these are "very likely unsafe". And as a general principle, where there are two extremes, there often will be a continuum inbetween. Maybe "what is a reasonable way of curing cancer?" could fall here? (Being probably safe, but I wouldn't bet all my money on it.)

AI Unsafety via Non-Zero-Sum Debate

I agree with what Paul and Donald are saying, but the post was trying to make a different point.

Among various things needed to "make debate work", I see three separate sub-problems:

(A) Ensuring that "agents use words to get a human to select them as the winner; and that this is their only terminal goal" is a good abstraction. (Please read this in the intended meaning of the sentence. No, if there is a magical word that causes the human's head to explode and their body falls on the reward button, this doesn't count.)

(B) Having already accomplished (A), ensure that "agents use words to convince the human that their answer is better" is a good abstraction. (Not sure how to operationalize this, but you want to, for example, ensure that: (i) Agents do not collaboratively convince the human to give reward to both of them. (ii) If the human could in principle be brainwashed, the other agent will be able and incentivized to prevent this. In particular, no brainwashing in a single claim.)

(C)Having already accomplished (A) and (B), ensure that AIs in debate only convince us of safe and useful things.

While somewhat related, I think these three problems should be tackled separately as much as possible. Indeed, (A) seems to not really be specific to debate, because a similar problem can be posed for any advanced AI. Moreover, I think that if you are relying on the presence of the other agent to help you with (A) (e.g., one AI producing signals to block the other AI's signals), you have already made a mistake. On the other hand, it seems fine to rely on the presence of the other agent for both (B) and (C). However, my intuition is that these problems are mostly orthogonal - most solution to (B) will be compatible with most solutions to (C).

For (A), Michael Cohen's Boxed Myopic AI seems like a particularly relevant thing. (Not saying that what it proposes is enough, nor that it is required in all scenarios.) Michael's recent "AI Debate" Debate post seems to be primarily concerned about (B). Finally, this post could be rephrased as "When people talk about debate, they often focus on (C). And that seems fair. However, if you make debate non-zero-sum, your (B) will break.".

AI Unsafety via Non-Zero-Sum Debate

if you have 2 AI's that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don't want

That is a good point, and this can indeed happen. If I believe something is a piece of chocolate while you - hating me - believe it is poison, we will happily coordinate towards me eating it. I was assuming that the AIs are copies of each other, which would eliminate most of these cases. (The remaining cases would be when the two AIs somehow diverge during the debate. I totally don't see how this would happen, but that isn't a particularly strong argument.)

Also, the debaters better be comparably smart.

Yes, this seems like a necessary assumption in a symmetric debate. Once again, this is trivially satisfied if the debaters are copies of each other. It is interesting to note that this assumption might not be sufficient because even if the debate has symmetric rules, the structure of claims might not be. (That is, there is the thing with false claims that are easier to argue for than against, or potentially with attempted human-hacks that are easier to pull off than prevent.)

AI Unsafety via Non-Zero-Sum Debate

I think I understood the first three paragraphs. The AI "ramming a button to the human" clearly is a problem and an important one at that. However, I would say it is one that you already need to address in any single-agent scenario --- by preventing the AI from doing this (boxing), ensuring it doesn't want to do it (???), or by using AI that is incapable of doing it (weak ML system). As a result, I view this issue (even in this two-agent case) as orthogonal to debate. In the post, this is one of the things that hides under the phrase "assume, for the sake of argument, that you have solved all the 'obvious' problems".

Or did you have something else in mind by the first three paragraphs?

I didn't understand the last paragraph. Or rather, I didn't understand how it relates to debate, what setting the AIs appear in, and why would they want to behave as you describe.