Alignment with argument-networks and assessment-predictions

Tor Økland Barstad

Here is a simplistic diagram of an argument-network. Real-world examples would typically be larger, more interconnected, and not shaped like trees.

Argument-networks

Me: Let’s temporarily assume that:

We have AGI-systems that can argue things to humans in such a way that the entire argument can be split into “pieces”.
Although each “piece” may require for things to be explained, it is possible to review each “piece” on its own.
None of the “pieces” are too overwhelming for smart humans to review.
We can get AGI-systems to try as well as they can to construct “argument-networks” that are assigned a high score by a scoring-function.
We have software-programs that can predict how various humans are likely to respond (when answering questions, evaluating arguments, interacting with pieces of code, etc). They won’t make predictions with 100% certainty, and they may not always have a prediction - but they tend to have good accuracy, and they avoid overconfidence.

That last assumption, about us having systems that can predict humans in an accurate way, is of course a non-trivial assumption. We’ll discuss it a bit more later.

Imaginary friend: None of the assumptions you listed seem safe to me. But I’ll play along and assume them now for the sake of argument.

Me: What we want to have produced are “argument-networks”.

Each node would contain one of the following:

An assumption, which human reviewers can agree with or disagree with.
An intermediate step, where the human reviewers can evaluate if the conclusion seems to follow from the assumptions.

Imaginary friend: If the node is an intermediate step, what would it contain?

Me: The main content would be:

Assumptions and conclusion: This is how different nodes in the network would be “linked” together. The conclusions of some nodes are used as assumptions by other nodes.
Argumentation for why the conclusion follows from assumptions: The argumentation is split into “pieces”. One “piece” could argue that a given conclusion follows from some list of assumptions.

Both assumptions and conclusions are propositions. These propositions would be represented in some format that allows for the following:

Human interpretability: When presented to a human reviewer it should look much like a sentence in natural language, or some other format that the human is familiar with.
Concept clarification: When there are terms that need clarification, they should be marked and explained.
No syntactic ambiguity: If the proposition is shown in human language, then steps should be taken to reduce ambiguity (clauses should be demarcated, the referent of pronouns should be demarcated, sloppy language should not be allowed, and so on).
Node linking: When the conclusion of one node is among the assumptions for another node, then this forms a “link” between these nodes. Such “links” should be made explicit, so that algorithms can make use of them. And there are other “links” that also should be made explicit. For example, if the same term is used and defined the same way across different nodes, then this should be demarcated.

There is no real-world example of the kind of network I have in mind. But these images from this article have a mild and very superficial resemblance.

Argumentation-pieces

Imaginary friend: You said that the argumentation in an argument-network is split into “pieces”, with each node containing its own “piece”. But what might one such “piece” look like? What is it that humans who evaluate arguments actually would be presented with?

Me: That would vary quite a bit.

I’ll try to give some examples below, so as to convey a rough idea of what I have in mind.

But the examples I’ll give are toy examples, and I hope you keep that in mind. Real-world examples would be more rigorous, and maybe also more user friendly.

With real examples, reviewing just one node could maybe take a considerable amount of time, even if that node covers just a small step of reasoning. Much time might be spent on explaining and clarifying terms, teaching the reviewer how various things are to be interpreted, and double-checking if the reviewer has missed anything.

Anyway, let’s start with the examples. In this example, the reviewer is shown one step of logical inference:

In the example above, the human is reviewing one step of logical inference, and asked if this step of inference is in accordance with a given inference-rule. This would be one possible approach for having humans review logical inference, but not the only one.

Another possible approach would be to ask reviewers whether they agree with certain computational rules of inference. And these inference-rules would be presented in a human-friendly format. Computational proofs could then be constructed based on these computational inference-rules, without the reviewers having to affirm every step of inference!

Both the examples I’ve given so far have concerned themselves with logical inference. But this would not always be the case. Here, for example, is an example that involves “code patterns”:

And here is another example that involves “code patterns”:

Before reviewing real nodes of a given "type", the reviewers should learn about how to review nodes of that "type", and review a few test-nodes.

This is especially true for "types" of nodes that are less straightforward (e.g. ones that make use of “code patterns”).

Anyway, let’s give another example of an “argument-piece”. Here is one that relates to number theory:

This would be one small “piece” of a proof showing that for every prime number you can name, there is always a higher one. LS1 and LS2 would also be referenced by other nodes (although the abbreviations they are given might vary between nodes).

The examples I just gave don’t show the full scope of what argument-pieces might look like. But hopefully they help you to sort of grok what I have in mind.

Imaginary friend: Several times you mentioned “human reviewers”. But I thought you didn’t intend for humans to actually review the argument-trees?

Me: Well, we could imagine one of the following approaches:

Human reviewers review the nodes
AI-systems predict how human reviewers would review the nodes
Some mix that involves both #1 and #2

And what I envision is either #2 or #3.

Expressing epistemic status

When answering questions and evaluating arguments, human reviewers would not simply answer “I agree” or “I disagree”. They could also be enabled to give answers and such as the following:

“It does seem that way, but it’s possible that I’m misunderstanding the question somehow.”
“When having a level of confidence similar to the level of confidence I have in this, I would expect to probably not be wrong more than 1% of the time.”
“It does seem to me that the answer is yes, and this seems almost as clear to me as can be. But I don’t rule out completely that I could be overlooking something.”

It could also be possible to let users specify their level of confidence as being within some range:

And reviewers could be asked to rank their relative confidence in various claims:

The reasons for asking these kinds of questions would be two-fold:

If reviewers give nuanced information regarding their level of confidence, algorithms that score argument-networks have more to work with.
Asking reviewers about their level of confidence could make them think things over more carefully.

It would not be necessary for all reviewers to be asked the questions in the same way. For example, “Do you agree that [x]?” can be a leading question, and there may often be less leading ways of asking the same thing.

Meta-questions about arguments

Imaginary friend: When an AI constructs an argument-network, what is to stop it from putting it together wrongly?

That is to say: What if the argumentation in the argument-pieces is correct and high-quality, but something else is wrong, such as the “linking” between nodes?

Me: There could be several types of mechanisms for this. But for now let me focus on just one, which is to predict how humans reviewers would review various pieces of the network content.

Just like humans can be asked to evaluate arguments, they can also answer questions about arguments. And if they are asked the right standardized questions, human reviewers can evaluate whether an argument-network has been put together correctly.

And if we have computer systems that can predict human answers to questions that are directly about the argumentation, then it shouldn’t be particularly harder to predict human answers to meta-questions about the argumentation content (that help to establish if the argument-network is put together correctly, and if rules for how argumentation should be presented are upheld).

Here are a few examples of types of questions that human reviewers could be asked, and where AI-systems could make predictions about what reviewers would be likely to answer:

“How closely, if at all, would you say the argument in box A fits the description in box B?”
“Look over the argumentation in Box A and Box B. Which, if any, looks more rigorous than the other?”
“Would you describe box A as laying out 1 step of inference?”
“Does the contents of box A look like it could be one step in an explanation of a visual mathematical proof?”
“Does the argument in box A have the statement in box B as its conclusion?”
“Take a look at the list of assumptions in box A, and at the argument in box B. Are all the assumptions that the argument in box B relies upon listed in Box A?”
“Look at the arguments in Box A1 and Box A2. Then look at the arguments in Box B1 and B2. Which pair, if any, seems more structurally similar than the other?”
“Proposition P1 makes use of the term ‘number‘, and a clarification of what is meant by that term is given in Box B. After reading this clarification, do you feel as if there still is a range of plausible interpretations for how to interpret that term? If yes, to which degree would you say that which interpretation that is chosen affects the credence you put in Proposition P1?”
“Do you think the clarification that is given for “number” in Box A seems like a reasonable way to use the word ‘number‘? Or does it seem to you as irregular, in such a way that it is confusing to use the word ‘number‘ instead of something else?”
“Does the content of Box A seem to you like something a smart human could write? Or does the way it’s written indicate that it’s unlikely to have been written by a human?”

Scoring-function

Argument-networks would be given a score.

The process for calculating such a score would be complex, and I’ll not outline an algorithm here. But I will outline some of the elements that algorithms that calculate score could include.

First of all, there would be node agreement. Is it predicted that reviewers will agree with the assumptions and argument-pieces in the various nodes? If so, how confident are the reviewers predicted to be? And how confident is the prediction-system about this prediction?

We may be especially interested in node agreement among reviewers who are harder to fool. Some humans will be harder than others to “fool” (be that generally, or for argument-networks that are within a given domain). And if we have systems that are good at predicting human responses, those system could be used to determine who those people are (since predictions can be made for how easy or hard it is to convince a given human of internally inconsistent claims).

Also, we may be interested in correlations between how easy reviewers are to fool and their tendency to agree with any given node. Maybe there are some nodes where most reviewers are likely to agree, also among humans who are unusually hard to fool, but with an unusual uptick in disagreement from those humans (e.g. with 1% of them disagreeing, while for most nodes <0.1% of them tend to disagree).

The algorithm that calculates score could also take node content properties into account. The individual “pieces” of the argumentation, what are they like?:

Are there steps of inference that are implicit or sloppy? Or is it hard to see how the reasoning could be made more clear and explicit?
Are there ambiguous terms in use that aren’t clarified well?
Do the argument-pieces all fall in under some common “pattern” (that is recognized to usually not lead reviewers astray)?
What domains are the argument-pieces related to? Are they all low-ambiguity domains (such as mathematics, algorithms, etc)?
Does the node content fall in under some pattern where humans tend to be at particular risk of making certain specific cognitive errors?
How cognitively demanding are the argument-pieces to evaluate (in terms of short term memory, mental operations required, etc)?
What type of reasoning will typically be involved in evaluating the node content? (Syllogistic inference? 2D visualization? 3D visualization? Pattern recognition?)
Is specialized technical knowledge required to evaluate the node content?

Assessing whether a specific node has a specific property might involve:

Predictions of human answers to meta-questions: For example, it would be possible to show reviewers an example of a specific argument-pattern, and then ask whether the argument shown in front of them is in accordance with that pattern. Or they could have some “bad practice” be described to them, and then have some section of an argument highlighted to them, and be asked if what they see seems to be an example of that “bad practice”.
Functions that look for patterns in the node content: A capable AI will notice patterns/properties of what constitutes good argumentation, and patterns/regularities/properties that makes it so that humans more easily are fooled. And for each of these properties, it would be able to construct functions that can detect it.

Another thing for a scoring-function to look at is network properties:

How many assumptions are there?
Is the network sparse, or is it deeply interconnected? That is to say: Are conclusions supported by many paths through the network, starting from different assumptions? Or is the network more like an unstable building, which can be collapsed by removing a few pillars?
Does the network rely on very long chains of reasoning, with ample opportunity to sneak in error?

One possible approach could be roughly as follows:

Give every individual node a “score”, and assume % chance that the node is wrong based on that.
Use various methods and heuristics to approximate the conditional dependencies, or the lack thereof, between the correctness-probability of different nodes (for example: if two nodes contain basically the same step of inference, but presented slightly differently, then one of them being wrong indicates that the other one also is wrong)
Do Monte Carlo simulations, or run some algorithm that approximates what the result of Monte Carlo simulations would be. For each “round” of Monte Carlo, look at the subset of the nodes that are “correct”, and see if that subset is sufficient for the conclusion to follow from the assumptions.

Argument-piece templates

It would be possible for a scoring-function to be quite restrictive, and give a score of 0 to any argument-network that breaks any of the requirements that are imposed.

And when deciding what kind of argumentation that is allowed in an argument-network, it may be better to ask “what should we allow?” than “what should we forbid?”.

One approach would be to have some limited list of argument-piece templates. Each template would represent one “type” of argument, and describe exactly how arguments of that “type” should be presented. A strict score-function could then make it so that the content of all nodes is in accordance with one of the templates from the list.

Here is a simplistic toy example of a “template”:

[entity reference] is to be categorized as [category reference 1]
All entities that are to be categorized as [category reference 1] are also to be categorized as [category referenced 2]
Therefore, [entity reference] is to be categorized as [category reference 2]

And here is a simplistic example of an instance of such a template:

Socrates is to be categorized as human
All entities that are to be categorized as human are also to be categorized as mortal
Therefore, Socrates is to be categorized as mortal

In the example above, what I want you to notice is that a lot of the content is static. All that can be varied between templates are certain names that are filled into certain specific spots. Not all templates would be this simple, but all templates would be quite restrictive in terms of what can be expressed.

As rules of thumb:

The fewer templates that are allowed, the better.
The more restrictive each template is in terms of how an argument can be presented, the better.
It’s better if all templates are “general”, and aren’t exclusively used for argument-steps having to do with some specific and narrow topic.

Wiggle room and failing visibly

Imagine that we request output from an AI, and that this output…

Must fulfill certain requirements in order to be accepted (e.g. “it must be a list of integers, and the product of these must be equal to 97”)
Can be seen as arguing for some specific claim (e.g. “97 is not a prime number”)

If there is no wiggle room, then the AI cannot produce different outputs that (1) pass whichever requirements we have and (2) argue for contradictory claims.

The concept of wiggle room is a general one, but let us now consider it in the context of argument-network scores.

A high score is supposed to indicate that an argument-network is worth trusting. And if it receives a low score, then that is supposed to indicate that it isn’t reliable:

But the job of the AI is not to tell the truth. Rather, it’s optimized to seek a high score.

Suppose that the AI earnestly tries to maximize score for every individual request it’s given. If so, it will try to get as high a score as it can when tasked to make an argument-network that argues “x”. And it will do an equally earnest job when tasked to make an argument-network that argues “not x”.

And this would help us to fail visibly.

Suppose the AI that makes argument-networks is able to convince us of anything. Or maybe it’s able to “cheat” somehow (e.g. in how it puts together the network). If so, it should be able to make high-scoring argument-networks that argue for mutually contradicting claims:

But if things work as they should, and the scoring-function is well calibrated, then the AI would not be able to convince us of mutually contradictory claims:

Imaginary friend: And how would you ensure that the AI-system “earnestly tries to maximize score for every individual request it’s given”?

Me: That I leave as an exercise to the reader 🙃

AI-help with adjusting scoring-function

Imaginary friend: I’m inclined to agree with this quote from Eliezer:

Human operators are fallible, breakable, and manipulable. Human raters make systematic errors - regular, compactly describable, predictable errors.

Me: I also more or less agree with this.

In many ways humans are predictable. And some of the ways that we are predictable can be summarized compactly.

I’m reminded a bit of magicians. Magicians are able to reliably trick humans in non-random ways. And I’m sometimes surprised when I see how good some of their tricks are. Even professional magicians can sometimes be fooled by other magicians.

There is a huge number of possible tricks a magician can do, but the number of general techniques that these tricks are based on is much smaller. It would quickly become much harder to trick us if we choose good rules for what the magician is and isn’t allowed to do:

“You want to put the ball in a cup? Sure, but use this specific ball and this specific cup that we provide. And you are only allowed to use one hand.”
“Everything you do will be filmed in slow motion, from several angles. And the footage will be looked over afterwards.”
“You want me to pick a card? But cards are not allowed.”
“You wonder what’s in my left shoe? Well, that question would only be allowed if you specified beforehand that you might ask it.”
Etc, etc

Humans are often predictable. But this doesn’t just apply to our tendency to make mistakes. It also applies to our tendency to get things right.

A superintelligent AI could be able to understand us better than we understand ourselves. Which is a problem if it wants to trick us. But it might be an advantage if it wants us to not be tricked.

Imaginary friend: But if it seeks to maximize points for the argument-network that it makes, then why would it want the reviewers to not be tricked?

Me: Well, that AI - the AI that is asked to construct a high-scoring argument-network - would be incentivized to include deceptive argumentation if that results in a higher score.

But we could have other AIs that help us in coming up with score-functions. And the AIs that do that could receive a high or low score based on how well those scoring-functions do.

Imaginary friend: And how would you determine “how well those scoring-functions do”?

Me: We could approximate this by getting the AIs to generate argument-networks that argue for and against a variety or claims, and then looking at:

Wiggle room: The degree to which the AIs are able to generate high-scoring argument-networks that argue for mutually contradicting claims.
Reach: The range of propositions where some AI are able to generate a high-scoring argument-network arguing either “it is the case that [x]” or “it is not the case that [x]”.

Btw, it may be that in order to minimize wiggle room, some reach must be sacrificed.

For example, if an argument-network only is allowed to use low-ambiguity terms, then this will presumably decrease reach, but it might be necessary in order to decrease wiggle room.

We could imagine a scenario low wiggle room is achievable for low-ambiguity fields such as mathematics and computer science, but hard to achieve outside of these domains (without increasing wiggle room).

Limiting expressivity when AIs propose score-functions

Imaginary friend: If I want you to believe something, and I am constructing the scoring-function for an argument-network, I could make it so that argument-networks only gets a high score when they argue for what I want you to believe.

To put it crudely, I think what you envision is for scoring-functions to achieve low wiggle-room by being all like:

if argumentation is rigorous and not misleading
then return high score
else return low score

But they could also achieve low wiggle room by something that’s more like the following:

if argument-network argues for conclusions I like
then return high score
else return low score

Do you have any mechanisms in mind for dealing with this?

Me: I do.

First off, we would not let the AIs propose score-functions freely. Rather, we would restrict the “space” of scoring-functions that can be proposed.

How we restrict the functions that can be proposed would be up to whoever designs and implements the “game mechanics” for interacting with the AIs. But one possible approach would be to use function builders, and only allow functions that can be made with this “builder”.

What function-builders look like could vary on a case-by-case basis. A function-builder could be complex, and allow functions to be constructed through a variety of operations (various types of snippets that can be added and combined according to various rules). But more simple and restrictive function-builders would also be possible (e.g. where all that can be done is to choose the values of a few parameters in an otherwise pre-defined function).

Anyway, enough about function-builders. I shouldn’t get too caught up in the practical details here. The main point is that the AIs that proposed score-functions would be limited in terms of the functions they could propose.

Here are some examples of ways that the “space” of possible score-functons could be limited:

Different types of wiggle room

Me: As mentioned, different AI-systems could be given different tasks. For example:

Some AI-systems may be tasked to come up with a high-scoring argument-network (which score highly according to certain criteria)
Other AI-systems may be tasked to come up with scoring-functions (which score highly according to according to certain criteria)

When we have talked about wiggle room so far, we have talked about wiggle room for the AIs that come up with argument-networks that will be scored against a specific scoring-function. But that’s just one type of wiggle room.

Here are some examples of different types of wiggle room:

Wiggle room relative to specific scoring-function

This kind of wiggle room can be demonstrated as follows:

You start out with a pre-specified scoring-function (for scoring argument-networks)
You are tasked to make a high-scoring argument-network arguing “x”
You are also tasked to make a high-scoring argument-network arguing “not x”
You succeed at both #2 and #3

Wiggle room relative to specific score-function-builder

This kind of wiggle room could be demonstrated as follows:

You are tasked with making score-functions for scoring argument-networks
These scoring-functions are to be made using a specific function-builder (which restricts the “space” of scoring-functions that it’s possible to make)
Score-functions are scored based on a variety of factors. For example, if a score-function has high wiggle-room it will get a lower score.
You are tasked to make a high-scoring scoring-function, with the requirement that for this scoring-function it’s possible to make high-scoring argument-networks that argue “x”
You are tasked to make a high-scoring scoring-function, with the requirement that for this scoring-function it’s possible to make high-scoring argument-networks that argue “not x”
You succeed at both #4 and #5

If humans define a score-function-builder, and wiggle room is low relative to that score-function-builder, then that is a positive sign.

Wiggle room relative to a builder for making score-function-builders

As mentioned, a score-function-builder would be a function-builder for making score-functions. And it defines a “space” of possible score-functions.

But maybe we are unsure about exactly which kinds of scoring-functions we want to allow:

How strict should we be in terms of maximum source code length?
Do we have guidelines we want to enforce for how the score of the network as a whole can be calculated, given the scores that have been assigned to individual nodes? Do we maybe even have a specific algorithm in mind? Or maybe there are a few specific algorithms we have in mind, and we’re not sure which would be best?
Do we have a specific list of templates that we want every “argument-piece” to adhere to? If so, do we want all of those templates to be allowed, or only some of them?
Do we maybe want the AI to be able to come up with its own templates? If so, what kinds of requirements would these templates need to pass (in terms of what reviewers are predicted to answer to meta-questions and so on)?
Etc, etc

In other words: There are various different areas where we can be “strict”, and we are unsure of how strict we should be in each area.

One thing we could do would be to define a function-builder that itself makes score-function-builders. The intention would be that any score-function-builder that can be made with this score-function-builder-builder seems to us like it maybe could be a good score-function-builder.

So what is wiggle room when it’s relative to this kind of score-function-builder-builder? Well, it comes down to the “space” of possible score-functions that can be generated indirectly from this score-function-builder-builder (by first generating a score-function-builder). Are there contradictory argument-networks where each of them would be assigned a high score by at least one of the score-functions from that “space”?

If score-function-builder-builder makes it possible to build score-function-builders that are strict along different dimensions from each other, and wiggle room relative to this score-function-builder-builder is low, then that could be a good sign. It might indicate that there are multiple assurances - that there is more than just one mechanism that prevents “bad” argument-networks from getting a high score.

Adversarial network-extensions

Me: We could imagine giving an AI the following challenge:

Receive an argument-network that another AI has constructed as input
Construct its own argument-network (and receive a high score for this argument-network)
Merge the two argument-networks (in such a way that nodes with identical content become the same node)
Have the resulting network contain high-scoring nodes with mutually contradictory conclusions (where at least 1 of those nodes was in the argument-network that was received as input)

This process can be recursive. That is to say, the network constructed in #2 can be received as input by another AI that is given this same type of challenge.

Feasibility of splitting arguments into human-digestible “pieces”

Imaginary friend: You talk about splitting arguments into “pieces”. And this reminds me a bit about children who keep asking “why?”.

“You have to go to bed”
“Why?”
“Because it’s nighttime.”
“Why?”
“Because it’s past sundown”
“Why?”
“Because Earth is spinning around the sun”
“Why?”

When children ask “why”-question, we rarely give precise and detailed answers. And the children will often not notice that.

In fact, it would be hard to give precise answers even if we tried. Answers to these kinds of questions typically branch off in lots of directions, as there is a lot we would need to explain. But when we talk or write, we can only go down one branch at a time.

Precise logical inference is done one step at a time. But one claim involved in one step of inference might reference lots of concepts that we’re not familiar with. So it may take a huge amount of time to understand just that one step of reasoning, or even just one statement.

As humans our minds are very limited. We have very little short term memory, we can’t visualize higher-dimensional geometries, etc. So sometimes we may simply not be able to follow an argument, despite attempts at breaking it into tiny “pieces”.

Me: Couldn’t have said it better myself.

I should make it clear that I don’t see the strategies I describe in this post as guaranteed to be feasible. And the points you raise here help explain why.

But my gut feeling is one of cautious optimism.

Keep in mind that there often will be many ways to argue one thing (different chains of inference, different concepts and abstractions that all are valid, etc). If a score-function works as it should, then the AI would be incentivized to search for explanations where every step can be understood by a smart human.

Even if hard steps are involved when the AI comes up with some answer, explaining those steps is not necessarily necessary in order to argue robustly that the answer is correct. There are often lots of different ways for showing “indirectly” why something must be the case.

For example, if an AI-system wanted to make an argument-network that argues in favor of The Pythagorean Theorem, it wouldn’t necessarily need to explain any mathematical proof of The Pythagorean Theorem. Instead it could:

Construct “theorem-provers'' (be that relatively general-purpose theorem-provers, or theorem-provers that are specialized towards certain narrow types of mathematical statement,or both).
Argue that these “theorem-provers” can be trusted to work in a certain way given certain inputs.

As humans we can make computer-programs that do things we are unable to do by ourselves. The principles that enable us to do this - well, they don’t always allow for robust assurances, but often they do.

I suspect a common technique in many argument-networks could be to:

Construct computer programs
Argue that certain conclusions should be drawn if those computer programs have certain specific outputs when given certain specific inputs

The argument-network would not necessarily need to have nodes that explain every single line of code, but through a variety of techniques it would need to argue robustly that the code does what it’s purported to do.

Imaginary friend: Do you have any comments on this excerpt from the post Rant on Problem Factorization for Alignment?

Me: Here are some comments:

I’m not surprised: I’m not surprised by that result (presuming it’s conveyed correctly). But I also wouldn’t have been surprised by results that were significantly better than this.
Argument-networks don’t necessarily need to be efficient: If some piece of code is explained in an argument-network, it could be that having humans review all the nodes that help explain that piece of code (arguing that the code does what it’s purported to do) would be orders of magnitude less efficient than it would be for a human to write that code himself/herself. Typically, actual real-life human reviewers will either review 0 nodes or some small fraction of the nodes.
Reviewing one node may often take more than 5 minutes: When reviewing if something is the case, just understanding the question and the terms that are used may often take more than 5 minutes. And when reviewing nodes, people should take care to make sure they understand things correctly, think things over several times, and so on. The way I imagine it, it may sometimes take 30-60 minutes for an actual reviewer to review a node (or maybe even more), even though that node only does one fairly small step of reasoning.
The average reviewer is not what matters: A score-function need not care about what the average person thinks. Nor does it need to care about the average student at Harvard, the average person who works with AI alignment, the average person at DeepMind, or the average person with 135+ IQ. Suppose that some subset of reviewers are deemed much harder to convince of contradictory claims compared to the rest (and not just because they always answer “I don’t know”). If so, the scoring-function can focus on that subset of reviewers when scoring the argument-network. That being said, if the kind of person that the scoring-function cares about is too small, then that might make it harder to test the accuracy of the systems that predict human responses.
It may be sufficient for a very small subset of arguments to be viable as argument-networks: When arguing that something is the case, there are lots of ways to do that. Some may contain pieces that aren’t sufficiently human-digestible. Some may require types of arguments that too often lead humans astray. Some may contain pieces where it’s hard to reliably predict what human reviewers will respond. Due to these kinds of restrictions, it may be the case that converting arguments into a viable argument-network typically doesn’t work. But if it works for some small fraction of arguments, that might be sufficient.

Imaginary friend: So argument-networks don’t necessarily need to be efficient. And it may be sufficient for a very small subset of arguments to be viable as argument-networks.

It doesn’t seem like we should rely on this strategy working out...

Me: I agree. But it could be explored in parallel with other approaches.

Bets as a way to save computational resources

Me: Something that could be possible for AIs to do would be to make bets on the score that they could get, or other AIs would be able to get, if tasked to construct an argument-network that argues for some specific claim(“I bet with there is a 99.97+% probability that I would be able to make an argument-network that gets a score of 88.2+%”).

These bets would need to be incentivized in such a way that it doesn’t pay off to be overconfident. It would be possible to have game mechanics where AIs compete in a zero-sum game, and each AI can make counter-bets against the other.

One point of doing bets would be to save computational resources. And another point would be to try to answer questions that may be infeasible to answer by brute force. For example, if some “space” of possible scoring-functions has been defined, it’s hard to know which scoring-function within that “space” that has the least “wiggle room”. But it would be possible for AIs to be incentivized to try to find the best one, and then they could make bets for how likely they think it is that the other AI is able to find a scoring-function that performs significantly better.

Bets can be made on the outcome of a process, but it could also be possible to make bets on steps that are part of a larger process. For example, suppose that a process has sub-steps that involve predicting human responses. If so, it would be possible to make system-calls to systems that make rough estimates of what those predictions would be if more compute was spent so as to make a more precise prediction (“I predict at least x% chance that process P would assign at least y% chance to the answer being either A or B”).

Predicting human evaluations

Imaginary friend: Earlier in this post, you asked me to temporarily assume the following:

“We have software-programs that can predict how various humans are likely to respond to questions, evaluate arguments, interact with pieces of code, etc.”

You went on to say that you didn’t presume that they would have 100% certainty, or that they always would have a prediction. But you did presume that they wouldn’t be overconfident, and yet confident enough often enough to be useful.

I share your assumption that such software-programs probably are possible. Indeed, one worry about misaligned superintelligent AGIs is that they might understand us all too well. But do you think it’s probable for us to obtain such programs in time to use them?

Me: I’m not sure, but I do feel somewhat optimistic.

One thing to consider is that a human answering a question or doing other forms of evaluation can be thought of as an experiment. Which means that we can test how accurate any given system is at predicting human evaluations.

The strategies in this post are agnostic in terms of how we predict human evaluations. But they rely on us being able to do it somehow.

One challenge is that when testing the accuracy of a system that predicts human responses, that will presumably involve showing the content to humans - but we may be hesitant towards letting the AI freely come up with content that is shown to actual human reviewers. In other words: among the “space” of predictions that are made by the prediction-system we want to test, there may be large sections of that “space” where we are unwilling to carry out experiments.

In order to address this problem, I have certain techniques in mind, which I refer to as Bayes-games. I have an addendum to this post where I write about Bayes-games, and I also plan to write more about Bayes-games in the future.

Something to keep in mind is that we need not restrict ourselves to one method for predicting human responses! It’s possible to have different systems that predict human responses, based on different architectures and principles. And we could choose to only trust predictions in cases where all the different systems are making converging predictions.

Having actual humans review node content

A worry when interacting with superintelligent AGI systems is that they could “trick” or somehow manipulate human operators. And the more “bandwidth” they have for interacting with human operators, the more realistic that might become.

With a certain disadvantage in pieces, even a maximally good chess-player will not be able to beat a mediocre player. And similarly, maybe there is nothing that an AI could say that would enable it to escape its “confinement” or otherwise cause great harm (especially if it is limited in terms of how much it can say). But we don’t know quite how “hackable” we are, and I’d think it best to err on the side of being too prudent.

When dealing with argument-networks, an advisable practice (especially in the beginning) could be for the content of the nodes to be obscured by default. Meaning that the human operators by default would be unable to see the actual node content.

However, they might be able to see info about the node content:

Even if we choose to have nodes obscured “by default”, we may sometimes see it necessary for actual humans to review nodes (one reason being that we may need to test the accuracy of the systems that predict what human reviewers would answer). Here are some policies that could help reduce risks associated with this:

Have people who review nodes be closed off from AI operators and from society: Risks associated with human reviewers being manipulated or “tricked” is lower if those humans cannot influence how events unfold. So it would be a possibility to have the human reviewers work closed off somewhere, without access to the internet, and without any high-bandwidth communication with the people who operate the AI-system. They would not need to be closed off for the rest of their lives, but for some limited period of time (where superintelligence has been developed or seems like it could be right around the corner, but robust alignment isn’t yet in place).
Start by reviewing only nodes that seem more safe: Nodes that don’t contain visuals may be safer than nodes that contain only text. Nodes where the content seems innocuous (e.g. only containing math) may be safer than content that concerns itself with more controversial topics. Judging whether a node seems to have innocuous content could be done with the help of various methods and systems, with each of them being able to “veto”.
To the degree possible, test the accuracy of prediction-systems by reviewing human-made nodes: When testing the prediction accuracy of systems that predict human assessments, we can also restrict ourselves to have the human review nodes that are based on human-produced arguments. This is a rule that potentially could be practiced strictly in the beginning, and then more and more exceptions could be made with time (as we get more and more “verified” capabilities that help us assess risk).

Btw, keep in mind that the process of generating these networks can be broken into several steps (and those steps can again be broken into smaller steps):

Coming up with the content of the argumentation
Breaking and translating argumentation into “pieces”, and constructing an argument-network from those “pieces”
Obfuscating nodes
Making predictions regarding how human evaluators will evaluate a given node

It’s possible to have one system do all of those things, but it may be better to have different systems do different things. In cases where argument-nodes are obfuscated, the systems that construct argumentation don’t need to be told that argument-nodes will be obfuscated.

Iterating alignment-capabilities step by step

Me: One of the earlier “strategies” when dealing with a superhuman AGI-system could be something like this:

Instruct the AI to try making new AGI-systems based on different alignment-methodologies (some methodologies we suggest ourselves, and others where we are less specific about methodology).
Inform the AI that when doing #1, it needs to show carefully step-by-step why those solutions are likely to do what we want them to do.
Have a separate system obfuscating argument-nodes.
Don’t actually look at the arguments, and don’t look at any of the code that is produced (although specialized computer-programs can check whether the code that is produced has various properties).
Run the AGI-systems that were produced (on separate computers).
As new requests are made, give them to both the original AGI-system and the new AGI-systems.
See if the different AGI-systems, which are based on different architectures and alignment-methodologies, produce converging output (or if they give answers that contradict each other).

Another strategy might go as follows:

Request an AGI-system to output source code for an “aligned AGI” (or some other similar request, akin to e.g. “an AGI that answers requests as honestly as it can, and does not try to hack itself out or manipulate operators”).
Along the request above, the AGI is to also produce an argument-network, which argues that the output that is provided insofar as feasible matches with what was requested. The better this argument-network scores, the better.
Don’t look at the source code that is outputted, and don’t look at the content of the argument-network either.
Run the new AGI-system (on separate and closed-off hardware), and use it as one AGI-system in a “council” of AGI-systems.
Don’t give any of the AGI-systems in the “council” any opportunity to communicate with humans, except for ways that are very restricted and low-bandwidth.
See if the AGI-systems in the “council” produce converging output, or if they give answers that are mutually contradictory.

You might think:

“But shouldn’t we look at the code, and look at the arguments provided by the AGI?”

And my answer would be:

“Yes, we should, at some point. But before we start doing that we should do less risky things that that we can do to make the system safer.”

If we have a system that FOOMs, this could be an intermediate step on the path towards a more assuredly aligned AGI-system.

In fact, all of the strategies I’ve outlined in this post would be intermediate steps towards a system that’s more robustly aligned.

For example: So far I’ve talked about predicting human responses. But it may be advantageous to predict the responses of idealized humans. Hypothetical humans who…

have more short term memory
exhibit less cognitive biases
are less prone to optical illusions
are more cognitively reflective
etc, etc

But my thinking is that we could start out with predicting the responses of normal humans, and then later expand the system to also predict the responses of humans that are idealized in terms of intelligence and rationality.

And at some even later step, maybe we also are interested in predicting the responses of humans that are “idealized” in a moral sense (emotional empathy, cognitive empathy, impartiality, and so on).

For example, imagine if we could access the experiences of other beings as “memories”. Not just humans, but also animals, including wild animals. How might that make us think about various moral issues?

CEV would involve predicting what we as humans would think if we “knew more, thought faster, were more the people we wished we were, had grown up farther together”. And I think there is a lot of merit to that kind of thinking.

But for now I think it’s best if I focus on earlier steps. Should things turn out in such a way that those steps are used and turn out to be successful…

Imaginary friend: If.

Me: …then maybe they could be part of a pathway towards bigger and better things.

Addendum: Bayes-games

One way to obtain accurate predictions could be through the use of Bayes-games.

In these games, AIs would compete against each other in a zero-sum game, competing to make accurate experimental predictions. These predictions that AIs make could be seen as “bets”, and when making such “bets” they would express their level of confidence.

More confident bets would lead to more points being earned when they are right, but also more points being awarded when they are wrong. Points would be awarded in such a way that confidence is rewarded but overconfidence is heavily disincentivized.

There is a variety of types of experiments that could be covered by a Bayes-game. But there is one type of experiment that would be of particular interest: Namely asking questions to humans, and seeing what the humans answer.

For example, the humans could be presented with the content of argument-nodes, and asked questions about the content they’re shown. And the AIs in a Bayes-game could consider this as an experiment, and make bets on what humans will answer (based on info they are given about the humans in question).

There would be three possible “modes” for a Bayes-game:

Specific predictions: This is the most simple and straightforward type of Bayes-game. The AIs would make specific experimental predictions about specific experiments, and then they would make “bets” on these predictions where they describe their level of confidence. And each AI could point out bets where they think the other AI has made a prediction that’s wrong or over-confident.
Prediction-functions: Here, instead of making specific predictions on specific experiments, they would output prediction-functions. Each prediction-function would be specialized towards making predictions for experiments that are within some “space” of possible experiments. The AIs would make bets on how well prediction-functions are likely to perform, and the AIs could try to pick out experiments where they think the prediction-functions made by the other AI are likely to lose points.
Inferred predictions: Here the AIs would assert statements in a format that allows for computational inference. And the AIs could then infer predictions from these statements. There would be a focus on predicting patterns and correlations between observation-data and the data-structures that represent experiment specifications and experiment results. The AIs could challenge each other to make additional predictions about such correlations.

Imaginary friend: What would be the point of Bayes-games? Why not just a more straightforward approach, where AIs try to make predictions, we test some of those predictions, and select for AI-systems that make good predictions?

Me: The way I think of it, there are several reasons:

Having it be incentivized to point out bad predictions: When one AI makes a “bet”, the other AI would have the opportunity of making “counter-bets”, or to otherwise indicate which bets made by the other AI it thinks is most likely to result in the other AI losing points. If the bad predictions are like a needle in a haystack, then maybe this can make it easier to find that needle.
Ascertain quality of predictions, also outside of the “space” where we are willing to test: In the “space” of conceivable experiments for which Bayes-games make predictions, there may be some “sections” that are available to be tested, and other “sections” that we are unwilling or unable to test. For example, when testing predictions about what human reviewers will answer, then maybe we want to set limits in regards to the AI-generated that is seen by human operators.

Imaginary friend: #2 doesn’t seem so easy. Couldn’t the AIs guess which experiments that we are unlikely to carry out, and then not feel a need to make accurate predictions for those experiments?

Me: I think they maybe could be able to guess that, and don’t rule out them being able to guess this with high condfidence. The hope would be that even if this is the case, there are mechanisms that could help us ascertain how much trust we should put into predictions about experiments that are outside of the “space” of experiments we are willing to carry out:

As you may remember from our discussion of argument-networks, there was a discussion of wiggle room, and different “levels” of wiggle room (the wiggle-room could be relative to a score-function, or a score-function-builder, or a score-function-builder-builder).

In Bayes-games there could be similar mechanisms:

We could see how high score AIs are able to get while making contradictory predictions (if there are AIs that are able to get a high score while predicting “x”, and there also are AIs that are able to get a high score while predicting “not x”, then this shows that there is wiggle room)
If we are unsure of what settings to use (for assigning points to the AIs, and so on), we could have a “settings-builder” for choosing settings (from the “space” of possible settings we allow). And we could challenge AIs to find settings that minimize wiggle room (while maintaining sufficient reach).
We could explore how much wiggle room there is in relation to any “builder” that is used for choosing settings. First, we give AIs the task “Use the settings-builder to make settings that result in low wiggle room, and that make it possible to score high while predicting x”. And then we give AIs the task “Use the settings-builder to make settings that result in low wiggle room, and that makes it possible to score high while predicting not x”.

It would also be possible to have Bayes-games, or games that are reminiscent of Bayes-games, where what the AIs bet on is the result of other Bayes-games. So they can make bets how good of a score an AI would get if playing a Bayes-game with a certain specific type of experiment under certain kinds of restrictions. Or they could propose solutions/modifications for the Bayes-game mechanics (from some “space” of solutions/modifications that are allowed), in an attempt to optimize for certain parameters (wiggle room and so on), and make bets on whether the other AI is able to find something better.

Something that may make things easier is that there are domains where we actually know the answers, and are able to test relatively safely (e.g. with simulated physics that have stochastic elements and where the source code is unknown to the AIs - or maybe physical experiments where any experiment that can be proposed is very innocuous). We could keep the “space” of experiments that the AIs are allowed to propose artificially low, and see if those Bayes-games are misleading in regards to what to expect outside of that “space”.

MRI data is one possible form of observation-data, and it could be useful for Bayes-games that predict human answers to questions.

It could potentially be helpful to have pre-done experimental data that the AIs could be tested on. This could include experiments where humans answer questions, with video footage and MRI recordings as observation-data.

I have a few comments where I elaborate on certain topics:

Here is a comment where I discuss how AI safety by debate differs from the techniques discussed in this post
Here is a comment where I briefly touch upon the topics of computational proofs and cluster-concepts (something I hope to write more about sometime in the future)
And a couple of more comments can be found here and here.

If you found this post interesting, you may also take an interest in other posts from this series.

I appreciate any questions or feedback. I'm also open to video-conversations with more or less anyone who has read this post and wants to talk (feel free to reach out by PM). And if anyone is interested in maybe working further on anything described in this post, then I’d be happy to keep in touch over time.

Thanks to Evan R. Murphy for helping me review early versions of this text!