Here are some of the concepts that will be touched upon in this post:
Here are a few examples of topics that an argument-networks could be used to argue about:
For the strategies outlined in this post, here are some of the underlying goals:
The strategies I outline here are not intended for any current-day AI system. Rather they would be intended for hypothetical future AGI-like systems that have superhuman abilities.
The techniques assume that the AIs that participate act in such a way as to maximize points. I vaguely imagine some sort of search process for finding the AI-systems that earn the most points. A potential failure mode could be if we get stuck at a local optimum, and don’t find any systems that try earnestly to maximize points for every individual request they receive.
Me: Let’s temporarily assume that:
That last assumption, about us having systems that can predict humans in an accurate way, is of course a non-trivial assumption. We’ll discuss it a bit more later.
Imaginary friend: None of the assumptions you listed seem safe to me. But I’ll play along and assume them now for the sake of argument.
Me: What we want to have produced are “argument-networks”.
Each node would contain one of the following:
Imaginary friend: If the node is an intermediate step, what would it contain?
Me: The main content would be:
Both assumptions and conclusions are propositions. These propositions would be represented in some format that allows for the following:
Imaginary friend: You said that the argumentation in an argument-network is split into “pieces”, with each node containing its own “piece”. But what might one such “piece” look like? What is it that humans who evaluate arguments actually would be presented with?
Me: That would vary quite a bit.
I’ll try to give some examples below, so as to convey a rough idea of what I have in mind.
But the examples I’ll give are toy examples, and I hope you keep that in mind. Real-world examples would be more rigorous, and maybe also more user friendly.
With real examples, reviewing just one node could maybe take a considerable amount of time, even if that node covers just a small step of reasoning. Much time might be spent on explaining and clarifying terms, teaching the reviewer how various things are to be interpreted, and double-checking if the reviewer has missed anything.
Anyway, let’s start with the examples. In this example, the reviewer is shown one step of logical inference:
In the example above, the human is reviewing one step of logical inference, and asked if this step of inference is in accordance with a given inference-rule. This would be one possible approach for having humans review logical inference, but not the only one.
Another possible approach would be to ask reviewers whether they agree with certain computational rules of inference. And these inference-rules would be presented in a human-friendly format. Computational proofs could then be constructed based on these computational inference-rules, without the reviewers having to affirm every step of inference!
Both the examples I’ve given so far have concerned themselves with logical inference. But this would not always be the case. Here, for example, is an example that involves “code patterns”:
And here is another example that involves “code patterns”:
Before reviewing real nodes of a given "type", the reviewers should learn about how to review nodes of that "type", and review a few test-nodes.
This is especially true for "types" of nodes that are less straightforward (e.g. ones that make use of “code patterns”).
Anyway, let’s give another example of an “argument-piece”. Here is one that relates to number theory:
This would be one small “piece” of a proof showing that for every prime number you can name, there is always a higher one. LS1 and LS2 would also be referenced by other nodes (although the abbreviations they are given might vary between nodes).
The examples I just gave don’t show the full scope of what argument-pieces might look like. But hopefully they help you to sort of grok what I have in mind.
Imaginary friend: Several times you mentioned “human reviewers”. But I thought you didn’t intend for humans to actually review the argument-trees?
Me: Well, we could imagine one of the following approaches:
And what I envision is either #2 or #3.
When answering questions and evaluating arguments, human reviewers would not simply answer “I agree” or “I disagree”. They could also be enabled to give answers and such as the following:
It could also be possible to let users specify their level of confidence as being within some range:
And reviewers could be asked to rank their relative confidence in various claims:
The reasons for asking these kinds of questions would be two-fold:
It would not be necessary for all reviewers to be asked the questions in the same way. For example, “Do you agree that [x]?” can be a leading question, and there may often be less leading ways of asking the same thing.
Imaginary friend: When an AI constructs an argument-network, what is to stop it from putting it together wrongly?
That is to say: What if the argumentation in the argument-pieces is correct and high-quality, but something else is wrong, such as the “linking” between nodes?
Me: There could be several types of mechanisms for this. But for now let me focus on just one, which is to predict how humans reviewers would review various pieces of the network content.
Just like humans can be asked to evaluate arguments, they can also answer questions about arguments. And if they are asked the right standardized questions, human reviewers can evaluate whether an argument-network has been put together correctly.
And if we have computer systems that can predict human answers to questions that are directly about the argumentation, then it shouldn’t be particularly harder to predict human answers to meta-questions about the argumentation content (that help to establish if the argument-network is put together correctly, and if rules for how argumentation should be presented are upheld).
Here are a few examples of types of questions that human reviewers could be asked, and where AI-systems could make predictions about what reviewers would be likely to answer:
Argument-networks would be given a score.
The process for calculating such a score would be complex, and I’ll not outline an algorithm here. But I will outline some of the elements that algorithms that calculate score could include.
First of all, there would be node agreement. Is it predicted that reviewers will agree with the assumptions and argument-pieces in the various nodes? If so, how confident are the reviewers predicted to be? And how confident is the prediction-system about this prediction?
We may be especially interested in node agreement among reviewers who are harder to fool. Some humans will be harder than others to “fool” (be that generally, or for argument-networks that are within a given domain). And if we have systems that are good at predicting human responses, those system could be used to determine who those people are (since predictions can be made for how easy or hard it is to convince a given human of internally inconsistent claims).
Also, we may be interested in correlations between how easy reviewers are to fool and their tendency to agree with any given node. Maybe there are some nodes where most reviewers are likely to agree, also among humans who are unusually hard to fool, but with an unusual uptick in disagreement from those humans (e.g. with 1% of them disagreeing, while for most nodes <0.1% of them tend to disagree).
The algorithm that calculates score could also take node content properties into account. The individual “pieces” of the argumentation, what are they like?:
Assessing whether a specific node has a specific property might involve:
Another thing for a scoring-function to look at is network properties:
One possible approach could be roughly as follows:
It would be possible for a scoring-function to be quite restrictive, and give a score of 0 to any argument-network that breaks any of the requirements that are imposed.
And when deciding what kind of argumentation that is allowed in an argument-network, it may be better to ask “what should we allow?” than “what should we forbid?”.
One approach would be to have some limited list of argument-piece templates. Each template would represent one “type” of argument, and describe exactly how arguments of that “type” should be presented. A strict score-function could then make it so that the content of all nodes is in accordance with one of the templates from the list.
Here is a simplistic toy example of a “template”:
[entity reference] is to be categorized as [category reference 1]All entities that are to be categorized as [category reference 1] are also to be categorized as [category referenced 2]Therefore, [entity reference] is to be categorized as [category reference 2]
[entity reference] is to be categorized as [category reference 1]
All entities that are to be categorized as [category reference 1] are also to be categorized as [category referenced 2]
Therefore, [entity reference] is to be categorized as [category reference 2]
And here is a simplistic example of an instance of such a template:
Socrates is to be categorized as humanAll entities that are to be categorized as human are also to be categorized as mortalTherefore, Socrates is to be categorized as mortal
Socrates is to be categorized as human
All entities that are to be categorized as human are also to be categorized as mortal
Therefore, Socrates is to be categorized as mortal
In the example above, what I want you to notice is that a lot of the content is static. All that can be varied between templates are certain names that are filled into certain specific spots. Not all templates would be this simple, but all templates would be quite restrictive in terms of what can be expressed.
As rules of thumb:
Imagine that we request output from an AI, and that this output…
If there is no wiggle room, then the AI cannot produce different outputs that (1) pass whichever requirements we have and (2) argue for contradictory claims.
The concept of wiggle room is a general one, but let us now consider it in the context of argument-network scores.
A high score is supposed to indicate that an argument-network is worth trusting. And if it receives a low score, then that is supposed to indicate that it isn’t reliable:
But the job of the AI is not to tell the truth. Rather, it’s optimized to seek a high score.
Suppose that the AI earnestly tries to maximize score for every individual request it’s given. If so, it will try to get as high a score as it can when tasked to make an argument-network that argues “x”. And it will do an equally earnest job when tasked to make an argument-network that argues “not x”.
And this would help us to fail visibly.
Suppose the AI that makes argument-networks is able to convince us of anything. Or maybe it’s able to “cheat” somehow (e.g. in how it puts together the network). If so, it should be able to make high-scoring argument-networks that argue for mutually contradicting claims:
But if things work as they should, and the scoring-function is well calibrated, then the AI would not be able to convince us of mutually contradictory claims:
Imaginary friend: And how would you ensure that the AI-system “earnestly tries to maximize score for every individual request it’s given”?
Me: That I leave as an exercise to the reader 🙃
Imaginary friend: I’m inclined to agree with this quote from Eliezer:
Human operators are fallible, breakable, and manipulable. Human raters make systematic errors - regular, compactly describable, predictable errors.
Me: I also more or less agree with this.
In many ways humans are predictable. And some of the ways that we are predictable can be summarized compactly.
I’m reminded a bit of magicians. Magicians are able to reliably trick humans in non-random ways. And I’m sometimes surprised when I see how good some of their tricks are. Even professional magicians can sometimes be fooled by other magicians.
There is a huge number of possible tricks a magician can do, but the number of general techniques that these tricks are based on is much smaller. It would quickly become much harder to trick us if we choose good rules for what the magician is and isn’t allowed to do:
Humans are often predictable. But this doesn’t just apply to our tendency to make mistakes. It also applies to our tendency to get things right.
A superintelligent AI could be able to understand us better than we understand ourselves. Which is a problem if it wants to trick us. But it might be an advantage if it wants us to not be tricked.
Imaginary friend: But if it seeks to maximize points for the argument-network that it makes, then why would it want the reviewers to not be tricked?Me: Well, that AI - the AI that is asked to construct a high-scoring argument-network - would be incentivized to include deceptive argumentation if that results in a higher score.
But we could have other AIs that help us in coming up with score-functions. And the AIs that do that could receive a high or low score based on how well those scoring-functions do.
Imaginary friend: And how would you determine “how well those scoring-functions do”?
Me: We could approximate this by getting the AIs to generate argument-networks that argue for and against a variety or claims, and then looking at:
Btw, it may be that in order to minimize wiggle room, some reach must be sacrificed.
For example, if an argument-network only is allowed to use low-ambiguity terms, then this will presumably decrease reach, but it might be necessary in order to decrease wiggle room.
We could imagine a scenario low wiggle room is achievable for low-ambiguity fields such as mathematics and computer science, but hard to achieve outside of these domains (without increasing wiggle room).
Imaginary friend: If I want you to believe something, and I am constructing the scoring-function for an argument-network, I could make it so that argument-networks only gets a high score when they argue for what I want you to believe.
To put it crudely, I think what you envision is for scoring-functions to achieve low wiggle-room by being all like:
if argumentation is rigorous and not misleadingthen return high scoreelse return low score
if argumentation is rigorous and not misleading
then return high score
else return low score
But they could also achieve low wiggle room by something that’s more like the following:
if argument-network argues for conclusions I likethen return high scoreelse return low score
if argument-network argues for conclusions I like
Do you have any mechanisms in mind for dealing with this?
Me: I do.First off, we would not let the AIs propose score-functions freely. Rather, we would restrict the “space” of scoring-functions that can be proposed.
How we restrict the functions that can be proposed would be up to whoever designs and implements the “game mechanics” for interacting with the AIs. But one possible approach would be to use function builders, and only allow functions that can be made with this “builder”.
What function-builders look like could vary on a case-by-case basis. A function-builder could be complex, and allow functions to be constructed through a variety of operations (various types of snippets that can be added and combined according to various rules). But more simple and restrictive function-builders would also be possible (e.g. where all that can be done is to choose the values of a few parameters in an otherwise pre-defined function).
Anyway, enough about function-builders. I shouldn’t get too caught up in the practical details here. The main point is that the AIs that proposed score-functions would be limited in terms of the functions they could propose.
Here are some examples of ways that the “space” of possible score-functons could be limited:
Only allowing meta-questions from pre-determined list
When calculating the score of a node, one thing we may allow the algorithm to take into account is what human reviewers are predicted to answer to questions about the node content.
This may help us find out things such as the following:
And these kinds of things could be an important component of how a scoring-function calculates score.
But if the AIs that make score-functions can choose questions freely, and then later choose freely how the answers affect score - well, that would make it possible to tailor the score-function so as to favor specific conclusions. A simplistic example of this could be as follows:
One way to make this kind of malpractice more difficult would be to only allow questions from some pre-specified list that we provide. That way, the score-function cannot take into account what reviewers are predicted to answer to “Is it concluded here that P≠NP?”, since that question is not included in the pre-specified list.
Restriction of meta-questions about node content based on meta-questions about those meta-questions
As mentioned, we may want to allow the AI to predict what reviewers will answer to meta-questions about node content, but only if those meta-questions are from some pre-specified list. But what if no such pre-specified list exists? Or maybe we have a list, but it isn’t good enough?
One solution then could be to allow the AI that proposes score-functions to come up with its own meta-questions, but to have whether a meta-question is allowed be conditional upon on what human reviewers are predicted to answer to meta-questions about those meta-questions.
This makes it so that questions where we predict what humans reviewers will answer can be at 3 different “levels”:
Ground-level questions (questions about argument/assumption)
If the node contains the assumption “4 is a prime number”, then an example of a ground-level question could be “Do you agree with this statement?”.
Ground-level questions would be used to ascertain whether the reviewers are likely to agree with the argumentation and assumptions that are contained in the argument-network.
1st-level meta-question (question having to do with question about ground-level question)
If the node contains the assumption “4 is a prime number”, then an example of a 1st-level meta-question could be “Does it seem clear to you how this statement should be interpreted?”.
1st-level meta-questions could be thought of as a way to assess whether the argument-content in an argument-network meets various standards/requirements.
2nd-level meta-question (question about 1st-level meta-question)
If a 1st-level meta-question is “Does it seem clear to you how this statement should be interpreted?”, then a 2nd-level meta-question about that meta-question could be “Does that question seem to you as if it is targeted towards some relatively narrow domain or topic?”.
2nd-level meta-questions could be thought of as a way to assess whether the 1st-level meta-questions used by a score-function meet various specific standards/requirements.
(We could imagine 3rd-level meta-question as well, but I’ll stop here.)
Keep in mind that it’s possible for a question to reference various pieces of content (which are shown along with the question). For example: If a meta-question is meant to help us evaluate if some specific requirement is met, we can write a description that explains this requirement, and have the question reference that description.
Requiring meta-questions that AIs come up with to be ranked favorably by human reviewers when compared to meta-questions from some predefined list
As mentioned earlier, we could restrict which 1st-level meta-questions a score-function is allowed to use based on what human reviewers are predicted to answer to various specific 2nd-level meta-questions about those 1st-level meta-questions.
One possible variant of this technique is to have 2nd-level meta-questions that asks the reviewer to compare two 1st-level meta-questions against each other (e.g. asking which 1st-level meta-question, if any, that seems most in line with some specific description that is referenced).
There are various qualities that we could be interested in comparing when ranking 1st-level meta-questions against each other. Here are a few examples:
We could imagine various possible strategies that involve 1st-level meta-questions being ranked based on 2nd-level meta-questions. Here is one example:
Limiting node properties
A node property is more or less any parameter that can be calculated for a node, and later made use of when calculating score.
We covered one way to establish a node property, which is to have an external system predict what human reviewers are likely to answer to certain specific questions relating to the node content.
There could be other methods of establishing node properties as well. But any method would need to pass certain requirements in order to be accepted.
Info access of subroutines
When we restrict the score-functions that can be proposed, we may provide a “skeleton” function. This function may have some finished code that is pre-specified, while other sections of the code will need to be constructed by the AI (even these sections could not be written in free-form, but within the limits of what the “function-builder” allows).
Different subroutines within the “skeleton” may have different responsibilities, such as for example:
These subroutines should not get access to more data from the argument-network than they need.
For example, a subroutine that calculates the score of an individual node doesn’t need to have access to the entire argument-network. And a subroutine that calculates the score of the network as a whole doesn’t need access to the internals of individual nodes (it doesn’t even need to be told what specific topic the argument-network is about).
And the subroutines that determine score for individual nodes - well, they don’t necessarily need to get access to the actual content of whichever node they are scoring. It might for example suffice that they are told what template the content is based upon, what answers that are predicted from reviewers to certain specific questions about the node content, etc.
Source code minimization
Having to minimize the amount of source code makes it harder to add in “trickery”. Especially if this “trickery” only is to affect score in rare cases, and otherwise make no difference no way or the other.
So source code length could be one of the factors that is taken into account when scoring a score-function. And there could be upper limits to the amount of source-code that is allowed (both for the score-function as a whole, and for various specific subroutines).
Me: As mentioned, different AI-systems could be given different tasks. For example:
When we have talked about wiggle room so far, we have talked about wiggle room for the AIs that come up with argument-networks that will be scored against a specific scoring-function. But that’s just one type of wiggle room.
Here are some examples of different types of wiggle room:
Wiggle room relative to specific scoring-function
This kind of wiggle room can be demonstrated as follows:
Wiggle room relative to specific score-function-builder
This kind of wiggle room could be demonstrated as follows:
If humans define a score-function-builder, and wiggle room is low relative to that score-function-builder, then that is a positive sign.
Wiggle room relative to a builder for making score-function-builders
As mentioned, a score-function-builder would be a function-builder for making score-functions. And it defines a “space” of possible score-functions.
But maybe we are unsure about exactly which kinds of scoring-functions we want to allow:
In other words: There are various different areas where we can be “strict”, and we are unsure of how strict we should be in each area.
One thing we could do would be to define a function-builder that itself makes score-function-builders. The intention would be that any score-function-builder that can be made with this score-function-builder-builder seems to us like it maybe could be a good score-function-builder.
So what is wiggle room when it’s relative to this kind of score-function-builder-builder? Well, it comes down to the “space” of possible score-functions that can be generated indirectly from this score-function-builder-builder (by first generating a score-function-builder). Are there contradictory argument-networks where each of them would be assigned a high score by at least one of the score-functions from that “space”?
If score-function-builder-builder makes it possible to build score-function-builders that are strict along different dimensions from each other, and wiggle room relative to this score-function-builder-builder is low, then that could be a good sign. It might indicate that there are multiple assurances - that there is more than just one mechanism that prevents “bad” argument-networks from getting a high score.
Me: We could imagine giving an AI the following challenge:
This process can be recursive. That is to say, the network constructed in #2 can be received as input by another AI that is given this same type of challenge.
Imaginary friend: You talk about splitting arguments into “pieces”. And this reminds me a bit about children who keep asking “why?”.
“You have to go to bed”“Why?”“Because it’s nighttime.”“Why?”“Because it’s past sundown”“Why?”“Because Earth is spinning around the sun”“Why?”
“You have to go to bed”
“Because it’s nighttime.”
“Because it’s past sundown”
“Because Earth is spinning around the sun”
When children ask “why”-question, we rarely give precise and detailed answers. And the children will often not notice that.
In fact, it would be hard to give precise answers even if we tried. Answers to these kinds of questions typically branch off in lots of directions, as there is a lot we would need to explain. But when we talk or write, we can only go down one branch at a time.
Precise logical inference is done one step at a time. But one claim involved in one step of inference might reference lots of concepts that we’re not familiar with. So it may take a huge amount of time to understand just that one step of reasoning, or even just one statement.
As humans our minds are very limited. We have very little short term memory, we can’t visualize higher-dimensional geometries, etc. So sometimes we may simply not be able to follow an argument, despite attempts at breaking it into tiny “pieces”.
Me: Couldn’t have said it better myself.
I should make it clear that I don’t see the strategies I describe in this post as guaranteed to be feasible. And the points you raise here help explain why.
But my gut feeling is one of cautious optimism.
Keep in mind that there often will be many ways to argue one thing (different chains of inference, different concepts and abstractions that all are valid, etc). If a score-function works as it should, then the AI would be incentivized to search for explanations where every step can be understood by a smart human.
Even if hard steps are involved when the AI comes up with some answer, explaining those steps is not necessarily necessary in order to argue robustly that the answer is correct. There are often lots of different ways for showing “indirectly” why something must be the case.
For example, if an AI-system wanted to make an argument-network that argues in favor of The Pythagorean Theorem, it wouldn’t necessarily need to explain any mathematical proof of The Pythagorean Theorem. Instead it could:
As humans we can make computer-programs that do things we are unable to do by ourselves. The principles that enable us to do this - well, they don’t always allow for robust assurances, but often they do.
I suspect a common technique in many argument-networks could be to:
The argument-network would not necessarily need to have nodes that explain every single line of code, but through a variety of techniques it would need to argue robustly that the code does what it’s purported to do.
Imaginary friend: Do you have any comments on this excerpt from the post Rant on Problem Factorization for Alignment?
Me: Here are some comments:
Imaginary friend: So argument-networks don’t necessarily need to be efficient. And it may be sufficient for a very small subset of arguments to be viable as argument-networks.
It doesn’t seem like we should rely on this strategy working out...
Me: I agree. But it could be explored in parallel with other approaches.
Me: Something that could be possible for AIs to do would be to make bets on the score that they could get, or other AIs would be able to get, if tasked to construct an argument-network that argues for some specific claim(“I bet with there is a 99.97+% probability that I would be able to make an argument-network that gets a score of 88.2+%”).
These bets would need to be incentivized in such a way that it doesn’t pay off to be overconfident. It would be possible to have game mechanics where AIs compete in a zero-sum game, and each AI can make counter-bets against the other.
One point of doing bets would be to save computational resources. And another point would be to try to answer questions that may be infeasible to answer by brute force. For example, if some “space” of possible scoring-functions has been defined, it’s hard to know which scoring-function within that “space” that has the least “wiggle room”. But it would be possible for AIs to be incentivized to try to find the best one, and then they could make bets for how likely they think it is that the other AI is able to find a scoring-function that performs significantly better.
Bets can be made on the outcome of a process, but it could also be possible to make bets on steps that are part of a larger process. For example, suppose that a process has sub-steps that involve predicting human responses. If so, it would be possible to make system-calls to systems that make rough estimates of what those predictions would be if more compute was spent so as to make a more precise prediction (“I predict at least x% chance that process P would assign at least y% chance to the answer being either A or B”).
Imaginary friend: Earlier in this post, you asked me to temporarily assume the following:
“We have software-programs that can predict how various humans are likely to respond to questions, evaluate arguments, interact with pieces of code, etc.”
You went on to say that you didn’t presume that they would have 100% certainty, or that they always would have a prediction. But you did presume that they wouldn’t be overconfident, and yet confident enough often enough to be useful.
I share your assumption that such software-programs probably are possible. Indeed, one worry about misaligned superintelligent AGIs is that they might understand us all too well. But do you think it’s probable for us to obtain such programs in time to use them?
Me: I’m not sure, but I do feel somewhat optimistic.
One thing to consider is that a human answering a question or doing other forms of evaluation can be thought of as an experiment. Which means that we can test how accurate any given system is at predicting human evaluations.
The strategies in this post are agnostic in terms of how we predict human evaluations. But they rely on us being able to do it somehow.
One challenge is that when testing the accuracy of a system that predicts human responses, that will presumably involve showing the content to humans - but we may be hesitant towards letting the AI freely come up with content that is shown to actual human reviewers. In other words: among the “space” of predictions that are made by the prediction-system we want to test, there may be large sections of that “space” where we are unwilling to carry out experiments.
In order to address this problem, I have certain techniques in mind, which I refer to as Bayes-games. I have an addendum to this post where I write about Bayes-games, and I also plan to write more about Bayes-games in the future.
Something to keep in mind is that we need not restrict ourselves to one method for predicting human responses! It’s possible to have different systems that predict human responses, based on different architectures and principles. And we could choose to only trust predictions in cases where all the different systems are making converging predictions.
A worry when interacting with superintelligent AGI systems is that they could “trick” or somehow manipulate human operators. And the more “bandwidth” they have for interacting with human operators, the more realistic that might become.
With a certain disadvantage in pieces, even a maximally good chess-player will not be able to beat a mediocre player. And similarly, maybe there is nothing that an AI could say that would enable it to escape its “confinement” or otherwise cause great harm (especially if it is limited in terms of how much it can say). But we don’t know quite how “hackable” we are, and I’d think it best to err on the side of being too prudent.
When dealing with argument-networks, an advisable practice (especially in the beginning) could be for the content of the nodes to be obscured by default. Meaning that the human operators by default would be unable to see the actual node content.
However, they might be able to see info about the node content:
Even if we choose to have nodes obscured “by default”, we may sometimes see it necessary for actual humans to review nodes (one reason being that we may need to test the accuracy of the systems that predict what human reviewers would answer). Here are some policies that could help reduce risks associated with this:
Btw, keep in mind that the process of generating these networks can be broken into several steps (and those steps can again be broken into smaller steps):
It’s possible to have one system do all of those things, but it may be better to have different systems do different things. In cases where argument-nodes are obfuscated, the systems that construct argumentation don’t need to be told that argument-nodes will be obfuscated.
Me: One of the earlier “strategies” when dealing with a superhuman AGI-system could be something like this:
Another strategy might go as follows:
You might think:
“But shouldn’t we look at the code, and look at the arguments provided by the AGI?”
And my answer would be:
“Yes, we should, at some point. But before we start doing that we should do less risky things that that we can do to make the system safer.”
If we have a system that FOOMs, this could be an intermediate step on the path towards a more assuredly aligned AGI-system.
In fact, all of the strategies I’ve outlined in this post would be intermediate steps towards a system that’s more robustly aligned.
For example: So far I’ve talked about predicting human responses. But it may be advantageous to predict the responses of idealized humans. Hypothetical humans who…
But my thinking is that we could start out with predicting the responses of normal humans, and then later expand the system to also predict the responses of humans that are idealized in terms of intelligence and rationality.
And at some even later step, maybe we also are interested in predicting the responses of humans that are “idealized” in a moral sense (emotional empathy, cognitive empathy, impartiality, and so on).
For example, imagine if we could access the experiences of other beings as “memories”. Not just humans, but also animals, including wild animals. How might that make us think about various moral issues?
CEV would involve predicting what we as humans would think if we “knew more, thought faster, were more the people we wished we were, had grown up farther together”. And I think there is a lot of merit to that kind of thinking.
But for now I think it’s best if I focus on earlier steps. Should things turn out in such a way that those steps are used and turn out to be successful…
Imaginary friend: If.
Me: …then maybe they could be part of a pathway towards bigger and better things.
One way to obtain accurate predictions could be through the use of Bayes-games.
In these games, AIs would compete against each other in a zero-sum game, competing to make accurate experimental predictions. These predictions that AIs make could be seen as “bets”, and when making such “bets” they would express their level of confidence.
More confident bets would lead to more points being earned when they are right, but also more points being awarded when they are wrong. Points would be awarded in such a way that confidence is rewarded but overconfidence is heavily disincentivized.
There is a variety of types of experiments that could be covered by a Bayes-game. But there is one type of experiment that would be of particular interest: Namely asking questions to humans, and seeing what the humans answer.
For example, the humans could be presented with the content of argument-nodes, and asked questions about the content they’re shown. And the AIs in a Bayes-game could consider this as an experiment, and make bets on what humans will answer (based on info they are given about the humans in question).
There would be three possible “modes” for a Bayes-game:
Imaginary friend: What would be the point of Bayes-games? Why not just a more straightforward approach, where AIs try to make predictions, we test some of those predictions, and select for AI-systems that make good predictions?
Me: The way I think of it, there are several reasons:
Imaginary friend: #2 doesn’t seem so easy. Couldn’t the AIs guess which experiments that we are unlikely to carry out, and then not feel a need to make accurate predictions for those experiments?
Me: I think they maybe could be able to guess that, and don’t rule out them being able to guess this with high condfidence. The hope would be that even if this is the case, there are mechanisms that could help us ascertain how much trust we should put into predictions about experiments that are outside of the “space” of experiments we are willing to carry out:
Source code length minimization
A more “general” solution will often require fewer lines of code. Not always, but often.
For example, imagine that you are to calculate the area of a square, based on the length of the edges. The fewest lines of code to do that is something akin to this:
And certainly not like this:
This principle can be made use of, so as to make it harder to return intentionally wrong outputs for some inputs.
It’s easy enough to make a function return wrong outputs for certain specific inputs. That can for example be done by adding a few extra if-sentences to the function.
But what’s not easy - at least not always - is to intentionally return bad outputs for some specific inputs while at the same time competing to minimize source code length.
Suppose the following situation:
In such a situation, #2 would not prevent you from #7, due to #3. But #7 would prevent you from #6, due to #4.
Or to say things in a different way: You would not be able to do the three following at once:
The example I give here is a simplistic toy example. But I do think the principle extends to more complex and real-world examples.
As a side note, I’ll note that there is some similarity here to concepts such as solomonoff induction, konglorov complexity and occam’s razor. It isn’t the same concept, but it’s interesting to note the reminiscence.
I am aware that although this principle sometimes works, it does not always do so. In some cases, having to minimize source-code length could even force you to overfit.
However, if the Bayes-games competitors are competent and do their best, then it should not pay off to overfit in relation to the “space” of experiments that are available for testing. So one approach would be as follows:
If predictions change when we do #3, that is an indication that overfitting is a problem.
Btw, I have talked now about minimizing source code length. But when minimizing source code length there are several possible approaches:
Competing to make code more “general”
Source code length can be seen as one method for trying to approximate “generality”. But there could also be other ways of approximating “generality”.
For example, we could generate “heat maps” of sections of the source code that can be removed or simplified without significantly affecting output (for certain inputs). If it’s hard to remove sections of code (or otherwise simplify it) without significantly affecting output for a large fraction of inputs, then this can indicate that the function is relatively “general”.
Making a “web” of prediction-generators between different parts of experiment data
Consider the “space” of experiments that can be carried out in the real world, and the “space” of experiment data that could correspond to these experiments. There number of conceivable permutations that would be congruent with how the world works - well, it is of course mind-bogglingly huge.
But for every 1 permutation of experiment data that would be congruent with how the world works, there is a huge number of permutations that is not congruent with how the world works. So the requirement that predictions are to be congruent with how the world works, is in fact quite restrictive!
When I speak about experiment data, I think of there as being 3 “sections” of data corresponding to an experiment:
What we ultimately are interested in, is being able to start out with experiment specification, and then be able to make predictions about experiment results. But it can be helpful to make AIs predict how various types of experiment are correlated. This way, they make more predictions that can be tested per experiment, meaning that they can earn or lose more “Bayes-points” per experiment that is done.
To make things easier, let’s imagine a simple toy example, where the predictions that the AIs make have to do with the movement of a ball in a physics simulator:
In this case, the data would be as follows:
As you may remember, I said that one type of Bayes-game would involve inferred predictions. And I mentioned that here the AIs would assert statements in a format that allows for computational inference.
Let me say a bit more about the statements in this type of Bayes-game:
In the toy example with the physics simulation, we could for example imagine statements that describe the following:
This way, we could predictions of final results based on initial conditions. In one sense, it would be an unnecessarily elaborate way of doing that, since the AIs could have done that directly (without the detour of predicting observation-data “abstractions”. But by taking that detour, they have made themselves easier to disprove. They have described correlations in the observation-data that there is plenty of opportunity to prove wrong (if they are wrong).
Here is how some of the adversarial dynamics could work in a Bayes-game with inferred predictions:
When doing Bayes-games with inferred predictions, the AIs would be scored based on factors such as the following:
As you may remember from our discussion of argument-networks, there was a discussion of wiggle room, and different “levels” of wiggle room (the wiggle-room could be relative to a score-function, or a score-function-builder, or a score-function-builder-builder).In Bayes-games there could be similar mechanisms:
It would also be possible to have Bayes-games, or games that are reminiscent of Bayes-games, where what the AIs bet on is the result of other Bayes-games. So they can make bets how good of a score an AI would get if playing a Bayes-game with a certain specific type of experiment under certain kinds of restrictions. Or they could propose solutions/modifications for the Bayes-game mechanics (from some “space” of solutions/modifications that are allowed), in an attempt to optimize for certain parameters (wiggle room and so on), and make bets on whether the other AI is able to find something better.
Something that may make things easier is that there are domains where we actually know the answers, and are able to test relatively safely (e.g. with simulated physics that have stochastic elements and where the source code is unknown to the AIs - or maybe physical experiments where any experiment that can be proposed is very innocuous). We could keep the “space” of experiments that the AIs are allowed to propose artificially low, and see if those Bayes-games are misleading in regards to what to expect outside of that “space”.
It could potentially be helpful to have pre-done experimental data that the AIs could be tested on. This could include experiments where humans answer questions, with video footage and MRI recordings as observation-data.
I have a few comments where I elaborate on certain topics:
If you found this post interesting, you may also take an interest in other posts from this series.
I appreciate any questions or feedback. I'm also open to video-conversations with more or less anyone who has read this post and wants to talk (feel free to reach out by PM). And if anyone is interested in maybe working further on anything described in this post, then I’d be happy to keep in touch over time.
Thanks to Evan R. Murphy for helping me review early versions of this text!