I think Paul Christiano’s research agenda for the alignment of superintelligent AGIs presents one of the most exciting and promising approaches to AI safety. After being very confused about Paul’s agenda, chatting with others about similar confusions, and clarifying with Paul many times over, I’ve decided to write a FAQ addressing common confusions around his agenda.
This FAQ is not intended to provide an introduction to Paul’s agenda, nor is it intended to provide an airtight defense. This FAQ only aims to clarify commonly misunderstood aspects of the agenda. Unless otherwise stated, all views are my own views of Paul’s views. (ETA: Paul does not have major disagreements with anything expressed in this FAQ. There are many small points he might have expressed differently, but he endorses...
One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.
We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent, since if we could notice these failures, so could the agent. So we should expect that powerful agents appear coherent to us. (Note that it is possible that the...
I want to quickly draw attention to a concept in AI alignment: Robustness to Scale. Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities. I discuss three different types of robustness to scale: robustness to scaling up, robustness to scaling down, and robustness to relative scale.
The purpose of this post is to communicate, not to persuade. It may be that we want to bite the bullet of the strongest form of robustness to scale, and build an AGI that is simply not robust to scale, but if we do, we should at least realize that we are doing that.
Robustness to scaling up means that your AI system does not depend on not being...
Note: weird stuff, very informal.
Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.
I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.
I don't know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning--I'm trying to find a strong argument that we've solved alignment, not just something that seems to work in practice.
I am pretty convinced that daemons are a real...
When I say an AI A is aligned with an operator H, I mean:
A is trying to do what H wants it to do.
The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.
This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.
In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
Consider a human assistant who is trying their hardest to do what H wants.
I’d say this assistant is aligned with...
Crossposted from my blog.
One thing I worry about sometimes is people writing code with optimisers in it, without realising that that's what they were doing. An example of this: suppose you were doing deep reinforcement learning, doing optimisation to select a controller (that is, a neural network that takes a percept and returns an action) that generated high reward in some environment. Alas, unknown to you, this controller actually did optimisation itself to select actions that score well according to some metric that so far has been closely related to your reward function. In such a scenario, I'd be wary about your deploying that controller, since the controller itself is doing optimisation which might steer the world into a weird and unwelcome place.
In order to avoid such...
In which I propose a closed-form solution to low impact, increasing corrigibility and seemingly taking major steps to neutralize basic AI drives 1 (self-improvement), 5 (self-protectiveness), and 6 (acquisition of resources).
To be used inside an advanced agent, an impact measure... must capture so much variance that there is no clever strategy whereby an advanced agent can produce some special type of variance that evades the measure.
If we have a safe impact measure, we may have arbitrarily-intelligent unaligned agents which do small (bad) things instead of big (bad) things.
For the abridged experience, read up to "Notation", skip to "Experimental Results", and then to "Desiderata".
One lazy Sunday afternoon, I...
Cross-posted to the EA forum.
Like last year and the year before, I’ve attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to an securities analyst with regards to possible investments. It appears that once again no-one else has attempted to do this, to my knowledge, so I've once again undertaken the task.
This year I have included several groups not covered in previous years, and read more widely in the literature.
My aim is basically to judge the output of each organisation in 2018 and compare it to their budget. This should...
I talk here about how a mathematician mindset can be useful for AI alignment. But first, a puzzle:
Given , what is the least number such that for , the base representation of consists entirely of 0s and 1s?
If you want to think about it yourself, stop reading.
For =2, =2.
For =3, =3.
For =4, =4.
For =5, =82,000.
Indeed, 82,000 is 10100000001010000 in binary, 11011111001 in ternary, 110001100 in base 4, and 10111000 in base 5.
What about when =6?
So, a mathematician might tell you that this is an open problem. It is not known if there is any which consists of 0s and 1s in bases 2 through 6.
A scientist, on the other hand, might just tell you that clearly no such number exists. There...
Faced with the astronomical amount of unclaimed and unused resources in our universe, one's first reaction is probably wonderment and anticipation, but a second reaction may be disappointment that our universe isn't even larger or contains even more resources (such as the ability to support 3^^^3 human lifetimes or perhaps to perform an infinite amount of computation). In a previous post I suggested that the potential amount of astronomical waste in our universe seems small enough that a total utilitarian (or the total utilitarianism part of someone’s moral uncertainty) might reason that since one should have made a deal to trade away power/resources/influence in this universe for power/resources/influence in universes with much larger amounts of available resources, it would be rational to behave as if this deal...