In which I propose a closed-form solution to low impact, increasing corrigibility and seemingly taking major steps to neutralize basic AI drives 1 (self-improvement), 5 (self-protectiveness), and 6 (acquisition of resources).
To be used inside an advanced agent, an impact measure... must capture so much variance that there is no clever strategy whereby an advanced agent can produce some special type of variance that evades the measure.
If we have a safe impa... (Read more)
I think Paul Christiano’s research agenda for the alignment of superintelligent AGIs presents one of the most exciting and promising approaches to AI safety. After being very confused about Paul’s agenda, chatting with others about similar confusions, and clarifying with Paul many times over, I’ve decided to write a FAQ addressing common confusions around his agenda.
This FAQ is not intended to provide an introduction to Paul’s agenda, nor is it intended to provide an airtight defense. This FAQ only aims to clarify commonly misunderstood aspects of the agenda. Unless otherwise stated, all view... (Read more)
The following is a basically unedited summary I wrote up on March 16 of my take on Paul Christiano’s AGI alignment approach (described in “ALBA” and “Iterated Distillation and Amplification”). Where Paul had comments and replies, I’ve included them below.
I see a lot of free variables with respect to what exactly Paul might have in mind. I've sometimes tried presenting Paul with my objections and then he replies in a way that locally answers some of my question but I think would make other difficulties worse. My global objection is thus something like, "I don't see any concrete ... (Read more),,,,,
One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.
We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent,... (Read more)
Crossposted from my blog.
One thing I worry about sometimes is people writing code with optimisers in it, without realising that that's what they were doing. An example of this: suppose you were doing deep reinforcement learning, doing optimisation to select a controller (that is, a neural network that takes a percept and returns an action) that generated high reward in some environment. Alas, unknown to you, this controller actually did optimisation itself to select actions that score well according to some metric that so far has been closely related to your reward function. In such a scenario... (Read more)
Note: weird stuff, very informal.
Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.
I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.
I don't know whether this is a real problem or not. But from a theoretical perspe... (Read more)