One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.
We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent, since if we could notice these failures, so could the agent. So we should expect that powerful agents appear coherent to us. (Note that it is possible that the...
I want to quickly draw attention to a concept in AI alignment: Robustness to Scale. Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities. I discuss three different types of robustness to scale: robustness to scaling up, robustness to scaling down, and robustness to relative scale.
The purpose of this post is to communicate, not to persuade. It may be that we want to bite the bullet of the strongest form of robustness to scale, and build an AGI that is simply not robust to scale, but if we do, we should at least realize that we are doing that.
Robustness to scaling up means that your AI system does not depend on not being...
Rereading this post while thinking about the approximations that we make in alignment, two points jump at me:
Note: weird stuff, very informal.
Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.
I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.
I don't know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning--I'm trying to find a strong argument that we've solved alignment, not just something that seems to work in practice.
I am pretty convinced that daemons are a real...
I consider the argument in this post a reasonably convincing negative answer to this question---a minimal circuit may nevertheless end up doing learning internally and thereby generate deceptive learned optimizers.
This suggests a second informal clarification of the problem (in addition to Wei Dai's comment): can the search for minimal circuits itself be responsible for generating deceptive behavior? Or is it always the case that something else was the offender and the search for minimal circuits is an innocent bystander?
If the search for minimal circuits ...
When I say an AI A is aligned with an operator H, I mean:
A is trying to do what H wants it to do.
The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.
This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.
In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
Consider a human assistant who is trying their hardest to do what H wants.
I’d say this assistant is aligned with...
Crossposted from my blog.
One thing I worry about sometimes is people writing code with optimisers in it, without realising that that's what they were doing. An example of this: suppose you were doing deep reinforcement learning, doing optimisation to select a controller (that is, a neural network that takes a percept and returns an action) that generated high reward in some environment. Alas, unknown to you, this controller actually did optimisation itself to select actions that score well according to some metric that so far has been closely related to your reward function. In such a scenario, I'd be wary about your deploying that controller, since the controller itself is doing optimisation which might steer the world into a weird and unwelcome place.
In order to avoid such...
Okay, so another necessary condition for being downstream from an optimizer is being causally downstream. I'm sure there are other conditions, but the claim still feels like an important addition to the conversation.
In which I propose a closed-form solution to low impact, increasing corrigibility and seemingly taking major steps to neutralize basic AI drives 1 (self-improvement), 5 (self-protectiveness), and 6 (acquisition of resources).
Previously: Worrying about the Vase: Whitelisting, Overcoming Clinginess in Impact Measures, Impact Measure Desiderata
To be used inside an advanced agent, an impact measure... must capture so much variance that there is no clever strategy whereby an advanced agent can produce some special type of variance that evades the measure.
~ Safe Impact Measure
If we have a safe impact measure, we may have arbitrarily-intelligent unaligned agents which do small (bad) things instead of big (bad) things.
For the abridged experience, read up to "Notation", skip to "Experimental Results", and then to "Desiderata".
One lazy Sunday afternoon, I...
I'm curious whether these are applications I've started to gesture at in Reframing Impact
I confess that it's been a bit since I've read that sequence, and it's not obvious to me how to go from the beginnings of gestures to their referents. Basically what I mean is 'when trying to be cooperative in a group, preserve generalised ability to achieve goals', nothing more specific than that.
Cross-posted to the EA forum.
Like last year and the year before, I’ve attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to an securities analyst with regards to possible investments. It appears that once again no-one else has attempted to do this, to my knowledge, so I've once again undertaken the task.
This year I have included several groups not covered in previous years, and read more widely in the literature.
My aim is basically to judge the output of each organisation in 2018 and compare it to their budget. This should...
I talk here about how a mathematician mindset can be useful for AI alignment. But first, a puzzle:
Given , what is the least number such that for , the base representation of consists entirely of 0s and 1s?
If you want to think about it yourself, stop reading.
For =2, =2.
For =3, =3.
For =4, =4.
For =5, =82,000.
Indeed, 82,000 is 10100000001010000 in binary, 11011111001 in ternary, 110001100 in base 4, and 10111000 in base 5.
What about when =6?
So, a mathematician might tell you that this is an open problem. It is not known if there is any which consists of 0s and 1s in bases 2 through 6.
A scientist, on the other hand, might just tell you that clearly no such number exists. There...
I don't like the intro to the post. I feel like the example Scott gives makes the opposite of the point he wants it to make. Either a number with the given property exists or not. If such a number doesn't exist, creating a superintelligence won't change that fact. Given talk I've heard around the near certainty of AI doom, betting the human race on the nonexistence of a number like this looks pretty attractive by comparison -- and it's plausible there are AI alignment bets we could make that are analogous to this bet.
Faced with the astronomical amount of unclaimed and unused resources in our universe, one's first reaction is probably wonderment and anticipation, but a second reaction may be disappointment that our universe isn't even larger or contains even more resources (such as the ability to support 3^^^3 human lifetimes or perhaps to perform an infinite amount of computation). In a previous post I suggested that the potential amount of astronomical waste in our universe seems small enough that a total utilitarian (or the total utilitarianism part of someone’s moral uncertainty) might reason that since one should have made a deal to trade away power/resources/influence in this universe for power/resources/influence in universes with much larger amounts of available resources, it would be rational to behave as if this deal...
I think that at the time this post came out, I didn't have the mental scaffolding necessary to really engage with it – I thought of this question as maybe important, but sort of "above my paygrade", something better left to other people who would have the resources to engage more seriously with it.
But, over the past couple years, the concepts here have formed an important component of my understanding of robust agency. Much of this came from private in-person conversations, but this post is the best writeup of the concept I'm cur...
"random utility-maximizer" is pretty ambiguous; if you imagine the space of all possible utility functions over action-observation histories and you imagine a uniform distribution over them (suppose they're finite, so this is doable), then the answer is low.
Heh, looking at my comment it turns out I said roughly the same thing 3 years ago.