Buck Shlegeris

Wiki Contributions


The theory-practice gap

Yeah, I talk about this in the first bullet point here (which I linked from the "How useful is it..." section).

The alignment problem in different capability regimes

One crucial concern related to "what people want" is this seems underdefined, un-stable in interactions with wildly superintelligent systems, and prone to problems with scaling of values within systems where intelligence increases.

This is what I was referring to with

by assumption the superintelligence will be able to answer any question you’re able to operationalize about human values

The superintelligence can answer any operationalizable question about human values, but as you say, it's not clear how to elicit the right operationalization.

The alignment problem in different capability regimes

Re the negative side effect avoidance: Yep, you're basically right, I've removed side effect avoidance from that list.

And you're right, I did mean "it will be able to" rather than "it will"; edited.

The alignment problem in different capability regimes

I think this is a reasonable definition of alignment, but it's not the one everyone uses.

I also think that for reasons like the "ability to understand itself" thing, there are pretty interesting differences in the alignment problem as you're defining it between capability levels.

Buck's Shortform

[this is a draft that I shared with a bunch of friends a while ago; they raised many issues that I haven't addressed, but might address at some point in the future]

In my opinion, and AFAICT the opinion of many alignment researchers, there are problems with aligning superintelligent models that no alignment techniques so far proposed are able to fix. Even if we had a full kitchen sink approach where we’d overcome all the practical challenges of applying amplification techniques, transparency techniques, adversarial training, and so on, I still wouldn’t feel that confident that we’d be able to build superintelligent systems that were competitive with unaligned ones, unless we got really lucky with some empirical contingencies that we will have no way of checking except for just training the superintelligence and hoping for the best.

Two examples: 

  • A simplified version of the hope with IDA is that we’ll be able to have our system make decisions in a way that never had to rely on searching over uninterpretable spaces of cognitive policies. But this will only be competitive if IDA can do all the same cognitive actions that an unaligned system can do, which is probably false, eg cf Inaccessible Information.
  • The best we could possibly hope for with transparency techniques is: For anything that a neural net is doing, we are able to get the best possible human understandable explanation of what it’s doing, and what we’d have to change in the neural net to make it do something different. But this doesn’t help us if the neural net is doing things that rely on concepts that it’s fundamentally impossible for humans to understand, because they’re too complicated or alien. It seems likely to me that these concepts exist. And so systems will be much weaker if we demand interpretability.

Even though these techniques are fundamentally limited, I think there are still several arguments in favor of sorting out the practical details of how to implement them:

  • Perhaps we actually should be working on solving the alignment problem for non-arbitrarily powerful systems
    • Maybe because we only need to align slightly superhuman systems who we can hand off alignment work to. (I think that this relies on assumptions about gradual development of AGI and some other assumptions.)
    • Maybe because narrow AI will be transformative before general AI, and even though narrow AI doesn't pose an x-risk from power-seeking, it would still be nice to be able to align it so that we can apply it to a wider variety of tasks (which I think makes it less of a scary technological development in expectation). (Note that this argument for working on alignment is quite different from the traditional arguments.)
  • Perhaps these fundamentally limited alignment strategies work on arbitrarily powerful systems in practice, because the concepts that our neural nets learn, or the structures they organize their computations into, are extremely convenient for our purposes. (I called this “empirical generalization” in my other doc; maybe I should have more generally called it “empirical contingencies work out nicely”)
  • These fundamentally limited alignment strategies might be ingredients in better alignment strategies. For example, many different alignment strategies require transparency techniques, and it’s not crazy to imagine that if we come up with some brilliant theoretically motivated alignment schemes, these schemes will still need something like transparency, and so the research we do now will be crucial for the overall success of our schemes later.
    • The story for this being false is something like “later on, we’ll invent a beautiful, theoretically motivated alignment scheme that solves all the problems these techniques were solving as a special case of solving the overall problem, and so research on how to solve these subproblems was wasted.” As an analogy, think of how a lot of research in computer vision or NLP seems kind of wasted now that we have modern deep learning.
  • The practical lessons we learn might also apply to better alignment strategies. For example, reinforcement learning from human feedback obviously doesn’t solve the whole alignment problem. But it’s also clearly a stepping stone towards being able to do more amplification-like things where your human judges are aided by a model.
  • More indirectly, the organizational and individual capabilities we develop as a result of doing this research seems very plausibly helpful for doing the actually good research. Like, I don’t know what exactly it will involve, but it feels pretty likely that it will involve doing ML research, and arguing about alignment strategies in google docs, and having large and well-coordinated teams of researchers, and so on. I don’t think it’s healthy to entirely pursue learning value (I think you get much more of the learning value if you’re really trying to actually do something useful) but I think it’s worth taking into consideration.

But isn’t it a higher priority to try to propose better approaches? I think this depends on empirical questions and comparative advantage. If we want good outcomes, we both need to have good approaches and we need to know how to make them work in practice. Lacking either of these leads to failure. It currently seems pretty plausible to me that on the margin, at least I personally should be trying to scale the applied research while we wait for our theory-focused colleagues to figure out the better ideas. (Part of this is because I think it’s reasonably likely that the theory researchers will make a bunch of progress over the next year or two. Also, I think it’s pretty likely that most of the work required is going to be applied rather than theoretical.)

I think that research on these insufficient strategies is useful. But I think it’s also quite important for people to remember that they’re insufficient, and that they don’t suffice to solve the whole problem on their own. I think that people who research them often equivocate between “this is useful research that will plausibly be really helpful for alignment” and “this strategy might work for aligning weak intelligent systems, but we can see in advance that it might have flaws that only arise when you try to use it to align sufficiently powerful systems and that might not be empirically observable in advance”. (A lot of this equivocation is probably because they outright disagree with me on the truth of the second statement.)

Buck's Shortform

I used to think that slower takeoff implied shorter timelines, because slow takeoff means that pre-AGI AI is more economically valuable, which means that economy advances faster, which means that we get AGI sooner. But there's a countervailing consideration, which is that in slow takeoff worlds, you can make arguments like ‘it’s unlikely that we’re close to AGI, because AI can’t do X yet’, where X might be ‘make a trillion dollars a year’ or ‘be as competent as a bee’. I now overall think that arguments for fast takeoff should update you towards shorter timelines.

So slow takeoffs cause shorter timelines, but are evidence for longer timelines.

This graph is a version of this argument: if we notice that current capabilities are at the level of the green line, then if we think we're on the fast takeoff curve we'll deduce we're much further ahead than we'd think on the slow takeoff curve.

For the "slow takeoffs mean shorter timelines" argument, see here: https://sideways-view.com/2018/02/24/takeoff-speeds/

point feels really obvious now that I've written it down, and I suspect it's obvious to many AI safety people, including the people whose writings I'm referencing here. Thanks to Caroline Ellison for pointing this out to me, and various other people for helpful comments.

I think that this is why belief in slow takeoffs is correlated with belief in long timelines among the people I know who think a lot about AI safety.

$1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is
It's tempting to anthropomorphize GPT-3 as trying its hardest to make John smart. That's what we want GPT-3 to do, right?

I don't feel at all tempted to do that anthropomorphization, and I think it's weird that EY is acting as if this is a reasonable thing to do. Like, obviously GPT-3 is doing sequence prediction--that's what it was trained to do. Even if it turns out that GPT-3 correctly answers questions about balanced parens in some contexts, I feel pretty weird about calling that "deliberately pretending to be stupider than it is".

Possible takeaways from the coronavirus pandemic for slow AI takeoff

If the linked SSC article is about the aestivation hypothesis, see the rebuttal here.

Let's talk about "Convergent Rationality"

In OpenAI's Roboschool blog post:

This policy itself is still a multilayer perceptron, which has no internal state, so we believe that in some cases the agent uses its arms to store information.

Aligning a toy model of optimization

Given a policy π we can directly search for an input on which it behaves a certain way.

(I'm sure this point is obvious to Paul, but it wasn't to me)

We can search for inputs on which a policy behaves badly, which is really helpful for verifying the worst case of a certain policy. But we can't search for a policy which has a good worst case, because that would require using the black box inside the function passed to the black box, which we can't do. I think you can also say this as "the black box is an NP oracle, not a oracle".

This still means that we can build a system which in the worst case does nothing, rather than in the worst case is dangerous: we do whatever thing to get some policy, then we search for an input on which it behaves badly, and if one exists we don't run the policy.

Load More