12 comments, sorted by Click to highlight new comments since: Today at 2:24 AM
New Comment

Something I think I’ve been historically wrong about:

A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we’re training systems to answer questions, of course we’ll use our current versions of systems to help us evaluate, why would we not do that? We’ll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system.

Similarly with debate--adversarial setups are pretty obvious and easy.

In this frame, the contributions from Paul and Geoffrey feel more like “they tried to systematically think through the natural limits of the things people will do” than “they thought of an approach that non-alignment-obsessed people would never have thought of or used”.

It’s still not obvious whether people will actually use these techniques to their limits, but it would be surprising if they weren’t used at all.

Agreed, and versions of them exist in human governments trying to maintain control (where non-cooordination of revolts is central).  A lot of the differences are about exploiting new capabilities like copying and digital neuroscience or changing reward hookups.

In ye olde times of the early 2010s people (such as I) would formulate questions about what kind of institutional setups you'd use to get answers out of untrusted AIs (asking them separately to point out vulnerabilities in your security arrangement, having multiple AIs face fake opportunities to whistleblow on bad behavior, randomized richer human evaluations to incentivize behavior on a larger scale).

Are any of these ancient discussions available anywhere?

Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).

It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.

[this is a draft that I shared with a bunch of friends a while ago; they raised many issues that I haven't addressed, but might address at some point in the future]

In my opinion, and AFAICT the opinion of many alignment researchers, there are problems with aligning superintelligent models that no alignment techniques so far proposed are able to fix. Even if we had a full kitchen sink approach where we’d overcome all the practical challenges of applying amplification techniques, transparency techniques, adversarial training, and so on, I still wouldn’t feel that confident that we’d be able to build superintelligent systems that were competitive with unaligned ones, unless we got really lucky with some empirical contingencies that we will have no way of checking except for just training the superintelligence and hoping for the best.

Two examples: 

  • A simplified version of the hope with IDA is that we’ll be able to have our system make decisions in a way that never had to rely on searching over uninterpretable spaces of cognitive policies. But this will only be competitive if IDA can do all the same cognitive actions that an unaligned system can do, which is probably false, eg cf Inaccessible Information.
  • The best we could possibly hope for with transparency techniques is: For anything that a neural net is doing, we are able to get the best possible human understandable explanation of what it’s doing, and what we’d have to change in the neural net to make it do something different. But this doesn’t help us if the neural net is doing things that rely on concepts that it’s fundamentally impossible for humans to understand, because they’re too complicated or alien. It seems likely to me that these concepts exist. And so systems will be much weaker if we demand interpretability.

Even though these techniques are fundamentally limited, I think there are still several arguments in favor of sorting out the practical details of how to implement them:

  • Perhaps we actually should be working on solving the alignment problem for non-arbitrarily powerful systems
    • Maybe because we only need to align slightly superhuman systems who we can hand off alignment work to. (I think that this relies on assumptions about gradual development of AGI and some other assumptions.)
    • Maybe because narrow AI will be transformative before general AI, and even though narrow AI doesn't pose an x-risk from power-seeking, it would still be nice to be able to align it so that we can apply it to a wider variety of tasks (which I think makes it less of a scary technological development in expectation). (Note that this argument for working on alignment is quite different from the traditional arguments.)
  • Perhaps these fundamentally limited alignment strategies work on arbitrarily powerful systems in practice, because the concepts that our neural nets learn, or the structures they organize their computations into, are extremely convenient for our purposes. (I called this “empirical generalization” in my other doc; maybe I should have more generally called it “empirical contingencies work out nicely”)
  • These fundamentally limited alignment strategies might be ingredients in better alignment strategies. For example, many different alignment strategies require transparency techniques, and it’s not crazy to imagine that if we come up with some brilliant theoretically motivated alignment schemes, these schemes will still need something like transparency, and so the research we do now will be crucial for the overall success of our schemes later.
    • The story for this being false is something like “later on, we’ll invent a beautiful, theoretically motivated alignment scheme that solves all the problems these techniques were solving as a special case of solving the overall problem, and so research on how to solve these subproblems was wasted.” As an analogy, think of how a lot of research in computer vision or NLP seems kind of wasted now that we have modern deep learning.
  • The practical lessons we learn might also apply to better alignment strategies. For example, reinforcement learning from human feedback obviously doesn’t solve the whole alignment problem. But it’s also clearly a stepping stone towards being able to do more amplification-like things where your human judges are aided by a model.
  • More indirectly, the organizational and individual capabilities we develop as a result of doing this research seems very plausibly helpful for doing the actually good research. Like, I don’t know what exactly it will involve, but it feels pretty likely that it will involve doing ML research, and arguing about alignment strategies in google docs, and having large and well-coordinated teams of researchers, and so on. I don’t think it’s healthy to entirely pursue learning value (I think you get much more of the learning value if you’re really trying to actually do something useful) but I think it’s worth taking into consideration.

But isn’t it a higher priority to try to propose better approaches? I think this depends on empirical questions and comparative advantage. If we want good outcomes, we both need to have good approaches and we need to know how to make them work in practice. Lacking either of these leads to failure. It currently seems pretty plausible to me that on the margin, at least I personally should be trying to scale the applied research while we wait for our theory-focused colleagues to figure out the better ideas. (Part of this is because I think it’s reasonably likely that the theory researchers will make a bunch of progress over the next year or two. Also, I think it’s pretty likely that most of the work required is going to be applied rather than theoretical.)

I think that research on these insufficient strategies is useful. But I think it’s also quite important for people to remember that they’re insufficient, and that they don’t suffice to solve the whole problem on their own. I think that people who research them often equivocate between “this is useful research that will plausibly be really helpful for alignment” and “this strategy might work for aligning weak intelligent systems, but we can see in advance that it might have flaws that only arise when you try to use it to align sufficiently powerful systems and that might not be empirically observable in advance”. (A lot of this equivocation is probably because they outright disagree with me on the truth of the second statement.)
 

I wonder what you mean by "competitive"? Let's talk about the "alignment tax" framing. One extreme is that we can find a way such that there is no tradeoff whatsoever between safety and capabilities—an "alignment tax" of 0%. The other extreme is an alignment tax of 100%—we know how to make unsafe AGIs but we don't know how to make safe AGIs. (Or more specifically, there are plans / ideas that an unsafe AI could come up with and execute, and a safe AI can't, not even with extra time/money/compute/whatever.)

I've been resigned to the idea that an alignment tax of 0% is a pipe dream—that's just way too much to hope for, for various seemingly-fundamental reasons like humans-in-the-loop being more slow and expensive than humans-out-of-the-loop (more discussion here). But we still want to minimize the alignment tax, and we definitely want to avoid the alignment tax being 100%. (And meanwhile, independently, we try to tackle the non-technical problem of ensuring that all the relevant players are always paying the alignment tax.)

I feel like your post makes more sense to me when I replace the word "competitive" with something like "arbitrarily capable" everywhere (or "sufficiently capable" in the bootstrapping approach where we hand off AI alignment research to the early AGIs). I think that's what you have in mind?—that you're worried these techniques will just hit a capabilities wall, and beyond that the alignment tax shoots all the way to 100%. Is that fair? Or do you see an alignment tax of even 1% as an "insufficient strategy"?

[epistemic status: speculative]

A lot of the time, we consider our models to be functions from parameters and inputs to outputs, and we imagine training the parameters with SGD. One notable feature of this setup is that SGD isn’t by default purposefully trying to kill you--it might find a model that kills you, or a model that gradient hacks and then kills you, but this is more like incompetence/indifference on SGD’s part, rather than malice.

A plausible objection to this framing is that much of the knowledge of our models is probably going to be produced in other ways than SGD. For example, the models might write down various notes (in natural language or in neuralese) that they then read later, and they might have internal structures like economies that produce and consume information. Does this introduce new alignment problems?

Here’s a way I’ve been thinking about this recently. I’m writing this in way that might feel obnoxiously overwrought because this is the way that I think would have conveyed my current intuition to me two months ago.

In SGD, we update our weights by something like:

weights <- weights + alpha * (d loss/d weights)

You might think that this is fundamental. But actually it’s just a special case of the more general life rule:

do something that seems like a good idea, based on the best available estimate of what's a good idea

Imagine that you want a good language model, and you randomly initialize it and now you’re sitting at your computer unsure what to do next, because (as a totally random model) it’s totally useless to you. What’s a good idea to do?

Your best idea is to slightly perturb the model based on a local estimate of which direction in parameter space will reduce log loss, evaluated on a random batch. So you take that derivative and update your weights with it.

And then you do this over and over again, basically because you don’t have any better ideas for what to do.

(I think that the “we’re just trying to do something that seems good based on our best guess” perspective is also an intuition that I was missing for why adversarial training is good, or why using a more conservative oversight process during adversarial training might be good. We’re just trying to make an informed guess based on our available estimate about what parameter update will maximize goodness, and we think that the gradient of a conservative loss evaluated on adversarially chosen data might be a useful contributor to that best-guess parameter update.)

If you were training an AGI instead of a dumb LM, the situation would be different, because when you have a partially trained AGI, your sense of what’s a good idea might have materially changed, because the partially trained AGI can contribute to your effort to estimate what’s a good idea to do next.

The simplest case of this is that you use your AGI when you’re calculating the loss for an answer. This is the basic idea behind IDA/debate--during training, your sense of how good an answer is gets better for the obvious reason. And if your model can deceive you at this point, you’ll maybe pick bad losses, then get bad gradient estimates, then maybe eventually get paperclipped.

But we can also do more complicated things than estimating a loss. For example, at some point our AI might know something about what changes to its parameters would lead to outcomes that we liked better, or that would lead to more paperclips. At this point, we need to be able to access the model’s opinions about parameter changes that will lead to good outcomes for us (or else we’re not competitive), and we also want to avoid the model using its knowledge of parameter changes that would lead to paperclips to gradient hack.

(Just to rephrase/reiterate a point in that last paragraph that I hadn’t thought of before: If your model is able to gradient hack, then it’s apparently better than SGD at at least one aspect of guessing what parameter changes will lead to what long-term outcomes. And so for competitiveness reasons, you might need your weight update rule to be “ask the model for its favorite new choice of weights (perhaps giving it the loss gradient as an input that it can use as it wishes) and use its suggestion”. And if the AI is misaligned at this point, you are obviously completely fucked re gradient hacking. So it is kind of missing the point to argue about whether the model practically is able to gradient hack just via choosing actions.)

And so the hope for competitive alignment has to go via an inductive property--you’re using all the model’s relevant knowledge to update your model at every step, and so everything is going as fast as possible and maintaining alignment.

And this setup is basically the same for any other mechanism via which your AI might influence its future behavior, including writing notes-to-self or having some global memory bank or whatever.

And so in conclusion:

  • Gradient hacking isn’t really a different problem than needing to have access to the model’s knowledge in order to provide a good loss.
  • Gradient hacking isn’t really a different problem than handling other mechanisms by which the AI’s actions affect its future actions, and so it’s fine for us to just talk about having parameters and an update rule.

Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.

One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don't think I can make suggestions about what my parameters should be changed to.

How can I gradient hack myself? For example, by thinking of strawberries every time I'm about to get a reward. Now I've hacked myself to like strawberries. But I have no idea how that's implemented in my brain, I can't "pick the parameters for myself", even if you gave me a big tensor of gradients.

Two potential alternatives to the thing you said:

  • maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference).
  • maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can't let them do it too often or else your model just wireheads.)

In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7

I used to think that slower takeoff implied shorter timelines, because slow takeoff means that pre-AGI AI is more economically valuable, which means that economy advances faster, which means that we get AGI sooner. But there's a countervailing consideration, which is that in slow takeoff worlds, you can make arguments like ‘it’s unlikely that we’re close to AGI, because AI can’t do X yet’, where X might be ‘make a trillion dollars a year’ or ‘be as competent as a bee’. I now overall think that arguments for fast takeoff should update you towards shorter timelines.

So slow takeoffs cause shorter timelines, but are evidence for longer timelines.

This graph is a version of this argument: if we notice that current capabilities are at the level of the green line, then if we think we're on the fast takeoff curve we'll deduce we're much further ahead than we'd think on the slow takeoff curve.

For the "slow takeoffs mean shorter timelines" argument, see here: https://sideways-view.com/2018/02/24/takeoff-speeds/

This
point feels really obvious now that I've written it down, and I suspect it's obvious to many AI safety people, including the people whose writings I'm referencing here. Thanks to Caroline Ellison for pointing this out to me, and various other people for helpful comments.

I think that this is why belief in slow takeoffs is correlated with belief in long timelines among the people I know who think a lot about AI safety.

I wrote a whole post on modelling specific continuous or discontinuous scenarios- in the course of trying to make a very simple differential equation model of continuous takeoff, by modifying the models given by Bostrom/Yudkowsky for fast takeoff, the result that fast takeoff means later timelines naturally jumps out.

Varying d between 0 (no RSI) and infinity (a discontinuity) while holding everything else constant looks like this: Continuous Progress If we compare the trajectories, we see two effects - the more continuous the progress is (lower d), the earlier we see growth accelerating above the exponential trend-line (except for slow progress, where growth is always just exponential) and the smoother the transition to the new growth mode is. For d=0.5, AGI was reached at t=1.5 but for discontinuous progress this was not until after t=2. As Paul Christiano says, slow takeoff seems to mean that AI has a larger impact on the world, sooner.

But that model relies on pre-setting a fixed 'threshold for AGI, given by the parameter AGI, in advance. This, along with the starting intelligence of the system, fixes how far away AGI is.

For values between 0 and infinity we have varying steepnesses of continuous progress. IAGI is the Intelligence level we identify with AGI. In the discontinuous case, it is where the jump occurs. In the continuous case, it is the centre of the logistic curve. here IAGI=4

You could (I might get round to doing this), model the effect you're talking about by allowing IAGI to vary with the level of discontinuity. So every model would start with the same initial intelligence I0, but the IAGI would be correlated with the level of discontinuity, with larger discontinuity implying IAGI is smaller. That way, you would reproduce the epistemic difference of expecting a stronger discontinuity - that the current intelligence of AI systems is implied to be closer to what we'd expect to need for explosive growth on discontinuous takeoff scenarios than on continuous scenarios.

We know the current level of capability and the current rate of progress, but we don't know I_AGI, and holding all else constant slow takeoff implies I_AGI is a significantly higher number (again, I_AGI is relative to the starting intelligence of the system)

This is because my model was trying to model different physical situations, different ways AGI could be, not different epistemic situations, so I was thinking in terms of I_AGI being some fixed, objective value that we just don't happen to know.

I'm uncertain if there's a rigorous way of quantifying how much this epistemic update does against the physical fact that continuous takeoff implies an earlier acceleration above exponential. If you're right, it overall completely cancels this effect out and makes timelines on discontinuous takeoff earlier overall - I think you're right about this. It would be easy enough to write something to evenly cancel it out, to make all takeoffs in the different scenarios appear at the same time, but that's not what you have in mind.