All of Buck's Comments + Replies

The case for becoming a black-box investigator of language models

Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about--I'm really interested in questions like "how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50", which are much simpler and lower level.

The case for becoming a black-box investigator of language models

Yeah I wrote an interface like this for personal use, maybe I should release it publicly.

Takeoff speeds have a huge effect on what it means to work on AI x-risk

I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.

Examples of small AI c... (read more)

Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.

This prediction feels like... it doesn't play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which ... (read more)

Buck's Shortform

Are any of these ancient discussions available anywhere?

Buck's Shortform

In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7

Buck's Shortform

Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.

2Richard Ngo1mo
One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don't think I can make suggestions about what my parameters should be changed to. How can I gradient hack myself? For example, by thinking of strawberries every time I'm about to get a reward. Now I've hacked myself to like strawberries. But I have no idea how that's implemented in my brain, I can't "pick the parameters for myself", even if you gave me a big tensor of gradients. Two potential alternatives to the thing you said: * maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference). * maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can't let them do it too often or else your model just wireheads.)
Buck's Shortform

[epistemic status: speculative]

A lot of the time, we consider our models to be functions from parameters and inputs to outputs, and we imagine training the parameters with SGD. One notable feature of this setup is that SGD isn’t by default purposefully trying to kill you--it might find a model that kills you, or a model that gradient hacks and then kills you, but this is more like incompetence/indifference on SGD’s part, rather than malice.

A plausible objection to this framing is that much of the knowledge of our models is probably going to be produced in ... (read more)

1Buck Shlegeris1mo
In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7 [https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7]

Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.

Buck's Shortform

Something I think I’ve been historically wrong about:

A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we’re training systems to answer questions, of course we’ll use our current versions of systems to help us evaluate, why would we not do that? We’ll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system.

Similarly with debate... (read more)

Agreed, and versions of them exist in human governments trying to maintain control (where non-cooordination of revolts is central).  A lot of the differences are about exploiting new capabilities like copying and digital neuroscience or changing reward hookups.

In ye olde times of the early 2010s people (such as I) would formulate questions about what kind of institutional setups you'd use to get answers out of untrusted AIs (asking them separately to point out vulnerabilities in your security arrangement, having multiple AIs face fake opportunities to whistleblow on bad behavior, randomized richer human evaluations to incentivize behavior on a larger scale).

Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).

It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.

Discussion with Eliezer Yudkowsky on AGI interventions

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.

 

You aren't counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you'll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.

Discussion with Eliezer Yudkowsky on AGI interventions

Am I correct that you wouldn't find a bound acceptable, you specifically want the exact maximum?

2Buck Shlegeris6mo
You aren't counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you'll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.
Redwood Research’s current project

Suppose you have three text-generation policies, and you define "policy X is better than policy Y" as "when a human is given a sample from both policy X and policy Y, they prefer the sample from the latter more than half the time". That definition of "better" is intransitive.

1Adam Shimi7mo
Hum, I see. And is your point that it should not create a problem because you're only doing comparison X vs Y and Z vs Y (where Y is the standard policy and X and Z are two of your conservative policies) but you don't really care about the comparison between X and Z?
Redwood Research’s current project

Thanks, glad to hear you appreciate us posting updates as we go.

Redwood Research’s current project

So note that we're actually working on the predicate "an injury occurred or was exacerbated", rather than something about violence (I edited out the one place I referred to violence instead of injury in the OP to make this clearer).

The reason I'm not that excited about finding this latent is that I suspect that the snippets that activate it are particularly easy cases--we're only interested in generating injurious snippets that the classifier is wrong about.

For example, I think that the model is currently okay with dropping babies probably because it doesn... (read more)

Redwood Research’s current project

We're tried some things kind of like this, though less sophisticated. The person who was working on this might comment describing them at some point.

One fundamental problem here is that I'm worried that finding a "violence" latent is already what we're doing when we fine-tune. And so I'm worried that the classifier mistakes that will be hardest to stamp out are those that we can't find through this kind of process.

I have an analogous concern with the "make the model generate only violent completions"--if we knew how to define "violent", we'd already be don... (read more)

2gwern8mo
Controlling the violence latent would let you systematically sample for it: you could hold the violence latent constant, and generate an evenly spaced grid of points around it to get a wide diversity of violent but stylistically/semantically unique. Kinds of text which would be exponentially hard to find by brute force sampling can be found this way easily. It also lets you do various kinds of guided search or diversity sampling, and do data augmentation (encode known-violent samples into their latent, hold the violent latent constant, generate a bunch of samples 'near' it). Even if the violence latent is pretty low quality, it's still probably a lot better as an initialization for sampling than trying to brute force random samples and running into very rapidly diminishing returns as you try to dig your way into the tails. And if you can't do any of that because there is no equivalent of a violent latent or its equivalent is clearly too narrow & incomplete, that is pretty important, I would think. Violence is such a salient category, so frequent in fiction and nonfiction (news), that a generative model which has not learned it as a concept is, IMO, probably too stupid to be all that useful as a 'model organism' of alignment. (I would not expect a classifier based on a failed generative model to be all that useful either.) If a model cannot or does not understand what 'violence' is, how can you hope to get a model which knows not to generate violence, can recognize violence, can ask for labels on violence, or do anything useful about violence?
The theory-practice gap

Yeah, I talk about this in the first bullet point here (which I linked from the "How useful is it..." section).

The alignment problem in different capability regimes

One crucial concern related to "what people want" is this seems underdefined, un-stable in interactions with wildly superintelligent systems, and prone to problems with scaling of values within systems where intelligence increases.

This is what I was referring to with

by assumption the superintelligence will be able to answer any question you’re able to operationalize about human values

The superintelligence can answer any operationalizable question about human values, but as you say, it's not clear how to elicit the right operationalization.

The alignment problem in different capability regimes

Re the negative side effect avoidance: Yep, you're basically right, I've removed side effect avoidance from that list.

And you're right, I did mean "it will be able to" rather than "it will"; edited.

The alignment problem in different capability regimes

I think this is a reasonable definition of alignment, but it's not the one everyone uses.

I also think that for reasons like the "ability to understand itself" thing, there are pretty interesting differences in the alignment problem as you're defining it between capability levels.

2Edouard Harris8mo
One reason to favor such a definition of alignment might be that we ultimately need a definition that gives us guarantees that hold at human-level capability or greater, and humans are probably near the bottom of the absolute scale of capabilities that can be physically realized in our world. It would (imo) be surprising to discover a useful alignment definition that held across capability levels way beyond us, but that didn't hold below our own modest level of intelligence.
Buck's Shortform

[this is a draft that I shared with a bunch of friends a while ago; they raised many issues that I haven't addressed, but might address at some point in the future]

In my opinion, and AFAICT the opinion of many alignment researchers, there are problems with aligning superintelligent models that no alignment techniques so far proposed are able to fix. Even if we had a full kitchen sink approach where we’d overcome all the practical challenges of applying amplification techniques, transparency techniques, adversarial training, and so on, I still wouldn’t feel... (read more)

4Steve Byrnes9mo
I wonder what you mean by "competitive"? Let's talk about the "alignment tax" framing [https://www.effectivealtruism.org/articles/paul-christiano-current-work-in-ai-alignment/] . One extreme is that we can find a way such that there is no tradeoff whatsoever between safety and capabilities—an "alignment tax" of 0%. The other extreme is an alignment tax of 100%—we know how to make unsafe AGIs but we don't know how to make safe AGIs. (Or more specifically, there are plans / ideas that an unsafe AI could come up with and execute, and a safe AI can't, not even with extra time/money/compute/whatever.) I've been resigned to the idea that an alignment tax of 0% is a pipe dream—that's just way too much to hope for, for various seemingly-fundamental reasons like humans-in-the-loop being more slow and expensive than humans-out-of-the-loop (more discussion here [https://www.lesswrong.com/posts/Gfw7JMdKirxeSPiAk/solving-the-whole-agi-control-problem-version-0-0001#2_1_But_aren_t_capability_safety_tradeoffs_a_no_go_] ). But we still want to minimize the alignment tax, and we definitely want to avoid the alignment tax being 100%. (And meanwhile, independently, we try to tackle the non-technical problem of ensuring that all the relevant players are always paying the alignment tax.) I feel like your post makes more sense to me when I replace the word "competitive" with something like "arbitrarily capable" everywhere (or "sufficiently capable" in the bootstrapping approach where we hand off AI alignment research to the early AGIs). I think that's what you have in mind?—that you're worried these techniques will just hit a capabilities wall, and beyond that the alignment tax shoots all the way to 100%. Is that fair? Or do you see an alignment tax of even 1% as an "insufficient strategy"?
Buck's Shortform

I used to think that slower takeoff implied shorter timelines, because slow takeoff means that pre-AGI AI is more economically valuable, which means that economy advances faster, which means that we get AGI sooner. But there's a countervailing consideration, which is that in slow takeoff worlds, you can make arguments like ‘it’s unlikely that we’re close to AGI, because AI can’t do X yet’, where X might be ‘make a trillion dollars a year’ or ‘be as competent as a bee’. I now overall think ... (read more)

2Samuel Dylan Martin2y
I wrote a whole post on modelling specific continuous or discontinuous scenarios [https://www.lesswrong.com/posts/66FKFkWAugS8diydF/modelling-continuous-progress] - in the course of trying to make a very simple differential equation model of continuous takeoff, by modifying the models given by Bostrom/Yudkowsky for fast takeoff, the result that fast takeoff means later timelines naturally jumps out. But that model relies on pre-setting a fixed 'threshold for AGI, given by the parameter AGI, in advance. This, along with the starting intelligence of the system, fixes how far away AGI is. You could (I might get round to doing this), model the effect you're talking about by allowing IAGI to vary with the level of discontinuity. So every model would start with the same initial intelligence I0, but the IAGI would be correlated with the level of discontinuity, with larger discontinuity implying IAGI is smaller. That way, you would reproduce the epistemic difference of expecting a stronger discontinuity - that the current intelligence of AI systems is implied to be closer to what we'd expect to need for explosive growth on discontinuous takeoff scenarios than on continuous scenarios. We know the current level of capability and the current rate of progress, but we don't know I_AGI, and holding all else constant slow takeoff implies I_AGI is a significantly higher number (again, I_AGI is relative to the starting intelligence of the system) This is because my model was trying to model different physical situations, different ways AGI could be, not different epistemic situations, so I was thinking in terms of I_AGI being some fixed, objective value that we just don't happen to know. I'm uncertain if there's a rigorous way of quantifying how much this epistemic update does against the physical fact that continuous takeoff implies an earlier acceleration above exponential. If you're right, it overall completely cancels this effect out and makes timelines on discontinuous tak
$1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is
It's tempting to anthropomorphize GPT-3 as trying its hardest to make John smart. That's what we want GPT-3 to do, right?

I don't feel at all tempted to do that anthropomorphization, and I think it's weird that EY is acting as if this is a reasonable thing to do. Like, obviously GPT-3 is doing sequence prediction--that's what it was trained to do. Even if it turns out that GPT-3 correctly answers questions about balanced parens in some contexts, I feel pretty weird about calling that "deliberately pretending to be stupider than it is".

I don't feel at all tempted to do that anthropomorphization, and I think it's weird that EY is acting as if this is a reasonable thing to do.

"It's tempting to anthropomorphize GPT-3 as trying its hardest to make John smart" seems obviously incorrect if it's explicitly phrased that way, but e.g. the "Giving GPT-3 a Turing Test" post seems to implicitly assume something like it:

This gives us a hint for how to stump the AI more consistently. We need to ask questions that no normal human would ever talk about.

Q: How m
... (read more)
1ESRogs2y
Yeah, it seems like deliberately pretending to be stupid here would be predicting a less likely sequence, in service of some other goal.
Possible takeaways from the coronavirus pandemic for slow AI takeoff

If the linked SSC article is about the aestivation hypothesis, see the rebuttal here.

Let's talk about "Convergent Rationality"

In OpenAI's Roboschool blog post:

This policy itself is still a multilayer perceptron, which has no internal state, so we believe that in some cases the agent uses its arms to store information.

Aligning a toy model of optimization

Given a policy π we can directly search for an input on which it behaves a certain way.

(I'm sure this point is obvious to Paul, but it wasn't to me)

We can search for inputs on which a policy behaves badly, which is really helpful for verifying the worst case of a certain policy. But we can't search for a policy which has a good worst case, because that would require using the black box inside the function passed to the black box, which we can't do. I think you can also say this as "the black box is an NP oracle, not a oracle".

This still means that w

... (read more)
Robustness to Scale

I think that the terms introduced by this post are great and I use them all the time

Six AI Risk/Strategy Ideas

Ah yes this seems totally correct

Six AI Risk/Strategy Ideas

Minor point: I think asteroid strikes are probably very highly correlated between Everett branches (though maybe the timing of spotting an asteroid on a collision course is variable).

4Wei Dai3y
I think if we could look at all the Everett branches that contain some version of you, we'd see "bundles" where the asteroid locations are the same within each bundle but different between bundles, because different bundles evolved from different starting conditions (and then converged in terms of having produced someone who is subjectively indistinguishable from you). So a big asteroid strike would wipe out humanity in an entire bundle but that would only constitute a small fraction of all the Everett branches that contain a version of you. Hopefully that makes sense?