Buck Shlegeris

Wiki Contributions


Discussion with Eliezer Yudkowsky on AGI interventions

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.


You aren't counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you'll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.

Discussion with Eliezer Yudkowsky on AGI interventions

Am I correct that you wouldn't find a bound acceptable, you specifically want the exact maximum?

Redwood Research’s current project

Suppose you have three text-generation policies, and you define "policy X is better than policy Y" as "when a human is given a sample from both policy X and policy Y, they prefer the sample from the latter more than half the time". That definition of "better" is intransitive.

Redwood Research’s current project

Thanks, glad to hear you appreciate us posting updates as we go.

Redwood Research’s current project

So note that we're actually working on the predicate "an injury occurred or was exacerbated", rather than something about violence (I edited out the one place I referred to violence instead of injury in the OP to make this clearer).

The reason I'm not that excited about finding this latent is that I suspect that the snippets that activate it are particularly easy cases--we're only interested in generating injurious snippets that the classifier is wrong about.

For example, I think that the model is currently okay with dropping babies probably because it doesn't really think of this as an injury occurring, and so I wouldn't have thought that we'd find an example like this by looking for things that maximize the injury latent. And I suspect that most of the problem here is finding things that don't activate the injury latent but are still injurious, rather than things that do.

One way I've been thinking of this is that maybe the model has like 20 different concepts for violence, and we're trying to find each of them in our fine tuning process.

Redwood Research’s current project

We're tried some things kind of like this, though less sophisticated. The person who was working on this might comment describing them at some point.

One fundamental problem here is that I'm worried that finding a "violence" latent is already what we're doing when we fine-tune. And so I'm worried that the classifier mistakes that will be hardest to stamp out are those that we can't find through this kind of process.

I have an analogous concern with the "make the model generate only violent completions"--if we knew how to define "violent", we'd already be done. And so I'd worry that the definition of violence used by the generator here is the same as the definition used by the classifier, and so we wouldn't find any new mistakes.

The theory-practice gap

Yeah, I talk about this in the first bullet point here (which I linked from the "How useful is it..." section).

The alignment problem in different capability regimes

One crucial concern related to "what people want" is this seems underdefined, un-stable in interactions with wildly superintelligent systems, and prone to problems with scaling of values within systems where intelligence increases.

This is what I was referring to with

by assumption the superintelligence will be able to answer any question you’re able to operationalize about human values

The superintelligence can answer any operationalizable question about human values, but as you say, it's not clear how to elicit the right operationalization.

Load More