Adam Jermyn

Wiki Contributions


A Toy Model of Gradient Hacking

This is a cool result. If I'm understanding correctly, M- increases its loss the more that M+ is represented in the mixture, thereby encouraging SGD to make M- more prominent.

Is there a way to extend this to cases where M- doesn't have access to the weights? I think that probably requires an RL environment, but that's entirely based on "I thought about it for a few minutes and couldn't find a way to do it without RL" so I could be way off here.

Given an RL environment I suspect M- could steer the model into scenarios that make it look better than M+...

Conditioning Generative Models

I’m worried about running HCH because it seems likely that in worlds that can run HCH people are not sufficiently careful to restrict GPU access and those worlds get taken over by unsafe AI built by other actors. Better to just not have the GPU’s at all.

Conditioning Generative Models

I don’t think the description-length prior enters here. The generative model has a prior based on training data we fed it, and I don’t see why it would prefer short description lengths (which is a very uninformed prior) over “things that are likely in the world given the many PB of data it’s seen”.

Putting that aside, can you say why you think the “AI does weird dances” world is more likely conditioned on the observations than “humans happened to do this weird thing”?

Conditioning Generative Models

I think I basically agree re: honeypots.

I'm sure there'll be weird behaviors if we outlaw simulations, but I don't think that's a problem. My guess is that a world where simulations are outlawed has some religion with a lot of power that distrusts computers, which definitely looks weird but shouldn't stop them from solving alignment.

Conditioning Generative Models

I don’t think that’s an example of the model noticing it’s in a simulation. There’s nothing about simulations versus the real world that makes RSA instances more or less likely to pop up.

Rather, that’s a case where the model just has a defecting condition and we don’t hit it in the simulation. This is what I was getting at with “other challenge” #2.

Conditioning Generative Models

I'm assuming we can input observations about the world for conditioning, and those don't need to be text. I didn't go into this in the post, but for example I think the following are fair game:

  • Physical newspapers are exist which report BigLab has solved the alignment problem.
  • A camera positioned 10km above NYC would take a picture consistent with humans walking on the street.
  • There is data on hard drives consistent with Reddit posts claiming BigCo has perfected interpretability tools.

Whereas the following are not allowed because I don't see how they could be operationalized:

  • BigLab has solved the alignment problem.
  • Alice is not deceptive.
  • BigCo has perfected interpretability tools.
Causal confusion as an argument against the scaling hypothesis

I think I basically hold disagreement (1), which I think is close to Owain’s comment. Specifically. I think a plausible story for a model learning causality is:

  1. The model learns a lot of correlations, most real (causal) but many spurious.
  2. The model eventually groks that there’s a relatively simple causal model explaining the real correlations but not the spurious ones. This gets favored by whatever inductive bias the training process/architecture encodes.
  3. The model maintains uncertainty as to whether the spurious correlations are real or spurious, the same way humans do.

In this story the model learns both a causal model and the spurious correlations. It doesn’t dismiss the spurious correlations but still models the causal ones. This lets it minimize loss, which I think addresses the counter argument to (1).

Training Trace Priors

Right. Maybe a better way to say it is:

  1. Without hidden behaviors (suitably defined), you can't have deception.
  2. With hidden behaviors, you can have deception.

The two together give a bit of a lever that I think we can use to bias away from deception if we can find the right operational notion of hidden behaviors.

Training Trace Priors

Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function?

That would work, yeah. I was thinking of an approach based on making ad-hoc updates to the weights (beyond SGD), but an evolutionary approach would be much cleaner!

Load More