William Saunders

member of OpenAI scalable alignment team

Wiki Contributions


From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.

However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, and maybe this would make these methods more effective. By default I think the tuned lens learns only the transformation needed to predict the output token but the method could be adapted to retrodict the input token from each layer as well, we’d need both. Code for tuned lens is at https://github.com/alignmentresearch/tuned-lens

So I do think you can get feedback on the related question of "can you write a critique of this action that makes us think we wouldn't be happy with the outcomes" as you can give a reward of 1 if you're unhappy with the outcomes after seeing the critique, 0 otherwise.

And this alone isn't sufficient, e.g. maybe then the AI system says things about good actions that make us think we wouldn't be happy with the outcome, which is then where you'd need to get into recursive evaluation or debate or something. But this feels like "hard but potentially tractable problem" and not "100% doomed". Or at least the failure story needs to involve more steps like "sure critiques will tell us that the fusion power generator will lead to everyone dying, but we ignore that because it can write a critique of any action that makes us believe it's bad" or "the consequences are so complicated the system can't explain them to us in the critique and get high reward for it"

ETA: So I'm assuming the story for feedback on reliably doing things in the world you're referring to is something like "we give the AI feedback by letting it build fusion generators and then giving it a score based on how much power it generates" or something like that, and I agree this is easier than "are we actually happy with the outcome"

If we can't get the AI to answer something like "If we take the action you just proposed, will we be happy with the outcomes?", why can we get it to also answer the question of "how do you design a fusion power generator?" to get a fusion power generator that does anything reliably in the world (including having consequences that kill us), rather than just getting out something that looks to us like a plan for a fusion generator but doesn't actually work?

You define robustness to scaling down as "a solution to alignment keeps working if the AI is not optimal or perfect." but for interpretability you talk about "our interpretability is merely good or great, but doesn't capture everything relevant to alignment" which seems to be about the alignment approach/our understanding being flawed not the AI. I can imagine techniques being robust to imperfect AI but find it harder to imagine how any alignment approach could be robust if the approach/our implementation of the approach itself is flawed, do you have any example of this?

Summary for 8 "Can we take a deceptively aligned model and train away its deception?" seems a little harder than what we actually need, right? We could prevent a model from being deceptive rather than trying to undo arbitrary deception (e.g. if we could prevent all precursors)

Do you think we could basically go 1->4 and 2->5 if we could train a helper network to behaviourally clone humans using transparency tools and run the helper network over the entire network/training process? Or if we do critique style training (RL reward some helper model with access to the main model weights if it produces evidence of the property we don't want the main network to have)?

Could I put in a request to see a brain dump from Eliezer of ways to gain dignity points?

Take after talking with Daniel: for future work I think it will be easier to tell how well your techniques are working if you are in a domain where you care about minimizing both false-positive and false-negative error, regardless of whether that's analagous to the long term situation we care most about. If you care about both kinds of error then the baseline of "set a reallly low classifier threshold" wouldn't work, so you'd be starting from a regime where it was a lot easier to sample errors, hence it will be easier to measure differences in performance.

Load More