J Bostock — AI Alignment Forum

Steering Language Models with Weight Arithmetic

I wonder if this could be used as a probe. Idea:

Generate some output from the model
Treat the output as an SFT data point and do a backward pass to get a gradient vector w.r.t the loss
Take the cosine sim of the gradient vector and a given parameter diff vector

This would be pretty similar to the emergent misalignment detection you did.

Do you have an intuition on whether or not using LoRA for the SGMCMC sampling of the BIF breaks everything? I'm vibe-investigating some stuff on top of your code and I want my BIFs to converge better.

I've seen someone say something like "LoRA width is a hyperparameter which varies from 1 (probe-steering vector) to full-rank (normal finetuning) and doesn't affect high level training dynamics" in particular arguing that it shouldn't affect emergent misalignment, which is basically just a special case of BIFs.

Claude just glazes me, and I don't have enough intuition to figure out whether this is completely stupid or not.

Rogue internal deployments via external APIs

J Bostock2mo10

On a more meta-level, I find it concerning that I found this new way in which AIs could set up rogue deployments roughly a year after having first tried to catalog ways in which AIs could set up a rogue deployment.

I think this was predictable, since "Try to catalogue every way in which an AI could set up a rogue deployment" is an argument from exhaustive free association. I also predict that you will think of another one within a year. Conditional on you thinking of a new method, my gut says there's a 1/2 chance that it was something you could have thought of a year ago, and a 1/2 chance that it's based on some new empirical discovery about LLMs made between a year ago and when you think of it (a la "The model triggers emergent misalignment in a distilled mini version of itself.")

Lessons from Studying Two-Hop Latent Reasoning

J Bostock3mo20

RE part 6:

I think there's a more intuitive/abstract framing here. If a model has only seen e_2 with respect to two different facts, it probably won't have generated an abstraction for e_2 in its world model at all. An abstraction is mostly useful as a hub of different inferences, like in the old blegg/rube diagram.

Something which has come up in pretraining will already be an abstraction with an easy-to-reach-for handle that the model can pull.

Might be testable by fine-tuning on only some of (or some pairs of) the spokes of a blegg/rube diagram, to see whether the final spoke-pairs fill in.

I.e.

"This object is round, so it's a blegg, so it's blue"

"This object is smooth, so it's a blegg, so it's round"

"This object is smooth, so it's a blegg, so it's bouncy"

"This object is round, is it bouncy?"

Something like that might cause "blegg" to be bound up and assembled into an abstraction in the AI, with a single representation.

Overall I consider this work to be weak evidence in favour of multi-step reasoning being an issue, since the latter parts show that it definitely can occur (just not if both facts are fine-tuned separately)

Knowledge is not just map/territory resemblance

J Bostock5y00

I love the depth you're going into with this sequence, and I am very keen to read more about this. I wonder if the word "knowledge" is not ideal. It seems like the examples you've given, while all clearly "knowledge" could correspond to different things. Possibly the human-understandable concept of "knowledge" is tied up with lots of agent-y optimizer-y things which make it more difficult to describe in a human-comfortable way on the level of physics (or maybe it's totally possible and you're going to prove me dead-wrong in the next few posts!)

My other thought is that knowledge is stable to small perturbations (equivalently: small amounts of uncertainty) of the initial knowledge-accumulating region: a rock on the moon moved a couple of atoms to the left would not get the same mutual information with the history of humanity, but a ship moved a couple of atoms to the left would make the same map of the coastline.

This brings to mind the idea of abstractions as things which are not "wiped out" by noise or uncertainty between a system and an observer. Lots of examples I can think of as knowledge seem to be representations of abstractions but so do some counterexamples (it's possible - minus quantumness - to have knowledge about the position of an atom at a certain time).

Other systems which are stable to small perturbations of the starting configuration are optimizers. I have written about optimizers previously using an information-theoretic point of view (though before realizing I only have a slippery grasp on the concept of knowledge). Is a knowledge-accumulating algorithm simply a special class of optimization algorithm? Backpropagation definitely seems to be both, so there's probably significant overlap, but maybe there are some counter examples I haven't thought of yet.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments