AI ALIGNMENT FORUM
AF

1481
Jonathan Bostock
Ω14200
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
0Jemist's Shortform
4y
0
Rogue internal deployments via external APIs
J Bostock12h10

On a more meta-level, I find it concerning that I found this new way in which AIs could set up rogue deployments roughly a year after having first tried to catalog ways in which AIs could set up a rogue deployment.

I think this was predictable, since "Try to catalogue every way in which an AI could set up a rogue deployment" is an argument from exhaustive free association. I also predict that you will think of another one within a year. Conditional on you thinking of a new method, my gut says there's a 1/2 chance that it was something you could have thought of a year ago, and a 1/2 chance that it's based on some new empirical discovery about LLMs made between a year ago and when you think of it (a la "The model triggers emergent misalignment in a distilled mini version of itself.")

Reply
Lessons from Studying Two-Hop Latent Reasoning
J Bostock1mo20

RE part 6:

I think there's a more intuitive/abstract framing here. If a model has only seen e_2 with respect to two different facts, it probably won't have generated an abstraction for e_2 in its world model at all. An abstraction is mostly useful as a hub of different inferences, like in the old blegg/rube diagram.

Something which has come up in pretraining will already be an abstraction with an easy-to-reach-for handle that the model can pull.

Might be testable by fine-tuning on only some of (or some pairs of) the spokes of a blegg/rube diagram, to see whether the final spoke-pairs fill in.

I.e.

"This object is round, so it's a blegg, so it's blue"

"This object is smooth, so it's a blegg, so it's round"

"This object is smooth, so it's a blegg, so it's bouncy"

"This object is round, is it bouncy?"

Something like that might cause "blegg" to be bound up and assembled into an abstraction in the AI, with a single representation.

Overall I consider this work to be weak evidence in favour of multi-step reasoning being an issue, since the latter parts show that it definitely can occur (just not if both facts are fine-tuned separately)

Reply
Knowledge is not just map/territory resemblance
J Bostock4y00

I love the depth you're going into with this sequence, and I am very keen to read more about this. I wonder if the word "knowledge" is not ideal. It seems like the examples you've given, while all clearly "knowledge" could correspond to different things. Possibly the human-understandable concept of "knowledge" is tied up with lots of agent-y optimizer-y things which make it more difficult to describe in a human-comfortable way on the level of physics (or maybe it's totally possible and you're going to prove me dead-wrong in the next few posts!)

My other thought is that knowledge is stable to small perturbations (equivalently: small amounts of uncertainty) of the initial knowledge-accumulating region: a rock on the moon moved a couple of atoms to the left would not get the same mutual information with the history of humanity, but a ship moved a couple of atoms to the left would make the same map of the coastline.

This brings to mind the idea of abstractions as things which are not "wiped out" by noise or uncertainty between a system and an observer. Lots of examples I can think of as knowledge seem to be representations of abstractions but so do some counterexamples (it's possible - minus quantumness - to have knowledge about the position of an atom at a certain time).

Other systems which are stable to small perturbations of the starting configuration are optimizers. I have written about optimizers previously using an information-theoretic point of view (though before realizing I only have a slippery grasp on the concept of knowledge). Is a knowledge-accumulating algorithm simply a special class of optimization algorithm? Backpropagation definitely seems to be both, so there's probably significant overlap, but maybe there are some counter examples I haven't thought of yet.

Reply
7Measuring Learned Optimization in Small Transformer Models
2y
0
7Hypotheses about Finding Knowledge and One-Shot Causal Entanglements
4y
0