If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
The fact that latents are often related to their neighbors definitely seems to support your thesis, but it's not clear to me that you couldn't train a smaller, somewhat-lossy meta-SAE even on an idealized SAE, so long as the data distribution had rare events or rare properties you could thow away cheaply.
You could also play a similar game showing that latents in a larger SAE are "merely" compositions of latents in a smaller SAE.
So basically, I was left wanting a more mathematical perspective of what kinds of properties you're hoping for SAEs (or meta-SAEs) and their latents to have.
It would be interesting to meditate in the question "What kind of training procedure could you use to get a meta-SAE directly?" And I think answering this relies in part on mathematical specification of what you want.
When you showed the decomposition of 'einstein', I also kinda wanted to see what the closest latents were in the object-level SAE to the components of 'einstein' in the meta-SAE.
Did you ever read Lara Buchak's book? Seems related.
Also, I'm not really intuition-pumped by the repeated mugging example. It seems similar to a mugging where Omega only shows up once, but asks you for a recurring payment.
A related issue might be asking if UDT-ish agents who use a computable approximation to the Solomonoff prior are reflectively stable - will they want to "lock out" certain hypotheses that involve lots of computation (e.g. universes provably trying to simulate you via search for simple universes that contain agents who endorse Solomonoff induction). And probably the answer us going to be "it depends," and you can do verbal argumentation for either option.
I worry about erasing self-other distinction of values. If I want an AI to get me a sandwich, I don't want the AI to get itself a sandwich.
It's easy to say "we'll just train the AI to have good performance (and thereby retain some self-other distinctions), and getting itself a sandwich would be bad performance so it won't learn to do that." But this seems untrustworthy for any AI that's learning human values and generalizing them to new situations. In fact the entire point is that you hope it will affect generalization behavior.
I also worry that the instant you try this on sufficiently interesting and general domains, the safety benefit doesn't last - sort of like optimality is the tiger. If some AI needs to learn general problem-solving strategies to deal with tricky and novel tasks, the application of those strategies to a problem where deception is incentivized will rediscover deception, without needing it to be stored in the representation of the model.
Fun read!
This seems like it highlights that it's vital for current fine-tuned models to change the output distribution only a little (e.g. small KL divergence between base model and finetuned model). If they change the distribution a lot, they'll run into unintended optima, but the base distribution serves as a reasonable prior / reasonable set of underlying dynamics for the text to follow when the fine-tuned model isn't "spending KL divergence" to change its path.
Except it's still weird how bad the reward model is - it's not like the reward model was trained based on the behavior it produced (like humans' genetic code was), it's just supervised learning on human reviews.
This was super interesting. I hadn't really thought about the tension between SLT and superposition before, but this is in the middle of it.
Like, there's nothing logically inconsistent with the best local basis for the weights being undercomplete while the best basis for the activations is overcomplete. But if both are true, it seems like the relationship to the data distribution has to be quite special (and potentially fragile).
Nice! There's definitely been this feeling with training SAEs that activation penalty+reconstruction loss is "not actually asking the computer for what we want," leading to fragility. TopK seems like it's a step closer to the ideal - did you subjectively feel confident when starting off large training runs?
Confused about section 5.3.1:
To mitigate this issue, we sum multiple TopK losses with different values of k (Multi-TopK). For
example, using L(k) + L(4k)/8 is enough to obtain a progressive code over all k′ (note however
that training with Multi-TopK does slightly worse than TopK at k). Training with the baseline ReLU
only gives a progressive code up to a value that corresponds to using all positive latents.
Why would we want a progressive code over all hidden activations? If features have different meanings when they're positive versus when they're negative (imagining a sort of Toy Model of Superposition picture where features are a bunch of rays squeezed in around a central point), it seems like if your negative hidden activations are informative something weird is going on.
> [Tells complicated, indirect story about how to wind up with a corrigible AI]
> "Corrigibility is, at its heart, a relatively simple concept"
I'm not saying the default strategy of bumbling forward and hoping that we figure out tool AI as we go has a literal 0% chance of working. But from the tone of this post and the previous table-of-contents post, I was expecting a more direct statement of what sort of functional properties you mean by "corrigibility," and I feel like I got more of a "we'll know it when we see it" approach.
I think that guarding against rogue insiders and external attackers might mostly suffice for guarding against schemers. So if it turns out that it's really hard to convince labs that they need to be robust to schemers, we might be safe from human-level schemers anyway.
This would be nice (maybe buying us a whole several months until slightly-better schemers convince internal users to escalate their privileges), but it only makes sense if labs are blindly obstinate to all persuasion, rather than having a systematic bias. I know this is really just me dissecting a joke, but oh well :P
Wow, that's pessimistic. So in the future you imagine, we could build AIs that promote the good of all humanity, we just won't because if a business built that AI it wouldn't make as much money?
Just riffing on this rather than starting a different comment chain:
If alignment1 is "get AI to follow instructions" (as typically construed in a "good enough" sort of way) and alignment2 is "get AI to do good things and not bad things," (also in a "good enough" sort of way, but with more assumed philosophical sophistication) I basically don't care about anyone's safety plan to get alignment1 except insofar as it's part of a plan to get alignment2.
Philosophical errors/bottlenecks can mean you don't know how to go from 1 to 2. Human safety problems are what stop you from going from 1 to 2 even if you know how, or stop you from trying to find out how.
The checklist has a space for "nebulous future safety case for alignment1," which is totally fine. I just also want a space for "nebulous future safety case for alignment2" at the least (some earlier items explicitly about progressing towards that safety case can be extra credit). Different people might have different ideas about what form a plan for alignment2 takes (will it focus on the structure of the institution using an aligned1 AI, or will it focus on the AI and its training procedure directly?), and where having it should come in the timeline, but I think it should be somewhere.
Part of what makes power corrupting insidious is it seems obvious to humans that we can make everything work out best so long as we have power - that we don't even need to plan for how to get from having control to actually getting good things control was supposed to be an instrumental goal for.