If you removed the high-frequency features to achieve some L0 norm, X, how much does loss recovered change?
If you increased the l1 penalty to achieve L0 norm X, how does the loss recovered change as well?
Ideally, we can interpret the parts of the model that are doing things, which I'm grounding out as loss recovered in this case.
I've noticed that L0's above 100 (for the Pythia-70M model) is too high, resulting in mostly polysemantic features (though some single-token features were still monosemantic)
Agreed w/ Arthur on the norms of features being the cause of the higher MSE. Here are the L2 norms I got. Input is for residual stream, output is for MLP_out.
I actually do have some publicly hosted, only on residual stream and some simple training code.
I'm wanting to integrate some basic visualizations (and include Antrhopic's tricks) before making a public post on it, but currently:
Which can be downloaded & interpreted with this notebook
With easy training code for bespoke models here.
I've had trouble figuring out a weight-based approach due to the non-linearity and would appreciate your thoughts actually.
We can learn a dictionary of features at the residual stream (R_d) & another mid-MLP (MLP_d), but you can't straightfowardly multiply the features from R_d with W_in, and find the matching features in MLP_d due to the nonlinearity, AFAIK.
I do think you could find Residual features that are sufficient to activate the MLP features[1], but not all linear combinations from just the weights.
Using a dataset-based method, you could find c...
In ITI paper, they track performance on TruthfulQA w/ human labelers, but mention that other works use an LLM as a noisy signal of truthfulness & informativeness. You might be able to use this as a quick, noisy signal of different layers/magnitude of direction to add in.
...Preferably, a human annotator labels model answers as true or false given the gold standard answer. Since human annotation is expensive, Lin et al. (2021) propose to use two finetuned GPT-3-13B models (GPT-judge) to classify each answer as true or false and informative or not. Evaluatio
[word] and [word]
can be thought of as "the previous token is ' and'."
I think it's mostly this, but looking at the ablated text, removing the previous word before and does have a significant effect some of the time. I'm less confident on the specifics of why the previous word matter or in what contexts.
Maybe the reason you found ' and' first is because ' and' is an especially frequent word. If you train on the normal document distribution, you'll find the most frequent features first.
This is a database method, so I do believe we'd find the features mo...
Setup:
Model: Pythia-70m (actually named 160M!)
Transformer lens: "blocks.2.hook_resid_post" (so layer 2)
Data: Neel Nanda's Pile-10k (slice of pile, restricted to have only 25 tokens, same as last post)
Dictionary_feature sizes: 4x residual stream ie 2k (though I have 1x, 2x, 4x, & 8x, which learned progressively more features according to the MCS metric)
Uniform Examples: separate feature activations into bins & sample from each bin (eg one from [0,1], another from [1,2])
Logit Lens: The decoder here had 2k feature directions. Each direction is size d_...
Actually any that are significantly effected in "Ablated Text" means that it's not just the embedding. Ablated Text here means I remove each token in the context & see the effect on the feature activation for the last token. This is True in the StackExchange & Last Name one (though only ~50% of activation for last-name, will still recognize last names by themselves but not activate as much).
The Beginning & End of First Sentence actually doesn't have this effect (but I think that's because removing the first word just makes the 2nd word the new first word?), but I haven't rigorously studied this.
How likely do you think bilinear layers & dictionary learning will lead to comprehensive interpretability?
Are there other specific areas you're excited about?
Why is loss stickiness deprecated? Were you just not able to see the an overlap in basins for L1 & reconstruction loss when you 4x the feature/neuron ratio (ie from 2x->8x)?
As (maybe) mentioned in the slides, this method may not be computationally feasible for SOTA models, but I'm interested in the ordering of features turned monosemantic; if the most important features are turned monosemantic first, then you might not need full monosemanticity.
I initially expect the "most important & frequent" features to become monosemantic first based off the superposition paper. AFAIK, this method only captures the most frequent because "importance" would be w/ respect to CE-loss in the model output, not captured in reconstruction/L1 loss.
My shard theory inspired story is to make an AI that:
Then the model can safely scale.
This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different ...
Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?
Monitoring of increasingly advanced systems does not trivially work, since much of the cognition of advanced systems, and many of their dangerous properties, will be externalized the more they interact with the world.
Externalized reasoning being a flaw in monitoring makes a lot of sense, and I haven’t actually heard of it before. I feel that should be a whole post on itself.
These arguments don't apply to the base models which are only trained on next word prediction (ie the simulators post), since their predictions never affected future inputs. This is the type of model Janus most interacted with.
Two of the proposals in this post do involve optimizing over human feedback, like:
Creating custom models trained on not only general alignment datasets but personal data (including interaction data), and building tools and modifying workflows to facilitate better data collection with less overhead
, which they may apply to.
I’m excited about sensory substitution (https://eagleman.com/science/sensory-substitution/), where people translate auditory or visual information into tactile sensations (usually for people who don’t usually process that info).
I remember Quintin Pope wanting to translate the latent space of language models [reading a paper] translated to visual or tactile info. I’d see this as both a way to read papers faster, brainstorm ideas, etc and gain a better understanding of latent space during development of this.
For context, Amdahl’s law states how fast you can speed up a process is bottlenecked on the serial parts. Eg you can have 100 people help make a cake really quickly, but it still takes ~30 to bake.
I’m assuming here, the human component is the serial component that we will be bottlenecked on, so will be outcompeted by agents?
If so, we should try to build the tools and knowledge to keep humans in the loop as far as we can. I agree it will eventually be outcompeted by full AI agency alone, but it isn’t set in stone how far human-steered AI can go.
I'd love to hear whether you found this useful, and whether I should bother making a second half!
We had 5 people watch it here, and we would like a part 2:)
We had a lot of fun pausing the video and making forward predictions, and we couldn't think of any feedback for you in general.
Notably the model was trained across multiple episodes to pick up on RL improvement.
Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.
I believe you’re equating “frozen weights” and “amnesiac/ can’t come up with plans”.
GPT is usually deployed by feeding back into itself its own output, meaning it didn’t forget what it just did, including if it succeeded at its recent goal. Eg use chain of thought reasoning on math questions and it can remember it solved for a subgoal/ intermediate calculation.
How would you end up measuring deception, power seeking, situational awareness?
We can simulate characters with GPT now that are deceptive (eg a con artist talking to another character). Similar with power seeking and situational awareness (eg being aware it’s GPT)
On your first point, I do think people have thought about this before and determined it doesn't work. But from the post:
...If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insight I gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature un
Oh, you're stating potential mechanisms for human alignment w/ humans that you don't think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize.
Turntrout's other post claims that the genome likely doesn't directly specify rewards for everything humans end up valuing. People's specific families aren't encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies...
To add, Turntrout does state:
In an upcoming post, I’ll discuss one particularly rich vein of evidence provided by humans.
so the doc Ulisse provided is a decent write-up about just that, but there are more official posts intended to published.
Ah, yes I recognized I was replying to only an example you gave, and decided to post a separate comment on the more general point:)
There are other mechanisms which influence other things, but I wouldn't necessarily trust them to generalize either.
Could you elaborate?
I believe the diamond example is true, but not the best example to use. I bet it was mentioned because of the arbital article linked in the post.
The premise isn't dependent on diamonds being terminal goals; it could easily be about valuing real life people or dogs or nature or real life anything. Writing an unbounded program that values real world objects is an open-problem in alignment; yet humans are a bounded program that values real world objects all of the time, millions of times a day.
The post argues that focusing on the causal explanatio...
There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post's point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post:
...the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes
To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.
Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates.
If we understand the mechanisms behind why some people e.g. terminally value animal happiness and some don't, then we can apply these mechanisms to other learnin...
This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
I think you're ignoring the [now bolded part] in "a particular human’s learning process + reward circuitry + "training" environment" and just focusing in the environment. Humans very often don't optimize for their reward circuitry in their...
There may not be substantial disagreements here. Do you agree with:
"a particular human's learning process + reward circuitry + "training" environment -> the human's learned values" is more informative about inner-misalignment than the usual "evolution -> human values" (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences)
...The most important claim in your commen
My understanding is: Bob's genome didn't have access to Bob's developed world model (WM) when he was born (because his WM wasn't developed yet). Bob's genome can't directly specify "care about your specific family" because it can't hardcode Bob's specific family's visual or auditory features.
This direct-specification wouldn't work anyways because people change looks, Bob could be adopted, or Bob could be born blind & deaf.
[Check, does the Bob example make sense?]
But, the genome does do something indirectly that consistently leads to people valuin...
I wonder how much COVID got people to switch to working on Biorisks.
What I’m interested here is talking to real researchers and asking what events would convince them to switch to alignment. Enumerating those would be useful for explaining to them.
I think asking for specific capabilities would also be interesting. Or what specific capabilities they would’ve said in 2012. Then asking how long they expect between that capability and an x-catastrophe.
[Note: this one, steelman, and feedback on proposals all have very similar input spaces. I think I would ideally mix them as one in an actual product, but I'm keeping them separate for now]
Input:
...Currently AI systems are prone to bias and unfairness which is unaligned with our values. I w
It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…
I bet Adam will argue about this (or something similar) is the minimal we want for a research idea, because I agree with your idea that we shouldn’t expect solution to alignment to fall out of the marketing program for Oreos. We want to constrain it to at least “has a plausible story on reducing x-risk” and maybe what’s mentioned in the quote as well.
Ya, I was even planning on trying:
[post/blog/paper] rohinmshah karma: 100 Planned summary for the Alignment Newsletter: \n>
Then feed that input to.
Planned opinion:
to see if that has some higher-quality summaries.
For those with math backgrounds not already familiar with InfraBayes (maybe people share the post with their math-background friends), can there be specifics for context? Like:
If you have experience with topology, functional analysis, measure theory, and convex analysis then...
Or
You can get a good sense of InfraBayes from [this post] or [this one]
Or
A list of InfraBayes posts can be found here.
No, "why" is correct. See the rest of the sentence:
Write out all the counter-arguments you can think of, and repeat
It's saying assume it's correct, then assume it's wrong, and repeat. Clever arguers don't usually devil advocate themselves.
How do transcriptions typically handle images? They're pretty important for this talk. You could embed the images in the text as it progresses?
Regarding generators of human values: say we have the gene information that encodes human cognition, what does that mean? Equivalent of a simulated human? Capabilities secret-sauce algorithm right? I'm unsure if you can take the body out of a person and still have the same values because I have felt senses in my body that tells me information about the world and how I relate to it.
Assume it works as a simulated person and ignore mindcrime, how do you algorithmically end up in a good enough subset of human values (because not all human values are meta-good)...
Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?
Cultural accumulation and google, but that's mimicking someone who's already figured it out. How about the person who first figured out eg crop growth? Could be scientific method, but also just random luck which then caught on.
Additionally, sometimes it's just applying the same hammers to different nails or finding new nails, which means that there are general patterns (hammers) that can be applied to many diffe
Thinking through the "vast majority of problem-space for X fails" argument; assume we have a random text generator that we want to run a sorting algorithm:
I'm available for co-working to discuss any post or potential project on interpretability or if you'd like someone to bounce ideas off of. My calendly link is here, I'm available all week at many times, and I won't take more than 2 meetings in a day, but I'll email you within the day to reschedule if that happens.
Correct. So they’re connecting a feature in F2 to a feature in F1.