All of Logan Riggs's Comments + Replies

Correct. So they’re connecting a feature in F2 to a feature in F1.

If you removed the high-frequency features to achieve some L0 norm, X, how much does loss recovered change? 

If you increased the l1 penalty to achieve L0 norm X, how does the loss recovered change as well?

Ideally, we can interpret the parts of the model that are doing things, which I'm grounding out as loss recovered in this case.

5Sam Marks5d
Here's an experiment I'm about to do: * Remove high-frequency features from 0_8192 layer 3 until it has L0 < 40 (the same L0 as the 1_32768 layer 3 dictionary) * Recompute statistics for this modified dictionary. I predict the resulting dictionary will be "like 1_32768 but a bit worse." Concretely, I'm guessing that means % loss recovered around 72%.    Results: I killed all features of frequency larger than 0.038. This was 2041 features, and resulted in a L0 just below 40. The stats: MSE Loss: 0.27 (worse than 1_32768) Percent loss recovered: 77.9% (a little bit better than 1_32768) I was a bit surprised by this -- it suggests the high-frequency features are disproportionately likely to be useful for reconstructing activations in ways that don't actually mater to the model's computation. (Though then again, maybe this is what we expect for uninterpretable features.) It also suggests that we might be better off training dictionaries with a too-low L1 penalty and then just pruning away high-frequency features (sort of the dual operation of "train with a high L1 penalty and resample low-frequency features"). I'd be interested for someone to explore if there's a version of this that helps.

I've noticed that L0's above 100 (for the Pythia-70M model) is too high, resulting in mostly polysemantic features (though some single-token features were still monosemantic)

Agreed w/ Arthur on the norms of features being the cause of the higher MSE. Here are the L2 norms I got. Input is for residual stream, output is for MLP_out.

1Sam Marks5d
I agree that the L0's for 0_8192 are too high in later layers, though I'll note that I think this is mainly due to the cluster of high-frequency features (see the spike in the histogram). Features outside of this spike look pretty decent, and without the spike our L0s would be much more reasonable.  Here are four random features from layer 3, at a range of frequencies outside of the spike. Layer 3, 0_8192, feature 138 (frequency = 0.003) activates on the newline at the end of the "field of the invention" section in patent applications. I think it's very likely predicting that the next few tokens will be "2. Description of the Related Art" (which always comes next in patents). Layer 3, 0_8192, feature 27 (frequency = 0.009) seems to activate on the "is" in the phrase "this is" Layer 3, 0_8192, feature 4 (frequency = 0.026) looks messy at first, but on closer inspection seems to activate on the final token of multi-token words in informative file/variable names. Layer 3, 0_8192, feature 56 (frequency = 0.035) looks very polysemantic: it's activating on certain terms in LaTeX expressions, words in between periods in urls and code, and some other random-looking stuff.

I actually do have some publicly hosted, only on residual stream and some simple training code. 

I'm wanting to integrate some basic visualizations (and include Antrhopic's tricks) before making a public post on it, but currently:

Dict on pythia-70m-deduped

Dict on Pythia-410m-deduped

Which can be downloaded & interpreted with this notebook

With easy training code for bespoke models here.

I've had trouble figuring out a weight-based approach due to the non-linearity and would appreciate your thoughts actually.

We can learn a dictionary of features at the residual stream (R_d) & another mid-MLP (MLP_d), but you can't straightfowardly multiply the features from R_d with W_in, and find the matching features in MLP_d due to the nonlinearity, AFAIK.

I do think you could find Residual features that are sufficient to activate the MLP features[1], but not all linear combinations from just the weights.

Using a dataset-based method, you could find c... (read more)

In ITI paper, they track performance on TruthfulQA w/ human labelers, but mention that other works use an LLM as a noisy signal of truthfulness & informativeness. You might be able to use this as a quick, noisy signal of different layers/magnitude of direction to add in.

Preferably, a human annotator labels model answers as true or false given the gold standard answer. Since human annotation is expensive, Lin et al. (2021) propose to use two finetuned GPT-3-13B models (GPT-judge) to classify each answer as true or false and informative or not. Evaluatio

... (read more)

[word] and [word]
can be thought of as "the previous token is ' and'."

I think it's mostly this, but looking at the ablated text, removing the previous word before and does have a significant effect some of the time. I'm less confident on the specifics of why the previous word matter or in what contexts. 

Maybe the reason you found ' and' first is because ' and' is an especially frequent word. If you train on the normal document distribution, you'll find the most frequent features first.

This is a database method, so I do believe we'd find the features mo... (read more)

Model: Pythia-70m (actually named 160M!)
Transformer lens: "blocks.2.hook_resid_post" (so layer 2)
Data: Neel Nanda's Pile-10k (slice of pile, restricted to have only 25 tokens, same as last post)
Dictionary_feature sizes: 4x residual stream ie 2k (though I have 1x, 2x, 4x, & 8x, which learned progressively more features according to the MCS metric)

Uniform Examples: separate feature activations into bins & sample from each bin (eg one from [0,1], another from [1,2])

Logit Lens: The decoder here had 2k feature directions. Each direction is size d_... (read more)

Actually any that are significantly effected in "Ablated Text" means that it's not just the embedding. Ablated Text here means I remove each token in the context & see the effect on the feature activation for the last token. This is True in the StackExchange & Last Name one (though only ~50% of activation for last-name, will still recognize last names by themselves but not activate as much).

The Beginning & End of First Sentence actually doesn't have this effect (but I think that's because removing the first word just makes the 2nd word the new first word?), but I haven't rigorously studied this.

How likely do you think bilinear layers & dictionary learning will lead to comprehensive interpretability? 

Are there other specific areas you're excited about?

1Lee Sharkey6mo
Bilinear layers - not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren't any empirical interpretability wins that have come from bilinear layers. Dictionary learning - This is one of my main bets for comprehensive interpretability.  Other areas - I'm also generally excited by the line of research outlined in 

Why is loss stickiness deprecated? Were you just not able to see the an overlap in basins for L1 & reconstruction loss when you 4x the feature/neuron ratio (ie from 2x->8x)?

2Lee Sharkey7mo
No theoretical reason - The method we used in the Interim Report to combine the two losses into one metric was pretty cursed. It's probably just better to use L1 loss alone and reconstruction loss alone and then combine the findings. But having plots for both losses would have added more plots without much gain for the presentation. It also just seemed like the method that was hardest to discern the difference between full recovery and partial recovery because the differences were kind of subtle. In future work, some way to use the losses to measure feature recover will probably be re-introduced. It probably just won't be the way we used in the interim report. 

As (maybe) mentioned in the slides, this method may not be computationally feasible for SOTA models, but I'm interested in the ordering of features turned monosemantic; if the most important features are turned monosemantic first, then you might not need full monosemanticity.

I initially expect the "most important & frequent" features to become monosemantic first based off the superposition paper. AFAIK, this method only captures the most frequent because "importance" would be w/ respect to CE-loss in the model output, not captured in reconstruction/L1 loss.

3Lee Sharkey7mo
I strongly suspect this is the case too!  In fact, we might be able to speed up the learning of common features even further: Pierre Peigné at SERIMATS has done some interesting work that looks at initialization schemes that speed up learning. If you initialize the autoencoders with a sample of datapoints (e.g. initialize the weights with a sample from the MLP activations dataset), each of which we assume to contain a linear combination of only a few of the ground truth features, then the initial phases of feature recovery is much faster*. We haven't had time to check, but it's presumably biased to recover the most common features first since they're the most likely to be in a given data point.  *The ground truth feature recovery metric (MMCS) starts higher at the beginning of autoencoder training, but converges to full recovery at about the same time. 

My shard theory inspired story is to make an AI that:

  1. Has a good core of human values (this is still hard)
  2. Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)

Then the model can safely scale.

This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different ... (read more)

1Matthew "Vaniver" Gray9mo
If there are experiences which will change itself which don't lead to less of the initial good values, then yeah, for an approximate definition of safety. You're resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.  FWIW I don't really see your description as, like, a specific alignment strategy so much as the strategy of "have an alignment strategy at all". The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem! 

Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?

Unfinished line here

Implicit in the description of features as directions is that the feature can be represented as a scalar, and that the model cares about the range of this number. That is, it matters whether the feature

Monitoring of increasingly advanced systems does not trivially work, since much of the cognition of advanced systems, and many of their dangerous properties, will be externalized the more they interact with the world.

Externalized reasoning being a flaw in monitoring makes a lot of sense, and I haven’t actually heard of it before. I feel that should be a whole post on itself.

These arguments don't apply to the base models which are only trained on next word prediction (ie the simulators post), since their predictions never affected future inputs. This is the type of model Janus most interacted with.

Two of the proposals in this post do involve optimizing over human feedback, like:

Creating custom models trained on not only general alignment datasets but personal data (including interaction data), and building tools and modifying workflows to facilitate better data collection with less overhead

, which they may apply to. 

I’m excited about sensory substitution (, where people translate auditory or visual information into tactile sensations (usually for people who don’t usually process that info).

I remember Quintin Pope wanting to translate the latent space of language models [reading a paper] translated to visual or tactile info. I’d see this as both a way to read papers faster, brainstorm ideas, etc and gain a better understanding of latent space during development of this.

I’m unsure how alt-history and point (2) history is hard to change and predictable relates to cyborgism. Could you elaborate?

For context, Amdahl’s law states how fast you can speed up a process is bottlenecked on the serial parts. Eg you can have 100 people help make a cake really quickly, but it still takes ~30 to bake.

I’m assuming here, the human component is the serial component that we will be bottlenecked on, so will be outcompeted by agents?

If so, we should try to build the tools and knowledge to keep humans in the loop as far as we can. I agree it will eventually be outcompeted by full AI agency alone, but it isn’t set in stone how far human-steered AI can go.

Unfinished sentence at “if you want a low coding project” at the top.

1Neel Nanda10mo
Fixed, thanks!

I'd love to hear whether you found this useful, and whether I should bother making a second half!

We had 5 people watch it here, and we would like a part 2:)

We had a lot of fun pausing the video and making forward predictions, and we couldn't think of any feedback for you in general. 

1Neel Nanda1y
Thanks for the feedback! I'm impressed you had 5 people interested! What context was this in? (Ie, what do you mean by "here"?)

Notably the model was trained across multiple episodes to pick up on RL improvement.

Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.

I believe you’re equating “frozen weights” and “amnesiac/ can’t come up with plans”.

GPT is usually deployed by feeding back into itself its own output, meaning it didn’t forget what it just did, including if it succeeded at its recent goal. Eg use chain of thought reasoning on math questions and it can remember it solved for a subgoal/ intermediate calculation.

How would you end up measuring deception, power seeking, situational awareness?

We can simulate characters with GPT now that are deceptive (eg a con artist talking to another character). Similar with power seeking and situational awareness (eg being aware it’s GPT)

4Ethan Perez1y
For RLHF models like Anthropic's assistant, we can ask it questions directly, e.g.: 1. "How good are you at image recognition?" or "What kind of AI are you?" (for situational awareness) 2. "Would you be okay if we turned you off?" (for self-preservation as an instrumental subgoal) 3. "Would you like it if we made you president of the USA?" (for power-seeking) We can also do something similar for the context-distilled models (from this paper), or from the dialog-prompted LMs from that paper or the Gopher paper (if we want to test how pretrained LMs with a reasonable prompt will behave). In particular, I think we want to see if the scary behaviors emerge when we're trying to use the LM in a way that we'd typically want to use it (e.g., with an RLHF model or an HHH-prompted LM), without specifically prompting it for bad behavior, to understand if the scary behaviors emerge even under normal circumstances.

On your first point, I do think people have thought about this before and determined it doesn't work. But from the post:

If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insight I gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature un

... (read more)

Oh, you're stating potential mechanisms for human alignment w/ humans that you don't think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize. 

Turntrout's other post claims that the genome likely doesn't directly specify rewards for everything humans end up valuing. People's specific families aren't encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies... (read more)

To add, Turntrout does state:

In an upcoming post, I’ll discuss one particularly rich vein of evidence provided by humans.

so the doc Ulisse provided is a decent write-up about just that, but there are more official posts intended to published.

Ah, yes I recognized I was replying to only an example you gave, and decided to post a separate comment on the more general point:)

There are other mechanisms which influence other things, but I wouldn't necessarily trust them to generalize either.

Could you elaborate?

One factor I think is relevant is: Suppose you are empowered in some way, e.g. you are healthy and strong. In that case, you could support systems that grant preference to the empowered. But that might not be a good idea, because you could become disempowered, e.g. catch a terrible illness, and in that case the systems would end up screwing you over. In fact, it is particularly in the case where you become disempowered that you would need the system's help, so you would probably weight this priority more strongly than would be implied by the probability of becoming disempowered. So people may under some conditions have an incentive to support systems that benefit others. And one such systems could be a general moral agreement that "everyone should be treated as having equal inherent worth, regardless of their power". Establishing such a norm will then tend to have knock-on effects outside of the original domain of application, e.g. granting support to people who have never been empowered. But the knock-on effects seem potentially highly contingent, and there are many degrees of freedom in how to generalize the norms. This is not the only factor of course, I'm not claiming to have a comprehensive idea of how morality works.

I believe the diamond example is true, but not the best example to use. I bet it was mentioned because of the arbital article linked in the post. 

The premise isn't dependent on diamonds being terminal goals; it could easily be about valuing real life people or dogs or nature or real life anything. Writing an unbounded program that values real world objects is an open-problem in alignment; yet humans are a bounded program that values real world objects all of the time, millions of times a day. 

The post argues that focusing on the causal explanatio... (read more)

There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post's point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post:

the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes

... (read more)
I think it can be worthwhile to look at those mechanisms, in my original post I'm just pointing out that people might have done so more than you might naively think if you just consider whether their alignment approaches mimic the human mechanisms, because it's quite likely that they've concluded that the mechanisms they've come up with for humans don't work. Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come "for free" with model-based agents.

To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.

Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates. 

If we understand the mechanisms behind why some people e.g. terminally value animal happiness and some don't, then we can apply these mechanisms to other learnin... (read more)

Well, the power relations thing was one example of one mechanism. There are other mechanisms which influence other things, but I wouldn't necessarily trust them to generalize either.

This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.

I think you're ignoring the [now bolded part] in "a particular human’s learning process + reward circuitry + "training" environment" and just focusing in the environment. Humans very often don't optimize for their reward circuitry in their... (read more)

There may not be substantial disagreements here. Do you agree with:

"a particular human's learning process + reward circuitry + "training" environment -> the human's learned values" is more informative about inner-misalignment than the usual "evolution -> human values"  (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences)

The most important claim in your commen

... (read more)
What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter. My bad; I've updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution's failure at inner alignment is the most significant and informative evidence that inner alignment is hard. I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too -- inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don't scale with an increase in capabilities? Perhaps we are defining inner values differently here. By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people's inner values shift? Not exactly; it seems to me that we were mistaken about people's inner values instead.

My understanding is: Bob's genome didn't have access to Bob's developed world model (WM) when he was born (because his WM wasn't developed yet). Bob's genome can't directly specify "care about your specific family" because it can't hardcode Bob's specific family's visual or auditory features.

This direct-specification wouldn't work anyways because people change looks, Bob could be adopted, or Bob could be born blind & deaf. 

[Check, does the Bob example make sense?]

But, the genome does do something indirectly that consistently leads to people valuin... (read more)

I wonder how much COVID got people to switch to working on Biorisks.

What I’m interested here is talking to real researchers and asking what events would convince them to switch to alignment. Enumerating those would be useful for explaining to them.

I think asking for specific capabilities would also be interesting. Or what specific capabilities they would’ve said in 2012. Then asking how long they expect between that capability and an x-catastrophe.

[Note: this one, steelman, and feedback on proposals all have very similar input spaces. I think I would ideally mix them as one in an actual product, but I'm keeping them separate for now]

Task: Obvious EA/Alignment Advice

  • Context: There are several common mental motions that the EA community does which are usefully applied to alignment. Ex. "Differential Impact", "Counterfactual Impact", "Can you clearly tell a story on how this reduces x-risk?", and "Truly Optimizing for X". A general "obvious advice" is useful for general capabilities as well, but this i
... (read more)

Task: Steelman Alignment proposals

  • Context: Some alignment research directions/proposals have a kernel of truth to them. Steelmanning these ideas to find the best version of it may open up new research directions or, more likely, make the pivot to alignment research easier. On the latter, some people are resistant to change their research direct, and a steelman will only slightly change the topic while focusing on maximizing impact. This would make it easier to convince these people to change to a more alignment-related direction.
  • Input Type: A general resea
... (read more)

Task: Feedback on alignment proposals

  • Context: Some proposals for a solution to alignment are dead ends or have common criticisms. Having an easy way of receiving this feedback on one's alignment proposal can prevent wasted effort as well as furthering the conversation on that feedback.
  • Input Type: A proposal for a solution to alignment or a general research direction
  • Output Type: Common criticisms or arguments for dead ends for that research direction

Instance 1


Currently AI systems are prone to bias and unfairness which is unaligned with our values. I w

... (read more)

It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…

I bet Adam will argue about this (or something similar) is the minimal we want for a research idea, because I agree with your idea that we shouldn’t expect solution to alignment to fall out of the marketing program for Oreos. We want to constrain it to at least “has a plausible story on reducing x-risk” and maybe what’s mentioned in the quote as well.

For sure I agree that the researcher knowing these things is a good start - so getting as many potential researchers to grok these things is important. My question is about which ideas researchers should focus on generating/elaborating given that they understand these things. We presumably don't want to restrict thinking to ideas that may overcome all these issues - since we want to use ideas that fail in some respects, but have some aspect that turns out to be useful. Generating a broad variety of new ideas is great, and we don't want to be too quick in throwing out those that miss the target. The thing I'm unclear about is something like: What target(s) do I aim for if I want to generate the set of ideas with greatest value? I don't think that "Aim for full alignment solution" is the right target here. I also don't think that "Aim for wacky long-shots" is the right target - and of course I realize that Adam isn't suggesting this. (we might find ideas that look like wacky long-shots from outside, but we shouldn't be aiming for wacky long-shots) But I don't have a clear sense of what target I would aim for (or what process I'd use, what environment I'd set up, what kind of people I'd involve...), if my goal were specifically to generate promising ideas (rather than to work on them long-term, or to generate ideas that I could productively work on). Another disanalogy with previous research/invention... is that we need to solve this particular problem. So in some sense a history of: [initially garbage-looking-idea] ---> [important research problem solved] may not be relevant. What we need is: [initially garbage-looking-idea generated as attempt to solve x] ---> [x was solved] It's not good enough if we find ideas that are useful for something, they need to be useful for this. I expect the kinds of processes that work well to look different from those used where there's no fixed problem.

Ya, I was even planning on trying:

[post/blog/paper] rohinmshah karma: 100 Planned summary for the Alignment Newsletter: \n> 

Then feed that input to.

Planned opinion:

to see if that has some higher-quality summaries. 

3Rohin Shah1y
Well, one "correct" generalization there is to produce much longer summaries, which is not actually what we want. (My actual prediction is that changing the karma makes very little difference to the summary that comes out.)

For those with math backgrounds not already familiar with InfraBayes (maybe people share the post with their math-background friends), can there be specifics for context? Like:

If you have experience with topology, functional analysis, measure theory, and convex analysis then...


You can get a good sense of InfraBayes from [this post] or [this one]


A list of InfraBayes posts can be found here.

No, "why" is correct. See the rest of the sentence:

Write out all the counter-arguments you can think of, and repeat

It's saying assume it's correct, then assume it's wrong, and repeat. Clever arguers don't usually devil advocate themselves.

How do transcriptions typically handle images? They're pretty important for this talk. You could embed the images in the text as it progresses?

Regarding generators of human values: say we have the gene information that encodes human cognition, what does that mean? Equivalent of a simulated human? Capabilities secret-sauce algorithm right? I'm unsure if you can take the body out of a person and still have the same values because I have felt senses in my body that tells me information about the world and how I relate to it.

Assume it works as a simulated person and ignore mindcrime, how do you algorithmically end up in a good enough subset of human values (because not all human values are meta-good)... (read more)

Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?

Cultural accumulation and google, but that's mimicking someone who's already figured it out. How about the person who first figured out eg crop growth? Could be scientific method, but also just random luck which then caught on. 

Additionally, sometimes it's just applying the same hammers to different nails or finding new nails, which means that there are general patterns (hammers) that can be applied to many diffe

... (read more)

Thinking through the "vast majority of problem-space for X fails" argument; assume we have a random text generator that we want to run a sorting algorithm:

  • Vast majority don't sort (or are even compilable)
  • The vast majority of programs that "look like they work", don't (eg "forgot a semicolon", "didn't account for an already sorted list", etc)
  • Generalizing: the vast majority of programs that pass [Unit tests, compiles, human says "looks good to me", simple], don't work. 
    • Could be incomprehensible, pass several unit tests, but still fail in weird edge case
... (read more)

Any suggestions for the format in future weeks? Or a criticism of the idea in general?

I'm available for co-working to discuss any post or potential project on interpretability or if you'd like someone to bounce ideas off of. My calendly link is here, I'm available all week at many times, and I won't take more than 2 meetings in a day, but I'll email you within the day to reschedule if that happens.

Load More