Here are some views, often held in a cluster:
I'm not sure exactly which clusters you're referring to, but I'll just assume that you're pointing to something like "people who aren't very into the sharp left turn and think that iterative, carefully bootstrapped alignment is a plausible strategy." If this isn't what you were trying to highlight, I apologize. The rest of this comment might not be very relevant in that case.
To me, the views you listed here feel like a straw man or weak man of this perspective.
Furthermore, I think the actual crux is more ofte...
Pitting two models against each other in a zero-sum competition only works so long as both models actually learn the desired goals. Otherwise, they may be able to reach a compromise with each other and cooperate towards a non-zero-sum objective.
If training works well, then they can't collude on average during training, only rarely or in some sustained burst prior to training crushing these failures.
In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more beni...
We can't be confident enough that it won't happen to safely rely on that assumption.
I'm not sure what motivation for worst-case reasoning you're thinking about here. Maybe just that there are many disjunctive ways things can go wrong other than bad capability evals and the AI will optimize against us?
Overall, I think I disagree.
This will depend on the exact bar for safety. I think this sort of scenario feels like 0.1% to 3% likely to me which is immensely catastrophic overall, but there is lower hanging fruit for danger avoidance elsewhere.
(And for this...
[Sorry for late reply]
...Analogously, conditional on things like gradient hacking being an issue at all, I'd expect the "hacker" to treat potential-training-objective-improvement as a scarce resource, which it generally avoids "spending" unless the expenditure will strengthen its own structure. Concretely, this probably looks like mostly keeping itself decoupled from the network's output, except when it can predict the next backprop update and is trying to leverage that update for something.
So it's not a question of performing badly on the training metric s
(Note: this comment is rambly and repetitive, but I decided not to spend time cleaning it up)
It sounds like you believe something like: "There are autonomous learning style approaches which are considerably better than the efficiency on next token prediction."
And more broadly, you're making a claim like 'current learning efficiency is very low'.
I agree - brains imply that it's possible to learn vastly more efficiently than deep nets, and my guess would be that performance can be far, far better than brains.
Suppose we instantly went from 'current status quo...
So I propose “somebody gets autonomous learning to work stably for LLMs (or similarly-general systems)” as a possible future fast-takeoff scenario.
Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations. For instance, suppose that data doesn't run out despite scaling and autonomous learning is moderately to considerably less efficient than supervised learning. Then, you'd just do supervised learning. Now, we can imagine fast takeoff scenarios where:
Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations.
Suppose I ask you to spend a week trying to come up with a new good experiment to try in AI. I give you two options.
Option A: You need to spend the entire week reading AI literature. I choose what you read, and in what order, using a random number generator and selecting out of every AI paper / textbook ever written. While reading, you are forced to dwell for exactly one second—no more, no less—on each word of the t...
comment TLDR: Adversarial examples are a weapon against the AIs we can use for good and solving adversarial robustness would let the AIs harden themselves.
I haven't read this yet (I will later : ) ), so it's possible this is mentioned, but I'd note that exploiting the lack of adversarial robustness could also be used to improve safety. For instance, AI systems might have a hard time keeping secrets if they also need to interact with humans trying to test for verifiable secrets. E.g., trying to jailbreak AIs to get them to tell you about the fact that they ...
Simulations are not the most efficient way for A and B to reach their agreement
Are you claiming that the marginal returns to simulation are never worth the costs? I'm skeptical. I think it's quite likely that some number of acausal trade simulations are run even if that isn't where most of the information comes from. I think there are probably diminishing returns to various approaches and thus you both do simulations and other approaches. There's a further benefit to sims which is that credence about sims effects the behavior of cdt agents, but it's unc...
It is indeed pretty weird to see these behaviors appear in pure LMs. It's especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
By 'pure LMs' do you mean 'pure next token predicting LLMs trained on a standard internet corpus'? If so, I'd be very surprised if they're miscalibrated and this prompt isn't that improbable (which it probably isn't). I'd guess this output is the 'right' output for this corpus (so long as you don't sample enough tokens to make the sequence detectably very...
(Context, I work at Redwood)
While we're on the topic, it's perhaps useful to more directly describe my concerns about distribution-specific understanding of models, and especially narrow-distribution understanding of the kind a lot of work building Causal Scrubbing seems to be focusing on.
Can I summarize your concerns as something like "I'm not sure that looking into the behavior of "real" models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?" Or perhaps you think it'...
Thinking about the state and time evolution rules for the state seems fine, but there isn't any interesting structure with the naive formulation imo. The state is the entire text, so we don't get any interesting Markov chain structure. (you can turn any random process into a Markov chain where you include the entire history in the state! The interesting property was that the past didn't matter!)
I would argue that ARC's research is justified by (1) (roughly speaking). Sadly, I don't think that there are enough posts on their current plans for this to be clear or easy for me to point at. There might be some posts coming out soon.
Fair enough if you're interested in just talking about 'approaches to acquiring information wrt. AIs' and you'd like to call this interpretability.
Is there anything in particular you would like to see discussed later in this sequence?
It seems like you're trying to convince people to do interpretability research differently or to work on other types of research.
If so, I think that it might be worth engaging with peoples cruxes. This can be harder than laying out general arguments, but it would make the sequence more useful.
That I said, I don't really know what people's cruxes for working in interp are and as far as I know this sequence already includes lots of discussion along these lines.
I'm hopeful that Redwood (where I work) moves toward having a clear and well argued plan or directly useful techniques (perhaps building up from more toy problems).
If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, shouldn’t we be seeing tools that are more helpful in the real world?
There are various reasons you might not see tools which are helpful right now. Here are some overly conjunctive examples:
Oh, it seems like you're reluctant to define interpretability, but if anything into using a very broad definition. Fair enough, I certainly agree that "methods by which something novel about a system can be better predicted or described" are important.
ARC started less than a year ago
FWIW, I woudn't describe ARC's work as interpretability. My best understanding is that they aren't directly targeting better human understanding of how AIs work (though this may happen indirectly). (I'm pretty confident in this, but maybe someone from ARC will correct me : ) )
Edit: it seems like you define interpretability very broadly, to the point where I'm a bit confused about what is or isn't interpretability work. This comment should be interpreted to refer to interpretability as 'someone (humans or AIs) getting a better understanding how an AI works (often with a mechanistic connotation)'
We are seeing a number of pushes to get many more people involved in interpretability work
Context: I work at redwood. You linked to REMIX here, but I wouldn't neccesarily argue for more people doing interpretability on the margin (and I think Buck probably roughly agrees with me here). I think it's plausible that too much effort is going to interp at the margin. I'm personally far more worried about interpretability work being well directed and high quality than the number of people involved. (It seems like I potentially agree with you on this point bas...
If we want to reduce near and long term risks from AI, we should care a lot about interpretability tools. This is a very uncontroversial claim to make inside the AI safety community. Almost every agenda for safe advanced AI incorporates interpretability in some way. The key value of interpretability tools is that they aid in human oversight by enabling open-ended evaluation.
Hmm, I actually don't think this is uncontroversial if by 'interpretability' you mean mechanistic interpretability. I think there's a pretty plausible argument that doing anything ot...
As an established case for tractability, we have the natural abstraction hypothesis. According to it, efficient abstractions are a feature of the territory, not the map (at least to a certain significant extent). Thus, we should expect different AI models to converge towards the same concepts, which also would make sense to us. Either because we're already using them (if the AI is trained on a domain we understand well), or because they'd be the same abstractions we'd arrive at ourselves (if it's a novel domain).
Even believing in a relatively strong ver...
An important note here is that our final 80% loss recovered results in loss which is worse than a constant model! (aka, a rock)
Specifically, note that the dataset consists of 26% balanced sequences as discussed here. So, the loss of a constant model is (equivalent to the entropy of the labels). This is less than the 1.22 loss we get for our final 72% loss recovered experiment.
We think this is explanation is still quite meaningful - adversarially constructed causal scrubbing hypothesis preserve variance and the m...
Overall, my view is that we will need to solve the optimization problem of 'what properties of the activation distribution are sufficient to explain how the model behaves', but this solution can be represented somewhat implicitly and I don't currently see how you'd transition it into a solution to superposition in the sense I think you mean.
I'll try to explain why I have this view, but it seems likely I'll fail (at least partially because of my own confusions).
Quickly, some background so we're hopefully on the same page (or at least closer):
I'm imagining t...
Thanks for the great comment clarifying your thinking!
I would be interested in seeing the data dimensionality curve for the validation set on MNIST (as opposed to just the train set) - it seems like the stated theory should make pretty clear predictions about what you'd see. (Or maybe I'll try to make reproduction happpen and do some more experiments).
These results also suggest that if superposition is widespread, mechanistic anomaly detection will require solving superposition
I feel pretty confused, but my overall view is that many of the routes I cur...
Calling individual tokens the 'State' and a generated sequence the 'Trajectory' is wrong/misleading IMO.
I would instead call a sequence as a whole the 'State'. This follows the meaning from Dynamical systems.
Then, you could refer to a Trajectory which is a list of sequence each with one more token.
(That said, I'm not sure thinking about trajectories is useful in this context for various reasons)
In prior work I've done, I've found that activations have tails between and (typically closer to ). As such, they're probably better modeled as logistic distributions.
That said, different directions in the residual stream have quite different distributions. This depends considerably on how you select directions - I imagine random directions are more gaussian due to CLT. (Note that averaging together heavier tailed distributions takes a very long time to be become gaussian.) But, if you look at (e.g.) the directions selected by neurons optimized...
See also the Curve Detectors paper for a very narrow example of this (https://distill.pub/2020/circuits/curve-detectors/#dataset-analysis -- a straight line on a log prob plot indicates exponential tails).
I believe the phenomena of neurons often having activation distributions with exponential tails was first informally observed by Brice Menard.
I am happy to take a “non-worst-case” empirical perspective in studying this problem. In particular, I suspect it will be very helpful – and possibly necessary – to use incidental empirical properties of deep learning systems, which often have a surprising amount of useful emergent structure (as I will discuss more under “Intuitions”).
One reason I feel sad about depending on incidental properties is that it likely implies the solution isn't robust enough to optimize against. This is a key desiderata in an ELK solution. I imagine this optimization would ...
Here are some dumb questions. Perhaps the answer to all of them is 'this work is preliminary, we'll address this + more later' or 'hey, see section 4 where we talked about this in detail' or 'your question doesn't make sense'
In the spirit of Evan's original post here's a (half baked) simple model:
Simplicity claims are claims about how many bits (in the human prior) it takes to explain[1] some amount of performance in the NN prior.
E.g., suppose we train a model which gets 2 nats of loss with 100 Billion parameters and we can explain this model getting 2.5 nats using a 300 KB human understandable manual (we might run into issues with irreducible complexity such that making a useful manual is hard, but let's put that aside for now).
So, 'simplicity' of this sort is lower bounded b...
And yet, whenever we actually delve into these systems, it turns out that there's a ton of ultimately-relatively-simple internal structure.
I'm not sure exactly what you mean by "ton of ultimately-relatively-simple internal structure".
I'll suppose you mean "a high percentage of what models use parameters for is ultimately simple to humans" (where by simple to humans we mean something like, description length in the prior of human knowledge, e.g., natural language).
If so, this hasn't been my experience doing interp work or from the interp work I've seen (...
I would typically call
MLP(x) = f(x) + (MLP(x) - f(x))
a non-linear decomposition as f(x) is an arbitrary function.
Regardless, any decomposition into a computational graph (that we can prove is extensionally equal) is fine. For instance, if it's the case that MLP(x) = combine(h(x), g(x)) (via extensional equality), then I can scrub h(x) and g(x) individually.
One example of this could be a product, e.g, suppose that MLP(x) = h(x) * g(x) (maybe like swiglu or something).
(Context: I work at Redwood on using the internals of models to do useful stuff. This is often interpretability work.)
I broadly agree with this post, but I think that it implicitly uses the words 'mechanistic interpretability' differently than people typically do. It seems to be implying that for mechanistic interpretability to be tractible, all parts of the AGI's cognition must be possible to understand for humans in principle. While I agree that this is basically required for Microscope AI to be very useful, it isn't required to have mechanistic interp b...
... THEN the Paulian family of plans don't provide much hope.
My understanding is that Ryan was tentatively on board with this conditional statement, but Paul was not.
I forget the extent to which I communicated (or even thought) this in the past, but at the moment, the current claim I'd agree with is: "this specific plan is much less likely to work".
My best guess is that even if I was quite confident in those conditions being true, work on various subparts of this plan seems like quite a good bet.
By explanation, I think we mean 'reason why a thing happens' in some intuitive (and underspecified) sense. Explanation length gets at something like "how can you cluster/compress a justification for the way the program responds to inputs" (where justification is doing a lot of work). So, while the program itself is a great way to compress how the program responds to inputs, it doesn't justify why the program responds this way to inputs. Thus program length/simplicity prior isn't equivalent. Here are some examples demonstrating where (I think) these priors ...
I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.