All of ryan_greenblatt's Comments + Replies

I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.

Here are some views, often held in a cluster:

I'm not sure exactly which clusters you're referring to, but I'll just assume that you're pointing to something like "people who aren't very into the sharp left turn and think that iterative, carefully bootstrapped alignment is a plausible strategy." If this isn't what you were trying to highlight, I apologize. The rest of this comment might not be very relevant in that case.

To me, the views you listed here feel like a straw man or weak man of this perspective.

Furthermore, I think the actual crux is more ofte... (read more)

3Ryan Greenblatt2d
I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.

Pitting two models against each other in a zero-sum competition only works so long as both models actually learn the desired goals. Otherwise, they may be able to reach a compromise with each other and cooperate towards a non-zero-sum objective.

If training works well, then they can't collude on average during training, only rarely or in some sustained burst prior to training crushing these failures.

In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more beni... (read more)

1Rubi J. Hudson9h
For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don't know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further. For the second point, I think we are also in agreement. If the training process leads the AI to learning "If I predict that this action will destroy the world, the humans won't choose it", which then leads to dishonest predictions. However, I also find the training process converging to a mesa-optimizer for the training objective (or something sufficiently close) to be somewhat more plausible.

We can't be confident enough that it won't happen to safely rely on that assumption.

I'm not sure what motivation for worst-case reasoning you're thinking about here. Maybe just that there are many disjunctive ways things can go wrong other than bad capability evals and the AI will optimize against us?

Overall, I think I disagree.

This will depend on the exact bar for safety. I think this sort of scenario feels like 0.1% to 3% likely to me which is immensely catastrophic overall, but there is lower hanging fruit for danger avoidance elsewhere.

(And for this... (read more)

3johnswentworth10d
This getting very meta, but I think my Real Answer is that there's an analogue of You Are Not Measuring What You Think You Are Measuring [https://www.lesswrong.com/posts/9kNxhKWvixtKW5anS/you-are-not-measuring-what-you-think-you-are-measuring] for plans. Like, the system just does not work any of the ways we're picturing it at all, so plans will just generally not at all do what we imagine they're going to do. (Of course the plan could still in-principle have a high chance of "working", depending on the problem, insofar as the goal turns out to be easy to achieve, i.e. most plans work by default. But even in that case, the planner doesn't have counterfactual impact; just picking some random plan would have been about as likely to work.) The general solution which You Are Not Measuring What You Think You Are Measuring suggested was "measure tons of stuff", so that hopefully you can figure out what you're actually measuring. The analogy of that technique for plans would be: plan for tons of different scenarios, failure modes, and/or goals. Find plans (or subplans) which generalize to tons of different cases, and there might be some hope that it generalizes to the real world. The plan can maybe be robust enough to work even though the system does not work at all the ways we imagine. But if the plan doesn't even generalize to all the low-but-not-astronomically-low-probability possibilities we've thought of, then, man, it sure does seem a lot less likely to generalize to the real system. Like, that pretty strongly suggests that the plan will work only insofar as the system operates basically the way we imagined. Personally, my take on basically-all capabilities evals which at all resemble the evals developed to date is You Are Not Measuring What You Think You Are Measuring; I expect them to mostly just not measure whatever turns out to matter in practice.

[Sorry for late reply]

Analogously, conditional on things like gradient hacking being an issue at all, I'd expect the "hacker" to treat potential-training-objective-improvement as a scarce resource, which it generally avoids "spending" unless the expenditure will strengthen its own structure. Concretely, this probably looks like mostly keeping itself decoupled from the network's output, except when it can predict the next backprop update and is trying to leverage that update for something.

So it's not a question of performing badly on the training metric s

... (read more)

(Note: this comment is rambly and repetitive, but I decided not to spend time cleaning it up)

It sounds like you believe something like: "There are autonomous learning style approaches which are considerably better than the efficiency on next token prediction."

And more broadly, you're making a claim like 'current learning efficiency is very low'.

I agree - brains imply that it's possible to learn vastly more efficiently than deep nets, and my guess would be that performance can be far, far better than brains.

Suppose we instantly went from 'current status quo... (read more)

So I propose “somebody gets autonomous learning to work stably for LLMs (or similarly-general systems)” as a possible future fast-takeoff scenario.

Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations. For instance, suppose that data doesn't run out despite scaling and autonomous learning is moderately to considerably less efficient than supervised learning. Then, you'd just do supervised learning. Now, we can imagine fast takeoff scenarios where:

  • Scaling runs into
... (read more)

Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations.

Suppose I ask you to spend a week trying to come up with a new good experiment to try in AI. I give you two options.

Option A: You need to spend the entire week reading AI literature. I choose what you read, and in what order, using a random number generator and selecting out of every AI paper / textbook ever written. While reading, you are forced to dwell for exactly one second—no more, no less—on each word of the t... (read more)

comment TLDR: Adversarial examples are a weapon against the AIs we can use for good and solving adversarial robustness would let the AIs harden themselves.

I haven't read this yet (I will later : ) ), so it's possible this is mentioned, but I'd note that exploiting the lack of adversarial robustness could also be used to improve safety. For instance, AI systems might have a hard time keeping secrets if they also need to interact with humans trying to test for verifiable secrets. E.g., trying to jailbreak AIs to get them to tell you about the fact that they ... (read more)

1AdamGleave2mo
This is a good point, adversarial examples in what I called in the post the "main" ML system can be desirable even though we typically don't want them in the "helper" ML systems used to align the main system. One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that are cryptographically hard to discover if you don't know how they were constructed. I generally expect that if adversarial robustness can be practically solved then transformative AI systems will eventually self-improve themselves to the point of being robust. So, the window where AI systems are dangerous & deceptive enough that we need to test them using adversarial examples but not capable enough to have overcome this might be quite short. Could still be useful as an early-warning sign, though.

Simulations are not the most efficient way for A and B to reach their agreement

Are you claiming that the marginal returns to simulation are never worth the costs? I'm skeptical. I think it's quite likely that some number of acausal trade simulations are run even if that isn't where most of the information comes from. I think there are probably diminishing returns to various approaches and thus you both do simulations and other approaches. There's a further benefit to sims which is that credence about sims effects the behavior of cdt agents, but it's unc... (read more)

It is indeed pretty weird to see these behaviors appear in pure LMs. It's especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.

By 'pure LMs' do you mean 'pure next token predicting LLMs trained on a standard internet corpus'? If so, I'd be very surprised if they're miscalibrated and this prompt isn't that improbable (which it probably isn't). I'd guess this output is the 'right' output for this corpus (so long as you don't sample enough tokens to make the sequence detectably very... (read more)

(Context, I work at Redwood)

While we're on the topic, it's perhaps useful to more directly describe my concerns about distribution-specific understanding of models, and especially narrow-distribution understanding of the kind a lot of work building Causal Scrubbing seems to be focusing on.

Can I summarize your concerns as something like "I'm not sure that looking into the behavior of "real" models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?" Or perhaps you think it'... (read more)

5Christopher Olah3mo
Between the two, I might actually prefer training a toy model on a narrow distribution! But it depends a lot on exactly how the analysis is done and what lessons one wants to draw from it. Real language models seem to make extensive use of superposition. I expect there to be lots of circuits superimposed with the one you're studying, and I worry that studying it on a narrow distribution may give a misleading impression – as soon as you move to a broader distribution, overlapping features and circuits which you previously missed may activate, and your understanding may in fact be misleading. On the other hand, for a model just trained on a toy task, I think your understanding is likely closer to the truth of what's going on in that model. If you're studying it over the whole training distribution, features either aren't in superposition (there's so much free capacity in most of these models this seem possible!) or else they'll be part of the unexplained loss, in your language. So choosing to use a toy model is just a question of what that model teaches you about real models (for example, you've kind of side-stepped superposition, and it's also unclear to what extent the features and circuits in a toy model represent the larger model). But it seems much clearer what is true, and it also seems much clearer that these limitations exist.

Thinking about the state and time evolution rules for the state seems fine, but there isn't any interesting structure with the naive formulation imo. The state is the entire text, so we don't get any interesting Markov chain structure. (you can turn any random process into a Markov chain where you include the entire history in the state! The interesting property was that the past didn't matter!)

1Charlie Steiner3mo
Hm, I mostly agree. There isn't any interesting structure by default, you have to get it by trying to mimic a training distribution that has interesting structure. And I think this relates to another way that I was too reductive, which is that if I want to talk about "simulacra" as a thing, then they don't exist purely in the text, so I must be sneaking in another ontology somewhere - an ontology that consists of features inferred from text (but still not actually the state of our real universe).
1Lawrence Chan3mo
Nitpick: I mean, technically, the state is only the last 4k tokens or however long your context length is. Though I agree this is still very uninteresting. 

I would argue that ARC's research is justified by (1) (roughly speaking). Sadly, I don't think that there are enough posts on their current plans for this to be clear or easy for me to point at. There might be some posts coming out soon.

Fair enough if you're interested in just talking about 'approaches to acquiring information wrt. AIs' and you'd like to call this interpretability.

Can you give examples of alignment research which isn't interpretability research?

1Stephen Casper4mo
There are not that many that I don't think are fungible with interpretability work :)  But I would describe most outer alignment work to be sufficiently different...
2Ryan Greenblatt4mo
Fair enough if you're interested in just talking about 'approaches to acquiring information wrt. AIs' and you'd like to call this interpretability.

Is there anything in particular you would like to see discussed later in this sequence?

It seems like you're trying to convince people to do interpretability research differently or to work on other types of research.

If so, I think that it might be worth engaging with peoples cruxes. This can be harder than laying out general arguments, but it would make the sequence more useful.

That I said, I don't really know what people's cruxes for working in interp are and as far as I know this sequence already includes lots of discussion along these lines.

I'm hopeful that Redwood (where I work) moves toward having a clear and well argued plan or directly useful techniques (perhaps building up from more toy problems).

If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, shouldn’t we be seeing tools that are more helpful in the real world?

There are various reasons you might not see tools which are helpful right now. Here are some overly conjunctive examples:

  1. There's a clear and well argued plan for the tools/research to build into tools/research which reduce X-risk, but this plan requires additional components which don't exist yet. So, these components are being worked on. Ideally,
... (read more)
1Stephen Casper4mo
I think that (1) is interesting. This sounds plausible, but I do not know of any examples of this perspective being fleshed out. Do you know of any posts on this?
1Ryan Greenblatt4mo
I'm hopeful that Redwood (where I work) moves toward having a clear and well argued plan or directly useful techniques (perhaps building up from more toy problems).

Oh, it seems like you're reluctant to define interpretability, but if anything into using a very broad definition. Fair enough, I certainly agree that "methods by which something novel about a system can be better predicted or described" are important.

ARC started less than a year ago

FWIW, I woudn't describe ARC's work as interpretability. My best understanding is that they aren't directly targeting better human understanding of how AIs work (though this may happen indirectly). (I'm pretty confident in this, but maybe someone from ARC will correct me : ) )

Edit: it seems like you define interpretability very broadly, to the point where I'm a bit confused about what is or isn't interpretability work. This comment should be interpreted to refer to interpretability as 'someone (humans or AIs) getting a better understanding how an AI works (often with a mechanistic connotation)'

2Stephen Casper4mo
Thanks! I discuss in the second post of the sequence why I lump ARC's work in with human-centered interpretability. 

We are seeing a number of pushes to get many more people involved in interpretability work

Context: I work at redwood. You linked to REMIX here, but I wouldn't neccesarily argue for more people doing interpretability on the margin (and I think Buck probably roughly agrees with me here). I think it's plausible that too much effort is going to interp at the margin. I'm personally far more worried about interpretability work being well directed and high quality than the number of people involved. (It seems like I potentially agree with you on this point bas... (read more)

1Stephen Casper4mo
Interesting to know that about the plan. I have assumed that remix was in large part about getting more people into this type of work. But I'm interested in the conclusions and current views on it. Is there a post reflecting on how it went and what lessons were learned from it?

If we want to reduce near and long term risks from AI, we should care a lot about interpretability tools. This is a very uncontroversial claim to make inside the AI safety community. Almost every agenda for safe advanced AI incorporates interpretability in some way. The key value of interpretability tools is that they aid in human oversight by enabling open-ended evaluation.

Hmm, I actually don't think this is uncontroversial if by 'interpretability' you mean mechanistic interpretability. I think there's a pretty plausible argument that doing anything ot... (read more)

2Ryan Greenblatt4mo
Oh, it seems like you're reluctant to define interpretability, but if anything into using a very broad definition [https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7/p/MyvkTKfndx9t4zknh#The_key_idea_behind_this_post_is_that_whatever_we_call__interpretability__tools_are_entirely_fungible_with_other_techniques_related_to_describing__evaluating__debugging__etc_]. Fair enough, I certainly agree that "methods by which something novel about a system can be better predicted or described" are important.

As an established case for tractability, we have the natural abstraction hypothesis. According to it, efficient abstractions are a feature of the territory, not the map (at least to a certain significant extent). Thus, we should expect different AI models to converge towards the same concepts, which also would make sense to us. Either because we're already using them (if the AI is trained on a domain we understand well), or because they'd be the same abstractions we'd arrive at ourselves (if it's a novel domain).

Even believing in a relatively strong ver... (read more)

An important note here is that our final 80% loss recovered results in loss which is worse than a constant model! (aka, a rock)

Specifically, note that the dataset consists of 26% balanced sequences as discussed here. So, the loss of a constant model is (equivalent to the entropy of the labels). This is less than the 1.22 loss we get for our final 72% loss recovered experiment.

We think this is explanation is still quite meaningful - adversarially constructed causal scrubbing hypothesis preserve variance and the m... (read more)

4Tao Lin5mo
I'm not against evaluating models in ways where they're worse than rocks, I just think you shouldn't expect anyone else to care about your worse-than-rock numbers without very extensive justification (including adversarial stuff)

Overall, my view is that we will need to solve the optimization problem of 'what properties of the activation distribution are sufficient to explain how the model behaves', but this solution can be represented somewhat implicitly and I don't currently see how you'd transition it into a solution to superposition in the sense I think you mean.

I'll try to explain why I have this view, but it seems likely I'll fail (at least partially because of my own confusions).

Quickly, some background so we're hopefully on the same page (or at least closer):

I'm imagining t... (read more)

Thanks for the great comment clarifying your thinking!

I would be interested in seeing the data dimensionality curve for the validation set on MNIST (as opposed to just the train set) - it seems like the stated theory should make pretty clear predictions about what you'd see. (Or maybe I'll try to make reproduction happpen and do some more experiments).

These results also suggest that if superposition is widespread, mechanistic anomaly detection will require solving superposition

I feel pretty confused, but my overall view is that many of the routes I cur... (read more)

4Christopher Olah5mo
  It seems quite plausible there might be ways to solve mechanistic interpretability which frame things differently. However, I presently expect that they'll need to do something which is equivalent to solving superposition, even if they don't solve it explicitly. (I don't fully understand your perspective, so it's possible I'm misunderstanding something though!) To give a concrete example (although this is easier than what I actually envision), let's consider this model from Adam Jermyn's repeated data extension [https://transformer-circuits.pub/2023/toy-double-descent/index.html#comment-jermyn-3] of our paper:      If you want to know whether the model is "generalizing" rather than "triggering a special case" you need to distinguish the "single data point feature" direction from normal linear combinations of features. Now, it happens to be the case that the specific geometry of the 2D case we're visualizing here means that isn't too hard. But we need to solve this in general. (I'm imagining this as a proxy for a model which has one "special case backdoor/evil feature" in superposition with lots of benign features. We need to know if the "backdoor/evil feature" activated rather than an unusual combination of normal features.) Of course, there may be ways to distinguish this without the language of features and superposition. Maybe those are even better framings! But if you can, it seems to me that you should then be able to backtrack that solution into a sparse coding solution (if you know whether a feature has fired, it's now easy to learn the true sparse code!). So it seems to me that you end up having done something equivalent. Again, all of these comments are without really understanding your view of how these problems might be solved. It's very possible I'm missing something.

Calling individual tokens the 'State' and a generated sequence the 'Trajectory' is wrong/misleading IMO.

I would instead call a sequence as a whole the 'State'. This follows the meaning from Dynamical systems.

Then, you could refer to a Trajectory which is a list of sequence each with one more token.

(That said, I'm not sure thinking about trajectories is useful in this context for various reasons)

1Jan Hendrik Kirchner3mo
Hmm there was a bunch of back and forth on this point even before the first version of the post, with @Michael Oesterle [https://www.alignmentforum.org/users/michael-oesterle?mention=user]  and @metasemi [https://www.alignmentforum.org/users/metasemi?mention=user] arguing what you are arguing. My motivation for calling the token the state is that A) the math gets easier/cleaner that way and B) it matches my geometric intuitions. In particular, if I have a first-order dynamical system 0=F(xt,˙xt) then x is the state, not the trajectory of states (x1,…,xt). In this situation, the dynamics of the system only depend on the current state (that's because it's a first-order system). When we move to higher-order systems, 0=F(xt,˙xt,¨xt), then the state is still just x, but the dynamics of the system but also the "direction from which we entered it". That's the first derivative (in a time-continuous system) or the previous state (in a time-discrete system). At least I think that's what's going on. If someone makes a compelling argument that defuses my argument then I'm happy to concede!
4Adrià Garriga-Alonso5mo
To elaborate somewhat, you could say that the token is the state, but then the transition probability is non-Markovian and all the math gets really hard.

In prior work I've done, I've found that activations have tails between and (typically closer to ). As such, they're probably better modeled as logistic distributions.

That said, different directions in the residual stream have quite different distributions. This depends considerably on how you select directions - I imagine random directions are more gaussian due to CLT. (Note that averaging together heavier tailed distributions takes a very long time to be become gaussian.) But, if you look at (e.g.) the directions selected by neurons optimized... (read more)

See also the Curve Detectors paper for a very narrow example of this (https://distill.pub/2020/circuits/curve-detectors/#dataset-analysis -- a straight line on a log prob plot indicates exponential tails).

I believe the phenomena of neurons often having activation distributions with exponential tails was first informally observed by Brice Menard.

I am happy to take a “non-worst-case” empirical perspective in studying this problem. In particular, I suspect it will be very helpful – and possibly necessary – to use incidental empirical properties of deep learning systems, which often have a surprising amount of useful emergent structure (as I will discuss more under “Intuitions”).

One reason I feel sad about depending on incidental properties is that it likely implies the solution isn't robust enough to optimize against. This is a key desiderata in an ELK solution. I imagine this optimization would ... (read more)

Here are some dumb questions. Perhaps the answer to all of them is 'this work is preliminary, we'll address this + more later' or 'hey, see section 4 where we talked about this in detail' or 'your question doesn't make sense'

  • In the toy datasets, the features have the same scale (uniform from zero to one when active multiplied by a unit vector). However in the NN case, there's no particular reason to think the feature scales are normalized very much (though maybe they're normalized a bit due to weight decay and similar). Is there some reason this is ok?
  • I
... (read more)
2Lee Sharkey5mo
Hm it's a great point. There's no principled reason for it. Equivalently, there's no principled reasons to expect the coefficients/activations for each feature to be on the same scale either. We should probably look into a 'feature coefficient magnitude decay' to create features that don't all live on the same scale. Thanks! One reason for this is that the polytopic features learned by the model in the Toy models of superposition paper can be thought of as approximately maximally distant points on a hypersphere (to my intuitions at least). When using high-ish numbers of dimensions as in our toy data (256), choosing points randomly on the hypersphere achieves approximately the same thing. By choosing points randomly like in the way we did here, we don't have to train another potentially very large matrix that puts the one-hot features into superposition. The data generation method seemed like it would approximate real features about as well as polytope-like encodings of one-hot features (which are unrealistic too), so the small benefits didn't seem like were worth the moderate computational costs. But I could be convinced otherwise on this if I've missed some important benefits.   Nice idea! This could potentially be a nice middle ground between toy data experiments and language model experiments. We'll look into this, thanks again!    

In the spirit of Evan's original post here's a (half baked) simple model:

Simplicity claims are claims about how many bits (in the human prior) it takes to explain[1] some amount of performance in the NN prior.

E.g., suppose we train a model which gets 2 nats of loss with 100 Billion parameters and we can explain this model getting 2.5 nats using a 300 KB human understandable manual (we might run into issues with irreducible complexity such that making a useful manual is hard, but let's put that aside for now).

So, 'simplicity' of this sort is lower bounded b... (read more)

And yet, whenever we actually delve into these systems, it turns out that there's a ton of ultimately-relatively-simple internal structure.

I'm not sure exactly what you mean by "ton of ultimately-relatively-simple internal structure".

I'll suppose you mean "a high percentage of what models use parameters for is ultimately simple to humans" (where by simple to humans we mean something like, description length in the prior of human knowledge, e.g., natural language).

If so, this hasn't been my experience doing interp work or from the interp work I've seen (... (read more)

4Ryan Greenblatt6mo
In the spirit of Evan's original post here's a (half baked) simple model: Simplicity claims are claims about how many bits (in the human prior) it takes to explain[1] some amount of performance in the NN prior. E.g., suppose we train a model which gets 2 nats of loss with 100 Billion parameters and we can explain this model getting 2.5 nats using a 300 KB human understandable manual (we might run into issues with irreducible complexity such that making a useful manual is hard, but let's put that aside for now). So, 'simplicity' of this sort is lower bounded by the relative parameter efficiency of neural networks in practice vs the human prior. In practice, you do worse than this insofar as NNs express things which are anti-natural in the human prior (in terms of parameter efficiency). We can also reason about how 'compressible' the explanation is in a naive prior (e.g., a formal framework for expressing explanations which doesn't utilize cleverer reasoning technology than NNs themselves). I don't quite mean compressible - presumably this ends up getting you insane stuff as compression usually does. -------------------------------------------------------------------------------- 1. by explain, I mean something like the idea of heuristic arguments from ARC. ↩︎

I would typically call

MLP(x) = f(x) + (MLP(x) - f(x))

a non-linear decomposition as f(x) is an arbitrary function.

Regardless, any decomposition into a computational graph (that we can prove is extensionally equal) is fine. For instance, if it's the case that MLP(x) = combine(h(x), g(x)) (via extensional equality), then I can scrub h(x) and g(x) individually.

One example of this could be a product, e.g, suppose that MLP(x) = h(x) * g(x) (maybe like swiglu or something).

(Context: I work at Redwood on using the internals of models to do useful stuff. This is often interpretability work.)

I broadly agree with this post, but I think that it implicitly uses the words 'mechanistic interpretability' differently than people typically do. It seems to be implying that for mechanistic interpretability to be tractible, all parts of the AGI's cognition must be possible to understand for humans in principle. While I agree that this is basically required for Microscope AI to be very useful, it isn't required to have mechanistic interp b... (read more)

3David Scott Krueger7mo
I disagree.  I think in practice people say mechanistic interpretability all the time and almost never say these other more specific things.  This feels a bit like moving the goalposts to me.  And I already said in the caveats that it could be useful even if the most ambitious version doesn't pan out.   This is a statement that is almost trivially true, but we likely disagree on how much signal.  It seems like much of mechanistic interpretability is predicated on something like weak tractability (e.g. that we can understand what deep networks are doing via simple modular/abstract circuits), I disagree with this, and think that we probably do need to understand "how models predict whether the next token is ' is' or ' was'" to determine if a model was "lying" (whatever that means...).   But to the extent that weak/strong tractability are true, this should also make us much more optimistic about engineering modular systems.  That is the main point of the post.  

... THEN the Paulian family of plans don't provide much hope.

My understanding is that Ryan was tentatively on board with this conditional statement, but Paul was not.

I forget the extent to which I communicated (or even thought) this in the past, but at the moment, the current claim I'd agree with is: "this specific plan is much less likely to work".

My best guess is that even if I was quite confident in those conditions being true, work on various subparts of this plan seems like quite a good bet.

By explanation, I think we mean 'reason why a thing happens' in some intuitive (and underspecified) sense. Explanation length gets at something like "how can you cluster/compress a justification for the way the program responds to inputs" (where justification is doing a lot of work). So, while the program itself is a great way to compress how the program responds to inputs, it doesn't justify why the program responds this way to inputs. Thus program length/simplicity prior isn't equivalent. Here are some examples demonstrating where (I think) these priors ... (read more)

2Evan Hubinger1y
I definitely think there are bad reporter heads that don't ever have to access X. E.g. the human imitator only accesses X if X is required to model humans, which is certainly not the case for all X.