All of Rohin Shah's Comments + Replies

I feel like a lot of these arguments could be pretty easily made of individual AI safety researchers. E.g.

Misaligned Incentives

In much the same way that AI systems may have perverse incentives, so do the [AI safety researchers]. They are [humans]. They need to make money, [feed themselves, and attract partners]. [Redacted and redacted even just got married.] This type of accountability to [personal] interests is not perfectly in line with doing what is good for human interests. Moreover, [AI safety researchers are often] technocrats whose values and demogr

... (read more)
5Stephen Casper6d
Thanks. I agree that the points apply to individual researchers. But I don't think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the "when put under great optimization pressure" of Goodhart's Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one. 

Sounds reasonable, though idk what you think realistic values of N are (my wild guess with hardly any thought is 15 minutes - 1 day).

EDIT: Tbc in the 1 day case I'm imagining that most of the time goes towards running the experiment -- it's more a claim about what experiments we want to run. If we just talk about the time to write the code and launch the experiment I'm thinking of N in the range of 5 minutes to 1 hour.

Cool, that all roughly makes sense to me :)

I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking.

Multi-tasking where you are responsible for the entire design of the project? (Designing the algorithm, choosing an experimental setting and associated metrics, knowing the related work, interpreting the results of the experiments, figuring out what the next experiment should be, ...)

Suppose today I gave you a dev... (read more)

2Ryan Greenblatt1mo
Probably yes for realistic values of N? Assuming the box is pretty smart at understanding instructions (and has an understanding of my typical ontology to the extent that you would get after working with me a few weeks and reading various posts) and the box will ask follow-up questions in cases where the instructions are unclear. (And we can do small diffs with reduced latency like asking the results to be plotted in a different way.) My main concern is running out of ideas after a while despite copies of myself with more thinking time having more time to generate ideas.

I agree it helps to run experiments at small scales first, but I'd be pretty surprised if that helped to the point of enabling a 30x speedup -- that means that the AI labor allows you get 30x improvement in compute needed beyond what would be done by default by humans (though the 30x can include e.g. improving utilization, it's not limited just to making individual experiments take less time).

I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically with the size of the pretrained m... (read more)

5Ryan Greenblatt1mo
Overall, this has updated me to some extent and it seems less plausible to me that ML research can achieve 30x speedups while having human researchers do all of the high level ideas. (I think the picture looks importantly better when AIs are removing this bottleneck.) The situation I was imagining is where most experiments use some combination of: * A relatively small amount of finetuning/inference on the biggest models (including for the actual AI researcher) * Larger (possibly full) training runs, but at much smaller scale (e.g. GPT-3 level performance models) Then, we can in total afford ~training dataset sized amounts of finetuning/inference for the biggest models (by the inference availability argument). And GPT-3 performance experiments will be pretty cheap. So assuming our base model looks like GPT-6 with the expected compute requirement and model size, this is a huge amount of possible inference availability. So, the implicit claim is that compute costs scale much less than quadratically. It's certainly not obvious ML research can be progressed fast enough with this little compute. I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking. Note that this often involves multiple people working on the same paper. In the AI case, the division of labor might look at least somewhat different. (Though I don't think this changes the picture very much from what you're describing because most people now aren't the "ideas" people.)

I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)

Why doesn't compute become the bottleneck well before the 30x mark? It seems like the AIs have to be superhuman at something to overcome that bottleneck (rather than just making it fast and cheap to implement experiments). Indeed the AIs make the problem somewhat worse, since you have to spend compute to run the AIs.

6Ryan Greenblatt1mo
I guess I'm not that sold that compute will actually be that much of a key bottleneck for experiments in the future in a way that can't be overcome with 2x additional labor and/or a 2x slow down. Like in many cases you can spend additional labor to reduce compute usage of experiments. (E.g., first run the experiments on smaller models.) And, we're conditioning on having really powerful AIs which correlates with a high baseline level of compute and that will help whenever we can run experiments at small scales and then use scaling laws etc. Further, the current scaling laws imply huge inference availablity if huge amounts of compute are used for training. This might depend on what type of ML research we're talking about.

I think you mostly need to hope that it doesn't matter (because the crazy XOR directions aren't too salient) or come up with some new idea.

Yeah certainly I'd expect the crazy XOR directions aren't too salient.

I'll note that if it ends up these XOR directions don't matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques). I know that for CCS you're more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spuriou

... (read more)
3Sam Marks2mo
I agree with this! (And it's what I was trying to say; sorry if I was unclear.) My point is that  { features which are as crazy as "true according to Alice" (i.e., not too crazy)}  seems potentially manageable, where as  { features which are as crazy as arbitrary boolean functions of other features }  seems totally unmanageable. Thanks, as always, for the thoughtful replies.

Yeah, agreed that's a clear overclaim.

In general I believe that many (most?) people take it too far and make incorrect inferences -- partly on priors about popular posts, and partly because many people including you believe this, and those people engage more with the Simulators crowd than I do.

Fwiw I was sympathetic to nostalgebraist's positive review saying:

sometimes putting a name to what you "already know" makes a whole world of difference. [...] I see these takes, and I uniformly respond with some version of the sentiment "it seems like you aren't thin

... (read more)

Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way?

Yes, I definitely meant this in the non-mechanistic way. Any mechanistic claims that sound simulator-flavored based just on the evidence in this post sounds clearly overconfident and probably wrong. I didn't reread this post carefully but I don't remember seeing mechanistic claims in it.

I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that's because that's basicall

... (read more)
2Oliver Habryka2mo
Hmm, yeah, this perspective makes more sense to me, and I don't currently believe you ended up making any of the wrong inferences I've seen others make on the basis of the post.  I do sure see many other people make inferences of this type. See for example the tag page for Simulator Theory which says:  This also directly claims that the physics the system learned are "the mechanics underlying our world", which I think isn't totally false (they have probably learned a good chunk of the mechanics of our world) but is inaccurate as something trying to describe most of what is going on in a base model's cognition.

The thing that's confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything.

Idk, I think it's pretty hard to know what things are and aren't useful for predicting the next token. For example, some of your features involve XORing with a "has_not" feature -- XORing with an indicator for "not" might be exactly what you want to do to capture the effect of the "not".

(Tbc here the hypothesis could be "the model computes XORs with has_not all the time, and then uses only some of them", so it does have some... (read more)

1Sam Marks2mo
I agree that "the model has learned the algorithm 'always compute XORs with has_not'" is a pretty sensible hypothesis. (And might be useful to know, if true!) FWIW, the stronger example of "clearly not useful XORs" I was thinking of has_true XOR has_banana, where I'm guessing you're anticipating that this XOR exists incidentally. Focusing again on the Monster gridworld setting, here are two different ways that your goals could misgeneralize: 1. player_has_shield is spuriously correlated with high_score during training, so the agent comes to value both 2. monster_present XOR high_score is spuriously correlated with high_score during training, so the agent comes to value both. These are pretty different things that could go wrong. Before realizing that these crazy XOR features existed, I would only have worried about (1); now that I know these crazy XOR features exist ... I think I mostly don't need to worry about (2), but I'm not certain and it might come down to details about the setting. (Indeed, your CCS challenges work has shown that sometimes these crazy XOR features really can get in the way!) I agree that you can think of this issue as just being the consequence of the two issues "there are lots of crazy XOR features" and "linear probes can pick up on spurious correlations," I guess this issue feels qualitatively new to me because it just seems pretty untractable to deal with it on the data augmentation level (how do you control for spurious correlations with arbitrary boolean functions of undesired features?). I think you mostly need to hope that it doesn't matter (because the crazy XOR directions aren't too salient) or come up with some new idea. I'll note that if it ends up these XOR directions don't matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques).[1] If I had to articulate my reason for being surprised here, it'd be something like: 1. I didn't expect LLMs to compute many XO

I think the main thing I'd point to is this section (where I've changed bullet points to numbers for easier reference):

I can’t convey all that experiential data here, so here are some rationalizations of why I’m partial to the term, inspired by the context of this post:

  1. The word “simulator” evokes a model of real processes which can be used to run virtual processes in virtual reality.
  2. It suggests an ontological distinction between the simulator and things that are simulated, and avoids the fallacy of attributing contingent properties of the latter to the for
... (read more)
3Oliver Habryka2mo
Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way? I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that's because that's basically a description of the task the LLM is trained on.  Like, the above sounds similar to me to "in order to predict what AlphaZero will do, choose some promising moves, then play forward the game and predict after which moves AlphaZero is most likely to win, then adopt the move that most increases the probability of winning as your prediction of what AlphaZero does". Of course, that is approximately useless advice, since basically all you've done is describe the training setup of AlphaZero. As a mechanistic explanation, I would be surprised if even with amazing mechanistic interpretability you will find some part of the LLM whose internal structure corresponds in a lot of detail to the mind or brain of the kind of person it is trying to "simulate". I expect the way you get low loss here will involve an enormous number of non-simulating cognition (see again my above analogy about how when humans engage in roleplay, we engage in a lot of non-simulating cognition).  To maybe go into a bit more depth on what wrong predictions I've seen people make on the basis of this post:  * I've seen people make strong assertions about what kind of cognition is going on inside of LLMs, ruling out things like situational awareness for base models (it's quite hard to know whether base models have any situational awareness, though RLHF'd models clearly have some level, I also think what situational awareness would mean for base models is a bit confusing, but not that confusing, like it would just mean that as you scale up the model its behavior would become quite sensitive to the context in which it is run) * I've seen people make strong predictions that LLM performance can't become superhuman on various tasks, s

Nice post, and glad this got settled experimentally! I think it isn't quite as counterintuitive as you make it out to be -- the observations seem like they have reasonable explanations.

I feel pretty confident that there's a systematic difference between basic features and derived features, where the basic features are more "salient" -- I'll be assuming such a distinction in the rest of the comment.

(I'm saying "derived" rather than "XOR" because it seems plausible that some XOR features are better thought of as "basic", e.g. if they were very useful for the... (read more)

3Sam Marks2mo
I agree with a lot of this, but some notes: The thing that's confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything. So I think any utility explanation that's going to be correct needs to be a somewhat subtle one of the form "the model doesn't initially know which XORs will be useful, so it just dumbly computes way more XORs than it needs, including XORs which are never used in any example in training." Or in other words "the model has learned the algorithm 'compute lots of XORs' rather than having learned specific XORs which it's useful to compute." I think this subtlety changes the story a bit. One way that it changes the story is that you can't just say "the model won't compute multi-way XORs because they're not useful" -- the two-way XORs were already not useful! You instead need to argue that the model is implementing an algorithm which computed all the two-way XORs but didn't compute XORs of XORs; it seems like this algorithm might need to encode somewhere information about which directions correspond to basic features and which don't. Even though on a surface level this resembles the failure discussed in the post (because one feature is held fixed during training), I strongly expect that the sorts of failures you cite here are really generalization failure for "the usual reasons" of spurious correlations during training. For example, during training (because monsters are present), "get a high score" and "pick up shields" are correlated, so the agents learn to value picking up shields. I predict that if you modified the train set so that it's no longer useful to pick up shields (but monsters are still present), then the agent would no longer pick up shields, and so would no longer misgeneralize in this particular way. In contrast, the point I'm trying to make in the post is that RAX can cause problems even in the absence of spurious correlations like this.[1] As you noted, it will sometimes be

Are you saying that this claim is supported by PCA visualizations you've done?

Yes, but they're not in the paper. (I also don't remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)

I'll say that I've done a lot of visualizing true/false datasets with PCA, and I've never noticed anything like this, though I never had as clean a distractor feature as banana/shed.

It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the ... (read more)

(To summarize the parallel thread)

The claim is that the learned probe is . As shown in Theorem 1, if you chug through the math with this probe, it gets low CCS loss and leads to an induced classifier .*

You might be surprised that this is possible, because the CCS normalization is supposed to eliminate  -- but what the normalization does is remove linearly-accessible information about . However,  is not linearly accessible, and... (read more)

The point is that while the normalization eliminates , it does not eliminate , and it turns out that LLMs really do encode the XOR linearly in the residual stream.

Why does the LLM do this? Suppose you have two boolean variables  and . If the neural net uses three dimensions to represent , and , I believe that allows it to recover arbitrary boolean functions of  and  linearly from the residual stream. So you might expect the LLM to do this "by default"... (read more)

2Sam Marks2mo
Thanks! I'm still pretty confused though. It sounds like you're making an empirical claim that in this banana/shed example, the model is representing the features has_banana(x), has_true(x), and has_banana(x)⊕has_true(x) along linearly independent directions. Are you saying that this claim is supported by PCA visualizations you've done? Maybe I'm missing something, but none of the PCA visualizations I'm seeing in the paper seem to touch on this. E.g. visualization in figure 2(b) (reproduced below) is colored by is_true(x), not has_true(x). Are there other visualizations showing linear structure to the feature has_banana(x)⊕has_true(x) independent of the features has_banana(x) and has_true(x)? (I'll say that I've done a lot of visualizing true/false datasets with PCA, and I've never noticed anything like this, though I never had as clean a distractor feature as banana/shed.) More broadly, it seems like you're saying that you think in general, when LLMs have linearly-represented features a and b they will also tend to linearly represent the feature a⊕b. Taking this as an empirical claim about current models, this would be shocking. (If this was meant to be a claim about a possible worst-case world, then it seems fine.)  For example, if I've done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify a=0 vs 1 on a dataset where b=0, the resulting probe should get ~50% accuracy on a test dataset where b=1. And this should apply for any features a,b. But this is certainly not the typical case, at least as far as I can tell! Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always "true" or "false" and the second word is always "banana" or "shed," do you predict that a probe trained with logistic regression on the dataset {(true banana,1),(false banana,0)} will have poor accuracy when tested on {(true shed,1),(false shed,1)}?

Good point on the rotational symmetry, that makes sense now.

I still think that this assumption is fairly realistic because in practice, most pairs of unrelated features would co-occur only very rarely, and I expect the winner-take-all dynamic to dominate most of the time. But I agree that it would be nice to quantify this and test it out.

Agreed that's a plausible hypothesis. I mostly wish that in this toy model you had a hyperparameter for the frequency of co-occurrence of features, and identified how it affects the rate of incidental polysemanticity.

I think I agree with all of that (with the caveat that it's been months and I only briefly skimmed the past context, so further thinking is unusually likely to change my mind).

My guess is that this result is very sensitive to the design of the training dataset:

the input/output data pairs are  for , where  is the  basis vector.

In particular, I think it is likely very sensitive to the implicit assumption that feature i and feature j never co-occur on a single input. I'd be interested to see experiments where each feature is turned on with some (not too small) probability, independently of all other features, similarly to the original toy models setting. This would result in so... (read more)

3Victor Lecomte3mo
Thanks for the feedback! Definitely! I still think that this assumption is fairly realistic because in practice, most pairs of unrelated features would co-occur only very rarely, and I expect the winner-take-all dynamic to dominate most of the time. But I agree that it would be nice to quantify this and test it out. If there is no L1 regularization on activations, then every hidden neuron would indeed be highly "polysemantic" in the sense that it has nonzero weights for each input feature. But on the other hand, the whole encoding space would become rotationally symmetric, and when that's the case it feels like polysemanticity shouldn't be about individual neurons (since the canonical basis is not special anymore) and instead about the angles that different encodings form. In particular, as long as mgen, the space of optimal solutions for this setup requires the encodings Wi to form angles of at least 90° with each other, and it's unclear whether we should call this polysemantic. So one of the reasons why we need L1 regularization is to break the rotational symmetry and create a privileged basis: that way, it's actually meaningful to ask whether a particular hidden neuron is representing more than one feature.

Unless by "shrugs" you mean the details of what the partial hypothesis says in this particular case are still being worked out.

Yes, that's what I mean.

I do agree that it's useful to know whether a partial hypothesis says anything or not; overall I think this is good info to know / ask for. I think I came off as disagreeing more strongly than I actually did, sorry about that.

Do you have any plans to do this?

No, we're moving on to other work: this took longer than we expected, and was less useful for alignment than we hoped (though that part wasn't that unex... (read more)

Which of these theories [...] can predict the same "four novel predictions about grokking" yours did? The relative likelihoods are what matters for updates after all.

I disagree with the implicit view on how science works. When you are a computationally bounded reasoner, you work with partial hypotheses, i.e. hypotheses that only make predictions on a small subset of possible questions, and just shrug at other questions. This is mostly what happens with the other theories:

  1. Difficulty of representation learning: Shrugs at our prediction about  /
... (read more)
Implictly, I thought if a you have a partial hypothesis of grokking, then if it shrugs at an grokking related phenomena it should be penalized. Unless by "shrugs" you mean the details of what the partial hypothesis says in this particular case are still being worked out. But in that case, confirming the partial hypothesis doesn't say anything yet about some phenomena is still useful info. I'm fairly sure this belief was what generated my question.  Thank you for going through the theories and checking what they have to say. That was helpful to me.  Do you have any plans to do this? How much time do you think it would take? And do you have any predictions for what should happen in these cases?

From page 6 of the paper:

Ungrokking can be seen as a special case of catastrophic forgetting (McCloskey and Cohen, 1989; Ratcliff, 1990), where we can make much more precise predictions. First, since ungrokking should only be expected once , if we vary  we predict that there will be a sharp transition from very strong to near-random test accuracy (around ). Second, we predict that ungrokking would arise even if we only remove examples from the training dataset, whereas catastrophic forgetting typically involves trainin

... (read more)

I think that post has a lot of good ideas, e.g. the idea that generalizing circuits get reinforced by SGD more than memorizing circuits at least rhymes with what we claim is actually going on (that generalizing circuits are more efficient at producing strong logits with small param norm). We probably should have cited it, I forgot that it existed.

But it is ultimately a different take and one that I think ends up being wrong (e.g. I think it would struggle to explain semi-grokking).

I also think my early explanation, which that post compares to, is basically... (read more)

I think I would particularly critique DeepMind and OpenAI's interpretability works, as I don't see how this reduces risks more than other works that they could be doing, and I'd appreciate a written plan of what they expect to achieve.

I can't speak on behalf of Google DeepMind or even just the interpretability team (individual researchers have pretty different views), but I personally think of our interpretability work as primarily a bet on creating new affordances upon which new alignment techniques can be built, or existing alignment techniques can be en... (read more)

Yeah, that seems like a reasonable operationalization of "capable of doing X". So my understanding is that (1), (3), (6) and (7) would not falsify the hypothesis under your operationalization, (5) would falsify it, (2) depends on details, and (4) is kinda ambiguous but I tend to think it would falsify it.

1Lukas Finnveden3mo
I think (5) also depends on further details. As you have written it, both the 2023 and 2033 attempt uses similar data and similar compute. But in my proposed operationalization, "you can get it to do X" is allowed to use a much greater amount of resources ("say, 1% of the pre-training budget") than the test for whether the model is "capable of doing X" ("Say, at most 1000 data points".) I think that's important: * If both the 2023 and the 2033 attempt are really cheap low-effort attempts, then I don't think that the experiment is very relevant for whether "you can get it to do X" in the sort of high-stakes, high-efforts situations that I'm imagining that we'll be in when we're trying to eval/align AI models to avoid takeover. * It seems super plausible that a low-effort attempt could fail, and then succeed later-on with 10 more years knowledge of best practices. I wouldn't learn much from that happening once. * If both the 2023 and the 2033 attempts are really expensive and high-effort (e.g. 1% of pre-training budget), then I think it's very plausible that the 2033 training run gave the model new capabilities that it didn't have before. * And in particular: capabilities that the model wouldn't have been able to utilize in a takeover attempt that it was very motivated to do its best at. (Which is ultimately what we care about.)   By a similar argument, I would think that (4) wouldn't falsify the hypothesis as-written, but would falsify the hypothesis if the first run was a much more high-effort attempt. With lots of iteration by a competent team, and more like a $1,000,000 budget. But the 2nd run, with a much more curated and high-quality dataset, still just used $1,000 of training compute.   One thing that I'm noticing while writing this is something like: The argument that "elicitation efforts would get to use ≥1% of the training budget" makes sense if we're eliciting all the capabilities at once, or if there's only a few important capabilities to e

Which of (1)-(7) above would falsify the hypothesis if observed? Or if there isn't enough information, what additional information do you need to tell whether the hypothesis has been falsified or not?

The “no sandbagging on checkable tasks” hypothesis: With rare exceptions, if a not-wildly-superhuman ML model is capable of doing some task X, and you can check whether it has done X, then you can get it to do X using already-available training techniques (e.g., fine-tuning it using gradient descent).[1]

I think as phrased this is either not true, or tautological, or otherwise imprecisely specified (in particular I'm not sure what it means for a model to be "capable of" doing some task X -- so far papers define that to be "can you quickly finetune the model... (read more)

3Lukas Finnveden7mo
Here's a proposed operationalization. For models that can't gradient hack: The model is "capable of doing X" if it would start doing X upon being fine-tuned to do it using a hypothetical, small finetuning dataset that demonstrated how to do the task. (Say, at most 1000 data points.) (The hypothetical fine-tuning dataset should be a reasonable dataset constructed by a hypothetical team of human who knows how to do the task but aren't optimizing the dataset hard for ideal gradient updates to this particular model, or anything like that.) For models that might be able to gradient-hack, but are well-modelled as having certain goals: The model is "capable of doing X" if it would start doing X if doing X was a valuable instrumental goal, for it. For both kinds: "you can get it to do X" if you could make it do X with some large amount of research+compute budget (say, 1% of the pre-training budget), no-holds-barred. Edit: Though I think your operationalization also looks fine. I mainly wanted to point out that the "finetuning" definition of "capable of doing X" might be ok if you include the possibility of finetuning on hypothetical datasets that we don't have access to. (Since we only know how to check the task — not perform it.)
3Tom Davidson7mo
I read "capable of X" as meaning something like "if the model was actively trying to do X then it would do X". I.e. a misaligned model doesn't reveal the vulnerability to humans during testing bc it doesn't want them to patch it, but then later it exploits that same vulnerability during deployment bc it's trying to hack the computer system

I expect a delay even in the infinite data case, I think?

Although I'm not quite sure what you mean by "infinite data" here -- if the argument is that every data point will have been seen during training, then I agree that there won't be any delay. But yes training on the test set (even via "we train on everything so there is no possible test set") counts as cheating for this purpose.

Honestly I'd be surprised if you could achieve (2) even with explicit regularization, specifically on the modular addition task.

(You can achieve it by initializing the token embeddings to those of a grokked network so that the representations are appropriately structured; I'm not allowing things like that.)

EDIT: Actually, Omnigrok does this by constraining the parameter norm. I suspect this is mostly making it very difficult for the network to strongly memorize the data -- given the weight decay parameter the network "tries" to learn a high-param norm memo... (read more)

1Robert Kirk7mo
If you train on infinite data, I assume you'd not see a delay between training and testing, but you'd expect a non-monotonic accuracy curve that looks kind of like the test accuracy curve in the finite-data regime? So I assume infinite data is also cheating?

In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis. 

I disagree at least with the "almost completely" version of this claim:

Notice that the operation you want to learn is manifestly a convolution operation, i.e.

This also applies to the non-modular addition operation, but I think it's pretty plausible that if you train on non-modular addition (to the point of ~perfect generalization), the network would learn an embedding that converts the "tokenized" ... (read more)

Since I'm an author on that paper, I wanted to clarify my position here. My perspective is basically the same as Steven's: there's a straightforward conceptual argument that goal-directedness leads to convergent instrumental subgoals, this is an important part of the AI risk argument, and the argument gains much more legitimacy and slightly more confidence in correctness by being formalized in a peer-reviewed paper.

I also think this has basically always been my attitude towards this paper. In particular, I don't think I ever thought of this paper as provid... (read more)

You're right, I incorrectly interpreted the sup as an inf, because I thought that they wanted to assume that there exists a prompt creating an adversarial example, rather than saying that every prompt can lead to an adversarial example.

I'm still not very compelled by the theorem -- it's saying that if adversarial examples are always possible (the sup condition you mention) and you can always provide evidence for or against adversarial examples (Definition 2) then you can make the adversarial example probable (presumably by continually providing evidence for adversarial examples). I don't really feel like I've learned anything from this theorem.

3Johannes Treutlein8mo
My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution  P=αP0+(1−α)P1, such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have P(s∣s0)=P(s⊗s0)P(s0). Together with the assumption that P0 is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for P0 by stringing together bad sentences in the prompt work. To see why this assumption is doing the work, consider an LLM that completely ignores the prompt and always outputs sentences from a bad distribution with α probability and from a good distribution with (1−α) probability. Here, adversarial examples are always possible. Moreover, the bad and good sentences can be distinguishable, so Definition 2 could be satisfied. However, the result clearly does not apply (since you just cannot up- or downweigh anything with the prompt, no matter how long). The reason for this is that there is no way to split up the model into two components P0 and P1, where one of the components always samples from the bad distribution. This assumption implies that there is some latent binary variable of whether the model is predicting a bad distribution, and the model is doing Bayesian inference to infer a distribution over this variable and then sample from the posterior. It would be violated, for instance, if the model is able to ignore some of the sentences in the prompt, or if it is more like a hidden Markov model that can also allow for the possibility of switching characters within a sequence of sentences (then either P0 has to be able to also output good sentences sometimes, or the assumption P=αP0+(1−α)P1 is violated). I do think there is something to the paper, though. It seems that when talking e.g. about the Waluigi effect people often take the stance that the model is doing this
1Lukas Finnveden8mo
Yeah, I also don't feel like it teaches me anything interesting.

I forget if I already mentioned this to you, but another example where you can interpret randomization as worst-case reasoning is MaxEnt RL, see this paper. (I reviewed an earlier version of this paper here (review #3).)

Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)

Okay, I understand how that addresses my edit.

I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don't really have any concrete questions at the moment.

3Thane Ruthenis9mo
My impression is that it being a concrete example is the why. "What is the right framework to use?" and "what is the environment-structure in which natural abstractions can be defined?" are core questions of this research agenda, and this sort of multi-layer locality-including causal model is one potential answer. The fact that it loops-in the speed of causal influence is also suggestive — it seems fundamental to the structure of our universe, crops up in a lot of places, so the proposition that natural abstractions are somehow downstream of it is interesting.

Okay, that mostly makes sense.

note that the resampler itself throws away a ton of information about  while going from  to . And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes.

I agree this is true, but why does the Lightcone theorem matter for it?

It is also a theorem that a Gibbs resampler initialized at equilibrium will produce  distributed according to , and as you say it's c... (read more)

Sounds like we need to unpack what "viewing X0 as a latent which generates X" is supposed to mean. I start with a distribution P[X]. Let's say X is a bunch of rolls of a biased die, of unknown bias. But I don't know that's what X is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow "recover" the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as P[X]=∑iP[Xi|Λ]P[Λ], where Λ is the bias in this case. Then when reasoning/updating, we can usually just think about how an individual die-roll interacts with Λ, rather than all the other rolls, which is useful insofar as Λ is much smaller than all the rolls. Note that P[X|Λ] is not supposed to match P[X]; then the representation would be useless. It's the marginal ∑iP[Xi|Λ]P[Λ] which is supposed to match P[X]. The lightcone theorem lets us do something similar. Rather all the Xi's being independent given Λ, only those Xi's sufficiently far apart are independent, but the concept is otherwise similar. We express P[X] as ∑X0P[X|X0]P[X0] (or, really, ∑ΛP[X|Λ]P[Λ], where Λ summarizes info in X0 relevant to X, which is hopefully much smaller than all of X).

The Lightcone Theorem says: conditional on , any sets of variables in  which are a distance of at least  apart in the graphical model are independent.

I am confused. This sounds to me like:

If you have sets of variables that start with no mutual information (conditioning on ), and they are so far away that nothing other than  could have affected both of them (distance of at least ), then they continue to have no mutual information (independent).

Some things that I am confused about as a result:

  1. I don't se
... (read more)
Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance 2T implies that nothing other than X0 could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious. It does, but then XT doesn't have the same distribution as the original graphical model (unless we're running the sampler long enough to equilibrate). So we can't view X0 as a latent generating that distribution. Not quite - note that the resampler itself throws away a ton of information about X0 while going from X0 to XT. And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes. So the reason this is interesting (for the thing you're pointing to) is not that it lets us ignore information from far-away parts of XT which could not possibly have been relevant given X0, but rather that we want to further throw away information from X0 itself (while still maintaining conditional independence at a distance).

I agree that there's a threshold for "can meaningfully build and chain novel abstractions" and this can lead to a positive feedback loop that was not previously present, but there will already be lots of positive feedback loops (such as "AI research -> better AI -> better assistance for human researchers -> AI research") and it's not clear why to expect the new feedback loop to be much more powerful than the existing ones.

(Aside: we're now talking about a discontinuity in the gradient of capabilities rather than of capabilities themselves, but sufficiently large discontinuities in the gradient of capabilities have much of the same implications.)

3Thane Ruthenis10mo
Yeah, the argument here would rely on the assumption that e. g. the extant scientific data already uniquely constraint some novel laws of physics/engineering paradigms/psychological manipulation techniques/etc., and we would be eventually able to figure them out even if science froze right this moment. In this case, the new feedback loop would be faster because superintelligent cognition would be faster than real-life experiments. And I think there's a decent amount of evidence for this. Consider that there are already narrow AIs that can solve protein folding more efficiently than our best manually-derived algorithms — which suggests that better algorithms are already uniquely constrained by the extant data, and we've just been unable to find them. Same may be true for all other domains of science — and thus, a superintelligence iterating on its own cognition would be able to outspeed human science.

Oh, I disagree with your core thesis that the general intelligence property is binary. (Which then translates into disagreements throughout the rest of the post.) But experience has taught me that this disagreement tends to be pretty intractable to talk through, and so I now try just to understand the position I don't agree with, so that I can notice if its predictions start coming true.

You mention universality, active adaptability and goal-directedness. I do think universality is binary, but I expect there are fairly continuous trends in some underlying l... (read more)

3Thane Ruthenis10mo
Interesting, thanks. Agreed that this point (universality leads to discontinuity) probably needs to be hashed out more. Roughly, my view is that universality allows the system to become self-sustaining. Prior to universality, it can't autonomously adapt to novel environments (including abstract environments, e. g. new fields of science). Its heuristics have to be refined by some external ground-truth signals, like trial-and-error experimentation or model-based policy gradients. But once the system can construct and work with self-made abstract objects, it can autonomously build chains of them — and that causes a shift in the architecture and internal dynamics, because now its primary method of cognition is iterating on self-derived abstraction chains, instead of using hard-coded heuristics/modules. 

Okay, this mostly makes sense now. (I still disagree but it no longer seems internally inconsistent.)

Fwiw, I feel like if I had your model, I'd be interested in:

  1. Producing tests for general intelligence. It really feels like there should be something to do here, that at least gives you significant Bayesian evidence. For example, filter the training data to remove anything talking about <some scientific field, e.g. complexity theory>, then see whether the resulting AI system can invent that field from scratch if you point it at the problems that motiva
... (read more)
1Thane Ruthenis10mo
I agree that those are useful pursuits. Mind gesturing at your disagreements? Not necessarily to argue them, just interested in the viewpoint.

Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?

Discontinuity ending (without stalling):


Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).

Are you imagining systems that are built differently from today? Because I'm not seeing how SGD could ... (read more)

3Thane Ruthenis10mo
Ah, makes sense. I do expect that some sort of ability to reprogram itself at inference time will be ~necessary for AGI, yes. But I also had in mind something like your "SGD creates a set of weights that effectively treats the input English tokens as a programming language" example. In the unlikely case that modern transformers are AGI-complete, I'd expect something on that order of exoticism to be necessary (but it's not my baseline prediction). "Doing science" is meant to be covered by "lack of empirical evidence that there's anything in the universe that humans can't model". Doing science implies the ability to learn/invent new abstractions, and we're yet to observe any limits to how far we can take it / what that trick allows us to understand. Mmm. Consider a scheme like the following: * Let T2 be the current date. * Train an AI on all of humanity's knowledge up to a point in time T1, where T1<T2. * Assemble a list D of all scientific discoveries made in the time period (T1;T2]. * See if the AI can replicate these discoveries. At face value, if the AI can do that, it should be considered able to "do science" and therefore AGI, right? I would dispute that. If the period (T1;T2] is short enough, then it's likely that most of the cognitive work needed to make the leap to any discovery in D is already present in the data up to T1. Making a discovery from that starting point doesn't necessarily require developing new abstractions/doing science — it's possible that it may be done just by interpolating between a few already-known concepts. And here, some asymmetry between humans and e. g. SOTA LLMs becomes relevant: * No human knows everything the entire humanity knows. Imagine if making some discovery in D by interpolation required combining two very "distant" concepts, like a physics insight and advanced biology knowledge. It's unlikely that there'd be a human with sufficient expertise in both, so a human will likely do it by actual-science (e. g., a biol

See Section 5 for more discussion of all of that.

Sorry, I seem to have missed the problems mentioned in that section on my first read.

There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions.

I'm not claiming the AGI would stall at human level, I'm claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level.

(I care about this because I think it cuts against this point: We only have one shot. There will be a sharp discontinuity in capabilities... (read more)

3Thane Ruthenis10mo
Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two? Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts). The strategy of slowly scaling our AI up is workable at the core, but IMO there are a lot of complications: * A "mildly-superhuman" AGI, or even just a genius-human AGI, is still be an omnicide risk (see also). I wouldn't want to experiment with that; I would want it safely at average-human-or-below level. It's likely hard to "catch" it at that level by inspecting its external behavior, though: can only be reliably done via advanced interpretability tools. * Deceptiveness (and manipulation) is a significant factor, as you've mentioned. Even just a mildly-superhuman AGI will likely be very good at it. Maybe not implacably good, but it'd be like working bare-handed with an extremely dangerous chemical substance, with the entire humanity at the stake. * The problem of "iterating" on this system. If we have just a "weak" AGI on our hands, it's mostly a pre-AGI system, with a "weak" general-intelligence component that doesn't control much. Any "naive" approaches, like blindly training interpretability probes on it or something, would likely ignore that weak GI component, and focus mainly on analysing or shaping heuristics/shards. To get the right kind of experience from it, we'd have to very precisely aim our experiments at the GI component — which, again, likely requires advanced interpretability tools. Basically, I think we need to catch the AGI-ness while it's an "asymptomatic" stage, because the moment it becomes visible it's likely already incredibly dangerous (if not necessarily maximally dangerous). More or less, plus the theoretical argument from the

What ties it all together is the belief that the general-intelligence property is binary.

Do any humans have the general-intelligence property?

If yes, after the "sharp discontinuity" occurs, why won't the AGI be like humans (in particular: generally not able to take over the world?)

If no, why do we believe the general-intelligence property exists?

3Thane Ruthenis10mo
Yes, ~all of them. Humans are not superintelligent because despite their minds embedding the algorithm for general intelligence, that algorithm is still resource-constrained (by the brain's compute) and privilege-constrained within the mind (e. g., it doesn't have full write-access to our instincts). There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions. On the contrary: even if we resolve to check for "AGI-ness" often, with the intent of stopping the training the moment our AI becomes true AGI but still human-level or below it, we're likely to miss the right moment without advanced interpretability tools, and scale it past "human-level" straight to "impossible-to-ignore superintelligent". There would be no warning signs, because "weak" AGI (human-level or below) can't be clearly distinguished from a very capable pre-AGI system, based solely on externally-visible behaviour. See Section 5 for more discussion of all of that. Quoting from my discussion with cfoster0:

So here's a paper: Fundamental Limitations of Alignment in Large Language Models. With a title like that you've got to at least skim it. Unfortunately, the quick skim makes me pretty skeptical of the paper.

The abstract says "we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt." This clearly can't be true in full generality, and I wish the abstract would give me some hint about ... (read more)

Note that B is (0.2,10,−1)-distinguishable in P.

I think this isn't right, because definition 3 requires that sup_s∗ {B_P− (s∗)} ≤ γ.

And for your counterexample, s* = "C" will have B_P-(s*) be 0 (because there's 0 probably of generating "C" in the future). So the sup is at least 0 > -1.

(Note that they've modified the paper, including definition 3, but this comment is written based on the old version.)

Interestingly, I apparently had a median around 2040 back in 2019, so my median is still later than it used to be prior to reading the bio anchors report.

Indeed I am confused why people think Goodharting is effectively-100%-likely to happen and also lead to all the humans dying. Seems incredibly extreme. All the examples people give of Goodharting do not lead to all the humans dying.

(Yes, I'm aware that the arguments are more sophisticated than that and "previous examples of Goodharting didn't lead to extinction" isn't a rebuttal to them, but that response does capture some of my attitude towards the more sophisticated arguments, something like "that's a wildly strong conclusion you've drawn from a pretty h... (read more)

I'm not claiming that you figure out whether the model's underlying motivations are bad. (Or, reading back what I wrote, I did say that but it's not what I meant, sorry about that.) I'm saying that when the model's underlying motivations are bad, it may take some bad action. If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.

It's plausible that you then get a model with bad motivations that knows not to produce bad... (read more)

1David Xu1y
but, but, standard counterargument imperfect proxies Goodharting magnification of error adversarial amplification etc etc etc? (It feels weird that this is a point that seems to consistently come up in discussions of this type, considering how basic of a disagreement it really is, but it really does seem to me like lots of things come back to this over and over again?)

I think you're missing the primary theory of change for all of these techniques, which I would say is particularly compatible with your "follow-the-trying" approach.

While all of these are often analyzed from the perspective of "suppose you have a potentially-misaligned powerful AI; here's what would happen", I view that as an analysis tool, not the primary theory of change.

The theory of change that I most buy is that as you are training your model, while it is developing the "trying", you would like it to develop good "trying" and not bad "trying", and one... (read more)

4Steve Byrnes1y
Thanks, that helps! You’re working under a different development model than me, but that’s fine. It seems to me that the real key ingredient in this story is where you propose to update the model based on motivation and not just behavior—“penalize it instead of rewarding it” if the outputs are “due to instrumental / deceptive reasoning”. That’s great. Definitely what we want to do. I want to zoom in on that part. You write that “debate / RRM / ELK” are supposed to “allow you to notice” instrumental / deceptive reasoning. Of these three, I buy the ELK story—ELK is sorta an interpretability technique, so it seems plausible that ELK is relevant to noticing deceptive motivations (even if the ELK literature is not really talking about that too much at this stage, per Paul’s comment). But what about debate & RRM? I’m more confused about why you brought those up in this context. Traditionally, those techniques are focused on what the model is outputting, not what the model’s underlying motivations are. But I haven’t read all the literature. Am I missing something? (We can give the debaters / the reward model a printout of model activations alongside the model’s behavioral outputs. But I’m not sure what the next step of the story is, after that. How do the debaters / reward model learn to skillfully interpret the model activations to extract underlying motivations?)

Yes, that's right, though I'd say "probable" not "possible" (most things are "possible").

Depends what the aligned sovereign does! Also depends what you mean by a pivotal act!

In practice, during the period of time where biological humans are still doing a meaningful part of alignment work, I don't expect us to build an aligned sovereign, nor do I expect to build a single misaligned AI that takes over: I instead expect there to be a large number of AI systems, that could together obtain a decisive strategic advantage, but could not do so individually.

4David Johnston1y
So, if I'm understanding you correctly: * if it's possible to build a single AI system that executes a catastrophic takeover (via self-bootstrap or whatever), it's also probably possible to build a single aligned sovereign, and so in this situation winning once is sufficient * if it is not possible to build a single aligned sovereign, then it's probably also not possible to build a single system that executes a catastrophic takeover and so the proposition that the model only has to win once is not true in any straightforward way * in this case, we might be able to think of "composite AI systems" that can catastrophically take over or end the acute risk period, and for similar reasons as in the first scenario, winning once with a composite system is sufficient, but such systems are not built from single acts and you think the second scenario is more likely than the first.

I think that skews it somewhat but not very much. We only have to "win" once in the sense that we only need to build an aligned Sovereign that ends the acute risk period once, similarly to how we only have to "lose" once in the sense that we only need to build a misaligned superintelligence that kills everyone once.

(I like the discussion on similar points in the strategy-stealing assumption.)

4David Johnston1y
Is building an aligned sovereign to end the acute risk period different to a pivotal act in your view?
Load More