Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

3Thomas Larsen
Credit: Mainly inspired by talking with Eli Lifland. Eli has a potentially-published-soon document here.   The basic case against against Effective-FLOP.  1. We're seeing many capabilities emerge from scaling AI models, and this makes compute (measured by FLOPs utilized) a natural unit for thresholding model capabilities. But compute is not a perfect proxy for capability because of algorithmic differences. Algorithmic progress can enable more performance out of a given amount of compute. This makes the idea of effective FLOP tempting: add a multiplier to account for algorithmic progress.  2. But doing this multiplications seems importantly quite ambiguous.  1. Effective FLOPs depend on the underlying benchmark. It’s not at all apparent which benchmark people are talking about, but this isn’t obvious.  1. People often use perplexity, but applying post training enhancements like scaffolding or chain of thought doesn’t improve perplexity but does improve downstream task performance.  2. See for examples of algorithmic changes that cause variable performance gains based on the benchmark.  2. Effective FLOPs often depend on the scale of the model you are testing. See graph below from: - the compute efficiency from from LSTMs to transformers is not invariant to scale. This means that you can’t just say that the jump from X to Y is a factor of Z improvement on Capability per FLOP.  This leads to all sorts of unintuitive properties of effective FLOPs. For example, if you are using 2016-next-token-validation-E-FLOPs, and LSTM scaling becomes flat on the benchmark, you could easily imagine that at very large scales you could get a 1Mx E-FLOP improvement from switching to transformers, even if the actual capability difference is small.  3. If we move away from pretrained LLMs, I think E-FLOPs become even harder to define, e.g., if we’re able to build systems may be better at reasoni
3Oliver Habryka
Maybe I am being dumb, but why not do things on the basis of "actual FLOPs" instead of "effective FLOPs"? Seems like there is a relatively simple fact-of-the-matter about how many actual FLOPs were performed in the training of a model, and that seems like a reasonable basis on which to base regulation and evals.

Yeah, actual FLOPs are the baseline thing that's used in the EO. But the OpenAI/GDM/Anthropic RSPs all reference effective FLOPs. 

If there's a large algorithmic improvement you might have a large gap in capability between two models with the same FLOP, which is not desirable.  Ideal thresholds in regulation / scaling policies are as tightly tied as possible to the risks. 

Another downside that FLOPs / E-FLOPs share is that it's unpredictable what capabilities a 1e26 or 1e28 FLOPs model will have.  And it's unclear what capabilities will emerge from a small bit of scaling: it's possible that within a 4x flop scaling you get high capabilities that had not appeared at all in the smaller model. 

22Buck Shlegeris
[this is not a particularly original take; I'm writing it here so that I have somewhere to link people when I want to make this point] From my perspective, one of the main avenues by which adversarial robustness might reduce AI x-risk is making weak trusted models more robust to attempts from scheming models to cause concentrated failures by jailbreaking or otherwise tricking them. That is: if you’re using weak trusted models to review the actions of more powerful models to ensure they aren’t causing egregious problems (we call this trusted monitoring), one of the main ways the powerful models might be able to defeat your safety measures is by jailbreaking the trusted model into saying that some action is safe when it actually isn’t. You have similar problems with many other control protocols that require trusted models to be run on inputs that contain generations from untrusted models. So I’d say the main applications of research into adversarial robustness are: * Improving robustness of models to attacks from scheming AIs (and other AI attacks). * Improving robustness of models to human attacks. * Making models themselves less likely to take intentional catastrophic actions. * I’m not sure how likely this is to work if your model is a schemer. Unlike e.g. human users of ChatGPT, if a scheming AI wants to jailbreak its trusted monitor, it might have very few chances, because the lab will hopefully notice the jailbreak attempts and apply countermeasures (e.g. shutting down or training a probe on the untrusted model to detect when it’s trying to jailbreak the trusted monitor). This probably has some effect on how you should prioritize your research, but I’m not sure specifically what it is. I didn’t have this threat model in mind when working on Redwood’s adversarial robustness project—at the time I was mostly thinking about the third application in the list above. Note that, though my conclusion here is sort of similar to the thesis of AI Safety in a Wo
8Dan H
Some years ago we wrote that "[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries" and discussed monitoring systems that can create "AI tripwires could help uncover early misaligned systems before they can cause damage." Since then, I've updated that adversarial robustness for LLMs is much more tractable (preview of paper out very soon). In vision settings, progress is extraordinarily slow but not necessarily for LLMs.
4Buck Shlegeris
Thanks for the link. I don't actually think that either of the sentences you quoted are closely related to what I'm talking about. You write "[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries"; I think that this is you making a version of the argument that I linked to Adam Gleave and Euan McLean making and that I think is wrong. You wrote "AI tripwires could help uncover early misaligned systems before they can cause damage" in a section on anomaly detection, which isn't a central case of what I'm talking about.

I'm not sure if I agree as far as the first sentence you mention is concerned. I agree that the rest of the paragraph is talking about something similar to the point that Adam Gleave and Euan McLean make, but the exact sentence is:

Separately, humans and systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries.

"Separately" is quite key here.

I assume this is intended to include AI adversaries and high stakes monitoring.


Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner*, Sören Mindermann*


Artificial intelligence (AI) is progressing rapidly, and companies are shifting their focus to developing generalist AI systems that can autonomously act and pursue goals. Increases in capabilities and autonomy may soon massively amplify AI’s impact, with risks that include large-scale social harms, malicious uses, and an irreversible loss of human control over autonomous AI systems. Although researchers have warned of extreme risks from AI, there is a lack of consensus about how to manage them. Society’s response, despite...

2Oliver Habryka
See also:

Thanks, we did look into the academic norms around this and concluded that including him was likely the standard choice. This choice was especially clear since (if I remember right) there was no further round of approval from the other authors either for the final edits after the relevant point in time.

This is a linkpost for


We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus

  1.  complies with a task of mass-generating comments to influence public perception of the company, later deceiving humans about it having done so,
  2. lies to auditors when asked questions,
  3. strategically pretends to be less capable than it is during capability evaluations.

Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave


I updated some from this, thanks. My interpretation is that goal-directed shards can be activated even by a context where the AI is expected to do stuff (eg a company assistant).

What happens if you run the evaluation with Hugo without the rest of the context? Does it still strategize?

In experiment C we find that the model strategically refuses evaluation tasks, with refusal rates increasing as negative consequences from displaying capable performance are implied to be larger.

Seems like only variants 1 and 2 introduced new info, and the rest are information... (read more)

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

2Neel Nanda
If we compared our jailbreak technique to other jailbreaks on an existing benchmark like Harm Bench and it does comparably well to SOTA techniques, or does even better than SOTA techniques, would you consider this success at doing something useful on a real task?
4Buck Shlegeris
If it did better than SOTA under the same assumptions, that would be cool and I'm inclined to declare you a winner. If you have to train SAEs with way more compute than typical anti-jailbreak techniques use, I feel a little iffy but I'm probably still going to like it. Bonus points if, for whatever technique you end up using, you also test the technique which is most like your technique but which doesn't use SAEs. I haven't thought that much about how exactly to make these comparisons, and might change my mind. I'm also happy to spend at least two hours advising on what would impress me here, feel free to use them as you will.
3Neel Nanda
Thanks! Note that this work uses steering vectors, not SAEs, so the technique is actually really easy and cheap - I actively think this is one of the main selling points (you can jailbreak a 70B model in minutes, without any finetuning or optimisation). I am excited at the idea of seeing if you can improve it with SAEs though - it's not obvious to me that SAEs are better than steering vectors, though it's plausible. I may take you up on the two hours offer, thanks! I'll ask my co-authors

Ugh I'm a dumbass and forgot what we were talking about sorry. Also excited for you demonstrating the steering vectors beat baselines here (I think it's pretty likely you succeed).

Load More