AI ALIGNMENT FORUM
AF

Quick Takes

Effective layer horizon of transformer circuits. The residual stream norm grows exponentially over the forward pass, with a growth rate of about 1.05. Consider the residual stream at layer 0, with norm (say) of 100. Suppose the MLP heads at layer 0 have outputs of norm (say) 5. Then after 30 layers, the residual stream norm will be $100 \cdot {1.05}^{30} \approx 432.2$ . Then the MLP-0 outputs of norm 5 should have a significantly reduced effect on the computations of MLP-30, due to their smaller relative norm.

On input tokens $x$ , let ${A t t n}_{i} (x), {M L P}_{i} (x)$ be... (read more)

Richard Ngo's Shortform

Richard Ngo6d154

I haven't yet read through them thoroughly, but these four papers by Oliver Richardson are pattern-matching to me as potentially very exciting theoretical work.

tl;dr: probabilistic dependency graphs (PDGs) are directed graphical models designed to be able to capture inconsistent beliefs (paper 1). The definition of inconsistency is a natural one which allows us to, for example, reframe the concept of "minimizing training loss" as "minimizing inconsistency" (paper 2). They provide an algorithm for inference in PDGs (paper 3) and an algorithm for learning vi... (read more)

Fabien's Shortform

Fabien Roger1mo346

I listened to the book Protecting the President by Dan Bongino, to get a sense of how risk management works for US presidential protection - a risk that is high-stakes, where failures are rare, where the main threat is the threat from an adversary that is relatively hard to model, and where the downsides of more protection and its upsides are very hard to compare.

Some claims the author makes (often implicitly):

Large bureaucracies are amazing at creating mission creep: the service was initially in charge of fighting against counterfeit currency, got preside

... (read more)

Neel Nanda1mo310

This was interesting, thanks! I really enjoy your short book reviews

Fabien's Shortform

Fabien Roger1mo3512

[Edit: The authors released code and probing experiments. Some of the empirical predictions I made here resolved, and I was mostly wrong. See here for my takes and additional experiments.]

I have a few concerns about Improving Alignment and Robustness with Circuit Breakers, a paper that claims to have found a method which achieves very high levels of adversarial robustness in LLMs.

I think hype should wait for people investigating the technique (which will be easier once code and datasets are open-sourced), and running comparisons with simpler baselines... (read more)

Showing 3 of 4 replies (Click to show all)

Fabien Roger11d40

I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).

Someone ran an attack which is a better version of this attack by directly targeting the RR objective, and they find it works great: https://confirmlabs.org/posts/circuit_breaking.html#attack-succe... (read more)

10Dan H1mo

The loss is cleaner, IDK about "massively," because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn't affect performance and doesn't markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper. This puzzles me and maybe we just have a different sense of what progress in adversarial robustness looks like. 20% that no one could find a jailbreak within 3 months? That would be the most amazing advance in robustness ever if that were true and should be a big update on jailbreak robustness tractability. If it takes the community more than a day that's a tremendous advance. This is a little nonspecific (does easily mean >0% ASR with an automated attack, or does it mean a high ASR?). I should say we manually found a jailbreak after messing with the model for around a week after releasing. We also invited people who have a reputation as jailbreakers to poke at it and they had a very hard time. Nowhere did we claim "there are no more jailbreaks and they are solved once and for all," but I do think it's genuinely harder now. We had the idea a few times to try out a detection-based approach but we didn't get around to it. It seems possible that it'd perform similarly if it's leaning on the various things we did in the paper. (Obviously probing has been around but people haven't gotten results at this level, and people have certainly tried using detecting adversarial attacks in hundreds of papers in the past.) IDK if performance would be that different from circuit-breakers, in which case this would still be a contribution. I don't really care about the aesthetics of methods nearly as much as the performance, and similarly performing methods are fine in my book. A lot of different-looking deep learning methods perform similarly. A detection based method seems fine, so does a defense that's tuned into the

2[comment deleted]1mo

Buck's Shortform

Buck Shlegeris1mo2416

AI safety people often emphasize making safety cases as the core organizational approach to ensuring safety. I think this might cause people to anchor on relatively bad analogies to other fields.

Safety cases are widely used in fields that do safety engineering, e.g. airplanes and nuclear reactors. See e.g. “Arguing Safety” for my favorite introduction to them. The core idea of a safety case is to have a structured argument that clearly and explicitly spells out how all of your empirical measurements allow you to make a sequence of conclusions that establis... (read more)

Showing 3 of 13 replies (Click to show all)

Buck Shlegeris11d50

I think that my OP was in hindsight taking for granted that we have to analyze AIs as adversarial. I agree that you could theoretically have safety cases where you never need to reason about AIs as adversarial; I shouldn't have ignored that possibility, thanks for pointing it out.

1Seth Herd1mo

What about this alternate twist: safety cases are the right model, but it just happens to be extremely difficult to make an adequate safety case for competent agentic AGI (or anything close). Introducing the safety case model for near-future AI releases could normalize that. It should be pretty easy to make a safety case for GPT4 and Claude 3.5. When people want to deploy real AGI, they won't be able to make the safety case without real advances in alignment. And that's the point.

4Buck Shlegeris1mo

My guess is that it's infeasible to ask for them to delay til they have a real safety case, due to insufficient coordination (including perhaps international competition).

Fabien's Shortform

Fabien Roger18d630

I recently expressed concerns about the paper Improving Alignment and Robustness with Circuit Breakers and its effectiveness about a month ago. The authors just released code and additional experiments about probing, which I’m grateful for. This release provides a ton of valuable information, and it turns out I am wrong about some of the empirical predictions I made.

Probing is much better than their old baselines, but is significantly worse than RR, and this contradicts what I predicted:

I'm glad they added these results to the main body of the paper!&... (read more)

Showing 3 of 6 replies (Click to show all)

Fabien Roger12d60

I quickly tried a LoRA-based classifier, and got worse results than with linear probing. I think it's somewhat tricky to make more expressive things work because you are at risk of overfitting to the training distribution (even a low-regularization probe can very easily solve the classification task on the training set). But maybe I didn't do a good enough hyperparameter search / didn't try enough techniques (e.g. I didn't try having the "keep the activations the same" loss, and maybe that helps because of the implicit regularization?).

2Sam Marks17d

Why would it 2x the cost of inference? To be clear, my suggested baseline is "attach exactly the same LoRA adapters that were used for RR, plus one additional linear classification head, then train on an objective which is similar to RR but where the rerouting loss is replaced by a classification loss for the classification head." Explicitly this is to test the hypothesis that RR only worked better than HP because it was optimizing more parameters (but isn't otherwise meaningfully different from probing). (Note that LoRA adapters can be merged into model weights for inference.) (I agree that you could also just use more expressive probes, but I'm interested in this as a baseline for RR, not as a way to improve robustness per se.)

3Fabien Roger17d

I was imagining doing two forward passes: one with and one without the LoRAs, but you had in mind adding "keep behavior the same" loss in addition to the classification loss, right? I guess that would work, good point.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo17d2215

I found this article helpful and depressing. Kudos to TracingWoodgrains for detailed, thorough investigation.

leogao's Shortform

leogao16d160

learning thread for taking notes on things as i learn them (in public so hopefully other people can get value out of it)

leogao16d50

VAEs:

a normal autoencoder decodes single latents z to single images (or whatever other kind of data) x, and also encodes single images x to single latents z.

with VAEs, we want our decoder (p(x|z)) to take single latents z and output a distribution over x's. for simplicity we generally declare that this distribution is a gaussian with identity covariance, and we have our decoder output a single x value that is the mean of the gaussian.

because each x can be produced by multiple z's, to run this backwards you also need a distribution of z's for each sin... (read more)

Nisan's Shortform

Nisan17d150

On 2018-04-09, OpenAI said^[1]:

OpenAI’s mission is to ensure that artificial general intelligence (AGI) [...] benefits all of humanity.

In contrast, in 2023, OpenAI said^[2]:

[...] OpenAI’s mission: to build artificial general intelligence (AGI) that is safe and benefits all of humanity.

Archived ↩︎
This archived snapshot is from 2023-05-17, but the document didn't get much attention until November that year. ↩︎

Daniel Kokotajlo's Shortform

Daniel Kokotajlo17d60

Rereading this classic by Ajeya Cotra: https://www.planned-obsolescence.org/july-2022-training-game-report/

I feel like this is an example of a piece that is clear, well-argued, important, etc. but which doesn't seem to have been widely read and responded to. I'd appreciate pointers to articles/posts/papers that explicitly (or, failing that, implicitly) respond to Ajeya's training game report. Maybe the 'AI Optimists?'

Matthew Barnett's Shortform

Matthew Barnett1mo213

In the last year, I've had surprisingly many conversations that have looked a bit like this:

Me: "Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence."

Interlocutor: "You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part."

Me: "I didn't misunderstand the argument. I un... (read more)

Showing 3 of 16 replies (Click to show all)

Daniel Kokotajlo17d30

Thanks for this Matthew, it was an update for me -- according to the quote you pulled Bostrom did seem to think that understanding would grow up hand-in-hand with agency, such that the current understanding-without-agency situation should come as a positive/welcome surprise to him. (Whereas my previous position was that probably Bostrom didn't have much of an opinion about this)

1Matthew Barnett1mo

I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness". If a system possesses all relevant behavioral qualities that we associate with those terms, I think it's basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It's possible this is our main disagreement. When I talk to GPT-4, I think it's quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not? I agree that GPT-4 does not understand the world in the same way humans understand the world, but I'm not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things. I'm similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one's own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don't see how that fact bears much on the question of whether you understand human intentions. It's possible there's some connection here, but I'm not seeing it. I'd claim: 1. Current systems have limited situational awareness. It's above zero, but I agree it's below human level. 2. Current systems don't have stable preferences over time. But I think this is a point in favor of the model I'm providing here. I'm claiming that it's plausibly easy to create smart, corr

1RobertM1mo

But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they're giving us the desired behavior now will continue to give us desired behavior in the future. My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you're importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic's Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it's in training and needs to pretend to be helpful? No, and neither does the model "understand" your intentions in a way that generalizes out of distribution the way you might expect a human's "understanding" to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the "right" responses during RLHF are not anything like human reasoning. Are you asking for a capabilities threshold, beyond which I'd be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is "can it replace humans at all economically valuable tasks", which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we'll be able to train models capable of doing a lot of economically useful

TurnTrout's shortform feed

Alex Turner19d22-20

The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity.

The bitter les... (read more)

Buck's Shortform

Buck Shlegeris20d2115

I think people in the safety community underrate the following possibility: early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway, either because they're wildly irresponsible or because they're under massive competitive pressure.

In this world, it's not very valuable to have better techniques for determining whether models are scheming (because they obviously are), but it is valuable to have better techniques for reducing the risk associated with deployi... (read more)

Showing 3 of 10 replies (Click to show all)

gwern19d5-5

early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway

So... Sydney?

7Matthew Barnett19d

Does this mean that if in, say, 1-5 years, it's not pretty obvious that SOTA deployed models are scheming, you would be surprised? That is, suppose we get to a point where models are widespread and producing lots of economic value, and the models might be scheming but the evidence is weak and uncertain, with arguments on both sides, and no one can reasonably claim to be confident that currently deployed SOTA models are scheming. Would that mean your default prediction was wrong?

2Oliver Habryka20d

Ah, to be clear, in as much as I do have hope, it does route through this kind of pause. I am generally pessimistic about that happening, but it is where a lot of my effort these days goes into. And then in those worlds, I do agree that a lot of progress will probably be made with substantial assistance from these early systems. I do expect it to take a good while until we figure out how to do that, and so don't see much hope for that kind of work happening where humanity doesn't substantially pause or slow down cutting-edge system development.

ryan_greenblatt's Shortform

Ryan Greenblatt1mo6053

I'm currently working as a contractor at Anthropic in order to get employee-level model access as part of a project I'm working on. The project is a model organism of scheming, where I demonstrate scheming arising somewhat naturally with Claude 3 Opus. So far, I’ve done almost all of this project at Redwood Research, but my access to Anthropic models will allow me to redo some of my experiments in better and simpler ways and will allow for some exciting additional experiments. I'm very grateful to Anthropic and the Alignment Stress-Testing team for providi... (read more)

Showing 3 of 6 replies (Click to show all)

Beth Barnes25d67

I'd be surprised if this was employee-level access. I'm aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.

1Zach Stein-Perlman1mo

Source?

12Rachel Freedman1mo

It was a secretive program — it wasn’t advertised anywhere, and we had to sign an NDA about its existence (which we have since been released from). I got the impression that this was because OpenAI really wanted to keep the existence of GPT4 under wraps. Anyway, that means I don’t have any proof beyond my word.

gwern's Shortform

gwern1mo2414

We know that "AI is whatever doesn't work yet". We also know that people often contrast AI (or DL, or LLMs specifically) derogatorily with classic forms of software, such as regexps: why use a LLM to waste gigaflops of compute to do what a few good regexps could...?

So I am amused to discover recently, by sheer accident while looking up 'what does the "regular" in "regular expression" mean, anyway?', that it turns out that regexps are AI. In fact, they are not even GOFAI symbolic AI, as you immediately assumed on hearing that, but they were originally conne... (read more)

Fabien's Shortform

Fabien Roger1mo90

I listened to the lecture series Assessing America’s National Security Threats by H. R. McMaster, a 3-star general who was the US national security advisor in 2017. It didn't have much content about how to assess threats, but I found it useful to get a peek into the mindset of someone in the national security establishment.

Some highlights:

Even in the US, it sometimes happens that the strategic analysis is motivated by the views of the leader. For example, McMaster describes how Lyndon Johnson did not retreat from Vietnam early enough, in part because criti

... (read more)

gwern's Shortform

gwern1y90

I have some long comments I can't refind now (weirdly) about the difficulty of investing based on AI beliefs (or forecasting in general): similar to catching falling knives, timing is all-important and yet usually impossible to nail down accurately; specific investments are usually impossible if you aren't literally founding the company, and indexing 'the entire sector' definitely impossible. Even if you had an absurd amount of money, you could try to index and just plain fail - there is no index which covers, say, OpenAI.

Apropos, Matt Levine comments on o... (read more)

gwern1mo31

Masayoshi Son reflects on selling Nvidia in order to maintain ownership of ARM etc: https://x.com/TheTranscript_/status/1805012985313903036 "Let's stop talking about this, I just get sad."

2gwern3mo

So among the most irresponsible tech stonk boosters has long been ARK's Cathy Woods, whose antics I've refused to follow in any detail (except to periodically reflect that in bull markets the most over-leveraged investors always look like geniuses); so only today do I learn that beyond the usual stuff like slobbering all over TSLA (which has given back something like 4 years of gains now), Woods has also adamantly refused to invest in Nvidia recently and in fact, managed to exit her entire position at an even worse time than SoftBank did: "Cathie Wood’s Popular ARK Funds Are Sinking Fast: Investors have pulled a net $2.2 billion from ARK’s active funds this year, topping outflows from all of 2023" (mirror):

TurnTrout's shortform feed

Alex Turner2mo3313

The Scaling Monosemanticity paper doesn't do a good job comparing feature clamping to steering vectors.

Edit 6/20/24: The authors updated the paper; see my comment.

To better understand the benefit of using features, for a few case studies of interest, we obtained linear probes using the same positive / negative examples that we used to identify the feature, by subtracting the residual stream activity in response to the negative example(s) from the activity in response to the positive example(s). We experimented with (1) visualizing the top-activating

16Alex Turner1mo

The authors updated the Scaling Monosemanticity paper. Relevant updates include: 1. In the intro, they added: 2. The related work section now credits the rich history behind steering vectors / activation engineering, including not just my team's work on activation additions, but also older literature in VAEs and GANs. (EDIT: Apparently this was always there? Maybe I misremembered the diff.) 3. The comparison results are now in an appendix and are much more hedged, noting they didn't evaluate properly according to a steering vector baseline. While it would have been better to have done this the first time, I really appreciate the team updating the paper to more clearly credit past work. :)

Neel Nanda1mo51

Oh, that's great! Kudos to the authors for setting the record straight. I'm glad your work is now appropriately credited

2Fabien Roger2mo

I think DIM and LR aren't spiritually different (e.g. LR with infinite L2 regularization gives you the same direction as DIM), even though in practice DIM is better for steering (and ablations). But I agree with you that "steering vectors" is the good expression to talk about directions used for steering (while I would use linear probes to talk about directions used to extract information or trained to extract information and used for another purpose).

TsviBT's Shortform

Tsvi Benson-Tilsen1mo1818

An important thing that the AGI alignment field never understood:

Reflective stability. Everyone thinks it's about, like, getting guarantees, or something. Or about rationality and optimality and decision theory, or something. Or about how we should understand ideal agency, or something.

But what I think people haven't understood is

If a mind is highly capable, it has a source of knowledge.
The source of knowledge involves deep change.
Lots of deep change implies lots of strong forces (goal-pursuits) operating on everything.
If there's lots of strong goal-p

... (read more)

Richard Ngo's Shortform

Richard Ngo1mo26-16

Some opinions about AI and epistemology:

One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues.
A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn't make sense in the context of group rationality, it probably doesn't make sense in the context of individual rationality either.
For example: there's no privileged way to combin

... (read more)

Showing 3 of 4 replies (Click to show all)

Thomas Kwa1mo30

How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.

12Oliver Habryka1mo

I don't really buy this statement. Most frames, from most disciplines, and most styles of reasoning, do not make clear predictions about what will happen to humanity in the long-run future. A very few do, but the vast majority are silent on this issue. Silence is not anything like "50%". Most frames, from most disciplines, and most styles of reasoning, don't predict sparks when you put metal in a microwave. This doesn't mean I don't know what happens when you put metal in a microwave. You need to at the very least limit yourself to applicable frames, and there are very few applicable frames for predicting humanity's long-term future.

2Joe_Collman1mo

I agree with this. Unfortunately, I think there's a fundamentally inside-view aspect of [problems very different from those we're used to]. I think looking for a range of frames is the right thing to do - but deciding on the relevance of the frame can only be done by looking at the details of the problem itself (if we instead use our usual heuristics for relevance-of-frame-x, we run into the same out-of-distribution issues). I don't think there's a way around this. Aspects of this situation are fundamentally different from those we're used to. [Is different from] is not a useful relation - we can't get far by saying "We've seen [fundamentally different] situations before - what happened there?". It'll all come back to how they were fundamentally different. To say something mildly more constructive, I do still think we should be considering and evaluating other frames, based on our own inside-view model (with appropriate error bars on that model). A place I'd start here would be: * Attempt to understand another frame. * See how far I need to zoom out before that frame's models become a reasonable abstraction for the problem-as-I-understand-it. * Find the smallest changes to my models that'd allow me to stick with this frame without zooming out so far. Assess the probability that these adjusted models are correct/useful. For most frames, I end up needing to zoom out too far for them to say much of relevance - so this doesn't much change my p(doom) assessment. It seems more useful to apply other frames to evaluate smaller parts of our models. I'm sure there are a bunch of places where intuitions and models from e.g. economics or physics do apply to safety-related subproblems.