A brief overview of the contents, page by page.
1: most important century and hinge of history
2: wisdom needs to keep up with technological power or else self-destruction / the world is fragile / cuban missile crisis
3: unilateralist's curse
4: bio x-risk
5: malicious actors intentionally building power-seeking AIs / anti-human accelerationism is common in tech
6: persuasive AIs and eroded epistemics
7: value lock-in and entrenched totalitarianism
8: story about bioterrorism
9: practical malicious use suggestions
10: LAWs as an on-ramp to AI x-risk
11: automated c...
but I'm confident it isn't trying to do this
It is. It's an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it's to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn't get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix. Apologies that the paper is so dense, but that's because it took over a year.
I asked for permission via Intercom to post this series on March 29th. Later, I asked for permission to use the [Draft] indicator and said it was written by others. I got permission for both of these, but the same person didn't give permission for both of these requests. Apologies this was not consolidated into one big ask with lots of context. (Feel free to get rid of any undue karma.)
It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)
...Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabili
steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)
(You linked to "deep deceptiveness," and I'm going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper). If it isn't, then this point is still relevant for alignment since self-deception is another internal hazard.)
I think one could argue that self-deception could in some instances be ...
Page 4 of this paper compares negative vectors with fine-tuning for reducing toxic text: https://arxiv.org/pdf/2212.04089.pdf#page=4
In Table 3, they show in some cases task vectors can improve fine-tuned models.
Insofar as you mean to imply that "negative vectors" are obviously comparable to our technique, I disagree. Those are not activation additions, and I would guess it's not particularly similar to our approach. These "task vectors" involve subtracting weight vectors, not activation vectors. See also footnote 39 (EDIT: and the related work appendix now talks about this directly).
Yes, I'll tend to write up comments quickly so that I don't feel as inclined to get in detailed back-and-forths and use up time, but here we are. When I wrote it, I thought there were only 2 things mentioned in the related works until Daniel pointed out the formatting choice, and when I skimmed the post I didn't easily see comparisons or discussion that I expected to see, hence I gestured at needing more detailed comparisons. After posting, I found a one-sentence comparison of the work I was looking for, so I edited to include that I found it, but it was oddly not emphasized. A more ideal comment would have been "It would be helpful to me if this work would more thoroughly compare to (apparently) very related works such as ..."
In many of my papers, there aren't fairly similar works (I strongly prefer to work in areas before they're popular), so there's a lower expectation for comparison depth, though breadth is always standard. In other works of mine, such as this paper on learning the the right thing in the presence of extremely bad supervision/extremely bad training objectives, we contrast with the two main related works for two paragraphs, and compare to these two methods for around half of the entire paper.
The extent of an adequate comparison depends on the relatedness. I'm ...
Yes, I was--good catch. Earlier and now, unusual formatting/and a nonstandard related works is causing confusion. Even so, the work after the break is much older. The comparison to works such as https://arxiv.org/abs/2212.04089 is not in the related works and gets a sentence in a footnote: "That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors."
Is this big difference? I really don't know; it'd be helpful if they'd contrast more. Is this work very novel and useful, an...
On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.
Also, taking affine combinations in weight-space is not novel to Schmidt et ...
Background for people who understandably don't habitually read full empirical papers:
Related Works sections in empirical papers tend to include many comparisons in a coherent place. This helps contextualize the work and helps busy readers quickly identify if this work is meaningfully novel relative to the literature. Related works must therefore also give a good account of the literature. This helps us more easily understand how much of an advance this is. I've seen a good number of papers steering with latent arithmetic in the past year, but I would be su...
Could these sorts of posts have more thorough related works sections? It's usually standard for related works in empirical papers to mention 10+ works. Update: I was looking for a discussion of https://arxiv.org/abs/2212.04089, assumed it wasn't included in this post, and many minutes later finally found a brief sentence about it in a footnote.
I don't understand this comment. I did a quick count of related works that are mentioned in the "Related Works" section (and the footnotes of that section) and got around 10 works, so seems like this is meeting your pretty arbitrarily established bar, and there are also lots of footnotes and references to related work sprinkled all over the post, which seems like the better place to discuss related work anyways.
I am not familiar enough with the literature to know whether this post is omitting any crucial pieces of related work, but the relevant section of ...
"AI Safety" which often in practice means "self driving cars"
This may have been true four years ago, but ML researchers at leading labs rarely directly work on self-driving cars (e.g., research on sensor fusion). AV is has not been hot in quite a while. Fortunately now that AGI-like chatbots are popular, we're moving out of the realm of talking about making very narrow systems safer. The association with AV was not that bad since it was about getting many nines of reliability/extreme reliability, which was a useful subgoal. Unfortunately the world has not ...
When ML models get more competent, ML capabilities researchers will have strong incentives to build superhuman models. Finding superhuman training techniques would be the main thing they'd work on. Consequently, when the problem is more tractable, I don't see why it'd be neglected by the capabilities community--it'd be unreasonable for profit maximizers not to have it as a top priority when it becomes tractable. I don't see why alignment researchers have to work in this area with high externalities now and ignore other safe alignment research areas (in pra...
I am strongly in favor of our very best content going on arXiv. Both communities should engage more with each other.
As follows are suggestions for posting to arXiv. As a rule of thumb, if the content of a blogpost didn't take >300 hours of labor to create, then it probably should not go on arXiv. Maintaining a basic quality bar prevents arXiv from being overriden by people who like writing up many of their inchoate thoughts; publication standards are different for LW/AF than for arXiv. Even if a researcher spent many hours on the project, arXiv moderato...
Strongly agree. Three examples of work I've put on Arxiv which originated from the forum, which might be helpful as a touchstone. The first was cited 7 times the first year, and 50 more times since. The latter two were posted last year, and have not been indexed by Google as having been cited yet.
As an example of a technical but fairly conceptual paper, there is the Categorizing Goodhart's law paper. I pushed for this to be a paper rather than just a post, and I think that the resulting exposure was very worthwhile. Scott wrote the original pos...
Here's a continual stream of related arXiv papers available through reddit and twitter.
https://www.reddit.com/r/mlsafety/
https://twitter.com/topofmlsafety
I should say formatting is likely a large contributing factor for this outcome. Tom Dietterich, an arXiv moderator, apparently had a positive impression of the content of your grokking analysis. However, research on arXiv will be more likely to go live if it conforms to standard (ICLR, NeurIPS, ICML) formatting and isn't a blogpost automatically exported into a TeX file.
I agree that formatting is the most likely issue. The content of Neel's grokking work is clearly suitable for arXiv (just very solid ML work). And the style of presentation of the blog post is already fairly similar to a standard paper (e.g. is has an Introduction section, lists contributions in bullet points, ...).
So yeah, I agree that formatting/layout probably will do the trick (including stuff like academic citation style).
Others can post their own papers, but I'll post some papers I was on and group them into one of four safety topics: Enduring hazards (“Robustness”), identifying hazards (“Monitoring”), steering ML systems (“Alignment”), and forecasting the future of ML ("Foresight").
The main ML conferences are ICLR, ICML, NeurIPS. The main CV conferences are CVPR, ICCV, and ECCV. The main NLP conferences are ACL and EMNLP.
Alignment (Value Learning):
Aligning AI With Shared Human Values (ICLR)
Robustness (Adversaries):
This seems like a fun exercise, so I spent half an hour jotting down possibilities. I'm more interested in putting potential considerations on peoples' radars and helping with brainstorming than I am in precision. None of these points are to be taken too seriously since this is fairly extemporaneous and mostly for fun.
2022
Multiple Codex alternatives are available. The financial viability of training large models is obvious.
Research models start interfacing with auxiliary tools such as browsers, Mathematica, and terminals.
2023
Large pretrai...
Strong-upvoted because this was exactly the sort of thing I was hoping to inspire with this post! Also because I found many of your suggestions helpful.
I think model size (and therefore model ability) probably won't be scaled up as fast as you predict, but maybe. I think getting models to understand video will be easier than you say it is. I also think that in the short term all this AI stuff will probably create more programming jobs than it destroys. Again, I'm not confident in any of this.
no AI safety relevant publications in 2019 or 2020, and only one is a coauthor on what I would consider a highly relevant paper.
Context: I'm an OpenPhil fellow who is doing work on robustness, machine ethics, and forecasting.
I published several papers on the research called for in Concrete Problems in AI Safety and OpenPhil's/Steinhardt's AI Alignment Research Overview. The work helped build a trustworthy ML community and aimed at reducing accident risks given very short AI timelines. Save for the first paper I helped with (when I was trying to learn the r...
Hey Daniel, thanks very much for the comment. In my database I have you down as class of 2020, hence out of scope for that analysis, which was class of 2018 only. I didn't include 2019 or 2020 classes because I figured it takes time to find your footing, do research, write it up etc., so absence of evidence would not be very strong evidence of absence. So please don't consider this as any reflection on you. Ironically I actually did review one of your papers in the above - this one - which I did indeed think was pretty relevant! (Cntrl-F 'Hendrycks' to find the paragraph in the article). Sorry if this was not clear from the text.
Plug: CAIS is funding constrained.