The hard/useful parts of alignment research are largely about understanding agency/intelligence/etc. That sort of understanding naturally yields capabilities-relevant insights. So, alignment researchers naturally run into decisions about how private to keep their work.
This post is a bunch of models which I use to think about that decision.
I am not very confident that my thinking on the matter is very good; in general I do not much trust my own judgment on security matters. I’d be more-than-usually interested to hear others’ thoughts/critiques.
The “Nobody Cares” Model
By default, nobody cares. Memetic reproduction rate is less than 1. Median number of citations (not counting self-citations) is approximately zero, and most citations are from people who didn’t actually read the whole paper but just noticed that it’s vaguely related to their own work. Median number of people who will actually go to any effort whatsoever to use your thing is zero. Getting other people to notice your work at all takes significant effort and is hard even when the work is pretty good. “Nobody cares” is a very strong default, the very large majority of the time.
Privacy, under this model, is very easy. You need to make a very large active effort in order for your research to not end up de-facto limited to you and maybe a few friends/coworkers.
This is the de-facto mechanism by which most theoretical work on alignment avoids advancing capabilities most of the time, and I think it is the main mechanism by which most theoretical work on alignment should aim to avoid advancing capabilities most of the time. It should be the default. But obviously there will be exceptions; when does the “nobody cares” model fail?
Theory-Practice Gap and Flashy Demos
Why, as a general rule, does nobody care? In particular, why does nobody working on AI capabilities care about most work on alignment theory most of the time, given that a lot of it is capabilities-relevant?
Well, even ignoring the (large) chunk of theoretical research which turns out to be useless, the theory-practice gap is a thing. Most theoretical ideas don’t really do much when you translate them to practice. This includes most ideas which sound good to intelligent people. Even those theoretical ideas which do turn out to be useful are typically quite hard to translate into practice. It takes months of work (at least), often additional significant insights, and often additional enabling pieces which aren’t already mainstream or even extant. Practitioners correctly expect this, and therefore mostly don’t pay attention to most ideas until after there’s evidence that they work in practice. (This is especially true of the sort of people who work on ML systems.)
In ML/AI, smart-sounding ideas which don’t really work easily are especially abundant, so ML practitioners are (correctly) even more than usually likely to ignore theoretical work.
The flip side of this model is that people will pay lots of attention once there is clear evidence that some idea works in practice - i.e. evidence that the idea has crossed the theory-practice gap. What does that look like? Flashy demos. Flashy demos are the main signal that the theory-practice gap has already been crossed, which people correctly take to mean that the thing can be useful now.
The theory-practice gap is therefore a defense which both (a) slows down someone actively trying to apply an idea, and (b) makes most ideas very-low-memetic-fitness until they have a flashy demo. To a large extent, one can write freely in public without any flashy demos, and it won’t spread very far memetically (or will spread very slowly if it does).
Aside from flashy demos, the other main factor I know of which can draw peoples’ attention is reputation. If someone has a track record of interesting work, high status, or previous flashy demos, then people are more likely to pay attention to their theoretical ideas even before the theory-practice gap is crossed.
Of course this is not relevant to the large majority of people the large majority of time, especially insofar as it involves reputation outside of the alignment research community. That said, if you’re relying on lack-of-reputation for privacy, then you need to avoid gaining too broad a following in the future, which may be an important constraint - more on that in the next section.
Takeaways & Gotchas
Main takeaway of the “nobody cares” model: if you’re not already a person of broad interest outside alignment, and you don’t make any flashy demos, then probably approximately nobody working on ML systems outside of alignment will pay any attention to your work.
… but there are some gotchas.
First, there’s a commitment/time-consistency problem: to the extent that we rely on this model of privacy, we need to precommit to remain uninteresting in the future, at least until we’re confident that our earlier work won’t dangerously accelerate capabilities. If you’re hoping to gain lots of status outside the alignment research community, that won’t play well with a “nobody cares” privacy model. If you’re hoping to show future flashy demos, that won’t play well with a “nobody cares” privacy model. If your future work is very visibly interesting, you may be stuck keeping it secret.
(Though note that, in the vast majority of cases, it will turn out that your earlier theory work was never particularly important for capabilities in the first place, and hopefully you figure that out later. So relying on “nobody caring” now will reduce your later options mainly in worlds where your current work turns out to be unusually important/interesting in its own right.)
Second, relying on “nobody caring” obviously does not yield much defense-in-depth. It’s probably not something we want to rely on for stuff that immediately or directly advances capabilities by a lot.
But for most theoretical alignment work most of the time, where there are some capabilities implications but they’re not very direct or immediately dangerous on their own, I think “nobody cares” is the right privacy model under which to operate. Mostly, theoretical researchers should just not worry much about privacy, as long as (1) they don’t publish flashy demos, (2) they don’t have much name recognition outside alignment, and (3) the things they’re working on won’t immediately or directly advance capabilities by a lot.
Beyond “Nobody Cares”: Danger, Secrecy and Adversaries
Broadly speaking, I see two main categories of reasons for theoretical researchers to go beyond the “nobody cares” model and start to actually think about privacy:
- Research which might directly or immediately advance capabilities significantly
- Current or anticipated future work which is unusually likely to draw a lot of attention, especially outside the alignment field
These are different failure modes of the “nobody cares” model, and they call for different responses.
The “Keep It To Yourself” Model for Immediately Capabilities-Relevant Research
Under the “nobody cares” model, a small number of people might occasionally pay attention to your research and try to use it, but your research is not memetically fit enough to spread much. For research which might directly or immediately advance capabilities significantly, even a handful of people trying it out is potentially problematic. Those handful might realize there’s a big capability gain and run off to produce a flashy demo.
For research which is directly or immediately capabilities-relevant, we want zero people to publicly try it. The “nobody cares” model is not good enough to robustly achieve that. In these cases, my general policy would be to not publish the research, and possibly not share it with anyone else at all (depending on just how immediately and directly capabilities-relevant it looks).
On the other hand, we don’t necessarily need to be super paranoid about it. In this model, we’re still mostly worried about the research contributing marginally to capabilities; we don’t expect it to immediately produce a full-blown strong AGI. We want to avoid the work spreading publicly, but it’s still not that big a problem if e.g. some government surveillance sees my google docs. Spy agencies, after all, would presumably not publicly share my secrets after stealing them.
The “Active Adversary” Model
… which brings us to the really paranoid end of the spectrum. Under this model, we want to be secure even against active adversaries trying to gain access to our research - e.g. government spy agencies.
I’m not going to give advice about how to achieve this level of security, because I don’t think I’m very good at this kind of paranoia. The main question I’ll focus on is: when do we need highly paranoid levels of security, and when can we get away with less?
As with the other models, someone has to pay attention in order for security to be necessary at all. Even if a government spy agency had a world-class ML research lab (which I doubt is currently the case), they’d presumably ignore most research for the same reasons other ML researchers do. Also, spying is presumably expensive; random theorists/scientists are presumably not worth the cost of having a human examine their work. The sorts of things which I’d expect to draw attention are the same as earlier:
- enough of a track record that someone might actually go to the trouble of spying on our work
- public demonstration of impressive capabilities, or use of impressive capabilities in a way which will likely be noticed (e.g. stock trading)
Even if we are worried about attention from spies, that still doesn’t mean that most of our work needs high levels of paranoia. The sort of groups who are likely to steal information not meant to be public are not themselves very likely to make that information public. (Well, assuming our dry technical research doesn’t draw the attention of the dreaded Investigative Journalists.) So unless we’re worried that our research will accelerate capabilities to such a dramatic extent that it would enable some government agency to develop dangerous AGI themselves, we probably don’t need to worry about the spies.
The case where we need extreme paranoia is where both (1) an adversary is plausibly likely to pay attention, and (2) our research might allow for immediate and direct and very large capability gains, without any significant theory-practice gap.
This degree of secrecy should hopefully not be needed very often.
Many people may have the same idea, and it only takes one of them to share it. If all their estimates of the riskiness of the idea have some noise in them, and their risk tolerance has some noise, then presumably it will be the person with unusually low risk estimate and unusually high risk tolerance who determines whether the idea is shared.
In general, this sort of thing creates a bias toward unilateral actions being taken even when most people want them to not be taken.
On the other hand, unilateralist's curse is only relevant to an idea which many people have. And if many people have the idea already, then it's probably not something which can realistically stay secret for very long anyway.
In general, if an idea has been talked-about in public at some previous point, then it’s probably fine to talk about again. Your marginal impact on memetic fitness is unlikely to be very large, and if the idea hasn’t already taken off then that’s strong evidence that it isn’t too memetically fit. (Though this does not apply if you are a person with a very large following.)
Alignment Researchers as the Threat
Just because someone’s part of the ingroup does not mean that they won’t push the run button. We don’t have a way to distinguish safe from dangerous programs; our ingroup is not meaningfully more able to do so than the outgroup, and few people in the ingroup are very careful about running python scripts on a day-to-day basis. (I’m certainly not!)
Point is: don’t just assume that it’s fine to share ideas with everyone in the ingroup.
On the other hand, if all we want is for an idea to not spread publicly, then in-group trust is less risky, because group members would burn their reputation by sharing private things.
Differential Alignment/Capabilities Advantage
In the large majority of cases, research is obviously much more relevant to one or the other, and desired privacy levels should be chosen based on that.
I don’t think it’s very productive, in practice, to play the “but it could be relevant to [alignment/capabilities] via [XYZ]” game for things which seem obviously more relevant to capabilities/alignment.
Most Secrecy Is Hopefully Temporary
Most ideas will not dramatically impact capabilities. Usually, we should expect secrecy to be temporary, long enough to check whether a potentially-capabilities-relevant idea is actually short-term relevant (i.e. test it on some limited use-case).
Part of the reason I’m posting this is because I have not seen discussion of the topic which feels adequate. I don’t think my own thoughts are clearly correct. So, please argue about it!
I think my threat model is a bit different. I don’t particularly care about the zillions of mediocre ML practitioners who follow things that are hot and/or immediately useful. I do care about the pioneers, who are way ahead of the curve, working to develop the next big idea in AI long before it arrives. These people are not only very insightful themselves, but also can recognize an important insight when they see it, and they’re out hunting for those insights, and they’re not looking in the same places as most people, and in particular they’re not looking at whatever is trending on Twitter or immediately useful.
Let’s try this analogy, maybe: “most impressive AI” ↔ “fastest man-made object”. Let’s say that the current record-holder for fastest man-made object is a train. And right now a competitor is building a better train, that uses new train-track technology. It’s all very exciting, and lots of people are following it in the newspapers. Meanwhile, a pioneer has the idea of building the first-ever rocket ship, but the pioneer is stuck because they need better heat-resistant tiles in order for the rocket-ship prototype to actually work. This pioneer is probably not going to be following the fastest-train news; instead, they’re going to be poring over the obscure literature on heat-resistant tiles. (Sorry for lack of historical or engineering accuracy in the above.) This isn’t a perfect analogy for many reasons, ignore it if you like.
So my ideal model is (1) figure out the whole R&D path(s) to building AGI, (2) don’t tell anyone (or even write it down!), (3) now you know exactly what not to publish, i.e. everything on that path, and it doesn’t matter whether those things would be immediately useful or not, because the pioneers who are already setting out down that path will seek out and find what you’re publishing, even if it’s obscure, because they already have a pretty good idea of what they’re looking for. Of course, that’s easier said than done, especially step (1) :-P
Thinking out loud here...
I do basically buy the "ignore the legions of mediocre ML practitioners, pay attention to the pioneers" model. That does match my modal expectations for how AGI gets built. But:
Thinking about it, these factors are not enough to make me confident that someone won't use my work to produce an unaligned AGI. On (1), thinking about my personal work, there's just very little technical work at all on abstraction, so someone who knows to look for technical work on abstraction could very plausibly encounter mine. And that is indeed the sort of thing I'd expect an AGI pioneer to be looking for. On (2), they'd be a lot more likely to encounter my work if they're already paying attention to alignment, and encountering my work would probably make them more likely to pay attention to alignment even if they weren't before, but neither of those really rules out unaligned researchers. On (3), I do expect a pioneer to be able to recognize the theory they need without flashy demos to prove it if they spend a few days' attention on it.
... ok, so I'm basically convinced that I should be thinking about this particular scenario, and the "nobody cares" defense is weaker against the hypothetical pioneers than against most people. I think the "fundamental difference" is that the hypothetical pioneers know what to look for; they're not just relying on memetic fitness to bring the key ideas to them.
... well fuck, now I need to go propagate this update.
An annoying thing is, just as I sometimes read Yann LeCun or Steven Pinker or Jeff Hawkins, and I extract some bits of insight from them while ignoring all the stupid things they say about the alignment problem, by the same token I imagine other people might read my posts, and extract some bits of insight from me while ignoring all the wise things I say about the alignment problem. :-P
That said, I do definitely put some nonzero weight on those kinds of considerations. :)
More thinking out loud...
My usual take here is the “Nobody Cares” model, though I think there is one scenario that I tend to be worried about a bit here that you didn't address, which is how to think about whether or not you want things ending up in the training data for a future AI system. That's a scenario where the “Nobody Cares” model really doesn't apply, since the AI actually does have time to look at everything you write.
That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important. However, it can also help AI systems do things like better understand how to be deceptive, so this sort of thing can be a bit tricky.
Worrying about which alignment writing ends up in the training data feels like a very small lever for affecting alignment; my general heuristic is that we should try to focus on much bigger levers.
Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?
Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won't have them changed by small amounts of training text.
Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.
The relevant texts I'm thinking about here are:
That consideration seems relevant only for language models that will be doing/supporting alignment work.
[For the record, here's previous relevant discussion]
My problem with the "nobody cares" model is that it seems self-defeating. First, if nobody cares about my work, then how would my work help with alignment? I don't put a lot of stock into building aligned AGI in the basement on the my own. (And not only because I don't have a basement.) Therefore, any impact I will have flows through my work becoming sufficiently known that somebody who builds AGI ends up using it. Even if I optimistically assume that I will personally be part of that project, my work needs to be sufficiently well-known to attract money and talent to make such a project possible.
Second, I also don't put a lot of stock into solving alignment all by myself. Therefore, other people need to build on my work. In theory, this only requires it to be well-known in the alignment community. But, to improve our chances of solving the problem we need to make the alignment community bigger. We want to attract more talent, much of which is found in the broader computer science community. This is in direct opposition to preserving the conditions for "nobody cares".
Third, a lot of people are motivated by fame and status (myself included). Therefore, bringing talent into alignment requires the fame and status to be achievable inside the field. This is obviously also in contradiction with "nobody cares".
My own thinking about this is: yes, progress in the problems I'm working on can contribute to capability research, but the overall chance of success on the pathway "capability advances driven by theoretical insights" is higher than on the pathway "capability advances driven by trial and error", even if the first leads to AGI sooner, especially if these theoretical insights are also useful for alignment. I certainly don't want to encourage the use of my work to advance capability, and I try to discourage anyone who would listen, but I accept the inevitable risk of that happening in exchange for the benefits.
Then again, I'm by no means confident that I'm thinking about all of this in the right way.
Our work doesn't necessarily need wide memetic spread to be found by the people who know what to look for. E.g. people playing through the alignment game tree are a lot more likely to realize that ontology identification, grain-of-truth, value drift, etc, are key questions to ask, whereas ML researchers just pushing toward AGI are a lot less likely to ask those questions.
I do agree that a growing alignment community will add memetic fitness to alignment work in general, which is at least somewhat problematic for the "nobody cares" model. And I do expect there to be at least some steps which need a fairly large alignment community doing "normal" (i.e. paradigmatic) incremental research. For instance, on some paths we need lots of people doing incremental interpretability/ontology research to link up lots of concepts to their representations in a trained system. On the other hand, not all of the foundations need to be very widespread - e.g. in the case of incremental interpretability/ontology research, it's mostly the interpretability tools which need memetic reach, not e.g. theory around grain-of-truth or value drift.
That's a valid argument, but I can also imagine groups that (i) in a world where alignment research is obscure proceed to create unaligned AGI (ii) in a world where alignment research is famous, use this research when building their AGI. Maybe any such group would be operationally inadequate anyway, but I'm not sure. More generally, it's possible that in a world where alignment research is a well-known respectable field of study, more people take AI risk seriously.
I think I have a somewhat different model of the alignment knowledge tree. From my perspective, the research I'm doing is already paradigmatic. I have a solid-enough paradigm, inside which there are many open problems, and what we need is a bunch of people chipping away at these open problems. Admittedly, the size of this "bunch" is still closer to 10 people than to 1000 people but (i) it's possible that the open problems will keep multiplying hydra-style, as often happens in math and (ii) memetic fitness would help getting the very best 10 people to do it.
It's also likely that there will be a "phase II" where the nature of the necessary research becomes very different (e.g. it might involve combining the new theory with neuroscience, or experimental ML research, or hardware engineering), and successful transition to this phase might require getting a lot of new people on board which would also be a lot easier given memetic fitness.
Thank you for writing this post; I had been struggling with these considerations a while back. I investigated going full paranoid mode but in the end mostly decided against it.
I agree theoretical insight on agency and intelligence have a real chance of leiding to capability gains. I agree on the government spy threat model as being unlikely. I would like to add however that if say MIRI builds a safe AGI prototype - perhaps based on different principles than systems used by adversaries it might make sense for an (ai-assisted) adversary to trawl through your old blogposts.
Byrnes has already mentioned the distinction between pioneers and median researchers. Another aspect that your threat models don't capture is: research that builds on your research. Your research may end up in a very long chain of theoretical research only a minority of which you have contributed. Or the spirit if not the letter of your ideas may percolate through the research community. Additionally, the alignment field will almost certainly become very much larger raising both the status of John and the alignment field in general. Over longer timescales I expect percolation to be quite strong.
Even if approximately nobody reads your or know of your works the insights may very well become massively signalboosted by other alignment researchers (once again I expect the community to explode in size within a decade) and thereby end up in a flashy demo.
All-in-all these and other considerations let me to the conclusion that this danger is very real. That is there is a significant minority of possible worlds in which early alignment researchers tragically contribute to DOOM.
However, I still think on the whole most alignment researchers should work in the open. Any solution to alignment will most likely come from a group (albeit-small) of people. Working privately massively hampers collaboration. It makes the community look weird and makes it way harder to recruit good people. Also, for most researchers it is difficult to support themselves financially if they can't show their work. As by far the most likely doom scenario is some company/government simply building AGI without sufficient safeguards because either there is no alignment solution or they are simply unaware of it/it ote it I conclude that the best policy in expected value is to work mostly publicly*.
*Ofc if there is a clear path to capability gain keeping it secret might be the best.
EDIT: Cochran has a comical suggestion
Suppose hypothetically I had a way to make neural networks recognize OOD inputs. (Like I get back 40% dog, 10% cat, 20% OOD, 5% teapot...) Should I run a big imagenet flashy demo (So I personally know if the idea scales up) and then tell no one?
There was reasoning that went. Any research that has a better alignment/ capabilities ratio than the average of all research currently happening is good. A lot of research is pure capabilities, like hardware research. So almost anything with any alignment in it is good. I'm not quite sure if this is a good rule of thumb.
I think I basically don't buy the "just increase the alignment/capabilities ratio" model, at least on its own. It just isn't a sufficient condition to not die.
It does feel like there's a better version of that model waiting to be found, but I'm not sure what it is.
Model 1. Your new paper produces c units of capabilities, and a units of alignment. When C units of capabilities are reached, an AI is produced, and it is aligned iff A units of alignment has been produced. The rest of the world produces, and will continue to produce, alignment and capabilities research in ratio R. You are highly uncertain about A and/or C, but have a good guess at a,c,R.
In this model, if AC<<R we are screwed whatever you do, if AC>>R we win whatever you do. Your paper makes a difference in those worlds where AC≈R, and in those worlds it helps iff ac>R.
This model treats alignment and capabilities as continuous, fungible quantities that slowly accumulate. This is a dubious assumption. It also assumes that conditional on us being in the marginal world (The world very where good and bad outcomes are both very close) that your mainline probability involves research continuing at its current ratio.
If for example, you were extremely pessimistic, and think that the only way we have any chance is if a portal to Dath ilan opens up, then the goal is largely to hold off all research for as long as possible, to maximize the time a deus ex machina can happen in. Other goals might include publishing the sort of research most likely to encourage a massive global "take AI seriously" movement.
So, the main takeaway is that we need some notion of fungibility/additivity of research progress (for both alignment and capabilities) in order for the "ratio" model to make sense.
Some places fungibility/additivity could come from:
Fungibility is necessary, but not sufficient for the "if your work has a better ratio than average research, publish". You also need your uncertainty to be in the right place.
If you were certain of R, and uncertain what ACfuture research might have, you get a different rule, publish if ac>R.