Wiki Contributions


Whether a PhD is something someone will enjoy is so dependent on individual personality, advisor fit, etc that I don't feel I can offer good generalized advice. Generally I'd suggest people trying to gauge fit try doing some research in an academic environment (e.g. undergrad/MS thesis, or a brief RA stint after graduating) and talk to PhD students in their target schools. If after that you think you wouldn't enjoy a PhD then you're probably right!

Personally I enjoyed my PhD. I had smart & interesting colleagues, an advisor who wanted me to do high-quality research (not just publish), I had almost-complete control over how I spent my time, could explore areas I found interesting & important in depth. The compensation is low but with excellent job security and I had some savings so I lived comfortably. Unless I take a sabbatical I will probably never again have the time to go as deep into a research area so in a lot of ways I really cherish my PhD time.

I think a lot of the negatives of PhDs really feel like negatives of becoming a research lead in general. Trying to create something new with limited feedback loops is tough, and can be psychologically draining if you tie your self-worth with your work output (don't do this! but easier said than done for the kind of person attracted to these careers). Research taste will take up many years of your life to develop -- as will most complex skills. etc.

I'm sympathetic to a lot of this critique. I agree that prospective students should strive to find an advisor that is "good at producing clear, honest and high-quality research while acting in high-integrity ways around their colleagues". There are enough of these you should be able to find one, and it doesn't seem worth compromising.

Concretely, I'd definitely recommend digging into into an advisor's research and asking their students hard questions prior to taking any particular PhD offer. Their absolutely are labs that prioritize publishing above all else, turn a blind eye to academic fraud or at least brush accidental non-replicability under the rug, or just have a toxic culture. You want to avoid those at all costs.

But I disagree with the punchline that if this bar isn't satisfied then "almost any other job will be better preparation for a research career". In particular, I think there's a ton of concrete skills a PhD teaches that don't need a stellar advisor. For example, there's some remarkably simple things like having an experimental baseline, running multiple seeds and reporting confidence intervals that a PhD will absolutely drill into you. These things are remarkably often missing from research produced by those I see in the AI safety ecosystem who have not done a PhD or been closely mentored by an experienced researcher.

Additionally, I've seen plenty of people do PhDs under an advisor who lacks one or more of these properties and most of them turned out to be fine researchers. Hard to say what the counterfactual is, the admission process to the PhD might be doing a lot of work here, but I think it's important to recognize the advisor is only one of many sources of mentorship and support you get in a PhD: you also have taught classes, your lab mates, your extended cohort, senior post-docs, peer review, etc. To be clear, none of these mentorship sources are perfect, but part of your job as a student is to decide who to listen to & when. If someone can't do that then they'll probably not get very far as a researcher no matter what environment they're in.

Thanks for the post Ryan -- I agree that given the difficulty in making models actually meaningfully robust the best solution to misuse in the near-term is going to be via a defence in depth approach consisting of filtering the pre-training data, input filtering, output filtering, automatic and manual monitoring, KYC checks, etc.

At some point though we'll need to grapple with what to do about models that are superhuman in some domains related to WMD development, cybercrime or other potential misuses. There's glimmers of this already here, e.g. my impression is that AlphaFold is better than human experts at protein folding. It does not seem far-fetched that automatic drug discovery AI systems in the near future might be better than human experts at finding toxic substances (Urbina et al, 2022 give a proof of concept). In this setting, a handful of queries that slip through a model's defences might be dangerous: "how to build a really bad bioweapon" might be something the system could make significant headway on zero-shot. Additionally, if the model is superhuman, then it starts becoming attractive for nation-state or other well-resourced adversaries to seek to attack it (whereas at human-level, they can just hire their own human experts). The combination of lower attack tolerance and increased sophistication of attacks makes me somewhat gloomy this regime will hold up indefinitely.

Now I'm still excited to see the things you propose be implemented in the near-term: they're some easy wins, and lay foundations for a more rigorous regime later (e.g. KYC checks seem generally really helpful in mitigating misuse). But I do suspect that in the long-run we'll need a more principled solution to security, or simply refrain from training such dangerous models.


When I started working on this project, a number of people came to me and told me (with varying degrees of tact) that I was wasting my time on a fool's errand. Around half the people told me they thought it was extremely unlikely I'd find such a vulnerability. Around the other half told me such vulnerabilities obviously existed, and there was no point demonstrating it. Both sets of people were usually very confident in their views. In retrospect I wish I'd done a survey (even an informal one) before conducting this research to get a better sense of people's views.

Personally I'm in the camp that vulnerabilities like these existing was highly likely given the failures we've seen in other ML systems and the lack of any worst-case guarantees. But I was very unsure going in how easy they'd be to find. Go is a pretty limited domain, and it's not enough to beat the neural network: you've got to beat Monte-Carlo Tree Search as well (and MCTS does have worst-case guarantees, albeit only in the limit of infinite search). Additionally, there are results showing that scale improves robustness (e.g. more pre-training data reduces vulnerability to adversarial examples in image classifiers).

In fact, although the method we used is fairly simple, actually getting everything to work was non-trivial. There was one point after we'd patched the first (rather degenerate) pass-attack that the team was doubting whether our method would be able to beat the now stronger KataGo victim. We were considering cancelling the training run, but decided to leave it going given we had some idle GPUs in the cluster. A few days later there was a phase shift in the win rate of the adversary: it had stumbled across some strategy that worked and finally was learning.

This is a long-winded way of saying that I did change my mind as a result of these experiments (towards robustness improving less than I'd previously thought with scale). I'm unsure how much effect it will have on the broader ML research community. The paper is getting a fair amount of attention, and is a nice pithy example of a failure mode. But as you suggest, the issue may be less a difference in concrete belief (surely any ML researcher would acknowledge adversarial examples are a major problem and one that is unlikely to be solved any time soon), than that of culture (to what degree is a security mindset appropriate?).

This post was written as a summary of the results of the paper, intended for a fairly broad audience, so we didn't delve much into the theory of change behind this agenda here. You might find this blog post describing the broader research agenda this paper fits into provides some helpful context, and I'd be interested to hear your feedback on that agenda.

Thanks for flagging this disagreement Ryan. I enjoyed our earlier conversation (on LessWrong and in-person) and updated in favor of the sample efficiency framing, although we (clearly) still have some significant differences in perspective here. Would love to catch up again sometime and see if we can converge more on this. I'll try and summarize my current take and our key disagreements for the benefit of other readers.

I think I mostly agree with you that in the special case of vanilla RLHF this problem is equivalent to a sample efficiency problem. Specifically, I'm referring to the case where we perform RL on a learned reward model; that reward model is trained based on human feedback from an earlier version of the RL policy; and this process iterates. In this case, if the RL algorithm learns to exploit the reward model (which it will, in contemporary systems, without some regularization like a KL penalty) then the reward model will receive corrective feedback from the human. At worst, this process will just not converge, and the policy will just bounce from one adversarial example to another -- useless, but probably not that dangerous. In practice, it'll probably work fine given enough human data and after tuning parameters.

However, I think sample efficiency could be a really big deal! Resolving this issue of overseers being exploited I expect could change the asymptotic sample complexity (e.g. exponential to linear) rather than just changing the constant factor. My understanding is that your take is that sample efficiency is unlikely to be a problem because RLHF works fine now, is fairly sample efficient, and improves with model scale -- so why should we expect it to get worse?

I'd argue first that sample efficiency now may actually be quite bad. We don't exactly have any contemporary model that I'd call aligned. GPT-4 and Claude are a lot better than what I'd expect from base models their size -- but "better than just imitating internet text" is a low bar. I expect if we had ~infinite high quality data to do RLHF on these models would be much more aligned. (I'm not sure if having ~infinite data of the same quality that we do now would help; I tend to assume you can trade less quantity for increased quality, but there are obviously some limits here.)

I'm additionally concerned that sample efficiency may be highly task dependent. RLHF is a pretty finnicky method, so we're tending to see the success cases of it. What if there are just certain tasks that it's really hard to use RLHF for (perhaps because the base model doesn't already have a good representation of it)? There'll be a strong economic pressure to develop systems that do that task anyway, just using less reliable proxies for that task objective.

(A similar argument will apply for various recursive oversight schemes or debate.)

This might be the most interesting disagreement, and I'd love to dig into this more. With RLHF I can see how you can avoid the problem with sufficient samples since the human won't be fooled by the AdvEx. But this stops working in a domain where you need scalable oversight as the inputs are too complex for a human to judge, so can't provide any input.

The strongest argument I can see for your view is that scalable oversight procedures already have to deal with a human that says "I don't know" for a lot of inputs. So, perhaps you can make a base model that perfectly mimics what the human would say on a large subset of inputs, and for AdvEx's (as well as some other inputs) says "I don't know". This is still a hard problem -- my impression was adversarial example detection is still far from solved -- but is plausibly a fair bit easier than full robustness (which I suspect isn't possible). Then you can just use your scalable oversight procedure to make the "I don't knows" go away.

Alteratively, if you think the issue is that periodically being incentivized to adversarially attack the reward model has serious problematic effects on the inductive biases of RL, it seems relevant to argue for why this would be the case. I don't really see why this would be important. It seems like periodically being somewhat trained to find different advexes shouldn't have much effect on how the AI generalizes?

I think this is an area where we disagree but it doesn't feel central to my view -- I can see it going either way, and I think I'd still be concerned by whether the oversight process is robust even if the process wasn't path dependent (e.g. we just did random restarting of the policy every time we update the reward model).

Thanks, that's a good link. In our case our assets significantly exceed the FDIC $250k insurance limit and there are operational costs to splitting assets across a large number of banks. But a high-interest checking account could be a good option for many small orgs.

Does this circle exploit have any connection to convolutions? That was my first thought when I saw the original writeups, but nothing here seems to help explain where the exploit is coming from. All of the listed agents vulnerable to it, AFAIK, make use of convolutions. The description you give of Wu's anti-circle training sounds a lot like you would expect from an architectural problem like convolution blindness: training can solve the specific exploit but then goes around in cycles or circles (ahem), simply moving the vulnerability around, like squeezing a balloon.

We think it might. One weak point against this is that we tried training CNNs with larger kernels and the problem didn't improve. However, it's not obvious that larger kernels would fix it (it gives the model less need for spatial locality, but it might still have an inductive bias towards it), and the results are a bit confounded since we trained the CNN based on historical KataGo self-play training data rather. We've been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this. It'd be somewhat time consuming though, so curious to hear how interesting you and other commenters would find this result so we can prioritize.

We're also planning on doing mechanistic interpretability to better understand the failure mode, which might shed light on this question.

Do you know they are distinct? The discussion of Go in that paper is extremely brief and does not describe what the exploitation is at all, AFAICT. Your E3 also doesn't seem to describe what the Timbers agent does.

My main reason for believing they're distinct is that an earlier version of their paper includes Figure 3 providing an example Go board that looks fairly different to ours. It's a bit hard to compare since it's a terminal board, there's no move history, but it doesn't look like what would result from capture of a large circular group. But I do wish the Timbers paper went into more detail on this, e.g. including full game traces from their latest attack. I encouraged the authors to do this but it seems like they've all moved on to other projects since then and have limited ability to revise the paper.

This matches my impression. FAR could definitely use more funding. Although I'd still at the margin rather hire someone above our bar than e.g. have them earn-to-give and donate to us, the math is getting a lot closer than it used to be, to the point where those with excellent earning potential and limited fit for AI safety might well have more impact pursuing a philanthropic pathway.

I'd also highlight there's a serious lack of diversity in funding. As others in the thread have mentioned, the majority of people's funding comes (directly or indirectly) from OpenPhil. I think OpenPhil does a good job trying to mitigate this (e.g. being careful about power dynamics, giving organizations exit grants if they do decide to stop funding an org, etc) it's ultimately not a healthy dynamic, and OpenPhil appears to be quite capacity constrained in terms of grant evaluation. So, the entry of new funders would help diversify this in addition to increasing total capacity.

One thing I don't see people talk about as much but also seems like a key part of the solution: how can alignment orgs and researchers make more efficient use of existing funding? Spending that was appropriate a year or two ago when funding was plentiful may not be justified any longer, so there's a need to explicitly put in place appropriate budgets and spending controls. There's a fair amount of cost-saving measures I could see the ecosystem implementing that would have limited if any hit on productivity: for example, improved cash management (investing in government money market funds earning ~5% rather than 0% interest checking accounts); negotiating harder with vendors (often possible to get substantial discounts on things like cloud compute or commercial real-estate); and cutting back on some fringe benefits (e.g. more/higher-density open plan rather than private offices). I'm not trying to point fingers here: I've made missteps here as well, for example FAR's cash management currently has significant room for improvement -- we're in the process of fixing this and plan to share a write-up of what we found with other orgs in the next month.

I still don't understand which of (1), (2), or (3) your most worried about.

Sample efficiency isn't the main way I think about this topic so it's a bit difficult to answer. I find all these defeaters fairly plausible, but if I had to pick the central concern it'd be (3).

I tend to view ML training as a model taking a path through a space of possible programs. There's some programs that are capable and aligned with our interests; others that are capable but will actively pursue harmful goals; and of course many other programs that just don't do anything particularly useful. Assuming we start with model that is aligned (where "aligned" could include "model cannot do anything useful so does not cause any harm") and we only reward positive behavior, I find it plausible that we can hill-climb to more capable models while preserving alignment. 

However, suppose we at some point err and reward undesirable behavior. (This could occur due to incorrect human feedback, or a reward model that is not robust, or some other issues.) At this point, we're training a sub-component of the system that is actively opposed to our interests. Hopefully, we eventually discover this sub-component, and can then disincentivize it in the training process. But at that point, there is some uncertainty in my mind: will the training process remove the sub-component, or simply train the sub-component into being better able to fool the training process?

Now, we don't need the reward model to be perfectly robust to avoid this (as you quite rightly point out), just robust in the region of policy space around the current policy where the RL algorithm is likely to explore. But empirically current reward model robustness falls short of even this.

In response to:

2. Harmless base model. If the foundation model starts off harmless (not necessarily aligned, just not actively trying to cause harm), then I'd expect RLHF'ing it to only improve things so long as the training signal never rewards bad behavior. However, the designers want the model to significantly outperform humans at this task. The model has capacity to learn to do this, but can't just leverage existing capabilities in the foundation model, as the performance of that model is limited to that of the best humans it saw in the self-supervised training data. So, we need to do RL for many more time steps. Collecting fresh human data for that is prohibitive, so we rely on a reward model -- unfortunately that gets hacked.

you write:

Are you assuming that we can't collect human data online as the policy optimizes against the reward model? (People currently do collect data online to avoid getting hacked like this.) This case seems probably hopeless to me without very strong regularization (I think you agree with this being mostly hopeless), but it also seems easy to avoid by just collecting human data online.

No, I do expect online data collection to take place, I just don't expect to be able to do that data collection fast enough or in large enough volumes to kick in before hacking takes place. I think in your taxonomy, this is defeater (2): I think we'll need substantially more samples to train superhuman models than we do human models, as the demands from RLHF switch from localizing a task that a network already knows how to perform, to teaching a model to perform a new capability (safely). (I will note online data collection is a pain and people seem to try and do as little as possible of it.)

Oh, we're using terminology quite differently then. I would not call (a) reward hacking, as I view the model as being the reward (to the RL process), whereas humans are not providing reward at all (but rather some data that gets fed into a reward model's learning process). I don't especially care about what definitions we use here, but do wonder if this means we're speaking past each other in other areas as well.

Load More