Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter

Wiki Contributions


DeepMind is hiring for the Scalable Alignment and Alignment Teams

Update: I think you should apply now and mention somewhere that you'd prefer to be interviewed in 3 months because in those 3 months you will be doing <whatever it is you're planning to do> and it will help with interviewing.

DeepMind is hiring for the Scalable Alignment and Alignment Teams

I don't have a strong opinion on whether it is good to support remote work. I agree we lose out on a lot of potential talent, but we also gain productivity benefits from in person collaboration.

However, this is a DeepMind-wide policy and I'm definitely not sold enough on the importance of supporting remote work to try and push for an exception here.

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Looking into it, I'll try to get you a better answer soon. My current best guess is that you should apply 3 months from now. This runs an increased risk that we'll have filled all our positions / closed our applications, but also improved chances of making it through because you'll know more things and be better prepared for the interviews.

(Among other things I'm looking into: would it be reasonable to apply now and mention that you'd prefer to be interviewed in 3 months.)

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Almost certainly, e.g. this one meets those criteria and I'm pretty sure costs < 1/3 of total comp (before taxes), though I don't actually know what typical total comp is. You would find significantly cheaper places if you were willing to compromise on commute, since DeepMind is right in the center of London.

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Unfortunately not, though as Frederik points out below, if your concern is about getting a visa, that's relatively easy to do. DeepMind will provide assistance with the process. I went through it myself and it was relatively painless; it probably took 5-10 hours of my time total (including e.g. travel to and from the appointment where they collected biometric data).

rohinmshah's Shortform

That's what future research is for!

rohinmshah's Shortform

I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).

I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem

I agree Boltzmann rationality (over the action space of, say, "muscle movements") is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including "things that humans say", and the human can just tell you that hyperslavery is really bad. Obviously you can't trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes.

(Ideally you'd figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of "getting a good observation model" while you still have the ability to turn off the model. It's hard to say exactly what that would look like since I don't have a great sense of how you get AGI capabilities under the non-ML story.)

rohinmshah's Shortform

I recently had occasion to write up quick thoughts about the role of assistance games (CIRL) in AI alignment, and how it relates to the problem of fully updated deference. I thought I'd crosspost here as a reference.

  • Assistance games / CIRL is a similar sort of thing as CEV. Just as CEV is English poetry about what we want, assistance games are math poetry about what we want. In particular, neither CEV nor assistance games tells you how to build a friendly AGI. You need to know something about how the capabilities arise for that.
  • One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.
  • Well-specified assistive agents (i.e. ones where you got the observation model and reward space exactly correct) do many of the other nice things corrigible agents do, like the 5 bullet points at the top of this post. Obviously we don't know how to correctly specify the observation model and reward space, so this is not a solution to alignment, which is why it is "math poetry about what we want".
  • Another objection: ultimately an assistive agent becomes equivalent to optimizing a fixed reward, aren’t things that optimize a fixed reward bad? Again, I think this seems totally fine; the intuition that “optimizing a fixed reward is bad” comes from our expectation that we’ll get the fixed reward wrong, because there’s so much information that has to be in that fixed reward. An assistive agent will spend a long time gaining all the information about the reward -- it really should get it correct (barring misspecification)! If we imagine the superintelligent CIRL sovereign, it has billions of years to optimize the universe! It would be worth it to spend a thousand years to learn a single bit about the reward function if that has more than a 1 in a million chance of doubling the resulting utility (and obviously going from existential catastrophe to not-that seems like a huge increase in utility).
  • I don’t personally work on assistance-game-like algorithms because they rely on having explicit probability distributions over high-dimensional reward spaces, which we don’t have great techniques for, and I think we will probably get AGI before we have great techniques for that. But this is more about what I expect drives AGI capabilities than about some fundamental “safety problems” with assistance games.
  • Another point against assistance games is that they might have very narrow “safety margins”, i.e. if you get the observation model slightly wrong, maybe you get a slightly wrong reward function, and that still leads to an existential catastrophe because value is fragile. (Though this isn’t totally clear, e.g. is it really that easy to mess up the observation model such that it leads to a reward function that’s fine with murdering humans? It seems like there’s a lot of evidence that humans don’t want to be murdered!) If this were the only point against assistance (i.e. the previous bullet point somehow didn't apply) I’d still be keen for a large fraction of the field pushing forward the assistance games approach, while the others look for approaches with wider safety margins.

(I made some of these points before in my summary of Human Compatible.)

Project Intro: Selection Theorems for Modularity

Specifically, if for example you vary between two loss functions in some training environment, L1 and L2, that variation is called “modular” if somewhere in design space, that is, the space formed by all possible combinations of parameter values your network can take, you can find a network N1 that “does well”(1) on L1, and a network N2 that “does well” on L2, and these networks have the same values for all their parameters, except for those in a single(2) submodule(3).

It's often the case that you can implement the desired function with, say, 10% of the parameters that you actually have. So every pair of L1 and L2 would be called "modular", by changing the 10% of parameters that actually do anything, and leaving the other 90% the same. Possible fixes:

  1. You could imagine that it's more modular the fewer parameters are needed, so that if you can do it with 1% of the parameters, that's more modular than 10% of the parameters. Problem: this is probably mostly measuring min(difficulty(L1), difficulty(L2)), where difficulty(L) is the minimum number of parameters needed to "solve" L, for whatever definition of "solve" you are using.
  2. You could have a definition that first throws away all the parameters that are irrelevant, and then applies the definition above. (I expect this to have problems with Goodharting on the definition of "irrelevant", but it's not quite so obvious what they will be.)
Load More