I don't have a strong opinion on whether it is good to support remote work. I agree we lose out on a lot of potential talent, but we also gain productivity benefits from in person collaboration.
However, this is a DeepMind-wide policy and I'm definitely not sold enough on the importance of supporting remote work to try and push for an exception here.
Looking into it, I'll try to get you a better answer soon. My current best guess is that you should apply 3 months from now. This runs an increased risk that we'll have filled all our positions / closed our applications, but also improved chances of making it through because you'll know more things and be better prepared for the interviews.
(Among other things I'm looking into: would it be reasonable to apply now and mention that you'd prefer to be interviewed in 3 months.)
Almost certainly, e.g. this one meets those criteria and I'm pretty sure costs < 1/3 of total comp (before taxes), though I don't actually know what typical total comp is. You would find significantly cheaper places if you were willing to compromise on commute, since DeepMind is right in the center of London.
Unfortunately not, though as Frederik points out below, if your concern is about getting a visa, that's relatively easy to do. DeepMind will provide assistance with the process. I went through it myself and it was relatively painless; it probably took 5-10 hours of my time total (including e.g. travel to and from the appointment where they collected biometric data).
Should be fixed now!
That's what future research is for!
I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).
I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem
I agree Boltzmann rationality (over the action space of, say, "muscle movements") is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including "things that humans say", and the human can just tell you that hyperslavery is really bad. Obvious... (read more)
I recently had occasion to write up quick thoughts about the role of assistance games (CIRL) in AI alignment, and how it relates to the problem of fully updated deference. I thought I'd crosspost here as a reference.
Specifically, if for example you vary between two loss functions in some training environment, L1 and L2, that variation is called “modular” if somewhere in design space, that is, the space formed by all possible combinations of parameter values your network can take, you can find a network N1 that “does well”(1) on L1, and a network N2 that “does well” on L2, and these networks have the same values for all their parameters, except for those in a single(2) submodule(3).
It's often the case that you can implement the desired function with, say, 10% of the pa... (read more)
Isn't this a temporary solution at best? Eventually you resolve your uncertainty over the reward (or, more accurately, you get as much information as you can about the reward, potentially leaving behind some irreducible uncertainty), and then you start manipulating the target human.
I'm pretty wary of introducing potentially-false assumptions like the SVP already, and it seems particularly bad if their benefits are only temporary.
Idk, 95%? Probably I should push that down a bit because I haven't thought about it very hard.
It's a bit fuzzy what "deployed" means, but for now I'm going to assume that we mean that we put inputs into the AI system for the primary purpose of getting useful outputs, rather than for seeing what the AI did so that we can make it better.
Any existential catastrophe that didn't involve a failure of alignment seems like it had to involve a deployed system.
For failures of alignment, I'd expect that before you get an AI system that can break out of the training p... (read more)
Yeah I think I agree with all of that. Thanks for rereading my original comment and noticing a misunderstanding :)
Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well?
This one.
Independent impressions (= inside view in your terminology), though my all-things-considered belief (= betting odds in your terminology) is pretty similar.
I'm curious how well a model finetuned on the Alignment Newsletter performs at summarizing new content (probably blog posts; I'd assume papers are too long and rely too much on figures). My guess is that it doesn't work very well even for blog posts, which is why I haven't tried it yet, but I'd still be interested in the results and would love it on the off chance that it actually was good enough to save me some time.
I'd also be happy to include a good summary in the Alignment Newsletter (here's the previous summary, which doesn't include many of the newer results).
Cool, that all makes sense.
But one could also think that the disvalue of extinction is more continuous with disvalue in non-extinction scenarios, which makes things a bit more tricky.
I'm happy to use continuous notions (and that's what I was doing in my original comment) as long as "half the cost" means "you update such that the expected costs of misalignment according to your probability distribution over the future are halved". One simple way to imagine this update is to take all the worlds where there was any misalignment, halve their probability, and d... (read more)
Let's say you currently think the singularity will proceed at a rate of R.
What does this mean? On my understanding, singularities don't proceed at fixed rates?
I agree that in practice there will be some maximum rate of GDP growth, because there are fundamental physical limits (and more tight in-practice limits that we don't know), but it seems like they'll be way higher than 25% per year. Or to put it a different way, at 25% max rate I think it stops deserving the term "singularity", it seems like it takes decades and maybe centuries to reach technological... (read more)
The most interesting substantive disagreement I found in the discussion was that I was comparably much more excited about using interpretability to audit a trained model, and skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view.
Fwiw, I do have the reverse view, but my reason is more that "auditing a trained model" does not have a great story for wins. Like, either you find that the model is fine (in which c... (read more)
Ah, fair point, looking back at this summary I probably should have clarified that the methodology could be applied with other samples and those look much less long.
It's definitely cruxy in the sense that changing my opinions on any of these would shift my p(doom) some amount.
My rough model is that there's an unknown quantity about reality which is roughly "how strong does the oversight process have to be before the trained model does what the oversight process intended for it to do". p(doom) mainly depends on whether the actors training the powerful systems have sufficiently powerful oversight processes. This seems primarily affected by the quality of technical alignment solutions, but certainly civilizational adequacy also affects the answer.
I mean, maybe we should just drop this point about the intuition pump, it was a throwaway reference in the original comment. I normally use it to argue against a specific mentality I sometimes see in people, and I guess it doesn't make sense outside of that context.
(The mentality is "it doesn't matter what oversight process you use, there's always a malicious superintelligence that can game it, therefore everyone dies".)
Re: cultured meat example: If you give me examples in which you know the features are actually inconsistent, my method is going to look optimistic when it doesn't know about that inconsistency. So yeah, assuming your description of the cultured meat example is correct, my toy model would reproduce that problem.
To give a different example, consider OpenAI Five. One would think that to beat Dota, you need to have an algorithm that allows you to do hierarchical planning, state estimation from partial observability, coordination with team members, understandin... (read more)
I agree that is also moderately cruxy (but less so, at least for me, than "high-capabilities alignment is extremely difficult").
It's the first guess.
I think if you have a particular number then I'm like "yup, it's fair to notice that we overestimate the probability that x is even and odd by saying it's 25%", and then I'd say "notice that we underestimate the probability that x is even and divisible by 4 by saying it's 12.5%".
I agree that if you estimate a probability, and then "perform search" / "optimize" / "run n copies of the estimate" (so that you estimate the probability as 1 - (1 - P(event))^n), then you're going to have systematic errors.
I don't think I'm doing anything that... (read more)
I obviously do not think this is at all competitive, and I also wanted to ignore the "other people steal your code" case. I am confused what you think I was trying to do with that intuition pump.
I guess I said "powerful oversight would solve alignment" which could be construed to mean that powerful oversight => great future, in which case I'd change it to "powerful oversight would deal with the particular technical problems that we call outer and inner alignment", but was it really so non-obvious that I was talking about the technical problems?
Maybe you... (read more)
The goal is to bring x-risk down to near-zero, aka "End the Acute Risk Period". My usual story for how we do this is roughly "we create a methodology for building AI systems that allows you to align them at low cost relative to the cost of gaining capabilities; everyone uses this method, we have some governance / regulations to catch any stragglers who aren't using it but still can make dangerous systems".
If I talk to Eliezer, I expect him to say "yes, in this story you have executed a pivotal act, via magical low-cost alignment that we definitely do not g... (read more)
I can of course imagine a reasonable response to that from you--"ah, resolving philosophical difficulties is the user's problem, and not one of the things that I mean by alignment"
That is in fact my response. (Though one of the ways in which the intuition pump isn't fully compelling to me is that, even after understanding the exact program that the AGI implements and its causal history, maybe the overseers can't correctly predict the consequences of running that program for a long time. Still feels like they'd do fine.)
I do agree that if you go as far as "... (read more)
If you define "mainline" as "particle with plurality weight", then I think I was in fact "talking on my mainline" at some points during the conversation, and basically everywhere that I was talking about worlds (instead of specific technical points or intuition pumps) I was talking about "one of my top 10 particles".
I think I responded to every request for concreteness with a fairly concrete answer. Feel free to ask me for more concreteness in any particular story I told during the conversation.
In all cases, the real answer is "the actual impact will depend a ton on the underlying argument that led to the update; that argument will lead to tons of other updates across the board".
I imagine that the spirit of the questions is that I don't perform a Bayesian update and instead do more of a "causal intervention" on the relevant node and propagate downstream. In that case:
I don't think this is the main crux -- disagreements about mechanisms of intelligence seem far more important -- but to answer the questions:
Do you think major AI orgs will realize that AI is potentially worldendingly dangerous, and have any kind of process at all to handle that?
Clearly yes? They have safety teams that are focused on x-risk? I suspect I have misunderstood your question.
(Maybe you mean the bigger tech companies like FAANG, in which case I'm still at > 95% on yes, but I suspect I am still misunderstanding your question.)
(I know less about... (read more)
Do you feel like you do this 'sometimes', or 'basically always'?
I don't know what "this" refers to. If the referent is "have a concrete example in mind", then I do that frequently but not always. I do it a ton when I'm not very knowledgeable and learning about a thing; I do it less as my mastery of a subject increases. (Examples: when I was initially learning addition, I used the concrete example of holding up three fingers and then counting up two more to compute 3 + 2 = 5, which I do not do any more. When I first learned recursion, I used to explicitly r... (read more)
I think... this feels true as a matter of human psychology of problem-solving, or something, and not as a matter of math.
I think we're imagining different toy mathematical models.
Your model, according to me:
I'm mostly going to answer assuming that there's not some incredibly different paradigm (i.e. something as different from ML as ML is from expert systems). I do think the probability of "incredibly different paradigm" is low.
I'm also going to answer about the textbook at, idk, the point at which GDP doubles every 8 years. (To avoid talking about the post-Singularity textbook that explains how to build a superintelligence with clearly understood "intelligence algorithms" that can run easily on one of today's laptops, which I know very little about.)
I think ... (read more)
(For object-level responses, see comments on parallel threads.)
I want to push back on an implicit framing in lines like:
there's some value to more people thinking thru / shooting down their own edge cases [...], instead of pushing the work to Eliezer.
people aren't updating on the meta-level point and continue to attempt 'rolling their own crypto', asking if Eliezer can poke the hole in this new procedure
This makes it sound like the rest of us don't try to break our proposals, push the work to Eliezer, agree with Eliezer when he finds a problem, and then no... (read more)
Ah, got it. I agree that:
Man, I would not call the technique you described "mainline prediction". It also seems kinda inconsistent with Vaniver's usage; his writing suggests that a person only has one mainline at a time which seems odd for this technique.
Vaniver, is this what you meant? If so, my new answer is that I and others do in fact talk about "mainline predictions" -- for me, there was that whole section talking about natural language debate as an alignment strategy. (It ended up not being about a plausible world, but that's because (a) Eliezer wanted enough concreteness th... (read more)
I wrote this doc a couple of years ago (while I was at CHAI). It's got many rough edges (I think I wrote it in one sitting and never bothered to rewrite it to make it better), but I still endorse the general gist, if we're talking about what systems are being deployed to do and what happens amongst organizations. It doesn't totally answer your question (it's more focused on what happens before we get systems that could kill everyone), but it seems pretty related.
(I haven't brought it up before because it seems to me like the disagreement is much more in th... (read more)
Or, to put this somewhat differently, in my view the basic abstract point implies that having one extra free parameter allows you to believe in a 5% chance of doom when in fact there's 100% chance of doom, and so in order to get estimations like that right this needs to be one of the basic principles shaping your thoughts, tho ofc your prior should come from many examples instead of one specific counterexample.
I agree that if you have a choice about whether to have more or fewer free parameters, all else equal you should prefer the model with fewer free pa... (read more)
EDIT: I wrote this before seeing Paul's response; hence a significant amount of repetition.
They often seem to emit sentences that are 'not absurd', instead of 'on their mainline', because they're mostly trying to generate sentences that pass some shallow checks instead of 'coming from their complete mental universe.'
Why is this?
Well, there are many boring cases that are explained by pedagogy / argument structure. When I say things like "in the limit of infinite oversight capacity, we could just understand everything about the AI system and reengineer it to... (read more)
... (read more)As I understand it, when you "talk about the mainline", you're supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
But to the extent this is right, I'm actually quite confused why anyone thinks "talk about the mainline" is an ideal t
In response to your last couple paragraphs: the critique, afaict, is not "a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those", but rather "a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is th... (read more)
Note that my first response was:
(For the reader, I don't think that "arguments about what you're selecting for" is the same thing as "freely combining surface desiderata", though I do expect they look approximately the same to Eliezer)
and my immediately preceding message was
... (read more)I actually think something like this might be a crux for me, though obviously I wouldn't put it the way you're putting it. More like "are arguments about internal mechanisms more or less trustworthy than arguments about what you're selecting for" (limiting to arguments we actually have
Sorry, I probably should have been more clear about the "this is a quote from a longer dialogue, the missing context is important." I do think that the disagreement about "how relevant is this to 'actual disagreement'?" is basically the live thing, not whether or not you agree with the basic abstract point.
My current sense is that you're right that the thing you're doing is more specific than the general case (and one of the ways you can tell is the line of argumentation you give about chance of doom), and also Eliezer can still be correctly observing that... (read more)
But I guess I'm sufficiently confident in “>50% chance that it's destructive” that I'll argue for that.
Fwiw 50% on doom in the story I told seems plausible to me; maybe I'm at 30% but that's very unstable. I don't think we disagree all that much here.
Then we can start talking about capability windows etc., but I don't think that was your objection here.
Capability windows are totally part of the objection. If you completely ignore capability windows / compute restrictions then you just run AIXI (or AIXI-tl if you don't want something uncomputable) and die immediately.
So it seems reasonable to conclude that this level of "trying" is not enough to enact the pivotal acts you described
Stated differently than how I'd say it, but I agree that a single human performing human-level reasoning is not enough to enact those pivotal acts.
in my model reflexiveness is a property of actions,
Yeah, in my ontology (and in this context) reflexiveness is a property of cognitions, not of actions. I can reflexively reach into a transparent pipe to pick up a sandwich, without searching over possible plans for getting the sandwich (or at least... (read more)
I agree that we don't have a plan that we can be justifiably confident in right now.
I don't see why the "destructive consequences" version is most likely to arise, especially since it doesn't seem to arise for humans. (In terms of Rob's continuum, humans seem much closer to #2-style trying.)
If we have some way to limit an AI's strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).
If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)
It sounds like you think my position is... (read more)
... (read more)In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a soluti
I totally agree those are on a continuum. I don't think this changes my point? It seems like Eliezer is confident that "reduce x-risk to EDIT: sub-50%" requires being all the way on the far side of that continuum, and I don't see why that's required.
So my objection to debate (which again I think is similar to Eliezer's) would be: (1) if the debaters are “trying to win the debate” in a way that involves RL-on-thoughts / consequentialist planning / etc., then in all likelihood they would think up the strategy of breaking out of the box and hacking into the judge / opposing debater / etc. (2) if not, I don't think the AIs would be sufficiently capable that they could do anything pivotal.
In that particular non-failure story, I'm definitely imagining that they aren't "trying to win the debate" (where "tryi... (read more)
Hm. I've often imagined a "keep the diamond safe" planner just choosing a plan which a narrow-ELK-solving reporter says is OK.
But where does the plan come from? If you're imagining that the planner creates N different plans and then executes the one that the reporter says is OK, then I have the same objection:
... (read more)The planner "knows" how and why it chose the action sequence while the predictor doesn't, and so it's very plausible that this allows the planner to choose some bad / deceptive sequence that looks good to the predictor. (The classic example is t
Update: I think you should apply now and mention somewhere that you'd prefer to be interviewed in 3 months because in those 3 months you will be doing <whatever it is you're planning to do> and it will help with interviewing.