All of Rohin Shah's Comments + Replies

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Update: I think you should apply now and mention somewhere that you'd prefer to be interviewed in 3 months because in those 3 months you will be doing <whatever it is you're planning to do> and it will help with interviewing.

DeepMind is hiring for the Scalable Alignment and Alignment Teams

I don't have a strong opinion on whether it is good to support remote work. I agree we lose out on a lot of potential talent, but we also gain productivity benefits from in person collaboration.

However, this is a DeepMind-wide policy and I'm definitely not sold enough on the importance of supporting remote work to try and push for an exception here.

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Looking into it, I'll try to get you a better answer soon. My current best guess is that you should apply 3 months from now. This runs an increased risk that we'll have filled all our positions / closed our applications, but also improved chances of making it through because you'll know more things and be better prepared for the interviews.

(Among other things I'm looking into: would it be reasonable to apply now and mention that you'd prefer to be interviewed in 3 months.)

2Nathan Helm-Burger7d
Thanks Rohin. I also feel that interviewing after my 3 more months of independent work is probably the correct call.
DeepMind is hiring for the Scalable Alignment and Alignment Teams

Almost certainly, e.g. this one meets those criteria and I'm pretty sure costs < 1/3 of total comp (before taxes), though I don't actually know what typical total comp is. You would find significantly cheaper places if you were willing to compromise on commute, since DeepMind is right in the center of London.

2Michael Y. Zuo8d
Thanks, that is more luxurious than I imagined, so families should have no difficulty finding a large enough place.
DeepMind is hiring for the Scalable Alignment and Alignment Teams

Unfortunately not, though as Frederik points out below, if your concern is about getting a visa, that's relatively easy to do. DeepMind will provide assistance with the process. I went through it myself and it was relatively painless; it probably took 5-10 hours of my time total (including e.g. travel to and from the appointment where they collected biometric data).

0Yair Halberstadt7d
Whilst that's definitely great, my guess is that 90% of the people who would be interested and don't live in London, would not move to London for the job, even with a free Visa. Not supporting remote work therefore loses out on a majority of the potential talent pool for this job.
rohinmshah's Shortform

That's what future research is for!

rohinmshah's Shortform

I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).

I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem

I agree Boltzmann rationality (over the action space of, say, "muscle movements") is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including "things that humans say", and the human can just tell you that hyperslavery is really bad. Obvious... (read more)

rohinmshah's Shortform

I recently had occasion to write up quick thoughts about the role of assistance games (CIRL) in AI alignment, and how it relates to the problem of fully updated deference. I thought I'd crosspost here as a reference.

  • Assistance games / CIRL is a similar sort of thing as CEV. Just as CEV is English poetry about what we want, assistance games are math poetry about what we want. In particular, neither CEV nor assistance games tells you how to build a friendly AGI. You need to know something about how the capabilities arise for that.
  • One objection: an assistive
... (read more)
5DanielFilan1mo
I think this is way more worrying in the case where you're implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower. I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem - in those cases, the way you know how 'human preferences' rank future hyperslavery vs wireheaded rat tiling vs humane utopia is by how human actions affect the likelihood of those possible worlds, but that's probably not well-modelled by Boltzmann rationality (e.g. the thing I'm most likely to do today is not to write a short computer program that implements humane utopia), and it seems like your inference is going to be very sensitive to plausible variations in the observation model.
Project Intro: Selection Theorems for Modularity

Specifically, if for example you vary between two loss functions in some training environment, L1 and L2, that variation is called “modular” if somewhere in design space, that is, the space formed by all possible combinations of parameter values your network can take, you can find a network N1 that “does well”(1) on L1, and a network N2 that “does well” on L2, and these networks have the same values for all their parameters, except for those in a single(2) submodule(3).

It's often the case that you can implement the desired function with, say, 10% of the pa... (read more)

2Lblack1mo
A very good point! I agree that fix 1. seems bad, and doesn't capture what we care about. At first glance, fix 2. seems more promising to me, but I'll need to think about it. Thank you very much for pointing this out.
On Agent Incentives to Manipulate Human Feedback in Multi-Agent Reward Learning Scenarios

Isn't this a temporary solution at best? Eventually you resolve your uncertainty over the reward (or, more accurately, you get as much information as you can about the reward, potentially leaving behind some irreducible uncertainty), and then you start manipulating the target human.

I'm pretty wary of introducing potentially-false assumptions like the SVP already, and it seems particularly bad if their benefits are only temporary.

4Francis Rhys Ward1mo
Yeah, at the end of the post I point out both the potential falsity of the SVP and the problem of updated deference. Approaches that make the agent indefinitely uncertain about the reward (or at least uncertain for longer) might help with the latter, e.g. if [https://arxiv.org/abs/1901.08654]His also uncertain about the reward [https://arxiv.org/abs/1901.08654], or if preferences are modeled as changing over time or with different contexts [https://groups.csail.mit.edu/rrg/papers/daehyung_corl_19.pdf], etc. I agree, and I'm not sure I endorse the SVP, but I think it's the right type of solution -- i.e. an assumption about the training environment that (hopefully) encourages cooperative behaviour. I've found it difficult to think of a more robust/satisfying solution to manipulation (in this context). It seems like agents just will have incentives to manipulate each other in a multi-polar world, and it's hard to prevent that.
Late 2021 MIRI Conversations: AMA / Discussion

Idk, 95%? Probably I should push that down a bit because I haven't thought about it very hard.

It's a bit fuzzy what "deployed" means, but for now I'm going to assume that we mean that we put inputs into the AI system for the primary purpose of getting useful outputs, rather than for seeing what the AI did so that we can make it better.

Any existential catastrophe that didn't involve a failure of alignment seems like it had to involve a deployed system.

For failures of alignment, I'd expect that before you get an AI system that can break out of the training p... (read more)

A Longlist of Theories of Impact for Interpretability

Yeah I think I agree with all of that. Thanks for rereading my original comment and noticing a misunderstanding :)

A Longlist of Theories of Impact for Interpretability

Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well?

This one.

3Evan R. Murphy2mo
Ok, I think there's a plausible success story for interpretability though where transparency tools become broadly available. Every major AI lab is equipped to use them and has incorporated them into their development processes. I also think it's plausible that either 1) one AI lab eventually gains a considerable lead/advantage over the others so that they'd have time to iterate after their model fails audit, or 2) if one lab communicated that their audits show a certain architecture/training approach keeps producing models that are clearly unsafe, then the other major labs would take that seriously. This is why "auditing a trained model" still seems like a useful ability to me. Update: Perhaps I was reading Rohin's original comment as more critical of audits than he intended. I thought he was arguing that audits will be useless. But re-reading it, I see him saying that the conjunctiveness of the coordination story makes him "more excited" about interpretability for training, and that it's "not an either-or".
When people ask for your P(doom), do you give them your inside view or your betting odds?

Independent impressions (= inside view in your terminology), though my all-things-considered belief (= betting odds in your terminology) is pretty similar.

A survey of tool use and workflows in alignment research

I'm curious how well a model finetuned on the Alignment Newsletter performs at summarizing new content (probably blog posts; I'd assume papers are too long and rely too much on figures). My guess is that it doesn't work very well even for blog posts, which is why I haven't tried it yet, but I'd still be interested in the results and would love it on the off chance that it actually was good enough to save me some time.

4jacquesthibs2mo
We could definitely look into making the project evolve in this direction. In fact, we're building a dataset of alignment-related texts and a small part of the dataset includes a scrape of arXiv papers extracted from the Alignment Newsletter. We're working towards building GPT models fine-tuned on the texts.
Job Offering: Help Communicate Infrabayesianism

I'd also be happy to include a good summary in the Alignment Newsletter (here's the previous summary, which doesn't include many of the newer results).

Late 2021 MIRI Conversations: AMA / Discussion

Cool, that all makes sense.

But one could also think that the disvalue of extinction is more continuous with disvalue in non-extinction scenarios, which makes things a bit more tricky.

I'm happy to use continuous notions (and that's what I was doing in my original comment) as long as "half the cost" means "you update such that the expected costs of misalignment according to your probability distribution over the future are halved". One simple way to imagine this update is to take all the worlds where there was any misalignment, halve their probability, and d... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

Let's say you currently think the singularity will proceed at a rate of R.

What does this mean? On my understanding, singularities don't proceed at fixed rates?

I agree that in practice there will be some maximum rate of GDP growth, because there are fundamental physical limits (and more tight in-practice limits that we don't know), but it seems like they'll be way higher than 25% per year. Or to put it a different way, at 25% max rate I think it stops deserving the term "singularity", it seems like it takes decades and maybe centuries to reach technological... (read more)

1Matthew Barnett2mo
I think you sufficiently addressed my confusion, so you don't need to reply to this comment, but I still had a few responses to what you said. No, I agree. But growth is generally measured over an interval. In the original comment I proposed the interval of one year during the peak rate of economic growth. To allay your concern that a 25% growth rate indicates we didn't experience a singularity, I meant that we were halving the growth rate during the peak economic growth year in our future, regardless of whether that rate was very fast. The 25% figure was totally arbitrary. I didn't mean it as any sort of prediction. I agree that an extrapolation from biological growth implies that we can and should see >1000% growth rates eventually, though it seems plausible that we would coordinate to avoid that. That's reasonable. A separate question might be about whether the rate of growth during the entire duration from now until the peak rate will cut in half. I think the way you're bucketing this into "costs if we go extinct" and "costs if we don't go extinct" is reasonable. But one could also think that the disvalue of extinction is more continuous with disvalue in non-extinction scenarios, which makes things a bit more tricky. I hope that makes sense.
A Longlist of Theories of Impact for Interpretability

The most interesting substantive disagreement I found in the discussion was that I was comparably much more excited about using interpretability to audit a trained model, and skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view.

Fwiw, I do have the reverse view, but my reason is more that "auditing a trained model" does not have a great story for wins. Like, either you find that the model is fine (in which c... (read more)

1Evan R. Murphy2mo
Can you explain your reasoning behind this a bit more? Are you saying someone else destroys the world because a capable lab wants to destroy the world, and so as soon as the route to misaligned AGI is possible then someone will do it? Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well? (Or something else?...)
3Charlie Steiner2mo
The way I'd put something-like-this is that in order for auditing the model to help (directly), you have to actually be pretty confident in your ability to understand and fix your mistakes if you find one. It's not like getting a coin to land Heads by flipping it again if it lands Tails - different AGI projects are not independent random variables, if you don't get good results the first time you won't get good results the next time unless you understand what happened. This means that auditing trained models isn't really appropriate for the middle of the skill curve. Instead, it seems like something you could use after already being confident you're doing good stuff, as quality control. This sharply limits the amount you expect it to save you, but might increase some other benefits of having an audit, like convincing people you know what you're doing and aren't trying to play Defect.
Fractional progress estimates for AI timelines and implied resource requirements

Ah, fair point, looking back at this summary I probably should have clarified that the methodology could be applied with other samples and those look much less long.

Late 2021 MIRI Conversations: AMA / Discussion

It's definitely cruxy in the sense that changing my opinions on any of these would shift my p(doom) some amount.

My rough model is that there's an unknown quantity about reality which is roughly "how strong does the oversight process have to be before the trained model does what the oversight process intended for it to do". p(doom) mainly depends on whether the actors training the powerful systems have sufficiently powerful oversight processes. This seems primarily affected by the quality of technical alignment solutions, but certainly civilizational adequacy also affects the answer.

Late 2021 MIRI Conversations: AMA / Discussion

I mean, maybe we should just drop this point about the intuition pump, it was a throwaway reference in the original comment. I normally use it to argue against a specific mentality I sometimes see in people, and I guess it doesn't make sense outside of that context.

(The mentality is "it doesn't matter what oversight process you use, there's always a malicious superintelligence that can game it, therefore everyone dies".)

Late 2021 MIRI Conversations: AMA / Discussion

Re: cultured meat example: If you give me examples in which you know the features are actually inconsistent, my method is going to look optimistic when it doesn't know about that inconsistency. So yeah, assuming your description of the cultured meat example is correct, my toy model would reproduce that problem.

To give a different example, consider OpenAI Five. One would think that to beat Dota, you need to have an algorithm that allows you to do hierarchical planning, state estimation from partial observability, coordination with team members, understandin... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

I agree that is also moderately cruxy (but less so, at least for me, than "high-capabilities alignment is extremely difficult").

Late 2021 MIRI Conversations: AMA / Discussion

It's the first guess.

I think if you have a particular number then I'm like "yup, it's fair to notice that we overestimate the probability that x is even and odd by saying it's 25%", and then I'd say "notice that we underestimate the probability that x is even and divisible by 4 by saying it's 12.5%".

I agree that if you estimate a probability, and then "perform search" / "optimize" / "run n copies of the estimate" (so that you estimate the probability as 1 - (1 - P(event))^n), then you're going to have systematic errors.

I don't think I'm doing anything that... (read more)

3Matthew "Vaniver" Graves2mo
Cool, I like this example. I think the thing I'm interested in is "what are our estimates of the output of search processes?". The question we're ultimately trying to answer with a model here is something like "are humans, when they consider a problem that could have attempted solutions of many different forms, overly optimistic about how solvable those problems are because they hypothesize a solution with inconsistent features?" The example of "a number divisible by 2 and a number divisible by 4" is an example of where the consistency of your solution helps you--anything that satisfies the second condition is already satisfying the first condition. But importantly the best you can do here is ignore superfluous conditions; they can't increase the volume of the solution space. I think this is where the systematic bias is coming from (that the joint probability of two conditions can't be higher than the maximum of those two conditions, where the joint probability can be lower than the minimum of the two, and so the product isn't an unbiased estimator of the joint). For example, consider this recent analysis of cultured meat [https://thecounter.org/lab-grown-cultivated-meat-cost-at-scale/], which seems to me to point out a fundamental inconsistency of this type in people's plans for creating cultured meat. Basically, the bigger you make a bioreactor, the better it looks on criteria ABC, and the smaller you make a bioreactor, the better it looks on criteria DEF, and projections seem to suggest that massive progress will be made on all of those criteria simultaneously because progress can be made on them individually. But this necessitates making bioreactors that are simultaneously much bigger and much smaller! [Sometimes this is possible, because actually one is based on volume and the other is based on surface area, and so when you make something like a zeolite [https://en.wikipedia.org/wiki/Zeolite] you can combine massive surface area with tiny volume. But if y
Late 2021 MIRI Conversations: AMA / Discussion

I obviously do not think this is at all competitive, and I also wanted to ignore the "other people steal your code" case. I am confused what you think I was trying to do with that intuition pump.

I guess I said "powerful oversight would solve alignment" which could be construed to mean that powerful oversight => great future, in which case I'd change it to "powerful oversight would deal with the particular technical problems that we call outer and inner alignment", but was it really so non-obvious that I was talking about the technical problems?

Maybe you... (read more)

1Matthew "Vaniver" Graves2mo
I think I'm confused about the intuition pump too! Like, here's some options I thought up: * The 'alignment problem' is really the 'not enough oversight' problem. [But then if we solve the 'enough oversight' problem, we still have to solve the 'what we want' problem, the 'coordination' problem, the 'construct competitively' problem, etc.] * Bits of the alignment problem can be traded off against each other, most obviously coordination and 'alignment tax' (i.e. the additional amount of work you need to do to make a system aligned, or the opposite of 'competitiveness', which I didn't want to use here for ease-of-understanding-by-newbies reasons.) [But it's basically just coordination and competitiveness; like, you could imagine that oversight gives you a rejection sampling story for trading off time and understanding but I think this is basically not true because you're also optimizing for finding holes in your transparency regime.] Like, by analogy, I could imagine someone who uses an intuition pump of "if you had sufficient money, you could solve any problem", but I wouldn't use that intuition pump because I don't believe it. [Sure, 'by definition' if the amount of money doesn't solve the problem, it's not sufficient. But why are we implicitly positing that there exists a sufficient amount of money instead of thinking about what money cannot buy [https://www.lesswrong.com/posts/YABJKJ3v97k9sbxwg/what-money-cannot-buy]?] (After reading the rest of your comment, it seems pretty clear to me that you mean the first bullet, as you say here:) I both 1) didn't think it was obvious (sorry if I'm being slow on following the change in usage of 'alignment' here) and 2) don't think realistically powerful oversight solves either of those two on its own (outer alignment because of "rejection sampling can get you siren worlds" problem, inner alignment because "rejection sampling isn't competitive", but I find that one not very compelling and
Late 2021 MIRI Conversations: AMA / Discussion

The goal is to bring x-risk down to near-zero, aka "End the Acute Risk Period". My usual story for how we do this is roughly "we create a methodology for building AI systems that allows you to align them at low cost relative to the cost of gaining capabilities; everyone uses this method, we have some governance / regulations to catch any stragglers who aren't using it but still can make dangerous systems".

If I talk to Eliezer, I expect him to say "yes, in this story you have executed a pivotal act, via magical low-cost alignment that we definitely do not g... (read more)

4Steve Byrnes2mo
Huh. I'm under the impression that "offense-defense balance for technology-inventing AGIs" is also a big cruxy difference between you and Eliezer. Specifically: if almost everyone is making helpful aligned norm-following AGIs, but one secret military lab accidentally makes a misaligned paperclip maximizer, can the latter crush all competition? My impression is that Eliezer thinks yes: there's really no defense against self-replicating nano-machines, so the only paths to victory are absolutely perfect compliance forever (which he sees as implausible, given secret military labs etc.) or someone uses an aligned AGI to do a drastic-seeming pivotal act in the general category of GPU-melting nanobots. Whereas you disagree. Sorry if I'm putting words in anyone's mouths. For my part, I don't have an informed opinion about offense-defense balance, i.e. whether more-powerful-and-numerous aligned AGIs can defend against one paperclipper born in a secret military lab accident. I guess I'd have to read Drexler's nano book or something. At the very least, I don't see it as a slam dunk in favor of Team Aligned [https://www.lesswrong.com/posts/CtGwGgxfoefiwfcor/disentangling-perspectives-on-strategy-stealing-in-ai-safety?commentId=Nax4QmYNDEpmZLq36] , I see it as a question that could go either way.
Late 2021 MIRI Conversations: AMA / Discussion

I can of course imagine a reasonable response to that from you--"ah, resolving philosophical difficulties is the user's problem, and not one of the things that I mean by alignment"

That is in fact my response. (Though one of the ways in which the intuition pump isn't fully compelling to me is that, even after understanding the exact program that the AGI implements and its causal history, maybe the overseers can't correctly predict the consequences of running that program for a long time. Still feels like they'd do fine.)

I do agree that if you go as far as "... (read more)

1Matthew "Vaniver" Graves3mo
I think some of my more alignment-flavored counterexamples look like: * The 'reengineer it to be safe' step breaks down / isn't implemented thru oversight. Like, if we're positing we spin up a whole Great Reflection to evaluate every action the AI takes, this seems like it's probably not going to be competitive! * The oversight gives us as much info as we ask for, but the world is a siren world (like what Stuart points to [https://www.lesswrong.com/posts/nFv2buafNc9jSaxAH/siren-worlds-and-the-perils-of-over-optimised-search] , but a little different), where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas. * Related to the previous point, the oversight is sufficient to reveal features about the plan that are terrible, but before the 'reengineer to make it more safe' plan is executed, the code is stolen and executed by a subset of humanity which thinks the terrible plan is 'good enough', for them at least. That is, it feels to me like we benefit a lot from having 1) a constructive approach to alignment instead of rejection sampling, 2) sufficient security focus that we don't proceed on EV of known information, but actually do the 'due diligence', and 3) sufficient coordination among humans that we don't leave behind substantial swaths of current human preferences, and I don't see how we get those thru having arbitrary transparency. [I also would like to solve the problem of "AI has good outcomes" instead of the smaller problem of "AI isn't out to get us", because accidental deaths are deaths too! But I do think it makes sense to focus on that capability problem separately, at least sometimes.]
Late 2021 MIRI Conversations: AMA / Discussion

If you define "mainline" as "particle with plurality weight", then I think I was in fact "talking on my mainline" at some points during the conversation, and basically everywhere that I was talking about worlds (instead of specific technical points or intuition pumps) I was talking about "one of my top 10 particles".

I think I responded to every request for concreteness with a fairly concrete answer. Feel free to ask me for more concreteness in any particular story I told during the conversation.

Late 2021 MIRI Conversations: AMA / Discussion

In all cases, the real answer is "the actual impact will depend a ton on the underlying argument that led to the update; that argument will lead to tons of other updates across the board".

I imagine that the spirit of the questions is that I don't perform a Bayesian update and instead do more of a "causal intervention" on the relevant node and propagate downstream. In that case:

  1. I'm confused by the question. If the peak rate of GWP growth ever is 25%, it seems like the singularity didn't happen? Nonetheless, to the extent this question is about updates on th
... (read more)
1Matthew Barnett2mo
Thanks for your response. :) I'm a little confused by your confusion. Let's say you currently think the singularity will proceed at a rate of R. The spirit of what I'm asking is: what would you change if you learned that it will proceed at a rate of one half R. (Maybe plucking specific numbers about the peak-rate of growth just made things more confusing). For me at least, I'd probably expect a lot more oversight, as people have more time to adjust to what's happening in the world around them. I'm also a little confused about this. My exact phrasing was, "You learn that the cost of misalignment is half as much as you thought, in the sense that slightly misaligned AI software impose costs that are half as much (ethically, or economically), compared to what you used to think." I assume you don't think that slightly misaligned software will, by default, cause extinction, especially if it's acting alone and is economically or geographically isolated. We could perhaps view this through an analogy. War is really bad: so bad that maybe it will even cause our extinction (if say, we have some really terrible nuclear winter). But by default, I don't expect war to cause humanity to go extinct. And so, if someone asked me about a scenario in which the costs of war are only half as much as I thought, it would probably significantly update me away from thinking we need to take actions to prevent war. The magnitude of this update might not be large, but understanding exactly how much we'd update and change our strategy in light of this information is type of thing I'm asking for.
Late 2021 MIRI Conversations: AMA / Discussion

I don't think this is the main crux -- disagreements about mechanisms of intelligence seem far more important -- but to answer the questions:

Do you think major AI orgs will realize that AI is potentially worldendingly dangerous, and have any kind of process at all to handle that?

Clearly yes? They have safety teams that are focused on x-risk? I suspect I have misunderstood your question.

(Maybe you mean the bigger tech companies like FAANG, in which case I'm still at > 95% on yes, but I suspect I am still misunderstanding your question.)

(I know less about... (read more)

2Raymond Arnold3mo
Thanks. I wasn't super satisfied with the way I phrased my questions. I just made some slight edits to them (labeled as such), although they still don't feel like they quite do the thing. (I feel like I'm looking at a bunch of subtle frame disconnects, while multiple other frame disconnects are going on, so pinpointing the thing is hard_ I think "is any of this actually cruxy" is maybe the most important question and I should have included it. You answered "not supermuch, at least compared to models of intelligence". Do you think there's any similar nearby thing that feels more relevant on your end? In any case, thanks for your answers, they do help give me more a sense of the gestalt of your worldview here, however relevant it is.
Late 2021 MIRI Conversations: AMA / Discussion

Do you feel like you do this 'sometimes', or 'basically always'?

I don't know what "this" refers to. If the referent is "have a concrete example in mind", then I do that frequently but not always. I do it a ton when I'm not very knowledgeable and learning about a thing; I do it less as my mastery of a subject increases. (Examples: when I was initially learning addition, I used the concrete example of holding up three fingers and then counting up two more to compute 3 + 2 = 5, which I do not do any more. When I first learned recursion, I used to explicitly r... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

I think... this feels true as a matter of human psychology of problem-solving, or something, and not as a matter of math.

I think we're imagining different toy mathematical models.

Your model, according to me:

  1. There is a space of possible approaches, that we are searching over to find a solution. (E.g. the space of all possible programs.)
  2. We put a layer of abstraction on top of this space, characterizing approaches by N different "features" (e.g. "is it goal-directed", "is it an oracle", "is it capable of destroying the world")
  3. Because we're bounded agents, we
... (read more)
3Matthew "Vaniver" Graves3mo
Huh, why doesn't that procedure have that systematic error? Like, when I try to naively run your steps 1-4 on "probability of there existing a number that's both even and odd", I get that about 25% of numbers should be both even and odd, so it seems pretty likely that it'll work out given that there are at least 4 numbers. But I can't easily construct an argument at a similar level of sophistication that gives me an underestimate. [Like, "probability of there existing a number that's both odd and prime" gives the wrong conclusion if you buy that the probability that a natural number is prime is 0 [https://math.stackexchange.com/questions/82074/percentage-of-primes-among-the-natural-numbers] , but this is because you evaluated your limits in the wrong order, not because of a problem with dropping all the covariance data from your joint distribution.] My first guess is that you think I'm doing the "ways the world could be" thing wrong--like, I'm looking at predicates over numbers and trying to evaluate a predicate over all numbers, but instead I should just have a probability on "universe contains a number that is both even and odd" and its complement, as those are the two relevant ways the world can be. My second guess is that you've got a different distribution over target predicates; like, we can just take the complement of my overestimate ("probability of there existing no numbers that are both even and odd") and call it an underestimate. But I think I'm more interested in 'overestimating existence' than 'underestimating non-existence'. [Is this an example of the 'additional details' you're talking about?] Also maybe you can just exhibit a simple example that has an underestimate, and then we need to think harder about how likely overestimates and underestimates are to see if there's a net bias.
Late 2021 MIRI Conversations: AMA / Discussion

I'm mostly going to answer assuming that there's not some incredibly different paradigm (i.e. something as different from ML as ML is from expert systems). I do think the probability of "incredibly different paradigm" is low.

I'm also going to answer about the textbook at, idk, the point at which GDP doubles every 8 years. (To avoid talking about the post-Singularity textbook that explains how to build a superintelligence with clearly understood "intelligence algorithms" that can run easily on one of today's laptops, which I know very little about.)

I think ... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

(For object-level responses, see comments on parallel threads.)

I want to push back on an implicit framing in lines like:

there's some value to more people thinking thru / shooting down their own edge cases [...], instead of pushing the work to Eliezer.

people aren't updating on the meta-level point and continue to attempt 'rolling their own crypto', asking if Eliezer can poke the hole in this new procedure

This makes it sound like the rest of us don't try to break our proposals, push the work to Eliezer, agree with Eliezer when he finds a problem, and then no... (read more)

6Matthew "Vaniver" Graves3mo
Yeah, sorry about not owning that more, and for the frame being muddled. I don't endorse the "asking Eliezer" or "agreeing with Eliezer" bits, but I do basically think he's right about many object-level problems he identifies (and thus people disagreeing with him about that is not a feature) and think 'security mindset' is the right orientation to have towards AGI alignment. That hypothesis is a 'worry' primarily because asymmetric costs means it's more worth investigating than the raw probability would suggest. [Tho the raw probability of components of it do feel pretty substantial to me.] [EDIT: I should say I think ARC's approach to ELK seems like a great example of "people breaking their own proposals". As additional data to update on, I'd be interested in seeing, like, a graph of people's optimism about ELK over time, or something similar.]
Late 2021 MIRI Conversations: AMA / Discussion

Ah, got it. I agree that:

  1. The technique you described is in fact very useful
  2. If your probability distribution over futures happens to be such that it has a "mainline prediction", you get significant benefits from that (similar to the benefits you get from the technique you described).
Late 2021 MIRI Conversations: AMA / Discussion

Man, I would not call the technique you described "mainline prediction". It also seems kinda inconsistent with Vaniver's usage; his writing suggests that a person only has one mainline at a time which seems odd for this technique.

Vaniver, is this what you meant? If so, my new answer is that I and others do in fact talk about "mainline predictions" -- for me, there was that whole section talking about natural language debate as an alignment strategy. (It ended up not being about a plausible world, but that's because (a) Eliezer wanted enough concreteness th... (read more)

2Matthew "Vaniver" Graves3mo
Uh, I inherited "mainline" from Eliezer's usage in the dialogue, and am guessing that his reasoning is following a process sort of like mine and John's. My natural word for it is a 'particle', from particle filtering, as linked in various places, which I think is consistent with John's description. I'm further guessing that Eliezer's noticed more constraints / implied inconsistencies, and is somewhat better at figuring out which variables to drop, so that his cloud is narrower than mine / more generates 'mainline predictions' than 'probability distributions'. Do you feel like you do this 'sometimes', or 'basically always'? Maybe it would be productive for me to reread the dialogue (or at least part of it) and sort sections / comments by how much they feel like they're coming from this vs. some other source. As a specific thing that I have in mind, I think there's a habit of thinking / discourse that philosophy trains, which is having separate senses for "views in consideration" and "what I believe", and thinking that statements should be considered against all views in consideration, even ones that you don't believe. This seems pretty good in some respects (if you begin by disbelieving a view incorrectly, your habits nevertheless gather you lots of evidence about it, which can cause you to then correctly believe it), and pretty questionable in other respects (conversations between Alice and Bob now have to include them shadowboxing with everyone else in the broader discourse, as Alice is asking herself "what would Carol say in response to that?" to things that Bob says to her). When I imagine dialogues generated by people who are both sometimes doing the mainline thing and sometimes doing the 'represent the whole discourse' thing, they look pretty different from dialogues generated by people who are both only doing the mainline thing. [And also from dialogues generated by both people only doing the 'represent the whole discourse' thing, of course.]
4johnswentworth3mo
To be clear, I do not mean to use the label "mainline prediction" for this whole technique. Mainline prediction tracking is one way of implementing this general technique, and I claim that the usefulness of the general technique is the main reason why mainline predictions are useful to track. (Also, it matches up quite well with Nate's model based on his comment here [https://www.lesswrong.com/posts/34Gkqus9vusXRevR8/late-2021-miri-conversations-ama-discussion?commentId=nhLb9Ym8mYfGoSv2G] , and I expect it also matches how Eliezer wants to use the technique.)
Late 2021 MIRI Conversations: AMA / Discussion

I wrote this doc a couple of years ago (while I was at CHAI). It's got many rough edges (I think I wrote it in one sitting and never bothered to rewrite it to make it better), but I still endorse the general gist, if we're talking about what systems are being deployed to do and what happens amongst organizations. It doesn't totally answer your question (it's more focused on what happens before we get systems that could kill everyone), but it seems pretty related.

(I haven't brought it up before because it seems to me like the disagreement is much more in th... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

Or, to put this somewhat differently, in my view the basic abstract point implies that having one extra free parameter allows you to believe in a 5% chance of doom when in fact there's 100% chance of doom, and so in order to get estimations like that right this needs to be one of the basic principles shaping your thoughts, tho ofc your prior should come from many examples instead of one specific counterexample.

I agree that if you have a choice about whether to have more or fewer free parameters, all else equal you should prefer the model with fewer free pa... (read more)

3Matthew "Vaniver" Graves3mo
That is, when I give Optimistic Alice fewer constraints, she can more easily imagine a solution, and when I give Pessimistic Bob fewer constraints, he can more easily imagine that no solution is possible? I think... this feels true as a matter of human psychology of problem-solving, or something, and not as a matter of math. Like, the way Bob fails to find a solution mostly looks like "not actually considering the space", or "wasting consideration on easily-known-bad parts of the space", and more constraints could help with both of those. But, as math, removing constraints can't lower the volume of the implied space and so can't make it less likely that a viable solution exists. I think Eliezer thinks nearly all humans have such a bias by default, and so without clear evidence to the contrary it's a reasonable suspicion for anyone. [I think there's a thing Eliezer does a lot, which I have mixed feelings about, which is matching people's statements to patterns and then responding to the generator of the pattern in Eliezer's head, which only sometimes corresponds to the generator in the other person's head.] Cool, makes sense. [I continue to think we disagree about how true this is in a practical sense, where I read you as thinking "yeah, this is a minor consideration, we have to think with the tools we have access to, which could be wrong in either direction and so are useful as a point estimate" and me as thinking "huh, this really seems like the tools we have access to are going to give us overly optimistic answers, and we should focus more on how to get tools that will give us more robust answers."]
Late 2021 MIRI Conversations: AMA / Discussion

EDIT: I wrote this before seeing Paul's response; hence a significant amount of repetition.

They often seem to emit sentences that are 'not absurd', instead of 'on their mainline', because they're mostly trying to generate sentences that pass some shallow checks instead of 'coming from their complete mental universe.'

Why is this?

Well, there are many boring cases that are explained by pedagogy / argument structure. When I say things like "in the limit of infinite oversight capacity, we could just understand everything about the AI system and reengineer it to... (read more)

1Matthew "Vaniver" Graves3mo
I think this is roughly how I'm thinking about things sometimes, tho I'd describe the mainline as the particle [https://en.wikipedia.org/wiki/Particle_filter] with plurality weight (which is a weaker condition than >50%). [I don't know how Eliezer thinks about things; maybe it's like this? I'd be interested in hearing his description.] I think this is also a generator of disagreements about what sort of things are worth betting on; when I imagine why I would bail with "the future is hard to predict", it's because the hypotheses/particles I'm considering have clearly defined X, Y, and Z variables (often discretized into bins or ranges) but not clearly defined A, B, and C variables (tho they might have distributions over those variables), because if you also conditioned on those you would have Too Many Particles. And when I imagine trying to contrast particles on features A, B, and C, as they all make weak predictions we get at most a few bits of evidence to update their weights on, whereas when we contrast them on X, Y, and Z we get many more bits, and so it feels more fruitful to reason about. I mean, the question is which direction we want to approach Bayesianism from, given that Bayesianism is impossible (as you point out later in your comment). On the one hand, you could focus on 'updating', and have lots of distributions that aren't grounded in reality but which are easy to massage when new observations come in, and on the other hand, you could focus on 'hypotheses', and have as many models of the situation as you can ground, and then have to do something much more complicated when new observations come in. [Like, a thing I find helpful to think about here is where the motive power from Aumann's Agreement Theorem [https://en.wikipedia.org/wiki/Aumann%27s_agreement_theorem] comes from, which is that when I say 40% A, you know that my private info is consistent with an update of the shared prior whose posterior is 40%, and when you take the shared prior and u
1Matthew "Vaniver" Graves3mo
Huh, I guess I don't believe the intuition pump? Like, as the first counterexample that comes to mind, when I imagine having an AGI where I can tell everything about how it's thinking, and yet I remain a black box to myself, I can't really tell whether or not it's aligned to me. (Is me-now the one that I want it to be aligned to, or me-across-time? Which side of my internal conflicts about A vs. B / which principle for resolving such conflicts?) I can of course imagine a reasonable response to that from you--"ah, resolving philosophical difficulties is the user's problem, and not one of the things that I mean by alignment"--but I think I have some more-obviously-alignment-related counterexamples. [Tho if by 'infinite oversight ability' you do mean something like 'logical omniscience' it does become pretty difficult to find a real counterexample, in part because I can just find the future trajectory with highest expected utility and take the action I take at the start of that trajectory without having to have any sort of understanding about why that action was predictably a good idea.] But like, the thing this reminds me of is something like extrapolating tangents, instead of operating the production function? "If we had an infinitely good engine, we could make the perfect car", which seems sensible when you're used to thinking of engine improvements linearly increasing car quality and doesn't seem sensible when you're used to thinking of car quality as a product of sigmoids of the input variables. (This is a long response to a short section because I think the disagreement here is about something like "how should we reason and communicate about intuitions?", and so it's worth expanding on what I think might be the implications of otherwise minor disagreements.)

As I understand it, when you "talk about the mainline", you're supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)

But to the extent this is right, I'm actually quite confused why anyone thinks "talk about the mainline" is an ideal t

... (read more)

In response to your last couple paragraphs: the critique, afaict, is not "a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those", but rather "a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is th... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

Note that my first response was:

(For the reader, I don't think that "arguments about what you're selecting for" is the same thing as "freely combining surface desiderata", though I do expect they look approximately the same to Eliezer)

and my immediately preceding message was

I actually think something like this might be a crux for me, though obviously I wouldn't put it the way you're putting it. More like "are arguments about internal mechanisms more or less trustworthy than arguments about what you're selecting for" (limiting to arguments we actually have

... (read more)

Sorry, I probably should have been more clear about the "this is a quote from a longer dialogue, the missing context is important." I do think that the disagreement about "how relevant is this to 'actual disagreement'?" is basically the live thing, not whether or not you agree with the basic abstract point.

My current sense is that you're right that the thing you're doing is more specific than the general case (and one of the ways you can tell is the line of argumentation you give about chance of doom), and also Eliezer can still be correctly observing that... (read more)

Shah and Yudkowsky on alignment failures

But I guess I'm sufficiently confident in “>50% chance that it's destructive” that I'll argue for that.

Fwiw 50% on doom in the story I told seems plausible to me; maybe I'm at 30% but that's very unstable. I don't think we disagree all that much here.

Then we can start talking about capability windows etc., but I don't think that was your objection here.

Capability windows are totally part of the objection. If you completely ignore capability windows / compute restrictions then you just run AIXI (or AIXI-tl if you don't want something uncomputable) and die immediately.

Shah and Yudkowsky on alignment failures

So it seems reasonable to conclude that this level of "trying" is not enough to enact the pivotal acts you described

Stated differently than how I'd say it, but I agree that a single human performing human-level reasoning is not enough to enact those pivotal acts.

in my model reflexiveness is a property of actions,

Yeah, in my ontology (and in this context) reflexiveness is a property of cognitions, not of actions. I can reflexively reach into a transparent pipe to pick up a sandwich, without searching over possible plans for getting the sandwich (or at least... (read more)

Shah and Yudkowsky on alignment failures

I agree that we don't have a plan that we can be justifiably confident in right now.

I don't see why the "destructive consequences" version is most likely to arise, especially since it doesn't seem to arise for humans. (In terms of Rob's continuum, humans seem much closer to #2-style trying.)

2Steve Byrnes3mo
Again I don't have an especially strong opinion about what our prior should be on possible motivation systems for an AGI trained by straightforward debate, and in particular what fraction of those motivation systems are destructive. But I guess I'm sufficiently confident in “>50% chance that it's destructive” that I'll argue for that. I'll assume the AGI uses model-based RL, which would (I claim) put it very roughly in the same category as humans. Some aspects of motivation have an obvious relationship / correlation to the reward signal. In the human case, given that we're a social animal, we can't be surprised to find that the human brainstem reward function [https://www.alignmentforum.org/posts/hE56gYi5d68uux9oM/intro-to-brain-like-agi-safety-3-two-subsystems-learning-and] inserts lots of socially-related motivations into us, including things like caring about other humans (which sometimes generalizes to caring about other living creatures) and generally wanting to fit in and follow norms under most circumstances, etc. Whereas other things in the world have no relationship to the innate human brainstem reward function, and predictably, basically no one cares about them, except insofar as they become instrumentally useful for something else we do care about. (There are interesting rare exceptions, like human superstitions.) An example in humans would be the question of whether pebbles on the sidewalk are more often an even number of centimeters apart versus an odd number of centimeters apart. In the straightforward debate setup, I can't see any positive reason for the reward function to directly paint a valence, either positive or negative, onto the idea of the AGI taking over the world. So I revert to the default expectation that the AGI will view “I take over the world” in a way that's analogous to how humans view “the pebbles on the sidewalk are an even number of centimeters apart”—i.e., totally neutral, except insofar as it becomes instrumentally relevant fo
Shah and Yudkowsky on alignment failures

If we have some way to limit an AI's strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).

If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

It sounds like you think my position is... (read more)

In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a soluti

... (read more)
Shah and Yudkowsky on alignment failures

I totally agree those are on a continuum. I don't think this changes my point? It seems like Eliezer is confident that "reduce x-risk to EDIT: sub-50%" requires being all the way on the far side of that continuum, and I don't see why that's required.

4Nate Soares3mo
("near-zero" is a red herring, and I worry that that phrasing bolsters the incorrect view that the reason MIRI folk think alignment is hard is that we want implausibly strong guarantees. I suggest replacing "reduce x-risk to near-zero" with "reduce x-risk to sub-50%".)
Shah and Yudkowsky on alignment failures

So my objection to debate (which again I think is similar to Eliezer's) would be: (1) if the debaters are “trying to win the debate” in a way that involves RL-on-thoughts / consequentialist planning / etc., then in all likelihood they would think up the strategy of breaking out of the box and hacking into the judge / opposing debater / etc. (2) if not, I don't think the AIs would be sufficiently capable that they could do anything pivotal.

In that particular non-failure story, I'm definitely imagining that they aren't "trying to win the debate" (where "tryi... (read more)

4Steve Byrnes3mo
The RL-on-thoughts discussion [https://www.lesswrong.com/posts/SzrmsbkqydpZyPuEh/my-take-on-vanessa-kosoy-s-take-on-agi-safety#6_3_Why_is_the_algorithm__trying_to_do__anything__What_s_missing_from_the__infra_Bayesian_perspective_] was meant as an argument that a sufficiently capable AI needs to be “trying” to do something. If we agree on that part, then you can still say my previous comment was a false dichotomy, because the AI could be “trying” to (for example) “win the debate while following the spirit of the debate rules”. And yes I agree, that was bad of me to have listed those two things as if they're the only two options. I guess what I was thinking was: If we take the most straightforward debate setup, and if it gets an AI that is “trying” to do something, then that “something” is most likely to be vaguely like “win the debate” or something else with similarly-destructive consequences. A different issue is whether that “most likely” is 99.9% vs 80% or whatever—that part is not immediately obvious to me. And yet another question is whether we can push that probability much lower, even towards zero, by not using the most straightforward debate setup, but rather adding things to the setup that are directly targeted at sculpting the AGI's motivations. I am not in fact convinced of near-certain doom there—that would be my Consequentialism & Corrigibility [https://www.alignmentforum.org/posts/KDMLJEXTWtkZWheXt/consequentialism-and-corrigibility] post. (I am convinced that we don't have a good plan right now.)
3Rob Bensinger3mo
Suppose I'm debating someone about gun control, and they say 'guns don't kill people; people kill people'. Here are four different scenarios for how I might respond: * 1. Almost as a pure reflex, before I can stop myself, I blurt out 'That's bullshit!' in response. It's not the best way to win the debate, but heck, I've heard that zinger a thousand times and it just makes me so mad. (Or, in other words: I have an automatic reflex-like response to that specific phrase, which is to get mad; and when I get mad, I have an automatic reflex-like response to blurt out the first sentence I can think of that expresses disapproval for that slogan.) Or, instead: * 2. I remember that there's a $1000 prize for winning this debate, and I take a deep breath to calm myself. I know that winning the debate will require convincing a judge who isn't super sympathetic to my political views; so I'll have to come up with some argument that's convincing even to a conservative. My mind wanders for a few seconds, and a thought pops into my head: 'Guns and people both kill people!' Hmm, but that sounds sort of awkward and weak. Is there a more pithy phrasing? A memory suddenly pops into my head: I think I heard once that knife murders spiked in Australia or somewhere, when guns were banned? So, like, 'People will kill people regardless of whether guns are present?' Ugh, wait, that's exactly the point my opponent was making. Moving the debate in that direction is a terrible idea if I want to win. And now I feel a bit bad for strategically steering my thoughts away from true information, but whatever... And now my mind is wandering, thinking about gun suicide, and... come on, focus. 'Guns don't kill people. People kill people.' How to respond? Going concrete might make my response more compelling, by making me sound more grounded and common-sensical. Concretely, it's just obvious common sense that giving someon
Is ELK enough? Diamond, Matrix and Child AI

Hm. I've often imagined a "keep the diamond safe" planner just choosing a plan which a narrow-ELK-solving reporter says is OK. 

But where does the plan come from? If you're imagining that the planner creates N different plans and then executes the one that the reporter says is OK, then I have the same objection:

The planner "knows" how and why it chose the action sequence while the predictor doesn't, and so it's very plausible that this allows the planner to choose some bad / deceptive sequence that looks good to the predictor. (The classic example is t

... (read more)
Load More