AMA: Paul Christiano, alignment researcher

I don't know if we ever cleared up ambiguity about the concept of PONR. It seems like it depends critically on who is returning, i.e. what is the counterfactual we are considering when asking if we "could" return. If we don't do any magical intervention, then it seems like the PONR could be well before AI since the conclusion was always inevitable. If we do a maximally magical intervention, of creating unprecedented political will, then I think it's most likely that we'd see 100%+ annual growth (even of say energy capture) before PONR. I don't think there are reasonable definitions of PONR where it's very likely to occur before significant economic acceleration.

I don't think I consider most of the scenarios list necessarily-PONR-before-GDP acceleration scenarios, though many of them could permit PONR-before-GDP if AI was broadly deployed before it started adding significant economic value.

All of these probabilities are obviously pretty unreliable and made up on the spot:

1. Fast takeoff

Defined as 1-year doubling starts before 4-year doubling finishes, maybe 25%?

2. The sorts of skills needed to succeed in politics or war are easier to develop in AI than the sorts needed to accelerate

... (read more)

1Donald Hobson5y

I think there is a reasonable way it could happen even without an enormous lead. You just need either, 1. Its very hard to capture a significant fraction of the gains from the tech. 2. Tech progress scales very poorly in money. For example, suppose it is obvious to everyone that AI in a few years time will be really powerful. Several teams with lots of funding are set up. If progress is researcher bound, and researchers are ideologically committed to the goals of the project, then top research talent might be extremely difficult to buy. (They are already well paid, for the next year they will be working almost all day. After that, the world is mostly shaped by which project won.) Compute could be hard to buy if there were hard bottlenecks somewhere in the chip supply chain, most of the worlds new chips were already being used by the AI projects, and an attitude of "our chips and were not selling" was prevalent. Another possibility, suppose deploying a tech means letting the competition know how it works. Then if one side deploys, they are pushing the other side ahead. So the question is, does deploying one unit of research give you the resources to do more than one unit?

[-]Neel Nanda5y150

What are the most important ideas floating around in alignment research that don't yet have a public write-up? (Or, even better, that have a public write-up but could do with a good one?)

[-]paulfchristiano5y190

I have a big gap between "stuff I've written up" and "stuff that I'd like to write up." Some particular ideas that come to mind: how epistemic competitiveness seems really important for alignment; how I think about questions like "aligned with whom" and why I think it's good to try to decouple alignment techniques from decisions about values / preference aggregation (this position is surprisingly controversial); updated views on the basic dichotomy in Two Kinds of Generalization and the current best hopes for avoiding the bad kind.

I think that there's a cluster of really important questions about what we can verify, how "alien" the knowledge of ML systems will be, and how realistic it's going to be to take a kind of ad hoc approach to alignment. In my experience people with a more experimental bent to be more optimistic about those questions tend to have a bunch of intuitions about those questions that do kind of hang together (and are often approximately shared across people). This comes with some more color on the current alignment plan / what's likely to happen in practice as people try to solve the problem on their feet. I don't think that's really been written up well but it s... (read more)

3Ben Pace, the Vacationing Vagabond5y

The stuff about ‘alien’ knowledge sounds really fascinating, and I’d be excited about write-ups. All my concrete intuitions here come from reading Distill.Pub papers.

[-]Ben Pace, the Vacationing Vagabond5y140

What important truth do very few people in your community/network agree with you on?

[-]paulfchristiano5y120

Unfortunately (fortunately?) I don't feel like I have access to any secret truths. Most idiosyncratic things I believe are pretty tentative, and I hang out with a lot of folks who are pretty open to the kinds of weird ideas that might have ended up feeling like Paul-specific secret truths if I hung with a more normal crowd.

It feels like my biggest disagreement with people around me is something like: to what extent is it likely to be possible to develop an algorithm that really looks on paper like it should just work for aligning powerful ML systems. I'm at like 50-50 and I think that the consensus estimate of people in my community is more like "Uh, sure doesn't sound like that's going to happen, but we're still excited for you to try."

[-]Ben Pace, the Vacationing Vagabond5y140

Do you know what sorts of people you're looking to hire? How much do you expect ARC to grow over the coming years, and what will the employees be doing? I can imagine it being a fairly small group of like 3 researchers and a few understudies, I can also imagine it growing to 30 people like MIRI. Which one of these is it closer to?

I'd like to hire a few people (maybe 2 researchers median?) in 2021. I think my default "things are going pretty well" story involves doubling something like every 1-2 years for a while. Where that caps out / slows down a lot depends on how the field shapes out and how broad our activities are. I would be surprised if I wanted to stop growing at <10 people just based on the stuff I really know I want to do.

The very first hires will probably be people who want to work on the kind of theory I do, since right now that's what I'm feeling most excited about and really want to set up a team working on. I don't really know where that will end up going.

Once getting that going I'm not sure whether the next step will be growing it further or branching out into other things, and it will probably depend on how the theory work goes. I could also imagine doing enough theory on my own to change my view about how promising it is and make initial hires in another area instead.

[-]Mark Xu5y140

You've written multiple outer alignment failure stories. However, you've also commented that these aren't your best predictions. If you condition on humanity going extinct because of AI, why did it happen?

[-]paulfchristiano5y160

I think my best guess is kind of like this story, but:

People aren't even really deploying best practices.
ML systems generalize kind of pathologically over long time horizons, and so e.g. long-term predictions don't correctly reflect the probability of systemic collapse.
As a result there's no complicated "take over the sensors moment" it's just everything is going totally off the rails and everyone is yelling about it but it just keeps gradually drifting on the rails.
Maybe the biggest distinction is that e.g. "watchdogs" can actually give pretty good arguments about why things are bad. In the story we fix all the things they can explain and are left only with the crazy hard core of human-incomprehensible problems, but in reality we will probably just fix the things that are pretty obvious and will be left with the hard core of problems that are still fairly obvious but not quite obvious enough that institutions can respond intelligently to them.

[-]Neel Nanda5y140

Pre-hindsight: 100 years from now, it is clear that your research has been net bad for the long-term future. What happened?

[-]paulfchristiano5y190

Some plausible and non-exhaustive options, in roughly descending order of plausibility:

I crowd out other people who would have done a better job of working on alignment (either by being better or just by being more). People feel like in order to be taken seriously they have to engage with Paul's writing and ideas and that's annoying. Or the space seems like a confused mess with sloppy standards in part because of my influence. Or more charitably maybe they are more likely to feel like it's "under control." Or maybe I claim ideas and make it harder for others to get credit even if they would have developed the ideas further or better (or even end up stealing the credit for others' ideas and disincentivizing them from entering the field).
I convincingly or at least socially-forcefully argue for conclusions that turn out to be wrong (and maybe I should have understood as wrong) and so everyone ends up wronger and makes mistakes that have a negative effect. I mean ex post I think this kind of thing is pretty likely in some important cases (if I'm 80-20 and convince people to update in my favor I still think there's a 20% chance that I pushed people in the wrong direction and across many

... (read more)

[-]paulfchristiano5y150

As an aside, I think that the possibility of "work doesn't matter" is typically way more important then "work was net bad," at least once you are making a serious effort to do something good rather than bad for the world (I agree that for the "average" project in the world the negative impacts are actually pretty large relative to the positive impacts).

EAs/rationalists often focus on the chance of a big downside clawing back value. I think that makes sense to think seriously about, and sometimes it's a big deal, but most of the time the quantitative estimates just don't seem to add up at all to me and I think people are making a huge quantitative error. I'm not sure exactly where we disagree, I think a lot of it is just that I'm way more skeptical about the ability to incidentally change the world a huge amount---I think that changing the world a lot usually just takes quite a bit of effort.

I guess in some sense I agree that the downside is big for normal butterfly-effect-y reasons (probably 50% of well-intentioned actions make the world worse ex post), so it's also possible that I'm just answering this question in a slightly different way.

My big caveat is that I think the numbers ... (read more)

3DanielFilan5y

I guess I feel like we're in a domain where some people were like "we have concretely-specifiable tasks, intelligence is good, what if we figured how to create artificial intelligence to do those tasks", which is the sort of thing that someone trying to do good for the world would do, but had some serious chance of being very bad for the world. So in that domain, it seems to me that we should keep our eyes out for things that might be really bad for the world, because all the things in that domain are kind of similar. That being said, I agree that the possibility that the work doesn't matter is more important once you're making a thoughtful effort to do good. But I see much more effort and thought into trying to address that part, such that the occasional nudge to consider negative impacts seems appropriate to me.

4paulfchristiano5y

I think it's good to sometimes meditate on whether you are making the world worse (and get others' advice), and I'd more often recommend it for crowds other than EA and certainly wouldn't discourage people from doing it sometimes. I'm sympathetic to arguments that you should be super paranoid in domains like biosecurity since it honestly does seem asymmetrically easier to make things worse rather than better. But when people talk about it in the context of e.g. AI or policy interventions or gathering better knowledge about the world that might also have some negative side-effects, I often feel like there's little chance that predictable negative effects they are imagining loom large in the cost-benefit unless the whole thing is predictably pointless. Which isn't a reason not to consider those effects, just a push-back against the conclusion (and a heuristic push-back against the state of affairs where people are paralyzed by the possibility of negative consequences based on kind of tentative arguments). For advancing or deploying AI I generally have an attitude like "Even if actively trying to push the field forward full-time I'd be a small part of that effort, whereas I'm a much larger fraction of the stuff-that-we-would-be-sad-about-not-happening-if-the-field-went-faster, and I'm not trying to push the field forward," so while I'm on board with being particularly attentive to harms if you're in a field you think can easily cause massive harms, in this case I feel pretty comfortable about the expected cost-benefit unless alignment work isn't really helping much (in which case I have more important reasons not to work on it). I would feel differently about this if pushing AI faster was net bad on e.g. some common-sense perspective on which alignment was not very helpful, but I feel like I've engaged enough with those perspectives to be mostly not having it.

[-]Beth Barnes5y80

"Even if actively trying to push the field forward full-time I'd be a small part of that effort"

I think conditioning on something like 'we're broadly correct about AI safety' implies 'we're right about some important things about how AI development will go that the rest of the ML community is surprisingly wrong about'. In that world we're maybe able to contribute as much as a much larger fraction of the field, due to being correct about some things that everyone else is wrong about.

I think your overall point still stands, but it does seem like you sometimes overestimate how obvious things are to the rest of the ML community

[-]DanielFilan5y130

What's the most important thing that AI alignment researchers have learned in the past 10 years? Also, that question but excluding things you came up with.

[-]paulfchristiano5y190

"Thing" is tricky. Maybe something like the set of intuitions and arguments we have around learned optimizers, i.e. the basic argument that ML will likely produce a system that is "trying" to do something, and that it can end up performing well on the training distribution regardless of what it is "trying" to do (and this is easier the more capable and knowledgeable it is). I don't think we really know much about what's going on here, but I do think it's an important failure to be aware of and at least folks are looking for it now. So I do think that if it happens we're likely to notice it earlier than we would if taking a purely experimentally-driven approach and it's possible that at the extreme you would just totally miss the phenomenon. (This may not be fair to put in the last 10 years, but thinking about it sure seemed like a mess >10 years ago.)

(I may be overlooking something such that I really regret that answer in 5 minutes but so it goes.)

[-]JohnMalin5y130

I wonder how valuable you find some of the more math/theory focused research directions in AI safety. I.e., how much less impactful do you find them, compared to your favorite directions? In particular,

Vanessa Kosoy's learning-theoretic agenda, e.g., the recent sequence on infra-Bayesianism, or her work on traps in RL. Michael Cohen's research, e.g. the paper on imitation learning seems to go into a similar direction.
The "causal incentives" agenda (link).
Work on agent foundations, such as on cartesian frames. You have commented on MIRI's research in the past, but maybe you have an updated view.

I'd also be interested in suggestions for other impactful research directions/areas that are more theoretical and less ML-focused (expanding on adamShimi's question, I wonder which part of mathematics and statistics you expect to be particularly useful).

I'm generally bad at communicating about this kind of thing, and it seems like a kind of sensitive topic to share half-baked thoughts on. In this AMA all of my thoughts are half-baked, and in some cases here I'm commenting on work that I'm not that familiar with. All that said I'm still going to answer but please read with a grain of salt and don't take it too seriously.

Vanessa Kosoy's learning-theoretic agenda, e.g., the recent sequence on infra-Bayesianism, or her work on traps in RL. Michael Cohen's research, e.g. the paper on imitation learning seems to go into a similar direction.

I like working on well-posed problems, and proving theorems about well-posed problems are particularly great.

I don't currently expect to be able to apply those kinds of algorithms directly to alignment for various reasons (e.g. no source of adequate reward function that doesn't go through epistemic competitiveness which would also solve other aspects of the problem, not practical to get exact imitation), so I'm mostly optimistic about learning something in the course of solving those problems that turns out to be helpful. I think that's plausible because these formal problems do engage some of the dif... (read more)

4RyanCarey5y

Thanks for these thoughts about the causal agenda. I basically agree with you on the facts, though I have a more favourable interpretation of how they bear on the potential of the causal incentives agenda. I've paraphrased the three bullet points, and responded in reverse order: 3) Many important incentives are not captured by the approach - e.g. sometimes an agent has an incentive to influence a variable, even if that variable does not cause reward attainment. -> Agreed. We're starting to study "side-effect incentives" (improved name pending), which have this property. We're still figuring out whether we should just care about the union of SE incentives and control incentives, or whether SE or when, SE incentives should be considered less dangerous. Whether the causal style of incentive analysis captures much of what we care about, I think will be borne out by applying it and alternatives to a bunch of safety problems. 2) sometimes we need more specific quantities, than just D affects A. -> Agreed. We've privately discussed directional quantities like "do(D=d) causes A=a" as being more safety-relevant, and are happy to hear other ideas. 1) eliminating all control-incentives seems unrealistic -> Strongly agree it's infeasibile to remove CIs on all variables. My more modest goal would be to prove that for particular variables (or classes of variables) such as a shut down button, or a human's values, we can either: 1) prove how to remove control (+ side-effect) incentives, or 2) why this is impossible, given realistic assumptions. If (2), then that theoretical case could justify allocation of resources to learning-oriented approaches. Overall, I concede that we haven't engaged much on safety issues in the last year. Partly, it's that the projects have had to fit within people's PhDs. Which will also be true this year. But having some of the framework stuff behind us, we should still be able to study safety more, and gain a sense of how addressable concerns lik

2tom4everitt5y

This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs). We're still working on these as there are a range of subtleties, but I'm pretty confident we'll have a good account of it.

[-]Ben Pace, the Vacationing Vagabond5y120

Do you have any specific plans for your life in a post-singularity world?

[-]paulfchristiano5y*170

Not really.

I expect that many humans will continue to participate in a process of collectively clarifying what we want and how to govern the universe. I wouldn't be surprised if that involves a lot of life-kind-of-like-normal that gradually improves in a cautious way we endorse rather than some kind of table-flip (e.g. I would honestly not be surprised if post-singularity we still end up raising another generation because there's no other form of "delegation" that we feel more confident about). And of course in such a world I expect to just continue to spend a lot of time thinking, again probably under conditions that are designed to be gradually improving rather than abruptly changing. The main weird thing is that this process will now be almost completely decoupled from productive economic activity.

I think it's hard to talk about "your life" and identity is likely to be fuzzy over the long term. I don't think that most of the richness and value in the world will come from creatures who feel like "us" (and I think our selfish desires are mostly relatively satiable). That said, I do also expect that basically all of the existing humans will have a future that they feel excited abou... (read more)

[-]Ben Pace, the Vacationing Vagabond5y120

What work are you most proud of?

Slightly different: what blog post are you most proud of?

[-]paulfchristiano5y140

I don't have an easy way of slicing my work up / think that it depends on how you slice it. Broadly I think the two candidates are (i) making RL from human feedback more practical and getting people excited about it at OpenAI, (ii) the theoretical sequence from approval-directed agents and informed oversight to iterated amplification to getting a clear picture of the limits of iterated amplification and setting out on my current research project. Some steps of that were really hard for me at the time though basically all of them now feel obvious.

My favorite blog post was probably approval-directed agents, though this is very much based on judging by the standards of how-confused-Paul-started-out. I think that it set me on a way better direction for thinking about AI safety (and I think it also helped a lot of people in a similar way). Ultimately it's clear that I didn't really understand where the difficulties were, and I've learned a lot in the last 6 years, but I'm still proud of it.

[-]DanielFilan5y110

How many ideas of the same size as "maybe a piecewise linear non-linearity would work better than a sigmoid for not having vanishing gradients" are we away from knowing how to build human-level AI technology?

[-]paulfchristiano5y140

I think it's >50% chance that ideas like ReLUs or soft attention are best though of as multiplicative improvements on top of hardware progress (as are many other ideas like auxiliary objectives, objectives that better capture relevant tasks, infrastructure for training more efficiently, dense datasets, etc.), because the basic approach of "optimize for a task that requires cognitive competence" will eventually yield human-level competence. In that sense I think the answer is probably 0.

Maybe my median number of OOMs left before human-level intelligence, including both hardware and software progress, is 10 (pretty made-up). Of that I'd guess around half will come from hardware, so call it 5 OOMs of software progress. Don't know how big that is relative to ReLUs, maybe 5-10x? (But hard to define the counterfactual w.r.t. activation functions.)

(I think that may imply much shorter timelines than my normal view. That's mostly from thoughtlessness in this answer which was quickly composed and didn't take into account many sources of evidence, some is from legit correlations not taken into account here, some is maybe legitimate signal from an alternative estimation approach, not sure.)

3Daniel Kokotajlo5y

When you say hardware progress, do you just mean compute getting cheaper or do you include people spending more on compute? So you are saying, you guess that if we had 10 OOMs of compute today that would have a 50% chance of leading to human-level AI without any further software progress, but realistically you expect that what'll happen is we get +5 OOMs from increased spending and cheaper hardware, and then +5 "virtual OOMs" from better software?

[-]DanielFilan5y110

How many ideas of the same size as "maybe we could use inverse reinforcement learning to learn human values" are we away from knowing how to knowably and reliably build human-level AI technology that wouldn't cause something comparably bad as human extinction?

A lot of this is going to come down to estimates of the denominator.

(I mostly just think that you might as well just ask people "Is this good?" rather than trying to use a more sophisticated form of IRL---in particular I don't think that realistic versions of IRL will successfully address the cases where people err in answering the "is it good?" question, that directly asking is more straightforward in many important ways, and that we should mostly just try to directly empower people to give better answers to such questions.)

Anyway, with that caveat and kind of using the version of your idea that I feel most enthusiastic about (and construing it quite broadly), I have a significant probability on 0, maybe a median somewhere in 10-20, significant probability at very high levels.

[-]DanielFilan5y110

What is the most common wrong research-relevant intuition among AI alignment researchers?

[-]Ben Pace, the Vacationing Vagabond5y110

What was your biggest update about the world from living through the coronavirus pandemic?

Follow-up: does it change any of your feelings about how civilization will handle AGI?

[-]paulfchristiano5y*150

I found our COVID response pretty "par for the course" in terms of how well we handle novel challenges. That was a significant negative update for me because I had a moderate probability on us collectively pulling out some more exceptional adaptiveness/competence when an issue was imposing massive economic costs and had a bunch of people's attention on it. I now have somewhat more probability on AI dooms that play out slowly where everyone is watching and yelling loudly about it but it's just really tough to do something that really improves the situation (and correspondingly more total probability on doom). I haven't really sat down and processed this update or reflected on exactly how big it should be.

[-]Neel Nanda5y110

Do you have any advice for junior alignment researchers? In particular, what do you think are the skills and traits that make someone an excellent alignment researcher? And what do you think someone can do early in a research career to be more likely to become an excellent alignment researcher?

Some things that seem good:

Acquire background in relevant adjacent areas---especially a reasonably deep understanding of ML, but then also a broader+shallower background in more distant areas like algorithms, economics, learning theory, and some familiarity with what kinds of intellectual practices work well in other fields.
Build some basic research skills, especially (i) applied work in ML (e.g. be able to implement ML algorithms and run experiments, hopefully getting some kind of mentorship or guidance but you can also do a lot independently), (ii) academic research in any vaguely relevant area. I think it's good to have e.g. actually proven a few things, designed algorithms for a few problems, beaten your head against a few problems and then figured out how to make them work.
Think a bunch about alignment. It feels like there is really just not much relevant stuff that's publicly written so you might as well read basically all of it and try to come up with views on the core questions yourself.

I personally feel like I got a lot of benefit out of doing some research in adjacent areas, but I'd guess that mostly it's better to focus on what you actually want to achieve and just be a ... (read more)

[-]Neel Nanda5y110

What are the highest priority things (by your lights) in Alignment that nobody is currently seriously working on?

It's not clear how to slice the space up into pieces so that you can talk about "is someone working on this piece?" (and the answer depends a lot on that slicing). Here are two areas in robustness that feel kind of empty for my preferred way of slicing up the problem (though for a different slicing they could be reasonably crowded). These are are also necessarily areas where I'm not doing any work so I'm really out on a limb here.

I think there should be more theoretical work on neural net verification / relaxing adversarial training. I should probably update from this to think that it's more of a dead end (and indeed practical verification work does seem to have run into a lot of trouble), but to me it looks like there's got to be more you can say at least to show that various possible approaches are dead ends. I think a big problem is that you really need to keep the application in mind in order to actually know the rules of the game. (That is, we have a predicate A, say implemented as a neural network, and we want to learn a function f such that for all x we have A(x, f(x)), but the problem is only supposed to be possible because in some sense the predicate A is "easy" to satisfy... (read more)

[-]adamShimi5y110

Copying my question from your post about your new research center (because I'm really interested in the answer): which part (if any) of theoretical computer science do you expect to be particularly useful for alignment?

5paulfchristiano5y

Learning theory definitely seems most relevant. Methodologically I think any domain where you are designing and analyzing algorithms, especially working with fuzzy definitions or formalizing intuitive problems, is also useful practice though much less bang for your buck (especially if just learning about it rather than doing research in it). That theme cuts a bunch across domains, though I think cryptography, online algorithms, and algorithmic game theory are particularly good.

Going to start now. I vaguely hope to write something for all of the questions that have been asked so far but we'll see (80 questions is quite a few).

[-]Neel Nanda5y100

What is your theory of change for the Alignment Research Center? That is, what are the concrete pathways by which you expect the work done there to systematically lead to a better future?

For the initial projects, the plan is to find algorithmic ideas (or ideally a whole algorithm) that works well in practice, can be adopted by labs today, and would put us in a way better position with respect to future alignment challenges. If we succeed in that project, then I'm reasonably optimistic about being able to demonstrate the value of our ideas and get them adopted in practice (by a combination of describing them publicly, talking with people at labs, advising people who are trying to pressure labs to take alignment seriously about what their asks should be, and consulting for labs to help implement ideas). Even if adoption or demonstrating desirability turns out to be hard, I think that the alignment community would be in a much better place if we had a proposal that we all felt good about that we were advocating for (since we'd then have a better shot at doing so, and labs that were serious about alignment would be able to figure out what to do).

Beyond that, I'm also excited about offering concrete and well-justified advice (either about what algorithms to use or about alignment-relevant deployment decisions) that can help labs who care about alignment, or can be taken as a clear indicator of best practices so be adopted by labs who want to present as socially-responsible (whether to please employees, funders, civil society, or competitors).

But I'm mostly thinking about the impact of initial activities, and for that I feel like the theory of change is relatively concrete/straightforward.

If you could magically move most of the US rationality and x-risk and EA community to a city in the US that isn't the Bay, and you had to pick somewhere, where where would you move them to?

If I'm allowed to think about it first then I'd do that. If I'm not, then I'd regret never having thought about it, probably Seattle would be my best guess.

2Ben Pace, the Vacationing Vagabond5y

Huh, am surprised. Guess I might’ve predicted Boston. Curious if it’s because of the culture, the environment, or what.

3paulfchristiano5y

Don't read too much into it. I do dislike Boston weather.

And on an absolute level, is the world much more or less prepared for AGI than it was 15 years ago?

Follow-up: How much did the broader x-risk community change it at all?

4paulfchristiano5y

I think much better. [...] I don't really know / tough to answer. Certainly there's a lot more people talking about the problem, it's hard to know how much that comes from x-risk community or from vague concerns about AI in the world (my guess is big parts of both). I think we are in a better place with respect to knowledge of technical alignment---we know a fair bit about what the possible approaches are and have taken a lot of positive steps. There is a counterfactual where alignment isn't even really recognized as a distinct problem and is just lumped in with vague concerns about safety, which would be significantly worse in terms of our ability to work productively on the problem (though I'd love if we were further away from that world).

[-]DanielFilan5y90

How many hours per week should the average AI alignment researcher spend on improving their rationality? How should they spend those hours?

[-]Ben Pace, the Vacationing Vagabond5y70

I probably wouldn't set aside hours for improving rationality (/ am not exactly sure what it would entail). Seems generally good to go out of your way to do things right, to reflect on lessons learned from the things you did, to be willing to do (and slightly overinvest in) things that are currently hard in order to get better, and so on. Maybe I'd say that like 5-10% of time should be explicitly set aside for activities that just don't really move you forward (like post-mortems or reflecting on how things are going in a way that's clearly not going to pay itself off for this project) and a further 10-20% on doing things in ways that aren't the very optimal way right now but useful for getting better at doing them in the future (e.g. using unfamiliar tools, getting more advice from people than would make sense if the world ended next week, being more methodical about how you approach problems).

I guess the other aspect of this is separating some kind of general improvement from more domain specific improvement (i.e. are the numbers above about improving rationality or just getting better at doing stuff?). I think stuff that feels vaguely like "rationality" in the sense of being abou... (read more)

I want to know this question, but for the ‘peak’ alignment researcher.

3paulfchristiano5y

My answer isn't sensitive to things like "how good are you at research" (I didn't even express the sensitivity to "how much do you like reflecting" or "how old are you" which I think are more important). I guess probably the first order thing is that the 'peak' alignment researcher is more likely to be older and closer to death so investing somewhat less in getting better at things. (But the world changes and lives are long so I'm not sure it's a huge deal.)

I'm not interested in the strongest argument from your perspective (i.e. the steelman), but I am interested how much you think you can pass the ITT for Eliezer's perspective on the alignment problem — what shape the problem is, why it's hard, and how to make progress. Can you give a sense of the parts of his ITT you think you've got?

I think I could do pretty well (it's plausible to me that I'm the favorite in any head-to-head match with someone who isn't a current MIRI employee? probably not but I'm at least close). There are definitely some places I still get surprised and don't expect to do that well, e.g. I was recently surprised by one of Eliezer's positions regarding the relative difficulty of some kinds of reasoning tasks for near-future language models (and I expect there are similar surprises in domains that are less close to near-term predictions). I don't really know how to split it into parts for the purpose of saying what I've got or not.

Did you get much from reading the sequences? What was one of the things you found most interesting or valuable personally it them?

I enjoyed Leave a Line of Retreat. It's a very concrete and simple procedure that I actually still use pretty often and I've benefited a lot just from knowing about. Other than that I think I found a bunch of the posts interesting and entertaining. (Looking back now the post is a bit bombastic, I suspect all the sequences are, but I don't really mind.)

[-]Daniel Kokotajlo5y90

1. What credence would you assign to "+12 OOMs of compute would be enough for us to achieve AGI / TAI / AI-induced Point of No Return within five years or so." (This is basically the same, though not identical, with this poll question)

2. Can you say a bit about where your number comes from? E.g. maybe 25% chance of scaling laws not continuing such that OmegaStar, Amp(GPT-7), etc. don't work, 25% chance that they happen but don't count as AGI / TAI / AI-PONR, for total of about 60%? The more you say the better, this is my biggest crux! Thanks!

I'd say 70% for TAI in 5 years if you gave +12 OOM.

I think the single biggest uncertainty is about whether we will be able to adapt sufficiently quickly to the new larger compute budgets (i.e. how much do we need to change algorithms to scale reasonably? it's a very unusual situation and it's hard to scale up fast and depends on exactly how far that goes). Maybe I think that there's an 90% chance that TAI is in some sense possible (maybe: if you'd gotten to that much compute while remaining as well-adapted as we are now to our current levels of compute) and conditioned on that an 80% chance that we'll actually do it vs running into problems?

(Didn't think about it too much, don't hold me to it too much. Also I'm not exactly sure what your counterfactual is and didn't read the original post in detail, I was just assuming that all existing and future hardware got 12OOM faster. If I gave numbers somewhere else that imply much less than that probability with +12OOM, then you should be skeptical of both.)

3Daniel Kokotajlo5y

My counterfactual attempts to get at the question "Holding ideas constant, how much would we need to increase compute until we'd have enough to build TAI/AGI/etc. in a few years?" This is (I think) what Ajeya is talking about with her timelines framework. Her median is +12 OOMs. I think +12 OOMs is much more than 50% likely to be enough; I think it's more like 80% and that's after having talked to a bunch of skeptics, attempted to account for unknown unknowns, etc. She mentioned to me that 80% seems plausible to her too but that she's trying to adjust downwards to account for biases, unknown unknowns, etc. Given that, am I right in thinking that your answer is really close to 90%, since failure-to-achieve-TAI/AGI/etc-due-to-being-unable-to-adapt-quickly-to-magically-increased-compute "shouldn't count" for purposes of this thought experiment?

(I don't think Amp(GPT-7) will work though.)

2Daniel Kokotajlo5y

I'm very glad to hear that! Can you say more about why?

Natural language has both noise (that you can never model) and signal (that you could model if you were just smart enough). GPT-3 is in the regime where it's mostly signal (as evidenced by the fact that the loss keeps going down smoothly rather than approaching an asymptote). But it will soon get to the regime where there is a lot of noise, and by the time the model is 9 OOMs bigger I would guess (based on theory) that it will be overwhelmingly noise and training will be very expensive.

So it may or may not work in the sense of meeting some absolute performance threshold, but it will certainly be a very bad way to get there and we'll do something smarter instead.

3Daniel Kokotajlo5y

Hmm, I don't count "It may work but we'll do something smarter instead" as "it won't work" for my purposes. I totally agree that noise will start to dominate eventually... but the thing I'm especially interested in with Amp(GPT-7) is not the "7" part but the "Amp" part. Using prompt programming, fine-tuning on its own library, fine-tuning with RL, making chinese-room-bureaucracies, training/evolving those bureaucracies... what do you think about that? Naively the scaling laws would predict that we'd need far less long-horizon data to train them, since they have far fewer parameters, right? Moreover IMO evolved-chinese-room-bureaucracy is a pretty good model for how humans work, and in particular for how humans are able to generalize super well and make long-term plans etc. without many lifetimes of long-horizon training.

[-]Neel Nanda5y90

You seem in the unusual position of having done excellent conceptual alignment work (eg with IDA), and excellent applied alignment work at OpenAI, which I'd expect to be pretty different skillsets. How did you end up doing both? And how useful have you found ML experience for doing good conceptual work, and vice versa?

Aw thanks :) I mostly trained as a theorist through undergrad, then when I started grad school I spent some time learning about ML and decided to do applied work at OpenAI. I feel like the methodologies are quite different but the underlying skills aren't that different. Maybe the biggest deltas are that ML involves much more management of attention and jumping between things in order to be effective in practice, while theory is a bit more loaded on focusing on one line of reasoning for a long time and having some clever idea. But while those are important skills I don't think they are the main things that you improve at by working in either area and aren't really core.

I feel like in general there is a lot of transfer between doing well in different research areas, though unsurprisingly it's less than 100% and I think I would be better at either domain if I'd just focused on it more. The main exception is that I feel like I'm a lot better at grounding out theory that is about ML, since I've had more experience and have more of a sense for what kinds of assumptions are reasonable in practice. And on the flip side I do think theory is similar to a lot of algorithm design/analysis questions that come up in ML (frankly it doesn't seem like a central skill but I think there are big logistical benefits from being able to do the whole pipeline as one person).

[-]DanielFilan5y80

What's your favourite mathematical object? What's your least favourite mathematical object?

[-]paulfchristiano5y40

Favorite: Irit Dinur's PCP for constraint satisfaction. What a proof system.

If you want to be more pure, and consider the mathematical objects that are found rather than built, maybe the monster group? (As a layperson so I can't appreciate the full extent of what's going, on and like most people I only real know about it second-hand, but its existence seems like a crazy and beautiful fact about the world.)

Least favorite: I don't know, maybe Chaitin's constant?

4paulfchristiano5y

I take it back, Chaitin's constant is more cool than I thought. I don't like the cardinal ℵ1 very much, but I like 2ℵ0 just fine so it's not really clear if it's a problem with the object or the reference.

2DanielFilan5y

What changed your mind about Chaitin's constant?

3paulfchristiano5y

I hadn't appreciated how hard and special it is to be algorithmically random.

[-]DanielFilan5y80

Should marginal CHAI PhD graduates who are dispositionally indifferent between the two options try to become a professor or do research outside of universities?

[-]Ben Pace, the Vacationing Vagabond5y80

Not sure. If you don't want to train students, seems toe me like you should be outside of a university. If you do want to train students it's less clear and maybe depends on what you want to do (and given that students vary in what they are looking for, this is probably locally self-correcting if too many people go one way or the other). I'd certainly lean away from university for the kinds of work that I want to do, or for the kinds of things that involve aligning large ML systems (which benefit from some connection to customers and resources).

What are the main ways you've become stronger and smarter over the past 5 years? This isn't a question about new object-level beliefs so much as ways-of-thinking or approaches to the world that have changed for you.

3paulfchristiano5y

I'm changing a lot less with every successive 5-year interval. The last 5 years was the end of grad school and my time at OpenAI. I certainly learned a lot about how to make ML work in practice (start small, prioritize simple cases where you can debug, isolate assumptions). Then I learned a lot about how to run a team. I've gotten better at talking to people and writing and being a broadly functional (making up on some lost time when I was younger and focused on math instead). I don't think there's any simple slogan for new ways-of-thinking or changed approaches to the world. Mostly just seems like a ton of little stuff. I think earlier phases of my life were more likely to be a shift in an easily described direction, but this time it's been more a messy mix---I became more arrogant in some ways and more humble in others, more optimistic in some ways and more pessimistic in others, more inclined to trust on-paper reasoning in some ways and less in others, etc

[-]DanielFilan5y70

What's the largest cardinal whose existence you feel comfortable with assuming as an axiom?

[-]paulfchristiano5y50

I'm pretty comfortable working with strong axioms. But in terms of "would actually blow my mind if it turned out not to be consistent," I guess alpha-inaccessible cardinals for any concrete alpha? Beyond that I don't really know enough set theory to have my mind blown.

[-]Ben Pace, the Vacationing Vagabond5y70

Why did nobody in the world run challenge trials for the covid vaccine and save us a year of economic damage?