All of DanielFilan's Comments + Replies

Emergent modularity and safety

It's true! Altho I think of putting something up on arXiv as a somewhat lower bar than 'publication' - that paper has a bit of work left.

Welcome & FAQ!

I really like the art!

Finite Factored Sets: Orthogonality and Time

OK I think this is a typo, from the proof of prop 10 where you deal with condition 5:

Thus .

I think this should be .

2Scott Garrabrant4moFixed, Thanks.
Finite Factored Sets: Orthogonality and Time

From def 16:

... if for all

Should I take this to mean "if for all and "?

[EDIT: no, I shouldn't, since and are both subsets of ]

1DanielFilan4moOK I think this is a typo, from the proof of prop 10 where you deal with condition 5: I think this should be χFC(x,s)⊆x.
A simple example of conditional orthogonality in finite factored sets

Seems right. I still think it's funky that X_1 and X_2 are conditionally non-orthogonal even when the range of the variables is unbounded.

4Scott Garrabrant4moYeah, this is the point that orthogonality is a stronger notion than just all values being mutually compatible. Any x1 and x2 values are mutually compatible, but we don't call them orthogonal. This is similar to how we don't want to say that x1 and (the level sets of) x1+x2 are compatible. The coordinate system has a collection of surgeries, you can take a point and change the x1 value without changing the other values. When you condition on E, that surgery is no longer well defined. However the surgery of only changing the x4 value is still well defined, and the surgery of changing x1 x2 and x3 simultaneously is still well defined (provided you change them to something compatible with E). We could define a surgery that says that when you increase x1, you decrease x2 by the same amount, but that is a new surgery that we invented, not one that comes from the original coordinate system.
AXRP Episode 9 - Finite Factored Sets with Scott Garrabrant

I'm glad to hear that the podcast is useful for people :)

Knowledge is not just mutual information

Seems like maybe the solution should perhaps be that you should only take 'the system' to be the 'controllable' physical variables, or those variables that are relevant for 'consequential' behaviour? Hopefully if one can provide good definitions for these, it will provide a foundation for saying what the abstractions should be that let us distinguish between 'high-level' and 'low-level' behaviour.

Challenge: know everything that the best go bot knows about go

Ah, understood. I think this is basically covered by talking about what the go bot knows at various points in time, a la this comment - it seems pretty sensible to me to talk about knowledge as a property of the actual computation rather than the algorithm as a whole. But from your response there it seems that you think that this sense isn't really well-defined.

1Richard Ngo5moI'm not sure what you mean by "actual computation rather than the algorithm as a whole". I thought that I was talking about the knowledge of the trained model which actually does the "computation" of which move to play, and you were talking about the knowledge of the algorithm as a whole (i.e. the trained model plus the optimising bot).
Challenge: know everything that the best go bot knows about go

Actually, hmm. My thoughts are not really in equilibrium here.

AXRP Episode 7 - Side Effects with Victoria Krakovna

Not sure what the actual sentence you wanted to write was. "are not absolutely necessary" maybe?

You're quite right, let me fix that.

1DanielFilan5moAnd also thanks for your kind words :)
Challenge: know everything that the best go bot knows about go

(Also: such a rewrite would be a combination of 'what I really meant' and 'what the comments made me realize I should have really meant')

Challenge: know everything that the best go bot knows about go

OK, the parenthetical helped me understand where you're coming from. I think a re-write of this post should (in part) make clear that I think a massive heroic effort would be necessary to make this happen, but sometimes massive heroic efforts work, and I have no special private info that makes it seem more plausible than it looks a priori.

1DanielFilan5moActually, hmm. My thoughts are not really in equilibrium here.
1DanielFilan5mo(Also: such a rewrite would be a combination of 'what I really meant' and 'what the comments made me realize I should have really meant')
Challenge: know everything that the best go bot knows about go

In the parent, is your objection that the trained AlphaZero-like model plausibly knows nothing at all?

3Richard Ngo5moThe trained AlphaZero model knows lots of things about Go, in a comparable way to how a dog knows lots of things about running. But the algorithm that gives rise to that model can know arbitrarily few things. (After all, the laws of physics gave rise to us, but they know nothing at all.)
Challenge: know everything that the best go bot knows about go

Suppose you have a computer program that gets two neural networks, simulates a game of go between them, determines the winner, and uses the outcome to modify the neural networks. It seems to me that this program has a model of the 'go world', i.e. a simulator, and from that model you can fairly easily extract the rules and winning condition. Do you think that this is a model but not a mental model, or that it's too exact to count as a model, or something else?

2Richard Ngo5moI'd say that this is too simple and programmatic to be usefully described as a mental model. The amount of structure encoded in the computer program you describe is very small, compared with the amount of structure encoded in the neural networks themselves. (I agree that you can have arbitrarily simple models of very simple phenomena, but those aren't the types of models I'm interested in here. I care about models which have some level of flexibility and generality, otherwise you can come up with dumb counterexamples like rocks "knowing" the laws of physics.) As another analogy: would you say that the quicksort algorithm "knows" how to sort lists? I wouldn't, because you can instead just say that the quicksort algorithm sorts lists, which conveys more information (because it avoids anthropomorphic implications). Similarly, the program you describe builds networks that are good at Go, and does so by making use of the rules of Go, but can't do the sort of additional processing with respect to those rules which would make me want to talk about its knowledge of Go.
Challenge: know everything that the best go bot knows about go

I think there's some communication failure where people are very skeptical of this for reasons that they think are obvious given what they're saying, but which are not obvious to me. Can people tell me which subset of the below claims they agree with, if any? Also if you come up with slight variants that you agree with that would be appreciated.

  1. It is approximately impossible to succeed at this challenge.
  2. It is possible to be confident that advanced AGI systems will not pose an existential threat without being able to succeed at this challenge.
  3. It is not
... (read more)
1Adam Shimi5moMy take is: * I think making this post was a good idea. I'm personally interested in deconfusing the topic of universality (which basically should capture what "learning everything the model knows"), and you brought up a good "simple" example to try to build intuition on. * What I would call your mistake is a mostly 8, but a bit of the related ones (so 3 and 4?). Phrasing it as "can we do that" is a mistake in my opinion because the topic is very confused (as shown by the comments). On the other hand, I think asking the question of what it would mean is a very exciting problem. It also gives a more concrete form to the problem of deconfusing universality, which is important AFAIK to Paul's approaches to alignment.
Challenge: know everything that the best go bot knows about go

I'd also be happy with an inexact description of what the bot will do in response to specified strategies that captured all the relevant details.

Challenge: know everything that the best go bot knows about go

I think that it isn't clear what constitutes "fully understanding" an algorithm.

That seems right.

Another obstacle to full understanding is memory. Suppose your go bot has memorized a huge list of "if you are in such and such situation move here" type rules.

I think there's reason to believe that SGD doesn't do exactly this (nets that memorize random data have different learning curves than normal nets iirc?), and better reason to think it's possible to train a top go bot that doesn't do this.

There is not in general a way to compute what an algorith

... (read more)
1DanielFilan5moI'd also be happy with an inexact description of what the bot will do in response to specified strategies that captured all the relevant details.
Challenge: know everything that the best go bot knows about go

Hmmm. It does seem like I should probably rewrite this post. But to clarify things in the meantime:

  • it's not obvious to me that this is a realistic target, and I'd be surprised if it took fewer than 10 person-years to achieve.
  • I do think the knowledge should 'cover' all the athlete's ingrained instincts in your example, but I think the propositions are allowed to look like "it's a good idea to do x in case y".
2Richard Ngo5moPerhaps I should instead have said: it'd be good to explain to people why this might be a useful/realistic target. Because if you need propositions that cover all the instincts, then it seems like you're basically asking for people to revive GOFAI. (I'm being unusually critical of your post because it seems that a number of safety research agendas lately have become very reliant on highly optimistic expectations about progress on interpretability, so I want to make sure that people are forced to defend that assumption rather than starting an information cascade.)
Challenge: know everything that the best go bot knows about go

On that definition, how does one train an AlphaZero-like algorithm without knowing the rules of the game and win condition?

1Richard Ngo5moThe human knows the rules and the win condition. The optimisation algorithm doesn't, for the same reason that evolution doesn't "know" what dying is: neither are the types of entities to which you should ascribe knowledge.
Challenge: know everything that the best go bot knows about go

Perhaps the bot knows different things at different times and your job is to figure out (a) what it always knows and (b) a way to quickly find out everything it knows at a certain point in time.

I think at this point you've pushed the word "know" to a point where it's not very well-defined; I'd encourage you to try to restate the original post while tabooing that word.

This seems particularly valuable because there are some versions of "know" for which the goal of knowing everything a complex model knows seems wildly unmanageable (for example, trying to convert a human athlete's ingrained instincts into a set of propositions). So before people start trying to do what you suggested, it'd be good to explain why it's actually a realistic target.

Challenge: know everything that the best go bot knows about go

Also it certainly knows the rules of go and the win condition.

1Richard Ngo5moAs an additional reason for the importance of tabooing "know", note that I disagree with all three of your claims about what the model "knows" in this comment and its parent. (The definition of "know" I'm using is something like "knowing X means possessing a mental model which corresponds fairly well to reality, from which X can be fairly easily extracted".)
Challenge: know everything that the best go bot knows about go

But once you let it do more computation, then it doesn't have to know anything at all, right? Like, maybe the best go bot is, "Train an AlphaZero-like algorithm for a million years, and then use it to play."

I would say that bot knows what the trained AlphaZero-like model knows.

1DanielFilan5moAlso it certainly knows the rules of go and the win condition.
Challenge: know everything that the best go bot knows about go

Maybe it nearly suffices to get a go professional to know everything about go that the bot does? I bet they could.

3Adam Shimi5moWhat does that mean though? If you give the go professional a massive transcript of the bot knowledge, it's probably unusable. I think what the go professional gives you is the knowledge of where to look/what to ask for/what to search.
Challenge: know everything that the best go bot knows about go

[D]oes understanding the go bot in your sense imply that you could play an even game against it?

I imagine so. One complication is that it can do more computation than you.

4ESRogs5moBut once you let it do more computation, then it doesn't have to know anything at all, right? Like, maybe the best go bot is, "Train an AlphaZero-like algorithm for a million years, and then use it to play." I know more about go than that bot starts out knowing, but less than it will know after it does computation. I wonder if, when you use the word "know", you mean some kind of distilled, compressed, easily explained knowledge?
Challenge: know everything that the best go bot knows about go

You could plausibly play an even game against a go bot without knowing everything it knows.

2weathersystems5moSure. But the question is can you know everything it knows and not be as good as it? That is, does understanding the go bot in your sense imply that you could play an even game against it?
Mundane solutions to exotic problems

FYI: I would find it useful if you said somewhere what 'epistemic competitiveness' means and linked to it when using the term.

3Adam Shimi6moI assume the right pointer is ascription universality [https://ai-alignment.com/towards-formalizing-universality-409ab893a456].
AMA: Paul Christiano, alignment researcher

I guess I feel like we're in a domain where some people were like "we have concretely-specifiable tasks, intelligence is good, what if we figured how to create artificial intelligence to do those tasks", which is the sort of thing that someone trying to do good for the world would do, but had some serious chance of being very bad for the world. So in that domain, it seems to me that we should keep our eyes out for things that might be really bad for the world, because all the things in that domain are kind of similar.

That being said, I agree that the possi... (read more)

4Paul Christiano6moI think it's good to sometimes meditate on whether you are making the world worse (and get others' advice), and I'd more often recommend it for crowds other than EA and certainly wouldn't discourage people from doing it sometimes. I'm sympathetic to arguments that you should be super paranoid in domains like biosecurity since it honestly does seem asymmetrically easier to make things worse rather than better. But when people talk about it in the context of e.g. AI or policy interventions or gathering better knowledge about the world that might also have some negative side-effects, I often feel like there's little chance that predictable negative effects they are imagining loom large in the cost-benefit unless the whole thing is predictably pointless. Which isn't a reason not to consider those effects, just a push-back against the conclusion (and a heuristic push-back against the state of affairs where people are paralyzed by the possibility of negative consequences based on kind of tentative arguments). For advancing or deploying AI I generally have an attitude like "Even if actively trying to push the field forward full-time I'd be a small part of that effort, whereas I'm a much larger fraction of the stuff-that-we-would-be-sad-about-not-happening-if-the-field-went-faster, and I'm not trying to push the field forward," so while I'm on board with being particularly attentive to harms if you're in a field you think can easily cause massive harms, in this case I feel pretty comfortable about the expected cost-benefit unless alignment work isn't really helping much (in which case I have more important reasons not to work on it). I would feel differently about this if pushing AI faster was net bad on e.g. some common-sense perspective on which alignment was not very helpful, but I feel like I've engaged enough with those perspectives to be mostly not having it.
AMA: Paul Christiano, alignment researcher

What's the largest cardinal whose existence you feel comfortable with assuming as an axiom?

5Paul Christiano6moI'm pretty comfortable working with strong axioms. But in terms of "would actually blow my mind if it turned out not to be consistent," I guess alpha-inaccessible cardinals for any concrete alpha? Beyond that I don't really know enough set theory to have my mind blown.
AMA: Paul Christiano, alignment researcher

How many hours per week should the average AI alignment researcher spend on improving their rationality? How should they spend those hours?

I probably wouldn't set aside hours for improving rationality (/ am not exactly sure what it would entail). Seems generally good to go out of your way to do things right, to reflect on lessons learned from the things you did, to be willing to do (and slightly overinvest in) things that are currently hard in order to get better, and so on. Maybe I'd say that like 5-10% of time should be explicitly set aside for activities that just don't really move you forward (like post-mortems or reflecting on how things are going in a way that's clearly not going to pay... (read more)

I want to know this question, but for the ‘peak’ alignment researcher.

AMA: Paul Christiano, alignment researcher

What's the optimal ratio of researchers to support staff in an AI alignment research organization?

4Paul Christiano6moI guess it depends a lot on what the organization is doing and how exactly we classify "support staff." For my part I'm reasonably enthusiastic about eventually hiring people who are engaged in research but whose main role is more like clarifying, communicating, engaging with outside world, prioritizing, etc., and I could imagine doing like 25-50% as much of that kind of work as we do of frontier-pushing? I don't know whether you'd classify those people as researchers (though I probably wouldn't call it "support" since that seems to kind of minimize the work). Once you are relying on lots of computers, that's a whole different category of work and I'm not sure what the right way of organizing that is or what we'd call support. In terms of things like fundraising, accounting, supporting hiring processes, making payroll and benefits, budgeting, leasing and maintaining office space, dealing with the IRS, discharging legal obligations of employers, immigration, purchasing food, etc.... I'd guess it's very similar to other research organizations with similar salaries. I'm very ignorant about all of this stuff (I expect to learn a lot about it) but I'd guess that depending on details it ends up being 10-20% of staff. But it could go way lower if you outsource a lot to external vendors rather than in-house. (And if you organize a lot of events then that kind of work could just grow basically without bound and in that case I'd again wonder if "support" is the right word.)
AMA: Paul Christiano, alignment researcher

What's your favourite mathematical object? What's your least favourite mathematical object?

4Paul Christiano6moFavorite: Irit Dinur's PCP for constraint satisfaction [http://www.wisdom.weizmann.ac.il/~dinuri/mypapers/combpcp.pdf]. What a proof system. If you want to be more pure, and consider the mathematical objects that are found rather than built, maybe the monster group [https://en.wikipedia.org/wiki/Monster_group]? (As a layperson so I can't appreciate the full extent of what's going, on and like most people I only real know about it second-hand, but its existence seems like a crazy and beautiful fact about the world.) Least favorite: I don't know, maybe Chaitin's constant?
AMA: Paul Christiano, alignment researcher

Should more AI alignment researchers run AMAs?

5Paul Christiano6moDunno, would be nice to figure out how useful this AMA was for other people. My guess is that they should at some rate/scale (in combination with other approaches like going on a podcast or writing papers or writing informal blog posts), and the question is how much communication like that to do in an absolute sense and how much should be AMAs vs other things. Maybe I'd guess that typically like 1% of public communication should be something like an AMA, and that something like 5-10% of researcher time should be public communication (though as mentioned in another comment you might have some specialization there which would cut it down, though I think that the AMA format is less likely to be split off, though that might be an argument for doing less AMA-like stuff and more stuff that gets split off...). So that would suggest like 0.05-0.1% of time on AMA-like activities. If the typical one takes a full-time-day-equivalent, then that's like doing one every 2 years, which I guess would be way more AMAs than we have. This AMA is more like a full-time day so maybe every 4 years? That feels a bit like an overestimate, but overall I'd guess that it would be good on the margin for there to be more alignment researcher AMAs. (But I'm not sure if AMAs are the best AMA-like thing.) In general I think that talking with other researchers and practitioners 1:1 is way more valuable than broadcast communication.
AMA: Paul Christiano, alignment researcher

Should more AI alignment research be communicated in book form? Relatedly, what medium of research communication is most under-utilized by the AI alignment community?

I think it would be good to get more arguments and ideas pinned down, explained carefully, collected in one place. I think books may be a reasonable format for that, though man they take a long time to write.

I don't know what medium is most under-utilized.

AMA: Paul Christiano, alignment researcher

That's not the AXRP question I'm too polite to ask.

1Ben Pace6moPaul, if you did an episode of AXRP, which two other AXRP episodes do you expect your podcast would be between, in terms of quality? For this question, collapse all aspects of quality into a scalar.
AMA: Paul Christiano, alignment researcher

Should marginal CHAI PhD graduates who are dispositionally indifferent between the two options try to become a professor or do research outside of universities?

Not sure. If you don't want to train students, seems toe me like you should be outside of a university. If you do want to train students it's less clear and maybe depends on what you want to do (and given that students vary in what they are looking for, this is probably locally self-correcting if too many people go one way or the other). I'd certainly lean away from university for the kinds of work that I want to do, or for the kinds of things that involve aligning large ML systems (which benefit from some connection to customers and resources).

AMA: Paul Christiano, alignment researcher

What mechanisms could effective altruists adopt to improve the way AI alignment research is funded?

Long run I'd prefer with something like altruistic equity / certificates of impact. But frankly I don't think we have hard enough funding coordination problems that it's going to be worth figuring that kind of thing out. 

(And like every other community we are free-riders---I think that most of the value of experimenting with such systems would accrue to other people who can copy you if successful, and we are just too focused on helping with AI alignment to contribute to that kind of altruistic public good. If only someone would be willing to purchase ... (read more)

AMA: Paul Christiano, alignment researcher

Why aren't impact certificates a bigger deal?

4Paul Christiano6moChange is slow and hard and usually driven by organic changes rather than clever ideas, and I expect it to be the same here. In terms of why the idea is actually just not that big a deal, I think the big thing is that altruistic projects often do benefit hugely from not needing to do explicit credit attribution. So that's a real cost. (It's also a cost for for-profit businesses, leading to lots of acrimony and bargaining losses.) They also aren't quite consistent with moral public goods [https://www.lesswrong.com/posts/pqKwra9rRYYMvySHc/moral-public-goods] / donation-matching, which might be handled better by a messy status quo, and I think that's a long-term problem though probably not as big as the other issues.
AMA: Paul Christiano, alignment researcher

How many ideas of the same size as "maybe a piecewise linear non-linearity would work better than a sigmoid for not having vanishing gradients" are we away from knowing how to build human-level AI technology?

I think it's >50% chance that ideas like ReLUs or soft attention are best though of as multiplicative improvements on top of hardware progress (as are many other ideas like auxiliary objectives, objectives that better capture relevant tasks, infrastructure for training more efficiently, dense datasets, etc.), because the basic approach of "optimize for a task that requires cognitive competence" will eventually yield human-level competence. In that sense I think the answer is probably 0.

Maybe my median number of OOMs left before human-level intelligence,... (read more)

AMA: Paul Christiano, alignment researcher

How many ideas of the same size as "maybe we could use inverse reinforcement learning to learn human values" are we away from knowing how to knowably and reliably build human-level AI technology that wouldn't cause something comparably bad as human extinction?

A lot of this is going to come down to estimates of the denominator. 

(I mostly just think that you might as well just ask people "Is this good?" rather than trying to use a more sophisticated form of IRL---in particular I don't think that realistic versions of IRL will successfully address the cases where people err in answering the "is it good?" question, that directly asking is more straightforward in many important ways, and that we should mostly just try to directly empower people to give better answers to such questions.)

Anyway, with that caveat ... (read more)

AMA: Paul Christiano, alignment researcher

How many new blogs do you anticipate creating in the next 5 years?

3Paul Christiano6moI've created 3 blogs in the last 10 years and 1 blog in the preceding 5 years. It seems like 1-2 is a good guess. (A lot depends on whether there ends up being an ARC blog or it just inherits ai-alignment.com [ai-alignment.com])
AMA: Paul Christiano, alignment researcher

If a 17-year-old wanted to become the next Paul Christiano, what should they do?

AMA: Paul Christiano, alignment researcher

What is the Paul Christiano production function?

AMA: Paul Christiano, alignment researcher

How will we know when it's not worth getting more people to work on reducing existential risk from AI?

2Paul Christiano6moWe'll do the cost-benefit analysis and over time it will look like a good career for a smaller and smaller fraction of people (until eventually basically everyone for whom it looks like a good idea is already doing it). That could kind of qualitatively look like "something else is more important," or "things kind of seem under control and it's getting crowded," or "there's no longer enough money to fund scaleup." Of those, I expect "something else is more important" to be the first to go (though it depends a bit on how broadly you interpret "from AI," if anything related to the singularity / radically accelerating growth is classified as "from AI" then it may be a core part of the EA careers shtick kind of indefinitely, with most of the action in which of the many crazy new aspects of the world people are engaging with).
AMA: Paul Christiano, alignment researcher

What's the most important thing that AI alignment researchers have learned in the past 10 years? Also, that question but excluding things you came up with.

"Thing" is tricky. Maybe something like the set of intuitions and arguments we have around learned optimizers, i.e. the basic argument that ML will likely produce a system that is "trying" to do something, and that it can end up performing well on the training distribution regardless of what it is "trying" to do (and this is easier the more capable and knowledgeable it is). I don't think we really know much about what's going on here, but I do think it's an important failure to be aware of and at least folks are looking for it now. So I do think that if it... (read more)

AMA: Paul Christiano, alignment researcher

What is the most common wrong research-relevant intuition among AI alignment researchers?

Does the lottery ticket hypothesis suggest the scaling hypothesis?

Ah to be clear I am entirely basing my comments off of reading the abstracts (and skimming the multi-prize paper with an eye one develops after having been a ML PhD student for mumbles indistinctly years).

Does the lottery ticket hypothesis suggest the scaling hypothesis?

Oh here's where I think things went wrong:

Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning!

This is true in the original LTH paper, but there the "at-initialization ticket" doesn't actually perform well: it's just easy to train to high performance.

In the multi-prize LTH paper, it is the case that the "at-initialization ticket" performs well, but they don't find it by winding back the weights of a trained pruned network.

If you got multi-pri... (read more)

2Daniel Kokotajlo6moOH this indeed changes everything (about what I had been thinking) thank you! I shall have to puzzle over these ideas some more then, and probably read the multi-prize paper more closely (I only skimmed it earlier)
Does the lottery ticket hypothesis suggest the scaling hypothesis?

I guess I'm imagining that 'by default', your distribution over which optimum SGD reaches should be basically uniform, and you need a convincing story to end up believing that it reliably gets to one specific optimum.

So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!

Yes, that's exactly what I think happens. Training takes a long time, and I expect the weights in a 'ticket' to change based on the weights of the rest of the network (since those other weights have ... (read more)

Oh here's where I think things went wrong:

Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning!

This is true in the original LTH paper, but there the "at-initialization ticket" doesn't actually perform well: it's just easy to train to high performance.

In the multi-prize LTH paper, it is the case that the "at-initialization ticket" performs well, but they don't find it by winding back the weights of a trained pruned network.

If you got multi-pri... (read more)

Load More