All of Daniel Kokotajlo's Comments + Replies

DeepMind: Generally capable agents emerge from open-ended play

Thanks! This is exactly the sort of thoughtful commentary I was hoping to get when I made this linkpost.

--I don't see what the big deal is about laws of physics. Humans and all their ancestors evolved in a world with the same laws of physics; we didn't have to generalize to different worlds with different laws. Also, I don't think "be superhuman at figuring out the true laws of physics" is on the shortest path to AIs being dangerous. Also, I don't think AIs need to control robots or whatnot in the real world to be dangerous, so they don't even need to be ... (read more)

I don't see what the big deal is about laws of physics. Humans and all their ancestors evolved in a world with the same laws of physics; we didn't have to generalize to different worlds with different laws. Also, I don't think "be superhuman at figuring out the true laws of physics" is on the shortest path to AIs being dangerous. Also, I don't think AIs need to control robots or whatnot in the real world to be dangerous, so they don't even need to be able to understand the true laws of physics, even on a basic level.

The entire novelty of this work revol... (read more)

[AN #156]: The scaling hypothesis: a plan for building AGI

OK cool, sorry for the confusion. Yeah I think ESRogs interpretation of you was making a bit stronger claim than you actually were.

[AN #156]: The scaling hypothesis: a plan for building AGI

Huh, then it seems I misunderstood you then! The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn. But this seems unsupported by the argument you made, and in fact the opposite seems supported by the argument you made. The argument you made was:

There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there's a corresponding model that the resulting GPT-N would "try" to <do the thing i
... (read more)
4Rohin Shah6d?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words? EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just "fills in words" instead. Isn't that effectively what I said? (I was trying to be more precise since "achieve its training objective" is ambiguous, but given what I understand you to mean by that phrase, I think it's what I said?) This seems reasonable to me (and seems compatible with what I said)
Open Problems with Myopia

Also: I think making sure our agents are DDT is probably going to be approximately as difficult as making them aligned. Related: Your handle for anthropic uncertainty is:

never reason about anthropic uncertainty. DDT agents always think they know who they are.

"Always think they know who they are" doesn't cut it; you can think you know you're in a simulation. I think a more accurate version would be something like "Always think that you are on an original planet, i.e. one in which life appeared 'naturally,' rather than a planet in the midst of some larger in... (read more)

rohinmshah's Shortform

OK, fair enough. But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting? I think this would happen to many humans if they could work at super-speed.

That said, I don't think this is that likely I guess... probably AI will be unable to do even three such posts, or it'll be able to generate arbitrary numbers of them. The human range is small. Maybe. Idk.

3Rohin Shah13dI'd be pretty surprised if that happened. GPT-3 already knows way more facts than I do, and can mimic far more writing styles than I can. It seems like by the time it can write any good posts (without cherrypicking), it should quickly be able to write good posts on a variety of topics in a variety of different styles, which should let it scale well past 20 posts. (In contrast, a specific person tends to write on 1-2 topics, in a single style, and not optimizing that hard for karma, and many still write tens of high-scoring posts.)
rohinmshah's Shortform

Ah right, good point, I forgot about cherry-picking. I guess we could make it be something like "And the blog post wasn't cherry-picked; the same system could be asked to make 2 additional posts on rationality and you'd like both of them also." I'm not sure what credence I'd give to this but it would probably be a lot higher than 10%.

Website prediction: Nice, I think that's like 50% likely by 2030.

Major research area: What counts as a major research area? Suppose I go calculate that Alpha Fold 2 has already sped up the field of protein structure prediction... (read more)

5Rohin Shah13dI was thinking 365 posts * ~50 karma per post gets you most of the way there (18,250 karma), and you pick up some additional karma from comments along the way. 50 karma posts are good but don't have to be hugely insightful; you can also get a lot of juice by playing to the topics that tend to get lots of upvotes. Unlike humans the bot wouldn't be limited by writing speed (hence my restriction of one post per day). AI systems should be really, really good at writing, given how easy it is to train on text. And a post is a small, self-contained thing, that takes not very long to create (i.e. it has short horizons), and there are lots of examples to learn from. So overall this seems like a thing that should happen well before TAI / AGI. I think I want to give up on the research area example, seems pretty hard to operationalize. (But fwiw according to the picture in my head, I don't think I'd count AlphaFold.)
rohinmshah's Shortform

Nice! I really appreciate that you are thinking about this and making predictions. I want to do the same myself.

I think I'd put something more like 50% on "Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post." That's just a wild guess, very unstable.

Another potential prediction generation methodology: Name something that you think won't happen, but you think I think will.

5Rohin Shah14dThis seems more feasible, because you can cherrypick a single good example. I wouldn't be shocked if someone on LW spent a lot of time reading AI-written blog posts on rationality and posted the best one, and I liked that more than a typical >30 karma post. My default guess is that no one tries to do this, so I'd still give it < 50% (maybe 30%?), but conditional on someone trying I think probably 80% seems right. I spent a bit of time on this but I think I don't have a detailed enough model of you to really generate good ideas here :/ Otoh, if I were expecting TAI / AGI in 15 years, then by 2030 I'd expect to see things like: * An AI system that can create a working website with the desired functionality "from scratch" (e.g. a simple Twitter-like website, an application that tracks D&D stats and dice rolls for you, etc, a simple Tetris game with an account system, ...). The system allows even non-programmers to create these kinds of websites (so cannot depend on having a human programmer step in to e.g. fix compiler errors). * At least one large, major research area in which human researcher productivity has been boosted 100x relative to today's levels thanks to AI. (In calculating the productivity we ignore the cost of running the AI system.) Humans can still be in the loop here, but the large majority of the work must be done by AIs. * An AI system gets 20,000 LW karma in a year, when limited to writing one article per day and responses to any comments it gets from humans. * Productivity tools like todo lists, memory systems, time trackers, calendars, etc are made effectively obsolete (or at least the user interfaces are made obsolete); the vast majority of people who used to use these tools have replaced them with an Alexa / Siri style assistant. Currently, I don't expect to see any of these by 2030.
2Oliver Habryka19dI also found these very valuable! I wonder whether a better title might help more people see how great these are, but not sure.
2Mark Xu19dThanks! I will try, although they will likely stay very intermittent.
BASALT: A Benchmark for Learning from Human Feedback

Going from zero to "produce an AI that learns the task entirely from demonstrations and/or natural language description" is really hard for the modern AI research hive mind. You have to instead give it a shaped reward, breadcrumbs along the way that are easier, (such as allowing handcrafted heuristics and such, and allowing knowledge of a particular target task) to get the hive mind started making progress.

1Vanessa Kosoy19dIt's not "from zero" though, I think that we already have ML techniques that should be applicable here.
Experimentally evaluating whether honesty generalizes

Thanks!

I like your breakdown of A-E, let's use that going forward.

It sounds like your view is: For "dumb" AIs that aren't good at reasoning, it's more likely that they'll just do B "directly" rather than do E-->D-->C-->B. Because the latter involves a lot of tricky reasoning which they are unable to do. But as we scale up our AIs and make them smarter, eventually the E-->D-->C-->B thing will be more likely than doing B "directly" because it works for approximately any long-term consequence (e.g. paperclips) and thus probably works for som... (read more)

4Paul Christiano22dI think C->B is already quite hard for language models, maybe it's possible but still very clearly hard enough that it overwhelms the possible simplicity benefits from E over B (before even adding in the hardness of steps E->D->C). I would update my view a lot if I saw language models doing anything even a little bit like the C->b link. I agree that eventually A loses to any of {B, C, D, E}. I'm not sure if E is harder than B to fix, but at any rate my starting point is working on the reasons that A loses to any of the alternatives (e.g. here [https://www.alignmentforum.org/posts/QqwZ7cwEA2cxFEAun/teaching-ml-to-answer-questions-honestly-instead-of] , here [https://www.alignmentforum.org/posts/SRJ5J9Tnyq7bySxbt/answering-questions-honestly-given-world-model-mismatches] ) and then after handling that we can talk about whether there are remaining reasons that E in particular is hard. (My tentative best guess is that there won't be---I started out thinking about E vs A and then ended up concluding that the examples I was currently thinking about seemed like the core obstructions to making that work.) In the meantime, getting empirical evidence about other ways that you don't learn A is also relevant. (Since those would also ultimately lead to deceptive alignment, even if you learned some crappy A' rather than either A or B.)
Experimentally evaluating whether honesty generalizes

I'm not willing to bet yet, I feel pretty ignorant and confused about the issue. :) I'm trying to get more understanding of your model of how all this works. We've discussed:

A. "Do things you've read are instrumentally convergent"

B. "Tell the humans what they want to hear."

C. "Try to win at training."

D. "Build a model of the training process and use it to make predictions."

It sounds like you are saying A is the most complicated, followed by B and C, and then D is the least complicated. (And in this case the AI will know that winning at training means telli... (read more)

I'm saying that if you e.g. reward your AI by having humans evaluate its answers, then the AI may build a predictive model of those human evaluations and then may pick actions that are good according to that model. And that predictive model will overlap substantially with predictive models of humans in other domains.

The "build a good predictive model of humans" is a step in all of your proposals A-D.

Then I'm saying that it's pretty simple to plan against it. It would be even simpler if you were doing supervised training, since then you are just outputting ... (read more)

Experimentally evaluating whether honesty generalizes
I think that simple forms of the instrumental policy will likely arise much earlier than deceptive alignment. That is, a model can develop the intrinsic motivation "Tell the humans what they want to hear" without engaging in complex long-term planning or understanding the dynamics of the training process. So my guess is that we can be carrying out fairly detailed investigations of the instrumental policy before we have any examples of deception.

I'd be interested to hear more about this, it is not at all obvious to me. Might it not be harder to develop the ... (read more)

7Paul Christiano1moI'm willing to bet against that (very) strongly. If it's going to be preferred, it really needs to be something simpler than that which leads it to deduce that heuristic (since that heuristic itself is not going to be simpler than directly trying to win at training). This is wildly out-of-domain generalization of much better reasoning than existing language models engage in. Whereas there's nothing particularly exotic about building a model of the training process and using it to make predictions.
Brute force searching for alignment

OK, now I get what you are saying! Interesting. I am skeptical that this will work for most alignment problems, due to lack of simple conceptual core maybe. In particular, I doubt that corrigibility and non-deceptiveness have simple conceptual cores. I hope I'm wrong.

3Adam Shimi1moWell, if you worry that these properties don't have a simple conceptual core, maybe you can do the trick where you try to formalize a subset of them with a small conceptual core. That's basically Evan move on Myopia as a more easy to study subset of non-deceptiveness.
Environmental Structure Can Cause Instrumental Convergence

Oh interesting... so then what I need for my argument is not the simplest function period, but the simplest function that doesn't make both power-seeking and not-power-seeking both optimal? (isn't that probably just going to be the simplest function that doesn't make everything optimal?)

I admit I am probably conceptually confused in a bunch of ways, I haven't read your post closely yet.

Environmental Structure Can Cause Instrumental Convergence

Another typo:

I don't yet understand the general case, but I have a strong hunch that instrumental convergenceoptimal policies is a governed by how many more ways there are for power to be optimal than not optimal.
Environmental Structure Can Cause Instrumental Convergence

This is awesome, thanks! I found the minesweeper analogy in particular super helpful.

4. Combining (4) and (5), PU(PS)≥2−(KU(ϕ)+O(1))PU(NPS). QED.

Typo? Maybe you mean (2 and (3)?

More substantively, on the simplicity prior stuff:

There's the power seeking functions and the non-power-seeking functions, but then there's also the inner-aligned functions, i.e. the ones that are in some sense trying their best to achieve the base objective, and then there's also the human-aligned functions. Perhaps we can pretty straightforwardly argue that the non-power-seek... (read more)

2Alex Turner1moI like the thought. I don't know if this sketch works out, partly because I don't fully understand it. your conclusion seems plausible but I want to develop the arguments further. As a note: the simplest function period probably is the constant function, and other very simple functions probably make both power-seeking and not-power-seeking optimal. So if you permute that one, you'll get another function for which power-seeking and not-power-seeking actions are both optimal.
2Daniel Kokotajlo1moAnother typo:
Frequent arguments about alignment
This post has two purposes. First, I want to cache good responses to these questions, so I don't have to think about them each time the topic comes up. Second, I think it's useful for people who work on safety and alignment to be ready for the kind of pushback they'll get when pitching their work to others.

Great idea, thanks for writing this!

Parameter counts in Machine Learning

Thank you for collecting this dataset! What's the difference between the squares, triangles, and plus-sign datapoints? If you say it somewhere I haven't been able to find it I'm afraid.

1Jaime Sevilla1moThank you! The shapes mean the same as the color (ie domain) - they were meant to make the graph more clear. Ideally both shape and color would be reflected in the legend. But whenever I tried adding shapes to the legend instead a new legend was created, which was more confusing. If somebody reading this knows how to make the code [https://colab.research.google.com/drive/11m0AfSQnLiDijtE1fsIPqF-ipbTQcsFp#scrollTo=lwiOHOM8Qwd9] produce a correct legend I'd be very keen on hearing it! EDIT: Now fixed
Rogue AGI Embodies Valuable Intellectual Property

+1. Another way of putting it: This allegation of shaky arguments is itself super shaky, because it assumes that overcoming a 100x - 1,000,000x gap in "resources" implies a "very large" alignment tax. This just seems like a weird abstraction/framing to me that requires justification.

I wrote this Conquistadors post in part to argue against this abstraction/framing. These three conquistadors are something like a natural experiment in "how much conquering can the few do against the many, if they have various advantages?" (If I just selected a lone conqueror, ... (read more)

Cortés, Pizarro, and Afonso as Precedents for Takeover

I don't really see this as in strong conflict with what I said. I agree that technology is the main factor; I said it was also "strategic and diplomatic cunning;" are you suggesting that it wasn't really that at all and that if Cortez had gifted his equipment to 500 locals they would have been just as successful at taking over as he was? I could be convinced of this I suppose.

Agency in Conway’s Game of Life

I wonder if there are some sorts of images that are really hard to compress via this particular method.

I wonder if you can achieve massive reliable compression if you aren't trying to target a specific image but rather something in a general category. For example, maybe this specific lizard image requires a CA rule filesize larger than the image to express, but in the space of all possible lizard images there are some nice looking lizards that are super compressible via this CA method. Perhaps using something like DALL-E we could search this space efficiently and find such an image.

Agency in Conway’s Game of Life

Wow, that's cool! Any idea how complex (how large the filesize) the learned CA's rules were? I wonder how it compares to the filesize of the target image. Many order of magnitude bigger? Just one? Could it even be... smaller?

3Alex Flint2moYeah I had the sense that the project could have been intended as a compression mechanism since compressing in terms of CA rules kind of captures the spatial nature of image information quite well.
Intermittent Distillations #3

Thanks for doing this! I found your digging-into-the-actual-proof of the Multi-Prize LTH paper super helpful btw, I had wondered if they had been doing something boring like that but now I know! This is great news.

Understanding the Lottery Ticket Hypothesis

Thanks for this, I found it helpful!

If you are still interested in reading and thinking more about this topic, I would love to hear your thoughts on the papers below, in particular the "multi-prize LTH" one which seems to contradict some of the claims you made above. Also, I'd love to hear whether LTH-ish hypotheses apply to RNN's and more generally the sort of neural networks used to make, say, AlphaStar.

https://arxiv.org/abs/2103.09377

"In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis:

A suff... (read more)

5Alex Flint2moWow, thank you Daniel, this is an incredibly helpful list!
Understanding the Lottery Ticket Hypothesis

I confess I don't really understand what a tangent space is, even after reading the wiki article on the subject. It sounds like it's something like this: Take a particular neural network. Consider the "space" of possible neural networks that are extremely similar to it, i.e. they have all the same parameters but the weights are slightly different, for some definition of "slightly." That's the tangent space. Is this correct? What am I missing?

4johnswentworth2moPicture a linear approximation, like this: The tangent space at pointais that whole line labelled "tangent". The main difference between the tangent space and the space of neural-networks-for-which-the-weights-are-very-close is that the tangent space extrapolates the linear approximation indefinitely; it's not just limited to the region near the original point. (In practice, though, that difference does not actually matter much, at least for the problem at hand - we do stay close to the original point.) The reason we want to talk about "the tangent space" is that it lets us precisely state things like e.g. Newton's method [https://en.wikipedia.org/wiki/Newton%27s_method] in terms of search: Newton's method finds a point at which f(x) is approximately 0 by finding a point where the tangent space hits zero (i.e. where the line in the picture above hits the x-axis). So, the tangent space effectively specifies the "search objective" or "optimization objective" for one step of Newton's method. In the NTK/GP model, neural net training is functionally-identical to one step of Newton's method (though it's Newton's method in many dimensions, rather than one dimension).
Pre-Training + Fine-Tuning Favors Deception

Nice post! You may be interested in this related post and discussion.

I think you may have forgotten to put a link in "See Mesa-Search vs Mesa-Control for discussion."

1Mark Xu3mothanks, fixed
NTK/GP Models of Neural Nets Can't Learn Features

Ah, OK. Interesting, thanks. Would you agree with the following view:

"The NTK/GP stuff has neural nets implementing a "psuedosimplicity prior" which is maybe also a simplicity prior but might not be, the evidence is unclear. A psuedosimplicity prior is like a simplicity prior except that there are some important classes of kolmogorov-simple functions that don't get high prior / high measure."

Which would you say is more likely: The NTK/GP stuff is indeed not universally data efficient, and thus modern neural nets aren't either, or (b) NTK/GP stuff is indeed not universally data efficient, and thus modern neural nets aren't well-characterized by the NTK/GP stuff.

NTK/GP Models of Neural Nets Can't Learn Features
Feature learning requires the intermediate neurons to adapt to structures in the data that are relevant to the task being learned, but in the NTK limit the intermediate neurons' functions don't change at all.
Any meaningful function like a 'car detector' would need to be there at initialization -- extremely unlikely for functions of any complexity.

I used to think it would be extremely unlikely for a randomly initialized neural net to contain a subnetwork that performs just as well as the entire neural net does after training. But the mu... (read more)

AMA: Paul Christiano, alignment researcher

My counterfactual attempts to get at the question "Holding ideas constant, how much would we need to increase compute until we'd have enough to build TAI/AGI/etc. in a few years?" This is (I think) what Ajeya is talking about with her timelines framework. Her median is +12 OOMs. I think +12 OOMs is much more than 50% likely to be enough; I think it's more like 80% and that's after having talked to a bunch of skeptics, attempted to account for unknown unknowns, etc. She mentioned to me that 80% seems plausible to her too but that sh... (read more)

AMA: Paul Christiano, alignment researcher

Hmm, I don't count "It may work but we'll do something smarter instead" as "it won't work" for my purposes.

I totally agree that noise will start to dominate eventually... but the thing I'm especially interested in with Amp(GPT-7) is not the "7" part but the "Amp" part. Using prompt programming, fine-tuning on its own library, fine-tuning with RL, making chinese-room-bureaucracies, training/evolving those bureaucracies... what do you think about that? Naively the scaling laws would predict that we... (read more)

AMA: Paul Christiano, alignment researcher

When you say hardware progress, do you just mean compute getting cheaper or do you include people spending more on compute? So you are saying, you guess that if we had 10 OOMs of compute today that would have a 50% chance of leading to human-level AI without any further software progress, but realistically you expect that what'll happen is we get +5 OOMs from increased spending and cheaper hardware, and then +5 "virtual OOMs" from better software?

[AN #139]: How the simplicity of reality explains the success of neural nets
I agree with Zach above about the main point of the paper. One other thing I’d note is that SGD can’t have literally the same outcomes as random sampling, since random sampling wouldn’t display phenomena like double descent (AN #77).

Would you mind explaining why this is? It seems to me like random sampling would display double descent. For example, as you increase model size, at first you get more and more parameters that let you approximate the data better... but then you get too many parameters and just start memorizing the data... ... (read more)

4Rohin Shah3moHmm, I think you're right. I'm not sure what I was thinking when I wrote that. (Though I give it like 50% that if past-me could explain his reasons, I'd agree with him.) Possibly I was thinking of epochal double descent, but that shouldn't matter because we're comparing the final outcome of SGD to random sampling, so epochal double descent doesn't come into the picture.
NTK/GP Models of Neural Nets Can't Learn Features
I'll confess that I would personally find it kind of disappointing if neural nets were mostly just an efficient way to implement some fixed kernels, when it seems possible that they could be doing something much more interesting -- perhaps even implementing something like a simplicity prior over a large class of functions, which I'm pretty sure NTK/GP can't be

Wait, why can't NTK/GP be implementing a simplicity prior over a large class of functions? They totally are, it's just that the prior comes from the measure in random initia... (read more)

4interstice3moThere's an important distinction[1] [#fn-vCBKd8FAa79jE6hap-1] to be made between these two claims: A) Every function with large volume in parameter-space is simple B) Every simple function has a large volume in parameter space For a method of inference to qualify as a 'simplicity prior', you want both claims to hold. This is what lets us derive bounds like 'Solomonoff induction matches the performance of any computable predictor', since all of the simple, computable predictors have relatively large volume in the Solomonoff measure, so they'll be picked out after boundedly many mistakes. In particular, you want there to be an implication like, if a function has complexity C, it will have parameter-volume at least exp(−βC). Now, the Mingard results, at least the ones that have mathematical proof, rely on the Levin bound. This only shows (A), which is the direction that is much easier to prove -- it automatically holds for any mapping from parameter-space to functions with bounded complexity. They also have some empirical results that show there is substantial 'clustering', that is, there are some simple functions that have large volumes. But this still doesn't show that all of them do, and indeed is compatible with the learnable function class being extremely limited. For instance, this could easily be the case even if NTK/GP was only able to learn linear functions. In reality the NTK/GP is capable of approximating arbitrary functions on finite-dimensional inputs but, as I argued in another comment [https://www.lesswrong.com/posts/76cReK4Mix3zKCWNT/ntk-gp-models-of-neural-nets-can-t-learn-features?commentId=mhaTTCn5WubqBP7kw] , this is not the right notion of 'universality' for classification problems. I strongly suspect[2] [#fn-vCBKd8FAa79jE6hap-2] that the NTK/GP can be shown to not be 'universally data-efficient' as I outlined there, but as far as I'm aware no one's looked into the issue formally yet. Empirically, I think the results we have so far [https://www
Parsing Abram on Gradations of Inner Alignment Obstacles

Well, it seems to be saying that the training process basically just throws away all the tickets that score less than perfectly, and randomly selects one of the rest. This means that tickets which are deceptive agents and whatnot are in there from the beginning, and if they score well, then they have as much chance of being selected at the end as anything else that scores well. And since we should expect deceptive agents that score well to outnumber aligned agents that score well... we should expect deception.

I'm working on a much more fleshed out and expanded version of this argument right now.

3Alex Flint3moYeah right, that is scarier. Looking forward to reading your argument, esp re why we would expect deceptive agents that score well to outnumber aligned agents that score well. Although in the same sense we could say that a rock “contains” many deceptive agents, since if we viewed the rock as a giant mixture of computations then we would surely find some that implement deceptive agents.
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Pinging you to see what your current thoughts are! I think that if "SGD is basically equivalent to random search" then that has huge, huge implications.

4Evan Hubinger3moI guess I would say something like: random search is clearly a pretty good first-order approximation, but there are also clearly second-order effects. I think that exactly how strong/important/relevant those second-order effects are is unclear, however, and I remain pretty uncertain there.
Parsing Abram on Gradations of Inner Alignment Obstacles

I think Abram's concern about the lottery ticket hypothesis wasn't about the "vanilla" LTH that you discuss, but rather the scarier "tangent space hypothesis." See this comment thread.

3Alex Flint3moThank you for the pointer. Why is the tangent space hypothesis version of the LTH scarier?
Naturalism and AI alignment

I'd be interested to see naturalism spelled out more and defended against the alternative view that (I think) prevails in this community. That alternative view is something like: "Look, different agents have different goals/values. I have mine and will pursue mine, and you have yours and pursue yours. Also, there are rules and norms that we come up with to help each other get along, analogous to laws and rules of etiquette. Also, there are game-theoretic principles like fairness, retribution, and bullying-resistance that are basically just good ... (read more)

1Michele Campolo3moI am not sure the concept of naturalism I have in mind corresponds to a specific naturalistic position held by a certain (group of) philosopher(s). I link here [https://en.wikipedia.org/wiki/Ethical_naturalism] the Wikipedia page on ethical naturalism, which contains the main ideas and is not too long. Below I focus on what is relevant for AI alignment. In the other comment you asked about truth. AIs often have something like a world-model or knowledge base that they rely on to carry out narrow tasks, in the sense that if someone modifies the model or kb in a certain way—analogous to creating a false belief—than the agent fails at the narrow task. So we have a concept of true-given-task. By considering different tasks, e.g. in the case of a general agent that is prepared to face various tasks, we obtain true-in-general or, if you prefer, simply "truth". See also the section on knowledge in the post. Practical example: given that light is present almost everywhere in our world, I expect general agents to acquire knowledge about electromagnetism. I also expect that some AIs, given enough time, will eventually incorporate in their world-model beliefs like: "Certain brain configurations correspond to pleasurable conscious experiences. These configurations are different from the configurations observed in (for example) people who are asleep, and very different from what is observed in rocks." Now, take an AI with such knowledge and give it some amount of control over which goals to pursue: see also the beginning of Part II in the post. Maybe, in order to make this modification, it is necessary to abandon the single-agent framework and consider instead a multi-agent system, where one agent keeps expanding the knowledge base, another agent looks for "value" in the kb, and another one decides what actions to take given the current concept of value and other contents of the kb. [Two notes on how I am using the word control. 1 I am not assuming any extra-physical notion h
Naturalism and AI alignment
From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it's possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.

I think this is an interesting point -- but I don't conclude optimism from it as you do. Humans engage in explicit reasoning about what they should do, and they theorize and systematize, and some o... (read more)

AMA: Paul Christiano, alignment researcher

I'm very glad to hear that! Can you say more about why?

Natural language has both noise (that you can never model) and signal (that you could model if you were just smart enough). GPT-3 is in the regime where it's mostly signal (as evidenced by the fact that the loss keeps going down smoothly rather than approaching an asymptote). But it will soon get to the regime where there is a lot of noise, and by the time the model is 9 OOMs bigger I would guess (based on theory) that it will be overwhelmingly noise and training will be very expensive.

So it may or may not work in the sense of meeting some absolute performance threshold, but it will certainly be a very bad way to get there and we'll do something smarter instead.

Gradations of Inner Alignment Obstacles
Part of my idea for this post was to go over different versions of the lottery ticket hypothesis, as well, and examine which ones imply something like this. However, this post is long enough as it is.

I'd love to see you do this!

Re: The Treacherous Turn argument: What do you think of the following spitball objections:

(a) Maybe the deceptive ticket that makes T' work is indeed there from the beginning, but maybe it's outnumbered by 'benign' tickets, so that the overall behavior of the network is benign. This is an argument against ... (read more)

4Abram Demski3moMy overall claim is that attractor-basin type arguments need to address the base case. This seems like a potentially fine way to address the base-case, if the math works out for whatever specific attractor-basin argument. If we're trying to avoid deception via methods which can steer away from deception if we assume there's not yet any deception, then we're in trouble; the technique's assumptions are violated. Right, this seems in line with the original lottery ticket hypothesis, and would alleviate the concern. It doesn't seem as consistent with the tangent space hypothesis [https://www.lesswrong.com/posts/i9p5KWNWcthccsxqm/updating-the-lottery-ticket-hypothesis] , though.
AMA: Paul Christiano, alignment researcher

In this post I argued that an AI-induced point of no return would probably happen before world GDP starts to noticeably accelerate. You gave me some good pushback about the historical precedent I cited, but what is your overall view? If you can spare the time, what is your credence in each of the following PONR-before-GDP-acceleration scenarios, and why?

1. Fast takeoff

2. The sorts of skills needed to succeed in politics or war are easier to develop in AI than the sorts needed to accelerate the entire world economy, and/or have less deployment lag. (Maybe ... (read more)

I don't know if we ever cleared up ambiguity about the concept of PONR. It seems like it depends critically on who is returning, i.e. what is the counterfactual we are considering when asking if we "could" return. If we don't do any magical intervention, then it seems like the PONR could be well before AI since the conclusion was always inevitable. If we do a maximally magical intervention, of creating unprecedented political will, then I think it's most likely that we'd see 100%+ annual growth (even of say energy capture) before PONR. I don't think there ... (read more)

AMA: Paul Christiano, alignment researcher

1. What credence would you assign to "+12 OOMs of compute would be enough for us to achieve AGI / TAI / AI-induced Point of No Return within five years or so." (This is basically the same, though not identical, with this poll question)

2. Can you say a bit about where your number comes from? E.g. maybe 25% chance of scaling laws not continuing such that OmegaStar, Amp(GPT-7), etc. don't work, 25% chance that they happen but don't count as AGI / TAI / AI-PONR, for total of about 60%? The more you say the better, this is my biggest crux! ... (read more)

5Paul Christiano3mo(I don't think Amp(GPT-7) will work though.)

I'd say 70% for TAI in 5 years if you gave +12 OOM.

I think the single biggest uncertainty is about whether we will be able to adapt sufficiently quickly to the new larger compute budgets (i.e. how much do we need to change algorithms to scale reasonably? it's a very unusual situation and it's hard to scale up fast and depends on exactly how far that goes). Maybe I think that there's an 90% chance that TAI is in some sense possible (maybe: if you'd gotten to that much compute while remaining as well-adapted as we are now to our current levels of compute) an... (read more)

Coherence arguments imply a force for goal-directed behavior

I love your health points analogy. Extending it, imagine that someone came up with "coherence arguments" that showed that for a rational doctor doing triage on patients, and/or for a group deciding who should do a risky thing that might result in damage, the optimal strategy involves a construct called "health points" such that:

--Each person at any given time has some number of health points

--Whenever someone reaches 0 health points, they (very probably) die

--Similar afflictions/disasters tend to cause similar amounts of decrease in hea... (read more)

Wouldn't these coherence arguments be pretty awesome? Wouldn't this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?

Insofar as such a system could practically help doctors prioritise, then that would be great. (This seems analogous to how utilities are used in economics.)

But if doctors use this concept to figure out how to treat patients, or using it when designing prostheses for their patients, then I expect things to go badly. If you take HP as a guiding principle - for exampl... (read more)

Does the lottery ticket hypothesis suggest the scaling hypothesis?

OH this indeed changes everything (about what I had been thinking) thank you! I shall have to puzzle over these ideas some more then, and probably read the multi-prize paper more closely (I only skimmed it earlier)

3DanielFilan3moAh to be clear I am entirely basing my comments off of reading the abstracts (and skimming the multi-prize paper with an eye one develops after having been a ML PhD student for mumbles indistinctly years).
Does the lottery ticket hypothesis suggest the scaling hypothesis?

Whoa, the thing you are arguing against is not at all what I had been saying -- but maybe it was implied by what I was saying and I just didn't realize it? I totally agree that there are many optima, not just one. Maybe we are talking past each other?

(Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning! So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!)

3DanielFilan3moI guess I'm imagining that 'by default', your distribution over which optimum SGD reaches should be basically uniform, and you need a convincing story to end up believing that it reliably gets to one specific optimum. Yes, that's exactly what I think happens. Training takes a long time, and I expect the weights in a 'ticket' to change based on the weights of the rest of the network (since those other weights have similar magnitude). I think the best way to see why I think that is to manually run thru the backpropagation algorithm. If I'm wrong, it's probably because of this paper [http://proceedings.mlr.press/v119/frankle20a.html] that I don't have time to read over right now (but that I do recommend you read).
Does the lottery ticket hypothesis suggest the scaling hypothesis?

Hmmm, ok. Can you say more about why? Isn't the simplest explanation that the two tickets are the same?

1DanielFilan3moI expect that there are probably a bunch of different neural networks that perform well at a given task. We sort of know this because you can train a dense neural network to high accuracy, and also prune it to get a definitely-different neural network that also has high accuracy. Is it the case that these sparse architectures are small enough that there's only one optimum? Maybe, but IDK why I'd expect that.
Three reasons to expect long AI timelines

I definitely agree that our timelines forecasts should take into account the three phenomena you mention, and I also agree that e.g. Ajeya's doesn't talk about this much. I disagree that the effect size of these phenomena is enough to get us to 50 years rather than, say, +5 years to whatever our opinion sans these phenomena was. I also disagree that overall Ajeya's model is an underestimate of timelines, because while indeed the phenomena you mention should cause us to shade timelines upward, there is a long list of other phenomena I could m... (read more)

Three reasons to expect long AI timelines

Thanks for this post! I'll write a fuller response later, but for now I'll say: These arguments prove too much; you could apply them to pretty much any technology (e.g. self-driving cars, 3D printing, reusable rockets, smart phones, VR headsets...). There doesn't seem to be any justification for the 50-year number; it's not like you'd give the same number for those other techs, and you could have made exactly this argument about AI 40 years ago, which would lead to 10-year timelines now. You are just pointing out three reasons in f... (read more)

These arguments prove too much; you could apply them to pretty much any technology (e.g. self-driving cars, 3D printing, reusable rockets, smart phones, VR headsets...).

I suppose my argument has an implicit, "current forecasts are not taking these arguments into account." If people actually were taking my arguments into account, and still concluding that we should have short timelines, then this would make sense. But, I made these arguments because I haven't seen people talk about these considerations much. For example, I deliberately avoided the argument ... (read more)

Does the lottery ticket hypothesis suggest the scaling hypothesis?

Yeah, fair enough. I should amend the title of the question. Re: reinforcing the winning tickets: Isn't that implied? If it's not implied, would you not agree that it is happening? Plausibly, if there is a ticket at the beginning that does well at the task, and a ticket at the end that does well at the task, it's reasonable to think that it's the same ticket? Idk, I'm open to alternative suggestions now that you mention it...

3DanielFilan3moI don't think it's implied, and I'm not confident that it's happening. There are lots of neural networks!
Load More