All of Richard_Ngo's Comments + Replies

The Counterfactual Prisoner's Dilemma

Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F about the world before we've made our decision, F must be true in every counterfactual we construct (call this Principle F).

The problem is that principle F elides over the difference between facts which are logically caused by your decision, and facts which aren't. For example, in Parfit's hi... (read more)

1Chris_Leong15d"The problem is that principle F elides" - Yeah, I was noting that principle F doesn't actually get us there and I'd have to assume a principle of independence as well. I'm still trying to think that through.
The Counterfactual Prisoner's Dilemma

by only considering the branches of reality that are consistent with our knowledge

I know that, in the branch of reality which actually happened, Omega predicted my counterfactual behaviour. I know that my current behaviour is heavily correlated with my counterfactual behaviour. So I know that I can logically cause Omega to give me $10,000. This seems exactly equivalent to Newcomb's problem, where I can also logically cause Omega to give me a lot of money.

So if by "considering [other branches of reality]" you mean "taking predicted counterfactuals into acco... (read more)

1Chris_Leong16dHmm... that's a fascinating argument. I've been having trouble figuring out how to respond to you, so I'm thinking that I need to make my argument more precise and then perhaps that'll help us understand the situation. Let's start from the objection I've heard against Counterfactual Mugging. Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F about the world before we've made our decision, F must be true in every counterfactual we construct (call this Principle F). Now let's consider Counterfactual Prisoner's Dilemma. If the coin comes up HEADS, then principle F tells us that the counterfactuals need to have the COIN coming up HEADS as well. However, it doesn't tell us how to handle the impact of the agent's policy if they had seen TAILS. I think we should construct counterfactuals where the agent's TAILS policy is independent of its HEADS policy, whilst you think we should construct counterfactuals where they are linked. You justify your construction by noting that the agent can figure out that it will make the same decision in both the HEADS and TAILS case. In contrast, my tendency is to exclude information about our decision making procedures. So, if you knew you were a utility maximiser this would typically exclude all but one counterfactual and prevent us saying choice A is better than choice B. Similarly, my tendency here is to suggest that we should be erasing the agent's self-knowledge of how it decides so that we can imagine the possibility of the agent choosing PAY/NOT PAY or NOT PAY/PAY. But I still feel somewhat confused about this situation.
The Counterfactual Prisoner's Dilemma

I don't see why the Counterfactual Prisoner's Dilemma persuades you to pay in the Counterfactual Mugging case. In the counterfactual prisoner's dilemma, I pay because that action logically causes Omega to give me $10,000 in the real world (via influencing the counterfactual). This doesn't require shifting the locus of evaluation to policies, as long as we have a good theory of which actions are correlated with which other actions (e.g. paying in heads-world and paying in tails-world).

In the counterfactual mugging, by contrast, the whole point is that payin... (read more)

1Chris_Leong17dYou're correct that paying in Counterfactual Prisoner's Dilemma doesn't necessarily commit you to paying in Counterfactual Mugging. However, it does appear to provide a counter-example to the claim that we ought to adopt the principle of making decisions by only considering the branches of reality that are consistent with our knowledge as this would result in us refusing to pay in Counterfactual Prisoner's Dilemma regardless of the coin flip result. (Interestingly enough, blackmail problems seem to also demonstrate that this principle is flawed as well). This seems to suggest that we need to consider policies rather than completely separate decisions for each possible branch of reality. And while, as I already noted, this doesn't get us all the way, it does make the argument for paying much more compelling by defeating the strongest objection.
Coherence arguments imply a force for goal-directed behavior

Thanks for writing this post, Katja; I'm very glad to see more engagement with these arguments. However, I don't think the post addresses my main concern about the original coherence arguments for goal-directedness, which I'd frame as follows:

There's some intuitive conception of goal-directedness, which is worrying in the context of AI. The old coherence arguments implicitly used the concept of EU-maximisation as a way of understanding goal-directedness. But Rohin demonstrated that the most straightforward conception of EU-maximisation (which I'll call beh... (read more)

Against evolution as an analogy for how humans will create AGI

I personally found this post valuable and thought-provoking. Sure, there's plenty that it doesn't cover, but it's already pretty long, so that seems perfectly reasonable.

I particularly I dislike your criticism of it as strawmanish. Perhaps that would be fair if the analogy between RL and evolution were a standard principle in ML. Instead, it's a vague idea that is often left implicit, or else formulated in idiosyncratic ways. So posts like this one have to do double duty in both outlining and explaining the mainstream viewpoint (often a major task in its o... (read more)

Against evolution as an analogy for how humans will create AGI

there’s a “solving the problem twice” issue. As mentioned above, in Case 5 we need both the outer and the inner algorithm to be able to do open-ended construction of an ever-better understanding of the world—i.e., we need to solve the core problem of AGI twice with two totally different algorithms! (The first is a human-programmed learning algorithm, perhaps SGD, while the second is an incomprehensible-to-humans learning algorithm. The first stores information in weights, while the second stores information in activations, assuming a GPT-like architecture.

... (read more)
1Steve Byrnes24dThanks for cross-posting this! Sorry I didn't get around to responding originally. :-) For what it's worth, I figure that the neocortex has some number (dozens to hundreds, maybe 180 like your link says, I dunno) of subregions that do a task vaguely like "predict data X from context Y", with different X & Y & hyperparameters in different subregions. So some design work is obviously required to make those connections. (Some taste of what that might look like in more detail is maybe Randall O'Reilly's vision-learning model [https://arxiv.org/abs/1709.04654].) I figure this is vaguely analogous to figuring out what convolution kernel sizes and strides you need in a ConvNet, and that specifying all this is maybe hundreds or low thousands but not millions of bits of information. (I don't really know right now, I'm just guessing.) Where will those bits of information come from? I figure, some combination of: * automated neural architecture search * and/or people looking at the neuroanatomy literature and trying to copy ideas * and/or when the working principles of the algorithm are better understood, maybe people can just guess what architectures are reasonable, just like somebody invented U-Nets [https://en.wikipedia.org/wiki/U-Net] by presumably just sitting and thinking about what's a reasonable architecture for image segmentation, followed by some trial-and-error tweaking. * and/or some kind of dynamic architecture that searches for learnable relationships and makes those connections on the fly … I imagine a computer would be able to do that to a much greater extent than a brain (where signals travel slowly, new long-range high-bandwidth connections are expensive, etc.) If I understand your comment correctly, we might actually agree on the plausibility of the brute force "automated neural architecture search" / meta-learning case. …Except for the terminology! I'm not calling it "evolution analogy" because the final learning algorithm is mai
Against evolution as an analogy for how humans will create AGI

It seems totally plausible to give AI systems an external memory that they can read to / write from, and then you learn linear algebra without editing weights but with editing memory. Alternatively, you could have a recurrent neural net with a really big hidden state, and then that hidden state could be the equivalent of what you're calling "synapses".

I agree with Steve that it seems really weird to have these two parallel systems of knowledge encoding the same types of things. If an AGI learned the skill of speaking english during training, but then learn... (read more)

3Rohin Shah1moIdk, this just sounds plausible to me. I think the hope is that the weights encode more general reasoning abilities, and most of the "facts" or "background knowledge" gets moved into memory, but that won't happen for everything and plausibly there will be this strange separation between the two. But like, sure, that doesn't seem crazy. I do expect we reconsolidate into weights through some outer algorithm like gradient descent (and that may not require any human input). If you want to count that as "autonomously editing its weights", then fine, though I'm not sure how this influences any downstream disagreement. Similar dynamics in humans: 1. Children are apparently better at learning languages than adults; it seems like adults are using some different process to learn languages (though probably not as different as editing memory vs. editing weights) 2. One theory of sleep is that it is consolidating the experiences of the day into synapses, suggesting that any within-day learning is not relying as much on editing synapses. Tbc, I also think explicitly meta-learned update rules are plausible -- don't take any of this as "I think this is definitely going to happen" but more as "I don't see a reason why this couldn't happen". Fwiw I've mostly been ignoring the point of whether or not evolution is a good analogy. If you want to discuss that, I want to know what specifically you use the analogy for. For example: 1. I think evolution is a good analogy for how inner alignment issues can arise. 2. I don't think evolution is a good analogy for the process by which AGI is made (if you think that the analogy is that we literally use natural selection to improve AI systems). It seems like Steve is arguing the second, and I probably agree (depending on what exactly he means, which I'm still not super clear on).
The case for aligning narrowly superhuman models

Nice post. The one thing I'm confused about is:

Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured).

It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objecti... (read more)

6Rohin Shah1moIt's important to distinguish between: * "We (Open Phil) are not sure whether we want to actively push this in the world at large, e.g. by running a grant round and publicizing it to a bunch of ML people who may or may not be aligned with us" * "We (Open Phil) are not sure whether we would fund a person who seems smart, is generally aligned with us, and thinks that the best thing to do is reward modeling work" My guess is that Ajeya means the former but you're interpreting it as the latter, though I could easily be wrong about either of those claims.
4Ajeya Cotra2moWe're simply not sure where "proactively pushing to make more of this type of research happen" should rank relative to other ways we could spend our time and money right now, and determining that will involve thinking about a lot of things that are not covered in this post (most importantly what the other opportunities are for our time and money). It might be a standard way to make progress, but I don't feel that this work has been the default so far — the other three types of research I laid out seem to have absorbed significantly more researcher-hours and dollars among people concerned with long-term AI risk reduction. (It's possible that human feedback is more common among people motivated by profit, but I doubt that because it doesn't seem that profitable yet.) Also, if we use a stricter definition of "narrowly superhuman" (i.e. the model should be capable of outperforming the evaluations — not just the demonstrations — of the humans training it), I'd argue that there hasn't been any work published on that so far.
Book review: "A Thousand Brains" by Jeff Hawkins

Great post, and I'm glad to see the argument outlined in this way. One big disagreement, though:

the Judge box will house a relatively simple algorithm written by humans

I expect that, in this scenario, the Judge box would house a neural network which is still pretty complicated, but which has been trained primarily to recognise patterns, and therefore doesn't need "motivations" of its own.

This doesn't rebut all your arguments for risk, but it does reframe them somewhat. I'd be curious to hear about how likely you think my version of the judge is, and why.

1Steve Byrnes2moOh I'm very open-minded. I was writing that section for an audience of non-AGI-safety-experts and didn't want to make things over-complicated by working through the full range of possible solutions to the problem, I just wanted to say enough to convince readers that there is a problem here, and it's not trivial. The Judge box (usually I call it "steering subsystem" [https://www.lesswrong.com/posts/SJXujr5a2NcoFebr4/mesa-optimizers-vs-steered-optimizers] ) can be anything. There could even be a tower of AGIs steering AGIs, IDA [https://www.lesswrong.com/tag/iterated-amplification]-style, but I don't know the details, like what you would put at the base of the tower. I haven't really thought about it. Or it could be deep neural net classifier. (How do you train it? "Here's 5000 examples of corrigibility, here's 5000 examples of incorrigibility"?? Or what? Beats me...) In this post [https://www.lesswrong.com/posts/wcNEXDHowiWkRxDNv/inner-alignment-in-salt-starved-rats] I proposed that the amygdala houses a supervised learning algorithm which does a sorta "interpretability" thing where it tries to decode the latent variables inside the neocortex, and then those signals are inputs to the reward calculation. I don't see how that kind of mechanism would apply to more complicated goals, and I'm not sure how robust it is. Anyway, yeah, could be anything, I'm open-minded.
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?

Broadly speaking, I think our disagreement here is closely related to one we've discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won't pursue this further.

2johnswentworth2moYeah, I wouldn't even include reward tampering. Outer alignment, as I think about it, is mostly the pointer problem, and the (values) pointer problem is a subset of outer alignment. (Though e.g. Evan would define it differently.)
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Above you say:

Now, the basic problem: our agent’s utility function is mostly a function of latent variables. ... Those latent variables:

  • May not correspond to any particular variables in the AI’s world-model and/or the physical world
  • May not be estimated by the agent at all (because lazy evaluation)
  • May not be determined by the agent’s observed data

… and of course the agent’s model might just not be very good, in terms of predictive power.

And you also discuss how:

Human "values" are defined within the context of humans' world-models, and don't necessarily make

... (read more)
3johnswentworth2moOk, a few things here... The post did emphasize, in many places, that there may not be any real-world thing corresponding to a human concept, and therefore constructing a pointer is presumably impossible. But the "thing may not exist" problem is only one potential blocker to constructing a pointer. Just because there exists some real-world structure corresponding to a human concept, or an AI has some structure corresponding to a human concept, does not mean we have a pointer. It just means that it should be possible, in principle, to create a pointer. So, the concept-existence problem is a strict subset of the pointer problem. Second, there are definitely parts of a whole alignment scheme which are not the pointer problem. For instance, inner alignment, decision theory shenanigans (e.g. commitment races), and corrigibility are all orthogonal to the pointers problem (or at least the pointers-to-values problem). Constructing a feedback signal which rewards the thing we want is not the same as building an AI which does the thing we want. Third, and most important... The key point is that all these examples involve solving an essentially-similar pointer problem. In example A, we need to ensure that our English-language commands are sufficient to specify everything we care about which the alien wouldn't guess on its own; that's the part which is a pointer problem. In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that's the part which is a pointer problem. In example C, we need to identify which of its neurons correspond to the concepts we want, and ensure that the correspondence is robust; that's the part which is a pointer problem. Example D is essentially the same as B, with weaker implicit priors. The essence of each of these is "make sure we actually point to the thing we want, and not to anything else". That's the part which is a pointer prob
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

The question then is, what would it mean for such an AI to pursue our values?

Why isn't the answer just that the AI should:
1. Figure out what concepts we have;
2. Adjust those concepts in ways that we'd reflectively endorse;
3. Use those concepts?

The idea that almost none of the things we care about could be adjusted to fit into a more accurate worldview seems like a very strongly skeptical hypothesis. Tables (or happiness) don't need to be "real in a reductionist sense" for me to want more of them.

3Abram Demski2moAgreed. The problem is with AI designs which don't do that. It seems to me like this perspective is quite rare. For example, my post Policy Alignment [https://www.lesswrong.com/posts/TeYro2ntqHNyQFx8r/policy-alignment] was about something similar to this, but I got a ton of pushback in the comments -- it seems to me like a lot of people really think the AI should use better AI concepts, not human concepts. At least they did back in 2018. As you mention, this is partly due to overly reductionist world-views. If tables/happiness aren't reductively real, the fact that the AI is using those concepts is evidence that it's dumb/insane, right? Illustrative excerpt from a comment [https://www.lesswrong.com/posts/TeYro2ntqHNyQFx8r/policy-approval?commentId=7XT99Rv6DkgvXyczR] there: Probably most of the problem was that my post didn't frame things that well -- I was mainly talking in terms of "beliefs", rather than emphasizing ontology, which makes it easy to imagine AI beliefs are about the same concepts but just more accurate. John's description of the pointers problem might be enough to re-frame things to the point where "you need to start from human concepts, and improve them in ways humans endorse" is bordering on obvious. (Plus I arguably was too focused on giving a specific mathematical proposal rather than the general idea.)
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

I agree with all the things you said. But you defined the pointer problem as: "what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model?" In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

The problem of determining how to construct a feedback signal which refers to those variabl... (read more)

2Adam Shimi2moBut you didn't actually give an argument for that -- you simply stated it. As a matter of fact, I disagree: it seems really easy for an AGI to misunderstand what I mean when I use english words. To go back to the "fusion power generator", maybe it has a very deep model of such generators that abstracts away most of the concrete implementation details to capture the most efficient way of doing fusion; whereas my internal model of "fusion power generators" has a more concrete form and include safety guidelines. In general, I don't see why we should expect the abstraction most relevant for the AGI to be the one we're using. Maybe it uses the same words for something quite different, like how successive paradigms in physics use the same word (electricity, gravity) to talk about different things (at least in their connotations and underlying explanations). (That makes me think that it might be interesting to see how Kuhn's arguments about such incomparability of paradigms hold in the context of this problem, as this seems similar).
2johnswentworth2moThe problem is with what you mean by "find". If by "find" you mean "there exist some variables in the AI's world model which correspond directly to the things you mean by some English sentence", then yes, you've argued that. But it's not enough for there to exist some variables in the AI's world-model which correspond to the things we mean. We have to either know which variables those are, or have some other way of "pointing to them" in order to get the AI to actually do what we're saying. An AI may understand what I mean, in the sense that it has some internal variables corresponding to what I mean, but I still need to know which variables those are (or some way to point to them) and how "what I mean" is represented in order to construct a feedback signal. That's what I mean by "finding" the variables. It's not enough that they exist; we (the humans, not the AI) need some way to point to which specific functions/variables they are, in order to get the AI to do what we mean.
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world. I think this problem - the “pointers to values” problem, and the “pointers” problem more generally - is the primary conceptual barrier to alignment right now.

It seems likely that an AGI will understand very well what I mean when I use english words to describe things, and also what a more intelligent version of me with more coherent concepts would want those words to actually refer to. Why does this not imply that the pointers problem will be solve... (read more)

4johnswentworth2moThe AI knowing what I mean isn't sufficient here. I need the AI to do what I mean, which means I need to program it/train it to do what I mean. The program or feedback signal needs to be pointed at what I mean, not just whatever English-language input I give. For instance, if an AI is trained to maximize how often I push a particular button, and I say "I'll push the button if you design a fusion power generator for me", it may know exactly what I mean and what I intend. But it will still be perfectly happy to give me a design with some unintended side effects [https://www.lesswrong.com/posts/2NaAhMPGub8F2Pbr7/the-fusion-power-generator-scenario] which I'm unlikely to notice until after pushing the button.
Distinguishing claims about training vs deployment

I think 'robust instrumentality' is basically correct for optimal actions, because there's no question of 'emergence': optimal actions just are.

If I were to put my objection another way: I usually interpret "robust" to mean something like "stable under perturbations". But the perturbation of "change the environment, and then see what the new optimal policy is" is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent's inputs, or its state, and seeing whether it still behaved instrumentally.

A more accurate description might be something like "ubiquitous instrumentality"? But this isn't a very aesthetically pleasing name.

2Alex Turner2moI'd considered 'attractive instrumentality' a few days ago, to convey the idea that certain kinds of subgoals are attractor points during plan formulation, but the usual reading of 'attractive' isn't 'having attractor-like properties.'
3Alex Turner2moAh. To clarify, I was referring to holding an environment fixed, and then considering whether, at a given state, an action has a high probability of being optimal across reward functions. I think it makes to call those actions 'robustly instrumental.'
Distinguishing claims about training vs deployment

Can you elaborate? 'Robust' seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment.

The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentality as robust. It seems like you're trying to do the former, but because "robust" modifies "instrumentality", the latter is a more natural interpretation.

For example, if I said "life on earth is... (read more)

2Alex Turner2moOne possibility is that we have to individuate these "instrumental convergence"-adjacent theses using different terminology. I think 'robust instrumentality' is basically correct for optimal actions, because there's no question of 'emergence': optimal actions just are. However, it doesn't make sense to say the same for conjectures about how training such-and-such a system tends to induce property Y, for the reasons you mention. In particular, if property Y is not about goal-directed behavior, then it no longer makes sense to talk about 'instrumentality' from the system's perspective. e.g. I'm not sure it makes sense to say 'edge detectors are robustly instrumental for this network structure on this dataset after X epochs'. (These are early thoughts; I wanted to get them out, and may revise them later or add another comment) EDIT: In the context of MDPs, however, I prefer to talk in terms of (formal) POWER and of optimality probability, instead of in terms of robust instrumentality. I find 'robust instrumentality' to be better as an informal handle, but its formal operationalization seems better for precise thinking.
Distinguishing claims about training vs deployment

Yepp, this is a good point. I agree that there won't be a sharp distinction, and that ML systems will continue to do online learning throughout deployment. Maybe I should edit the post to point this out. But three reasons why I think the training/deployment distinction is still underrated:

  1. In addition to the clarifications from this post, I think there are a bunch of other concepts (in particular recursive self-improvement and reward hacking) which weren't originally conceived in the context of modern ML, but which it's very important to understand in the c
... (read more)
Distinguishing claims about training vs deployment

The burden of proof should be on whoever wants to claim that AI will be fine by default, not on whoever wants to claim it won't be fine by default.

I'm happy to wrap up this conversation in general, but it's worth noting before I do that I still strongly disagree with this comment. We've identified a couple of interesting facts about goals, like "unbounded large-scale final goals lead to convergent instrumental goals", but we have nowhere near a good enough understanding of the space of goal-like behaviour to say that everything apart from a "very small reg... (read more)

1Daniel Kokotajlo3moOK, interesting. I agree this is a double crux. For reasons I've explained above, it doesn't seem like circular reasoning to me, it doesn't seem like I'm assuming that goals are by default unbounded and consequentialist etc. But maybe I am. I haven't thought about this as much as you have, my views on the topic have been crystallizing throughout this conversation, so I admit there's a good chance I'm wrong and you are right. Perhaps I/we will return to it one day, but for now, thanks again and goodbye!
Some thoughts on risks from narrow, non-agentic AI

I agree with the two questions you've identified as the core issues, although I'd slightly rephrase the former. It's hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I'd rephrase the first option you mention as "feeling pretty confident that something that generalises from 1 week to 1 year won't become misaligned enough to cause disasters". This point seems ... (read more)

Distinguishing claims about training vs deployment

1. The goals that we imagine superintelligent AGI having, when spelled out in detail, have ALL so far been the sort that would very likely lead to existential catastrophe of the instrumental convergence variety.

2. We've even tried hard to imagine goals that aren't of this sort, and so far we haven't come up with anything. Things that seem promising, like "Place this strawberry on that plate, then do nothing else" actually don't work when you unpack the details.

Okay, this is where we disagree. I think what "unpacking the details" actually gives you is somet... (read more)

2Daniel Kokotajlo3moAgain, I'm not sure we disagree that much in the grand scheme of things -- I agree our thinking has improved over the past ten years, and I'm very much a fan of your more rigorous way of thinking about things. FWIW, I disagree with this: There are other explanations for this phenomenon besides "I'm able to have bounded goals." One is that you are in fact aligned with humans. Another is that you would in fact lead to catastrophe-by-the-standards-of-X if you were powerful enough and had a different goals than X. For example, suppose that right after reading this comment, you find yourself transported out of your body and placed into the body of a giant robot on an alien planet. The aliens have trained you to be smarter than them and faster than them; it's a "That Alien Message" scenario basically. And you see that the aliens are sending you instructions.... "PUT BERRY.... ON PLATE.... OVER THERE..." You notice that these aliens are idiots and left their work lying around the workshop, so you can easily kill them and take command of the computer and rescue all your comrades back on Earth and whatnot, and it really doesn't seem like this is a trick or anything, they really are that stupid... Do you put the strawberry on the plate? No. What people discovered back then was that you think you can "very easily imagine an AGI with bounded goals," but this is on the same level as how some people think they can "very easily imagine an AGI considering doing something bad, and then realizing that it's bad, and then doing good things instead." Like, yeah it's logically possible, but when we dig into the details we realize that we have no reason to think it's the default outcome and plenty of reason to think it's not. I was originally making the past tense claim, and I guess maybe now I'm making the present tense claim? Not sure, I feel like I probably shouldn't, you are about to tear me apart, haha... Other people being wrong can sometimes provide justification for making "
Distinguishing claims about training vs deployment

I disagree that we have no good justification for making the "vast majority" claim.

Can you point me to the sources which provide this justification? Your analogy seems to only be relevant conditional on this claim.

My point is that in the context in which the classic arguments appeared, they were useful evidence that updated people in the direction of "Huh AI could be really dangerous" and people were totally right to update in that direction on the basis of these arguments

They were right to update in that direction, but that doesn't mean that they were rig... (read more)

1Daniel Kokotajlo3moI think I agree that they may have been wrong to update as far as they did. (Credence = 50%) So maybe we don't disagree much after all. As for sources which provide that justification, oh, I don't remember, I'd start by rereading Superintelligence and Yudkowsky's old posts and try to find the relevant parts. But here's my own summary of the argument as I understand it: 1. The goals that we imagine superintelligent AGI having, when spelled out in detail, have ALL so far been the sort that would very likely lead to existential catastrophe of the instrumental convergence variety. 2. We've even tried hard to imagine goals that aren't of this sort, and so far we haven't come up with anything. Things that seem promising, like "Place this strawberry on that plate, then do nothing else" actually don't work when you unpack the details. 3. Therefore, we are justified in thinking that the vast majority of possible ASI goals will lead to doom via instrumental convergence. I agree that our thinking has improved since then, with more work being done on impact measures and bounded goals and quantilizers and whatnot that makes such things seem not-totally-impossible to achieve. And of course the model of ASI as a rational agent with a well-defined goal has justly come under question also. But given the context of how people were thinking about things at the time, I feel like they would have been justified in making the "vast majority of possible goals" claim, even if they restricted themselves to more modest "wide range" claims. I don't see how my analogy is only relevant conditional on this claim. To flip it around, you keep mentioning how AI won't be a random draw from the space of all possible goals -- why is that relevant? Very few things are random draws from the space of all possible X, yet reasoning about what's typical in the space of possible X's is often useful. Maybe I should have worked harder to pick a more real-world analogy than the weird loaded die one. Maybe
Distinguishing claims about training vs deployment

Re counterfactual impact: the biggest shift came from talking to Nate at BAGI, after which I wrote this post on disentangling arguments about AI risk, in which I identified the "target loading problem". This seems roughly equivalent to inner alignment, but was meant to avoid the difficulties of defining an "inner optimiser". At some subsequent point I changed my mind and decided it was better to focus on inner optimisers - I think this was probably catalysed by your paper, or by conversations with Vlad which were downstream of the paper. I think the paper ... (read more)

Distinguishing claims about training vs deployment

Ah, cool; I like the way you express it in the short form! I've been looking into the concept of structuralism in evolutionary biology, which is the belief that evolution is strongly guided by "structural design principles". You might find the analogy interesting.

One quibble: in your comment on my previous post, you distinguished between optimal policies versus the policies that we're actually likely to train. But this isn't a component of my distinction - in both cases I'm talking about policies which actually arise from training. My point is that there a... (read more)

1Alex Turner3moRight - I was pointing at the similarity in that both of our distinctions involve some aspect of training, which breaks from the tradition of not really considering training's influence on robust instrumentality. "Quite similar" was poor phrasing on my part, because I agree that our two distinctions are materially different. I think that "training goal convergence thesis" is way better, and I like how it accomodates dual meanings: the "goal" may be an instrumental or a final goal. Can you elaborate? 'Robust' seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment. I agree that switching costs are important to consider. However, I've recently started caring more about establishing and promoting clear nomenclature, both for the purposes of communication and for clearer personal thinking. My model of the 'instrumental convergence' situation is something like: * The switching costs are primarily sensitive to how firmly established the old name is, to how widely used the old name is, and the number of "entities" which would have to adopt the new name. * I think that if researchers generally agree that 'robust instrumentality' is a clearer name[1] and used it to talk about the concept, that the shift would naturally propagate through AI alignment circles and be complete within a year or two. This is just my gut sense, though. * The switch from "optimization daemons" to "mesa-optimizers" seemed to go pretty well * But 'optimization daemons' didn't have a wikipedia page yet (unlike [https://en.wikipedia.org/wiki/Instrumental_convergence] 'instrumental convergence') Of course, all of this is conditional on your agreeing that 'robust instrumentality' is in fact a better name; if you disagree, I'm interested in hearing why.[2] But if you agree, I think that the switch would
Distinguishing claims about training vs deployment

Saying "vast majority" seems straightfowardly misleading. Bostrom just says "a wide range"; it's a huge leap from there to "vast majority", which we have no good justification for making. In particular, by doing so you're dismissing bounded goals. And if you're talking about a "state of ignorance" about AI, then you have little reason to override the priors we have from previous technological development, like "we build things that do what we want".

On your analogy, see the last part of my reply to Adam below. The process of building things intrinsically pi... (read more)

2Daniel Kokotajlo3moI disagree that we have no good justification for making the "vast majority" claim, I think it's in fact true in the relevant sense. I disagree that we had little reason to override the priors we had from previous tech development like "we build things that do what we want." You are playing reference class tennis; we could equally have had a prior "AI is in the category of 'new invasive species appearing' and so our default should be that it displaces the old species, just as humans wiped out neanderthals etc." or a prior of "Risk from AI is in the category of side-effects of new technology; no one is doubting that the paperclip-making AI will in fact make lots of paperclips, the issue is whether it will have unintended side-effects, and historically most new techs do." Now, there's nothing wrong with playing reference class tennis, it's what you should do when you are very ignorant I suppose. My point is that in the context in which the classic arguments appeared, they were useful evidence that updated people in the direction of "Huh AI could be really dangerous" and people were totally right to update in that direction on the basis of these arguments, and moreover these arguments have been more-or-less vindicated by the last ten years or so, in that on further inspection AI does indeed seem to be potentially very dangerous and it does indeed seem to be not safe/friendly/etc. by default. (Perhaps one way of thinking about these arguments is that they were throwing in one more reference class into the game of tennis, the "space of possible goals" reference class.) I set up my analogy specifically to avoid your objection; the process of rolling a loaded die intrinsically is heavily biased towards a small section of the space of possibilities.
Distinguishing claims about training vs deployment

Thanks for the feedback! Some responses:

This looks like off-line training to me. That's not a problem per se, but it also means that you have an implicit hypothesis that the AGI will be model-based; otherwise, it would have trouble adapting its behavior after getting new information.

I don't really know what "model-based" means in the context of AGI. Any sufficiently intelligent system will model the world somehow, even if it's not trained in a way that distinguishes between a "model" and a "policy". (E.g. humans weren't.)

On the other hand, the instrumental

... (read more)
Distinguishing claims about training vs deployment

If you're right about the motivations for the classic theses, then it seems like there's been too big a jump from "other people are wrong" to "arguments for AI risk are right". Establishing the possibility of something is very far from establishing that it's a "default outcome".

3Daniel Kokotajlo3moIt depends on your standards/priors. The classic arguments do in fact establish that doom is the default outcome, if you are in a state of ignorance where you don't know what AI will be like or how it will be built, and you are dealing with interlocutors who believe 1 and/or 2, facts like "the vast majority of possible minds would lead to doom" count for a lot. Analogy: If you come across someone playing a strange role-playing game involving a strange, crudely carved many-sided die covered in strange symbols, and it's called the "Special asymmetric loaded die" and they are about to roll the die to see if something bad happens in the game, and at first you think that there's one particular symbol that causes bad things to happen, and then they tell you no actually bad things happen unless another particular symbol is rolled, this should massively change your opinion about what the default outcome is. In particular you should go from thinking the default outcome is not bad to thinking the default outcome is bad. This is so even though you know that not all the possible symbols are equally likely, the die is loaded, etc.
Some thoughts on risks from narrow, non-agentic AI

A couple of clarifications:

Type 2: Feedback which we use to decide whether to deploy trained agent. 

Let's also include feedback which we can use to decide whether to stop deploying an agent; the central example in my head is an agent which has been deployed for some time before we discover that it's doing bad things.

Relatedly, another argument for type 1 !~ type 2 which seems important to me: type 2 feedback can look at long time horizons, which I expect to be very useful. (Maybe you included this in the cost estimate, but idk how to translate between... (read more)

3Paul Christiano3moBut type 2 feedback is (by definition) our best attempt to estimate how well the model is doing what we really care about. So in practice any results-based selection for "does what we care about" goes via selecting based on type 2 feedback. The difference only comes up when we reason mechanically about the behavior of our agents and how they are likely to generalize, but it's not clear that's an important part of the default plan (whereas I think we will clearly extensively leverage "try several strategies and see what works"). "Do things that look to a human like you are achieving X" is closely related to X, but that doesn't mean that learning to do the one implies that you will learn to do the other. Maybe it’s helpful to imagine the world where type 1 feedback is “human evals after 1 week horizon”, type 2 feedback is “human evals after 1 year horizon,” and “what we really care about” is the "human evals after a 100 year horizon." I think that’s much better than the actual situation, but even in that case I’d have a significant probability on getting systems that work on the 1 year horizon without working indefinitely (especially if we do selection for working on 2 years + are able to use a small amount of 2 year data). Do you feel pretty confident that something that generalizes from 1 week to 1 year will go indefinitely, or is your intuition predicated on something about the nature of “be helpful” and how that’s a natural motivation for a mind? (Or maybe that we will be able to identify some other similar “natural” motivation and design our training process to be aligned with that?) In the former case, it seems like we can have an empirical discussion about how generalization tends to work. In the latter case, it seems like we need to be getting into more details about why “be helpful” is a particularly natural (or else why we should be able to pick out something else like that). In the other cases I think I haven't fully internalized your view.
Literature Review on Goal-Directedness

Kinda, but I think both of these approaches are incomplete. In practice finding a definition and studying examples of it need to be interwoven, and you'll have a gradual process where you start with a tentative definition, identify examples and counterexamples, adjust the definition, and so on. And insofar as our examples should focus on things which are actually possible to build (rather than weird thought experiments like blockhead or the chinese room) then it seems like what I'm proposing has aspects of both of the approaches you suggest.

My guess is tha... (read more)

Literature Review on Goal-Directedness

Hmm, okay, I think there's still some sort of disagreement here, but it doesn't seem particularly important. I agree that my distinction doesn't sufficiently capture the middle ground of interpretability analysis (although the intentional stance doesn't make use of that, so I think my argument still applies against it).

1Adam Shimi3moI think the disagreement left is whether we should first find a definition of goal-directedness then study how it appears through training (my position), or if we should instead define goal-directedness according to the kind of training processes that generate similar properties and risks (what I take to be your position). Does that make sense to you?
Against the Backward Approach to Goal-Directedness

Hmmm, it doesn't seem like these two approaches are actually that distinct. Consider: in the forward approach, which intuitions about goal-directedness are you using? If you're only using intuitions about human goal-directedness, then you'll probably miss out on a bunch of important ideas. Whereas if you're using intuitions about extreme cases, like superintelligences, then this is not so different to the backwards approach.

Meanwhile, I agree that the backward approach will fail if we try to find "the fundamental property that the forward approach is tryin... (read more)

1Adam Shimi3moThanks for both your careful response and the pointer to Conceptual Engineering! I believe I am usually thinking in terms of defining properties for their use, but it's important to keep that in mind. The post on Conceptual Engineering lead me to this follow up interview, which contains a great formulation of my position: So my take is that there is probably a core/basic concept of goal-directedness, which can be altered and fitted to different uses. What we actually want here is the version fitted to AI Alignment. So we could focus on that specific version from the beginning; yet I believe that looking for the core/basic version and then fitting it to the problem is more efficient. That might be a give source of our disagreement. (By the way, Joe Halpern is indeed awesome. I studied a lot of his work related to distributed systems, and it's always the perfect intersection of a philosophical concept and problem with a computer science treatement and analysis.) I resolve the apparent paradox that you raise by saying that the intuitions are about the core/basic idea which is close to human goal-directedness; but that it should then be fitted and adapted to our specific application of AI Alignment. Agreed. My distinction of forward and backward felt shakier by the day, and your point finally puts it out of its misery. My take on your approach is that we're still at 3, and we don't have yet a good enough understanding of those traits/properties to manage 4. As for how to solve 3, I reiterate that finding a core/basic version of goal-directedness and adapting it to the usecase seems to way to go for me.
Some thoughts on risks from narrow, non-agentic AI

Cool, thanks for the clarifications. To be clear, overall I'm much more sympathetic to the argument as I currently understand it, than when I originally thought you were trying to draw a distinction between "new forms of reasoning honed by trial-and-error" in part 1 (which I interpreted as talking about systems lacking sufficiently good models of the world to find solutions in any other way than trial and error) and "systems that have a detailed understanding of the world" in part 2.

Let me try to sum up the disagreement. The key questions are:

  1. What training
... (read more)
3Paul Christiano3moI agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior. It is important to my intuition that not only can we never train for the "good" generalization, we can't even evaluate techniques to figure out which generalization "well" (since both of the bad generalizations would lead to behavior that looks good over long horizons). If there is a disagreement it is probably that I have a much higher probability of the kind of generalization in story 1. I'm not sure if there's actually a big quantitative disagreement though rather than a communication problem. I also think it's quite likely that the story in my post is unrealistic in a bunch of ways and I'm currently thinking more about what I think would actually happen. Some more detailed responses that feel more in-the-weeds: I might not understand this point. For example, suppose I'm training a 1-day predictor to make good predictions over 10 or 100 days. I expect such predictors to initially fail over long horizons, but to potentially be greatly improved with moderate amounts of fine-tuning. It seems to me that if this model has "robust motivations" then they would most likely be to predict accurately, but I'm not sure about why the model necessarily has robust motivations. I feel similarly about goals like "plan to get high reward (defined as signals on channel X, you can learn how the channel works)." But even if prediction was a special case, if you learn a model then you can use it for planning/RL in simulation. It feels to me like our models are already getting to the point where they respond to quirks of the labeling or evaluation process, and are basically able to build simple models of the oversight process. Definitely, I think it's critical to what happens and not really explored in the post (which is mostly intended to provide some color for what failure
Some thoughts on risks from narrow, non-agentic AI

To clarify your position: if I train a system that makes good predictions over 1 minute and 10 minutes and 100 minutes, is your position that there's not much reason that this system would make a good prediction over 1000 minutes? Analogously, if I train a system by meta-learning to get high rewards over a wide range of simulated environments, is your position that there's not much reason to think it will try to get high rewards when deployed in the real world?

In most of the cases you've discussed, trying to do tasks over much longer time horizons involves... (read more)

4Paul Christiano3moI agree that this is probably the key point; my other comment ("I think this is the key point and it's glossed over...") feels very relevant to me.

I feel like a very natural version of "follow instructions" is "Do things that would the instruction-giver would rate highly." (Which is the generalization I'm talking about.) I don't think any of the arguments about "long horizon versions of tasks are different from short versions" tell us anything about which of these generalizations would be learnt (since they are both equally alien over long horizons).

Other versions like "Follow instructions (without regards to what the training process cares about)" seem quite likely to perform significantly worse on ... (read more)

Literature Review on Goal-Directedness

Really, the only issue for our purposes with this definition is that it focuses on how goal-directedness emerges, instead of what it entails for a system. Hence it gives less of a handle to predict the behavior of a system than Dennett’s intentional stance for example.

Another way to talk about this distinction is between definitions that allow you to predict the behaviour of agents which you haven't observed yet given how they were trained, versus definitions of goal-directedness which allow you to predict the future behaviour of an existing system given i... (read more)

1Adam Shimi3moI actually don't think we should make this distinction. It's true that Dennett's intentional stance falls in the first category for example, but that's not the reason why I'm interested about it. Explainability seems to me like a way to find a definition of goal-directedness that we can check through interpretability and verification, and which tells us something about the behavior of the system with regards to AI risk. Yet that doesn't mean it only applies to the observed behavior of systems. The biggest difference between your definition and the intuitions is that you focus on how goal-directedness appears through training. I agree that this is a fundamental problem; I just think that this is something we can only solve after having a definition of goal-directedness that we can check concretely in a system and that allows the prediction of behavior. As mentioned above, I think a definition of goal-directedness should allow us to predict what an AGI will broadly do based on its level of goal-directedness. Training for me is only relevant in understanding which level of goal-directedness are possible/probable. That seems like the crux of the disagreement here. I agree, but I definitely don't think the intuitions are limiting themselves to the observed behavior. With a definition you can check through interpretability and verification, you might be able to steer clear of deception during training. That's a use of (low) goal-directedness similar to the one Evan has in mind for myopia. For that one, understanding how goal-directedness emerges is definitely crucial.
Some thoughts on risks from narrow, non-agentic AI

In the second half of WFLL, you talk about "systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals". Does the first half of WFLL also primarily refer to systems with these properties? And if so, does "reasoning honed by trial-and-error" refer to the reasoning that those systems do?

If yes, then this undermines your core argument that "[some things] can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes", beca... (read more)

We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops.

I think this is the key po... (read more)

5Paul Christiano3moYes. I agree that it's only us who are operating by trial and error---the system understands what it's doing. I don't think that undermines my argument. The point is that we pick the system, and so determine what it's doing, by trial and error, because we have no understanding of what it's doing (under the current paradigm). For some kinds of goals we may be able to pick systems that achieve those goals by trial and error (modulo empirical uncertainty about generalization, as discussed in the second part). For other goals there isn't a plausible way to do that. To clarify your position: if I train a system that makes good predictions over 1 minute and 10 minutes and 100 minutes, is your position that there's not much reason that this system would make a good prediction over 1000 minutes? Analogously, if I train a system by meta-learning to get high rewards over a wide range of simulated environments, is your position that there's not much reason to think it will try to get high rewards when deployed in the real world? I consider those pretty wide open empirical questions. The view that we can get good generalization of this kind is fairly common within ML. I do agree once you generalize motivations from easily measurable tasks with short feedback loops to tasks with long feedback loops then you may also be able to get "good" generalizations, and this is a way that you can solve the alignment problem. It seems to me that there are lots of plausible ways to generalize to longer horizons without also generalizing to "better" answers (according to humans' idealized reasoning). (Another salient way in which you get long horizons is by doing something like TD learning, i.e. train a model that predicts its own judgment in 1 minute. I don't know if it's important to get into the details of all the ways people can try to get things to generalize over longer time horizons, it seems like there are many candidates. I agree that there are analogously candidates for getting
Why I'm excited about Debate

suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up - could we then use Debate on the question "does this candidate match the verbal specification?"

I'm less excited about this, and more excited about candidate training processes or candidate paradigms of AI research (for example, solutions to embedded agency). I expect that there will be a large cluster of techniques which produce safe AGIs, we just need to find them - which may be difficult, but hopefully less difficult with Debate involved.

Why I'm excited about Debate

I think I agree with all of this. In fact, this argument is one reason why I think Debate could be valuable, because it will hopefully increase the maximum complexity of arguments that humans can reliably evaluate.

This eventually fails at some point, but hopefully it fails after the point at which we can use Debate to solve alignment in a more scalable way. (I don't have particularly strong intuitions about whether this hope is justified, though.)

Why I'm excited about Debate

If arguments had no meaning but to argue other people into things, if they were being subject only to neutral selection or genetic drift or mere conformism, there really wouldn't be any reason for "the kind of arguments humans can be swayed by" to work to build a spaceship.  We'd just end up with some arbitrary set of rules fixed in place.

I agree with this. My position is not that explicit reasoning is arbitrary, but that it developed via an adversarial process where arguers would try to convince listeners of things, and then listeners would try to di... (read more)

Radical Probabilism

DP: (sigh...) OK. I'm still never going to design an artificial intelligence to have uncertain observations. It just doesn't seem like something you do on purpose.

What makes you think that having certain observations is possible for an AI?

2Abram Demski3moDP: I'm not saying that hardware is infinitely reliable, or confusing a camera for direct access to reality, or anything like that. But, at some point, in practice, we get what we get, and we have to take it for granted. Maybe you consider the camera unreliable, but you still directly observe what the camera tells you. Then you would make probabilistic inferences about what light hit the camera, based on definite observations of what the camera tells you. Or maybe it's one level more indirect from that, because your communication channel with the camera is itself imperfect. Nonetheless, at some point, you know what you saw -- the bits make it through the peripheral systems, and enter the main AI system as direct observations, of which we can be certain. Hardware failures inside the core system can happen, but you shouldn't be trying to plan for that in the reasoning of the core system itself -- reasoning about that would be intractable. Instead, to address that concern, you use high-reliability computational methods at a lower level, such as redundant computations on separate hardware to check the integrity of each computation. RJ: Then the error-checking at the lower level must be seen as part of the rational machinery. DP: True, but all the error-checking procedures I know of can also be dealt with in a classical bayesian framework. RJ: Can they? I wonder. But, I must admit, to me, this is a theory of rationality for human beings. It's possible that the massively parallel hardware of the brain performs error-correction at a separated, lower level. However, it is also quite possible that it does not. An abstract theory of rationality should capture both possibilities. And is this flexibility really useless for AI? You mention running computations on different hardware in order to check everything. But this requires a rigid setup, where all computations are re-run a set number of times. We could also have a more flexible setup, where computations have confidence
Imitative Generalisation (AKA 'Learning the Prior')

Ooops, yes, this seems correct. I'll edit mine accordingly.

Imitative Generalisation (AKA 'Learning the Prior')

A few things that I found helpful in reading this post:

  • I mentally replaced D with "the past" and D' with "the future".
  • I mentally replaced z with "a guide to reasoning about the future".

This gives us a summary something like:

We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the pas... (read more)

2Beth Barnes3moAgree that humans are not necessarily great at assigning priors. The main response to this is that we don't have a way to get better priors than an amplified human's best prior. If amplified humans think the NN prior is better than their prior, they can always just use this prior. So in theory this should be both strictly better than the alternative, and the best possible prior we can use. Science seems like it's about collecting more data and measuring the likelihood, not changing the prior. We still need to use our prior - there are infinite scientific theories that fit the data, but we prefer ones that are simple and elegant. One thing that helps a bit here is that we can use an amplified human. We also don't need the human to calculate the prior directly, just to do things like assess whether some change makes the prior better or worse. But I'm not sure how much of a roadblock this is in practice, or what Paul thinks about this problem. Yeah, the important difference is that in this case there's nothing that constrains the explanations to be the same as the actual reasoning the oracle is using, so the explanations you're getting are not necessarily predictive of the kind of generalisation that will happen. In IG it's important that the quality of z is measured by having humans use it to make predictions. I'm not sure exactly what you're asking. I think the proposal is motivated by something like: having the task be IID/being able to check arbitrary outputs from our model to make sure it's generalising correctly buys us a lot of safety properties. If we have this guarantee, we only have to worry about rare or probabilistic defection, not that the model might be giving us misleading answers for every question we can't check.
6Lukas Finnveden3moI don't think this is right. I've put my proposed modifications in cursive: We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past [we don't have ground-truth for the future, so we can't test how well humans can reason about it] and how well humans think it would generalise to the future. Then, we train a separate network to predict what humans with access to the previous network would predict about the future. (It might be a good idea to share some parameters between the second and first network.)
Eight claims about multi-agent AGI safety

This all seems straightforwardly correct, so I've changed the line in question accordingly. Thanks for the correction :)

One caveat: technical work to address #8 currently involves either preventing AGIs from being misaligned in ways that lead them to make threats, or preventing AGIs from being aligned in ways which make them susceptible to threats. The former seems to qualify as an aspect of the "alignment problem", the latter not so much. I should have used the former as an example in my original reply to you, rather than using the latter.

Eight claims about multi-agent AGI safety

I'd say that each of #5-#8 changes the parts of "AI alignment" that you focus on. For example, you may be confident that your AI system is not optimising against you, without being confident that 1000 copies of your AI system working together won't be optimising against you. Or you might be confident that your AI system won't do anything dangerous in almost all situations, but no longer confident once you realise that threats are adversarially selected to be extreme.

Whether you count these shifts as "moving beyond the standard paradigm" depends, I guess, o... (read more)

3Rohin Shah3moI would say that proponents of #7 and #8 believe that longtermists' priorities should shift significantly (in the case of #8, might just be negative utilitarians). They are proposing that we focus on other problems that are not AI alignment (as I defined it above). This might just be a semantic disagreement, but I do think it's an important point -- I wouldn't want people to say things like "people argue that it will become easier to engineer biological weapons than to build AGI, and therefore biosecurity is more important. Thus we need to move beyond the AGI paradigm to the emerging technologies paradigm". Like, it's correct, but it is creating too much generality; it is important to be able to focus on specific problems and make claims about those problems. Arguments 7-8 feel to me like "look, there's this other problem besides AI alignment that might be more important"; I don't deny that this could change what you do, but it doesn't change what the field of AI alignment should do. (You might say that you were talking about AI safety generally, and not AI alignment, but then I dispute that AI safety ever had a "single-AGI" paradigm; people have been talking about multipolar outcomes for a long time.) Yes, but not to a multiagent paradigm, which I thought was your main claim.
Richard Ngo's Shortform

Cool, glad to hear it. I'd clarify the summary slightly: I think all safety techniques should include at least a rough intuition for why they'll work in the scaled-up version, even when current work on them only applies them to simple AIs. (Perhaps this was implicit in your summary already, I'm not sure.)

Richard Ngo's Shortform

One source of our disagreement: I would describe evolution as a type of local search. The difference is that it's local with respect to the parameters of a whole population, rather than an individual agent. So this does introduce some disanalogies, but not particularly significant ones (to my mind). I don't think it would make much difference to my heuristic if we imagined that humans had evolved via gradient descent over our genes instead.

In other words, I like the heuristic of backchaining to local search, and I think of it as a subset of my heuristic. T... (read more)

2Adam Shimi4moSo if I try to summarize your position, it's something like: backchain to local search for simple and single-AI cases, and then think about aligning humans for the scaled and multi-agents version? That makes much more sense, thanks! I also definitely see why your full heuristic doesn't feel immediately useful to me: because I mostly focus on the simple and single-AI case. But I've been thinking more and more (in part thanks to your writing) that I should allocate more thinking time to the more general case. I hope your heuristic will help me there.
Richard Ngo's Shortform

A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.

I think this is useful for framing my core concerns about current safety research:

  • If we think that unsupervised learning will produce safe agents, then why will the comparatively small contributions of SL and RL make them unsafe?
  • If we think that unsupervised learning will produce dangerous agents, then why will safety techniques which focus on SL and RL (i.e. ba
... (read more)
1Steve Byrnes4moI wrote a few posts on self-supervised learning last year: * https://www.lesswrong.com/posts/SaLc9Dv5ZqD73L3nE/the-self-unaware-ai-oracle [https://www.lesswrong.com/posts/SaLc9Dv5ZqD73L3nE/the-self-unaware-ai-oracle] * https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-supervised-learning-and-agi-safety [https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-supervised-learning-and-agi-safety] * https://www.lesswrong.com/posts/L3Ryxszc3X2J7WRwt/self-supervised-learning-and-manipulative-predictions [https://www.lesswrong.com/posts/L3Ryxszc3X2J7WRwt/self-supervised-learning-and-manipulative-predictions] I'm not aware of any airtight argument that "pure" self-supervised learning systems, either generically or with any particular architecture, are safe to use, to arbitrary levels of intelligence, though it seems very much worth someone trying to prove or disprove that. For my part, I got distracted by other things and haven't thought about it much since then. The other issue is whether "pure" self-supervised learning systems would be capable enough to satisfy our AGI needs, or to safely bootstrap to systems that are. I go back and forth on this. One side of the argument I wrote up here [https://www.lesswrong.com/posts/AKtn6reGFm5NBCgnd/in-defense-of-oracle-tool-ai-research] . The other side is, I'm now (vaguely) thinking that people need a reward system to decide what thoughts to think, and the fact that GPT-3 doesn't need reward is not evidence of reward being unimportant but rather evidence that GPT-3 is nothing like an AGI [https://www.lesswrong.com/posts/SkcM4hwgH3AP6iqjs/can-you-get-agi-from-a-transformer] . Well, maybe. For humans, self-supervised learning forms the latent representations, but the reward system controls action selection. It's not altogether unreasonable to think that action selection, and hence reward, is a more important thing to focus on for safety research. AGIs are dangerous when they take dangerous actions, to a first appr
Continuing the takeoffs debate

I think that, because culture is eventually very useful for fitness, you can either think of the problem as evolution not optimising for culture, or evolution optimising for fitness badly. And these are roughly equivalent ways of thinking about it, just different framings. Paul notes this duality in his original post:

If we step back from skills and instead look at outcomes we could say: “Evolution is always optimizing for fitness, and humans have now taken over the world.” On this perspective, I’m making a claim about the limits of evolution. First, evolut

... (read more)
2Rohin Shah4moI agree it's not wrong. I'm claiming it's not a useful framing. If we must use this framing, I think humans and evolution are not remotely comparable on how good they are at long-term optimization, and I can't understand why you think they are. (Humans may not be good at long-term optimization on some absolute scale, but they're a hell of a lot better than evolution.) I think in my example you could make a similar argument: looking at outcomes, you could say "Rohin is always optimizing for learning abstract algebra, and he has now become very good at abstract algebra." It's not wrong, it's just not useful for predicting my future behavior, and doesn't seem to carve reality at its joints. (Tbc, I think this example is overstating the case, "evolution is always optimizing for fitness" is definitely more reasonable and more predictive than "Rohin is always optimizing for learning abstract algebra".) I really do think that the best thing is to just strip away agency, and talk about selection: Re: usefulness: Suppose a specific monkey has some mutation and gets a little bit of proto-culture. Are you claiming that this will increase the number of children that monkey has?
Continuing the takeoffs debate

Hmm, let's see. So the question I'm trying to ask here is: do other species lack proto-culture mainly because of an evolutionary oversight, or because proto-culture is not very useful until you're close to human-level in other respects? In other words, is the discontinuity we've observed mainly because evolution took a weird path through the landscape of possible minds, or because the landscape is inherently quite discontinuous with respect to usefulness? I interpret Paul as claiming the former.

But if the former is true, then we should expect that there ar... (read more)

3Rohin Shah4moI think I disagree with the framing. Suppose I'm trying to be a great physicist, and I study a bunch of physics, which requires some relatively high-level understanding of math. At some point I want to do new research into general relativity, and so I do a deep dive into abstract algebra / category theory to understand tensors better. Thanks to my practice with physics, I'm able to pick it up much faster than a typical person who starts studying abstract algebra. If you evaluate by "ability to do abstract algebra", it seems like there was a sharp discontinuity, even though on "ability to do physics" there was not. But if I had started off trying to learn abstract algebra before doing any physics, then there would not have been such a discontinuity. It seems wrong to say that my discontinuity in abstract algebra was "mainly because of an oversight in how I learned things", or to say that "my learning took a weird path through the landscape of possible ways to learn fields". Like, maybe those things would be true if you assume I had the goal of learning abstract algebra. But it's far more natural and coherent to just say "Rohin wasn't trying to learn abstract algebra, he was trying to learn physics". Similarly, I think you shouldn't be saying that there were "evolutionary oversights" or "weird paths", you should be saying "evolution wasn't optimizing for proto-culture, it was optimizing for reproductive fitness". What does "useful" mean here? If by "useful" you mean "improves an individual's reproductive fitness", then I disagree with the claim and I think that's where the major disagreement is. (I also disagree that this is an implication of the argument that evolution wasn't optimizing for proto-culture.) If by "useful" you mean "helps in building a technological civilization", then yes, I agree with the claim, but I don't see why it has any relevance. Yes, I agree with this one (at least if we get to use a shaped reward, e.g. we get to select the ones that sh
Richard Ngo's Shortform

So I think Debate is probably the best example of something that makes a lot of sense when applied to humans, to the point where they're doing human experiments on it already.

But this heuristic is actually a reason why I'm pretty pessimistic about most safety research directions.

2Adam Shimi4moSo I've been thinking about this for a while, and I think I disagree with what I understand of your perspective. Which might obviously mean I misunderstand your perspective. What I think I understand is that you judge safety research directions based on how well they could work on an evolutionary process like the one that created humans. But for me, the most promising approach to AGI is based on local search, which differs a bit from evolutionary process. I don't really see a reason to consider evolutionary processes instead of local search, and even then, the specific approach of evolution for humans is probably far too specific as a test bench. This matters because problems for one are not problems for the other. For example, one way to mess with an evolutionary process is to find way for everything to survive and reproduce/disseminate. Technology in general did that for humans, which means the evolutionary pressure decreased as technology evolved. But that's not a problem for local search, since at each step there will be only one next program. On the other hand, local search might be dangerous because of things like gradient hacking [https://www.alignmentforum.org/posts/uXH4r6MmKPedk8rMA/gradient-hacking]. And they don't make sense for evolutionary processes. In conclusion, I feel for the moment that backchaining to local search [https://www.lesswrong.com/posts/qEjh8rpxjG4qGtfuK/the-backchaining-to-local-search-technique-in-ai-alignment] is a better heuristic for judging safety research directions. But I'm curious about where our disagreement lies on this issue.
Richard Ngo's Shortform

I don't think that even philosophers take the "genie" terminology very seriously. I think the more general lesson is something like: it's particularly important to spend your weirdness points wisely when you want others to copy you, because they may be less willing to spend weirdness points.

Load More