Paul Christiano


Iterated Amplification

Wiki Contributions


Reward is not the optimization target

(By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)

This was mentioned in OP ("The argument would prove too much. Evolution selected for inclusive genetic fitness, and it did not get IGF optimizers."). It also appears to be a much stronger argument for the OP's position and so seemed worth responding to.

I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it “wants” to explore, in a way that can involve foresight.

It seems to me that incomplete exploration doesn't plausibly cause you to learn "task completion" instead of "reward" unless the reward function is perfectly aligned with task completion in practice. That's an extremely strong condition, and if the entire OP is conditioned on that assumption then I would expect it to have been mentioned.

I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.

If the OP is not intending to talk about the kind of ML algorithm deployed in practice, then it seems like a lot of the implications for AI safety would need to be revisited. (For example, if it doesn't apply to either policy gradients or the kind of model-based control that has been used in practice, then that would be a huge caveat.)

Reward is not the optimization target

It sounded like OP was saying: using gradient descent to select a policy that gets a high reward probably won't produce a policy that tries to maximize reward. After all, look at humans, who aren't just trying to get a high reward.

And I am saying: this analogy seem like it's pretty weak evidence, because human brains seem to have a lot of things going on other than "search for a policy that gets high reward," and those other things seem like they have a massive impacts on what goals I end up pursuing.

ETA: as a simple example, it seems like the details of humans' desire for their children's success, or their fear of death, don't seem to match well with the theory that all human desires come from RL on intrinsic reward. I guess you probably think they do? If you've already written about that somewhere it might be interesting to see. Right now the theory "human preferences are entirely produced by doing RL on an intrinsic reward function" seems to me to make a lot of bad predictions and not really have any evidence supporting it (in contrast with a more limited theory about RL-amongst-other-things, which seems more solid but not sufficient for the inference you are trying to make in this post).

Reward is not the optimization target

At some level I agree with this post---policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent, and indeed this is one of the major points of discussion amongst people working on alignment in practice at ML labs. 

However, I'm confused or skeptical about the particular deviations you are discussing and I suspect I disagree with or misunderstand this post.

As you suggest, in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).

If you have a system with a sophisticated understanding of the world, then cognitive policies like "select actions that I expect would lead to reward" will tend to outperform policies like "try to complete the task," and so I usually expect them to be selected by gradient descent over time. (Or we could be more precise and think about little fragments of policies, but I don't think it changes anything I say here.)

It seems to me like you are saying that you think gradient descent will fail to find such policies because it is greedy and local, e.g. if the agent isn't thinking about how much reward it will receive then gradient descent will never learn policies that depend on thinking about reward.

(Though I'm not clear on how much you are talking about the suboptimality of SGD, vs the fact that optimal policies themselves do not explicitly represent or pursue reward given that complex stews of heuristics may be faster or simpler. And it also seems plausible you are talking about something else entirely.)

I generally agree that gradient descent won't find optimal policies. But I don't understand the particular kinds of failures you are imagining or why you think they change the bottom line for the alignment problem. That is, it seems like you have some specific take on ways in which gradient descent is suboptimal and therefore how you should reason differently about "optimum of loss function" from "local optimum found by gradient descent" (since you are saying that thinking about "optimum of loss function" is systematically misleading). But I don't understand the specific failures you have in mind or even why you think you can identify this kind of specific failure.

As an example, at the level of informal discussion in this post I'm not sure why you aren't surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn't yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).

One possible distinction is that you are talking about exploration difficulty rather than other non-convexities. But I don't think I would buy that---task completion and reward are not synonymous even for the intended behavior, unless we take some extraordinary pains to provide "perfect" reward signals. So it seems like no exploration is needed, and we are really talking about optimization difficulties for SGD on supervised problems. 

The main concrete thing you say in this post is that humans don't seem to optimize reward.  I want to make two observations about that:

  • Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don't pursue reward doesn't seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.) It would also be kind of surprising a priori---evolution selected human minds to be fit, and why would the optimum be entirely described by RL (even if it involves RL as a component)?
  • I agree that humans don't effectively optimize inclusive genetic fitness, and that human minds are suboptimal in all kinds of ways from evolution's perspective. However this doesn't seem connected with any particular deviation that you are imagining, and indeed it looks to me like humans do have a fairly strong desire to have fit grandchildren (and that this desire would become stronger under further selection pressure).

At this point, there isn’t a strong reason to elevate this “inner reward optimizer” hypothesis to our attention. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any good mechanistic stories for that. I’d love to hear some, if there are any. 

Apart from the other claims of your post, I think this line seems to be wrong. When considering whether gradient descent will learn model A or model B, the fact that model A gets a lower loss is a strong prima facie and mechanistic explanation for why gradient descent would learn A rather than B. The fact that there are possible subtleties about non-convexity of the loss landscape doesn't change the existence of one strong reason.

That said, I agree that this isn't a theorem or anything, and it's great to talk about concrete ways in which SGD is suboptimal and how that influences alignment schemes, either making some proposals more dangerous or opening new possibilities. So far I'm mostly fairly skeptical of most concrete discussions along these lines but I still think they are valuable. Most of all it's the very strong take here that seems unreasonable.

On how various plans miss the hard bits of the alignment challenge

I don't feel like this is right (though I think this duality feels like a real thing that is important sometimes and is interesting to think about, so appreciated the comment).

ARC is spending its time right now (i) trying to write down concrete algorithms that solve ELK using heuristic arguments, and then trying to produce concrete examples in which they do the wrong thing, (ii) trying to write down concrete formalizations of heuristic arguments that have the desiderata needed for those algorithms to work, and trying to identify cases in which our algorithms don't yet meet those desiderata or they may be unachievable. The output is just actual code which is purported to solve major difficulties in alignment.

And on the flip side, I spend a significant amount of my time looking at the algorithms we are proposing (and the bigger plans into which they would fit if successful) and trying to find the best arguments I can that these plans will fail.

I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain.

Put differently: I'm not saying that Nate and Eliezer are vague about problems but concrete about solutions, I'm saying they are vague about everything. And I don't think they are saying that I'm concrete about problems but vague about solutions, they would say that I'm concrete about parts of the solution/problem that don't matter while systematically pushing all the difficulty into the parts I'm still vague about.

I do think "how well do we understand the problem" seems like a pretty big crux; that leads Nate and Eliezer to think that I'm avoiding the predictably-important difficulty, and it leads me to think that Nate and Eliezer need to get more concrete in order to have an accurate picture of what's going on.

On how various plans miss the hard bits of the alignment challenge

I don't think those are great summaries. I think this is probably some misunderstanding about what ARC is trying to do and about what I mean by "concrete." In particular, "concrete" doesn't mean "formalized," it means more like: you are able to discuss a bunch of concrete examples of the difficulty and why they leads to failure of particular concrete approaches; you are able to point out where the problem will appear in a particular decomposition of the problem, and would revise your picture if that turned out to be wrong; etc.

You write:

But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again.

I don't yet have this sense about a "sharp left turn" bottleneck.

I think I would agree with you if we'd looked at a bunch of plausible approaches, and then convinced ourselves that they would fail. And then we tried to introduce the sharp left turn to capture the unifying theme of those failures and to start exploring what's really going on. At a high level that's very similar to what ARC is doing day to day, looking at a bunch of approaches to a problem, seeing why they fail, and then trying to understand the nature of the problem so that we can succeed.

But for the sharp left turn I think we basically don't have examples. Existing alignment strategies fail in much more basic ways, which I'd call "concrete." We don't have examples of strategies that don't run into concrete difficulties, but they fail for a vague and hard-to-understand reason that we'd summarize as a "sharp left turn." So I don't really believe that this difficulty is being abstracted from a pattern of failures.

There can be other ways to learn about problems, and I didn't think Nate was even saying that this problem is derived from examples of obstructions to potential alignment approaches. I think Nate's perspective is that he has some petty good arguments and intuitions about why a sharp left turn will cause novel problems. And so a lot of what I'm saying is that I'm not yet buying it, that I think Nate's argument has fatal holes in it which are hidden by its vagueness, and that if the arguments are really very important then we should be trying hard to make them more concrete and to address those holes.

Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself "the hard part"?

ARC does theoretical work guided by concrete stories about how a proposed AI system could fail; we are focused on the "legible part" insofar as we try to fix failures for which we can tell concrete stories. I'm not quite sure what you mean by "illegible" and so this might just be a miscommunication, but I think this is the relevant sense of "illegible" so I'll respond briefly to it.

I think we can tell concrete stories about deceptive alignment; about ontology mismatches making it hard or meaningless to "elicit latent knowledge;" about exploitability of humans making debate impossible; and so on. And I think those stories we can tell seem to do a great job of capturing the reasons why we would expect existing alignment approaches to fail. So if we addressed these concrete stories I would feel like we've made real progress. That's a huge part of my optimism about concrete stories.

It feels to me like either we are miscommunicating about what ARC is doing, or you are saying that those concrete difficulties aren't the really important failures. That even if an alignment approach addressed all of them, it still wouldn't represent meaningful progress because the true risk is the risk that cannot be named.

One thing you might mean is that "these concrete difficulties are just shadows of a deeper core." But I think that's not actually a challenge to ARC's approach at all, and it's not that different from my own view. I think that if you have an intuitive sense of a deep problem, then it's really great to attack specific instantiations of the problem as a way to learn about the deep core. I feel pretty good about this approach, and I think it's pretty standard in most disciplines that face problems like this (e.g. if you are deeply confused about physics, it's good to think a lot about the simplest concrete confusing phenomenon and understand it well; if you are confused about how to design algorithms that overcome a conceptual barrier, it's good to think about the simplest concrete task that requires crossing that barrier; etc.).

Another thing you might mean is that "these concrete difficulties are distractions from a bigger difficulty that emerges at a later step of the plan." It's worth noting that ARC really does try to look at the whole plan and pick the step that is most likely to fail. But I do think it would be a problem for our methodology if there is a good argument about why plans will fail, which won't let us tell a concrete story about what the failure looks like. My position right now is that I don't see such an argument; I think we have some vague intuitions, and we have a bunch of examples which do correspond to concrete failure stories. I don't think there are any examples from which to infer the existence of a difficulty that can't be captured in concrete stories, and I'm not yet aware of arguments that I find persuasive without any examples. But I'm really quite strongly in the market for such arguments.

Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It's like the exact opposite of security mindset.

Here's how the situation feels to me. I know this isn't remotely fair as a summary of your view, it's just intended to illustrate where ARC is coming from. (It's also possible this is a research methodology disagreement, in which case I do just disagree strongly.)

Cryptographer: It seems like our existing proposals for "secure" communication are still vulnerable to man in the middle attacks. Better infrastructure for key distribution is one way to overcome this particular attack, so let's try to improve that. We can also see how this might fit in with the rest of our security infrastructure to help build to a secure internet, though no doubt the details will change.

Cryptography skeptic: The real difficulty isn't man in the middle attacks, it's that security is really hard. By focusing on concrete stuff like man-in-the-middle you are overlooking the real nature of the problem, focusing on the known risks rather than the unknown unknowns. Someone with a true security mindset wouldn't be fiddling around the edges like this.

I'm not saying that infrastructure for key distribution solves security (and indeed we have huge security problems). I'm saying that working on concrete problems is the right way to make progress in situations like this. I don't think this is in tension with security mindset. In fact I think effective people with security mindset spend most of their time thinking about concrete risks and how to address them.

It's great to generalize once you have a bunch of concrete risks and you think there is a deeper underlying pattern. But I think you basically need the examples to learn from, and if there is a real pattern then you should be able to instantiate it in any particular case rather than making reference to the pattern.

On how various plans miss the hard bits of the alignment challenge

I think that the sharp left turn is also relevant to ELK, if it leads to your system not generalizing from "questions humans can answer" to "questions humans can't answer." My suspicion is that our key disagreements with Nate are present in the case of solving ELK and are not isolated to handling high-stakes failures.

(However it's frustrating to me that I can never pin down Nate or Eliezer on this kind of thing, e.g. are they still pessimistic if there were a low-stakes AI deployment in the sense of this post?)

On how various plans miss the hard bits of the alignment challenge

I'm going to spend most of this comment responding to your concrete remarks about ELK, but I wanted to start with some meta level discussion because it seems to cut closer to the heart of the issue and might be more generally applicable.

I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can't anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise. It means not becoming too pessimistic about a direction until we see fairly concretely where it's stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.

My sense is that you have more faith in a rough intuitive sense you've developed of what the "hard part" of alignment is, and so you'd primarily recommend thinking about that until we feel less confused. I disagree in large part because I feel like your broad intuitive sense has not yet had much opportunity to make contact with either reality or with formal reasoning, and I'd guess it's not precise enough to be a useful guide to research prioritization.

More concretely, you talk about novel mechanisms by which AI systems gain capabilities, but I think you haven't said much concrete about why existing alignment work couldn't address these mechanisms. This looks to me like a pretty unproductive stance; I suspect you are wrong about the shape of the problem, but if you are right then I think your main realistic path to impact involves saying something more concrete about why you think this.

I think you don't see the situation the same way, probably because you feel like you have said plenty concrete. Perhaps this is the most serious disagreement of all. I don't think saying there is a "capabilities well" is helpfully concrete until you say something about what it looks like, why it poses alignment problems different from SGD and why particular approaches don't generalize, etc.

In ARC's day to day work we write down particular models of capabilities that would generalize far outside of training (e.g.: what about a causal model of the world that holds robustly? what about logical deduction from valid premises with longer chains of reasoning? what about continuing to learn by trial and error when deployed in a novel environment?), and ask about whether a given alignment solution would generalize along with them. If we can find any gap, then that it goes on the list of problems. We focus on the gaps that seem least likely to be addressable by using known techniques, and try to develop new techniques or to identify general reasons why the gap is unresolvable.

My guess is that you are playing a roughly similar game much more informally, and that you are just making a mistake because reasoning about this stuff is in fact hard. But I can't really tell, since your thinking is happening in private and we are seeing the vague intuitions that result. (I've been hanging around MIRI for a long time, and I suspect I have a better model of your and Eliezer's position than virtually anyone else outside of MIRI, yet this is still where I'm at.)

Anyway, now turning to your discussion of ELK in particular.

Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent (much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner). You might not be able to just "expose the bad behavior" to gradients that you can hit to correct the thing, at least not easily and quickly.

I often think and write about other places where capabilities may come from that could challenge our basic alignment plan. Four particularly salient examples:

  1. Your AI might perform search internally, e.g. looking for hypotheses that match the data or for policies that work well.
  2. Natural selection may occur internally, e.g. cognitive patterns that acquire power might tend to dominate the behavior of your AI (despite the AI having no explicit prediction that they would work well).
  3. Your AI might reason about how to think better, e.g. select cognitive actions based on anticipated consequences of those cognitive actions.
  4. Our AI might deploy new algorithms that pose their own alignment risk for different (potentially unanticipated) reasons.

Some of these represent real problems, but none of them seem to fundamentally change the game or be deal-breakers:

  1. Aligning the internal search seems very similar to aligning SGD on the outside. We could distinguish two additional difficulties in this case:
    1. Because the search is on the inside, we can't directly apply our alignment insights to align it. Instead we need to ensure that SGD learns to align the search. This itself poses two difficulties: (a) the outer gradient needs to incentivize doing this, (b) we need to argue that it's nearly as easy for SGD to learn the aligned search as to learn the unaligned search (or build scaffolding such that it becomes similarly easily). This is what we're talking about in this appendix, and it's part of why we are skeptical about approaches to ELK based on simple regularizers. But we don't see a reason that either (a) or (b) would be a dealbreaker, and we tentatively think our current approaches to ontology identification would at least solve (a) if they were successful at all. It's pretty hard to talk about (b) without having more clarity about what the alignment scheme actually looks like but we don't see an in principle reason it's hard.
    2. The internal search algorithm may not be SGD, and perhaps our alignment strategy was specific to some detail of SGD. But SGD appears to be amongst the hardest search algorithms, and ARC tries to pursue approaches that work for other algorithms rather than leveraging anything about SGD in particular. We're definitely in the market for other search algorithms that cause trouble but don't yet know of any.
  2. Natural selection on the inside is similar but potentially more tricky, because the optimizer has more limited control over how this search works. This is like the analog of memetic selection being smarter than humans and eventually overpowering or hijacking human consequentialism. Another extreme example is that it seems like a large enough neural network may be catastrophically misaligned at initialization simply because of selection amongst activation patterns within a single forward pass. Ultimately we'd like to handle this in exactly the same way that we handle the last point, by some combination of (a) we can just directly apply the same hope from the previous section even to natural selection, (b) we can run explicit searches that are more powerful than implicit search by natural selection within our model, which requires ensuring that our explicit learned search captures whatever is good about natural selection (this seems tough but not at all obvious impossible to me). It's hard to talk about option (a) without seeing if/how we solve the problem from the last point. We could definitely work on option (b) now, and a large enough ARC would be working on it, but it seems like a relatively low priority since it's both very remote from existing systems and seems relatively unlikely-to-me to be the simplest place where we get stuck.
  3. If our model is selecting cognitive actions, or designing new algorithms, then our core hope is that an aligned model will try to think in an aligned way. So if we've been succeeding at alignment so far then the model will be trying to stay aligned. By analogy, if humans cared about the amount of human DNA in the universe, then to the extent that cultural evolution was guided by human consequentialism (rather than e.g. being memetic selection), we would be trying to develop cultural machinery that was helpful for maximizing the amount of human DNA in the universe.
  4. One way this can go wrong is if our model wants to stay aligned but fails, e.g. because it identifies new techniques for thinking that themselves pose new alignment difficulties (just as we desire human flourishing but may instead implement AI systems that want paperclips). I think this is a real problem, but there are a lot of reasons I don't consider it an existential challenge for our approach:
    1. If you've succeeded at alignment so far, then your AI will also consider this a problem and will be trying to solve it. I think we should relate to our AI, discovering new ways to think that might pose new alignment difficulties, in the same way that we relate to future humans who may encounter alignment difficulties. The AI may solve the problem, or may implement policy solutions, or etc., and our role is to set them up for success just like we are trying to set up future humans for success. AI compresses the timescale both for "new AI algorithms with new alignment problems" but also for all of the solutions to those problems, so I don't think it changes the game from future humans. And so I'd focus on prosaic AI alignment for exactly the same reasons I focus on prosaic AI alignment when trying to help future humans succeed at alignment.
    2. I think that we should be considering the particular algorithms that might pose a new alignment problem, and trying to solve alignment for each of them. If we have some general reason to think that new algorithms will be much harder than old algorithms, or that lessons won't transfer, then we can discuss those and whether they should affect research prioritization. So far I don't think we have such arguments, and so I think we should just be looking for algorithms that might pose problems. (I don't actually think that's the highest priority, because prosaic ML so obviously poses problems, and the other problems we see seem so closely analogous to the ones posed by prosaic ML. But I'm certainly in the market for other problems and think that a large enough research community should already be actively looking for them.)

Possible disagreements between us: (i) you think that at least one of these examples looks really bad for our approach, (ii) you have other examples in mind, (iii) you don't think we can write down a concrete example that looks bad,  but we have reason to expect other kinds of capability gains that will be bad, (iv) nothing looks like a dealbreaker in particular, but it's just contributing to a long list of problems you'd have to solve and that's either a lot of work or something probably won't work out.

For me, the upshot of all of this is that SGD poses some obvious problems, that those problems are the most likely to actually occur, that they seem similar to (and at least subproblems of) the other alignment problems we may face, and that there are neither super compelling alternatives to aligning SGD nor particular arguments that the rest of the problem is harder than this step.

Your second problem is that the AGI's concepts might rapidly get totally uninterpretable to your ELK head. Like, you could imagine doing neuroimaging on your mammals all the way through the evolution process. They've got some hunger instincts in there, but it's not like they’re smart enough yet to represent the concept of "inclusive genetic fitness" correctly, so you figure you'll just fix it when they get capable enough to understand the alternative (of eating because it's instrumentally useful for procreation). And so far you're doing great: you've basically decoded the visual cortex, and have a pretty decent understanding of what it's visualizing. 

Our goal is to learn a reporter that describes the latent knowledge of the model, and to keep this up to date as the model changes under SGD. If thinking about SGD, we usually think concretely about a single step of SGD, and how you could find a good reporter at the end of that gradient descent step assuming you had one at the beginning.

It feels to me like what you are saying here is just "you might not be able to solve ELK." Or else maybe restating the previous point, that the model builds latent knowledge by mechanisms other than SGD and therefore you need to learn a reporter that can also follow along with those other mechanisms.

In either case, I can't speak to whether it's helpful for the audience understanding why ELK is hard, but it is certainly not helping me understand why you think ELK is hard. I think this discussion is just too vague to be helpful.

I think it's not crazy for you to say "ARC's hopes about how to solve ELK are too vague to seem worth engaging with" (this is pretty similar to me saying "Nate's arguments about why alignment is hard are too vague to seem worth engaging with").

Analogously, your ELK head's abilities are liable to fall off a cliff right as the AGI's capabilities start generalizing way outside of its training distribution.

But can you say something concrete about why? What I'd like to do is talk about what the AGI is actually thinking, the particular computation it's running, so that we can talk about why that computation keeps being correlated with reality off distribution and then ask whether the reporter remains correlated with reality. When I go through this exercise I don't see big dealbreakers, and I can't tell if you disagree with that diagnosis, or if you are noticing other things that might be going on inside the AI, or if the difference is that I think "this looks like it might work in all the concrete cases we can see" is a relevant signal and you think "nah the cases we can't see are way worse than those we can see."

And if they don't, then this ELK head is (in this hypothetical) able to decode and understand the workings of an alien mind. Likely a kludgey behemoth of an alien mind. This itself is liable to require quite a lot of capability, quite plausibly of the sort that humanity gets first from the systems that took sharp left-turns, rather than systems that ground along today's scaling curves until they scaled that far. 

Again, this seems too vague to be helpful, or perhaps just mistaken. The reporter is not some other AI looking at your predictor and trying to "decode its workings," or maybe it is but if so it's just because those english words are vague and broad. Can we talk about the particular kinds of cognition that your AI might be performing, such that you don't think this works? (Or which would require the reporter to itself be using magic-mystery-juice-of-intelligence?)

That's really the central theme of my response, so it's worth restating: ARC loves examples of ways an AI might be thinking such that ELK is difficult. But your description of the sharp left turn is too vague to be helpful for this purpose, and so I'd either like to turn this into more concrete discussion of the internals of the algorithm, or else some significantly more precise argument about why we expect the unknown possible internals to be so much less favorable for ELK than any of the concrete examples we can write down.[1]

  1. ^

    I'd like to head off a possible response you might make that I disagree with: "Sure your algorithm works for any example you can write down, but the whole point is that you need it to work for alien cognition, where humans don't understand why it works. So of course it works on concrete examples but not in the unknown real world." . I'm putting this in a footnote because it seems like a digression and I have no idea if this is your view.

    My main response is that we can in fact talk about concrete examples where "why your AI system's cognition works" isn't accessible to humans in the relevant ways:

    • We can consider tricky facts we understand about how to reason, for which our discovery of those facts is empirically contingent (and where discovering those facts is harder than discovering the reasons itself). Then we can consider whether our AI alignment strategies would work even if humans hadn't figured out the relevant facts about reasoning.
    • We can consider AI cognition which is contingent on hypothesized unknown-to-human facts, e.g. about the causal structure of reality, or about key facts about mathematics, or whatever else.
    • Most of our ELK approaches don't make no-holds-barred use of "can a human come up with some story about why this AI cognition may work," and so this just isn't a particularly salient threshold anyway. As a silly example, if you were solving this problem with a speed prior (or indeed with any of the approaches in the regularization section of the ELK document) you wouldn't expect a particular key threshold at the space of strategies that a human understands.
Where I agree and disagree with Eliezer

On 22, I agree that my claim is incorrect. I think such systems probably won't obsolete human contributions to alignment while being subhuman in many ways. (I do think their expected contribution to alignment may be large relative to human contributions; but that's compatible with significant room for humans to add value / to have made contributions that AIs productively build on, since we have different strengths.)

Let's See You Write That Corrigibility Tag

I don't think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we'd probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that's not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.

If you are in the business of "trying to train corrigibility" or "trying to design corrigible systems," I think understanding that distinction is what the game is about.

If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say---like "there are so many ways to mess with you, how could a definition cover all of them?"---doesn't make any progress on that, and so it doesn't help reconcile the intuitions or convince most optimists to be more pessimistic.

(Obviously all of that is just a best guess though, and the game may well be about something totally different.)

Load More