A positive case for how we might succeed at prosaic AI alignment

by Evan Hubinger9 min read16th Nov 202123 comments

45

AI Success ModelsOuter AlignmentAI
Frontpage

This post is my attempt at something like a response to Eliezer Yudkowsky’s recent discussion on AGI interventions.

I tend to be relatively pessimistic overall about humanity’s chances at avoiding AI existential risk. Contrary to some others that share my pessimism, however—Eliezer Yudkowsky, in particular—I believe that there is a clear path forward for how we might succeed within the current prosaic paradigm (that is, the current machine learning paradigm) that looks plausible and has no fundamental obstacles.

In the comments on Eliezer’s discussion, this point about whether there exists any coherent story for prosaic AI alignment success came up multiple times. From Rob Bensinger:

I think it's pretty important here to focus on the object-level. Even if you think the goodness of these particular research directions isn't cruxy (because there's a huge list of other things you find promising, and your view is mainly about the list as a whole rather than about any particular items on it), I still think it's super important for us to focus on object-level examples, since this will probably help draw out what the generators for the disagreement are.

In that spirit, I’d like to provide my own object-level story for how prosaic AI alignment might work out well.

Of course, any specific story for how we might succeed is going to be wrong simply because it specifies a bunch of details, and this is such a specific story. The point, however, is that the fact that we can write plausible stories for how we might succeed that don’t run into any fundamental obstacles implies that the problem isn’t “we don’t even know how we could possibly succeed” but rather “we know some ways in which we might succeed, but they all require a bunch of hard stuff that we have to actually execute on,” which I think is a pretty different place to be.

One thing I do agree with Eliezer on, however, is that, when you’re playing from behind—as I think we are—you play for variance. That means embracing strategies that might not work in expectation, but that have long tails in the positive direction, and I definitely see my picture here as falling into that category.

Furthermore, as one should probably expect with any full roadmap for solving such a complicated problem, there’s still a lot left out of my picture—especially my intuitions for why I think each of these steps is actually plausible. I am currently working on a research agenda that will go into a lot more detail on that, but until it’s published, it might be best to just think of this post as an overview of what that agenda will look like.[1]

Alright, without further ado, here’s my concrete picture for how we might end up succeeding:

  1. We produce an understanding of a simple, natural class of agents such that agents of this form are capable of doing all of the things that we might want a powerful, advanced AI to do—but such that no agents of this form will ever act deceptively.

    • My current best guess for what such a natural class might look like is a myopic agent—that is, an agent that only cares about its next action rather than the long-term consequences of its actions. I think it is possible to produce a simple, natural description of myopia such that myopic agents are still capable of doing all the powerful things we might want out of an AGI but such that they never have any reason to be deceptive.[2]
    • In the language of training stories, (1) gives us our training goal, the mechanistic description of what sort of model we’re trying to produce.
  2. We develop some way of determining whether a given non-deceptive model falls into the natural class we developed in step (1). It’s fine for this not to work for all non-deceptive models, as long as the class of non-deceptive models that it works on is large enough to make (3) and (4) go through.

    • Note that we will never rely on (2) working in a situation where we are given an agent that is already deceptive.
    • One way to accomplish (2) might be to develop worst-case transparency tools that can tell whether the basic structure of a given model is consistent with (1).
  3. We develop a training procedure such that, given that the current model being trained falls into the natural class from step (1), additional training will always keep it in that class.

    • If agents from our natural class are capable of being able to deploy the tools we developed in step (2), then one way to accomplish (3) might be to have the training be done by the model being trained given access to tools from (2).
    • For (3) to just work very straightforwardly, it would need to be the case that the set of models that (2) works for is large enough to include any model that can be reached from one step of training starting from a model in the natural class from (1).
  4. Using (2), we guide very early training (before the model has the capability to be deceptive) to get some model (with which we can initialize further training from) that falls into the natural class from (1).

    • For (4) to just work very straightforwardly, it would need to be the case that the set of models that (2) works for is large enough to include any model that can be produced early in training before the model has the capability to be deceptive.
    • Alternatively, the natural class from (1) could just be broad enough to include most models at initialization, though I suspect that will cause problems for (3).
  5. Combining (3) and (4), we get an inductive guarantee that we can produce models that fall into our natural class. Because in (1) we constructed our natural class to be sufficient for any tasks that we might want our AI to do, we can now train non-deceptive AIs on any task that we might want them for.

    • For the training process in (5) to be competitive, we also need (3) and (4) to not be so resource-intensive that they are substantially harder than training an unaligned model.
    • In the language of training stories, the inductive argument here in (5) is our central training rationale for why we’ll get a model that satisfies the training goal from (1).
  6. Given a powerful and non-deceptive AI produced from (5), we use standard red-teaming (e.g. testing on lots of examples) to find places where the model fails and retrain using (3) until the model looks like it’s doing the right thing.

    • Because we know that our model is non-deceptive from (5)—and since all of our retraining is done via (3)—the fact that the model looks like it’s doing the right thing should give us a real guarantee that it’ll actually do the right thing in similar situations, since we know it won’t just be pretending to do the right thing.
  7. We ensure that the leading AI lab uses (5) + (6) to produce their most powerful and advanced AI systems. By being first, they are able to set the standard for how training powerful machine learning models should generally be done.[3] Because of the strong tendencies for AI labs to copy each other’s successes, other labs also use (5) + (6) to train their powerful and advanced AI systems, ensuring that all of the most powerful AIs in the world are aligned.

    • Though these other labs might scale (5) + (6) further, as long as (3) is robust to scale, such systems should stay aligned.

    • Though I think that the forces pushing for homogeneity of AI training processes across labs are strong, once the set of labs with the capability to build misaligned AI systems gets large enough—e.g. once it includes all the small labs too—one of them is bound to break that homogeneity. Thus, there is a period of vulnerability after (7) and before (8) where smaller labs might not follow (5) + (6) and instead build misaligned AI systems.

      • Even if that does happen, however, since it’s only small labs with limited capabilities building misaligned AI in a world that already contains aligned AIs built by much larger and more capable labs, it should be quite difficult for such misaligned AI systems to actually destabilize such a world.
  8. We use the AI systems from (7) to help us design the next round of powerful and advanced AI systems and develop techniques to end the period of vulnerability.

    • I won't say too much about exactly what we would do here, mostly because it's not a problem that we have to solve before we actually get the powerful aligned AI systems to help us solve it, so it's mostly not a problem that I think we need to focus on right now.

If I had to guess what the hardest part of the above picture will be, I’d probably guess (2),[4] which is why I’m so excited about Automating Auditing as a way to start making progress on (2) now. That being said, I don’t think there are any fundamental obstacles to solving (2)—(2) very explicitly doesn’t require us to be robust to deceptive models or even be able to tell whether (1) holds for all non-deceptive models, both of which I think would run into fundamental obstacles, but which we don’t have to do.


  1. If you want access to an early draft of my agenda, message me privately and I might send it to you, though it’s still likely to change a lot before it’s released. ↩︎

  2. I think it is possible for a myopic agent to still be capable of solving problems that involve non-myopic reasoning (e.g. be a good AI CEO). For example, a myopic agent could myopically simulate a strongly-believed-to-be-safe non-myopic process such as HCH, allowing imitative amplification to be done without ever breaking a myopia guarantee—alternatively, AI safety via market making lets you do AI safety via debate without breaking a myopia guarantee. In general, I think it’s just not very hard to leverage careful recursion to turn non-myopic objectives into myopic objectives such that it’s possible for a myopic agent to do well on them—without breaking the guarantees that ensure that your myopic agent won’t be deceptive (as a concrete example of what a myopic agent that is capable of doing well on such tasks without ever having any reason to be deceptive might look like, consider LCDT). ↩︎

  3. For an example of what “setting the standard for how training powerful ML models should be done” might look like, consider how once the basic training paradigm of “train massive self-supervised transformer-based language models” was introduced, it was aggressively copied across the field and became the standard for all language-based AI systems. ↩︎

  4. Second place for hardest step would probably be (7). Definitely a hard step, but I think that the claim that we mostly only have to persuade the frontrunner makes this not a fundamental obstacle—persuading one organization of one thing is an achievable goal, persuading every person doing AI everywhere would be a fundamental obstacle. ↩︎

45

31 comments, sorted by Highlighting new comments since Today at 4:13 AM
New Comment

The notion of (1) seems like the cat-belling problem here; the other steps don't seem interesting by comparison, the equivalent of talking about all the neat things to do after belling the cat.

What pivotal act is this AGI supposed to be executing?  Designing a medium-strong nanosystem?  How would you do that via a myopic system?  That means the AGI needs to design a nanosystem whose purpose spans over time and whose current execution has distant good consequences.  It doesn't matter whether you claim it's being done by something that internally looks like myopic HCH any more than it matters that it's being done by internal transistors that don't have tiny models of the future inside themselves.  What's consequentialist and farseeing isn't the transistors, or the floating-point multiplications, or the elaborate HCH or whatever, it's the actual work and actual problem being solved by the system whereby it produces a nanosystem that has coherent effects on the physical world spanning hours and days.

The notion of (1) seems like the cat-belling problem here; the other steps don't seem interesting by comparison, the equivalent of talking about all the neat things to do after belling the cat.

I'm surprised that you think (1) is the hard part—though (1) is what I'm currently working on, since I think it's necessary to make a lot of the other parts go through, I expect it to be one of the easiest parts of the story to make work.

What pivotal act is this AGI supposed to be executing? Designing a medium-strong nanosystem?

I left this part purposefully vague, but I'm happy to accept designing a medium-strong nanosystem as the pivotal act to consider here for the sake of argument, since I think that if your advanced AI can't at least do that, then it probably can't do anything else pivotal either.

That means the AGI needs to design a nanosystem whose purpose spans over time and whose current execution has distant good consequences.

Agreed.

It doesn't matter whether you claim it's being done by something that internally looks like myopic HCH any more than it matters that it's being done by internal transistors that don't have tiny models of the future inside themselves.

I think this is where you misunderstand me. I suspect that you don't really understand what I mean by myopia.

Let me see if I can explain, just using the HCH example. Though I suspect that imitating HCH is actually not powerful enough to do a pivotal act—and I suspect you agree—it's a perfectly good example to showcase what I mean by myopia.

To start with, the optimization wouldn't be done by HCH, or anything that would internally look like HCH in the slightest—rather, the optimization would be done by whatever powerful optimization process is inside of our model. Where myopia comes into play is in what goal we're trying to direct that optimization towards. The key idea, in the case of HCH, would be to direct that optimization towards the goal of producing an action that is maximally close to what HCH would do. In such a situation, you would have a model that can use its own powerful internal optimization procedures to imitate what HCH would do as effectively as possible—able to do things like effectively manage cognitive resources and reason about how best to go about producing an action that is as close as possible to HCH.

The natural class that I think this example is pointing to is the class of optimizers that optimize for an objective that is exclusively about their action through a Cartesian boundary, rather than the consequences of their action on the world. Such optimizers can still end up producing actions with far-reaching consequences on the world if they deploy their optimization power in the service of an objective like imitating HCH that requires producing actions with particular consequences, however. In such a situation, the model would be actively doing lots of reasoning about the consequences of its actions on the world, but not for the goal of producing a particular consequence, but rather just for the goal of producing a particular action, e.g. the one that matches up to what HCH would do. Thus, optimizers of this form can do all sorts of extremely powerful, long-term, non-myopic tasks—but without ever having any incentive to act deceptively.

Notably, there are a bunch of nuances here, regarding things like ensuring that the agent doesn't end up optimizing its objective non-myopically because of acausal trade considerations, ensuring that it doesn't want to self-modify into a different sort of agent, making sure it doesn't just spin up other agents that act non-myopically, etc., but these problems are really not that hard to solve. As a proof of concept, LCDT definitely solves all of these problems, showcasing that an optimizing system that really “just imitates HCH” is possible. Unfortunately, LCDT is not quite as natural as I would like, since it requires paying a bunch of bits of complexity to specify a fundamental concept of an “agent,” such that I don't think that the final solution here will actually look much like LCDT. Rather, I suspect that a more natural class of myopic agents will come from something more like analyzing the general properties of different types of optimizers over Cartesian boundaries.

Regardless, I strongly doubt that just developing a proper notion of myopia here poses a fundamental obstacle—we already have evidence that optimizers of this form are possible and can have the desired properties, and the basic concept of “an optimizer that just cares about its next action” is natural enough that I'd be quite surprised if we couldn't fully systematize it. I do suspect that any systematization will require paying the complexity of specifying a Cartesian boundary, but I'd be quite surprised if that cost us enough complexity to make the desired class too unnatural.

What's consequentialist and farseeing isn't the transistors, or the floating-point multiplications, or the elaborate HCH or whatever, it's the actual work and actual problem being solved by the system whereby it produces a nanosystem that has coherent effects on the physical world spanning hours and days.

Certainly it doesn't matter what substrate the computation is running on. I don't think this is really engaging with anything that I'm saying.

Certainly it doesn't matter what substrate the computation is running on.

I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I'm guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren't myopic, or else the assemblage implements a non-myopic reasoner with the myopic reasoners as a (mere) substrate.

Reasoning correctly about far-reaching consequences by default (1) has mistargeted consequences, and (2) is done by summoning a dangerous reasoner.

Such optimizers can still end up producing actions with far-reaching consequences on the world if they deploy their optimization power in the service of an objective like imitating HCH that requires producing actions with particular consequences, however.

I think what you're saying here implies that you think it is feasible to assemble myopic reasoners into a non-myopic reasoner, without compromising safety. My possibly straw understanding, is that the way this is supposed to happen in HCH is that, basically, the humans providing the feedback train the imitator(s) to implement a collective message-passing algorithm that answers any reasonable question or whatever. This sounds like a non-answer, i.e. it's just saying "...and then the humans somehow assemble myopic reasoners into a non-myopic reasoner". Where's the non-myopicness? If there's non-myopicness happening in each step of the human consulting HCH, then the imitator is imitating a non-myopic reasoner and so is non-myopic (and this is compounded by distillation steps). If there isn't non-myopicness happening in each step, how does it come in to the assembly?

Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I'm guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren't myopic, or else the assemblage implements a non-myopic reasoner with the myopic reasoners as a (mere) substrate.

Endorsed.

To be clear, I agree with this also, but don't think it's really engaging with what I'm advocating for—I'm not proposing any sort of assemblage of reasoners; I'm not really sure where that misconception came from.

I don't think the assemblage is the point. I think the idea here is that "myopia" is a property of problems: a non-myopic problem is (roughly) one which inherently requires doing things with long time horizons. I think Eliezer's claim is that (1) a (good) pivotal act is probably a non-myopic problem, and (2) you can't solve a nontrivial nonmyopic problem with a myopic solver. Part (2) is what I think TekhneMakr is gesturing at and Eliezer is endorsing.

My guess is that you have some idea of how a myopic solver can solve a nonmyopic problem (by having it output whatever HCH would do, for instance). And then Eliezer would probably reply that the non-myopia has been wrapped up somewhere else (e.g. in HCH), and that has become the dangerous part (or, more realistically, the insufficiently capable part, and I expect Eliezer would claim that replacing it with something both sufficiently capable and aligned is about as hard as the whole alignment problem). I'm not sure what your response would be to that.

(1) a (good) pivotal act is probably a non-myopic problem, and (2) you can't solve a nontrivial nonmyopic problem with a myopic solver. [...] My guess is that you have some idea of how a myopic solver can solve a nonmyopic problem (by having it output whatever HCH would do, for instance).

Yeah, that's right, I definitely agree with (1) and disagree with (2).

And then Eliezer would probably reply that the non-myopia has been wrapped up somewhere else (e.g. in HCH), and that has become the dangerous part (or, more realistically, the insufficiently capable part, and I expect Eliezer would claim that replacing it with something both sufficiently capable and aligned is about as hard as the whole alignment problem).

I tend to think that HCH is not dangerous, but I agree that it's likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful. But that's not that hard, and there's lots of them that can incentivize good non-myopic behavior that are safe to optimize for as long as the optimizer is myopic.

AI safety via market making is one example, but it's a very tricky one, so maybe not the best candidate for showcasing what I mean. In particular, I suspect that a myopic optimizer given the goal of acting as a trader or market-maker in such a setup wouldn't act deceptively, though I suspect they would Goodhart on the human approval signal in unsafe ways (which is less bad of a problem than deception, and could potentially be solved via something like my step (6), but still a pretty serious problem).

Maybe a better example would be something like imitative generalization. If imitating HCH is insufficient, we can push further by replacing “imitate HCH” with “output the hypothesis which maximizes HCH's prior times the hypothesis's likelihood,” which gets you substantially farther and I think is still safe to optimize for given a myopic optimizer (though neither are safe for a non-myopic optimizer).

It still doesn't seem to me like you've sufficiently answered the objection here.

I tend to think that HCH is not dangerous, but I agree that it's likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful.

What if any sufficiently powerful objective is non-myopic? Or, on a different-but-equivalent phrasing: what if myopia is a property only of very specific toy objectives, rather than a widespread property of objectives in general (including objectives that humans would intuitively consider to be aimed at accomplishing things "in the real world")?

It seems to me that Eliezer has presented quite compelling arguments that the above is the case, and on a first pass it doesn't look to me like you've countered those arguments.

But that's not that hard, and there's lots of them that can incentivize good non-myopic behavior that are safe to optimize for as long as the optimizer is myopic.

How does a "myopic optimizer" successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about? To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?

AI safety via market making is one example, but it's a very tricky one, so maybe not the best candidate for showcasing what I mean. In particular, I suspect that a myopic optimizer given the goal of acting as a trader or market-maker in such a setup wouldn't act deceptively, though I suspect they would Goodhart on the human approval signal in unsafe ways (which is less bad of a problem than deception, and could potentially be solved via something like my step (6), but still a pretty serious problem).

Maybe a better example would be something like imitative generalization. If imitating HCH is insufficient, we can push further by replacing “imitate HCH” with “output the hypothesis which maximizes HCH's prior times the hypothesis's likelihood,” which gets you substantially farther and I think is still safe to optimize for given a myopic optimizer (though neither are safe for a non-myopic optimizer).

Both of these seem to be examples of solutions that simply push the problem back a step, rather than seeking to eliminate it directly. My model of Eliezer would call this attempting to manipulate confusion, and caution that, although adding more gears to your perpetual motion machine might make the physics-violating component harder to pick out, it does not change the fact that somewhere within the model is a step that violates physics.

In this case, it seems as though all of your proposals are of the form "Train your model to imitate some process X (where X is non-myopic and potentially unsafe), while adding incentives in favor of myopic behavior during training." To which my model of Eliezer replies, "Either your model will end up myopic, and not powerful enough to capture the part of X that actually does the useful work we are interested in, or it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper."

It seems to me that to usefully refute this, you need to successfully argue against Eliezer's background premise here—the one about power and non-myopic reasoning going hand-in-hand in a deep manner that, while perhaps circumventable via similarly deep insights, is not patchable via shallow methods like "Instead of directly using dangerous process X, we will imitate X, thereby putting an extra layer of abstraction between ourselves and the danger." My current impression is that you have not been arguing against this background premise at all, and as such I don't think your arguments hit at the core of what makes Eliezer doubt your proposals.

How does a "myopic optimizer" successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about?

It just reasons about them, using deduction, prediction, search, etc., the same way any optimizer would.

To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?

The sense that it's still myopic is in the sense that it's non-deceptive, which is the only sense that we actually care about.

it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper

The safety improvement that I'm claiming is that it wouldn't be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?

[Note: Still speaking from my Eliezer model here, in the sense that I am making claims which I do not myself necessarily endorse (though naturally I don't anti-endorse them either, or else I wouldn't be arguing them in the first place). I want to highlight here, however, that to the extent that the topic of the conversation moves further away from things I have seen Eliezer talk about, the more I need to guess about what I think he would say, and at some point I think it is fair to describe my claims as neither mine nor (any model of) Eliezer's, but instead something like my extrapolation of my model of Eliezer, which may not correspond at all to what the real Eliezer thinks.]

> To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?

The sense that it's still myopic is in the sense that it's non-deceptive, which is the only sense that we actually care about.

> it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper

The safety improvement that I'm claiming is that it wouldn't be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?

If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?

Conversely, if the myopic agent does not learn to imitate the underlying process to sufficient resolution that unwanted behaviors like deception start carrying over, then it is very likely that the powerful consequentialist properties of the underlying process have not been carried over, either. This is because (on my extrapolation of Eliezer's model) deceptive behavior, like all other instrumental strategies, arises from consequentialist reasoning, and is deeply tied to such reasoning in a way that is not cleanly separable—which is to say, by default, you do not manage to sever one without also severing the other.

Again, I (my model of Eliezer) does not think the "deep tie" in question is necessarily insoluble; perhaps there is some sufficiently clever method which, if used, would successfully filter out the "unwanted" instrumental behavior ("deception", in your terminology) from the "wanted" instrumental behavior (planning, coming up with strategies, in general being an effective agent in the real world). But this distinction between "wanted" and "unwanted" is not a natural distinction; it is, in fact, a distinction highly entangled with human concepts and human values, and any "filter" that selects based on said distinction will need to be of similar complexity. (Of identical complexity, in fact, to the whole alignment problem.) "Simple" filters like the thing you are calling "myopia" definitely do not suffice to perform this function.

I'd be interested in hearing which aspect(s) of the above model you disagree with, and why.

If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?

Yeah, this is obviously true. Certainly if you have an objective of imitating something that would act deceptively, you'll get deception. The solution isn't to somehow “filter out the unwanted instrumental behavior from the wanted instrumental behavior,” though, it's just to not imitate something that would be deceptive.

It's perhaps worth pointing out why, if we have something to imitate already that isn't deceptive, why we don't just run that thing directly—and the answer is that we can't: all of the sorts of things that might be both competitive and safe to myopically imitate are things like HCH that are too inefficient to run directly.

This is a great thread. Let me see if I can restate the arguments here in different language:

  1. Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob's brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, "You have a million subjective years to think of an effective pivotal act in the real world, and tell us how to execute it." Bob's a smart guy, and we trust him to do the right thing by us; he should be able to figure something out in a million years, right?
     
  2. My understanding of Evan's argument at this point would be: "Okay; so we don't have the technology to directly simulate Bob's brain. But maybe instead we can imitate its I/O signature by training a model against its actions. Then, because that model is software, we can (say) speed it up a million times and deal with it as if it was a high-fidelity copy of Bob's brain, and it can solve alignment / execute pivotal action / etc. for us. Since Bob was smart, the model of Bob will be smart. And since Bob was trustworthy, the model of Bob will be trustworthy to the extent that the training process we use doesn't itself introduce novel long-term dependencies that leave room for deception."
     
  3. Note that myopia — i.e., the purging of long term dependencies from the training feedback signal — isn't really conceptually central to the above scheme. Rather it is just a hack intended to prevent additional deception risks from being introduced through the act of copying Bob's brain. The simulated / imitated copy of Bob is still a full-blown consequentialist, with all the manifold risks that entails. So the scheme is basically a way of taking an impractically weak system that you trust, and overclocking it but not otherwise affecting it, so that it retains (you hope) the properties that made you trust it in the first place.
     
  4. At this point my understanding of Eliezer's counterargument would be: "Okay sure; but find me a Bob that you trust enough to actually put through this process. Everything else is neat, but it is downstream of that." And I think that this is correct and that it is a very, very strong objection, but — under certain sets of assumptions about timelines, alternatives, and counterfactual risks — it may not be a complete knock-down. (This is the "belling the cat" bit, I believe.)
     
  5. And at this point, maybe (?) Evan says, "But wait; the Bob-copy isn't actually a consequentialist because it was trained myopically." And if that's what Evan says, then I believe this is the point at which there is an empirically resolvable disagreement.

Is this roughly right? Or have I missed something?

Eliezer's counterargument is "You don't get a high-fidelity copy of Bob that can be iterated and recursed to do arbitrary amounts of work a Bob-army could do, the way Bob could do it, until many years after the world otherwise ends.  The imitated Bobs are imperfect, and if they scale to do vast amounts of work, kill you."

To be clear, I agree with this as a response to what Edouard said—and I think it's a legitimate response to anyone proposing we just do straightforward imitative amplification, but I don't think it's a response to what I'm advocating for in this post (though to be fair, this post was just a quick sketch, so I suppose I shouldn't be too surprised that it's not fully clear).

In my opinion, if you try to imitate Bob and get a model that looks like it behaves similarly to Bob, but no have no other guarantees about it, that's clearly not a safe model to amplify, and probably not even a safe model to train in the first place. That's because instead of getting a model that actually cares about imitating Bob or anything like that, you probably just got some pseudo-aligned mesa-optimizer with an objective that produces behavior that happens to correlate well with Bob's.

However, there does exist a purely theoretical construct—what would happen if you actually amplified Bob, not an imitation of Bob—that is very likely to be safe and superhuman (though probably still not fully competitive, but we'll put that aside for now since it doesn't seem to be the part you're most skeptical of). Thus, if you could somehow get a model that was in fact trying to imitate amplified Bob, you might be okay—except that that's not true, because most types of agents, when given the objective of imitating a safe thing, will end up with a bunch of convergent instrumental goals that break that safety. However, I claim that there are natural types of agents (that is, not too complex on a simplicity prior) that, when given the objective of imitating a safe thing, do so safely. That's what I mean by my step (1) above (and of course, even if such natural agents exist, there's still a lot you have to do to make sure you get them—that's the rest of the steps).

But since you seem most skeptical of (1), maybe I'll try to lay out my basic case for how I think we can get a theory of simple, safe imitators (including simple imitators with arbitrary levels of optimization power):

  • All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary (similarly to why an approval-directed agent wouldn't do this sort of thing—the main problem with approval-directed agents just being that human approval is not a very good thing to optimize for).
  • Specifying a robust Cartesian boundary is not that hard—you just need a good multi-level world-model, which any powerful agent should have to have anyway.
  • There are remaining issues related to superrationality, but those can be avoided by having a decision theory that ignores them (e.g. the right sort of CDT variant).
  • There are also some remaining issues related to tiling, but those can be avoided if the Cartesian boundary is structured in such a way that it excludes other agents (this is exactly the trick that LCDT pulls).

All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary

I'm confused from several directions here.  What is a "robust" Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate "an agent that optimizes an objective" are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?

are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?

No—I'm separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you're going to get it. My step (1) above, which is what I understand that we're talking about, is just about that first piece: understanding what we're going to be shooting for when we set up our training process (and then once we know what we're shooting for we can think about how to set up a training process to actually land there). See “How do we become confident in the safety of a machine learning system?” for understanding this way of thinking about ML systems.

It's worth pointing out, however, that even when we're just focusing on that first part, it's very important that we pay attention to the total complexity that we're paying in specifying what sort of model we want, since that's going to determine a lot of how difficult it will be to actually construct a training process that produces such a model. Exactly what sort of complexity we should be paying attention to is a bit unclear, but I think that the best model we currently have of neural network inductive biases is something like a simplicity prior with a speed cap (see here for some empirical evidence for this).

What is a "robust" Cartesian boundary, why do you think this stops an agent from trying to get more compute

Broadly speaking, I'd say that a Cartesian boundary is robust if the agent has essentially the same concept of what its action, observation, etc. is regardless of what additional true facts it learns about the world.

The Cartesian boundary itself does nothing to prevent an agent from trying to get more compute to simulate better, but having an objective that's just specified in terms of actions rather than world states does. If you want a nice simple proof of this, Alex Turner wrote one up here (and discusses it a bit more here), which demonstrates that instrumental convergence disappears when you have an objective specified in terms of action-observation histories rather than world states.

Like I said above, however, there are still some remaining problems—just having an objective specified in terms of actions isn't quite enough.

Thanks, that helps. So actually this objection says: "No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you've built yourself an AGI. But since this myopic copying scheme thing seems way harder than the easiest way I can think of to build an AGI, that means a fortiori that somebody else built one the easy way several years before you built yours."

Is that an accurate interpretation?

Closer, yeah.  In the limit of doing insanely complicated things with Bob you will start to break him even if he is faithfully simulated, you will be doing things that would break the actual Bob; but I think HCH schemes fail long before they get to that point.

Gotcha. Well, that seems right—certainly in the limit case.

Abstracting out one step: there is a rough general argument that human-imitating AI is, if not perfectly safe, then at least as safe as the humans it's imitating. In particular, if it's imitating humans working on alignment, then it's at least as likely as we are to come up with an aligned AI. Its prospects are no worse than our prospects are already. (And plausibly better, since the simulated humans may have more time to solve the problem.)

For full strength, this argument requires that:

  • It emulate the kind of alignment research which the actual humans would do, rather than some other kind of work
  • It correctly imitates the humans

Once we relax either of those assumptions, the argument gets riskier. A relaxation of the first assumption would be e.g. using HCH in place of humans working normally on the problem for a while (I expect this would not work nearly as well as the actual humans doing normal research, in terms of both safety and capability). The second assumption is where inner alignment problems and Evan's work enter the picture.

The key idea, in the case of HCH, would be to direct that optimization towards the goal of producing an action that is maximally close to what HCH would do.

Why do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has? That seems like a much more natural thing for it to learn than the concept of imitating HCH, because in the process of imitating HCH it still has to do long-term planning anyway.

(I feel like this is basically the same set of concerns/objections that I raised in this post. I also think that myopia is a fairly central example of the thing that Eliezer was objecting to with his "water" metaphor in our dialogue, and I endorse his objection in this context.)

Why do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has?

To be clear, I was only talking about (1) here, which is just about what it might look like for an agent to be myopic, not how to actually get an agent that satisfies (1). I agree that you would most likely get a proxy-aligned model if you just trained on “imitate HCH”—but just training on “imitating HCH” is definitely not the plan. See (2), (3), (4), (5) for how we actually get an agent that satisfies (1).

In terms of ease of getting (1)/naturalness of (1), all we need out of (1) there is for our concept of myopia to not cost so many bits that it's too unnatural to get (2), (3), and (4) to work, not that it's the most natural thing for you to get if all you do is just train on imitative amplification.

That all makes sense. But I had a skim of (2), (3), (4), and (5) and it doesn't seem like they help explain why myopia is significantly more natural than "obey humans"?

I mean, that's because this is just a sketch, but a simple argument for why myopia is more natural than “obey humans” is that if we don't care about competitiveness, we already know how to build myopic optimizers, whereas we don't know how to build an optimizer to “obey humans” at any level of capabilities.

Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency. I suspect we can get much better upper bounds on the complexity than that, though.

Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency.

It's an interesting idea, but are you confident that LCDT actually works? E.g. have you thought more about the issues I talked about here and concluded they're not serious problems?

I still don't see how we could get e.g. an HCH simulator without agentic components (or the simulator's qualifying as an agent).
As soon as an LCDT agent expects that it may create agentic components in its simulation, it's going to reason horribly about them (e.g. assuming that any adjustment it makes to other parts of its simulation can't possibly impact their existence or behaviour, relative to the prior).

I think LCDT does successfully remove the incentives you're aiming to remove. I just expect it to be too broken to do anything useful. I can't currently see how we could get the good parts without the brokenness.

we already know how to build myopic optimizers

What are you referring to here?

I think you might be able to design advanced nanosystems without AI doing long term real world optimization. 

Well a sufficiently large team of smart humans could probably design nanotech. The question is how much an AI could help.

Suppose unlimited compute. You program a simulation of quantum field theory. Add a GUI to see visualizations and move atoms around. Designing nanosystems is already quite a bit easier.

Now suppose you brute force search over all arrangements of 100 atoms within a 1nm box, searching for the configuration that most efficiently transfers torque. 

You do similar searches for the smallest arrangement of atoms needed to make a functioning logic gate.

Then you download an existing microprocessor design, and copy it (but smaller) using your nanologic gates.

I know that if you start brute forcing over a trillion atoms, you might find a mesaoptimizer. (Although even then I would suspect that visualization inspection shouldn't result in anything brain hacky. It would only be actually synthesizing such a thing that was dangerous. (or maybe possibly simulating it, if the mesaoptimizer realizes it's in a simulation and there are general simulation escape strategies ))

So look at the static output of your brute forcing. If you see anything that looks computational, delete it. Don't brute force anything too big. 

(Obviously you need human engineers here, any long term real world planning is coming from them.)

My attempt at a one sentence summary of the core intuition behind this proposal: if you can be sure your model isn’t optimizing for deceiving you, you can relatively easily tell if it’s trying to optimize for something you don’t want by just observing whether your model seems to be trying to do something obviously different from what you want during training, because it's much harder to slip under the radar by getting really lucky than by intentionally trying to.

The reason self supervised approaches took over NLP is because they delivered the best results. It would be convenient if the most alignable approach also gave the best results, but I don’t think that’s likely. If you convince the top lab to use an approach that delivered worse results, I doubt much of the field would follow their example.

I suspect that there were a lot of approaches that would have produced similar results to how we ended up doing language modeling. I believe that the main advantage of Transformers over LSTMs is just that LSTMs have exponentially decaying ability to pay attention to prior tokens while Transformers can pay constant attention to all tokens in the context. I suspect that it would have been possible to fix the exponential decay problem with LSTMs and get them to scale like Transformers, but Transformers came first, so nobody tried. And that's not to say that ML as a field is incompetent or anything—it's just why would you try when you already have Transformers.

Also, note that “best results” for powerful AI systems is going to include alignment—alignment is a pretty important component of best results for any actual practical application that the big labs care about that isn't just “scores the highest on some benchmark.”

I agree that transformers vs other architectures is a better example of the field “following the leader” because there are lots of other strong architectures (perceiver, mlp mixer, etc). In comparison, using self supervised transfer learning is just an objectively good idea you can apply to any architecture and one the brain itself almost surely uses. The field would have converged to doing so regardless of the dominant architecture.

One hopeful sign is how little attention the ConvBERT language model has gotten. It mixes some convolution operations with self attention to allow self attention heads to focus on global patterns as opposed to local patterns better handled by convolution. ConvBERT is more compute efficient than a standard transformer, but hasn’t made much of a splash. It shows the field can ignore low profile advances made by smaller labs.

For your point about the value of alignment: I think there’s a pretty big range of capabilities where the marginal return on extra capabilities is higher than the marginal return on extra alignment. Also, you seem focused on avoiding deception/treacherous turns, which I think are a small part of alignment costs until near human capabilities.

I don’t know what sort of capabilities penalty you pay for using a myopic training objective, but I don’t think there’s much margin available before voluntary mass adoption becomes implausible.