Paul Christiano


Iterated Amplification

Wiki Contributions


deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC's post on anomaly detection), so is included in π.

I'm not sure I fully understand this example, but I think it's fine. The idea is:

  • Suppose the model thinks about "Should I tell the truth, or do a treacherous turn?" On the training distribution it always concludes "tell the truth." 
  • A good explanation of the model's behavior on the training distribution will capture this fact (otherwise it will completely fail to match the empirics).
  • If we simply replaced the answer to this question with the constant "tell the truth" then we'd get the same behavior.
  • So a sub-explanation π can effectively ablate the answer to this question with its training-set-mean, and still explain the variance.
  • On a new input where the model decides to lie and say "the diamond will remain in the vault," such a sub-explanation won't explain why the model said that.  And so we will conservatively flag this output as "potentially for abnormal reasons."

alien philosophy explains train output variance; unfortunately it also has a notion of object permanence we wouldn't agree with, which the (AGI) robber exploits

Same question here as in my reply to Ramana: what is this alien notion of object permanence? How is this implemented within the model? I'm very interested in concrete counterexamples along these lines, but my current belief is that it's very plausible that it's impossible to get to a concrete counterexample because the algorithm actually works.

We obviously won't be able to settle this question confidently until we have a satisfactory notion of probabilistic heuristic argument. In the interim we will need to work with our pre-formal notion of heuristic argument and so everything will be a bit vague. But I currently think it will be tricky to produce a concrete counterexample that even looks like it plausibly fails.

(Note that it's totally fine if the counterexample uses some fictionalized physics.)

I'm very interested in understanding whether anything like your scenario can happen. Right now it doesn't look possible to me. I'm interested in attempting to make such scenarios concrete to the extent that we can now, to see where it seems like they might hold up. Handling the issue more precisely seems bottlenecked on a clearer notion of "explanation."

Right now by "explanation" I mean probabilistic heuristic argument as described here

A problem with this: π can explain the predictions on both train and test distributions without all the test inputs corresponding to safe diamonds. In other words, the predictions can be made for the “normal reason” π even when the normal reason of the diamond being safe doesn’t hold.

The proposed approach is to be robust over all subsets of π that explain the training performance (or perhaps to be robust over all explanations π, if you can do that without introducing false positives, which depends on pinning down more details about how explanations work).

So it's OK if there exist explanations that capture both training and test, as long as there also exist explanations that capture training but not test.

We might hope that a lot of the concepts π is dealing in do correspond to natural human things like object permanence or diamonds or photons. But suppose not all of them do, and/or there are some subtle mismatches.

I'm happy to assume that the AI's model is as mismatched and weird as possible, as long as it gives rise to the appearance of stable diamonds. My tentative view is that this is sufficient, but I'm extremely interested in exploring examples where this approach breaks down.

This could happen because, e.g., π’s version of “object permanence” is just broken on this input, and was never really about object permanence but rather about a particular group of circuits that happen to do something object-permanence-like on the training distribution.

This is the part that doesn't sound possible to me. The situation you're worried about seems to be:

  • We have a predictor M.
  • There is an explanation π for why M satisfies the "object permanence regularity."
  • On a new input, π still captures why M predicts the diamond will appear to just sit there.
  • But in fact, on this input the diamond isn't actually the same diamond sitting there, instead something else has happened that merely makes it look like the diamond is sitting there.

I mostly just want to think about more concrete details about how this might happen. Starting with: what is actually happening in the world to make it look like the diamond is sitting there undisturbed? Is it an event that has a description in our ontology (like "the robber stole the diamond and replaced it with a fake" or "the robber tampered with the cameras so they show an image of a diamond" or whatever) or is it something completely beyond our ken? What kind of circuit within M and explanation π naturally capture both our intuitive explanation of the object permanence regularity, and the new mechanism?

(Or is it the case that on the new input the diamond won't actually appear to remain stable, and this is just a case where M is making a mistake? I'm not nearly so worried about our predictive models simply being wrong, since then we can train on the new data to correct the problem, and this doesn't put us at a competitive disadvantage compared to someone who just wanted to get power.)

I'm very excited about any examples (or even small steps towards building a possible example) along these lines. Right now I can't find them, and my tentative view is that this can't happen.

Is it an assumption of your work here (or maybe a desideratum of whatever you find to do mechanistic explanations) that the mechanistic explanation is basically in terms of a world model or simulation engine, and we can tell that’s how it’s structured? I.e., it’s not some arbitrary abstract summary of the predictor’s computation. (And also that we can tell that the world model is good by our lights?)

I don't expect our ML systems to be world models or simulation engines, and I don't expect mechanistic heuristic explanations to explain things in terms of humans' models. I think this is echoing the point above, that I'm open to the whole universe of counterexamples including cases where our ML systems are incredibly strange.

First there is ɸ, the explanation that there was a diamond in the vault and the cameras were working properly, etc. and the predictor is a straightforward predictor with a human-like world-model (ɸ is kinda loose on the details of how the predictor works, and just says that it does work).

I don't think you can make a probabilistic heuristic argument like this---a heuristic argument can't just assert that the predictor works, it needs to actually walk through which activations are correlated in what ways and why that gives rise to human-like predictions.

The general strategy I'm describing for anomaly detection is:

  • Search for an explanation of a model behavior (like "answers questions coherently") on the training set.
  • Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn't explain the behavior on the new input.
  • If you can't find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent)

The first two steps are solving problems in NP, so I think it's reasonable to expect them to be easier from an alignment perspective. (We also don't imagine solving them using the same AI techniques applied on the object level, and if we did then I would agree that we have further deceptive alignment problems, but still it seems helpful to ask your AI to do something that is both tractable and formally verifiable.)

My sense is that you probably don't think this kind of strategy can work, and that instead anomaly detection requires something more like training a second AI to tell us if something is weird. I'd agree that this doesn't sound like progress.

Yes, you want the patient to appear on camera for the normal reason, but you don't want the patient to remain healthy for the normal reason.

We describe a possible strategy for handling this issue in the appendix. I feel more confident about the choice of research focus than I do about whether that particular strategy will work out. The main reasons are: I think that ELK and deceptive alignment are already challenging and useful to solve even in the case where there is no such distributional shift, that those challenges capture at least some central alignment difficulties, that the kind of strategy described in the post is at least plausible, and that as a result it's unlikely to be possible to say very much about the distributional shift case before solving the simpler case.

If the overall approach fails, I currently think it's most likely either because we can't define what we mean by explanation or that we can't find explanations for key model behaviors.

There isn't supposed to be a second AI.

In the object-level diamond example, we want to know that the AI is using "usual reasons" type decision-making.

In the object-level diamond situation, we have a predictor of "does the diamond appear to remain in the vault," we have a proposed action and predict that if we take it the diamond will appear to remain in the vault, and we want to know whether the diamond appears to remain in the vault for the normal reason.

For simplicity, when talking about ELK in this post or in the report, we are imagining literally selecting actions by looping over each possible action and predicting its consequences, or doing some kind of more clever search (but where the alignment difficulty comes from the search).

You could also try to apply this to a model-free RL agent. I think that's probably not very different. My best guess for how to do it is to train a question-answering head to talk about the possible consequences of its plan, and then use this machinery to keep that honest. But I don't discuss it in this post and haven't thought about it as much.

Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It's not enough for it to be plausible that it could happen often; it needs to happen all the time.

I think the situation is much better if deceptive alignment is inconsistent. I also think that's more likely, particularly if we are trying.

That said, I don't think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are trying to optimize reward, or we may find that models that are trying to optimize reward are unsuitable in practice (e.g. because of the issues raised in mechanism 1) and so selecting for things that works means you are selecting for deceptive alignment.

Mechanism 2: deceptive alignment

Suppose that during training my AI system had some arbitrary long-term goal. Many long-term goals would be best-served if the deployed AI system had that same goal. And so my AI is motivated to get a low loss, so that gradient descent won’t change its goals.

As a result, a very wide range of long-term goals will lead to competent loss-minimizing behavior. On the other hand, there is a very narrow range of short-term goals that lead to competent loss-minimizing behavior: “minimize the loss.”

So gradient descent on the short-term loss function can easily push towards long-term goals (in fact it would both push towards the precise short-term goals that result in low loss and arbitrary long-term goals, and it seems like a messy empirical question which one you get). This might not happen early in training, but eventually our model is competent enough to appreciate these arguments and perhaps for it to be extremely obvious to it that it should avoid taking actions that would be penalized by training.

It doesn’t seem like there are any behavioral checks we can do to easily push gradient descent back in the other direction, since an agent that is trying to get a low loss will always just adopt whatever behavior is best for getting a low loss (as long as it thinks it is on the training distribution).

This all is true even if my AI has subhuman long-horizon reasoning. Overall my take is maybe that there is a 25% chance that this becomes a serious issue soon enough to be relevant to us and that is resistant to simple attempts to fix it (though it’s also possible we will fail to even competently implement simple fixes). I expect to learn much more about this as we start engaging with AI systems intelligent enough for it to be a potential issue over the next 5-10 years.

This issue is discussed here. Overall I think it’s speculative but plausible.

Mechanism 1: Shifting horizon length in response to short-horizon tampering

Suppose I want my AI to write good code (say to help me run my business). The AI understands a lot about how to write code, how servers work, and how users behave, learned entirely from quick feedback and experimentation. Let’s say it has a human-level or even subhuman understanding of the overall business and other long-term planning.

(This example may seem a bit silly if you imagine a software-writing AI in isolation, but you should think of the same story playing out all across an economy in parallel as AI systems take on an extremely wide range of tasks.)

How do I train that system to use its understanding to write good code? Here are two simple options:

  1. Process-based: Look at the AI’s code, have the AI explain why it made these decisions, and evaluate everything on paper.
  2. Outcomes-based: Run the code, monitor resource usage, see what users say in the first hour after deployment.

Process-based feedback potentially handicaps my AI (even if it is only superhuman on short-horizon tasks). It’s not clear how large this advantage is, but I think our experience in practice is that “actually run your engineer’s code” is an extremely helpful technique for evaluating it in practice, and it becomes more important the better your engineers are and the less able you are to evaluate decisions they made.

So without some kind of technical progress I’m afraid we may often be pushed to use outcomes-based feedback to make our systems work better.

Unfortunately outcomes-based feedback creates an adversarial interaction between me and my AI. In addition to selecting for AIs that write good code, I’m selecting for AIs that use their short-horizon abilities to write software that covers up evidence of trouble and tampers with logs, that fakes or destroys user reports or manipulates users, and that generally messes with every mechanism we try to use to understand what’s going on. (Depending on exactly what goal the system internalized, it might also be motivated to write code that directly tampers with data fed back into the ML training process.)

Maybe we can make all of those things impossible with good enough security, so that the only way for code to look good is to actually be good. But now we are setting up a larger-scale, persistent conflict between humans and the AI systems they delegate work to. And many different AI systems could potentially all simultaneously achieve their separate short-term tasks by defecting and undermining human evaluations, making it harder to use AI evaluators to reduce the risk of tampering (or creating an increasingly unstable situation if we try to go that route).

I think small versions of sensor tampering are likely to happen and are not a big deal. I’m scared about the world where this kind of failure becomes a big pain and seriously impacts the usefulness of AI. Maybe I’d wildly guess a 40% chance that we end up with a significant problem like this if we keep building ML systems using roughly the same techniques we use today (and that those problems arise soon enough to be relevant).

There are many good things we could do to fix that problem, and in some sense it would be great news for AI alignment since we’d have a concrete technical problem to iterate on.

But there is also one very bad way we could fix this problem. We could notice incidents of tampering much later, e.g. by talking to users out-of-bound or noticing we aren’t getting return traffic or seeing that resource usage measurements were clearly faked. Then we can label those incidents as examples of wha not to do and throw that data back into the training process.

This will quickly select against policies that do this kind of short-term tampering, but it could so in two ways:

  • The AI learned to not to tamper with our measurements.
  • The AI learned to not to tamper with our measurements in any way we’d catch later.

In the second case, we’ve now created an AI with long-term goals even though all we really wanted was a short-term task, and we have a similar adversarial dynamic at a larger scale. And even if the AI has merely human-level competence at long-horizon tasks, it will still try to use its short-term competencies to prevent us from ever detecting and correcting its mistakes. If we keep training out every problem at a scale that we can correct, we will eventually be left with total failures that we can’t correct (e.g. because AI systems have permanently disempowered us).

I think it’s best to view this as a kind of overfitting: we wanted to get an AI that doesn’t do anything bad, but if we train on “did we eventually notice that it did something ad?” then we will end up with a system that is overfit to that objective. And just like with overfitting, you could also imagine it happening without explicitly training on the offending data. If 5% of AI systems have long-term goals and 95% have short-term goals, you could end up with the 95% consistently behaving badly by taking short-term opportunities to tamper with our measurements, and so we would select for the 5% with long-term goals.

Thanks for posting, I thought this was interesting and reasonable.

Some points of agreement:

  • I think many of these are real considerations that the risk is lower than it might otherwise appear.
  • I agree with your analysis that short-term and well-scoped decisions will probably tend to be a comparative advantage of AI systems.
  • I think it can be productive to explicitly focus on  “narrow” systems (which pursue scoped short-term goals, without necessarily having specifically limited competence) and to lean heavily on the verification-vs-generation gap.
  • I think these considerations together with a deliberate decision to focus on narrowness could significnatly (though not indefinitely) postpone the point when alignment difficulties could become fatal.
  • I think that it's unrealistic for AI systems to rapidly improve their own performance without limits. Relatedly, I sympathize with your skepticism about the story of a galaxy-brained AI outwitting humanity in a game of 3 dimensional chess.

My most important disagreement is that I don’t find your objections to hypothesis 2 convincing. I think the biggest reason for this is that you are implicitly focusing on a particular mechanism that could make hypothesis 2 true (powerful AI systems are trained to pursue long-term goals because we want to leverage AI systems’ long-horizon planning ability) and neglecting two other mechanisms that I find very plausible. I’ll describe those in two child comments so that we can keep the threads separate. Out of your 6 claims, I think only claim 2 is relevant to either of these other mechanisms.

I also have some scattered disagreements throughout:

  • So far it seems extremely difficult to extract short-term modules from models pursuing long-term goals. It’s not clear how you would do it even in principle and I don’t think we have compelling examples. The AlphaZero -> Stockfish situation does not seem like a successful example to me, though maybe I'm missing something about the situation. So overall I think this is worth mentioning as a possibility that might reduce risk (alongside many others), but not something that qualitatively changes the picture.
  • I’m very skeptical about your inference from “CEOs don’t have the literal highest IQs” to “cognitive ability is not that important for performance as a CEO,” and even moreso for jumping all the way to “cognitive ability is not that important for long-term planning.” I think that (i) competent CEOs are quite smart even if not in the tails of the IQ distribution, (ii) there are many forms of cognitive ability which are only modestly correlated, and so the tails come apart, (iii) there are huge amounts of real-world experience that drive CEO performance beyond cognitive ability, (iv) CEO selection is not perfectly correlated with performance. Given all of that, I think you basically can’t get any juice out of this data. If anything I would say the high compensation of CEOs, their tendency to be unusually smart, and skill transferability across different companies seem to provide some evidence that CEO cognitive ability has major effects on firm performance (I suspect there is an economics literature investigating this claim). Overall I thought this was the weakest point of the article.
  • While I agree there are fundamental computational limits to performance, I don’t think they qualitatively change the picture about the singularity. This is ultimately a weedsy quantitative question and doesn’t seem central to your point so I won't get into it, but I’d be happy to elaborate if it feels like an important disagreement. I also don’t think the scaling laws you cite support your claim; ultimately the whole point is that the (compute vs performance) curves tend to fall with further R&D.
  • I would agree with the claim “more likely than not, AI systems won’t take over the world.” But I don’t find <50% doom very comforting! Indeed my own estimate is more like 10-20% (depending on what we are measuring) but I still consider this a plurality of total existential risk and a very appealing thing to work on. Overall I think most of the considerations you raise are more like quantitative adjustments to these probabilities, and so a lot depends on what is in fact baked in or how you feel about the other arguments on offer about AI takeover (in both directions).
  • I think you are greatly underestimating the difficulty of deterrence and prevention. If AI systems are superhuman for short-horizon tasks, it seems like humans would become reliant on AI help to prevent or contain bad behavior by other AIs. But if there are widespread alignment problems, then the AI systems charged with defending humans may instead join in to help disempower humanity. Without progress on alignment it seems like we are heading towards an increasingly unstable word. The situation is quite different from preventing or deterring human “bad actors;” amongst humans the question is how to avoid destructive negative-sum behavior, whereas in the hypothetical situation you are imagining vast numbers of AIs who are doing almost all the work and don't care about human flourishing, yet somehow trying to structure society so that it nevertheless leads to human flourishing.

My perspective is:

  • Planning against a utility function is an algorithmic strategy that people might use as a component of powerful AI systems. For example, they may generate several plans and pick the best or use MCTS or whatever. (People may use this explicitly on the outside, or it may be learned as a cognitive strategy by an agent.)
  • There are reasons to think that systems using this algorithm would tend to disempower humanity. We would like to figure out how to similarly powerful AI systems that don't do that.
  • We don't currently have candidate algorithms that can safely substitute for planning. So we need to find an alternative.
  • Right now the only thing remotely close to working for this purpose is running a very similar planning algorithm but against a utility function that does not incentivize disempowering humanity.

My sense is that you want to decline to play this game and instead say: just don't build AI systems that search for high-scoring plans.

That might be OK if it turns out that planning isn't an effective algorithmic ingredient, or if you can convince people not to build such systems because it is dangerous (and similar difficulties don't arise if agents learn planning internally). But failing that, we are going to have to figure out how to build AI systems that capture the benefits of planning without being dangerous.

(It's possible you instead have a novel proposal for a way to capture the benefits of search without the risks, in case I'd withdraw this comment once part 2 came out though I wish you'd led with the juicy part.)

As a secondary point (that I've said a bunch of times), I also found the arguments in this post uncompelling. Probably the first thing to clarify is that I feel like you equivocate between the grader being something that is embedded in the real world and hence subject to manipulation by real-world consequences of the actor's actions, and the grader being something that operates on plans in the agent's head in order to select the best one. In the latter case the grader is still subject to manipulation, but the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.

Load More