Recommended Sequences

Embedded Agency
AGI safety from first principles
Iterated Amplification

Recent Discussion

Most of the work on inner alignment so far has been informal or semi-formal (with the notable exception of a little work on minimal circuits). I feel this has resulted in some misconceptions about the problem. I want to write up a large document clearly defining the formal problem and detailing some formal directions for research. Here, I outline my intentions, inviting the reader to provide feedback and point me to any formal work or areas of potential formal work which should be covered in such a document. (Feel free to do that last one without reading further, if you are time-constrained!)

The State of the Subfield

Risks from Learned Optimization (henceforth, RLO) offered semi-formal definitions of important terms, and provided an excellent introduction to the area for a...

12Richard Ngo18hI have fairly mixed feelings about this post. On one hand, I agree that it's easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they'll arise, otherwise the argument that 'we can't yet rule them out, so we should prioritise trying to rule them out' is privileging the hypothesis. Secondly, it seems like you're heavily prioritising formal tools and methods for studying mesa-optimisation. But there are plenty of things that formal tools have not yet successfully analysed. For example, if I wanted to write a constitution for a new country, then formal methods would not be very useful; nor if I wanted to predict a given human's behaviour, or understand psychology more generally. So what's the positive case for studying mesa-optimisation in big neural networks using formal tools? In particular, I'd say that the less we currently know about mesa-optimisation, the more we should focus on qualitative rather than quantitative understanding, since the latter needs to build on the former. And since we currently do know very little about mesa-optimisation, this seems like an important consideration.

I agree with much of this. I over-sold the "absence of negative story" story; of course there has to be some positive story in order to be worried in the first place. I guess a more nuanced version would be that I am pretty concerned about the broadest positive story, "mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?" -- and think more specific positive stories are mostly of illustrative value, rather than really pointing to gears that I expect to be important. (With the excep... (read more)

1Ben Pace1dCurated. Solid attempt to formalize the core problem, and solid comment section from lots of people.
2Rohin Shah1dPlanned summary for the Alignment Newsletter:
3michaelcohen2dI think you’re imagining deep learning as a MAP-type approach—it just identifies a best hypothesis and does inference with that. Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for MAP can be cut for the consensus algorithm too. Starting with 1), the bulk of the work for either the consensus algorithm or a MAP approach is computing the posterior to determine which model(s) is(are) best. In an analogy to neural networks, it would be like saying most of the work comes from using the model (the forward pass) rather than arriving at the model (the many forward and backward passes in training). Regarding 2), state-of-the-art-type AI basically assumes approximate stationarity when separating a training phase from a test/execution phase. This is cutting a huge corner, and it means that when you think of a neural network running, you mostly think about it using the hypothesis that it has already settled on. But if we compare apples to apples, a consensus algorithm can cut the same corner to some extent. Neither a MAP algorithm nor a consensus algorithm is any better equipped than the other to, say, update the posterior only when the timestep is a power two. In general, training (be it SGD or posterior updating) is the vast bulk of the work in learning. To select a good hypothesis in the first place you will have already had to consider many more; the consensus algorithm just says to keep track of the runner ups. I don’t understand what out-guess means. But what we need is that the malign hypothesis don’t have substantially higher posterior weight than the benign ones. As time passes, the probability of this happening is not independent. The result I show about the probability of the truth being in the top set applies to all time, not any given point in time. I don’t know what “no realistic hypothesis has all the answers” means. There will be a best “realistic” benign hypothesis, and we can talk a
4Koen Holtman5dI like your agenda. Some comments.... THE BENEFIT OF FORMALIZING THINGS First off, I'm a big fan of formalizing things so that we can better understand them. In the case of AI safety that, better understanding may lead to new proposals for safety mechanisms or failure mode analysis. In my experience, once you manage to create a formal definition, it seldom captures the exact or full meaning you expected the informal term to have. Formalization usually exposes or clarifies certain ambiguities in natural language. And this is often the key to progress. THE PROBLEM WITH FORMALIZING INNER ALIGNMENT On this forum and in the broader community. I have seen a certain anti-pattern appear. The community has so far avoided getting too bogged down in discussing and comparing alternative definitions and formalization's of the intuitive term intelligence. However, it has definitely gotten bogged down when it comes to the terms corrigibility, goal-directedness, and inner alignment failure. I have seen many cases of this happening: The anti-pattern goes like this: participant 1: I am now going to describe what I mean with the concept of X∈{ corrigibility, goal-directedness,inner alignment failure}, as first step to make progress on this problem of X. participants 2-n: Your description does not correspond to my intuitive concept of X at all! Also, your steps 2 and 3 seem to be irrelevant to making progress on my concept of X, because of the following reasons. In this post on corrigibility [] I have have called corrigibility a term with a high linguistic entropy, I think the same applies to the other two terms above. These high-entropy terms seem to be good at producing long social media discussions, but unfortunately these discussions seldom lead to any conclusions or broadly shared insights. A lot of energy is lost in this way. What we really want, ideally, is useful discussion abo
2Abram Demski3dIt's interesting that you think of treacherous turns as automatically reward hacking. I would differentiate reward hacking as cases where the treacherous turn is executed with the intention of taking over control of reward. In general, treacherous turns can be based on arbitrary goals. A fully inner-aligned system can engage in reward hacking. I think for outer-alignment purposes, what I want to respond here is "the lion needs feedback other than just rewards". You can't reliably teach the lion to "not ever each sheep" rather than "don't eat sheep when humans are watching" when your only feedback mechanism can only be applied when humans are watching. But if you could have the lion imagine hypothetical scenarios and provide feedback about them, then you could give feedback about whether it is OK to eat sheep when humans are not around. To an extent, the answer is the same with inner alignment: more information/feedback is needed. But with inner alignment, we should be concerned even if we can look at the behavior in hypothetical scenarios and give feedback, because the system might be purposefully behaving differently in these hypothetical scenarios than it would in real situations. So here, we want to provide feedback (or prior information) about which forms of cognition are acceptable/unacceptable in the first place. I guess I wouldn't want to use the term "reward hacking" for this, as it does not necessarily involve reward at all. The term "perverse instantiation" has been used -- IE the general problem of optimizers spitting out dangerous things which are high on the proxy evaluation function but low in terms of what you really want.
6Steve Byrnes5dI guess at the end of the day I imagine avoiding this particular problem by building AGIs without using "blind search over a super-broad, probably-even-Turing-complete, space of models" as one of its ingredients. I guess I'm just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs... (Of course I just wind up with a different set of unsolved AGI safety problems [] instead...) By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they're instrumentally useful, or they're a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly. So I guess I can imagine a strategy of saying "mesa-optimization won't happen" in some circumstance because we've somehow ruled out all three of those categories. This kind of argument does seem like a not-especially-promising path for safety research, in practice. For one thing, it seems hard. Like, we may be wrong about what’s instrumentally useful, or we may overlook part of the space of possible strategies, etc. For another thing, mesa-optimization is at least somewhat incentivized by seemingly almost any training procedure, I would think. ...Hmm, in our recent conversation, I might have said that mesa-optimization is not incentivized in predictive (self-supervised) learning. I forget. But if so, I was confused. I have long believed that mesa-optimization is useful for prediction [] and still do. Specifically, the directly-incentivized kind of "mesa-optimization in predictive learning" entails, for example, searching over diffe
2Abram Demski3dWait, you think your prosaic story doesn't involve blind search over a super-broad space of models?? I think any prosaic story involves blind search over a super-broad space of models, unless/until the prosaic methodology changes, which I don't particularly expect it to. I agree that replacing "blind search" with different tools is a very important direction. But your proposal doesn't do that! I agree with this general picture. While I'm primarily knocking down bad complexity-based arguments in my post, I would be glad to see someone working on trying to fix them. There were a lot of misunderstandings in the earlier part of our conversation, so, I could well have misinterpreted one of your points. But if so, I'm even more struggling to see why you would have been optimistic that your RL scenario doesn't involve risk due to unintended mesa-optimization. By your own account, the other part would be to argue that they're not simple, which you haven't done. They're not actively disincentivized, because they can use the planning capability to perform well on the task (deceptively). So they can be selected for just as much as other hypotheses, and might be simple enough to be selected in fact.
1Steve Byrnes2dNo, not prosaic, that particular comment was referring to the "brain-like AGI" story in my head... Like, I tend to emphasize the overlap between my brain-like AGI story and prosaic AI. There is plenty of overlap. Like they both involve "neural nets", and (something like) gradient descent, and RL, etc. By contrast, I haven't written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I'm thinking that there would be a hardcoded (by humans) inference algorithm that looks like (some more complicated cousin of) PGM belief propagation. In that case, yes there's a search over a model space, because we need to find the (more complicated cousin of a) PGM world-model. But I don't think that model space affords the same opportunities for mischief that you would get in, say, a 100-layer DNN. Not having thought about it too hard... :-P
3Ofer Givoli5dWe can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may create a model with a different objective that is identical to the intended objective just during training (or some part thereof). Why would this be unlikely? The intended objective is not privileged over such other objectives, from the perspective the training process. Evan gave an example related to this, where the intention was to train a myopic RL agent that goes through blue doors in the current epoch episode, but the result is an agent with a more general objective that cares about blue doors in future epochs episodes as well. In Evan's words [] (from the Future of Life podcast): Similar concerns are relevant for (self-)supervised models, in the limit of capability. If a network can model our world very well, the objective that SGD yields may correspond to caring about the actual physical RAM of the computer on which the inference runs (specifically, the memory location that stores the loss of the inference). Also, if any part of the network, at any point during training, corresponds to dangerous logic that cares about our world, the outcome can be catastrophic (and the probability of this seems to increase with the scale of the network and training compute). Also, a malign prior problem may manifest in (self-)supervised learning settings [] . (Maybe you consider this to be a special case of (2).)
1Steve Byrnes5dLike, if we do gradient descent, and the training signal is "get a high score in PacMan", then "mesa-optimize for a high score in PacMan" is incentivized by the training signal, and "mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips" is also incentivized by the training signal. For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better at figuring out how to execute a deceptive strategy. Here's a nice example. Let's say we do RL, and our model is initialized with random weights. The training signal is "get a high score in PacMan". We start training, and after a while, we look at the partially-trained model with interpretability tools, and we see that it's fabulously effective at calculating digits of π—it calculates them by the billions—and it's doing nothing else, it has no knowledge whatsoever of PacMan, it has no self-awareness about the training situation that it's in, it has no proclivities to gradient-hack or deceive, and it never did anything like that anytime during training. It literally just calculates digits of π. I would sure be awfully surprised to see that! Wouldn't you? If so, then you agree with me that "reasoning about training incentives" is a valid type of reasoning about what to expect from trained ML models. I don't think it's a controversial opinion... Again, I did not (and don't) claim that this type of reasoning should lead people to believe that mesa-optimizers won't happen, because there do tend to be training incentives for mesa-optimization.
1Ofer Givoli5dMy surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about π. If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn't be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective. Note that the examples in my comment don't rely on deceptive alignment. To "convert" your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is "make the relevant memory location in the RAM say that I won the game", or "win the game in all future episodes".
3Steve Byrnes5dMy hunch is that we don't disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you're misinterpreting me as saying something more interesting than I am.
14johnswentworth5dI'll make a case here that manipulation of imperfect internal search should be considered the inner alignment problem, and all the other things which look like inner alignment failures actually stem from outer alignment failure or non-mesa-optimizer-specific generalization failure. EXAMPLE: DR NEFARIOUS Suppose Dr Nefarious is an agent in the environment who wants to acausally manipulate other agents' models. We have a model which knows of Dr Nefarious' existence, and we ask the model to predict what Dr Nefarious will do. At this point, we have already failed: either the model returns a correct answer, in which case Dr Nefarious has acausal control over the answer and can manipulate us through it, or it returns an incorrect answer, in which case the prediction is wrong. (More precisely, the distinction is between informative/independent answers, not correct/incorrect.) The only way to avoid this would be to not ask the question in the first place - but if we need to know what Dr Nefarious will do in order to make good decisions ourselves, then we need to run that query. On the surface, this looks like an inner alignment failure: there's a malign subagent in the model. But notice that it's not even clear what we want in this situation - we don't know how to write down a goal-specification which avoids the problem while also being useful. The question of "what do we even want to do in this sort of situation?" is unambiguously an outer alignment question. It's not a situation where we know what we want but we're not sure how to make a system actually do it; it's a situation where it's not even clear what we want. Conversely, if we did have a good specification of what we want in this situation, then we could just specify that in the outer objective. Once that's done, we would still potentially need to solve inner alignment problems in practice, but we'd know how to solve them in principle: do the thing which is globally optimal for our outer objective. The whole poi
3Adam Shimi3dWhile I agree that outer objective, training data and prior should be considered together, I disagree that it makes the inner alignment problem dissolve except for manipulation of the search. In principle, if you could indeed ensure through a smart choice of these three parameters that there is only one global optimum, only "bad" (meaning high loss) local minima, and that your search process will always reach the global optimum, then I would agree that the inner alignment problem disappears. But answering "what do we even want?" at this level of precision seems basically impossible. I expect that it's pretty much equivalent to specifying exactly the result we want, which we are quite unable to do in general. So my perspective is that the inner alignment problem appears because of inherent limits into our outer alignment capabilities. And that in realistic settings where we cannot rule out multiple very good local minima, the sort of reasoning underpinning the inner alignment discussion is the best approach we have to address such problems. That being said, I'm not sure how this view interacts with yours or Evan's, or if this is a very standard use of the terms. But since that's part of the discussion Abram is pushing, here is how I use these terms.
3Steve Byrnes4dHm, I want to classify "defense against adversaries" as a separate category from both "inner alignment" and "outer alignment". The obvious example is: if an adversarial AGI hacks into my AGI and changes its goals, that's not any kind of alignment problem, it's a defense-against-adversaries problem. Then I would take that notion and extend it by saying "yes interacting with an adversary presents an attack surface, but also merely imagining an adversary presents an attack surface too". Well, at least in weird hypotheticals. I'm not convinced that this would really be a problem in practice, but I dunno, I haven't thought about it much. Anyway, I would propose that the procedure for defense against adversaries in general is: (1) shelter an AGI from adversaries early in training, until it's reasonably intelligent and aligned, and then (2) trust the AGI to defend itself. I'm not sure we can do any better than that. In particular, I imagine an intelligent and self-aware AGI that's aligned in trying to help me would deliberately avoid imagining an adversarial superintelligence that can acausally hijack its goals! That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say "OK yes now it's misaligned". (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don't think that's really a special problem we need to think about.
3Abram Demski3dThis part doesn't necessarily make sense, because prevention could be easier than after-the-fact measures. In particular, 1. You might be unable to defend against arbitrarily adversarial cognition, so, you might want to prevent it early rather than try to detect it later, because you may be vulnerable in between. 2. You might be able to detect some sorts of misalignment, but not others. In particular, it might be very difficult to detect purposeful deception, since it intelligently evades whatever measures are in place. So your misalignment-detection may be dependent on averting mesa-optimizers or specific sorts of mesa-optimizers.
1Steve Byrnes3dThat's fair. Other possible approaches are "try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so", or "intepretability that looks for the AGI imagining dangerous adversarial intelligences". I guess the fact that people don't tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible - like, that maybe there's a big window where one is smart enough to understand that imagining adversarial intelligences can be bad, but not smart enough to do so with such fidelity that it actuality is dangerous. But hard to say what's gonna work, if anything, at least at my current stage of general ignorance about the overall training process.
7Abram Demski4dSo, I think I could write a much longer response to this (perhaps another post), but I'm more or less not persuaded that problems should be cut up the way you say. As I mentioned in my other reply, your argument that Dr. Nefarious problems shouldn't be classified as inner alignment is that they are apparently outer alignment. If inner alignment problems are roughly "the internal objective doesn't match the external objective" and outer alignment problems are roughly "the outer objective doesn't meet our needs/goals", then there's no reason why these have to be mutually exclusive categories. In particular, Dr. Nefarious problems can be both. But more importantly, I don't entirely buy your notion of "optimization". This is the part that would require a longer explanation to be a proper reply. But basically, I want to distinguish between "optimization" and "optimization under uncertainty". Optimization under uncertainty is not optimization -- that is, it is not optimization of the type you're describing, where you have a well-defined objective which you're simply feeding to a search. Given a prior, you can reduce optimization-under-uncertainty to plain optimization (if you can afford the probabilistic inference necessary to take the expectations, which often isn't the case). But that doesn't mean that you do, and anyway, I want to keep them as separate concepts even if one is often implemented by the other. Your notion of the inner alignment problem applies only to optimization. Evan's notion of inner alignment applies (only!) to optimization under uncertainty.
4johnswentworth4dI buy the "problems can be both" argument in principle. However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we're thinking about optimization-under-uncertainty, although I'm still not sure exactly what that would mean. In other words: if a problem is both, then it is useful to think of it as an outer alignment problem (because that part has to be solved regardless), and not also inner alignment (because only a narrower version of that part necessarily has to be solved). In the Dr Nefarious example, the outer misalignment causes the inner misalignment in some important sense - correcting the outer problem fixes the inner problem , but patching the inner problem would leave an outer objective which still isn't what we want. I'd be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem. I'm not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn't well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point "X" and "Y" at some things in the real world. That's why the objective function cannot be meaningfully separated from the data/prior: "f(X, Y)" doesn't mean anything, by itself. But I could imagine the po
10Abram Demski3dTrying to lay this disagreement out plainly: According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is "inner" = issues of imperfect search, "outer" = issues of objective (which can include the prior, the utility function, etc). According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don't have enough information to really score domain items. In this frame, it seems reasonable to point to the way the algorithm tries to fill in the missing information as the location of "inner optimizers". This "way the algorithm tries to fill in missing info" has to include properties of the search, so we roll search+prior together into "inductive bias". I take your argument to have been: 1. The strength of well-defined optimization as a natural concept; 2. The weakness of any factorization which separates elements like prior, data, and loss function, because we really need to consider these together in order to see what task is being set for an ML system (Dr Nefarious demonstrates that the task "prediction" becomes the task "create a catastrophe" if prediction is pointed at the wrong data); 3. The idea that the my/Evan/Paul's concern about priors will necessarily be addressed by outer alignment, so does not need to be solved separately. Your crux is, can we factor 'uncertainty' from 'value pointer' such that the notion of 'value pointer' contains all (and only) the outer alignment issues? In that case, you could come around to optimization-under-uncertainty as a frame. I take my argument to have been: 1. The strength of optimization-under-uncertainty as a natural concept (I argue it is more often applicable than well-defined optimization); 2. The naturalness of referring to
2johnswentworth2dThis is a good summary. I'm still some combination of confused and unconvinced about optimization-under-uncertainty. Some points: * It feels like "optimization under uncertainty" is not quite the right name for the thing you're trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it. * The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal prior do. * Your examples in the other comment do feel closely related to your ideas on learning normativity [] , whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity). * It does seem like there's in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.) * The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples. ... so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it's capturing something different. (Though that's based on just a handful of examples, so the idea in your head is probably quite different from what I've interpolated from those examples.) On a side note, it feels weird to be the one saying "we can't separate uncertainty-handling from g
4Abram Demski3dThe way I'm currently thinking of things, I would say the reverse also applies in this case. We can turn optimization-under-uncertainty into well-defined optimization by assuming a prior. The outer alignment problem (in your sense) involves getting the prior right. Getting the prior right is part of "figuring out what we want". But this is precisely the source of the inner alignment problems in the paul/evan sense: Paul was pointing out a previously neglected issue about the Solomonoff prior, and Evan is talking about inductive biases of machine learning algorithms (which is sort of like the combination of a prior and imperfect search). So both you and Evan and Paul are agreeing that there's this problem with the prior (/ inductive biases). It is distinct from other outer alignment problems (because we can, to a large extent, factor the problem of specifying an expected value calculation into the problem of specifying probabilities and the problem of specifying a value function / utility function / etc). Everyone would seem to agree that this part of the problem needs to be solved. The disagreement is just about whether to classify this part as "inner" and/or "outer". What is this problem like? Well, it's broadly a quality-of-prior problem, but it has a different character from other quality-of-prior problems. For the most part, the quality of priors can be understood by thinking about average error being low, or mistakes becoming infrequent, etc. However, here, this kind of thinking isn't sufficient: we are concerned with rare but catastrophic errors. Thinking about these things, we find ourselves thinking in terms of "agents inside the prior" (or agents being favored by the inductive biases). To what extent "agents in the prior" should be lumped together with "agents in imperfect search", I am not sure. But the term "inner optimizer" seems relevant. A good example of optimization-under-uncertainty that doesn't look like that (at least, not overtly) is most ap
8Abram Demski5dThis is a great comment. I will have to think more about your overall point, but aside from that, you've made some really useful distinctions. I've been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it's about outer alignment). Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn't cluster together the problems people have been trying to cluster together.
10Adam Shimi5dHaven't read the full comment thread, but on this sentence Evan actually wrote a post [] to explain that it isn't the complement for him (and not the compliment either :p)
3Abram Demski4dRight, but John is disagreeing with Evan's frame, and John's argument that such-and-such problems aren't inner alignment problems is that they are outer alignment problems.
10Adam Shimi5dThanks for the post! Here is my attempt at a detailed peer-review feedback. I admit that I'm more excited by doing this because you're asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments). One thing I really like is the multiple "failure" stories at the beginning. It's usually frustrating in posts like that to see people argue against position/arguments which are not written anywhere. Here we can actually see the problematic arguments. I'm not sure if I agree that there is no connection. The mesa-objective comes from the interaction of the outer objective, the training data/environments and the bias of the learning algorithm. So in some sense there is a connection. Although I agree that for the moment we lack a formal connection, which might have been your point. Completely agreed. I always find such arguments unconvincing, not because I don't see where the people using them are coming from, but because such impossibility results require a way better understanding of what mesa-optimizers are and do that we have. Agreed too. I always find that weird when people use that argument, because it seems agreed upon in the field for a long time that there are probably simple goal-directed process in the search spaces. Like I can find a post [] from Paul's blog in 2012 where he writes: There's one approach that you haven't described (although it's a bit close to your last one) and which I am particularly excited about: finding an operationalization of goal-directedness, and just define/redefine mesa-optimizers as learned goal-directed agents. My interpretation of RLO is that it's arguing that search for simple competent programs will probably find a goal-directed system AND that it might have a simple structure "parametrized with a goal" (so basically an inner optimizer). This last assumption was really re
7Abram Demski3dEven with a significantly improved definition of goal-directedness, I think we'd be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign. But I'm happy to include your approach in the final document! Can you elaborate on this? Right. Low total error for, eg, imitation learning, might be associated with catastrophic outcomes. This is partly due to the way imitation learning is readily measured in terms of predictive accuracy, when what we really care about is expected utility (although we can't specify our utility function, which is one reason we may want to lean on imitation, of course). But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we're bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world. Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I'd be happy to hear it, but "systems performing complex tasks in complex environments have to pay that cost anyway" seems like a big problem for arguments of this kind. The question becomes where they put the complexity. I meant time as a function of data (I'm not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you'd need O(X). A memoryless alg could be constant time; IE, even though you have and X-long history, you just need to feed it the most recent thing, so its response time is not a function of X. Similarly with finite context windows. I agree than in principle we could decode the brain's algorithms and say
1Adam Shimi3dOh, definitely. I think a better definition of goal-directedness is a prerequisite to be able to do that, so it's only the first step. That being said, I think I'm more optimistic than you on the result, for a couple of reasons: * One way I imagine the use of a definition of goal-directedness is to filter against very goal-directed systems. A good definition (if it's possible) should clarify whether low goal-directed systems can be competitive, as well as the consequences of different parts and aspects of goal-directedness. You can see that as a sort of analogy to the complexity penalties, although it might risk being similarly uncompetitive. * One hope with a definition we can actually toy with is to find some properties of the environments and the behavior of the systems that 1) capture a lot of the information we care about and 2) are easy to abstract. Something like what Alex has done for his POWER-seeking results, where the relevant aspect of the environment are the symmetries it contains. * Even arguing for your point, that evaluating goals and/or goal-directedness of actual NNs would be really hard, is made easier by a deconfused notion of goal-directedness. What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those. That's one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition. Maybe irrelevant, but this makes me think of the problem with defining average complexity in complexity theory. You can prove things for some
2Abram Demski5dThanks! Right. By "no connection" I specifically mean "we have no strong reason to posit any specific predictions we can make about mesa-objectives from outer objectives or other details of training" -- at least not for training regimes of practical interest. (I will consider this detail for revision.) I could have also written down my plausibility argument (that there is actually "no connection"), but probably that just distracts from the point here. (More later!)
11Evan Hubinger5dI think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors). -------------------------------------------------------------------------------- Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign. But that does suggest that if you can construct a training process that defuses the arguments for why its prior/inductive biases will be malign, then I think that does make significant progress on defusing the inner alignment problem. Of course, I agree that we'd like to be as confident that there's as little malignancy/deception as possible such that just defusing the arguments that we can come up with might not be enough—but I still think that trying to figure out how plausible it is that the actual prior we use will be malign is in fact at least attempting to address the core problem.
6Abram Demski3dIf the universal prior were benign but NNs were still potentially malign, I think I would argue strongly against the use of NNs and in favor of more direct approximations of the universal prior. But, I agree this is not 100% obvious; giving up prosaic AI is giving up a lot. Hopefully my final write-up won't contain so much polemicizing about what "the main point" is, like this write-up, and will instead just contain good descriptions of the various important problems.

I've felt like the problem of counterfactuals is "mostly settled" (modulo some math working out) for about a year, but I don't think I've really communicated this online. Partly, I've been waiting to write up more formal results. But other research has taken up most of my time, so I'm not sure when I would get to it.

So, the following contains some "shovel-ready" problems. If you're convinced by my overall perspective, you may be interested in pursuing some of them. I think these directions have a high chance of basically solving the problem of counterfactuals (including logical counterfactuals).

Another reason for posting this rough write-up is to get feedback: am I missing the mark? Is this not what counterfactual reasoning is about? Can you illustrate remaining problems with...

1Bunthut17hIf a sentence is undecidable, then you could have two traders who disagree on its value indefinitely: one would have a highest price to buy, thats below the others lowest price to sell. But then anything between those two prices could be the "market price", in the classical supply and demand sense. If you say that the "payout" of a share is what you can sell it for... well, the "physical causation" trader is also buying shares on the counterfactual option that won't happen. And if he had to sell those, he couldn't sell them at a price close to where he bought them - he could only sell them at how much the "logical causation" trader values them, and so both would be losing "payout" on their trades with the unrealized option. Thats one interpretation of "sell". If theres a "market maker" in addition to both traders, it depends on what prices he makes - and as outlined above, there is a wide range of prices that would be consistent for him to offer as a market maker, including ways which are very close to the logical traders valuations - in which case, the logical trader is gaining on the physical one. Trying to communicate a vague intuition here: There is a set of methods which rely on there being a time when "everything is done", to then look back from there and do credit assignment for everything that happened before. They characteristically use backwards induction to prove things. I think markets fall into this: the argument for why ideal markets don't have bubbles is that eventually, the real value will be revealed, and so the bubble has to pop, and then someone holds the bag, and you don't want to be that someone, and people predicting this and trying to avoid it will make the bubble pop earlier, in the idealised case instantly. I also think these methods aren't going to work well with embedding. They essentially use "after the world" as a subsitute for "outside the world". My question was more "how should this roughly work" rather than "what conditions should

If a sentence is undecidable, then you could have two traders who disagree on its value indefinitely: one would have a highest price to buy, thats below the others lowest price to sell. But then anything between those two prices could be the "market price", in the classical supply and demand sense. If you say that the "payout" of a share is what you can sell it for... well, the "physical causation" trader is also buying shares on the counterfactual option that won't happen. And if he had to sell those, he couldn't sell them at a price close to where he bou

... (read more)

I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1).

If you want to ask something just post a top-level comment; I'll spend at least a day answering questions.

You can find some background about me here.


  • I think the existing approach and easy improvements don't seem like they can capture many important incentives such that you don't want to use it as an actual assurance (e.g. suppose that agent A is predicting the world and agent B is optimizing A's predictions about B's actions---then we want to say that the system has an incentive to manipulate the world but it doesn't seem like that is easy to incorporate into this kind of formalism).


This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs).  We're still ... (read more)

3Ryan Carey4dThanks for these thoughts about the causal agenda. I basically agree with you on the facts, though I have a more favourable interpretation of how they bear on the potential of the causal incentives agenda. I've paraphrased the three bullet points, and responded in reverse order: 3) Many important incentives are not captured by the approach - e.g. sometimes an agent has an incentive to influence a variable, even if that variable does not cause reward attainment. -> Agreed. We're starting to study "side-effect incentives" (improved name pending), which have this property. We're still figuring out whether we should just care about the union of SE incentives and control incentives, or whether SE or when, SE incentives should be considered less dangerous. Whether the causal style of incentive analysis captures much of what we care about, I think will be borne out by applying it and alternatives to a bunch of safety problems. 2) sometimes we need more specific quantities, than just D affects A. -> Agreed. We've privately discussed directional quantities like "do(D=d) causes A=a" as being more safety-relevant, and are happy to hear other ideas. 1) eliminating all control-incentives seems unrealistic -> Strongly agree it's infeasibile to remove CIs on all variables. My more modest goal would be to prove that for particular variables (or classes of variables) such as a shut down button, or a human's values, we can either: 1) prove how to remove control (+ side-effect) incentives, or 2) why this is impossible, given realistic assumptions. If (2), then that theoretical case could justify allocation of resources to learning-oriented approaches. Overall, I concede that we haven't engaged much on safety issues in the last year. Partly, it's that the projects have had to fit within people's PhDs. Which will also be true this year. But having some of the framework stuff behind us, we should still be able to study safety more, and gain a sense of how addressable concerns like

This is the first post in a sequence on Cartesian frames, a new way of modeling agency that has recently shaped my thinking a lot.

Traditional models of agency have some problems, like:

  • They treat the "agent" and "environment" as primitives with a simple, stable input-output relation. (See "Embedded Agency.")
  • They assume a particular way of carving up the world into variables, and don't allow for switching between different carvings or different levels of description.

Cartesian frames are a way to add a first-person perspective (with choices, uncertainty, etc.) on top of a third-person "here is the set of all possible worlds," in such a way that many of these problems either disappear or become easier to address.

The idea of Cartesian frames is that we take as our basic building block...

A formalisation of the ideas in this sequence in higher-order logic, including machine verified proofs of all the theorems, is available here.

Financial status: This is independent research. I welcome financial support to make further posts like this possible.

Epistemic status: I have been thinking about these ideas for years but still have not clarified them to my satisfaction.


  • This post asks whether it is possible, in Conway’s Game of Life, to arrange for a certain game state to arise after a certain number of steps given control only of a small region of the initial game state.

  • This question is then connected to questions of agency and AI, since one way to answer this question in the positive is by constructing an AI within Conway’s Game of Life.

  • I argue that the permissibility or impermissibility of AI is a deep property of our physics.

  • I propose the AI hypothesis, which is that any

3Rohin Shah1dPlanned summary for the Alignment Newsletter: Planned opinion:

Yeah this seems right to me.

Thank you for all the summarization work you do, Rohin.

1Donald Hobson4dRandom Notes: Firstly, why is the rest of the starting state random? In a universe where info can't be destroyed, like this one, random=max entropy. AI is only possible in this universe because the starting state is low entropy. Secondly, reaching an arbitrary state can be impossible for reasons like conservation of mass energy momentum and charge. Any state close to an arbitrary state might be unreachable due to these conservation laws. Ie a state containing lots of negitive electric charges, and no positive charges being unreachable in our universe. Well, quantum. We can't reach out from our branch to effect other branches. This control property is not AI. It would be possible to create a low impact AI. Something that is very smart and doesn't want to affect the future much. In the other direction, bacteria strategies are also a thing. I think it might be possible, both in this universe and in GOL, to create a non intelligent replicator. You could even hard code it to track its position, and turn on or off to make a smiley face. I'm thinking some kind of wall glider that can sweep across the GOL board destroying almost anything in its path. With crude self replicators behind it. Observation response timescales. Suppose the situation outside the small controlled region was rapidly changing and chaotic. By the time any AI has done its reasoning, the situation has changed utterly. The only thing the AI can usefully do is reason about GOL in general. Ie any ideas it has are things that could have been hard coded into the design.
7Paul Christiano4dIt seems like our physics has a few fundamental characteristics that change the flavor of the question: * Reversibility. This implies that the task must be impossible on average---you can only succeed under some assumption about the environment (e.g. sparsity). * Conservation of energy/mass/momentum (which seem fundamental to the way we build and defend structures in our world). I think this is an interesting question, but if poking around it would probably be nicer to work with simple rules that share (at least) these features of physics.
1Alex Flint4dYeah I agree. There was a bit of discussion re conservation of energy here [] too. I do like thought experiments in cellular automata because of the spatially localized nature of the transition function, which matches our physics. Do you have any suggestions for automata that also have reversibility and conservation of energy?
3Paul Christiano4dI feel like they must exist (and there may not be that many simple nice ones). I expect someone who knows more physics could design them more easily. My best guess would be to get both properties by defining the system via some kind of discrete hamiltonian. I don't know how that works, i.e. if there is a way of making the hamiltonian discrete (in time and in values of the CA) that still gives you both properties and is generally nice. I would guess there is and that people have written papers about it. But it also seems like that could easily fail in one way or another. It's surprisingly non-trivial to find that by googling though I didn't try very hard. May look a bit more tonight (or think about it a bit since it seems fun). Finding a suitable replacement for the game of life that has good conservation laws + reversibility (while still having a similar level of richness) would be nice.
3Paul Christiano4dI guess the important part of the hamiltonian construction may be just having the next state depend on x(t) and x(t-1) (apparently those are called second-order cellular automata []). Once you do that it's relatively easy to make them reversible, you just need the dependence of x(t+1) on x(t-1) to be a permutation. But I don't know whether using finite differences for the hamiltonian will easily give you conservation of momentum + energy in the same way that it would with derivatives.
1Richard Ngo5dIt feels like this post pulls a sleight of hand. You suggest that it's hard to solve the control problem because of the randomness of the starting conditions. But this is exactly the reason why it's also difficult to construct an AI with a stable implementation. If you can do the latter, then you can probably also create a much simpler system which creates the smiley face. Similarly, in the real world, there's a lot of randomness which makes it hard to carry out tasks. But there are a huge number of strategies for achieving things in the world which don't require instantiating an intelligent controller. For example, trees and bacteria started out small but have now radically reshaped the earth. Do they count as having "perception, cognition, and action that are recognizably AI-like"?
6Alex Flint5dWell yes, I do think that trees and bacteria exhibit this phenomenon of starting out small and growing in impact. The scope of their impact is limited in our universe by the spatial separation between planets, and by the presence of even more powerful world-reshapers in their vicinity, such as humans. But on this view of "which entities are reshaping the whole cosmos around here?", I don't think there is a fundamental difference in kind between trees, bacteria, humans, and hypothetical future AIs. I do think there is a fundamental difference in kind between those entities and rocks, armchairs, microwave ovens, the Opportunity mars rovers, and current Waymo autonomous cars, since these objects just don't have this property of starting out small and eventually reshaping the matter and energy in large regions. (Surely it's not that it's difficult to build an AI inside Life because of the randomness of the starting conditions -- it's difficult to build an AI inside Life because writing full-AGI software is a difficult design problem, right?)
7Richard Ngo5dThere's at least one important difference: some of these are intelligent, and some of these aren't. It does seem plausible that the category boundary you're describing is an interesting one. But when you indicate in your comment below that you see the "AI hypothesis" and the "life hypothesis" as very similar, then that mainly seems to indicate that you're using a highly nonstandard definition of AI, which I expect will lead to confusion.
3Alex Flint4dWell surely if I built a robot that was able to gather resources and reproduce itself as effectively as either a bacterium or a tree, I would be entirely justified in calling it an "AI". I would certainly have no problem using that terminology for such a construction at any mainstream robotics conference, even if it performed no useful function beyond self-reproduction. Of course we wouldn't call an actual tree or an actual bacterium an "AI" because they are not artificial.
1AprilSR5dI think the stuff about the supernovas addresses this: a central point is that the “AI” must be capable of generating an arbitrary world state within some bounds.
4Alex Flint5dWell in case it's relevant here, I actually almost wrote "the AI hypothesis" as "the life hypothesis" and phrased it as Perhaps in this form it's too vague (what does "life-like" mean?) or too circular (we could just define life-like as having an outsized physical impact). But in whatever way we phrase it, there is very much a substantial hypothesis under the hood here: the claim is that there is a low-level physical characterization of the general phenomenon of open-ended intelligent autonomy. The thing I'm personally most interested in is the idea that the permissibility of AI is a deep property of our physics.
2romeostevensit5dRelated to sensitivity of instrumental convergence. i.e. the question of whether we live in a universe of strong or weak instrumental convergence. In a strong instrumental convergence universe, most possible optimizers wind up in a relatively small space of configurations regardless of starting conditions, while in a weak one they may diverge arbitrarily in design space. This can be thought of one way of crisping up concepts around orthogonality. e.g. in some universes orthogonality would be locally true but globally false, or vice versa, or locally and globally true or vice versa.
1Alex Flint5dRomeo if you have time, would you say more about the connection between orthogonality and Life / the control question / the AI hypothesis? It seems related to me but I just can't quite put my finger on exactly what the connection is.
3Charlie Steiner5dThe truly arbitrary version seems provably impossible. For example, what if you're trying to make a smiley face, but some other part of the world contains an agent just like you except they're trying to make a frowny face - you obviously both can't succeed. Instead you need some special environment with low entropy, just like humans do in real life.
1Alex Flint5dYeah absolutely - see third bullet in the appendix. One way to resolve this would be to say that to succeed at answering the control question you have to succeed in at least 1% of randomly chosen environments.
2gwern5dMy immediate impulse is to say that it ought to be possible to create the smiley face, and that it wouldn't be that hard for a good Life hacker to devise it. I'd imagine it to go something like this. Starting from a Turing machine or simpler, you could program it to place arbitrary 'pixels': either by finding a glider-like construct which terminates at specific distances into a still [], so the constructor can crawl along an x/y axis, shooting off the terminating-glider to create stable pixels in a pre-programmed pattern. (If that doesn't exist, then one could use two constructors crawling along the x/y axises, shooting off gliders intended to collide, with the delays properly pre-programmed.) The constructor then terminates in a stable still life; this guarantees perpetual stability of the finished smiley face. If one wants to specify a more dynamic environment for realism, then the constructor can also 'wall off' the face using still blocks. Once that's done, nothing from the outside can possibly affect it, and it's internally stable, so the pattern is then eternal.
4Ben Pace5dI recall once seeing someone say with 99.9% probability that the sun would still rise 100 million years from now, citing information about the life-cycle of stars like our sun. Someone else pointed out that this was clearly wrong, that by default that sun would be taken apart for fuel on that time scale, by us or some AI, and that this was a lesson in people's predictions about the future being highly inaccurate. But also, "the thing that means there won't be a sun sometime soon" is one of the things I'm pointing to when talking about "general intelligence". This post reminded me of that.
Load More