I'll make a case here that manipulation of imperfect internal search should be considered the inner alignment problem, and all the other things which look like inner alignment failures actually stem from outer alignment failure or non-mesa-optimizer-specific generalization failure.
Suppose Dr Nefarious is an agent in the environment who wants to acausally manipulate other agents' models. We have a model which knows of Dr Nefarious' existence, and we ask the model to predict what Dr Nefarious will do. At this point, we have already failed: either the model returns a correct answer, in which case Dr Nefarious has acausal control over the answer and can manipulate us through it, or it returns an incorrect answer, in which case the prediction is wrong. (More precisely, the distinction is between informative/independent answers, not correct/incorrect.) The only way to avoid this would be to not ask the question in the first place - but if we need to know what Dr Nefarious will do in order to make good decisions ourselves, then we need to run that query.
On the surface, this looks like an inner alignment failure: there's a malign subagent in the model. But notice that it's not even clear what we want in this situation - we don't know how to write down a goal-specification which avoids the problem while also being useful. The question of "what do we even want to do in this sort of situation?" is unambiguously an outer alignment question. It's not a situation where we know what we want but we're not sure how to make a system actually do it; it's a situation where it's not even clear what we want.
Conversely, if we did have a good specification of what we want in this situation, then we could just specify that in the outer objective. Once that's done, we would still potentially need to solve inner alignment problems in practice, but we'd know how to solve them in principle: do the thing which is globally optimal for our outer objective. The whole point of "having a good specification of what we want" is that the globally-optimal thing should be good.
Point of all this: this supposed "inner alignment failure" can be broken into two parts. One of those parts is a "what do we even want?" question, i.e. an outer alignment problem. The other part is a problem of actually achieving the optimal thing, which is where manipulation of imperfect internal search is relevant. If both of those parts are solved, then the system is aligned.
Another example, this time with explicit acausal trade: our AI uses a Solomonoff-like world model, and a subagent in that model is trying to gain influence. Meanwhile, an (unrelated) nefarious agent in the environment wants to manipulate the AI. So, the subagent and the nefarious agent simulate each other and make an acausal deal: the nefarious agent produces a very specific string of bits in the real world, and the subagent gains weight by perfectly predicting that string. In exchange, the subagent manipulates the AI to help the nefarious agent in some way.
Self-fulfilling prophecies provide a similar, but simpler, class of examples.
In each of these, there is a malign inner agent, but that malign inner agent is only able to manipulate the AI successfully because of some structure in the environment. Or, another way to state it: the malign agent is successful only because the combination of (outer) prior + objective does not handle self-fulfilling prophecies or acausal trade the way we (humans) want them to. These are, in an important sense, outer alignment problems: we have not correctly specified what-we-want; even the global optimum of the outer process suffers from the problem.
One possible objection to this is that "outer alignment" - i.e. specifying what-humans-want - should be more narrowly interpreted. In particular, Evan has argued before that generalization errors resulting from e.g. distribution shift between training data and deployment environment should be considered a separate problem.
I disagree with this. I claim that an objective isn't even well-defined without a distribution; that's part of the type-signature of an objective.
This is easy to see in the case of an expected utility maximizer. When we write "max E[u(X)]", X is a variable in the probabilistic model. It is a thing-in-the-model, not a thing-in-the-world; the world does not necessarily share our ontology.
We could say something similar for any setup which maximizes some expected value on an empirical distribution, i.e. an average over the training data. For instance, maybe we have some labeled images, and we're training a classifier. We may have an objective for which the system does-what-we-want for the original labels, but does not do what we want if we change the objective function to permute the labels before calculating error (i.e. it switches "true" with "false"). Permuting the labels in the objective function is obviously an outer alignment problem - yet we can achieve exactly the same effect by permuting the labels in the dataset instead.
Another angle: plenty of ML work uses the exact same objective on different data sets, and obviously they do completely different things. There is no useful sense in which a training objective can be aligned or misaligned, separate from the context of data/prior.
My point is: there is no line between bad training data and bad objective. These problems only make sense when considered together. So, if "bad training objective" is an outer alignment problem, then we also need to consider "bad training data" to be an outer alignment problem in order for our factorization of the problem to work well. (For Bayesian agents, this also extends to the prior.)
Outer objective, training data, and prior (if any) all have to be treated as a unit: changes in one are equivalent to changes in another, and the objective isn't even well-defined outside the ontology of the data/prior. The central outer alignment question of "what do we even want?" has to be answered with both an objective and data/prior, in order for the answer to be well-defined.
If we buy that, then outer alignment (i.e. fully answering the question "what do we want?") implies that the true global optimum of an outer optimizer's search is aligned. So, there's only one type of inner alignment problem which would not be solved by solving outer alignment: manipulation of imperfect search. We can have a good objective+prior+data, but the search may still be imperfect, and malign subagents may arise which manipulate that search.
All that said... there's still an interesting alignment problem which examples like Dr Nefarious or self-fulfilling prophecies or maligness of Solomonoff are pointing to. I claim that inner alignment is not the right way to think about these - it's not the malign inner agents themselves which are the problem. They're just an indicator that we have not correctly specified what-we-want.
This is a great comment. I will have to think more about your overall point, but aside from that, you've made some really useful distinctions. I've been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it's about outer alignment). Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn't cluster together the problems people have been trying to cluster together.
Right, but John is disagreeing with Evan's frame, and John's argument that such-and-such problems aren't inner alignment problems is that they are outer alignment problems.
So, I think I could write a much longer response to this (perhaps another post), but I'm more or less not persuaded that problems should be cut up the way you say.
As I mentioned in my other reply, your argument that Dr. Nefarious problems shouldn't be classified as inner alignment is that they are apparently outer alignment. If inner alignment problems are roughly "the internal objective doesn't match the external objective" and outer alignment problems are roughly "the outer objective doesn't meet our needs/goals", then there's no reason why these have to be mutually exclusive categories.
In particular, Dr. Nefarious problems can be both.
But more importantly, I don't entirely buy your notion of "optimization". This is the part that would require a longer explanation to be a proper reply. But basically, I want to distinguish between "optimization" and "optimization under uncertainty". Optimization under uncertainty is not optimization -- that is, it is not optimization of the type you're describing, where you have a well-defined objective which you're simply feeding to a search. Given a prior, you can reduce optimization-under-uncertainty to plain optimization (if you can afford the probabilistic inference necessary to take the expectations, which often isn't the case). But that doesn't mean that you do, and anyway, I want to keep them as separate concepts even if one is often implemented by the other.
Your notion of the inner alignment problem applies only to optimization.
Evan's notion of inner alignment applies (only!) to optimization under uncertainty.
I buy the "problems can be both" argument in principle. However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we're thinking about optimization-under-uncertainty, although I'm still not sure exactly what that would mean.
In other words: if a problem is both, then it is useful to think of it as an outer alignment problem (because that part has to be solved regardless), and not also inner alignment (because only a narrower version of that part necessarily has to be solved). In the Dr Nefarious example, the outer misalignment causes the inner misalignment in some important sense - correcting the outer problem fixes the inner problem , but patching the inner problem would leave an outer objective which still isn't what we want.
I'd be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem.
I'm not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn't well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point "X" and "Y" at some things in the real world. That's why the objective function cannot be meaningfully separated from the data/prior: "f(X, Y)" doesn't mean anything, by itself.
But I could imagine the pointer-aspect of the data/prior could somehow be separated from the uncertainty-aspect. Obviously this would require a very different paradigm from either today's ML or Bayesianism, but if those pieces could be separated, then I could imagine a notion of inner alignment (and possibly also something like robust generalization) which talks about both optimization and uncertainty, plus a notion of outer alignment which just talks about the objective and what it points to. In some ways, I actually like that formulation better, although I'm not clear on exactly what it would mean.
Trying to lay this disagreement out plainly:
According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is "inner" = issues of imperfect search, "outer" = issues of objective (which can include the prior, the utility function, etc).
According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don't have enough information to really score domain items. In this frame, it seems reasonable to point to the way the algorithm tries to fill in the missing information as the location of "inner optimizers". This "way the algorithm tries to fill in missing info" has to include properties of the search, so we roll search+prior together into "inductive bias".
I take your argument to have been:
Your crux is, can we factor 'uncertainty' from 'value pointer' such that the notion of 'value pointer' contains all (and only) the outer alignment issues? In that case, you could come around to optimization-under-uncertainty as a frame.
I take my argument to have been:
My crux would be, does a solution to outer alignment (in the intuitive sense) really imply a solution to exorcising mesa-optimizers from a prior (in the sense relevant to eliminating them from perfect search)?
It might also help if I point out that well-defined-optimization vs optimization-under-uncertainty is my current version of the selection/control distinction.
In any case, I'm pretty won over by the uncertainty/pointer distinction. I think it's similar to the capabilities/payload distinction Jessica has mentioned. This combines search and uncertainty (and any other generically useful optimization strategies) into the capabilities.
But I would clarify that, wrt the 'capabilities' element, there seem to be mundane capabilities questions and then inner optimizer questions. IE, we might broadly define "inner alignment" to include all questions about how to point 'capabilities' at 'payload', but if so, I currently think there's a special subset of 'inner alignment' which is about mesa-optimizers. (Evan uses the term 'inner alignment' for mesa-optimizer problems, and 'objective-robustness' for broader issues of reliably pursuing goals, but he also uses the term 'capability robustness', suggesting he's not lumping all of the capabilities questions under 'objective robustness'.)
This is a good summary.
I'm still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:
... so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it's capturing something different. (Though that's based on just a handful of examples, so the idea in your head is probably quite different from what I've interpolated from those examples.)
On a side note, it feels weird to be the one saying "we can't separate uncertainty-handling from goals" and you saying "ok but it seems like goals and uncertainty could somehow be factored". Usually I expect you to be the one saying uncertainty can't be separated from goals, and me to say the opposite.
- Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).
Could you elaborate on that? I do think that learning-normativity is more about outer alignment. However, some ideas might cross-apply.
- It feels like "optimization under uncertainty" is not quite the right name for the thing you're trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
Well, it still seems like a good name to me, so I'm curious what you are thinking here. What name would communicate better?
- It does seem like there's in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.)
Again, I need more unpacking to be able to say much (or update much).
- The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples.
Well, the optimization-under-uncertainty is an attempt to make a frame which can contain both, so this isn't necessarily a problem... but I am curious what feels non-tight about inner agency.
... so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it's capturing something different. (Though that's based on just a handful of examples, so the idea in your head is probably quite different from what I've interpolated from those examples.)
On a side note, it feels weird to be the one saying "we can't separate uncertainty-handling from goals" and you saying "ok but it seems like goals and uncertainty could somehow be factored". Usually I expect you to be the one saying uncertainty can't be separated from goals, and me to say the opposite.
I still agree with the hypothetical me making the opposite point ;p The problem is that certain things are being conflated, so both "uncertainty can't be separated from goals" and "uncertainty can be separated from goals" have true interpretations. (I have those interpretations clear in my head, but communication is hard.)
OK, so.
My sense of our remaining disagreement...
We agree that the pointers/uncertainty could be factored (at least informally -- currently waiting on any formalism).
You think "optimization under uncertainty" is doing something different, and I think it's doing something close.
Specifically, I think "optimization under uncertainty" importantly is not necessarily best understood as the standard Bayesian thing where we (1) start with a utility function, (2) provide a prior, so that we can evaluate expected value (and 2.5, update on any evidence), (3) provide a search method, so that we solve the whole thing by searching for the highest-expectation element. Many examples of optimization-under-uncertainty strain this model. Probably the pointer/uncertainty model would do a better job in these cases. But, the Bayesian model is kind of the only one we have, so we can use it provisionally. And when we do so, the approximation of pointer-vs-uncertainty that comes out is:
Pointer: The utility function.
Uncertainty: The search plus the prior, which in practice can blend together into "inductive bias".
This isn't perfect, by any means, but, I'm like, "this isn't so bad, right?"
I mean, I think this approximation is very not-good for talking about the pointers problem. But I think it's not so bad for talking about inner alignment.
I almost want to suggest that we hold off on trying to resolve this, and first, I write a whole post about "optimization under uncertainty" which clarifies the whole idea and argues for its centrality. However, I kind of don't have time for that atm.
However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all.
The way I'm currently thinking of things, I would say the reverse also applies in this case.
We can turn optimization-under-uncertainty into well-defined optimization by assuming a prior. The outer alignment problem (in your sense) involves getting the prior right. Getting the prior right is part of "figuring out what we want". But this is precisely the source of the inner alignment problems in the paul/evan sense: Paul was pointing out a previously neglected issue about the Solomonoff prior, and Evan is talking about inductive biases of machine learning algorithms (which is sort of like the combination of a prior and imperfect search).
So both you and Evan and Paul are agreeing that there's this problem with the prior (/ inductive biases). It is distinct from other outer alignment problems (because we can, to a large extent, factor the problem of specifying an expected value calculation into the problem of specifying probabilities and the problem of specifying a value function / utility function / etc). Everyone would seem to agree that this part of the problem needs to be solved. The disagreement is just about whether to classify this part as "inner" and/or "outer".
What is this problem like? Well, it's broadly a quality-of-prior problem, but it has a different character from other quality-of-prior problems. For the most part, the quality of priors can be understood by thinking about average error being low, or mistakes becoming infrequent, etc. However, here, this kind of thinking isn't sufficient: we are concerned with rare but catastrophic errors. Thinking about these things, we find ourselves thinking in terms of "agents inside the prior" (or agents being favored by the inductive biases).
To what extent "agents in the prior" should be lumped together with "agents in imperfect search", I am not sure. But the term "inner optimizer" seems relevant.
I'd be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem.
A good example of optimization-under-uncertainty that doesn't look like that (at least, not overtly) is most applications of gradient descent.
So you can see, we can always technically turn optimization-under-uncertainty into a well-defined optimization by providing a prior, but, this is usually so impractical that ML people often don't even consider what their prior might be. Even if you did write down a prior, you'd probably have to do ordinary ML search to approximate that. Which goes to show that it's pretty hard to eliminate the non-EV versions of optimization-under-uncertainty; if you try to do real EV, you end up using non-EV methods anyway, to approximate EV.
The fact that we're not really optimizing EV, in typical applications of gradient descent, explains why methods like early stopping or dropout (or anything else that messes with the ability of gradient descent to optimize the given objective) might be useful. Otherwise, you would only expect to use modifications if they helped the search find higher-value items. But in real cases, we sometimes prefer items that have a lower score on our proxy, when the-way-we-got-that-item gives us other reason to expect it to be good (early stopping being the clearest example of this).
This in turn means we don't even necessarily convert our problem to a real, solidly defined optimization problem, ever. We can use algorithms like gradient-descent-with-early-stopping just "because they work well" rather than because they optimize some specific quantity we can already compute.
Which also complicates your argument, since if we're never converting things to well-defined optimization problems, we can't factor things into "imperfect search problems" vs "alignment given perfect search" -- because we're not really using search algorithms (in the sense of algorithms designed to get the maximum value), we're using algorithms with a strong family resemblance to search, but which may have a few overtly-suboptimal kinks thrown in because those kinks tend to reduce Goodharting.
In principle, a solution to an optimization-under-uncertainty problem needn't look like search at all.
Ah, here's an example: online convex optimization. It's a solid example of optimization-under-uncertainty, but, not necessarily thought of in terms of a prior and an expectation.
So optimization-under-uncertainty doesn't necessarily reduce to optimization.
I claim it's usually better to think about optimization-under-uncertainty in terms of regret bounds, rather than reduce it to maximization. (EG this is why Vanessa's approach to decision theory is superior.)
I'm not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn't well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point "X" and "Y" at some things in the real world. That's why the objective function cannot be meaningfully separated from the data/prior: "f(X, Y)" doesn't mean anything, by itself.
But I could imagine the pointer-aspect of the data/prior could somehow be separated from the uncertainty-aspect. Obviously this would require a very different paradigm from either today's ML or Bayesianism, but if those pieces could be separated, then I could imagine a notion of inner alignment (and possibly also something like robust generalization) which talks about both optimization and uncertainty, plus a notion of outer alignment which just talks about the objective and what it points to. In some ways, I actually like that formulation better, although I'm not clear on exactly what it would mean.
These remarks generally make sense to me. Indeed, I think the 'uncertainty-aspect' and the 'search aspect' would be rolled up into one, since imperfect search falls under the uncertainty aspect (being logical uncertainty). We might not even be able to point to which parts are prior vs search... as with "inductive bias" in ML. So inner alignment problems would always be "the uncertainty is messed up" -- forcibly unifying your search-oriented view on daemons w/ Evan's prior-oriented view. More generally, we could describe the 'uncertainty' part as where 'capabilities' live.
Naturally, this strikes me as related to what I'm trying to get at with optimization-under-uncertainty. An optimization-under-uncertainty algorithm takes a pointer, and provides all the 'uncertainty'.
But I don't think it should quite be about separating the pointer-aspect and the uncertainty-aspect. The uncertainty aspect has what I'll call "mundane issues" (eg, does it converge well given evidence, does it keep uncertainty broad w/o evidence) and "extraordinary issues" (inner optimizers). Mundane issues can be investigated with existing statistical tools/concepts. But the extraordinary issues seem to require new concepts. The mundane issues have to do with things like averages and limit frequencies. The extraordinary issues have to do with one-time events.
The true heart of the problem is these "extraordinary issues".
While I agree that outer objective, training data and prior should be considered together, I disagree that it makes the inner alignment problem dissolve except for manipulation of the search. In principle, if you could indeed ensure through a smart choice of these three parameters that there is only one global optimum, only "bad" (meaning high loss) local minima, and that your search process will always reach the global optimum, then I would agree that the inner alignment problem disappears.
But answering "what do we even want?" at this level of precision seems basically impossible. I expect that it's pretty much equivalent to specifying exactly the result we want, which we are quite unable to do in general.
So my perspective is that the inner alignment problem appears because of inherent limits into our outer alignment capabilities. And that in realistic settings where we cannot rule out multiple very good local minima, the sort of reasoning underpinning the inner alignment discussion is the best approach we have to address such problems.
That being said, I'm not sure how this view interacts with yours or Evan's, or if this is a very standard use of the terms. But since that's part of the discussion Abram is pushing, here is how I use these terms.
Hm, I want to classify "defense against adversaries" as a separate category from both "inner alignment" and "outer alignment".
The obvious example is: if an adversarial AGI hacks into my AGI and changes its goals, that's not any kind of alignment problem, it's a defense-against-adversaries problem.
Then I would take that notion and extend it by saying "yes interacting with an adversary presents an attack surface, but also merely imagining an adversary presents an attack surface too". Well, at least in weird hypotheticals. I'm not convinced that this would really be a problem in practice, but I dunno, I haven't thought about it much.
Anyway, I would propose that the procedure for defense against adversaries in general is: (1) shelter an AGI from adversaries early in training, until it's reasonably intelligent and aligned, and then (2) trust the AGI to defend itself. I'm not sure we can do any better than that.
In particular, I imagine an intelligent and self-aware AGI that's aligned in trying to help me would deliberately avoid imagining an adversarial superintelligence that can acausally hijack its goals!
That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say "OK yes now it's misaligned". (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don't think that's really a special problem we need to think about.
That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say "OK yes now it's misaligned". (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don't think that's really a special problem we need to think about.
This part doesn't necessarily make sense, because prevention could be easier than after-the-fact measures. In particular,
That's fair. Other possible approaches are "try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so", or "intepretability that looks for the AGI imagining dangerous adversarial intelligences".
I guess the fact that people don't tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible - like, that maybe there's a big window where one is smart enough to understand that imagining adversarial intelligences can be bad, but not smart enough to do so with such fidelity that it actuality is dangerous.
But hard to say what's gonna work, if anything, at least at my current stage of general ignorance about the overall training process.
I think one major reason why people don't tend to get hijacked by imagined adversaries is that you can't simulate someone who is smarter than you, and therefore you can defend against anything you can simulate in your mind.
This is not a perfect arugment since I can imagine someone that has power over me in the real world, and for example imagine how angry they would be at me if I did something they did not like. But then their power over me comes from their power in the real world, not their ability to outsmart me inside my own mind.
Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with "takeover without holding power over someone". Specifically this person described enlightenment in terms close to "I was ready to pack my things and leave. But the poison was already in me. My self died soon after that."
It's possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).
Since you're trying to compile a comprehensive overview of directions of research, I will try to summarize my own approach to this problem:
My third and final example: in one conversation, someone made a claim which I see as "exactly wrong": that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.
The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!
I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors).
Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign. But that does suggest that if you can construct a training process that defuses the arguments for why its prior/inductive biases will be malign, then I think that does make significant progress on defusing the inner alignment problem. Of course, I agree that we'd like to be as confident that there's as little malignancy/deception as possible such that just defusing the arguments that we can come up with might not be enough—but I still think that trying to figure out how plausible it is that the actual prior we use will be malign is in fact at least attempting to address the core problem.
I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors).
If the universal prior were benign but NNs were still potentially malign, I think I would argue strongly against the use of NNs and in favor of more direct approximations of the universal prior. But, I agree this is not 100% obvious; giving up prosaic AI is giving up a lot.
Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign.
Hopefully my final write-up won't contain so much polemicizing about what "the main point" is, like this write-up, and will instead just contain good descriptions of the various important problems.
I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors)
Agree.
I have fairly mixed feelings about this post. On one hand, I agree that it's easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they'll arise, otherwise the argument that 'we can't yet rule them out, so we should prioritise trying to rule them out' is privileging the hypothesis.
Secondly, it seems like you're heavily prioritising formal tools and methods for studying mesa-optimisation. But there are plenty of things that formal tools have not yet successfully analysed. For example, if I wanted to write a constitution for a new country, then formal methods would not be very useful; nor if I wanted to predict a given human's behaviour, or understand psychology more generally. So what's the positive case for studying mesa-optimisation in big neural networks using formal tools?
In particular, I'd say that the less we currently know about mesa-optimisation, the more we should focus on qualitative rather than quantitative understanding, since the latter needs to build on the former. And since we currently do know very little about mesa-optimisation, this seems like an important consideration.
I agree with much of this. I over-sold the "absence of negative story" story; of course there has to be some positive story in order to be worried in the first place. I guess a more nuanced version would be that I am pretty concerned about the broadest positive story, "mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?" -- and think more specific positive stories are mostly of illustrative value, rather than really pointing to gears that I expect to be important. (With the exception of John's story, which did point to important gears.)
With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable. However, that's not the sense the post conveyed overall, so I get it. I am concretely trying to convey pessimism about a specific sort of less-formal work: work which tries to block plausibility stories. Possibly you disagree about this kind of work.
WRT your argument for informal work, well, I agree in principle (trying to push toward more formal work myself has so far revealed challenges which I think more informal conceptual work could help with), but I'm nonetheless optimistic at the moment that we can define formal problems which won't be a waste of time to work on. And out of informal work, what seems most interesting is whatever pushes toward formality.
Mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?
I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I'll save for another time).
With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable.
I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I'm still not sure what intuitions or arguments underlie your optimism about formal work, though. I assume that this would be fairly time-consuming to spell out in detail - but given that the core point of this post is to encourage such work, it seems worth at least gesturing towards those intuitions, so that it's easier to tell where any disagreement lies.
To me, the post as written seems like enough to spell out my optimism... there multiple directions for formal work which seem under-explored to me. Well, I suppose I didn't focus on explaining why things seem under-explored. Hopefully the writeup-to-come will make that clear.
Thanks for the post!
Here is my attempt at a detailed peer-review feedback. I admit that I'm more excited by doing this because you're asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments).
One thing I really like is the multiple "failure" stories at the beginning. It's usually frustrating in posts like that to see people argue against position/arguments which are not written anywhere. Here we can actually see the problematic arguments.
I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection. I proceeded to offer a plausibility argument for a total disconnect between the two, such that even these course-grained adjustments would fail.
I'm not sure if I agree that there is no connection. The mesa-objective comes from the interaction of the outer objective, the training data/environments and the bias of the learning algorithm. So in some sense there is a connection. Although I agree that for the moment we lack a formal connection, which might have been your point.
Again, this strikes me as ignoring the fundamental problem, that we have little to no idea when mesa-optimizers can arise, that we lack formal tools for the analysis of such questions, and that what formal tools we might have thought to apply, have failed to yield any such results.
Completely agreed. I always find such arguments unconvincing, not because I don't see where the people using them are coming from, but because such impossibility results require a way better understanding of what mesa-optimizers are and do that we have.
The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!
Agreed too. I always find that weird when people use that argument, because it seems agreed upon in the field for a long time that there are probably simple goal-directed process in the search spaces. Like I can find a post from Paul's blog in 2012 where he writes:
This discussion has been brief and has necessarily glossed over several important difficulties. One difficulty is the danger of using computationally unbounded brute force search, given the possibility of short programs which exhibit goal-oriented behavior.
Defining Mesa-Optimization
There's one approach that you haven't described (although it's a bit close to your last one) and which I am particularly excited about: finding an operationalization of goal-directedness, and just define/redefine mesa-optimizers as learned goal-directed agents. My interpretation of RLO is that it's arguing that search for simple competent programs will probably find a goal-directed system AND that it might have a simple structure "parametrized with a goal" (so basically an inner optimizer). This last assumption was really relevant for making argument about the sort of architecture likely to evolved by gradient descent. But I don't think the arguments are tight enough to convince us that the learned goal-directed systems will necessarily have this kind of structure, and the sort of problems mentioned seems just as salient for other goal-directed systems.
I also believe that we're not necessarily that far from having a usable definition of goal-directedness (based on the thing I presented at AISU, changed according to your feedback), but I know that not everyone agree. Even if I'm wrong about being able to formalize goal-directedness, I'm pretty convinced that the cluster of intuitions around goal-directed is what should be applied to a definition of mesa-optimization, because I expect inner alignment problems beyond inner optimizers.
The concept of generalization accuracy misses important issues. For example, a guaranteed very low frequency of errors might still allow an error to be strategically inserted at a very important time.
I really like this argument against using generalization. To be clear on whether I understand you, do you mean that even with very limited error, a mesa-optimizer/goal-directed agent could bid its time and use a single action well-placed to make a catastrophic treacherous turn?
Occam's razor should only make you think one of the shortest hypotheses that fits your data is going to be correct, not necessarily the shortest one. So, this kind of thinking does not directly imply a lack of malign mesa-optimization in the shortest hypothesis.
A bit tangential, but this line of argument is exactly why I find the research of the loss landscape of a neural net frightening for inner alignment. What people try to prove for ML purpose is that there is no or few "bad" (high-loss) minima, where bad means high loss. But they're fine with many "good" (low-loss) local minima, and usually find many of them. Except that this is a terrible new for inner alignment, because the more "good" local minima, the more risk some of them are deceptive mesa-optimizers.
Mutual information between predicting reality and agency may mean mesa-optimizers don't have to spend extra bits on goal content and planning. In particular, if the reality being predicted contains goal-driven agents, then a mesa-optimizer doesn't have to spend extra bits on these things, because it already needs to describe them in order to predict well.
I hadn't thought of it that way, but that does capture nicely the intuition that any good enough agent for a sufficiently complex task will be able to model humans and deception, among other things. That being said, wouldn't the mesa-optimizer still have to pay the price to maintain at all time two goals, and to keep track of what things means related to both? Or are you arguing that this mutual information means that the mesa-optimizer will already be modeling many goal-directed systems, and so can just reuse that knowledge/information?
Pure Computational Complexity
About penalizing time complexity, I really like this part of RLO (which is sadly missed by basically everyone as it's not redescribed in the intro or conclusion):
Furthermore, in the context of machine learning, this analysis suggests that a time complexity penalty (as opposed to a description length penalty) is a double-edged sword. In the second post, we suggested that penalizing time complexity might serve to reduce the likelihood of mesa-optimization. However, the above suggests that doing so would also promote pseudo-alignment in those cases where mesa-optimizers do arise. If the cost of fully modeling the base objective in the mesa-optimizer is large, then a pseudo-aligned mesa-optimizer might be preferred simply because it reduces time complexity, even if it would underperform a robustly aligned mesa-optimizer without such a penalty.
Humans are essentially linear-time algorithms, in the sense that we take the same maximum amount of processing power (ie, that of the human brain) to produce each next output. Anything which produces linearly much output has to do so in at least linear time. So, Levin-complexity can't rule out humanlike intelligence.
I don't understand what you're saying. Your first sentence seems to point out that humans are constant-time, not linear time. An algorithm for a fixed sized is constant time, after all. The issue here is that we don't have a scaled version of the algorithms humans are solving (analogous to generalized games). So we can't discuss the asymptotic complexity of human-brain algorithms. But maybe you actually have an argument related to that which I missed?
One point about the time/description complexity penalty that I feel you don't point enough is that even if there was a threshold under which mesa-optimization doesn't appear, maybe it's just too low to be competitive. That's my main internal reason to doubt complexity penalties as a solution to the emergence of mesa-optimizers.
A Note on the Consensus Algorithm
As someone who has been unconvinced with this proposal as a solution for inner alignment, but didn't take the time to express exactly why, I feel like you did a pretty nice work, and probably what I will point people to when they ask about this post.
I also believe that we're not necessarily that far from having a usable definition of goal-directedness (based on the thing I presented at AISU, changed according to your feedback), but I know that not everyone agree.
Even with a significantly improved definition of goal-directedness, I think we'd be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.
But I'm happy to include your approach in the final document!
Even if I'm wrong about being able to formalize goal-directedness, I'm pretty convinced that the cluster of intuitions around goal-directed is what should be applied to a definition of mesa-optimization, because I expect inner alignment problems beyond inner optimizers.
Can you elaborate on this?
To be clear on whether I understand you, do you mean that even with very limited error, a mesa-optimizer/goal-directed agent could bid its time and use a single action well-placed to make a catastrophic treacherous turn?
Right. Low total error for, eg, imitation learning, might be associated with catastrophic outcomes. This is partly due to the way imitation learning is readily measured in terms of predictive accuracy, when what we really care about is expected utility (although we can't specify our utility function, which is one reason we may want to lean on imitation, of course).
But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we're bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world.
That being said, wouldn't the mesa-optimizer still have to pay the price to maintain at all time two goals, and to keep track of what things means related to both? Or are you arguing that this mutual information means that the mesa-optimizer will already be modeling many goal-directed systems, and so can just reuse that knowledge/information?
Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I'd be happy to hear it, but "systems performing complex tasks in complex environments have to pay that cost anyway" seems like a big problem for arguments of this kind. The question becomes where they put the complexity.
I don't understand what you're saying. Your first sentence seems to point out that humans are constant-time, not linear time. An algorithm for a fixed sized is constant time, after all.
I meant time as a function of data (I'm not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you'd need O(X).
A memoryless alg could be constant time; IE, even though you have and X-long history, you just need to feed it the most recent thing, so its response time is not a function of X. Similarly with finite context windows.
The issue here is that we don't have a scaled version of the algorithms humans are solving (analogous to generalized games). So we can't discuss the asymptotic complexity of human-brain algorithms.
I agree than in principle we could decode the brain's algorithms and say "actually, that's quadratic time" or whatever; EG, quadratic-in-size-of-working-memory or something. This would tell us something about what it would mean to scale up human intelligence. But I don't think this detracts from the concern about algorithms which are linear-time (and even constant-time) as a function of data. The concern is essentially that there's nothing stopping such algorithms from being faithful-enough human models, which demonstrates that they could be mesa-optimizers.
But maybe you actually have an argument related to that which I missed?
I think the crux here is what we're measuring runtime as-a-function-of. LMK if you still think something else is going on.
About penalizing time complexity, I really like this part of RLO (which is sadly missed by basically everyone as it's not redescribed in the intro or conclusion):
I actually struggled with where to place this in the text. I wanted to discuss the double-edged-sword thing, but, I didn't find a place where it felt super appropriate to discuss it.
One point about the time/description complexity penalty that I feel you don't point enough is that even if there was a threshold under which mesa-optimization doesn't appear, maybe it's just too low to be competitive. That's my main internal reason to doubt complexity penalties as a solution to the emergence of mesa-optimizers.
Right. I just didn't discuss this due to wanting to get this out as a quick sketch of where I'm going.
Even with a significantly improved definition of goal-directedness, I think we'd be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.
Oh, definitely. I think a better definition of goal-directedness is a prerequisite to be able to do that, so it's only the first step. That being said, I think I'm more optimistic than you on the result, for a couple of reasons:
Can you elaborate on this?
What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.
That's one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.
But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we're bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world.
Maybe irrelevant, but this makes me think of the problem with defining average complexity in complexity theory. You can prove things for some distributions over instances of the problem, but it's really difficult to find a distribution that capture the instances you will meet in the real world. This means that you tend to be limited to worst case reasoning.
One cool way to address that is through smoothed complexity: the complexity for an instance x is the expected complexity over the distribution on instances created by adding some Gaussian noise to x. I wonder if we can get some guarantees like that, which might improve over worst-case reasoning.
Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I'd be happy to hear it, but "systems performing complex tasks in complex environments have to pay that cost anyway" seems like a big problem for arguments of this kind. The question becomes where they put the complexity.
Agreed. I don't have such a story, but I think this is a good reframing of the crux underlying this line of argument.
I meant time as a function of data (I'm not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you'd need O(X).
For whatever reason, I thought about complexity depending on the size of the brain, which is really weird. But as complexity depending on the size of the data, I guess this makes more sense? I'm not sure why piling on more data wouldn't make the reliance on memory more difficult (so something like O(X^2) ?), but I don't think it's that important.
I agree than in principle we could decode the brain's algorithms and say "actually, that's quadratic time" or whatever; EG, quadratic-in-size-of-working-memory or something. This would tell us something about what it would mean to scale up human intelligence. But I don't think this detracts from the concern about algorithms which are linear-time (and even constant-time) as a function of data. The concern is essentially that there's nothing stopping such algorithms from being faithful-enough human models, which demonstrates that they could be mesa-optimizers.
Agreed that this a pretty strong argument that complexity doesn't preclude mesa-optimizers.
I actually struggled with where to place this in the text. I wanted to discuss the double-edged-sword thing, but, I didn't find a place where it felt super appropriate to discuss it.
Maybe in "Why this doesn't seem to work" for pure computational complexity?
What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.
That's one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.
Ah, on this point, I very much agree.
I'm not sure why piling on more data wouldn't make the reliance on memory more difficult (so something like O(X^2) ?), but I don't think it's that important.
I was treating the brain as fixed in size, so, having some upper bound on memory. Naturally this isn't quite true in practice (for all we know, healthy million-year-olds might have measurably larger heads if they existed, due to slow brain growth, but either way this seems like a technicality).
I admit that I'm more excited by doing this because you're asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments).
Thanks!
I'm not sure if I agree that there is no connection. The mesa-objective comes from the interaction of the outer objective, the training data/environments and the bias of the learning algorithm. So in some sense there is a connection. Although I agree that for the moment we lack a formal connection, which might have been your point.
Right. By "no connection" I specifically mean "we have no strong reason to posit any specific predictions we can make about mesa-objectives from outer objectives or other details of training" -- at least not for training regimes of practical interest. (I will consider this detail for revision.)
I could have also written down my plausibility argument (that there is actually "no connection"), but probably that just distracts from the point here.
(More later!)
To state the least of our problems first: this requires a 100x slowdown in comparison with the state-of-the-art deep learning (or whatever) we're layering the consensus algorithm on top of
I think you’re imagining deep learning as a MAP-type approach—it just identifies a best hypothesis and does inference with that. Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for MAP can be cut for the consensus algorithm too. Starting with 1), the bulk of the work for either the consensus algorithm or a MAP approach is computing the posterior to determine which model(s) is(are) best. In an analogy to neural networks, it would be like saying most of the work comes from using the model (the forward pass) rather than arriving at the model (the many forward and backward passes in training). Regarding 2), state-of-the-art-type AI basically assumes approximate stationarity when separating a training phase from a test/execution phase. This is cutting a huge corner, and it means that when you think of a neural network running, you mostly think about it using the hypothesis that it has already settled on. But if we compare apples to apples, a consensus algorithm can cut the same corner to some extent. Neither a MAP algorithm nor a consensus algorithm is any better equipped than the other to, say, update the posterior only when the timestep is a power two. In general, training (be it SGD or posterior updating) is the vast bulk of the work in learning. To select a good hypothesis in the first place you will have already had to consider many more; the consensus algorithm just says to keep track of the runner ups.
Third, the consensus algorithm requires a strong form of realizability assumption, where you not only assume that our Bayesian space contains the true hypothesis, but furthermore, that it's in the top 100 (or whatever number we choose). This hypothesis has to be really good: we have to think that malign hypotheses never out-guess the benign hypothesis. Otherwise, there's a chance that we eliminate the good guy at some point (allowing the bad guys to coordinate on a wrong answer). But this is unrealistic! The world is big and complex enough that no realistic hypothesis has all the answers.
I don’t understand what out-guess means. But what we need is that the malign hypothesis don’t have substantially higher posterior weight than the benign ones. As time passes, the probability of this happening is not independent. The result I show about the probability of the truth being in the top set applies to all time, not any given point in time. I don’t know what “no realistic hypothesis has all the answers” means. There will be a best “realistic” benign hypothesis, and we can talk about that one.
Michael Cohen seems to think that restricting to imitation learning makes the realizability assumption realistic
Realistic in theory! Because the model doesn’t need to include the computer. I do not think we can actually compute every hypothesis simpler than a human brain in practice.
When you go from an idealized version to a realistic one, all methods can cut corners, and I don’t see a reason to believe that the consensus algorithm can’t cut corners just as well. Realistically, we will have some hypothesis-proposing heuristic, strong enough to identify models one of which is accurate enough to generate powerful agency. This heuristic will clearly cast a wide net (if not, how would it magically arrive at a good answer? It’s internals would need some hypothesis-generating function). Rather than throwing out the runner ups, the consensus algorithm stores them. The hypothesis generating heuristic is another attack surface for optimization daemons, and I won’t make any claims for now about how easy or hard it is to prevent such a thing.
to apply this to something like deep learning, we need to think that each run has an independent chance of creating safe AGI
Evan and I talked along these lines for a bit. My basic position is that if “local search” is enough to get to general intelligence, our algorithms will be searching in spaces (or regions) where diverse hypothesis are close. Diverse hypothesis generation is just crucial for general intelligence. I do not advocate training GPT-N with 10^100 different initializations. I don’t think you have to, and I don’t think it would help much.
Fourth, the consensus algorithm still gives inner optimizers control of when the system asks for more feedback. This in itself could be catastrophic! For example, it could ask at the worst times, when it anticipates the human feedback to be of poor quality, or when asking itself has catastrophic consequences. A silly example would be causing a self-driving car to ask for feedback right when it needs to enact difficult evasive maneuvers to avoid a crash.
It’s imitation learning so whatever the person would do when stepping in is what the algorithm learns to do when they don’t. If the human action would be of poor quality if the human were called upon to take the action, then all the reasonable models, benign ones included, will predict that poor action, and the poor action will be taken regardless of whether the human was brought into the loop. If a person would be unable to avoid a crash when stepping in at the last minute, the imitation learner will crash the car too! If 99 hypotheses are correctly predicting the human action, then the 100th hypotheses can correctly predict it too, and the human action is executed, or it can err deliberately, in which case the human is called and the human action is executed anyway. You might have to modify this protocol a bit before using it to learn human policies for operating heavy machinery.
Fifth, although (as Michael Cohen notes) the frequency of asking for feedback will rapidly diminish, the learning algorithm converges to the correct hypothesis extremely slowly… We have to face the fact that it might require human feedback at any point in the future.
Yeah this feels like a small cost to me. One person can be doing this for many instances at once. If you want to mimic a human that has been attentive for a given period of time (because you want to know how a human drives a car, having already gotten their bearings previously, rather than how human drives a car having just been transported into it) you can know a few timesteps in advance if there is any likely circumstance in which you will need to query the demonstrator, and you can use the lead time to get the demonstrator acquainted. The query complexity will go up of course because there will be false alarms, but this is a totally surmountable problem for active learning.
The fourth point [controlling when to ask for more feedback] really seems like the deal-breaker. Malign mesa-optimizers could intelligently leverage this to communicate with the outside world, manipulate the training process, etc.
A malign mesa-optimizer has one moment to err, and then it will be discarded from the posterior. There’s not much it could say to us with a well-timed "now!". If multiple mesa-optimzers coordinate around a message, any mesa-optimizer not on board with this mission could disrupt the message. If you like, we could add a 1/million chance of querying to every timestep. Finally, with butterfly-effects, I don’t think that even a billion well-timed asks for help could precipitate human extinction. This is a bit all over the map, and this may be the point you want to press, but if you do, I think it is worth logging that if this is only attack surface left, we shouldn't forget to put a stake in the ground marking "lots of progress". If you believe that a single malign world-model could derail us if we use a Solomonoff predictor, but a consensus predictor would only be derailed by a supermajority of malign world-models, aligned with each other, that is a qualitative difference.
My third and final example: in one conversation, someone made a claim which I see as "exactly wrong": that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.
The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!
Maybe this was someone else, but it could have been me. I think MAP probably does solve the inner alignment problem in theory, but I don’t expect to make progress resolving that question, and I’m interested in hedging against being wrong. Where you say, “We know of no way of doing that” I would say, “We know of ways that might do that, but we’re not 100% sure”. On my to-do list is to write up some of my disagreements with Paul’s original post on optimization daemons in the Solomonoff prior (and maybe with other points in this article). I don’t think it’s good to argue from the premise that a problem is worth taking seriously, and then see what follows from the existence of that problem, because a problem can exist with 10% probability and be worth taking seriously, but one might get in trouble embedding its existence too deeply in one’s view of the world, if it is still on balance unlikely. That’s not to say that most people think Paul’s conclusions are <90% likely, just that one might.
Thanks for the extensive reply, and sorry for not getting around to it as quickly as I replied to some other things!
I am sorry for the critical framing, in that it would have been more awesome to get a thought-dumb of ideas for research directions from you, rather than a detailed defense of your existing work. But of course existing work must be judged, and I felt I had remained quiet about my disagreement with you for too long.
Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for MAP can be cut for the consensus algorithm too.
It's a fair point that it's no slower than idealized MAP. But the most important corner cut by deep learning is that deep learning represents just one hypothesis at a time, searching the space by following a gradient rather than by explicitly comparing options. The question is, how can we cut the same corner for the consensus algorithm, which needs to compare the outputs of many hypotheses?
In some settings, this is possible: for sufficiently simple hypothesis spaces, we can check consensus without explicitly computing a bunch of hypotheses. However, for deep learning, it seems rather difficult.
So, it seems like the best we can expect to do for deep learning is to train and run 100 hypotheses (or whatever number). This is a huge approximation in terms of MAP (since we have no guarantees that we are finding the 100 most probable, or anything), but we can naturally re-frame the consensus-alg guarantee in terms of frequency-of-malign-results for the NN training (rather than an assumption about at least 1 of the 100 most probable hypotheses being non-malign).
But this still means that, for a consensus of N hypotheses, the consensus algorithm will be N times slower (in terms of both training time and inference time). I expect N to be quite large, for reasons similar to what I said in the post: not only do we have to think N is large enough that at least one of the hypotheses is benign, but also we have to think that the benign hypothesis is at least as capable as any of the malign hypotheses (because otherwise it could get unlucky and be eliminated). For the purpose of imitation learning, this means we think one of the hypotheses has exactly learned to imitate a human. Since this problem is going to be quite under-determined even with quite large amounts of data, it seems like N needs to be large enough to chance upon it reliably.
Starting with 1), the bulk of the work for either the consensus algorithm or a MAP approach is computing the posterior to determine which model(s) is(are) best. In an analogy to neural networks, it would be like saying most of the work comes from using the model (the forward pass) rather than arriving at the model (the many forward and backward passes in training). Regarding 2), state-of-the-art-type AI basically assumes approximate stationarity when separating a training phase from a test/execution phase. This is cutting a huge corner, and it means that when you think of a neural network running, you mostly think about it using the hypothesis that it has already settled on. But if we compare apples to apples, a consensus algorithm can cut the same corner to some extent. Neither a MAP algorithm nor a consensus algorithm is any better equipped than the other to, say, update the posterior only when the timestep is a power two.
My main point here is: this isn't the most important corner I see deep learning cutting, as I described above.
In general, training (be it SGD or posterior updating) is the vast bulk of the work in learning. To select a good hypothesis in the first place you will have already had to consider many more; the consensus algorithm just says to keep track of the runner ups.
"Keeping track of runner-ups" in the MAP case makes a lot of sense. But for the deep learning case, it sounds like you are suggesting that we do consensus on part of the path that a single training run takes in hypothesis space. This seems like a pretty bad idea, for several reasons:
So it seems to me that you at least need to do independent training runs (w/ different random initializations) for the different models which you are checking consensus between, so that they are "independent" in some sense (perhaps most importantly, drawing different lottery tickets).
However, running the same training algorithm many times may not realistically explore the space enough. We sort of expect the same result from the same training procedure. Sufficiently large models will contain malign lotto tickets with high probability (so we can't necessarily argue from "one of these N initializations almost certainty lacks a malign lotto ticket" without very high N). The gradient landscape contains the same demons; maybe the chances of being pulled into them during training are just quite high. All of this suggests that N may need to be incredibly high, or, other measures may need to be taken to ensure that the consensus is taken between a greater variety of hypotheses than what we get from re-running training.
I don’t understand what out-guess means. But what we need is that the malign hypothesis don’t have substantially higher posterior weight than the benign ones.
Right, that's what I meant. Suppose we're trying to imitation-learn about Sally. Sally has a bunch of little nuances to her personality. For example, she has Opinions about flowers, butter, salt.... and a lot of other little things, which I'm supposing can't be anticipated from each other. I'm suggesting that no Bayesian hypothesis can get all of those things right from the get go. So imagine that after a while, the top 100 hypotheses are all pretty good at modeling Sally in typical situations, but each one has a different "sally secret" -- a different lucky guess about one of these little things.
In particular, a benign hypothesis (let's call it Ted) knows about the flower thing, and a malign hypothesis (Jerry) knows about a butter thing.
Unexpectedly, the butter thing becomes a huge focus of Sally's life for a while. Ted falls far out of favor compared to Jerry, since Ted just didn't see this coming. Maybe Ted updates pretty quickly, but it's too late, Ted has lost a bunch of Bayes points.
Whereas if the flower thing had come up, instead, we could consider ourselves lucky; Ted would still be in the running.
With Ted out of the running, the top 100 hypotheses are now all malign, and can coordinate some sort of treacherous turn.
That's the general idea I had in mind. The point is that N has to be high enough that we don't expect this to happen at any point.
As time passes, the probability of this happening is not independent. The result I show about the probability of the truth being in the top set applies to all time, not any given point in time. I don’t know what “no realistic hypothesis has all the answers” means. There will be a best “realistic” benign hypothesis, and we can talk about that one.
Why will there be one best? That's the realizability assumption. There is not necessarily a unique model with lowest bayes loss. Another way of stating this is that Bayesian updates lack a convergence guarantee; hypotheses can oscillate forever as to which is on top. (This is one of the classic frequentist critiques of bayesianism.) That's the formal thing that my "flowers vs butter" story about Sally is supposed to point at.
We can do better with logical induction or infrabayesianism. But I'm still leery of consensus-type approaches for those on other grounds.
Realistic in theory! Because the model doesn’t need to include the computer. I do not think we can actually compute every hypothesis simpler than a human brain in practice.
When you go from an idealized version to a realistic one, all methods can cut corners, and I don’t see a reason to believe that the consensus algorithm can’t cut corners just as well.
Haha :) OK, I misinterpreted you.
But the idealized issues (can a bayesian hypothesis model the computer it is running on?) have practical analogues (can we expect hypothesis generation to produce one model which is uniquely best, or will different models simply "know different things"?). So when I judge an idealized algorithm I think about those practical analogues, and whether they seem to suggest problems for realistic approximations.
In particular, you want your safety argument to translate well, since otherwise, what's the point?
And for the consensus algorithm, I've already explained why I do think it's particularly bad in this way.
Evan and I talked along these lines for a bit. My basic position is that if “local search” is enough to get to general intelligence, our algorithms will be searching in spaces (or regions) where diverse hypothesis are close. Diverse hypothesis generation is just crucial for general intelligence. I do not advocate training GPT-N with 10^100 different initializations. I don’t think you have to, and I don’t think it would help much.
Right, so what I'm getting is that we can think of GPT-N as already containing a large ensemble. But how would you check consensus?
It’s imitation learning so whatever the person would do when stepping in is what the algorithm learns to do when they don’t. If the human action would be of poor quality if the human were called upon to take the action, then all the reasonable models, benign ones included, will predict that poor action, and the poor action will be taken regardless of whether the human was brought into the loop. If a person would be unable to avoid a crash when stepping in at the last minute, the imitation learner will crash the car too! If 99 hypotheses are correctly predicting the human action, then the 100th hypotheses can correctly predict it too, and the human action is executed, or it can err deliberately, in which case the human is called and the human action is executed anyway. You might have to modify this protocol a bit before using it to learn human policies for operating heavy machinery.
So, what, the human has to act like they're driving the car forever, even though the AI autopilot is really good? If not, then there are many scenarios where the human would do well if driving normally but would crash if they were in the middle of sipping their drink when the feedback call unexpectedly came.
In terms of evaluating this on theoretical terms, the feedback call button is an output channel that malign stuff has access to. This feels kind of like "case closed" to me; I take it as a given that a sufficiently intelligent malign actor could exploit this.
If you told me "actually there is no feedback call button, the human is constantly monitored, the only difference is whether the data is used" then I'd concede that point; but then it seems like we're stuck driving the car forever.
If you want to mimic a human that has been attentive for a given period of time (because you want to know how a human drives a car, having already gotten their bearings previously, rather than how human drives a car having just been transported into it) you can know a few timesteps in advance if there is any likely circumstance in which you will need to query the demonstrator, and you can use the lead time to get the demonstrator acquainted. The query complexity will go up of course because there will be false alarms, but this is a totally surmountable problem for active learning.
Is this procedure supposed to strike me as so airtight that an inner optimizer could never exploit it?
A malign mesa-optimizer has one moment to err, and then it will be discarded from the posterior. There’s not much it could say to us with a well-timed "now!". If multiple mesa-optimzers coordinate around a message, any mesa-optimizer not on board with this mission could disrupt the message. If you like, we could add a 1/million chance of querying to every timestep. Finally, with butterfly-effects, I don’t think that even a billion well-timed asks for help could precipitate human extinction. This is a bit all over the map, and this may be the point you want to press, but if you do, I think it is worth logging that if this is only attack surface left, we shouldn't forget to put a stake in the ground marking "lots of progress". If you believe that a single malign world-model could derail us if we use a Solomonoff predictor, but a consensus predictor would only be derailed by a supermajority of malign world-models, aligned with each other, that is a qualitative difference.
This strikes me as a disorganized mix of arguments, and I've kind of run out of time atm to write this response, sorry! I will think more on what you have said.
Maybe this was someone else, but it could have been me.
I like the plausible deniability my anonymised examples gave me ;3
I felt I had remained quiet about my disagreement with you for too long
Haha that's fine. If you don't voice your objections, I can't respond to them!
I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is that there's no way to apply this in practice". Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network's epistemic uncertainty, it's potentially a hard problem! But it also seems like a clear problem, maybe even tractable. See Taylor (2016) section 2.1--inductive ambiguity identification. If you were convinced that AGI will be made of neural networks, you could say that I have reduced the problem of inner alignment to the problem of diverse-model-extraction from a neural network, perhaps allowing a few modifications to training (if you bought that the claim that the consensus algorithm is a theoretical solution). I have never tried to claim that analogizing this approach to neural networks will be easy, but I don't think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks; my ideal situation would be that I figure out how to do something in theory, and then 50 people try to work on analogizing it to state-of-the-art AI (there are many more neural network experts out there than AIXI experts). My less ideal situation is that people provisionally treat the theoretical solution as a dead end, right up until the very point that a practical version is demonstrated.
If it seemed like solving inner alignment in theory was easy (because allowing yourself an agent with the wherewithal to consider "unrealistic" models is such a boon), and there were thus lots of theoretical solutions floating around, any given one might not be such a strong signal: "this is the place to look for realistic solutions". But if there's only one floating around, that's a very a strong signal that we might be looking in a fundamental part of the solution space. In general, I think the most practical place to look for practical solutions is near the best theoretical one, and 10 hours of unsuccessful search isn't even close to the amount of time needed to demote that area from "most promising".
I think this covers my take on a few of your points, but some of your points are separate. In particular, some of them bear on the question of whether this really is an idealized solution in the first place.
With Ted out of the running, the top 100 hypotheses are now all malign, and can coordinate some sort of treacherous turn.
I think the question we are discussing here is: "yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?" I don't see how this example makes that point. If the threshold of "unrealistic" is set in such a way that "realistic" models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally's affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.
(A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally's allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we're just using the example to try to evaluate the effect of just removing the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).
Why will there be one best? That's the realizability assumption. There is not necessarily a unique model with lowest bayes loss. Another way of stating this is that Bayesian updates lack a convergence guarantee; hypotheses can oscillate forever as to which is on top.
Yeah I was thinking that the realistic setting was a finite length setting, with the one best being the best at the end. (And if it is best at the end, you can put a lower bound on how low its posterior weight ever was, since it's hard to recover from having vanishingly small weight, and then alpha just needs to be set to include that). But fair enough to be interested an infinite lifetime with a finite model class that does not include the truth. So yeah, a model's ranking can oscillate forever, although I think intelligent systems won't really do this in practice? I think in an infinite lifetime, it is reasonable in practice to assume that if something is generally intelligent, it has some sort of ability to generate new hypotheses over time. And if you have two hypotheses about Sally, one which explains her relationship to butter, and one which explains her relationship to flowers, it's pretty easy to generate another that combines these. So I'm not sure how instructive it is to consider the finite model class/infinite horizon Bayesian sequence prediction as a model for state-of-the-art AI.
So, what, the human has to act like they're driving the car forever, even though the AI autopilot is really good?
The suggestion I make later is that a few timesteps in advance, the human is told to start acting like they're driving the car, but they're not controlling it yet, and then comes the timestep when autopilot actually defers to them, and that's the behavior that gets recorded. Your response is
Is this procedure supposed to strike me as so airtight that an inner optimizer could never exploit it?
That proposal was making a different point--it was refuting exactly the concern that the human would have to act like they're driving the car forever. This protocol just defends active imitation learning in general as potentially viable for the operation of heavy machinery. Assume the demonstrator is sitting in a driving simulator, and all the relevant info shows up on their screens; this allows a small team to be the demonstrators for many autopilots simultaneously. It's not a question of whether an inner optimizer could exploit this protocol by finding the most inconvenient time for the human to be called in, and then querying the human--it's much worse than that! If there's a bad time for the human drive to be called in, then the autopilot will fail at that time regardless; the inner optimizer doesn't even have to arrange an actual query. So there's nothing to "exploit" really. And defending against the "much worse" version, I do think that the policy of a human driver with 5 seconds of acquaintance with the current state, does not crash the car at a rate significantly higher than a real human driver. Also, it may be worth saying that I don't think we really need general intelligence to operate heavy machinery at human level, so I'm not sure the heavy machinery case is very important to consider in the first place.
In terms of evaluating this on theoretical terms, the feedback call button is an output channel that malign stuff has access to. This feels kind of like "case closed" to me; I take it as a given that a sufficiently intelligent malign actor could exploit this.
You said you'd continue to think about the points I made about why I don't have the same reaction to this, so we can come back to those later. But stepping back, I think I want to make a point about the information loss you impose on your beliefs when applying the security mindset or Murphy's Law. A malign model gets to say "query now" once! And then it is removed from the posterior because of its wrong prediction. That is not many bits for an output channel. But... yes, it is an output channel. And if you insist that all output channels for malign stuff transmit world-ending signals, just like all aquariums will leak, I agree that that is sometimes a useful perspective to adopt and run with, but the exercise of making actual probability estimates about events is potentially different from exercising a security mindset. I mentioned information loss. With a security mindset, there is no point in distinguishing a malign agent with a 20 bit output channel from one with an unbounded output channel. You just close the case on both. But we really might live in a world where the former is existentially safe, and the latter is not--I mean that even seems likely to me. Even from my perspective, where I don't think a 20 bit output channel is dangerous, I understand the merit in disregarding that belief as the motivation for a research agenda, but I do think that's a whole different sort of "epistemic status". So when you say "I take it as a given that a sufficiently intelligent malign actor could exploit this" that makes sense to me as an action, but not as a proposition. (I mean, I think it's incorrect, so I certainly don't think it's self-evidently true).
I guess I can expand on the object level too. Like, take a moment to really stare at 01110010101011010000, and imagine that you take it to be a signal from some deeply wise agent that arranged for a query at timestep number 469,712. Suppose further that you already think this deeply wise agent is not malign, but rather is much maligned! (So it doesn't even need to spend the first few bits enthralling you). What do you do now? Or maybe timestep 469, 712 doesn't mean that bitstring--maybe it means "this is a very meaningful world-state, and of similar world-states, the fact that this one was chosen tells me I should run program X on a supercomputer, so it can engage in some good, clean world-optimization". Look around you right now--if someone told you "now is important, so think about that, and go run a program", what program do you produce? Am I taking a huge risk even asking that, because there's a precise time such that if I did, you'd run that program and it would end the world?
I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is that there's no way to apply this in practice".
A couple of separate points:
Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network's epistemic uncertainty, it's potentially a hard problem! [...] I have never tried to claim that analogizing this approach to neural networks will be easy, but I don't think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks;
Yeah, this is a fair point.
and 10 hours of unsuccessful search isn't even close to the amount of time needed to demote that area from "most promising".
To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about. I just perceive it to have hit diminishing returns. (This doesn't mean no one should ever think about it again, but it does seem worth communicating why the direction hasn't born fruit, at least to the extent that that line of research is happy being public.)
I think the question we are discussing here is: "yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?"
Sounds right to me.
I don't see how this example makes that point. If the threshold of "unrealistic" is set in such a way that "realistic" models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally's affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.
Ah, ok! Basically this is a new way of thinking about it for me, and I'm not sure what I think yet. My picture was that we argue that the top-weighted "good" (benign+correct) hypothesis can get unlucky, but should never get too unlucky, such that we can set N so that the good guy is always in the top N. Without realizability, we would have no particular reason to think "the good guy" (which is now just benign + reasonably correct) never drops below N on the list, for any N (because oscillations can be unbounded).
(A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally's allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we're just using the example to try to evaluate the effect of just removing the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).
I don't really get why yet -- can you spell the (brute-force) argument out in more detail?
(going for now, will read+reply more later)
A few quick thoughts, and I'll get back to the other stuff later.
To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about.
That's good to know. To clarify, I was only saying that spending 10 hours on the project of applying it to modern ML would not be enough time to deem it a fruitless path. If after 1 hour, you come up with a theoretical reason why it fails on its own terms--i.e. it is not even a theoretical solution--then there is no bound on how strongly you might reasonably conclude that it is fruitless. So this kind of meta point I was making only applied to your objections about slowdown in practice.
a "theoretical solution" to the realizability probelm at all.
I only meant to claim I was just doing theory in a context that lacks the realizability problem, not that I had solved the realizability problem! But yes, I see what you're saying. The theory regards a "fair" demonstrator which does not depend on the operation of the computer. There are probably multiple perspectives about what level of "theoretical" that setting is. I would contend that in practice, the computer itself is not among the most complex and important causal ancestors of the demonstrator's behavior, so this doesn't present a huge challenge for practically arriving at a good model. But that's a whole can of worms.
My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.
Okay good, this worry makes much more sense to me.
Just want to note that although it's been a week this is still in my thoughts, and I intend to get around to continuing this conversation... but possibly not for another two weeks.
I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using "blind search over a super-broad, probably-even-Turing-complete, space of models" as one of its ingredients. I guess I'm just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs... (Of course I just wind up with a different set of unsolved AGI safety problems instead...)
The Evolutionary Story
By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they're instrumentally useful, or they're a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.
So I guess I can imagine a strategy of saying "mesa-optimization won't happen" in some circumstance because we've somehow ruled out all three of those categories.
This kind of argument does seem like a not-especially-promising path for safety research, in practice. For one thing, it seems hard. Like, we may be wrong about what’s instrumentally useful, or we may overlook part of the space of possible strategies, etc. For another thing, mesa-optimization is at least somewhat incentivized by seemingly almost any training procedure, I would think.
...Hmm, in our recent conversation, I might have said that mesa-optimization is not incentivized in predictive (self-supervised) learning. I forget. But if so, I was confused. I have long believed that mesa-optimization is useful for prediction and still do. Specifically, the directly-incentivized kind of "mesa-optimization in predictive learning" entails, for example, searching over different possible approaches to process the data and generate a prediction, and then taking the most promising approach.
Anyway, what I should have said was that, in certain types of predictive learning, mesa-optimizers that search over active, real-world-manipulating plans are not incentivized—and then that's part of an argument that such mesa-optimizers are improbable. If that argument is correct, then the worst we would expect from a "misaligned mesa-optimizer" is that it will use an inappropriate prediction heuristic in some circumstances, and then we'd wind up with inaccurate predictions. That's a capability problem, not a safety problem.
So anyway, if there's a good argument along those lines, it would not be a safety argument that involves "There will be no mesa-optimizers", but rather "There will be no mesa-optimizers that think outside the box", so to speak. Details and (sketchy) argument in a forthcoming post.
By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they're instrumentally useful, or they're a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.
We can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may create a model with a different objective that is identical to the intended objective just during training (or some part thereof). Why would this be unlikely? The intended objective is not privileged over such other objectives, from the perspective the training process.
Evan gave an example related to this, where the intention was to train a myopic RL agent that goes through blue doors in the current epoch episode, but the result is an agent with a more general objective that cares about blue doors in future epochs episodes as well. In Evan's words (from the Future of Life podcast):
You can imagine a situation where every situation where the model has seen a blue door, it’s been like, “Oh, going through this blue is really good,” and it’s learned an objective that incentivizes going through blue doors. If it then later realizes that there are more blue doors than it thought because there are other blue doors in other episodes, I think you should generally expect it’s going to care about those blue doors as well.
Similar concerns are relevant for (self-)supervised models, in the limit of capability. If a network can model our world very well, the objective that SGD yields may correspond to caring about the actual physical RAM of the computer on which the inference runs (specifically, the memory location that stores the loss of the inference). Also, if any part of the network, at any point during training, corresponds to dangerous logic that cares about our world, the outcome can be catastrophic (and the probability of this seems to increase with the scale of the network and training compute).
Also, a malign prior problem may manifest in (self-)supervised learning settings. (Maybe you consider this to be a special case of (2).)
Like, if we do gradient descent, and the training signal is "get a high score in PacMan", then "mesa-optimize for a high score in PacMan" is incentivized by the training signal, and "mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips" is also incentivized by the training signal.
For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better at figuring out how to execute a deceptive strategy.
Here's a nice example. Let's say we do RL, and our model is initialized with random weights. The training signal is "get a high score in PacMan". We start training, and after a while, we look at the partially-trained model with interpretability tools, and we see that it's fabulously effective at calculating digits of π—it calculates them by the billions—and it's doing nothing else, it has no knowledge whatsoever of PacMan, it has no self-awareness about the training situation that it's in, it has no proclivities to gradient-hack or deceive, and it never did anything like that anytime during training. It literally just calculates digits of π. I would sure be awfully surprised to see that! Wouldn't you? If so, then you agree with me that "reasoning about training incentives" is a valid type of reasoning about what to expect from trained ML models. I don't think it's a controversial opinion...
Again, I did not (and don't) claim that this type of reasoning should lead people to believe that mesa-optimizers won't happen, because there do tend to be training incentives for mesa-optimization.
I would sure be awfully surprised to see that! Wouldn't you?
My surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about . If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn't be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective.
Note that the examples in my comment don't rely on deceptive alignment. To "convert" your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is "make the relevant memory location in the RAM say that I won the game", or "win the game in all future episodes".
My hunch is that we don't disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you're misinterpreting me as saying something more interesting than I am.
Most of the work on inner alignment so far has been informal or semi-formal (with the notable exception of a little work on minimal circuits). I feel this has resulted in some misconceptions about the problem. I want to write up a large document clearly defining the formal problem and detailing some formal directions for research. Here, I outline my intentions, inviting the reader to provide feedback and point me to any formal work or areas of potential formal work which should be covered in such a document. (Feel free to do that last one without reading further, if you are time-constrained!)
The State of the Subfield
Risks from Learned Optimization (henceforth, RLO) offered semi-formal definitions of important terms, and provided an excellent introduction to the area for a lot of people (and clarified my own thoughts and the thoughts of others who I know, even though we had already been thinking about these things).
However, RLO spent a lot of time on highly informal arguments (analogies to evolution, developmental stories about deception) which help establish the plausibility of the problem. While I feel these were important motivation, in hindsight I think they've caused some misunderstandings. My interactions with some other researchers has caused me to worry that some people confuse the positive arguments for plausibility with the core problem, and in some cases have exactly the wrong impression about the core problem. This results in mistakenly trying to block the plausibility arguments, which I see as merely illustrative, rather than attacking the core problem.
By no means do I intend to malign experimental or informal/semiformal work. Rather, by focusing on formal theoretical work, I aim to fill a hole I perceive in the field. I am very appreciative of much of the informal/semiformal work that has been done so far, and continue to think that kind of work is necessary for the crystallization of good concepts.
Focusing on the Core Problem
In order to establish safety properties, we would like robust safety arguments ("X will not happen" / "X has an extremely low probability of happening"). For example, arguments that probability of catastrophe will be very low, or arguments that probability of intentional catastrophe will be very low (ie, intent-alignment), or something along those lines.
For me, the core inner alignment problem is the absence of such an argument in a case where we might naively expect it. We don't know how to rule out the presence of (misaligned) mesa-optimizers.
Instead, I see many people focusing on blocking the plausibility arguments in RLO. This strikes me as the wrong direction. To me, these arguments are merely illustrative.
It seems like some people have gotten the impression that when the assumptions of the plausibility arguments in RLO aren't met, we should not expect an inner alignment problem to arise. Not only does this attitude misunderstand what we want (ie, a strong argument that we won't encounter a problem) -- I further think it's actually wrong (because when we look at almost any case, we see cause for concern).
Examples:
The Developmental Story
One recent conversation involved a line of research based on the developmental story, where a mesa-optimizer develops a pseudo-aligned objective early in training (an objective with a strong statistical correlation to the true objective in the training data), but as it learns more about the world, it improves its training score by becoming deceptive rather than by fixing the pseudo-aligned objective. The research proposal being presented to me involved shaping the early pseudo-aligned objective in very coarse-grained ways, which might ensure (for example) a high preference for cooperative behavior, or a low tolerance for risk (catastrophic actions might be expected to be particularly risky), etc.
This line of research seemed promising to the person I was talking to, because they supposed that while it might be very difficult to precisely control the objectives of a mesa-optimizer or rule out mesa-optimizers entirely, it might be easy to coarsely shape the mesa-objectives.
I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection. I proceeded to offer a plausibility argument for a total disconnect between the two, such that even these coarse-grained adjustments would fail.
(Possibly it was a mistake to offer a plausibility argument, because the rest of the discussion focused on this plausibility argument, again distracting from the core problem!)
The Evolutionary Story
Another recent conversation involved an over-emphasis on the evolutionary analogy. This person believed the inner optimizer problem would apply when systems were incentivised to be goal-oriented, as with animals selected for reproductive fitness, or policy networks trained to pursue reward. However, they did not believe it would apply to networks which are simply trained to predict, such as GPT.
Again, this strikes me as ignoring the fundamental problem, that we have little to no idea when mesa-optimizers can arise, that we lack formal tools for the analysis of such questions, and that what formal tools we might have thought to apply, have failed to yield any such results.
Bounding the Problem
My third and final example: in one conversation, someone made a claim which I see as "exactly wrong": that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.
The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!
I want to return to that idea. But first, we have to clarify some definitions.
The Formal Problem
I currently see three areas of concern:
Vanessa terms #1 Cartesian daemons because they obey the intended input/output protocol of the whole system, and #3 non-Cartesian daemons because they violate the protocol. I'm not sure whether/where #2 falls on Vanessa's classification.
For this short write-up I'll focus on #1, although clearly #2 and #3 are also important areas of study.
My semi-formal description of the problem is going to be very close to the explain-like-i'm-12 version:
Essentially what we want to do is block or weaken this conclusion (concluding that there is no, or only a very small, chance).
(The "(malign)" in parentheses should be removed in approaches which are trying to avoid mesa-optimization entirely, and included in approaches which are only trying to block bad actors. I'll sometimes use the term "malign hypothesis" to indicate hypotheses which produce catastrophic output, regardless of whether they contain a mesa-optimizer in any formal sense.)
Now, to make this into a formal problem, we have to define "mesa-optimizer".
Defining Mesa-Optimization
RLO offered a semi-formal definition of mesa-optimizers, as objects in a search space which themselves conduct an internal search. Whether or not you agree with this definition, it's too vague for formal purposes: how would we define "internal search"?
So, for the purpose of the formal problem, it's necessary to pick an operational definition.
Fortunately, it's possible to select operational definitions which have strong connections to safety concerns. In particular, I will call an operationalization tight if it includes more than necessary, so that we can be sure we're ruling out the undesired behavior when we rule out its operationalization. ("Tight" is supposed to sound like the opposite of "leaky" as in "leaky abstraction".)
As with any research, the choice of operational definitions is going to depend on the specific approach of the researcher, norms of the field, etc. Here are some options that occur to me:
Obviously, this is not exhaustive by any means. Coming up with productive operational definitions is an important part of attacking the formal inner alignment problem.
The goal is then to provide some kind of strong argument which avoids the operationalized problem.
Why is this a problem?
Why can't we address the inner alignment problem in the same way that we usually address problems of induction?
The most obvious way to avoid mesa-optimizers is to rule them out of the search space or the results of search. Earlier, I said:
I was referring to the idea that the first thing one tries when hearing about the inner alignment problem is probably this kind of complexity-bound approach, which leads me to think of "the whole problem" as the fact that these approaches don't seem to work.
So, let's look at these approaches in more detail.
The hope is that we can be confident that mesa-optimizers will be more complex in comparison with benign outputs, so that we can avert mesa-optimization by using machine learning techniques which heavily bias toward less complex models. (This is "the first thing one tries" because it employs standard ideas from machine learning.)
The main theoretical choice one has is, what notion of "complexity"? I will look at three approaches:
Description Complexity
The idea here would have to be that mesa-optimizers have to have a larger description length than benign hypotheses.
Why you might think this idea would work:
Why this idea doesn't seem to work:
Pure Computational Complexity
Alright, if description complexity can't rule out mesa-optimizers, maybe computational complexity can? I'm going to formalize "fast programs" as minimal circuits, where "circuit size" is understood as the number of logic gates.
Why you might think this would work:
Why it doesn't seem to work:
It might seem like we're just not cutting off enough processing power -- the minimal-circuits formalism cuts off mesa-optimization which requires superexponential thinking time, but not sub-exponential. Let's call this the "computational ceiling" of the formalism. Shouldn't we just pick a formalism with a lower computational ceiling?
Maybe! Let's look at a formalism which gives us a knob to adjust, so that we can choose to penalize time more/less harshly.
Mixing Time and Description Complexity
The speed prior is like Solomonoff induction, except that it penalizes computation time. Specifically, taking twice as long is like 1 more bit of description length. Combining time and description complexity in this way is called Levin complexity.
For Levin complexity, time is penalized logarithmically (in comparison to description length). This suggests that we will get an exponential computational ceiling again: if computing n bits costs O(2n) time, we're around the break-even point (it might be worth computing the bits, or it might be better to memorize them). So super-exponential calculations won't be worth it, and sub-exponential calculations will, just like we saw with minimal circuits.
However, it's easy to tweak the time/description trade-off. Let's define Levinf complexity as 2l+f(t), where l is description length and t is execution time (the choice of f=log2 gives us regular Levin complexity). (I am not sure if Levinf complexity has a more standard name in the literature.) This gives us a computational ceiling of f−1.
Why this might work:
Why this might not work:
Elaborating on that last point: I'm assuming that you are asking for a computation that runs "sufficiently long" anyway. The speed prior does have a heavy bias for computations which cut off quickly. However, if we're (say) predicting the weather, we probably want to require that the calculation keeps going on for a while into the future, rather than cutting off. If we're training an AI traffic control system, we'd certainly require that it doesn't just cut off after a few days. The idea, then, is that the treacherous computation C can be spaced out through the time the AI traffic control system is required to remain on-line.
Taking everything together, I don't think we have a strong basis (yet!) for thinking penalizing runtime makes the situation any better. Some have even argued that it makes the situation worse, because time constraints provide an extra incentive for complex value-driven computations (in order to manage time effectively) as opposed to simple causal hypotheses (such as raw physics). I don't think we have enough information to judge one way or the other.
A Note on the Consensus Algorithm
The eventual write-up, which this is a mere prospectus for, would aim to cover as broad a variety of approaches as possible, and give our current best assessment of the viability of each (ie, report why it might work and why it might not work).
I can't cover everything I'd want to cover in this prospectus, but one more approach sticks out as deserving of treatment. In February of this year, Michael Cohen posted Formal Solution to the Inner Alignment Problem. This used an algorithm which I had heard many others talk about in the context of inner alignment, but usually, with considerably less hope and optimism than Michael Cohen expresses. I'll call the algorithm the consensus algorithm. (I don't know of a better name for it; I've heard it called KWIK learning (Knows-What-It-Knows), but unfortunately KWIK is actually used for a learning setting, not the algorithm that's typically applied in that setting.)
The consensus algorithm works like this:
Note that I'm not nailing down all details, EG, how closely do we require models to agree? (Do we just require that they agree about which option is most probable? Do we require their probabilities to fall within the same small range?)
The consensus algorithm allows us to make some nice formal arguments. For example:
However, other drawbacks seem to spoil the appeal of this result.
Michael Cohen seems to think that restricting to imitation learning makes the realizability assumption realistic:
I disagree. It still doesn't seem realistic to suppose that competitors to the best hypothesis are never lucky. Even if we suppose that my behavior is entirely deterministic, the number of hypotheses you'd have to consider in order for one of them to be an exact copy of my brain would no doubt be huge! Let's conservatively call this 10100 (suggesting that I have at least 100 neurons, each of which has at least 10 meaningfully different configurations). This would suggest that the consensus algorithm needs googol times the processing power of the human brain to work.
Michael Cohen also says some other stuff about why the proposal doesn't have embedding problems, which I also disagree with. But let's set this aside and get back to listing other problems with the consensus algorithm.
The fourth point really seems like the deal-breaker. Malign mesa-optimizers could intelligently leverage this to communicate with the outside world, manipulate the training process, etc.
Conclusion
Has this been useful? Would an expanded and improved version of this be useful?
This is something where I could really use detailed peer-review-like feedback, since the final version of this thing would hopefully be a pretty canonical resource, with standardized terminology and so on.
A weakness of this as it currently stands is that I purport to offer the formal version of the inner optimization problem, but really, I just gesture at a cloud of possible formal versions. I think this is somewhat inevitable, but nonetheless, could probably be improved. What I'd like to have would be several specific formal definitions, together with several specific informal concepts, and strong stories connecting all of those things together.
I'd be glad to get any of the following types of feedback:
If you take nothing else away from this, I'm hoping you take away this one idea: the main point of the inner alignment problem (at least to me) is that we know hardly anything about the relationship between the outer optimizer and any mesa-optimizers. There are hardly any settings where we can rule mesa-optimizers out. And we can't strongly argue for any particular connection (good or bad) between outer objectives and inner.