Re-Define Intent Alignment?

by Abram Demski5 min read22nd Jul 202130 comments

16

Inner AlignmentAI
Frontpage

I think Evan's Clarifying Inner Alignment Terminology is quite clever; more well-optimized than it may at first appear. However, do think there are a couple of things which don't work as well as they could:

  • What exactly does the modifier "intent" mean?
    • Based on how "intent alignment" is defined (basically, the optimal policy of its behavioral objective would be good for humans), capability robustness is exactly what it needs to combine with in order to achieve impact alignment. However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans". In this case, capability robustness is not exactly what's needed; instead, what I'll provisionally call inner robustness (IE, strategies for achieving the mesa-objective generalize well) would be put in its place.
      • (I find myself flipping between these two views, and thereby getting confused.)
    • Furthermore, I would argue that the second alternative (making "intent alignment" about the mesa-objective) is more true to the idea of intent alignment. Making it about the behavioral objective turns it into a fact about the actual impact of the system, since "behavioral objective" is defined by looking at what the system actually accomplishes. But then, why the divide between intent alignment and impact alignment?
  • Any definition where "inner alignment" isn't directly paired with "outer alignment" is going to be confusing for beginners.
    • In Evan's terms, objective robustness is basically a more clever (more technically accurate and more useful) version of "the behavioral objective equals the outer objective", whereas inner alignment is "the mesa-objective equals the outer objective".
      • (It's clear that "behavioral" is intended to imply generalization, here -- the implication of objective robustness is supposed to be that the objective is stable under distributional shift. But this is obscured by the definition, which does not explicitly mention any kind of robustness/generalization.)
    • By making this distinction, Evan highlights the assumption that solving inner alignment will solve behavioral alignment: he thinks that the most important cases of catastrophic bad behavior are intentional (ie, come from misaligned objectives, either outer objective or inner objective).
      • In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional -- which could be an advantage, if this assumption isn't so good!
    • However, although I find the decomposition insightful, I dread explaining it to beginners in this way. I find that I would prefer to gloss over objective robustness and pretend that intent alignment simply factors into outer alignment and inner alignment.
      • I also find myself constantly thinking as if inner/outer alignment were a pair, intuitively!

My current proposal would be the following:

  • Re-define "intent alignment" to refer to the mesa-objective.
    • Now, inner alignment + outer alignment directly imply intent alignment, provided that there is a mesa-objective at all (IE, assuming that there's an inner optimizer).
      • This fits with the intuitive picture that inner and outer are supposed to be complimentary!
  • If we wish, we could replace or re-define "capability robustness" with "inner robustness", the robustness of pursuit of the mesa-objective under distributional shift.
    • This is exactly what we need to pair with the new "intent alignment" in order to achieve impact alignment.
    • However, this is clearly a narrower concept than capability robustness (it assumes there is a mesa-objective).

This is a complex and tricky issue, and I'm eager to get thoughts on it.

Relevant reading:


As a reminder, here are Evan's definitions. Nested children are subgoals; it's supposed to be the case that if you can achieve all the children, you can achieve the parent.

  • Impact Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.
    • Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.
    • Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective is impact aligned with humans.
      • Outer Alignment: An objective function  is outer aligned if all models that perform optimally on  in the limit of perfect training and infinite data are intent aligned.
      • Objective Robustness: An agent is objective robust if the optimal policy for its behavioral objective is impact aligned with the base objective it was trained under.

So we split impact alignment into intent alignment and capability; we split intent alignment into outer alignment and objective robustness; and, we achieve objective robustness through inner alignment.

Here's what my proposed modifications do:

  • (Impact) Alignment
    • Inner Robustness: An agent is inner-robust if it performs well on its mesa-objective even in deployment/off-distribution.
    • Intent Alignment: An agent is intent aligned if the optimal policy for its mesa-objective is impact aligned with humans.
      • Outer Alignment
      • Inner Alignment

"Objective Robustness" disappears from this, because inner+outer gives intent-alignment directly now. This is a bit of a shame, as I think objective robustness is an important subgoal. But I think the idea of objective robustness fits better with the generalization-focused approach:

  • Alignment
    • Outer Alignment: For this approach, outer alignment is re-defined to be only on-training-distribution (we could call it "on-distribution alignment" or something).
    • Robustness
      • Objective Robustness
        • Inner Alignment
      • Capability Robustness

And it's fine for there to be multiple different subgoal hierarchies, since there may be multiple paths forward.

16

31 comments, sorted by Highlighting new comments since Today at 5:39 AM
New Comment

However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans".

I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don't have a good reason to believe this.)

I want to be able to talk about how we can shape goals which may be messier, perhaps somewhat competing, internal representations or heuristics or proxies that determine behavior. If we actually want to understand "intent," we have to understand what the heck intentions and goals actually are in humans and what they might look like in advanced ML systems. However, I do think this is a very good point you raise about intent alignment (that it should correspond to the model's internal goals, objectives, intentions, etc.), and the need to be mindful of which version we're using in a given context.

Also, I'm iffy on including the "all inputs"/optimality thing (I believe Rohin is, too)... it does have the nice property that it lets you reason without considering e.g. training setup, dataset, architecture, but we won't actually have infinite data and optimal models in practice. So, I think it's pretty important to model how different environments or datasets interact with the reward/objective function in producing the intentions and goals of our models.

Evan highlights the assumption that solving inner alignment will solve behavioral alignment: he thinks that the most important cases of catastrophic bad behavior are intentional (ie, come from misaligned objectives, either outer objective or inner objective).

I don't think this is necessarily a crux between the generalization- and objective-driven approaches—if intentional behavior requires a mesa-objective, then humans can't act "intentionally." So we obviously want a notion of intent that applies to the messier middle cases of goal representation (between a literal mesa-objective and a purely implicit behavioral objective).

I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don't have a good reason to believe this.)

For myself, my reaction is "behavioral objectives also assume a system is well-described as EU maximizers". In either case, you're assuming that you can summarize a policy by a function it optimizes; the difference is whether you think the system itself thinks explicitly in those terms.

I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems. 

For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.

In this picture, there is no clear distinction between terminal values and instrumental values. Something is "more terminal" if you treat it as more fixed (you resolve contradictions by updating the other values), and "more instrumental" if its value is more changeable based on other things.

I want to be able to talk about how we can shape goals which may be messier, perhaps somewhat competing, internal representations or heuristics or proxies that determine behavior.

(Possibly you should consider my "approximately coherent expectations" idea)

I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.

For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.

Is this related to your post An Orthodox Case Against Utility Functions? It's been on my to-read list for a while; I'll be sure to give it a look now.

Right, exactly. (I should probably have just referred to that, but I was trying to avoid reference-dumping.)

If we wish, we could replace or re-define "capability robustness" with "inner robustness", the robustness of pursuit of the mesa-objective under distributional shift.

I strongly agree with this suggestion. IMO, tying capability robustness to the behavioral objective confuses a lot of things, because the set of plausible behavioral objectives is itself not robust to distributional shift.

One way to think about this from the standpoint of the "Objective-focused approach" might be: the mesa-objective is the thing the agent is revealed to be pursuing under arbitrary distributional shifts. To be precise: suppose we take the world and split it into an "agent" part and "environment" part. Then we expose the agent to every possible environment (or data distribution) allowed by our laws of physics, and we note down what the agent does in each of them. Any objective function that's consistent with our agent's actions across all of those environments must then count as a valid mesa-objective. (This is pretty much Amin & Singh's omnipotent experimenter setting.)

The behavioral objective, meanwhile, would be more like the thing the agent appears to be pursuing under some subset of possible distributional shifts. This is the more realistic case where we can't afford to expose our agent to every possible environment (or data distribution) that could possibly exist, so we make do and expose it to only a subset of them. Then we look at what objectives could be consistent with the agent's behavior under that subset of environments, and those count as valid behavioral objectives.

The key here is that the set of allowed mesa-objectives is a reliable invariant of the agent, while the set of allowed behavioral objectives is contingent on our observations of the agent's behavior under a limited set of environments. In principle, the two sets of objectives won't converge perfectly until we've run our agent in every possible environment that could exist.

So if we do an experiment whose results are consistent with behavioral objectives , and we want to measure the agent's capability robustness with respect to , we'd apply a distributional shift and see how well the agent performs at . But what if  isn't actually the mesa-objective? Then the fact that the agent appeared to be pursuing  at all was just an artifact of the limited set of experiments we were running. So if our agent does badly at  under the shift, maybe the problem isn't a capability shortfall — maybe the problem is that the agent doesn't care about  and never did.

Whereas with your definition of inner robustness, we'd at least be within our rights to say that the true mesa-objective was , and therefore that doing badly at  really does say something about the capability of our agent.

[This comment is no longer endorsed by its author]Reply

The behavioral objective, meanwhile, would be more like the thing the agent appears to be pursuing under some subset of possible distributional shifts. This is the more realistic case where we can't afford to expose our agent to every possible environment (or data distribution) that could possibly exist, so we make do and expose it to only a subset of them. Then we look at what objectives could be consistent with the agent's behavior under that subset of environments, and those count as valid behavioral objectives.

The key here is that the set of allowed mesa-objectives is a reliable invariant of the agent, while the set of allowed behavioral objectives is contingent on our observations of the agent's behavior under a limited set of environments In principle, the two sets of objectives won't converge perfectly until we've run our agent in every possible environment that could exist.

This is the right sort of idea; in the OOD robustness literature you try to optimize worst-case performance over a perturbation set of possible environments. The problem I have with what I understand you to be saying is with the assumption that there is any possible reliable invariant of the agent over every possible environment that could be a mesa-objective, which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment, so why shouldn't we talk about the mesa-objective being over a perturbation set, too, just that it has to be some function of the model's internal features?

which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment

 

I might be misunderstanding what you mean here, but carving up a world into agent vs environment is absolutely possible in reality, as is placing that agent in arbitrary environments to see what it does. You can think of the traditional RL setting as a concrete example of this: on one side we have an agent that is executing some policy ; and on the other side we have an environment that consists of state transition dynamics given by some distribution . One can in fact show (see the unidentifiability in IRL paper) that if an experimenter has the power to vary the environment  arbitrarily and look at the policies the agent pursues on each of those environments, then that experimenter can recover a reward function that is unique up to the usual affine transformations.

That recovered reward function is a fortiori a reliable invariant of the agent, since it is consistent with the agent's actions under every possible environment the agent could be exposed to. (To be clear, this claim is also proved in the paper.) It also seems reasonable to identify that reward function with the mesa-objective of the agent, because any mesa-objective that is not identical with that reward function has to be inconsistent with the agent's actions on at least one environment.

Admittedly there are some technical caveats to this particular result: off the top, 1) the set of states & actions is fixed across environments; 2) the result was proved only for finite sets of states & actions; and 3) optimal policy is assumed. I could definitely imagine taking issue with some of these caveats — is this the sort of thing you mean? Or perhaps you're skeptical that a proof like this in the RL setting could generalize to the train/test framing we generally use for NNs?

in the OOD robustness literature you try to optimize worst-case performance over a perturbation set of possible environments.

Yeah that's sensible because this is often all you can do in practice. Having an omnipotent experimenter is rarely realistic, but imo it's still useful as a way to bootstrap a definition of the mesa-objective.

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

Thanks!!

(see the unidentifiability in IRL paper)

Ah, I wasn't aware of this!

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a "non-naturalistic" assumption, which simply makes me think a framework is more artificial/fragile.

For example, AIXI assumes a hard boundary between agent and environment. One manifestation of this assumption is how AIXI doesn't predict its own future actions the way it predicts everything else, and instead, must explicitly plan its own future actions. This is necessary because AIXI is not computable, so treating the future self as part of the environment (and predicting it with the same predictive capabilities as usual) would violate the assumption of a computable environment. But this is unfortunate for a few reasons. First, it forces AIXI to have an arbitrary finite planning horizon, which is weird for something that is supposed to represent unbounded intelligence. Second, there is no reason to carry this sort of thing over to finite, computable agents; so it weakens the generality of the model, by introducing a design detail that's very dependent on the specific infinite setting.

Another example would be game-theoretic reasoning. Suppose I am concerned about cooperative behavior in deployed AI systems. I might work on something like the equilibrium selection problem in game theory, looking for rationality concepts which can select cooperative equilibria where they exist. However, this kind of work will typically treat a "game" as something which inherently comes with a pointer to the other agents. This limits the real-world applicability of such results, because to apply it to real AI systems, those systems would need "agent pointers" as well. This is a difficult engineering problem (creating an AI system which identifies "agents" in its environment); and even assuming away the engineering challenges, there are serious philosophical difficulties (what really counts as an "agent"?).

We could try to tackle those difficulties, but my assumption will tend to be that it'll result in fairly brittle abstractions with weird failure modes. 

Instead, I would advocate for Pavlov-like strategies which do not depend on actually identifying "agents" in order to have cooperative properties. I expect these to be more robust and present fewer technical challenges.

Of course, this general heuristic may not turn out to apply in the specific case we are discussing. If you control the training process, then, for the duration of training, you control the agent and the environment, and these concepts seem unproblematic. However, it does seem unrealistic to really check every environment; so, it seems like to establish strong guarantees, you'd need to do worst-case reasoning over arbitrary environments, rather than checking environments in detail. This is how I was mainly interpreting jbkjr; perturbation sets could be a way to make things more feasible (at a cost).

I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a "non-naturalistic" assumption, which simply makes me think a framework is more artificial/fragile.

Oh for sure. I wouldn't recommend having a Cartesian boundary assumption as the fulcrum of your alignment strategy, for example. But what could be interesting would be to look at an isolated dynamical system, draw one boundary, investigate possible objective functions in the context of that boundary; then erase that first boundary, draw a second boundary, investigate that; etc. And then see whether any patterns emerge that might fit an intuitive notion of agency. But the only fundamentally real object here is always going to be the whole system, absolutely.

As I understand, something like AIXI forces you to draw one particular boundary because of the way the setting is constructed (infinite on one side, finite on the other). So I'd agree that sort of thing is more fragile.

The multiagent setting is interesting though, because it gets you into the game of carving up your universe into more than 2 pieces. Again it would be neat to investigate a setting like this with different choices of boundaries and see if some choices have more interesting properties than others.

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, because "agents" and "environments" can only exist in a map, not the territory. The idea of trying to e.g. separate "your atoms" or whatever from those of "your environment," so that you can drop them into those of "another environment," is only a useful fiction, as in reality they're entangled with everything else. I'm not aware of formal proof of this point that I'm trying to make; it's just a pretty strongly held intuition. Isn't this also kind of one of the key motivations for thinking about embedded agency?

Ah I see! Thanks for clarifying.

Yes, the point about the Cartesian boundary is important. And it's completely true that any agent / environment boundary we draw will always be arbitrary. But that doesn't mean one can't usefully draw such a boundary in the real world — and unless one does, it's hard to imagine how one could ever generate a working definition of something like a mesa-objective. (Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")

Of course the right question will always be: "what is the whole universe optimizing for?" But it's hard to answer that! So in practice, we look at bits of the whole universe that we pretend are isolated. All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.

(i.e. I agree with you that duality is a useful fiction, just saying that we can still use it to construct useful definitions.)

I would further add that looking for difficulties created by the simplification seems very intellectually productive. (Solving "embedded agency problems" seems to genuinely allow you to do new things, rather than just soothing philosophical worries.) But yeah, I would agree that if we're defining mesa-objective anyway, we're already in the business of assuming some agent/environment boundary.

I would further add that looking for difficulties created by the simplification seems very intellectually productive.

Yep, strongly agree. And a good first step to doing this is to actually build as robust a simplification as you can, and then see where it breaks. (Working on it.)

(Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")

All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.

I totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions like "what am 'I' optimizing for?," and then try to figure out exactly what the demarcation is between "you" and "everything else" in order to answer that question, you're gonna have a real tough time finding anything close to a satisfactory answer.

Yeah I agree this is a legitimate concern, though it seems like it is definitely possible to make such a demarcation in toy universes (like in the example I gave above). And therefore it ought to be possible in principle to do so in our universe.

To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?

To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?

Probably something like the last one, although I think "even in principle" is doing some probably doing something suspicious in that statement. Like, sure, "in principle," you can pretty much construct any demarcation you could possibly imagine, including the Cartesian one, but what I'm trying to say is something like, "all demarcations, by their very nature, exist only in the map, not the territory." Carving reality is an operation that could only make sense within the context of a map, as reality simply is. Your concept of "agent" is defined in terms of other representations that similarly exist only within your world-model; other humans have a similar concept of "agent" because they have a similar representation built from correspondingly similar parts. If an AI is to understand the human notion of "agency," it will need to also understand plenty of other "things" which are also only abstractions or latent variables within our world models, as well as what those variables "point to" (at least, what variables in the AI's own world model they 'point to,' as by now I hope you're seeing the problem with trying to talk about "things they point to" in external/'objective' reality!).

I'm with you on this, and I suspect we'd agree on most questions of fact around this topic. Of course demarcation is an operation on maps and not on territories.

But as a practical matter, the moment one starts talking about the definition of something such as a mesa-objective, one has already unfolded one's map and started pointing to features on it. And frankly, that seems fine! Because historically, a great way to make forward progress on a conceptual question has been to work out a sequence of maps that give you successive degrees of approximation to the territory.

I'm not suggesting actually trying to imbue an AI with such concepts — that would be dangerous (for the reasons you alluded to) even if it wasn't pointless (because prosaic systems will just learn the representations they need anyway). All I'm saying is that the moment we started playing the game of definitions, we'd already started playing the game of maps. So using an arbitrary demarcation to construct our definitions might be bad for any number of legitimate reasons, but it can't be bad just because it caused us to start using maps: our earlier decision to talk about definitions already did that.

(I'm not 100% sure if I've interpreted your objection correctly, so please let me know if I haven't.)

No such thing is possible in reality, as an agent cannot exist without its environment, so why shouldn't we talk about the mesa-objective being over a perturbation set, too, just that it has to be some function of the model's internal features?

This makes some sense, but I don't generally trust some "perturbation set" to in fact capture the distributional shift which will be important in the real world. There has to at least be some statement that the perturbation set is actually quite broad. But I get the feeling that if we could make the right statement there, we would understand the problem in enough detail that we might have a very different framing. So, I'm not sure what to do here.

Update: having now thought more deeply about this, I no longer endorse my above comment.

While I think the reasoning was right, I got the definitions exactly backwards. To be clear, what I would now claim is:

  1. The behavioral objective is the thing the agent is revealed to be pursuing under arbitrary distributional shifts.
  2. The mesa-objective is something the agent is revealed to be pursuing under some subset of possible distributional shifts.

Everything in the above comment then still goes through, except with these definitions reversed.

On the one hand, the "perfect IRL" definition of the behavioral objective seems more naturally consistent with the omnipotent experimenter setting in the IRL unidentifiability paper cited downthread. As far as I know, perfect IRL isn't defined anywhere other than by reference to this reward modelling paper, which introduces the term but doesn't define it either. But the omnipotent experimenter setting seems to capture all the properties implied by perfect IRL, and does so precisely enough that one can use it to make rigorous statements about the behavioral objective of a system in various contexts.

On the other hand, it's actually perfectly possible for a mesa-optimizer to have a mesa-objective that is inconsistent with its own actions under some subset of conditions (the key conceptual error I was making was in thinking this was not possible). For example, a human being is a mesa-optimizer from the point of view of evolution. A human being may have something like "maximize happiness" as their mesa-objective. And a human being may, and frequently does, do things that do not maximize for their happiness.

A few consequences of the above:

  • Under an "omnipotent experimenter" definition, the behavioral objective (and not the mesa-objective) is a reliable invariant of the agent.
  • It's entirely possible for the behavioral objective to be overdetermined in certain situations. i.e., if we run every possible experiment on an agent, we may find that the only reward / utility function consistent with its behavior across all those experiments is the trivial utility function that's constant across all states.
  • If the behavioral objective of a system is overdetermined, that might mean the system never pursues anything coherently. But it might also mean that there exist subsets of distributions on which the system pursues an objective very coherently, but that different distributions induce different coherent objectives.
  • The natural way to use the mesa-objective concept is to attach it to one of these subsets of distributions on which we hypothesize our system is pursuing a goal coherently. If we apply a restricted version of the omnipotent experimenter definition — that is, run every experiment on our agent that's consistent with the subset of distributions we're conditioning on — then we will in general recover a set of mesa-objective candidates consistent with the system's actions on that subset.
  • It is strictly incorrect to refer to "the" mesa-objective of any agent or optimizer. Any reference to a mesa-objective has to be conditioned on the subset of distributions it applies on, otherwise it's underdetermined. (I believe Jack refers to this as a "perturbation set" downthread.)

This seems like it puts these definitions on a more rigorous footing. It also starts to clarify in my mind the connection with the "generalization-focused approach" to inner alignment, since it suggests a procedure one might use in principle to find out whether a system is pursuing coherent utilities on some subset of distributions. ("When we do every experiment allowed by this subset of distributions, do we recover a nontrivial utility function or not?")

Would definitely be interested in getting feedback on these thoughts!

(Meta: was this meant to be a question?)

In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional.

I don't think this is actually a con of the generalization-focused approach. From the post you link, one of the two questions in that approach (the one focused on robustness) is:

How do we ensure the model generalizes acceptably out of distribution?

Part of the problem is to come up with a good definition of "acceptable", such that this is actually possible to achieve. (See e.g. the "Defining acceptable" section of this post, or the beginning of this post.) But if you prefer to bake in the notion of intent, you could make the second question

How do we ensure the model continues to try to help us when out of distribution?

(Meta: was this meant to be a question?)

I originally conceived of it as such, but in hindsight, it doesn't seem right.

In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional.

I don't think this is actually a con of the generalization-focused approach.

By no means did I intend it to be a con. I'll try to edit to clarify. I think it is a real pro of the generalization-focused approach that it does not rely on models having mesa-objectives (putting it in Evan's terms, there is a real possibility of addressing objective robustness without directly addressing inner alignment). So, focusing on objective robustness seems like a potential advantage -- it opens up more avenues of attack. Plus, the generalization-focused approach requires a much weaker notion of "outer alignment", which may be easier to achieve as well.

But, of course, it may also turn out that the only way to achieve objective robustness is to directly tackle inner alignment. And it may turn out that the weaker notion of outer alignment is insufficient in reality.

Are you the historical origin of the robustness-centric approach? I noticed that Evan's post has the modified robustness-centric diagram in it, but I don't know if it was edited to include that. The "Objective Robustness and Inner Alignment Terminology" post attributes it to you (at least, attributes a version of it to you). (I didn't look at the references there yet.)

Are you the historical origin of the robustness-centric approach?

Idk, probably? It's always hard for me to tell; so much of what I do is just read what other people say and make the ideas sound sane to me. But stuff I've done that's relevant:

  • Talk at CHAI saying something like "daemons are just distributional shift" in August 2018, I think. (I remember Scott attending it.)
  • Talk at FHI in February 2020 that emphasized a risk model where objectives generalize but capabilities don't.
  • Talk at SERI conference a few months ago that explicitly argued for a focus on generalization over objectives.

Especially relevant stuff other people have done that has influenced me:

(My views were pretty set by the time Evan wrote the clarifying inner alignment terminology post; it's possible that his version that's closer to generalization-focused was inspired by things I said, you'd have to ask him.)

I've watched your talk at SERI now.

One question I have is how you hope to define a good notion of "acceptable" without a notion of intent. In your talk, you mention looking at why the model does what it does, in addition to just looking at what it does. This makes sense to me (I talk about similar things), but, it seems just about as fraught as the notion of mesa-objective:

  1. It requires approximately the same "magic transparency tech" as we need to extract mesa-objectives.
  2. Even with magical transparency tech, it requires additional insight as to which reasoning is acceptable vs unacceptable. 

If you are pessimistic about extracting mesa-objectives, why are you optimistic about providing feedback about how to reason? More generally, what do you think "acceptability" might look like?

(By no means do I mean to say your view is crazy; I am just looking for your explanation.)

One question I have is how you hope to define a good notion of "acceptable" without a notion of intent. 

I don't hope this; I expect to use a version of "acceptable" that uses intent. I'm happy with "acceptable" = "trying to do what we want".

If you are pessimistic about extracting mesa-objectives, why are you optimistic about providing feedback about how to reason?

I'm pessimistic about mesa-objectives existing in actual systems, based on how people normally seem to use the term "mesa-objective". If you instead just say that a "mesa objective" is "whatever the system is trying to do", without attempting to cash it out as some simple utility function that is being maximized, or the output of a particular neuron in the neural net, etc, then that seems fine to me.

One other way in which "acceptability" is better is that rather than require it of all inputs, you can require it of all inputs that are reasonably likely to occur in practice, or something along those lines. (And this is what I expect we'll have to do in practice given that I don't expect to fully mechanistically understand a large neural network; the "all inputs" should really be thought of as a goal we're striving towards.) Whereas I don't see how you do this with a mesa-objective (as the term is normally used); it seems like a mesa-objective must apply on any input, or else it isn't a mesa-objective.

I'm mostly not trying to make claims about which one is easier to do; rather I'm saying "we're using the wrong concepts; these concepts won't apply to the systems we actually build; here are some other concepts that will work".

All of that made perfect sense once I thought through it, and I tend to agree with most it. I think my biggest disagreement with you is that (in your talk) you said you don't expect formal learning theory work to be relevant. I agree with your points about classical learning theory, but the alignment community has been developing basically-classical-learning-theory tools which go beyond those limitations. I'm optimistic that stuff like Vanessa's InfraBayes could help here.

Granted, there's a big question of whether that kind of thing can be competitive. (Although there could potentially be a hybrid approach.)

My central complaint about existing theoretical work is that it doesn't seem to be trying to explain why neural nets learn good programs that generalize well, even when they have enough parameters to overfit and can fit a randomly labeled dataset. It seems like you need to make some assumption about the real world (i.e. an assumption about your dataset, or the training process that generated it), which people seem loathe to do.

I don't currently see how any of the alignment community's tools address that complaint; for example I don't think the InfraBayes work so far is making an interesting assumption about reality. Perhaps future work will address this though?

InfraBayes doesn't look for the regularity in reality that NNs are taking advantage of, agreed. But InfraBayes is exactly about "what kind of regularity assumptions can we realistically make about reality?" You can think of it as a reaction to the unrealistic nature of the regularity assumptions which Solomonoff induction makes. So it offers an answer to the question "what useful+realistic regularity assumptions could we make?"

The InfraBayesian answer is "partial models". IE, the idea that even if reality cannot be completely described by usable models, perhaps we can aim to partially describe it. This is an assumption about the world -- not all worlds can be usefully described by partial models. However, it's a weaker assumption about the world than usual. So it may not have presented itself as an assumption about the world in your mind, since perhaps you were thinking more of stronger assumptions.

If it's a good answer, it's at least plausible that NNs work well for related reasons.

But I think it also makes sense to try to get at the useful+realistic regularity assumptions from scratch, rather than necessarily making it all about NNs

This is an assumption about the world -- not all worlds can be usefully described by partial models.

They can't? Why not?

Maybe the "usefully" part is doing a lot of work here -- can all worlds be described (perhaps not usefully) by partial models? If so, I think I have the same objection, since it doesn't seem like any of the technical results in InfraBayes depend on some notion of "usefulness".

(I think it's pretty likely I'm just flat out wrong about something here, given how little I've thought about InfraBayesianism, but if so I'd like to know how I'm wrong.)

They can't? Why not?

Answer 1

I meant to invoke a no-free-lunch type intuition; we can always construct worlds where some particular tool isn't useful.

My go-to would be "a world that checks what an InfraBayesian would expect, and does the opposite". This is enough for the narrow point I was trying to make (that InfraBayes does express some kind of regularity assumption about the world), but it's not very illustrative or compelling for my broader point (that InfraBayes plausibly addresses your concerns about learning theory). So I'll try to tell a better story.

Answer 2

I might be describing logically impossible (or at least uncomputable) worlds here, but here is my story:

Solomonoff Induction captures something important about the regularities we see in the universe, but it doesn't explain NN learning (or "ordinary human learning") very well, because NNs and humans mostly use very fast models which are clearly much smaller (in time-complexity and space-complexity) than the universe. (Solomonoff induction is closer to describing human science, which does use these very simple but time/space-complex models.)

So there's this remaining question of induction: why can we do induction in practice? (IE, with NNs and with nonscientific reasoning)

InfraBayes answers this question by observing that although we can't easily use Solomonoff-like models of the whole universe, there are many patterns we can take advantage of which can be articulated with partial models. 

This didn't need to be the case. We could be in a universe in which you need to fully model the low-level dynamics in order to predict things well at all.

So, a regularity which InfraBayes takes advantage of is the fact that we see multi-scale phenomena -- that simple low-level rules often give rise to simple high-level behavior as well.

I say "maybe I'm describing logically impossible worlds" here because it is hard to imagine a world where you can construct a computer but where you don't see this kind of multi-level phenomena. Mathematics is full of partial-model-type regularities; so, this has to be a world where mathematics isn't relevant (or, where mathematics itself is different).

But Solomonoff induction alone doesn't give a reason to expect this sort of regularity. So, if you imagine a world being drawn from the Solomonoff prior vs a world being drawn from a similar InfraBayes prior, I think the InfraBayes prior might actually generate worlds more like the one we find ourselves in (ie, InfraBayes contains more information about the world).

(Although actually, I don't know how to "sample from an infrabayes prior"...)

"Usefully Describe"

Maybe the "usefully" part is doing a lot of work here -- can all worlds be described (perhaps not usefully) by partial models? If so, I think I have the same objection, since it doesn't seem like any of the technical results in InfraBayes depend on some notion of "usefulness".

Part of what I meant by "usefully describe" was to contrast runnable models from non-runnable models. EG, even if Solomonoff induction turned out to be the more accurate prior for dealing with our world, it's not very useful because it endorses hypotheses which we can't efficiently run. 

I mentioned that I think InfraBayes might fit the world better than Solomonoff. But what I actually predict more strongly is that if we compare time-bounded versions of both priors, time-bounded InfraBayes would do better thanks to its ability to articulate partial models.

I think it's also worth pointing out that the technical results of InfraBayes do in fact address a notion of usefulness: part of the point of InfraBayes is that it translates to decision-making learning guarantees (eg, guarantees about the performance of RL agents) better than Bayesian theories do. Namely, if there is a partial model such that the agent would achieve nontrivial reward if it believed it, then the agent will eventually do at least that well. So, to succeed, InfraBayes relies on an assumption about the world -- that there is a useful partial model. (This is the analog of the Solomonoff induction assumption that there exists a best computable model of the world.)

So although it wasn't what I was originally thinking, it would also be reasonable to interpret "usefully describe" as "describe in a way which gives nontrivial reward bounds". I would be happy to stand by this interpretation as well: as an assumption about the real world, I'm happy to assert that there are usually going to be partial models which (are accurate and) give good reward bounds.

What I Think You Should Think

I think you should think that it's plausible we will have learning-theoretic ideas which will apply directly to objects of concern, in the sense of under some plausible assumptions about the world, we can argue a learning-theoretic guarantee for some system we can describe, which theoretically addresses some alignment concern.

I don't want to strongly argue that you should think this will be competitive with NNs or anything like that. Obviously I prefer worlds where that's true, but I am not trying to argue that. Even if in some sense InfraBayes (or some other theory) turns out to explain the success of NNs, that does not actually imply it'll give rise to something competitive with NNs.

I'm wondering if that's a crux for your interest. Honestly, I don't really understand what's going on behind this remark:

My central complaint about existing theoretical work is that it doesn't seem to be trying to explain why neural nets learn good programs that generalize well, even when they have enough parameters to overfit and can fit a randomly labeled dataset. It seems like you need to make some assumption about the real world (i.e. an assumption about your dataset, or the training process that generated it), which people seem loathe to do.

Why is this your central complaint about existing theoretical work? My central complaint is that pre-existing learning theory didn't give us what we need to slot into a working alignment argument. In your presentation you listed some of those complaints, too. This seems more important to me that whether we can fully explain the success of large NNs.

My original interpretation about your remark was that you wanted to argue "learning theory makes bad assumptions about the world. To make strong arguments for alignment, we need to make more realistic assumptions. But these more realistic assumptions are necessarily of an empirical, non-theoretic nature." But I think InfraBayes in fact gets us closer to assumptions that are (a) realistic and (b) suited to arguments we want to make about alignment.

In other words, I had thought that you had (quite reasonably!) given up on learning theory because its results didn't seem relevant. I had hoped to rekindle your interest by pointing out that we can now do much better than 90s-era learning theory, in ways that seem relevant for EG objective robustness.

My personal theory about large NNs is that they act as a mixture model. It would be surprising if I told you that some genetic algorithm found a billion-bit program that described the data perfectly and then generalized well. It would be much less surprising if I told you that this billion-bit program was actually a mixture model that had been initialized randomly and then tuned by the genetic algorithm. From a Bayesian perspective, I expect a large random mixture model which then gets tuned to eliminate sub-models which are just bad on the data to be a pretty good approximation of my posterior, and therefore, I expect it to generalize well.

But my beliefs about this don't seem too cruxy for my beliefs about what kind of learning theory will be useful for alignment.

Why is this your central complaint about existing theoretical work?

Sorry, I meant that that was my central complaint about existing theoretical work that is trying to explain neural net generalization. (I was mostly thinking of work outside of the alignment community.) I wasn't trying to make a claim about all theoretical work.

It's my central complaint because we ~know that such an assumption is necessary (since the same neural net that generalizes well on real MNIST can also memorize a randomly labeled MNIST where it will obviously fail to generalize).

InfraBayes answers this question by observing that although we can't easily use Solomonoff-like models of the whole universe, there are many patterns we can take advantage of which can be articulated with partial models. 

I feel pretty convinced by this :) In particular the assumption on the real world could be something like "there exists a partial model that describes the real world well enough that we can prove a regret bound that is not vacuous" or something like that. And I agree this seems like a reasonable assumption.

Even if in some sense InfraBayes (or some other theory) turns out to explain the success of NNs, that does not actually imply it'll give rise to something competitive with NNs.

Tbc I would see this as a success.

In other words, I had thought that you had (quite reasonably!) given up on learning theory because its results didn't seem relevant. I had hoped to rekindle your interest by pointing out that we can now do much better than 90s-era learning theory, in ways that seem relevant for EG objective robustness.

I am interested! I listed it as one of the topics I saw as allowing us to make claims about objective robustness. I'm just saying that the current work doesn't seem to be making much progress (I agree now though that InfraBayes is plausibly on a path where it could eventually help).

It would be surprising if I told you that some genetic algorithm found a billion-bit program that described the data perfectly and then generalized well. It would be much less surprising if I told you that this billion-bit program was actually a mixture model that had been initialized randomly and then tuned by the genetic algorithm.

Fwiw I don't feel the force of this intuition, they seem about equally surprising (but I agree with you that it doesn't seem cruxy).

Great, I feel pretty resolved about this conversation now.