Seems fair. I'm similarly conflicted. In truth, both the generalization-focused path and the objective-focused path look a bit doomed to me.
Great, I feel pretty resolved about this conversation now.
I would further add that looking for difficulties created by the simplification seems very intellectually productive. (Solving "embedded agency problems" seems to genuinely allow you to do new things, rather than just soothing philosophical worries.) But yeah, I would agree that if we're defining mesa-objective anyway, we're already in the business of assuming some agent/environment boundary.
(see the unidentifiability in IRL paper)
Ah, I wasn't aware of this!
Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.
I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a "non-naturalistic" assumption, which simply makes me think a framework is more artificial/fragile.
For example, AIXI assumes a hard boundary between agent and environment. One manifestation of this assumption is how AIXI doesn't predict its own future actions the way it predicts everything else, and instead, must explicitly plan its own future actions. This is necessary because AIXI is not computable, so treating the future self as part of the environment (and predicting it with the same predictive capabilities as usual) would violate the assumption of a computable environment. But this is unfortunate for a few reasons. First, it forces AIXI to have an arbitrary finite planning horizon, which is weird for something that is supposed to represent unbounded intelligence. Second, there is no reason to carry this sort of thing over to finite, computable agents; so it weakens the generality of the model, by introducing a design detail that's very dependent on the specific infinite setting.
Another example would be game-theoretic reasoning. Suppose I am concerned about cooperative behavior in deployed AI systems. I might work on something like the equilibrium selection problem in game theory, looking for rationality concepts which can select cooperative equilibria where they exist. However, this kind of work will typically treat a "game" as something which inherently comes with a pointer to the other agents. This limits the real-world applicability of such results, because to apply it to real AI systems, those systems would need "agent pointers" as well. This is a difficult engineering problem (creating an AI system which identifies "agents" in its environment); and even assuming away the engineering challenges, there are serious philosophical difficulties (what really counts as an "agent"?).
We could try to tackle those difficulties, but my assumption will tend to be that it'll result in fairly brittle abstractions with weird failure modes.
Instead, I would advocate for Pavlov-like strategies which do not depend on actually identifying "agents" in order to have cooperative properties. I expect these to be more robust and present fewer technical challenges.
Of course, this general heuristic may not turn out to apply in the specific case we are discussing. If you control the training process, then, for the duration of training, you control the agent and the environment, and these concepts seem unproblematic. However, it does seem unrealistic to really check every environment; so, it seems like to establish strong guarantees, you'd need to do worst-case reasoning over arbitrary environments, rather than checking environments in detail. This is how I was mainly interpreting jbkjr; perturbation sets could be a way to make things more feasible (at a cost).
Right, exactly. (I should probably have just referred to that, but I was trying to avoid reference-dumping.)
I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless we start to be more inclusive about what we mean by "mesa-optimizer" and "mesa-objective." I don't think those terms as defined in RFLO actually capture humans, but I definitely want to say that we're "goal-oriented" and have "intent."But the graph structure makes perfect sense, I just am doing the mental substitution of "intent alignment means 'what the model is actually trying to do' is aligned with 'what we want it to do'." (Similar for inner robustness.)
I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless we start to be more inclusive about what we mean by "mesa-optimizer" and "mesa-objective." I don't think those terms as defined in RFLO actually capture humans, but I definitely want to say that we're "goal-oriented" and have "intent."
But the graph structure makes perfect sense, I just am doing the mental substitution of "intent alignment means 'what the model is actually trying to do' is aligned with 'what we want it to do'." (Similar for inner robustness.)
I too am a fan of broadening this a bit, but I am not sure how to.
I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.
I agree with your point about using "does this definition include humans" as a filter, and I think it would be easy to mess that up (and I wasn't thinking about it explicitly until you raised the point).
However, I think possibly you want a very behavioral definition of mesa-objective. If that's true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.
Maybe a very practical question about the diagram: is there a REASON for there to be no "sufficient together" linkage from "Intent Alignment" and "Robustness" up to "Behavioral Alignment"?
Leaning hard on my technical definitions:
Robustness: Performing well on the base objective in a wide range of circumstances.Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)
These two together do not quite imply behavioral alignment, because it's possible for a model to have a human-friendly mesa-objective but be super bad at achieving it, while being super good at achieving some other objective.
So, yes, there is a little bit of gear-grinding if we try to combine the two plans like that. They aren't quite the right thing to fit together.
It's like we have a magic vending machine that can give us anything, and we have a slip of paper with our careful wish, and we put the slip of paper in the coin slot.
That being said, if we had technology for achieving both intent alignment and robustness, I expect we'd be in a pretty good position! I think the main reason not to go after both is that we may possibly be able to get away with just one of the two paths.
I think there's another reason why factorization can be useful here, which is the articulation of sub-problems to try.
For example, in the process leading up to inventing logical induction, Scott came up with a bunch of smaller properties to try for. He invented systems which got desirable properties individually, then growing combinations of desirable properties, and finally, figured out how to get everything at once. However, logical induction doesn't have parts corresponding to those different subproblems.
It can be very useful to individually achieve, say, objective robustness, even if your solution doesn't fit with anyone else's solutions to any of the other sub-problems. It shows us a way to do it, which can inspire other ways to do it.
In other words: tackling the whole alignment problem at once sounds too hard. It's useful to split it up, even if our factorization doesn't guarantee that we can stick pieces back together to get a whole solution.
Though, yeah, it's obviously better if we can create a factorization of the sort you want.
I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don't have a good reason to believe this.)
For myself, my reaction is "behavioral objectives also assume a system is well-described as EU maximizers". In either case, you're assuming that you can summarize a policy by a function it optimizes; the difference is whether you think the system itself thinks explicitly in those terms.
I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.
For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.
In this picture, there is no clear distinction between terminal values and instrumental values. Something is "more terminal" if you treat it as more fixed (you resolve contradictions by updating the other values), and "more instrumental" if its value is more changeable based on other things.
I want to be able to talk about how we can shape goals which may be messier, perhaps somewhat competing, internal representations or heuristics or proxies that determine behavior.
(Possibly you should consider my "approximately coherent expectations" idea)
They can't? Why not?
I meant to invoke a no-free-lunch type intuition; we can always construct worlds where some particular tool isn't useful.
My go-to would be "a world that checks what an InfraBayesian would expect, and does the opposite". This is enough for the narrow point I was trying to make (that InfraBayes does express some kind of regularity assumption about the world), but it's not very illustrative or compelling for my broader point (that InfraBayes plausibly addresses your concerns about learning theory). So I'll try to tell a better story.
I might be describing logically impossible (or at least uncomputable) worlds here, but here is my story:
Solomonoff Induction captures something important about the regularities we see in the universe, but it doesn't explain NN learning (or "ordinary human learning") very well, because NNs and humans mostly use very fast models which are clearly much smaller (in time-complexity and space-complexity) than the universe. (Solomonoff induction is closer to describing human science, which does use these very simple but time/space-complex models.)
So there's this remaining question of induction: why can we do induction in practice? (IE, with NNs and with nonscientific reasoning)
InfraBayes answers this question by observing that although we can't easily use Solomonoff-like models of the whole universe, there are many patterns we can take advantage of which can be articulated with partial models.
This didn't need to be the case. We could be in a universe in which you need to fully model the low-level dynamics in order to predict things well at all.
So, a regularity which InfraBayes takes advantage of is the fact that we see multi-scale phenomena -- that simple low-level rules often give rise to simple high-level behavior as well.
I say "maybe I'm describing logically impossible worlds" here because it is hard to imagine a world where you can construct a computer but where you don't see this kind of multi-level phenomena. Mathematics is full of partial-model-type regularities; so, this has to be a world where mathematics isn't relevant (or, where mathematics itself is different).
But Solomonoff induction alone doesn't give a reason to expect this sort of regularity. So, if you imagine a world being drawn from the Solomonoff prior vs a world being drawn from a similar InfraBayes prior, I think the InfraBayes prior might actually generate worlds more like the one we find ourselves in (ie, InfraBayes contains more information about the world).
(Although actually, I don't know how to "sample from an infrabayes prior"...)
Maybe the "usefully" part is doing a lot of work here -- can all worlds be described (perhaps not usefully) by partial models? If so, I think I have the same objection, since it doesn't seem like any of the technical results in InfraBayes depend on some notion of "usefulness".
Part of what I meant by "usefully describe" was to contrast runnable models from non-runnable models. EG, even if Solomonoff induction turned out to be the more accurate prior for dealing with our world, it's not very useful because it endorses hypotheses which we can't efficiently run.
I mentioned that I think InfraBayes might fit the world better than Solomonoff. But what I actually predict more strongly is that if we compare time-bounded versions of both priors, time-bounded InfraBayes would do better thanks to its ability to articulate partial models.
I think it's also worth pointing out that the technical results of InfraBayes do in fact address a notion of usefulness: part of the point of InfraBayes is that it translates to decision-making learning guarantees (eg, guarantees about the performance of RL agents) better than Bayesian theories do. Namely, if there is a partial model such that the agent would achieve nontrivial reward if it believed it, then the agent will eventually do at least that well. So, to succeed, InfraBayes relies on an assumption about the world -- that there is a useful partial model. (This is the analog of the Solomonoff induction assumption that there exists a best computable model of the world.)
So although it wasn't what I was originally thinking, it would also be reasonable to interpret "usefully describe" as "describe in a way which gives nontrivial reward bounds". I would be happy to stand by this interpretation as well: as an assumption about the real world, I'm happy to assert that there are usually going to be partial models which (are accurate and) give good reward bounds.
I think you should think that it's plausible we will have learning-theoretic ideas which will apply directly to objects of concern, in the sense of under some plausible assumptions about the world, we can argue a learning-theoretic guarantee for some system we can describe, which theoretically addresses some alignment concern.
I don't want to strongly argue that you should think this will be competitive with NNs or anything like that. Obviously I prefer worlds where that's true, but I am not trying to argue that. Even if in some sense InfraBayes (or some other theory) turns out to explain the success of NNs, that does not actually imply it'll give rise to something competitive with NNs.
I'm wondering if that's a crux for your interest. Honestly, I don't really understand what's going on behind this remark:
My central complaint about existing theoretical work is that it doesn't seem to be trying to explain why neural nets learn good programs that generalize well, even when they have enough parameters to overfit and can fit a randomly labeled dataset. It seems like you need to make some assumption about the real world (i.e. an assumption about your dataset, or the training process that generated it), which people seem loathe to do.
Why is this your central complaint about existing theoretical work? My central complaint is that pre-existing learning theory didn't give us what we need to slot into a working alignment argument. In your presentation you listed some of those complaints, too. This seems more important to me that whether we can fully explain the success of large NNs.
My original interpretation about your remark was that you wanted to argue "learning theory makes bad assumptions about the world. To make strong arguments for alignment, we need to make more realistic assumptions. But these more realistic assumptions are necessarily of an empirical, non-theoretic nature." But I think InfraBayes in fact gets us closer to assumptions that are (a) realistic and (b) suited to arguments we want to make about alignment.
In other words, I had thought that you had (quite reasonably!) given up on learning theory because its results didn't seem relevant. I had hoped to rekindle your interest by pointing out that we can now do much better than 90s-era learning theory, in ways that seem relevant for EG objective robustness.
My personal theory about large NNs is that they act as a mixture model. It would be surprising if I told you that some genetic algorithm found a billion-bit program that described the data perfectly and then generalized well. It would be much less surprising if I told you that this billion-bit program was actually a mixture model that had been initialized randomly and then tuned by the genetic algorithm. From a Bayesian perspective, I expect a large random mixture model which then gets tuned to eliminate sub-models which are just bad on the data to be a pretty good approximation of my posterior, and therefore, I expect it to generalize well.
But my beliefs about this don't seem too cruxy for my beliefs about what kind of learning theory will be useful for alignment.