Under my current understanding of the "general alignment" angle, a core argument goes like:
I don't necessarily think this is the best framing, and I don't necessarily agree with it (e.g. whether the agent ha... (read more)
Perhaps a good way to summarize all this is something like "qualitatively similar models probably work well for brains and neural networks". I agree to a large extent with that claim (though there was a time when I would have agreed much less), and I think that's the main thing you need for the rest of the post.
"Ways we understand" comes across as more general than that - e.g. we understand via experimentally probing physical neurons vs spectral clustering of a derivative matrix.
I agree that the phrasing could be better; any suggestions?
I actually think you could just drop that intro altogether, or move it later into the post. We do have pretty good evidence of modularity in the brain (as well as other biological systems) and in trained neural nets; it seems to be a pretty common property of large systems "evolved" by local optimization. And the rest of the post (as well as some of the other comments) does a good job of talking about some of that evidence. It's a good post, and I think the arguments later in the post are stronger ... (read more)
Our default expectation about large neural networks should be that we will understand them in roughly the same ways that we understand biological brains, except where we have specific reasons to think otherwise.
Why would that be our default expectation? We don't have direct access to all of the underlying parameters in the brain. We can't even simulate it yet, let alone take a gradient.
Two things I'd especially like to highlight in this post:
Fundamentally, structural constraints give us back some of the guarantees of the main epistemic strategies of Science and Engineering that get lost in alignment: we don’t have the technology yet, but we have some ideas of how it will work.
This is possibly the best one-sentence summary I've seen of how these sorts of theorems would be useful.
One corollary of recovering (some of) the usual science-and-engineering strategies is that selection theorems would open the door to a lot of empirical work on al... (read more)
Cool, looks good.
I think that's a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:
Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior
I agree with the literal content of this sentence, but I personally don't imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximation... (read more)
The biggest piece (IMO) would be figuring out key properties of human values. If we look at e.g. your sequence on value learning, the main takeaway of the section on ambitious value learning is "we would need more assumptions". (I would also argue we need different assumptions, because some of the currently standard assumptions are wrong - like utility functions.)
That's one thing selection theorems offer: a well-grounded basis for new assumptions for ambitious value learning. (And, as an added bonus, directly bringing selection into the picture means we al... (read more)
A few comments...
Selection theorems are helpful because they tell us likely properties of the agents we build.
What are selection theorems helpful for? Three possible areas (not necessarily comprehensive):
Of these, I expect the first to be most important, followed by the last, although this depends on the relative ... (read more)
The problem with that sort of approach is that the system (i.e. agent) being modeled is not necessarily going to play along with whatever desiderata we want. We can't just be like "I want an interface which does X"; if X is not a natural fit for the system, then what pops out will be very misleading/confusing/antihelpful.
An oversimplified example: suppose I have some predictive model, and I want an interface which gives me a point estimate and confidence interval/region rather than a full distribution. That only works well if the distribution isn't multimo... (read more)
Oh excellent, that's a perfect reference for one of the successor posts to this one. You guys do a much better job explaining what agent type signatures are and giving examples and classification, compared to my rather half-baked sketch here.
Basically, yes. Though I would add that narrowing down model choices in some legible way is a necessary step if, for instance, we want to be able to interface with our models in any other way than querying for probabilities over the low-level state of the system.
You want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking. So: what process produced this complicated psychology? Natural selection. What data structures can represent that complicated psychology? That's a type signature question. Put the two together, and we have a selection-theorem-shaped question.
In the example with persons A and B: a set of selection theorems would offer a solid foundation for the type signature of human preferences. Most likely, person B would use whatever types the theo... (read more)
The #P-complete problem is to calculate the distribution of some variables in a Bayes net given some other variables in the Bayes net, without any particular restrictions on the net or on the variables chosen.
Formal statement of the Telephone Theorem: We have a sequence of Markov blankets forming a Markov chain M1→M2→.... Then in the limit n→∞, fn(Mn) mediates the interaction between M1 and Mn (i.e. the distribution factors according to M1→fn(Mn)→Mn), for some fn satisfying
with probabi... (read more)
More like: exponential family distributions are a universal property of information-at-a-distance in large complex systems. So, we can use exponential models without any loss of generality when working with information-at-a-distance in large complex systems.
That's what I hope to show, anyway.
Yup, that's the direction I want. If the distributions are exponential family, then that dramatically narrows down the space of distributions which need to be represented in order to represent abstractions in general. That means much simpler data structures - e.g. feature functions and Lagrange multipliers, rather than whole distributions.
Roughly speaking, the generalized KPD says that if the long-range correlations are low dimensional, then the whole distribution is exponential family (modulo a few "exceptional" variables). The theorem doesn't rule out the possibility of high-dimensional correlations, but it narrows down the possible forms a lot if we can rule out high-dimensional correlations some other way. That's what I'm hoping for: some simple/common conditions which limit the dimension of the long-range correlations, so that gKPD can apply.
This post says that those long range correla... (read more)
Claim: the core of the alignment problem is conserved across capability levels. If a particular issue only occurs at a particular capability level, then the issue is usually "not really about alignment" in some sense.
Roughly speaking, if I ask a system for something, and then the result is not really what I wanted, but the system "could have" given the result I wanted in some sense, then that's an alignment problem regardless of whether the system is a superintelligent AI or google maps. Whether it's a simple system with a bad user interface, or a giant ML... (read more)
We can still view these as travelling through many layers - the light waves have to propagate through many lightyears of mostly-empty space (and it could attenuate or hit things along the way). The photo has to last many years (and could randomly degrade a little or be destroyed at any moment along the way).
What makes it feel like "one hop" intuitively is that the information is basically-perfectly conserved at each "step" through spacetime, and there's in a symmetry in how the information is represented.
I recommend that the title make it clearer that non-members can now submit alignment forum content for review, since this post is cross-posted on LW.
Here's a similarly-motivated model which I have found useful for the knowledge of economic agents.Rather than imagining that agents choose their actions as a function of their information (which is the usual picture), imagine that agents can choose their action for every world-state. For instance, if I'm a medieval smith, I might want to treat my iron differently depending on its composition.
In economic models, it's normal to include lots of constraints on agents' choices - like a budget constraint, or a constraint that our medieval smith cannot produce mo... (read more)
Instead of capturing the intuitions present in our confused understanding, John proposes to start with one of the applications and only focus on formalizing the concept for this specific purpose. [...] A later step is to attempt unification of the different formalization for the many applications.
Important clarification here: in order for this to work well, it is necessary to try multiple different use-cases and then unify. This is not a "start with one and then maybe get around to others in the indefinite future" sort of thing. I generally do not expect t... (read more)
For a while, I've thought that the strategy of "split the problem into a complete set of necessary sub-goals" is incomplete. It produces problem factorizations, but it's not sufficient to produce good problem factorizations - it usually won't cut reality at clean joints. That was my main concern with Evan's factorization, and it also applies to all of these, but I couldn't quite put my finger on what the problem was.
I think I can explain it now: when I say I want a factorization of alignment to "cut reality at the joints", I think what I mean is that each ... (read more)
The construction is correct.
Note that for M2, conceptually we don't need to modify it, we just need to use the original M2 but apply it only to the subcomponents of the new X-variable which correspond to the original X-variable. Alternatively, we can take the approach you do: construct M′2 which has a distribution over the new X, but "doesn't say anything" about the new components, i.e. the it's just maxentropic over the new components. This is equivalent to ignoring the new components altogether.
It seems to me maybe the interesting thing is whether you can talk about a search algorithm in terms of particular kinds of abstractions rather than anything else, which if you go far enough around comes back to your position, but with more explained.
Good explanation, conceptually.
Not sure how all the details play out - in particular, my big question for any RL setup is "how does it avoid wireheading?". In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.
I'm generally bullish on multiple objectives, and this post is another independent arrow pointing in that direction. Some other signs which I think point that way:
If the wheels are bouncing off each other, then that could be chaotic in the same way as billiard balls. But at least macroscopically, there's a crapton of damping in that simulation, so I find it more likely that the chaos is microscopic. But also my intuition agrees with yours, this system doesn't seem like it should be chaotic...
Heads up, there's a lot of use of visuals - drawing, gesturing at things, etc - so a useful transcript may take some work.
A couple notes:
Another little update, speed issue solved for now by adding SymPy's fortran wrappers to the derivative calculations - calculating the SVD isn't (yet?) the bottleneck. Can now quickly get results from 1,000+ step simulations of 100s of particles.
Unfortunately, even for the pretty stable configuration below, the values are indeed exploding. I need to go back through the program and double check the logic but I don't think it should be chaotic, if anything I would expect the values to hit zero.
It might be that there's some kind of quasi-chaotic behaviou... (read more)
My own understanding of the flat minima idea is that it's a different thing. It's not really about noise, it's about gradient descent in general being a pretty shitty optimization method, which converges very poorly to sharp minima (more precisely, minima with a high condition number). (Continuous gradient flow circumvents that, but using step sizes small enough to circumvent the problem in practice would make GD prohibitively slow. The methods we actually use are not a good approximation of continuous flow, as I understand it.) If you want flat minima, th... (read more)
I'm still wrapping my head around this myself, so this comment is quite useful.
Here's a different way to set up the model, where the phenomenon is more obvious.
Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state k, it's equally likely to transition to k+1 or k−1). The rate λk at which the system leaves state k serves a role analogous to the... (read more)
does it represent a bias towards less variance over the different gradients one can sample at a given point?
This is a good summary.
I'm still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:
Picture a linear approximation, like this:
The tangent space at point a is that whole line labelled "tangent".
The main difference between the tangent space and the space of neural-networks-for-which-the-weights-are-very-close is that the tangent space extrapolates the linear approximation indefinitely; it's not just limited to the region near the original point. (In practice, though, that difference does not actually matter much, at least for the problem at hand - we do stay close to the original point.)
The reason we want to talk about "the tangen... (read more)
I don’t yet understand this proposal. In what way do we decompose this parameter tangent space into "lottery tickets"? Are the lottery tickets the cross product of subnetworks and points in the parameter tangent space? The subnetworks alone? If the latter then how does this differ from the original lottery ticket hypothesis?
The tangent space version is meant to be a fairly large departure from the original LTH; subnetworks are no longer particularly relevant at all.
We can imagine a space of generalized-lottery-ticket-hypotheses, in which the common theme i... (read more)
I buy the "problems can be both" argument in principle. However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we're thinking about optimization-under-uncertainty... (read more)
Trying to lay this disagreement out plainly:
According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is "inner" = issues of imperfect search, "outer" = issues of objective (which can include the prior, the utility function, etc).
According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don... (read more)
I'll make a case here that manipulation of imperfect internal search should be considered the inner alignment problem, and all the other things which look like inner alignment failures actually stem from outer alignment failure or non-mesa-optimizer-specific generalization failure.
Suppose Dr Nefarious is an agent in the environment who wants to acausally manipulate other agents' models. We have a model which knows of Dr Nefarious' existence, and we ask the model to predict what Dr Nefarious will do. At this point, we have already faile... (read more)
This is a great comment. I will have to think more about your overall point, but aside from that, you've made some really useful distinctions. I've been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it's about outer alignment). Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn't cluster together the problems people have been trying to cluster together.
Yeah, I don't think there was anything particularly special about the complexity measure they used, and I wouldn't be surprised if some other measures did as-well-or-better at predicting which functions fill large chunks of parameter space.
Minor note: I think "mappings with low Kolmogorov complexity occupy larger volumes" is an extremely misleading way to explain what the evidence points to here. (Yes, I know Chris used that language in his blog post, but he really should not have.) The intuition of "simple mappings occupy larger volumes" is an accurate summary, but the relevant complexity measure is very much not Kolmogorov complexity. It is a measure which does not account for every computable way of compressing things, only some specific ways of compressing things.
In a sense, the results ... (read more)
I still don’t know much about GANs, Gaussian processes, Bayesian neural nets, and convex optimization.
FWIW, I'd recommend investing the time to learn about convex optimization even if you don't think you need it yet. Unlike the other topics on this list, the ideas from convex optimization are relevant to a wide variety of things the name does not suggest. Some examples:
They wouldn't be random linear combinations, so the central limit theorem estimate wouldn't directly apply. E.g. this circuit transparency work basically ran PCA on activations. It's not immediately obvious to me what the right big-O estimate would be, but intuitively, I'd expect the PCA to pick out exactly those components dominated by change in activations - since those will be the components which involve large correlations in the activation patterns across data points (at least that's my intuition).
I think this claim is basically wrong:
And to the exten
Alright, I buy the argument on page 15 of the original NTK paper.
I'm still very skeptical of the interpretation of this as "NTK models can't learn features". In general, when someone proves some interesting result which seems to contradict some combination of empirical results, my default assumption is that the proven result is being interpreted incorrectly, so I have a high prior that that's what's happening here. In this case, it could be that e.g. the "features" relevant to things like transfer learning are not individual neuron activations - e.g. IIRC ... (read more)
Ok, that's at least a plausible argument, although there are some big loopholes. Main problem which jumps out to me: what happens after one step of backprop is not the relevant question. One step of backprop is not enough to solve a set of linear equations (i.e. to achieve perfect prediction on the training set); the relevant question is what happens after one step of Newton's method, or after enough steps of gradient descent to achieve convergence.
What would convince me more is an empirical result - i.e. looking at the internals of an actual NTK model, tr... (read more)
During training on the 'old task', NTK stays in the 'tangent space' of the network's initialization. This means that, to first order, none of the functions/derivatives computed by the individual neurons change at all, only the output function does.
Eh? Why does this follow? Derivatives make sense; the derivatives staying approximately-constant is one of the assumptions underlying NTK to begin with. But the functions computed by individual neurons should be able to change for exactly the same reason the output function changes, assuming the network has more than one layer. What am I missing here?
Here's intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there's multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.
Oh that's definitely interesting. Thanks for the link.
Yup, I agree that that quote says something which is probably true, given current evidence. I don't think "picking a winning lottery ticket" is a good way analogy for what that implies, though; see this comment.
Not quite. The linear expansion isn't just over the parameters associated with one layer, it's over all the parameters in the whole net.