All of johnswentworth's Comments + Replies

Reward Is Not Enough

Good explanation, conceptually.

Not sure how all the details play out - in particular, my big question for any RL setup is "how does it avoid wireheading?". In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.

1Steve Byrnes6dUm, unreliably, at least by default. Like, some humans are hedonists, others aren't. I think there's a "hardcoded" credit assignment algorithm. When there's a reward prediction error, that algorithm primarily increments the reward-prediction / value associated with whatever stuff in the world model became newly active maybe half a second earlier. And maybe to a lesser extent, it also increments the reward-prediction / value associated with anything else you were thinking about at the time. (I'm not sure of the gory details here.) Anyway, insofar as "the reward signal itself" is part of the world-model, it's possible that reward-prediction / value will wind up attached to that concept. And then that's a desire to wirehead. But it's not inevitable. Some of the relevant dynamics are: * Timing—if credit goes mainly to signals that slightly precede the reward prediction error, then the reward signal itself is not a great fit. * Explaining away—once you have a way to accurately predict some set of reward signals, it makes the reward prediction errors go away, so the credit assignment algorithm stops running for those signals. So the first good reward-predicting model gets to stick around by default. Example: we learn early in life that the "eating candy" concept predicts certain reward signals, and then we get older and learn that the "certain neural signals in my brain" concept predicts those same reward signals too. But just learning that fact doesn't automatically translate into "I really want those certain neural signals in my brain". Only the credit assignment algorithm can make a thought appealing, and if the rewards are already being predicted then the credit assignment algorithm is inactive. (This is kinda like the behaviorism concept of blocking [].) * There may be some kind of bias to assign credit to predictive models that are simple functions of sensory inputs, when such
Reward Is Not Enough

Nice post!

I'm generally bullish on multiple objectives, and this post is another independent arrow pointing in that direction. Some other signs which I think point that way:

  • The argument from Why Subagents?. This is about utility maximizers rather than reward maximizers, but it points in a similar qualitative direction. Summary: once we allow internal state, utility-maximizers are not the only inexploitable systems; markets/committees of utility-maximizers also work.
  • The argument from Fixing The Good Regulator Theorem. That post uses some incoming informatio
... (read more)
3Steve Byrnes6dThanks! I had totally forgotten about your subagents post. I've been thinking that they kinda blend together in model-based RL, or at least the kind of (brain-like) model-based RL AGI that I normally think about. See this comment [] and surrounding discussion. Basically, one way to do model-based RL is to have the agent create a predictive model of the reward and then judge plans based on their tendency to maximize "the reward as currently understood by my predictive model". Then "the reward as currently understood by my predictive model" is basically a utility function. But at the same time, there's a separate subroutine that edits the reward prediction model (≈ utility function) to ever more closely approximate the true reward function (by some learning algorithm, presumably involving reward prediction errors). In other words: At any given time, the part of the agent that's making plans and taking actions looks like a utility maximizer. But if you lump together that part plus the subroutine that keeps editing the reward prediction model to better approximate the real reward signal, then that whole system is a reward-maximizing RL agent. Please tell me if that makes any sense or not; I've been planning to write pretty much exactly this comment (but with a diagram) into a short post.
Testing The Natural Abstraction Hypothesis: Project Intro

If the wheels are bouncing off each other, then that could be chaotic in the same way as billiard balls. But at least macroscopically, there's a crapton of damping in that simulation, so I find it more likely that the chaos is microscopic. But also my intuition agrees with yours, this system doesn't seem like it should be chaotic...

Abstraction Talk

Heads up, there's a lot of use of visuals - drawing, gesturing at things, etc - so a useful transcript may take some work.

1Ben Pace1moYeah, we can have a try and see whether it ends up being worth publishing.
Testing The Natural Abstraction Hypothesis: Project Intro


A couple notes:

  • Make sure to check that the values in the jacobian aren't exploding - i.e. there's not values like 1e30 or 1e200 or anything like that. Exponentially large values in the jacobian probably mean the system is chaotic.
  • If you want to avoid explicitly computing the jacobian, write a method which takes in a (constant) vector  and uses backpropagation to return . This is the same as the time-0-to-time-t jacobian dotted with , but it operates on size-n vectors rather than n-by-n jacobian matrices, so should be a l
... (read more)

Another little update, speed issue solved for now by adding SymPy's fortran wrappers to the derivative calculations - calculating the SVD isn't (yet?) the bottleneck. Can now quickly get results from 1,000+ step simulations of 100s of particles. 

Unfortunately, even for the pretty stable configuration below, the values are indeed exploding. I need to go back through the program and double check the logic but I don't think it should be chaotic, if anything I would expect the values to hit zero.

It might be that there's some kind of quasi-chaotic behaviou... (read more)

SGD's Bias

My own understanding of the flat minima idea is that it's a different thing. It's not really about noise, it's about gradient descent in general being a pretty shitty optimization method, which converges very poorly to sharp minima (more precisely, minima with a high condition number). (Continuous gradient flow circumvents that, but using step sizes small enough to circumvent the problem in practice would make GD prohibitively slow. The methods we actually use are not a good approximation of continuous flow, as I understand it.) If you want flat minima, th... (read more)

SGD's Bias

I'm still wrapping my head around this myself, so this comment is quite useful.

Here's a different way to set up the model, where the phenomenon is more obvious.

Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state , it's equally likely to transition to  or ). The rate  at which the system leaves state  serves a role analogous to the... (read more)

3Steve Byrnes1moThat makes sense. Now it's coming back to me: you zoom your microscope into one tiny nm^3 cube of air. In a right-to-left temperature gradient you'll see systematically faster air molecules moving rightward and slower molecules moving leftward, because they're carrying the temperature from their last collision. Whereas in uniform temperature, there's "detailed balance" (just as many molecules going along a path vs going along the time-reversed version of that same path, and with the same speed distribution). Thinking about the diode-resistor thing more, I suspect it would be a waste-of-time nerd-sniping rabbit hole, especially because of the time-domain aspects (the fluctuations are faster on one side vs the other I think) which don't have any analogue in SGD. Sorry to have brought it up.
SGD's Bias

does it represent a bias towards less variance over the different gradients one can sample at a given point?

Yup, exactly.

Formal Inner Alignment, Prospectus

This is a good summary.

I'm still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:

  • It feels like "optimization under uncertainty" is not quite the right name for the thing you're trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
  • The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal
... (read more)
2Abram Demski1moCould you elaborate on that? I do think that learning-normativity is more about outer alignment. However, some ideas might cross-apply. Well, it still seems like a good name to me, so I'm curious what you are thinking here. What name would communicate better? Again, I need more unpacking to be able to say much (or update much). Well, the optimization-under-uncertainty is an attempt to make a frame which can contain both, so this isn't necessarily a problem... but I am curious what feels non-tight about inner agency. I still agree with the hypothetical me making the opposite point ;p The problem is that certain things are being conflated, so both "uncertainty can't be separated from goals" and "uncertainty can be separated from goals" have true interpretations. (I have those interpretations clear in my head, but communication is hard.) OK, so. My sense of our remaining disagreement... We agree that the pointers/uncertainty could be factored (at least informally -- currently waiting on any formalism). You think "optimization under uncertainty" is doing something different, and I think it's doing something close. Specifically, I think "optimization under uncertainty" importantly is not necessarily best understood as the standard Bayesian thing where we (1) start with a utility function, (2) provide a prior, so that we can evaluate expected value (and 2.5, update on any evidence), (3) provide a search method, so that we solve the whole thing by searching for the highest-expectation element. Many examples of optimization-under-uncertainty strain this model. Probably the pointer/uncertainty model would do a better job in these cases. But, the Bayesian model is kind of the only one we have, so we can use it provisionally. And when we do so, the approximation of pointer-vs-uncertainty that comes out is: Pointer: The utility function. Uncertainty: The search plus the prior, which in practice can blend together into "inductive bias". This isn't perfect, by any me
Understanding the Lottery Ticket Hypothesis

Picture a linear approximation, like this:

The tangent space at point  is that whole line labelled "tangent".

The main difference between the tangent space and the space of neural-networks-for-which-the-weights-are-very-close is that the tangent space extrapolates the linear approximation indefinitely; it's not just limited to the region near the original point. (In practice, though, that difference does not actually matter much, at least for the problem at hand - we do stay close to the original point.)

The reason we want to talk about "the tangen... (read more)

Understanding the Lottery Ticket Hypothesis

I don’t yet understand this proposal. In what way do we decompose this parameter tangent space into "lottery tickets"? Are the lottery tickets the cross product of subnetworks and points in the parameter tangent space? The subnetworks alone? If the latter then how does this differ from the original lottery ticket hypothesis?

The tangent space version is meant to be a fairly large departure from the original LTH; subnetworks are no longer particularly relevant at all.

We can imagine a space of generalized-lottery-ticket-hypotheses, in which the common theme i... (read more)

2Daniel Kokotajlo1moI confess I don't really understand what a tangent space is, even after reading the wiki article on the subject. It sounds like it's something like this: Take a particular neural network. Consider the "space" of possible neural networks that are extremely similar to it, i.e. they have all the same parameters but the weights are slightly different, for some definition of "slightly." That's the tangent space. Is this correct? What am I missing?
Formal Inner Alignment, Prospectus

I buy the "problems can be both" argument in principle. However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we're thinking about optimization-under-uncertainty... (read more)

Trying to lay this disagreement out plainly:

According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is "inner" = issues of imperfect search, "outer" = issues of objective (which can include the prior, the utility function, etc).

According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don... (read more)

4Abram Demski1moThe way I'm currently thinking of things, I would say the reverse also applies in this case. We can turn optimization-under-uncertainty into well-defined optimization by assuming a prior. The outer alignment problem (in your sense) involves getting the prior right. Getting the prior right is part of "figuring out what we want". But this is precisely the source of the inner alignment problems in the paul/evan sense: Paul was pointing out a previously neglected issue about the Solomonoff prior, and Evan is talking about inductive biases of machine learning algorithms (which is sort of like the combination of a prior and imperfect search). So both you and Evan and Paul are agreeing that there's this problem with the prior (/ inductive biases). It is distinct from other outer alignment problems (because we can, to a large extent, factor the problem of specifying an expected value calculation into the problem of specifying probabilities and the problem of specifying a value function / utility function / etc). Everyone would seem to agree that this part of the problem needs to be solved. The disagreement is just about whether to classify this part as "inner" and/or "outer". What is this problem like? Well, it's broadly a quality-of-prior problem, but it has a different character from other quality-of-prior problems. For the most part, the quality of priors can be understood by thinking about average error being low, or mistakes becoming infrequent, etc. However, here, this kind of thinking isn't sufficient: we are concerned with rare but catastrophic errors. Thinking about these things, we find ourselves thinking in terms of "agents inside the prior" (or agents being favored by the inductive biases). To what extent "agents in the prior" should be lumped together with "agents in imperfect search", I am not sure. But the term "inner optimizer" seems relevant. A good example of optimization-under-uncertainty that doesn't look like that (at least, not overtly) is most ap
Formal Inner Alignment, Prospectus

I'll make a case here that manipulation of imperfect internal search should be considered the inner alignment problem, and all the other things which look like inner alignment failures actually stem from outer alignment failure or non-mesa-optimizer-specific generalization failure.

Example: Dr Nefarious

Suppose Dr Nefarious is an agent in the environment who wants to acausally manipulate other agents' models. We have a model which knows of Dr Nefarious' existence, and we ask the model to predict what Dr Nefarious will do. At this point, we have already faile... (read more)

5Adam Shimi1moWhile I agree that outer objective, training data and prior should be considered together, I disagree that it makes the inner alignment problem dissolve except for manipulation of the search. In principle, if you could indeed ensure through a smart choice of these three parameters that there is only one global optimum, only "bad" (meaning high loss) local minima, and that your search process will always reach the global optimum, then I would agree that the inner alignment problem disappears. But answering "what do we even want?" at this level of precision seems basically impossible. I expect that it's pretty much equivalent to specifying exactly the result we want, which we are quite unable to do in general. So my perspective is that the inner alignment problem appears because of inherent limits into our outer alignment capabilities. And that in realistic settings where we cannot rule out multiple very good local minima, the sort of reasoning underpinning the inner alignment discussion is the best approach we have to address such problems. That being said, I'm not sure how this view interacts with yours or Evan's, or if this is a very standard use of the terms. But since that's part of the discussion Abram is pushing, here is how I use these terms.
3Steve Byrnes1moHm, I want to classify "defense against adversaries" as a separate category from both "inner alignment" and "outer alignment". The obvious example is: if an adversarial AGI hacks into my AGI and changes its goals, that's not any kind of alignment problem, it's a defense-against-adversaries problem. Then I would take that notion and extend it by saying "yes interacting with an adversary presents an attack surface, but also merely imagining an adversary presents an attack surface too". Well, at least in weird hypotheticals. I'm not convinced that this would really be a problem in practice, but I dunno, I haven't thought about it much. Anyway, I would propose that the procedure for defense against adversaries in general is: (1) shelter an AGI from adversaries early in training, until it's reasonably intelligent and aligned, and then (2) trust the AGI to defend itself. I'm not sure we can do any better than that. In particular, I imagine an intelligent and self-aware AGI that's aligned in trying to help me would deliberately avoid imagining an adversarial superintelligence that can acausally hijack its goals! That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say "OK yes now it's misaligned". (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don't think that's really a special problem we need to think about.
7Abram Demski1moSo, I think I could write a much longer response to this (perhaps another post), but I'm more or less not persuaded that problems should be cut up the way you say. As I mentioned in my other reply, your argument that Dr. Nefarious problems shouldn't be classified as inner alignment is that they are apparently outer alignment. If inner alignment problems are roughly "the internal objective doesn't match the external objective" and outer alignment problems are roughly "the outer objective doesn't meet our needs/goals", then there's no reason why these have to be mutually exclusive categories. In particular, Dr. Nefarious problems can be both. But more importantly, I don't entirely buy your notion of "optimization". This is the part that would require a longer explanation to be a proper reply. But basically, I want to distinguish between "optimization" and "optimization under uncertainty". Optimization under uncertainty is not optimization -- that is, it is not optimization of the type you're describing, where you have a well-defined objective which you're simply feeding to a search. Given a prior, you can reduce optimization-under-uncertainty to plain optimization (if you can afford the probabilistic inference necessary to take the expectations, which often isn't the case). But that doesn't mean that you do, and anyway, I want to keep them as separate concepts even if one is often implemented by the other. Your notion of the inner alignment problem applies only to optimization. Evan's notion of inner alignment applies (only!) to optimization under uncertainty.

This is a great comment. I will have to think more about your overall point, but aside from that, you've made some really useful distinctions. I've been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it's about outer alignment). Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn't cluster together the problems people have been trying to cluster together.

Parsing Chris Mingard on Neural Networks

Yeah, I don't think there was anything particularly special about the complexity measure they used, and I wouldn't be surprised if some other measures did as-well-or-better at predicting which functions fill large chunks of parameter space.

Parsing Chris Mingard on Neural Networks

Minor note: I think "mappings with low Kolmogorov complexity occupy larger volumes" is an extremely misleading way to explain what the evidence points to here. (Yes, I know Chris used that language in his blog post, but he really should not have.) The intuition of "simple mappings occupy larger volumes" is an accurate summary, but the relevant complexity measure is very much not Kolmogorov complexity. It is a measure which does not account for every computable way of compressing things, only some specific ways of compressing things.

In a sense, the results ... (read more)

4Alex Flint2moThanks for this clarification John. But did Mingard et al show that there is some specific practical complexity measure that explains the size of the volumes occupied in parameter space better than alternative practical complexity measures? If so then think we would have uncovered an even more detailed understanding of which mappings occupy large volumes in parameter space, and since neural nets just generally work so well in the world, we could say that we know something about what kind of compression is relevant to our particular world. But if they just used some practical complexity measure as a rough proxy for Kolmogorov complexity then it leaves open the door for mappings occupying large volumes in parameter space to be simple according to many more ways of compressing things than were used in their particular complexity measure.
FAQ: Advice for AI Alignment Researchers

I still don’t know much about GANs, Gaussian processes, Bayesian neural nets, and convex optimization.

FWIW, I'd recommend investing the time to learn about convex optimization even if you don't think you need it yet. Unlike the other topics on this list, the ideas from convex optimization are relevant to a wide variety of things the name does not suggest. Some examples:

  • how to think about optimization in high dimensions (even for non-convex functions)
  • using constraints as a more-gearsy way of representing relevant information (as opposed to e.g. just wrappin
... (read more)
4Rohin Shah2moYeah that's definitely the one on the list that I think would be most useful. I may also be understating how much I know about it; I've picked up some over time, e.g. linear programming, minimax, some kinds of duality, mirror descent, Newton's method.
NTK/GP Models of Neural Nets Can't Learn Features

They wouldn't be random linear combinations, so the central limit theorem estimate wouldn't directly apply. E.g. this circuit transparency work basically ran PCA on activations. It's not immediately obvious to me what the right big-O estimate would be, but intuitively, I'd expect the PCA to pick out exactly those components dominated by change in activations - since those will be the components which involve large correlations in the activation patterns across data points (at least that's my intuition).

I think this claim is basically wrong:

And to the exten

... (read more)
5interstice2moHmm, so regarding the linear combinations, it's true that there are some linear combinations that will change by Θ(1) in the large-width limit -- just use the vector of partial derivatives of the output at some particular input, this sum will change by the amount that the output function moves during the regression. Indeed, I suspect(but don't have a proof) that these particular combinations will span the space of linear combinations that change non-trivially during training. I would dispute "we expect most linear combinations to change" though -- the CLT argument implies that we should expect almost all combinations to not appreciably change. Not sure what effect this would have on the PCA and still think it's plausible that it doesn't change at all(actually, I think Greg Yang states that it doesn't change in section 9 of his paper, haven't read that part super carefully though) So I think I was a bit careless in saying that the NTK can't do transfer learning at all -- a more exact statement might be "the NTK does the minimal amount of transfer learning possible". What I mean by this is, any learning algorithm can do transfer learning if the task we are 'transferring' to is sufficiently similar to the original task -- for instance, if it's just the exact same task but with a different data sample. I claim that the 'transfer learning' the NTK does is of this sort. As you say, since the tangent kernel doesn't change at all, the net effect is to move where the network starts in the tangent space. Disregarding convergence speed, the impact this has on generalization is determined by the values set by the old function on axes of the NTK outside of the span of the partial derivatives at the new function's data points. This means that, for the NTK to transfer anything from one task to another, it's not enough for the tasks to both feature, for instance, eyes. It's that the eyes have to correlate with the output in the exact same way in both tasks. Indeed, the transfer l
NTK/GP Models of Neural Nets Can't Learn Features

Alright, I buy the argument on page 15 of the original NTK paper.

I'm still very skeptical of the interpretation of this as "NTK models can't learn features". In general, when someone proves some interesting result which seems to contradict some combination of empirical results, my default assumption is that the proven result is being interpreted incorrectly, so I have a high prior that that's what's happening here. In this case, it could be that e.g. the "features" relevant to things like transfer learning are not individual neuron activations - e.g. IIRC ... (read more)

5interstice2moI don't think taking linear combinations will help, because adding terms to the linear combination will also increase the magnitude of the original activation vector -- e.g. if you add together N12 units, the magnitude of the sum of their original activations will with high probability be Θ(N14), dwarfing the O(1) change due to change in the activations. But regardless, it can't help with transfer learning at all, since the tangent kernel(which determines learning in this regime) doesn't change by definition. What empirical results do you think are being contradicted? As far as I can tell, the empirical results we have are 'NTK/GP have similar performance to neural nets on some, but not all, tasks'. I don't think transfer/feature learning is addressed at all. You might say these results are suggestive evidence that NTK/GP captures everything important about neural nets, but this is precisely what is being disputed with the transfer learning arguments. I can imagine doing an experiment where we find the 'empirical tangent kernel' of some finite neural net at initialization, solve the linear system, and then analyze the activations of the resulting network. But it's worth noting that this is not what is usually meant by 'NTK' -- that usually includes taking the infinite-width limit at the same time. And to the extent that we expect the activations to change at all, we no longer have reason to think that this linear system is a good approximation of SGD. That's what the above mathematical results mean -- the same mathematical analysis that implies that network training is like solving a linear system, also implies that the activations don't change at all.
NTK/GP Models of Neural Nets Can't Learn Features

Ok, that's at least a plausible argument, although there are some big loopholes. Main problem which jumps out to me: what happens after one step of backprop is not the relevant question. One step of backprop is not enough to solve a set of linear equations (i.e. to achieve perfect prediction on the training set); the relevant question is what happens after one step of Newton's method, or after enough steps of gradient descent to achieve convergence.

What would convince me more is an empirical result - i.e. looking at the internals of an actual NTK model, tr... (read more)

6interstice2moThe result that NTK does not learn features in the large N limit is not in dispute at all -- it's right there on page 15 of the original NTK paper [], and indeed holds after arbitrarily many steps of backprop. I don't think that there's really much room for loopholes in the math here. See Greg Yang's paper for a lengthy proof that this holds for all architectures. Also worth noting that when people 'take the NTK limit' they often don't initialize an actual net at all, they instead use analytical expressions for what the inner product of the gradients would be at N=infinity to compute the kernel directly.
NTK/GP Models of Neural Nets Can't Learn Features

During training on the 'old task', NTK stays in the 'tangent space' of the network's initialization. This means that, to first order, none of the functions/derivatives computed by the individual neurons change at all, only the output function does.

Eh? Why does this follow? Derivatives make sense; the derivatives staying approximately-constant is one of the assumptions underlying NTK to begin with. But the functions computed by individual neurons should be able to change for exactly the same reason the output function changes, assuming the network has more than one layer. What am I missing here?

5interstice2moThe asymmetry between the output function and the intermediate neuron functions comes from backprop -- from the fact that the gradients are backprop-ed through weight matrices with entries of magnitude O(N−12). So the gradient of the output w.r.t itself is obviously 1, then the gradient of the output w.r.t each neuron in the preceding layer is O(N−12), since you're just multiplying by a vector with those entries. Then by induction all other preceding layers' gradients are the sum of N random things of size O(1/N), and so are of size O(N−12) again. So taking a step of backprop will change the output function by O(1) but the intermediate functions by O(N−12), vanishing in the large-width limit. (This is kind of an oversimplification since it is possible to have changing intermediate functions while doing backprop, as mentioned in the linked paper. But this is the essence of why it's possible in some limits to move around using backprop without changing the intermediate neurons)
Updating the Lottery Ticket Hypothesis


Here's intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there's multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.

3Matthew "Vaniver" Graves2moThat seems right, but also reminds me of the point that you need to randomly initialize your neural nets for gradient descent to work (because otherwise the gradients everywhere are the same). Like, in the randomly initialized net, each edge is going to be part of many subcircuits, both good and bad, and the gradient is basically "what's your relative contribution to good subcircuits vs. bad subcircuits?"
Updating the Lottery Ticket Hypothesis

Oh that's definitely interesting. Thanks for the link.

Updating the Lottery Ticket Hypothesis

Yup, I agree that that quote says something which is probably true, given current evidence. I don't think "picking a winning lottery ticket" is a good way analogy for what that implies, though; see this comment.

6DanielFilan2moI don't know what the referent of 'that quote' is. If you mean the passage I quoted from the original lottery ticket hypothesis ("LTH") paper, then I highly recommend reading a follow-up paper [] which describes how and why it's wrong for large networks. The abstract of the paper I'm citing here: -------------------------------------------------------------------------------- Again, assuming "that" refers to the claim in the original LTH paper, I also don't think it's a good analogy. But by default I think that claim is what "the lottery ticket hypothesis" refers to, given that it's a widely cited paper that has spawned a large number of follow-up works.
Updating the Lottery Ticket Hypothesis

Not quite. The linear expansion isn't just over the parameters associated with one layer, it's over all the parameters in the whole net.

Updating the Lottery Ticket Hypothesis

Yeah, I agree that something more general than one neuron but less general than (or at least different from) pruning might be appropriate. I'm not particularly worried about where that line "should" be drawn a priori, because the tangent space indeed seems like the right place to draw the line empirically.

6Abram Demski2moWait... so: 1. The tangent-space hypothesis implies something close to "gd finds a solution if and only if there's already a dog detecting neuron" (for large networks, that is) -- specifically it seems to imply something pretty close to "there's already a feature", where "feature" means a linear combination of existing neurons within a single layer 2. gd in fact trains NNs to recognize dogs 3. Therefore, we're still in the territory of "there's already a dog detector" ...yeah?
Updating the Lottery Ticket Hypothesis

See my comment here. The paper you link (like others I've seen in this vein) requires pruning, which will change the functional behavior of the nodes themselves. As I currently understand it, it is consistent with not having any neuron which lights up to match most functions/circuits.

6Abram Demski2moAh, I should have read comments more carefully. I agree with your comment that the claim "I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization" is ambiguous -- "contains" can mean "under pruning". Obviously, this is an important distinction: if the would-be dog-recognizing circuit behaves extremely differently due to intersection with a lot of other circuits, it could be much harder to find. But why is "a single neuron lighting up" where you draw the line? It seems clear that at least some relaxation of that requirement is tenable. For example, if no one neuron lights up in the correct pattern, but there's a linear combination of neurons (before the output layer) which does, then it seems we're good to go: GD could find that pretty easily. I guess this is where the tangent space model comes in; if in practice (for large networks) we stay in the tangent space, then a linear combination of neurons is basically exactly as much as we can relax your requirement. But without the tangent-space hypothesis, it's unclear where to draw the line, and your claim that an existing neuron already behaving in the desired way is "what would be necessary for the lottery ticket intuition" isn't clear to me. (Is there a more obvious argument for this, according to you?)
Updating the Lottery Ticket Hypothesis

Sort of. They end up equivalent to a single Newton step, not a single gradient step (or at least that's what this model says). In general, a set of linear equations is not solved by one gradient step, but is solved by one Newton step. It generally takes many gradient steps to solve a set of linear equations.

(Caveat to this: if you directly attempt a Newton step on this sort of system, you'll probably get an error, because the system is underdetermined. Actually making Newton steps work for NN training would probably be a huge pain in the ass, since the underdetermination would cause numerical issues.)

4Adam Shimi2moBy Newton step, do you mean one step of Newton's method []?
Updating the Lottery Ticket Hypothesis

In hindsight, I probably should have explained this more carefully. "Today’s neural networks already contain dog-recognizing subcircuits at initialization" was not an accurate summary of exactly what I think is implausible.

Here's a more careful version of the claim:

  • I do not find it plausible that a random network contains a neuron which acts as a reliable dog-detector. This is the sense in which it's not plausible that networks contain dog-recognizing subcircuits at initialization. But this is what would be necessary for the "lottery ticket" intuition (i.e
... (read more)
3Matthew "Vaniver" Graves2moI don't think I agree, because of the many-to-many relationship between neurons and subcircuits. Or, like, I think the standard of 'reliability' for this is very low. I don't have a great explanation / picture for this intuition, and so probably I should refine the picture to make sure it's real before leaning on it too much? To be clear, I think I agree with your refinement as a more detailed picture of what's going on; I guess I just think you're overselling how wrong the naive version is?
Updating the Lottery Ticket Hypothesis

The problem is what we mean by e.g. "dog recognizing subcircuit". The simplest version would be something like "at initialization, there's already one neuron which lights up in response to dogs" or something like that. (And that's basically the version which would be needed in order for a gradient descent process to actually pick out that lottery ticket.) That's the version which I'd call implausible: function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to hav... (read more)

7Abram Demski2moI was in a discussion yesterday that made it seem pretty plausible that you're wrong -- this paper [] suggests that the over-parameterization needed to ensure that some circuit is (approximately) present at the beginning is not that large. I haven't actually read the paper I'm referencing, but my understanding is that this argument doesn't work out because the number of possible circuits of size N is balanced by the high number of subgraphs in a graph of size M (where M is only logarithmically larger than N). That being said, I don't know whether "present at the beginning" is the same as "easily found by gradient descent".
4DanielFilan2moOne important thing to note here is that the LTH paper doesn't demonstrate that SGD "finds" a ticket: just that the subnetwork you get by training and pruning could be trained alone in isolation to higher accuracy. That doesn't mean that the weights in the original training are the same when the network is trained in isolation!
Updating the Lottery Ticket Hypothesis

This basically matches my current understanding. (Though I'm not strongly confident in my current understanding.) I believe the GP results are basically equivalent to this, but I haven't read up on the topic enough to be sure.

Computing Natural Abstractions: Linear Approximation

Note to self: the summary dimension seems basically constant as the neighborhood size increases. There's probably a generalization of Koopman-Pitman-Darmois which applies.

Computing Natural Abstractions: Linear Approximation

I ran this experiment about 2-3 weeks ago (after the chaos post but before the project intro). I pretty strongly expected this to work much earlier - e.g. back in December I gave a talk which had all the ideas in it.

Testing The Natural Abstraction Hypothesis: Project Intro

On the role of values: values clearly do play some role in determining which abstractions we use. An alien who observes Earth but does not care about anything on Earth's surface will likely not have a concept of trees, any more than an alien which has not observed Earth at all. Indifference has a similar effect to lack of data.

However, I expect that the space of abstractions is (approximately) discrete. A mind may use the tree-concept, or not use the tree-concept, but there is no natural abstraction arbitrarily-close-to-tree-but-not-the-same-as-tree. There... (read more)

Testing The Natural Abstraction Hypothesis: Project Intro

Also interested in helping on this - if there's modelling you'd want to outsource.

Here's one fairly-standalone project which I probably won't get to soon. It would be a fair bit of work, but also potentially very impressive in terms of both showing off technical skills and producing cool results.

Short somewhat-oversimplified version: take a finite-element model of some realistic objects. Backpropagate to compute the jacobian of final state variables with respect to initial state variables. Take a singular value decomposition of the jacobian. Hypothesis: th... (read more)

Been a while but I thought the idea was interesting and had a go at implementing it. Houdini was too much for my laptop, let alone my programming skills, but I found a simple particle simulation in pygame which shows the basics, can see below.

exponents of the Jacobian of a 5 particle, 200 step simulation, with groups of 3 and 2 connected by springs

Planned next step is to work on the run-time speed (even this took a couple of minutes run, calculating the frame-to-frame Jacobian is a pain, probably more than necessary) and then add some utilities for creatin... (read more)

Testing The Natural Abstraction Hypothesis: Project Intro

Re: dual use, I do have some thoughts on exactly what sort of capabilities would potentially come out of this.

The really interesting possibility is that we end up able to precisely specify high-level human concepts - a real-life language of the birds. The specifications would correctly capture what-we-actually-mean, so they wouldn't be prone to goodhart. That would mean, for instance, being able to formally specify "strawberry on a plate" in non-goodhartable way, so an AI optimizing for a strawberry on a plate would actually produce a strawberry on a plate... (read more)

How do we prepare for final crunch time?

Re: picking up new tools, skills and practice designing and building user interfaces, especially to complex or not-very-transparent systems, would be very-high-leverage if the tool-adoption step is rate-limiting.

How do we prepare for final crunch time?

Relevant topic of a future post: some of the ideas from Risks From Learned Optimization or the Improved Good Regulator Theorem offer insights into building effective institutions and developing flexible problem-solving capacity.

Rough intuitive idea: intelligence/agency are about generalizable problem-solving capability. How do you incentivize generalizable problem-solving capability? Ask the system to solve a wide variety of problems, or a problem general enough to encompass a wide variety.

If you want an organization to act agenty, then a useful technique ... (read more)

Transparency Trichotomy

This post seems to me to be beating around the bush. There's several different classes of transparency methods evaluated by several different proxy criteria, but this is all sort of tangential to the thing which actually matters: we do not understand what "understanding a model" means, at least not in a sufficiently-robust-and-legible way to confidently put optimization pressure on it.

For transparency via inspection, the problem is that we don't know what kind of "understanding" is required to rule out bad behavior. We can notice that some low-level featur... (read more)

5Mark Xu3moI agree it's sort of the same problem under the hood, but I think knowing how you're going to go from "understanding understanding" to producing an understandable model controls what type of understanding you're looking for. I also agree that this post makes ~0 progress on solving the "hard problem" of transparency, I just think it provides a potentially useful framing and creates a reference for me/others to link to in the future.
The Fusion Power Generator Scenario

One way in which the analogy breaks down: in the lever case, we have two levers right next to each other, and each does something we want - it's just easy to confuse the levers. A better analogy for AI might be: many levers and switches and dials have to be set to get the behavior we want, and mistakes in some of them matter while others don't, and we don't know which ones matter when. And sometimes people will figure out that a particular combination extends the flaps, so they'll say "do this to extend the flaps", except that when some other switch has th... (read more)

Alignment By Default

Yup, this is basically where that probability came from. It still feels about right.

Alignment By Default

This is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up.

3Chris_Leong3moAlso, I have another strange idea that might increase the probability of this working. If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of "true human values"? I don't think it's likely to work, but thought I'd share anyway.
1Chris_Leong3moThanks! Is this why you put the probability as "10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values"? Or have you updated your probabilities since writing this post?
HCH Speculation Post #2A

This is the best explanation I have seen yet of what seem to me to be the main problems with HCH. In particular, that scene from HPMOR is one that I've also thought about as a good analogue for HCH problems. (Though I do think the "humans are bad at things" issue is more probably important for HCH than the malicious memes problem; HCH is basically a giant bureaucracy, and the same shortcomings which make humans bad at giant bureaucracies will directly limit HCH.)

Behavioral Sufficient Statistics for Goal-Directedness

I'm working on writing it up properly, should have a post at some point.

EDIT: it's up.

Behavioral Sufficient Statistics for Goal-Directedness

I still feel like you're missing something important here.

For instance... in the explainability factor, you measure "the average deviation of  from the actions favored by the action-value function  of ", using the formula


. But why this particular formula? Why not take the log of  first, or use  in the denominator? Indeed, there's a strong argument to be made this formula is a bad choice: the value function  is... (read more)

4Adam Shimi3moTo people reading this thread: we had a private conversation with John (faster and easier), which resulted in me agreeing with him. The summary is that you can see the arguments made and constraints invoked as a set of equations, such that the adequate formalization is a solution of this set. But if the set has more than one solution (maybe a lot), then it's misleading to call that the solution. So I've been working these last few days at arguing for the properties (generalization, explainability, efficiency) in such a way that the corresponding set of equations only has one solution.
The case for aligning narrowly superhuman models

We can't mostly-win just by fine-tuning a language model to do moral discourse.

Uh... yeah, I agree with that statement, but I don't really see how it's relevant. If we tune a language model to do moral discourse, then won't it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like "they said they want fusion power, but they probably also want it to not be turn-into-bomb-able".

Or are you using "moral... (read more)

Behavioral Sufficient Statistics for Goal-Directedness

I think you are very confused about the conceptual significance of a "sufficient statistic".

Let's start with the prototypical setup of a sufficient statistic. Suppose I have a bunch of IID variables  drawn from a maximum-entropy distribution with features  (i.e. the "true" distribution is maxentropic subject to a constraint on the expectation of ), BUT I don't know the parameters of the distribution (i.e. I don't know the expected value ). For instance, maybe I know that the variables are drawn from a normal... (read more)

3Adam Shimi3moThanks for the spot-on pushback! I do understand what a sufficient statistics is -- which probably means I'm even more guilty of what you're accusing me of. And I agree completely that I don't defend correctly that the statistics I provide are really sufficient. If I try to explain myself, what I want to say in this post is probably something like * Knowing these intuitive properties aboutπand the goals seems sufficient to express and address basically any question we have related to goals and goal-directedness. (in a very vague intuitive way that I can't really justify). * To think about that in a grounded way, here are formulas for each property that look like they capture these properties. * Now what's left to do is to attack the aforementioned questions about goals and goal-directedness with these statistics, and see if they're enough. (Which is the topic of the next few posts) Honestly, I don't think there's an argument to show these are literally sufficient statistics. Yet I still think staking the claim that they are is quite productive for further research. It gives concreteness to an exploration of goal-directedness, carving more grounded questions: * Given a question about goals and goal-directedness, are these properties enough to frame and study this question? If yes, then study it. If not, then study what's missing. * Are my formula adequate formalization of the intuitive properties? This post mostly focuses on the second aspect, and to be honest, not even in as much detail as one could go. Maybe that means this post shouldn't exist, and I should have waited to see if I could literally formalize every question about goals and goal-directedness. But posting it to gather feedback on whether these statistics makes sense to people, and if they feel like something's missing, seemed valuable. That being said, my mistake (and what caused your knee-jerk reaction) was to just say these are literally sufficient statistics inst
The case for aligning narrowly superhuman models

I think one argument running through a lot of the sequences is that the parts of "human values" which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as "moral questions". Like, these examples from your comment below:

Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?

If an AGI i... (read more)

1Charlie Steiner3moI'd say "If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human 'moral experts' are going to disagree about], then we've already mostly-won" is an accurate correlation, but doesn't stand up to optimization pressure. We can't mostly-win just by fine-tuning a language model to do moral discourse. I'd guess you agree? Anyhow, my point was more: You said "you get what you can measure" is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said "you get what you measure" is a problem because humans can disagree when their values are 'measured' without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).
Open Problems with Myopia

(On reflection this comment is less kind than I'd like it to be, but I'm leaving it as-is because I think it is useful to record my knee-jerk reaction. It's still a good post; I apologize in advance for not being very nice.)

In theory, such an agent is safe because a human would only approve safe actions.

... wat.

Lol no.

Look, I understand that outer alignment is orthogonal to the problem this post is about, but like... say that. Don't just say that a very-obviously-unsafe thing is safe. (Unless this is in fact nonobvious, in which case I will retract this comment and give a proper explanation.)

3Charlie Steiner3moYou beat me to making this comment :P Except apparently I came here to make this comment about the changed version. "A human would only approve safe actions" is just a problem clause altogether. I understand how this seems reasonable for sub-human optimizers, but if you (now addressing Mark and Evan) think it has any particular safety properties for superhuman optimization pressure, the particulars of that might be interesting to nail down a bit better.
7Mark Xu3moYeah, you're right that it's obviously unsafe. The words "in theory" were meant to gesture at that, but it could be much better worded. Changed to "A prototypical example is a time-limited myopic approval-maximizing agent. In theory, such an agent has some desirable safety properties because a human would only approve safe actions (although we still would consider it unsafe)."
Load More