Reward Is Not Enough

Nice post!

I'm generally bullish on multiple objectives, and this post is another independent arrow pointing in that direction. Some other signs which I think point that way:

- The argument from Why Subagents?. This is about utility maximizers rather than reward maximizers, but it points in a similar qualitative direction. Summary: once we allow internal state, utility-maximizers are not the only inexploitable systems; markets/committees of utility-maximizers also work.
- The argument from Fixing The Good Regulator Theorem. That post uses some incoming informatio

36dThanks!
I had totally forgotten about your subagents post.
I've been thinking that they kinda blend together in model-based RL, or at least
the kind of (brain-like) model-based RL AGI that I normally think about. See
this comment
[https://www.lesswrong.com/posts/zzXawbXDwCZobwF9D/my-agi-threat-model-misaligned-model-based-rl-agent?commentId=fyHkc5yhzxS4ywyCd]
and surrounding discussion. Basically, one way to do model-based RL is to have
the agent create a predictive model of the reward and then judge plans based on
their tendency to maximize "the reward as currently understood by my predictive
model". Then "the reward as currently understood by my predictive model" is
basically a utility function. But at the same time, there's a separate
subroutine that edits the reward prediction model (≈ utility function) to ever
more closely approximate the true reward function (by some learning algorithm,
presumably involving reward prediction errors).
In other words: At any given time, the part of the agent that's making plans and
taking actions looks like a utility maximizer. But if you lump together that
part plus the subroutine that keeps editing the reward prediction model to
better approximate the real reward signal, then that whole system is a
reward-maximizing RL agent.
Please tell me if that makes any sense or not; I've been planning to write
pretty much exactly this comment (but with a diagram) into a short post.

Testing The Natural Abstraction Hypothesis: Project Intro

If the wheels are bouncing off each other, then that could be chaotic in the same way as billiard balls. But at least macroscopically, there's a crapton of damping in that simulation, so I find it more likely that the chaos is microscopic. But also my intuition agrees with yours, this system doesn't seem like it should be chaotic...

Abstraction Talk

Heads up, there's a lot of use of visuals - drawing, gesturing at things, etc - so a useful transcript may take some work.

11moYeah, we can have a try and see whether it ends up being worth publishing.

Testing The Natural Abstraction Hypothesis: Project Intro

Nice!

A couple notes:

- Make sure to check that the values in the jacobian aren't exploding - i.e. there's not values like 1e30 or 1e200 or anything like that. Exponentially large values in the jacobian probably mean the system is chaotic.
- If you want to avoid explicitly computing the jacobian, write a method which takes in a (constant) vector and uses backpropagation to return . This is the same as the time-0-to-time-t jacobian dotted with , but it operates on size-n vectors rather than n-by-n jacobian matrices, so should be a l

Another little update, speed issue solved for now by adding SymPy's fortran wrappers to the derivative calculations - calculating the SVD isn't (yet?) the bottleneck. Can now quickly get results from 1,000+ step simulations of 100s of particles.

Unfortunately, even for the pretty stable configuration below, the values are indeed exploding. I need to go back through the program and double check the logic but I don't think it should be chaotic, if anything I would expect the values to hit zero.

It might be that there's some kind of quasi-chaotic behaviou... (read more)

SGD's Bias

My own understanding of the flat minima idea is that it's a different thing. It's not really about noise, it's about gradient descent in general being a pretty shitty optimization method, which converges very poorly to sharp minima (more precisely, minima with a high condition number). (Continuous gradient flow circumvents that, but using step sizes small enough to circumvent the problem in practice would make GD prohibitively slow. The methods we actually use are not a good approximation of continuous flow, as I understand it.) If you want flat minima, th... (read more)

SGD's Bias

I'm still wrapping my head around this myself, so this comment is quite useful.

Here's a different way to set up the model, where the phenomenon is more obvious.

Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state , it's equally likely to transition to or ). The rate at which the system leaves state serves a role analogous to the... (read more)

31moThat makes sense. Now it's coming back to me: you zoom your microscope into one
tiny nm^3 cube of air. In a right-to-left temperature gradient you'll see
systematically faster air molecules moving rightward and slower molecules moving
leftward, because they're carrying the temperature from their last collision.
Whereas in uniform temperature, there's "detailed balance" (just as many
molecules going along a path vs going along the time-reversed version of that
same path, and with the same speed distribution).
Thinking about the diode-resistor thing more, I suspect it would be a
waste-of-time nerd-sniping rabbit hole, especially because of the time-domain
aspects (the fluctuations are faster on one side vs the other I think) which
don't have any analogue in SGD. Sorry to have brought it up.

SGD's Bias

does it represent a bias towards less variance over the different gradients one can sample at a given point?

Yup, exactly.

Formal Inner Alignment, Prospectus

This is a good summary.

I'm still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:

- It feels like "optimization under uncertainty" is not quite the right name for the thing you're trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
- The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal

21moCould you elaborate on that? I do think that learning-normativity is more about
outer alignment. However, some ideas might cross-apply.
Well, it still seems like a good name to me, so I'm curious what you are
thinking here. What name would communicate better?
Again, I need more unpacking to be able to say much (or update much).
Well, the optimization-under-uncertainty is an attempt to make a frame which can
contain both, so this isn't necessarily a problem... but I am curious what feels
non-tight about inner agency.
I still agree with the hypothetical me making the opposite point ;p The problem
is that certain things are being conflated, so both "uncertainty can't be
separated from goals" and "uncertainty can be separated from goals" have true
interpretations. (I have those interpretations clear in my head, but
communication is hard.)
OK, so.
My sense of our remaining disagreement...
We agree that the pointers/uncertainty could be factored (at least informally --
currently waiting on any formalism).
You think "optimization under uncertainty" is doing something different, and I
think it's doing something close.
Specifically, I think "optimization under uncertainty" importantly is not
necessarily best understood as the standard Bayesian thing where we (1) start
with a utility function, (2) provide a prior, so that we can evaluate expected
value (and 2.5, update on any evidence), (3) provide a search method, so that we
solve the whole thing by searching for the highest-expectation element. Many
examples of optimization-under-uncertainty strain this model. Probably the
pointer/uncertainty model would do a better job in these cases. But, the
Bayesian model is kind of the only one we have, so we can use it provisionally.
And when we do so, the approximation of pointer-vs-uncertainty that comes out
is:
Pointer: The utility function.
Uncertainty: The search plus the prior, which in practice can blend together
into "inductive bias".
This isn't perfect, by any me

Understanding the Lottery Ticket Hypothesis

Picture a linear approximation, like this:

The tangent space at point is that whole line labelled "tangent".

The main difference between the tangent space and the space of neural-networks-for-which-the-weights-are-very-close is that the tangent space extrapolates the linear approximation indefinitely; it's not just limited to the region near the original point. (In practice, though, that difference does not actually matter much, at least for the problem at hand - we do stay close to the original point.)

The reason we want to talk about "the tangen... (read more)

Understanding the Lottery Ticket Hypothesis

I don’t yet understand this proposal. In what way do we decompose this parameter tangent space into "lottery tickets"? Are the lottery tickets the cross product of subnetworks and points in the parameter tangent space? The subnetworks alone? If the latter then how does this differ from the original lottery ticket hypothesis?

The tangent space version is meant to be a fairly large departure from the original LTH; subnetworks are no longer particularly relevant at all.

We can imagine a space of generalized-lottery-ticket-hypotheses, in which the common theme i... (read more)

21moI confess I don't really understand what a tangent space is, even after reading
the wiki article on the subject. It sounds like it's something like this: Take a
particular neural network. Consider the "space" of possible neural networks that
are extremely similar to it, i.e. they have all the same parameters but the
weights are slightly different, for some definition of "slightly." That's the
tangent space. Is this correct? What am I missing?

Formal Inner Alignment, Prospectus

I buy the "problems can be both" argument in principle. However, when a problem involves both, it seems like we *have* to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we're thinking about optimization-under-uncertainty... (read more)

Trying to lay this disagreement out plainly:

According to you, the inner alignment problem should apply to *well-defined optimization problems*, meaning optimization problems which have been given *all* the pieces needed to score domain items. Within this frame, the only reasonable definition is "inner" = issues of imperfect search, "outer" = issues of objective (which can include the prior, the utility function, etc).

According to me/Evan, the inner alignment problem should apply to *optimization under uncertainty*, which is a notion of optimization where you *don*... (read more)

41moThe way I'm currently thinking of things, I would say the reverse also applies
in this case.
We can turn optimization-under-uncertainty into well-defined optimization by
assuming a prior. The outer alignment problem (in your sense) involves getting
the prior right. Getting the prior right is part of "figuring out what we want".
But this is precisely the source of the inner alignment problems in the
paul/evan sense: Paul was pointing out a previously neglected issue about the
Solomonoff prior, and Evan is talking about inductive biases of machine learning
algorithms (which is sort of like the combination of a prior and imperfect
search).
So both you and Evan and Paul are agreeing that there's this problem with the
prior (/ inductive biases). It is distinct from other outer alignment problems
(because we can, to a large extent, factor the problem of specifying an expected
value calculation into the problem of specifying probabilities and the problem
of specifying a value function / utility function / etc). Everyone would seem to
agree that this part of the problem needs to be solved. The disagreement is just
about whether to classify this part as "inner" and/or "outer".
What is this problem like? Well, it's broadly a quality-of-prior problem, but it
has a different character from other quality-of-prior problems. For the most
part, the quality of priors can be understood by thinking about average error
being low, or mistakes becoming infrequent, etc. However, here, this kind of
thinking isn't sufficient: we are concerned with rare but catastrophic errors.
Thinking about these things, we find ourselves thinking in terms of "agents
inside the prior" (or agents being favored by the inductive biases).
To what extent "agents in the prior" should be lumped together with "agents in
imperfect search", I am not sure. But the term "inner optimizer" seems relevant.
A good example of optimization-under-uncertainty that doesn't look like that (at
least, not overtly) is most ap

Formal Inner Alignment, Prospectus### Example: Dr Nefarious

I'll make a case here that manipulation of imperfect internal search should be considered *the* inner alignment problem, and all the other things which look like inner alignment failures actually stem from outer alignment failure or non-mesa-optimizer-specific generalization failure.

Suppose Dr Nefarious is an agent in the environment who wants to acausally manipulate other agents' models. We have a model which knows of Dr Nefarious' existence, and we ask the model to predict what Dr Nefarious will do. At this point, we have already faile... (read more)

51moWhile I agree that outer objective, training data and prior should be considered
together, I disagree that it makes the inner alignment problem dissolve except
for manipulation of the search. In principle, if you could indeed ensure through
a smart choice of these three parameters that there is only one global optimum,
only "bad" (meaning high loss) local minima, and that your search process will
always reach the global optimum, then I would agree that the inner alignment
problem disappears.
But answering "what do we even want?" at this level of precision seems basically
impossible. I expect that it's pretty much equivalent to specifying exactly the
result we want, which we are quite unable to do in general.
So my perspective is that the inner alignment problem appears because of
inherent limits into our outer alignment capabilities. And that in realistic
settings where we cannot rule out multiple very good local minima, the sort of
reasoning underpinning the inner alignment discussion is the best approach we
have to address such problems.
That being said, I'm not sure how this view interacts with yours or Evan's, or
if this is a very standard use of the terms. But since that's part of the
discussion Abram is pushing, here is how I use these terms.

31moHm, I want to classify "defense against adversaries" as a separate category from
both "inner alignment" and "outer alignment".
The obvious example is: if an adversarial AGI hacks into my AGI and changes its
goals, that's not any kind of alignment problem, it's a
defense-against-adversaries problem.
Then I would take that notion and extend it by saying "yes interacting with an
adversary presents an attack surface, but also merely imagining an adversary
presents an attack surface too". Well, at least in weird hypotheticals. I'm not
convinced that this would really be a problem in practice, but I dunno, I
haven't thought about it much.
Anyway, I would propose that the procedure for defense against adversaries in
general is: (1) shelter an AGI from adversaries early in training, until it's
reasonably intelligent and aligned, and then (2) trust the AGI to defend itself.
I'm not sure we can do any better than that.
In particular, I imagine an intelligent and self-aware AGI that's aligned in
trying to help me would deliberately avoid imagining an adversarial
superintelligence that can acausally hijack its goals!
That still leaves the issue of early training, when the AGI is not yet motivated
to not imagine adversaries, or not yet able. So I would say: if it does imagine
the adversary, and then its goals do get hijacked, then at that point I would
say "OK yes now it's misaligned". (Just like if a real adversary is exploiting a
normal security hole—I would say the AGI is aligned before the adversary
exploits that hole, and misaligned after.) Then what? Well, presumably, we will
need to have procedure that verifies alignment before we release the AGI from
its training box. And that procedure would presumably be indifferent to how the
AGI came to be misaligned. So I don't think that's really a special problem we
need to think about.

71moSo, I think I could write a much longer response to this (perhaps another post),
but I'm more or less not persuaded that problems should be cut up the way you
say.
As I mentioned in my other reply, your argument that Dr. Nefarious problems
shouldn't be classified as inner alignment is that they are apparently outer
alignment. If inner alignment problems are roughly "the internal objective
doesn't match the external objective" and outer alignment problems are roughly
"the outer objective doesn't meet our needs/goals", then there's no reason why
these have to be mutually exclusive categories.
In particular, Dr. Nefarious problems can be both.
But more importantly, I don't entirely buy your notion of "optimization". This
is the part that would require a longer explanation to be a proper reply. But
basically, I want to distinguish between "optimization" and "optimization under
uncertainty". Optimization under uncertainty is not optimization -- that is, it
is not optimization of the type you're describing, where you have a well-defined
objective which you're simply feeding to a search. Given a prior, you can reduce
optimization-under-uncertainty to plain optimization (if you can afford the
probabilistic inference necessary to take the expectations, which often isn't
the case). But that doesn't mean that you do, and anyway, I want to keep them as
separate concepts even if one is often implemented by the other.
Your notion of the inner alignment problem applies only to optimization.
Evan's notion of inner alignment applies (only!) to optimization under
uncertainty.

This is a great comment. I will have to think more about your overall point, but aside from that, you've made some really useful distinctions. I've been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it's about outer alignment). Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn't cluster together the problems people have been trying to cluster together.

Parsing Chris Mingard on Neural Networks

Yeah, I don't think there was anything particularly special about the complexity measure they used, and I wouldn't be surprised if some other measures did as-well-or-better at predicting which functions fill large chunks of parameter space.

Parsing Chris Mingard on Neural Networks

Minor note: I think "mappings with low Kolmogorov complexity occupy larger volumes" is an *extremely* misleading way to explain what the evidence points to here. (Yes, I know Chris used that language in his blog post, but he really should not have.) The intuition of "simple mappings occupy larger volumes" is an accurate summary, but the relevant complexity measure is very much not Kolmogorov complexity. It is a measure which does not account for every computable way of compressing things, only some *specific* ways of compressing things.

In a sense, the results ... (read more)

42moThanks for this clarification John.
But did Mingard et al show that there is some specific practical complexity
measure that explains the size of the volumes occupied in parameter space better
than alternative practical complexity measures? If so then think we would have
uncovered an even more detailed understanding of which mappings occupy large
volumes in parameter space, and since neural nets just generally work so well in
the world, we could say that we know something about what kind of compression is
relevant to our particular world. But if they just used some practical
complexity measure as a rough proxy for Kolmogorov complexity then it leaves
open the door for mappings occupying large volumes in parameter space to be
simple according to many more ways of compressing things than were used in their
particular complexity measure.

FAQ: Advice for AI Alignment Researchers

I still don’t know much about GANs, Gaussian processes, Bayesian neural nets, and convex optimization.

FWIW, I'd recommend investing the time to learn about convex optimization even if you don't think you need it yet. Unlike the other topics on this list, the ideas from convex optimization are relevant to a wide variety of things the name does not suggest. Some examples:

- how to think about optimization in high dimensions (even for non-convex functions)
- using constraints as a more-gearsy way of representing relevant information (as opposed to e.g. just wrappin

42moYeah that's definitely the one on the list that I think would be most useful.
I may also be understating how much I know about it; I've picked up some over
time, e.g. linear programming, minimax, some kinds of duality, mirror descent,
Newton's method.

NTK/GP Models of Neural Nets Can't Learn Features

They wouldn't be random linear combinations, so the central limit theorem estimate wouldn't directly apply. E.g. this circuit transparency work basically ran PCA on activations. It's not immediately obvious to me what the right big-O estimate would be, but intuitively, I'd expect the PCA to pick out exactly those components dominated by change in activations - since those will be the components which involve large correlations in the activation patterns across data points (at least that's my intuition).

I think this claim is basically wrong:

... (read more)And to the exten

52moHmm, so regarding the linear combinations, it's true that there are some linear
combinations that will change by Θ(1) in the large-width limit -- just use the
vector of partial derivatives of the output at some particular input, this sum
will change by the amount that the output function moves during the regression.
Indeed, I suspect(but don't have a proof) that these particular combinations
will span the space of linear combinations that change non-trivially during
training. I would dispute "we expect most linear combinations to change" though
-- the CLT argument implies that we should expect almost all combinations to not
appreciably change. Not sure what effect this would have on the PCA and still
think it's plausible that it doesn't change at all(actually, I think Greg Yang
states that it doesn't change in section 9 of his paper, haven't read that part
super carefully though)
So I think I was a bit careless in saying that the NTK can't do transfer
learning at all -- a more exact statement might be "the NTK does the minimal
amount of transfer learning possible". What I mean by this is, any learning
algorithm can do transfer learning if the task we are 'transferring' to is
sufficiently similar to the original task -- for instance, if it's just the
exact same task but with a different data sample. I claim that the 'transfer
learning' the NTK does is of this sort. As you say, since the tangent kernel
doesn't change at all, the net effect is to move where the network starts in the
tangent space. Disregarding convergence speed, the impact this has on
generalization is determined by the values set by the old function on axes of
the NTK outside of the span of the partial derivatives at the new function's
data points. This means that, for the NTK to transfer anything from one task to
another, it's not enough for the tasks to both feature, for instance, eyes. It's
that the eyes have to correlate with the output in the exact same way in both
tasks. Indeed, the transfer l

NTK/GP Models of Neural Nets Can't Learn Features

Alright, I buy the argument on page 15 of the original NTK paper.

I'm still very skeptical of the interpretation of this as "NTK models can't learn features". In general, when someone proves some interesting result which seems to contradict some combination of empirical results, my default assumption is that the proven result is being interpreted incorrectly, so I have a high prior that that's what's happening here. In this case, it could be that e.g. the "features" relevant to things like transfer learning are not individual neuron activations - e.g. IIRC ... (read more)

52moI don't think taking linear combinations will help, because adding terms to the
linear combination will also increase the magnitude of the original activation
vector -- e.g. if you add together N12 units, the magnitude of the sum of their
original activations will with high probability be Θ(N14), dwarfing the O(1)
change due to change in the activations. But regardless, it can't help with
transfer learning at all, since the tangent kernel(which determines learning in
this regime) doesn't change by definition.
What empirical results do you think are being contradicted? As far as I can
tell, the empirical results we have are 'NTK/GP have similar performance to
neural nets on some, but not all, tasks'. I don't think transfer/feature
learning is addressed at all. You might say these results are suggestive
evidence that NTK/GP captures everything important about neural nets, but this
is precisely what is being disputed with the transfer learning arguments.
I can imagine doing an experiment where we find the 'empirical tangent kernel'
of some finite neural net at initialization, solve the linear system, and then
analyze the activations of the resulting network. But it's worth noting that
this is not what is usually meant by 'NTK' -- that usually includes taking the
infinite-width limit at the same time. And to the extent that we expect the
activations to change at all, we no longer have reason to think that this linear
system is a good approximation of SGD. That's what the above mathematical
results mean -- the same mathematical analysis that implies that network
training is like solving a linear system, also implies that the activations
don't change at all.

NTK/GP Models of Neural Nets Can't Learn Features

Ok, that's at least a plausible argument, although there are some big loopholes. Main problem which jumps out to me: what happens after *one* step of backprop is not the relevant question. One step of backprop is not enough to solve a set of linear equations (i.e. to achieve perfect prediction on the training set); the relevant question is what happens after one step of Newton's method, or after enough steps of gradient descent to achieve convergence.

What would convince me more is an empirical result - i.e. looking at the internals of an actual NTK model, tr... (read more)

62moThe result that NTK does not learn features in the large N limit is not in
dispute at all -- it's right there on page 15 of the original NTK paper
[https://arxiv.org/abs/1806.07572], and indeed holds after arbitrarily many
steps of backprop. I don't think that there's really much room for loopholes in
the math here. See Greg Yang's paper for a lengthy proof that this holds for all
architectures. Also worth noting that when people 'take the NTK limit' they
often don't initialize an actual net at all, they instead use analytical
expressions for what the inner product of the gradients would be at N=infinity
to compute the kernel directly.

NTK/GP Models of Neural Nets Can't Learn Features

During training on the 'old task', NTK stays in the 'tangent space' of the network's initialization. This means that, to first order, none of the functions/derivatives computed by the individual neurons change at all, only the output function does.

Eh? Why does this follow? Derivatives make sense; the derivatives staying approximately-constant is one of the assumptions underlying NTK to begin with. But the functions computed by individual neurons should be able to change for exactly the same reason the output function changes, assuming the network has more than one layer. What am I missing here?

52moThe asymmetry between the output function and the intermediate neuron functions
comes from backprop -- from the fact that the gradients are backprop-ed through
weight matrices with entries of magnitude O(N−12). So the gradient of the output
w.r.t itself is obviously 1, then the gradient of the output w.r.t each neuron
in the preceding layer is O(N−12), since you're just multiplying by a vector
with those entries. Then by induction all other preceding layers' gradients are
the sum of N random things of size O(1/N), and so are of size O(N−12) again. So
taking a step of backprop will change the output function by O(1) but the
intermediate functions by O(N−12), vanishing in the large-width limit.
(This is kind of an oversimplification since it is possible to have changing
intermediate functions while doing backprop, as mentioned in the linked paper.
But this is the essence of why it's possible in some limits to move around using
backprop without changing the intermediate neurons)

Updating the Lottery Ticket Hypothesis

Plausible.

Here's intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there's multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.

32moThat seems right, but also reminds me of the point that you need to randomly
initialize your neural nets for gradient descent to work (because otherwise the
gradients everywhere are the same). Like, in the randomly initialized net, each
edge is going to be part of many subcircuits, both good and bad, and the
gradient is basically "what's your relative contribution to good subcircuits vs.
bad subcircuits?"

Updating the Lottery Ticket Hypothesis

Oh that's definitely interesting. Thanks for the link.

Updating the Lottery Ticket Hypothesis

Yup, I agree that that quote says something which is probably true, given current evidence. I don't think "picking a winning lottery ticket" is a good way analogy for what that implies, though; see this comment.

62moI don't know what the referent of 'that quote' is. If you mean the passage I
quoted from the original lottery ticket hypothesis ("LTH") paper, then I highly
recommend reading a follow-up paper
[http://proceedings.mlr.press/v119/frankle20a.html] which describes how and why
it's wrong for large networks. The abstract of the paper I'm citing here:
--------------------------------------------------------------------------------
Again, assuming "that" refers to the claim in the original LTH paper, I also
don't think it's a good analogy. But by default I think that claim is what "the
lottery ticket hypothesis" refers to, given that it's a widely cited paper that
has spawned a large number of follow-up works.

Updating the Lottery Ticket Hypothesis

Not quite. The linear expansion isn't just over the parameters associated with one layer, it's over all the parameters in the whole net.

Updating the Lottery Ticket Hypothesis

Yeah, I agree that something more general than one neuron but less general than (or at least different from) pruning might be appropriate. I'm not particularly worried about where that line "should" be drawn a priori, because the tangent space indeed seems like the right place to draw the line empirically.

62moWait... so:
1. The tangent-space hypothesis implies something close to "gd finds a solution
if and only if there's already a dog detecting neuron" (for large networks,
that is) -- specifically it seems to imply something pretty close to
"there's already a feature", where "feature" means a linear combination of
existing neurons within a single layer
2. gd in fact trains NNs to recognize dogs
3. Therefore, we're still in the territory of "there's already a dog detector"
...yeah?

Updating the Lottery Ticket Hypothesis

See my comment here. The paper you link (like others I've seen in this vein) requires pruning, which will change the functional behavior of the nodes themselves. As I currently understand it, it is consistent with not having any neuron which lights up to match most functions/circuits.

62moAh, I should have read comments more carefully.
I agree with your comment that the claim "I doubt that today’s neural networks
already contain dog-recognizing subcircuits at initialization" is ambiguous --
"contains" can mean "under pruning".
Obviously, this is an important distinction: if the would-be dog-recognizing
circuit behaves extremely differently due to intersection with a lot of other
circuits, it could be much harder to find. But why is "a single neuron lighting
up" where you draw the line?
It seems clear that at least some relaxation of that requirement is tenable. For
example, if no one neuron lights up in the correct pattern, but there's a linear
combination of neurons (before the output layer) which does, then it seems we're
good to go: GD could find that pretty easily.
I guess this is where the tangent space model comes in; if in practice (for
large networks) we stay in the tangent space, then a linear combination of
neurons is basically exactly as much as we can relax your requirement.
But without the tangent-space hypothesis, it's unclear where to draw the line,
and your claim that an existing neuron already behaving in the desired way is
"what would be necessary for the lottery ticket intuition" isn't clear to me.
(Is there a more obvious argument for this, according to you?)

Updating the Lottery Ticket Hypothesis

Sort of. They end up equivalent to a single Newton step, not a single gradient step (or at least that's what this model says). In general, a set of linear equations is not solved by one gradient step, but is solved by one Newton step. It generally takes many gradient steps to solve a set of linear equations.

(Caveat to this: if you directly attempt a Newton step on this sort of system, you'll probably get an error, because the system is underdetermined. Actually making Newton steps work for NN training would probably be a huge pain in the ass, since the underdetermination would cause numerical issues.)

42moBy Newton step, do you mean one step of Newton's method
[https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization]?

Updating the Lottery Ticket Hypothesis

In hindsight, I probably should have explained this more carefully. "Today’s neural networks already contain dog-recognizing subcircuits at initialization" was not an accurate summary of exactly what I think is implausible.

Here's a more careful version of the claim:

- I do not find it plausible that a random network contains a neuron which acts as a reliable dog-detector. This is the sense in which it's not plausible that networks contain dog-recognizing subcircuits at initialization. But this is what would be necessary for the "lottery ticket" intuition (i.e

32moI don't think I agree, because of the many-to-many relationship between neurons
and subcircuits. Or, like, I think the standard of 'reliability' for this is
very low. I don't have a great explanation / picture for this intuition, and so
probably I should refine the picture to make sure it's real before leaning on it
too much?
To be clear, I think I agree with your refinement as a more detailed picture of
what's going on; I guess I just think you're overselling how wrong the naive
version is?

Updating the Lottery Ticket Hypothesis

The problem is what we mean by e.g. "dog recognizing subcircuit". The simplest version would be something like "at initialization, there's already one neuron which lights up in response to dogs" or something like that. (And that's basically the version which would be needed in order for a gradient descent process to actually pick out that lottery ticket.) That's the version which I'd call implausible: function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to hav... (read more)

72moI was in a discussion yesterday that made it seem pretty plausible that you're
wrong -- this paper [https://arxiv.org/abs/2006.07990] suggests that the
over-parameterization needed to ensure that some circuit is (approximately)
present at the beginning is not that large.
I haven't actually read the paper I'm referencing, but my understanding is that
this argument doesn't work out because the number of possible circuits of size N
is balanced by the high number of subgraphs in a graph of size M (where M is
only logarithmically larger than N).
That being said, I don't know whether "present at the beginning" is the same as
"easily found by gradient descent".

42moOne important thing to note here is that the LTH paper doesn't demonstrate that
SGD "finds" a ticket: just that the subnetwork you get by training and pruning
could be trained alone in isolation to higher accuracy. That doesn't mean that
the weights in the original training are the same when the network is trained in
isolation!

Updating the Lottery Ticket Hypothesis

This basically matches my current understanding. (Though I'm not strongly confident in my current understanding.) I believe the GP results are basically equivalent to this, but I haven't read up on the topic enough to be sure.

Computing Natural Abstractions: Linear Approximation

Note to self: the summary dimension seems basically constant as the neighborhood size increases. There's probably a generalization of Koopman-Pitman-Darmois which applies.

Computing Natural Abstractions: Linear Approximation

I ran this experiment about 2-3 weeks ago (after the chaos post but before the project intro). I pretty strongly expected this to work much earlier - e.g. back in December I gave a talk which had all the ideas in it.

Testing The Natural Abstraction Hypothesis: Project Intro

On the role of values: values clearly do play some role in determining which abstractions we use. An alien who observes Earth but does not care about anything on Earth's surface will likely not have a concept of trees, any more than an alien which has not observed Earth at all. Indifference has a similar effect to lack of data.

However, I expect that the space of abstractions is (approximately) discrete. A mind may use the tree-concept, or not use the tree-concept, but there is no natural abstraction arbitrarily-close-to-tree-but-not-the-same-as-tree. There... (read more)

Testing The Natural Abstraction Hypothesis: Project Intro

Also interested in helping on this - if there's modelling you'd want to outsource.

Here's one fairly-standalone project which I probably won't get to soon. It would be a fair bit of work, but also potentially very impressive in terms of both showing off technical skills and producing cool results.

Short somewhat-oversimplified version: take a finite-element model of some realistic objects. Backpropagate to compute the jacobian of final state variables with respect to initial state variables. Take a singular value decomposition of the jacobian. Hypothesis: th... (read more)

Been a while but I thought the idea was interesting and had a go at implementing it. Houdini was too much for my laptop, let alone my programming skills, but I found a simple particle simulation in pygame which shows the basics, can see below.

Planned next step is to work on the run-time speed (even this took a couple of minutes run, calculating the frame-to-frame Jacobian is a pain, probably more than necessary) and then add some utilities for creatin... (read more)

Testing The Natural Abstraction Hypothesis: Project Intro

Re: dual use, I do have some thoughts on exactly what sort of capabilities would potentially come out of this.

The really interesting possibility is that we end up able to precisely specify high-level human concepts - a real-life language of the birds. The specifications would correctly capture what-we-actually-mean, so they wouldn't be prone to goodhart. That would mean, for instance, being able to formally specify "strawberry on a plate" in non-goodhartable way, so an AI optimizing for a strawberry on a plate would actually produce a strawberry on a plate... (read more)

How do we prepare for final crunch time?

Re: picking up new tools, skills and practice designing and building user interfaces, especially to complex or not-very-transparent systems, would be very-high-leverage if the tool-adoption step is rate-limiting.

How do we prepare for final crunch time?

Relevant topic of a future post: some of the ideas from Risks From Learned Optimization or the Improved Good Regulator Theorem offer insights into building effective institutions and developing flexible problem-solving capacity.

Rough intuitive idea: intelligence/agency are about generalizable problem-solving capability. How do you incentivize generalizable problem-solving capability? Ask the system to solve a wide variety of problems, or a problem general enough to encompass a wide variety.

If you want an organization to act agenty, then a useful technique ... (read more)

Transparency Trichotomy

This post seems to me to be beating around the bush. There's several different classes of transparency methods evaluated by several different proxy criteria, but this is all sort of tangential to the thing which actually matters: **we do not understand what "understanding a model" means**, at least not in a sufficiently-robust-and-legible way to confidently put optimization pressure on it.

For transparency via inspection, the problem is that we don't know what kind of "understanding" is required to rule out bad behavior. We can notice that some low-level featur... (read more)

53moI agree it's sort of the same problem under the hood, but I think knowing how
you're going to go from "understanding understanding" to producing an
understandable model controls what type of understanding you're looking for.
I also agree that this post makes ~0 progress on solving the "hard problem" of
transparency, I just think it provides a potentially useful framing and creates
a reference for me/others to link to in the future.

The Fusion Power Generator Scenario

One way in which the analogy breaks down: in the lever case, we have two levers right next to each other, and each does something we want - it's just easy to confuse the levers. A better analogy for AI might be: many levers and switches and dials have to be set to get the behavior we want, and mistakes in some of them matter while others don't, and we don't know which ones matter when. And sometimes people will figure out that a particular combination extends the flaps, so they'll say "do this to extend the flaps", except that when some other switch has th... (read more)

Alignment By Default

Yup, this is basically where that probability came from. It still feels about right.

Alignment By Default

This is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up.

33moAlso, I have another strange idea that might increase the probability of this
working.
If you could temporarily remove proxies based on what people say, then this
would seem to greatly increase the chance of it hitting the actual embedded
representation of human values. Maybe identifying these proxies is easier than
identifying the representation of "true human values"?
I don't think it's likely to work, but thought I'd share anyway.

13moThanks!
Is this why you put the probability as "10-20% chance of alignment by this path,
assuming that the unsupervised system does end up with a simple embedding of
human values"? Or have you updated your probabilities since writing this post?

HCH Speculation Post #2A

This is the best explanation I have seen yet of what seem to me to be the main problems with HCH. In particular, that scene from HPMOR is one that I've also thought about as a good analogue for HCH problems. (Though I do think the "humans are bad at things" issue is more probably important for HCH than the malicious memes problem; HCH is basically a giant bureaucracy, and the same shortcomings which make humans bad at giant bureaucracies will directly limit HCH.)

Behavioral Sufficient Statistics for Goal-Directedness

I'm working on writing it up properly, should have a post at some point.

EDIT: it's up.

Behavioral Sufficient Statistics for Goal-Directedness

I still feel like you're missing something important here.

For instance... in the explainability factor, you measure "the average deviation of from the actions favored by the action-value function of ", using the formula

. But why this particular formula? Why not take the log of first, or use in the denominator? Indeed, there's a strong argument to be made this formula is a bad choice: the value function is... (read more)

43moTo people reading this thread: we had a private conversation with John (faster
and easier), which resulted in me agreeing with him.
The summary is that you can see the arguments made and constraints invoked as a
set of equations, such that the adequate formalization is a solution of this
set. But if the set has more than one solution (maybe a lot), then it's
misleading to call that the solution.
So I've been working these last few days at arguing for the properties
(generalization, explainability, efficiency) in such a way that the
corresponding set of equations only has one solution.

The case for aligning narrowly superhuman models

We can't mostly-win just by fine-tuning a language model to do moral discourse.

Uh... yeah, I agree with that statement, but I don't really see how it's relevant. If we tune a language model to do moral discourse, then won't it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like "they said they want fusion power, but they probably also want it to not be turn-into-bomb-able".

Or are you using "moral... (read more)

Behavioral Sufficient Statistics for Goal-Directedness

I think you are very confused about the conceptual significance of a "sufficient statistic".

Let's start with the prototypical setup of a sufficient statistic. Suppose I have a bunch of IID variables drawn from a maximum-entropy distribution with features (i.e. the "true" distribution is maxentropic subject to a constraint on the expectation of ), BUT I don't know the parameters of the distribution (i.e. I don't know the expected value ). For instance, maybe I know that the variables are drawn from a normal... (read more)

33moThanks for the spot-on pushback!
I do understand what a sufficient statistics is -- which probably means I'm even
more guilty of what you're accusing me of. And I agree completely that I don't
defend correctly that the statistics I provide are really sufficient.
If I try to explain myself, what I want to say in this post is probably
something like
* Knowing these intuitive properties aboutπand the goals seems sufficient to
express and address basically any question we have related to goals and
goal-directedness. (in a very vague intuitive way that I can't really
justify).
* To think about that in a grounded way, here are formulas for each property
that look like they capture these properties.
* Now what's left to do is to attack the aforementioned questions about goals
and goal-directedness with these statistics, and see if they're enough.
(Which is the topic of the next few posts)
Honestly, I don't think there's an argument to show these are literally
sufficient statistics. Yet I still think staking the claim that they are is
quite productive for further research. It gives concreteness to an exploration
of goal-directedness, carving more grounded questions:
* Given a question about goals and goal-directedness, are these properties
enough to frame and study this question? If yes, then study it. If not, then
study what's missing.
* Are my formula adequate formalization of the intuitive properties?
This post mostly focuses on the second aspect, and to be honest, not even in as
much detail as one could go.
Maybe that means this post shouldn't exist, and I should have waited to see if I
could literally formalize every question about goals and goal-directedness. But
posting it to gather feedback on whether these statistics makes sense to people,
and if they feel like something's missing, seemed valuable.
That being said, my mistake (and what caused your knee-jerk reaction) was to
just say these are literally sufficient statistics inst

The case for aligning narrowly superhuman models

I think one argument running through a lot of the sequences is that the parts of "human values" which mostly determine whether AI is great or a disaster are *not* the sort of things humans usually think of as "moral questions". Like, these examples from your comment below:

Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?

If an AGI i... (read more)

13moI'd say "If an AGI is hung up on these sorts of questions [i.e. the examples I
gave of statements human 'moral experts' are going to disagree about], then
we've already mostly-won" is an accurate correlation, but doesn't stand up to
optimization pressure. We can't mostly-win just by fine-tuning a language model
to do moral discourse. I'd guess you agree?
Anyhow, my point was more: You said "you get what you can measure" is a problem
because the fact of the matter for whether decisions are good or bad is hard to
evaluate (therefore sandwiching is an interesting problem to practice on). I
said "you get what you measure" is a problem because humans can disagree when
their values are 'measured' without either of them being mistaken or defective
(therefore sandwiching is a procrustean bed / wrong problem).

Open Problems with Myopia

(On reflection this comment is less kind than I'd like it to be, but I'm leaving it as-is because I think it is useful to record my knee-jerk reaction. It's still a good post; I apologize in advance for not being very nice.)

In theory, such an agent is safe because a human would only approve safe actions.

... wat.

Lol no.

Look, I understand that outer alignment is orthogonal to the problem this post is about, but like... say that. Don't just say that a very-obviously-unsafe thing is safe. (Unless this is in fact nonobvious, in which case I will retract this comment and give a proper explanation.)

33moYou beat me to making this comment :P Except apparently I came here to make this
comment about the changed version.
"A human would only approve safe actions" is just a problem clause altogether. I
understand how this seems reasonable for sub-human optimizers, but if you (now
addressing Mark and Evan) think it has any particular safety properties for
superhuman optimization pressure, the particulars of that might be interesting
to nail down a bit better.

73moYeah, you're right that it's obviously unsafe. The words "in theory" were meant
to gesture at that, but it could be much better worded. Changed to "A
prototypical example is a time-limited myopic approval-maximizing agent. In
theory, such an agent has some desirable safety properties because a human would
only approve safe actions (although we still would consider it unsafe)."

Good explanation, conceptually.

Not sure how all the details play out - in particular, my big question for any RL setup is "how does it avoid wireheading?". In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.