We've written a paper on online imitation learning, and our construction allows us to bound the extent to which mesa-optimizers could accomplish anything. This is not to say it will definitely be easy to eliminate mesa-optimizers in practice, but investigations into how to do so could look here as a starting point. The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.

Here is the abstract:

...In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. No existing work

2Koen Holtman1dInteresting paper! I like the focus on imitation learning, but the really new
food-for-thought thing to me is the bit about dropping i.i.d. assumptions and
then seeing how far you can get. I need to think more about the math in the
paper before I can ask some specific questions about this i.i.d. thing.
My feelings about the post above are a bit more mixed. Claims about inner
alignment always seem to generate a lot of traffic on this site. But a lot of
this traffic consists of questions and clarification about what exactly counts
as an inner alignment failure or a mesa optimization related failure. The term
is so fluid that I find the quantitative feelings that people express in the
comment section hard to interpret. Is everybody talking about the same P(treache
ry) andP(bad)?
THOUGHT EXPERIMENT COUNTER-EXAMPLE
Moving beyond these mixed feelings, here is a fun thought experiment, for
various values of fun. You state:
So now I will try to construct a counter-example to this claim: an example where
mesa-optimizers (as I understand them) will accomplish very bad things even
inside your construction.
Your construction uses a set of candidate policies Π, where one of them equals
the real demonstrator policy πd.
A single policy πi is a function πi were πi(at|h<t)computes the probability that
action at will be taken in the world state represented by the history h<t.
Here is the first special step in the thought experiment. I am going to define
each πi by using a different reward function Ri that the policy is going to try
to maximize. Every policy definition will use the same world state transition
predictor P that allows me to estimate P(h<t+1|h<t,at) for every t. (I am
probably getting slightly creative with the notation in the paper at this
point.)
I now define each function πi as follows: I computeπi(at|h<t) by assigning a
probability of 1 to the one atthat comes out of the argmaxa of the Bellman
equation constructed using the shared predictor P and the policy sp

1michaelcohen18hThis is exactly right.
The reason this is consistent is that queries will become infrequent, but they
will still be well-timed. The choice of whether to query depends on what the
treacherous models are doing. So if the treacherous models wait for a long time,
then we will have a long time where no queries are needed, and as soon as they
decide to be treacherous, we thwart that plan with a query to the demonstrator.
So it does not imply that
Yep, perfectly reasonable!
If we setαsmall enough, we can make it arbitrarily like thatπdnever leaves the
set of top policies.

83dI think I mostly agree with what you're saying here, though I have a couple of
comments—and now that I've read and understand your paper more thoroughly, I'm
definitely a lot more excited about this research.
I don't think that's right—if you're modeling the training process as Bayesian,
as I now understand, then the issue is that what makes the problem go away isn't
more intelligent models, but less efficient training processes. Even if we have
arbitrary compute, I think we're unlikely to use training processes that look
Bayesian just because true Bayesianism is really hard to compute such that for
basically any amount of computation that you have available to you, you'd rather
run some more efficient algorithm like SGD.
I worry about these sorts of Bayesian analyses where we're assuming that we're
training a large population of models, one of which is assumed to be an accurate
model of the world, since I expect us to end up using some sort of local search
instead of a global search like that—just because I think that local search is
way more efficient than any sort of Bayesianish global search.
I definitely agree with all of this, except perhaps on whether Gated Linear
Networks are likely to help (though I don't claim to actually understand
backpropagation-free networks all that well).

42dI feel this is a wrong way to look at it. I expect any effective learning
algorithm to be an approximation of Bayesianism in the sense that, it satisfies
some good sample complexity bound w.r.t. some sufficiently rich prior. Ofc it's
non-trivial to (i) prove such a bound for a given algorithm (ii) modify said
algorithm using confidence thresholds in a way that leads to a safety guarantee.
However, there is no sharp dichotomy between "Bayesianism" and "SGD" such that
this approach obviously doesn't apply to the latter, or to something competitive
with the latter.

32dI agree that at some level SGD has to be doing something approximately Bayesian.
But that doesn't necessarily imply that you'll be able to get any nice,
Bayesian-like properties from it such as error bounds. For example, if you think
of SGD as effectively just taking the MAP model starting from sort of simplicity
prior, it seems very difficult to turn that into something like the top N
posterior models, as would be required for an algorithm like this.

41dI mean, there's obviously a lot more work to do, but this is progress.
Specifically if SGD is MAP then it seems plausible that e.g. SGD + random
initial conditions or simulated annealing would give you something like top N
posterior models. You can also extract confidence from NNGP.

313hI agree that this is progress (now that I understand it better), though:
I think there is strong evidence that the behavior of models trained via the
same basic training process are likely to be highly correlated. This sort of
correlation is related to low variance in the bias-variance tradeoff sense, and
there is evidence that [https://arxiv.org/abs/2002.11328] not only do massive
neural networks tend to have pretty low variance, but that this variance is
likely to continue to decrease as networks become larger.

Hmm, added to reading list, thank you.

13dI don't understand the alternative, but maybe that's neither here nor there.
It's a little hard to make a definitive statement about a hypothetical in which
the inner alignment problem doesn't apply to Bayesian inference. However, since
error bounds are apparently a key piece of a solution, it certainly seems that
if Bayesian inference was immune to mesa-optimizers it would be because of
competence not resource-prodigality.
Here's another tack. Hypothesis generation seems like a crucial part of
intelligent prediction. Pure Bayesian reasoning does hypothesis generation by
brute force. Suppose it was inefficiency, and not intelligence, that made
Bayesian reasoning avoid mesa-optimizers. Then suppose we had a Bayesian
reasoning that was equally intelligent but more efficient, by only generating
relevant hypotheses. It gains efficiency by not even bothering to consider a
bunch of obviously wrong models, but it's posterior is roughly the same, so it
should avoid inner alignment failure equally well. If, on the other hand, the
hyopthesis generation routine was bad enough that some plausible hypotheses went
unexamined, this could introduce an inner alignment failure, with a notable
decrease in intelligence.
I expect some heuristic search with no discernable structure, guided
"attentively" by an agent. And I expect this heuristic search through models to
identify any model that a human could hypothesize, and many more.

33dPerhaps I'm just being pedantic, but when you're building mathematical models of
things I think it's really important to call attention to what things in the
real world those mathematical models are supposed to represent—since that's how
you know whether your assumptions are reasonable or not. In this case, there are
two ways I could interpret this analysis: either the Bayesian learner is
supposed to represent what we want our trained models to be doing, and then we
can ask how we might train something that works that way; or the Bayesian
learner is supposed to represent how we want our training process to work, and
then we can ask how we might build training processes that work that way. I
originally took you as saying the first thing, then realized you were actually
saying the second thing.
That's a good point. Perhaps I am just saying that I don't think we'll get
training processes that are that competent. I do still think there are some ways
in which I don't fully buy this picture, though. I think that the point that I'm
trying to make is that the Bayesian learner gets its safety guarantees by having
a massive hypothesis space and just keeping track of how well each of those
hypotheses is doing. Obviously, if you knew of a smaller space that definitely
included the desired hypothesis, you could do that instead—but given a
fixed-size space that you have to search over (and unless your space is
absolutely tiny such that you basically already have what you want), for any
given amount of compute, I think it'll be more efficient to run a local search
over that space than a global one.
Interesting. I'm not exactly sure what you mean by that. I could imagine “guided
'attentively' by an agent” to mean something like recursive oversight/relaxed
adversarial training
[https://www.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment]
wherein some overseer is guiding the search process—is that what you mean?

33dI'm not exactly sure what I mean either, but I wasn't imagining as much
structure as exists in your post. I mean there's some process which constructs
hypotheses, and choices are being made about how computation is being directed
within that process.
I think any any heuristic search algorithm worth its salt will incorporate
information about proximity of models. And I think that arbitrary limits on
heuristic search of the form "the next model I consider must be fairly close to
the last one I did" will not help it very much if it's anywhere near smart
enough to merit membership in a generally intelligent predictor.
But for the purpose of analyzing it's output, I don't think this discussion is
critical if we agree that we can expect a good heuristic search through models
will identify any model that a human could hypothesize.

23dI think I would expect essentially all models that a human could hypothesize to
be in the search space—but if you're doing a local search, then you only ever
really see the easiest to find model with good behavior, not all models with
good behavior, which means you're relying a lot more on your prior/inductive
biases/whatever is determining how hard models are to find to do a lot more work
for you. Cast into the Bayesian setting, a local search like this is relying on
something like the MAP model not being deceptive—and escaping that to instead
get N models sampled independently from the top q proportion or whatever seems
very difficult to do via any local search algorithm.

32dSo would you say you disagree with the claim
?

32dYeah; I think I would say I disagree with that. Notably, evolution is not a
generally intelligent predictor, but is still capable of producing generally
intelligent predictors. I expect the same to be true of processes like SGD.

32dIf we ever produce generally intelligent predictors (or "accurate world-models"
in the terminology we've been using so far), we will need a process that is much
more efficient than evolution.
But also, I certainly don't think that in order to be generally intelligent you
need to start with a generally intelligent subroutine. Then you could never get
off the ground. I expect good hypothesis-generation / model-proposal to use a
mess of learned heuristics which would not be easily directed to solve arbitrary
tasks, and I expect the heuristic "look for models near the best-so-far model"
to be useful, but I don't think making it ironclad would be useful.
Another thought on our exchange:
If what you say is correct, then it sounds like exclusively-local search
precludes human-level intelligence! (Which I don't believe, by the way, even if
I think it's a less efficient path). One human competency is generating lots of
hypotheses, and then having many models of the world, and then designing
experiments to probe those hypotheses. It's hard for me to imagine that an agent
that finds an "easiest-to-find model" and then calls it a day could ever do
human-level science. Even something as simple as understanding an interlocuter
requires generating diverse models on the fly: "Do they mean X or Y with those
words? Let me ask a clarfying question."
I'm not this bearish on local search. But if local search is this bad, I don't
think it is a viable path to AGI, and if it's not, then the internals don't for
the purposes of our discussion, and we can skip to what I take to be the upshot:

62dI certainly don't think SGD is a powerful enough optimization process to do
science directly, but it definitely seems powerful enough to find an agent which
does do science.
We know that local search processes can produce AGI, so viability is a question
of efficiency—and we know that SGD is at least efficient enough to solve a wide
variety of problems from image classification, to language modeling, to complex
video games, all given just current compute budgets. So while I could certainly
imagine SGD being insufficient, I definitely wouldn't want to bet on it.

321hOkay I think we've switched from talking about Q-learning to talking about
policy gradient. (Or we were talking about the latter the whole time, and I
didn't notice it). The question that I think is relevant is: how are possible
world-models being hypothesized and analyzed? That's something I expect to be
done with messy heuristics that sometimes have discontinuities their sequence of
outputs. Which means I think that no reasonable DQN is will be generally
intelligent (except maybe an enormously wide one attention-based one, such that
finding models is more about selective attention at any given step than it is
about gradient descent over the whole history).
A policy gradient network, on the other hand, could maybe (after having its
parameters updated through gradient descent) become a network that, in a single
forward pass, considers diverse world-models (generated with a messy non-local
heuristic), and analyzes their plausibility, and then acts. At the end of the
day, what we have is an agent modeling world, and we can expect it to consider
any model that a human could come up with. (This paragraph also applies to the
DQN with a gradient-descent-trained method for selectively attending to
different parts of a wide network, since that could amount to effectively
considering different models).

213hHmmm... I don't think I was ever even meaning to talk specifically about RL, but
regardless I don't expect nearly as large of a difference between Q-learning and
policy gradient algorithms. If we imagine both types of algorithms making use of
the same size massive neural network, the only real difference is how the output
of that neural network is interpreted, either directly as a policy, or as Q
values that are turned into a policy via something like softmax. In both cases,
the neural network is capable of implementing any arbitrary policy and should be
getting a similar sort of feedback signal from the training process—especially
if you're using a policy gradient algorithm that involves something like
advantage estimation rather than actual rollouts, since the update rule in that
situation is going to look very similar to the Q learning update rule. I do
expect some minor differences in the sorts of models you end up with, such as Q
learning being more prone to non-myopic behavior across episodes
[https://arxiv.org/abs/2009.09153], and I think there are some minor reasons
that policy gradient algorithms are favored in real-world settings, since they
get to learn their exploration policy rather than having it hard-coded and can
handle continuous action domains—but overall I think these sorts of differences
are pretty minor and shouldn't affect whether these approaches can reach general
intelligence or not.

155dI haven't read the paper yet, looking forward to it. Using something along these
lines to run a sufficiently-faithful simulation of HCH seems like a plausible
path to producing an aligned AI with a halting oracle. (I don't think that even
solves the problem given a halting oracle, since HCH is probably not aligned,
but I still think this would be noteworthy.)
First I'm curious to understand this main result so I know what to look for and
how surprised to be. In particular, I have two questions about the quantitative
behavior described here:
1: DEPENDENCE ON THE PRIOR OF THE TRUE HYPOTHESIS
It seems like you have the following tough case even if the human is
deterministic:
* There are N hypotheses about the human, of which one is correct and the
others are bad.
* I would like to make a series of N potentially-catastrophic decisions.
* On decision k, all (N-1) bad hypotheses propose the same catastrophic
prediction. We don't know k in advance.
* For the other (N-1) decisions, (N-2) of the bad hypotheses make the correct
prediction a*, and the final bad hypothesis claims that action a* would be
catastrophic.
In this setting, it seems like I have no hope other than to query the human on
all N decisions (since all days and hypotheses are symmetrical), so I assume
that this is what your algorithm would do.
That strongly suggests that the number of queries to the human goes as 1 /
p(correct demonstrator), unless you use some other feature of the hypothesis
class. But p(correct demonstrator) is probably less than2−1014, so this constant
might not be acceptable. Usually we try to have a logarithmic dependence on
p(correct demonstrator) but this doesn't seem possible here.
So as you say, we'd want to condition on on some relevant facts to get up to the
point where that probability might be acceptably-high. So then it seems like we
have two problems:
* It still seems like the actual probability of the truth is going to be quite
low, so it seems like

420hHere's the sketch of a solution to the query complexity problem.
Simplifying technical assumptions:
* The action space is {0,1}
* All hypotheses are deterministic
* Predictors output maximum likelihood predictions instead of sampling the
posterior
I'm pretty sure removing those is mostly just a technical complication.
Safety assumptions:
* The real hypothesis has prior probability lower bounded by some known
quantity δ, so we discard all hypotheses of probability less than δ from the
onset.
* Malign hypotheses have total prior probability mass upper bounded by some
known quantity ϵ (determining parameters δ and ϵ is not easy, but I'm pretty
sure any alignment protocol will have parameters of some such sort.)
* Any predictor that uses a subset of the prior without malign hypotheses is
safe (I would like some formal justification for this, but seems plausible on
the face of it.)
* Given any safe behavior, querying the user instead of producing a prediction
cannot make it unsafe (this can and should be questioned but let it slide for
now.)
Algorithm:On each round, let p0 be the prior probability mass of unfalsified
hypotheses predicting 0 and p1 be the same for 1.
* If p0>p1+ϵ or p1=0, output 0
* If p1>p0+ϵ or p0=0, output 1
* Otherwise, query the user
Analysis:As long as p0+p1≫ϵ, on each round we query, p0+p1 is halved. Therefore
there can only be roughly O(log1ϵ) such rounds. On the following rounds, each
round we query at least one hypothesis is removed, so there can only be roughly
O(ϵδ) such rounds. The total query complexity is therefore approximately O(log1ϵ
+ϵδ).

119hThis is very nice and short!
And to state what you left implicit:
Ifp0>p1+ε, then in the setting with no malign hypotheses (which you assume to be
safe), 0 is definitely the output, since the malign models can only shift the
outcome byε, so we assume it is safe to output 0. And likewise with outputting
1.
One general worry I have about assuming that the deterministic case extends
easily to the stochastic case is that a sequence of probabilities that tends to
0 can still have an infinite sum, which is not true when probabilities must∈{0,1
}, and this sometimes causes trouble. I'm not sure this would raise any issues
here--just registering a slightly differing intuition.

12dRe 1: This is a good point. Some thoughts:
[EDIT: See this
[https://www.lesswrong.com/posts/CnruhwFGQBThvgJiX/formal-solution-to-the-inner-alignment-problem?commentId=GieL6GD9KupGhbEEo]
]
* We can add the assumption that the probability mass of malign hypotheses is
small and that following any non-malign hypothesis at any given time is safe.
Then we can probably get query complexity that scales with p(malign) /
p(true)?
* However, this is cheating because a sequence of actions can be dangerous even
if each individual action came from a (different) harmless hypothesis. So,
instead we want to assume something like: any dangerous strategy has high
Kolmogorov complexity relative to the demonstrator. (More realistically, some
kind of bounded Kolmogorov complexity.)
* A related problem is: in reality "user takes action a" and "AI takes action a
" are not physically identical events, and there might be an attack vector
that exploits this.
* If p(malign)/p(true) is really high that's a bigger problem, but then doesn't
it mean we actually live in a malign simulation and everything is hopeless
anyway?
Re 2: Why do you need such a strong bound?

24dThey're not quite symmetrical: midway through, some bad hypotheses will have
been ruled out for making erroneous predictions about the demonstrator
previously. But your conclusion is still correct. And
It's certainly no better than that, and the bound I prove is worse.
Yeah. Error bounds on full Bayesian are logarithmic in the prior weight on the
truth, but error bounds on maximum a posteriori prediction are just inverse in
the prior weight, and your example above is the one to show it. If each
successive MAP model predicts wrong at each successive timestep, it could take N
timesteps to get rid of N models, which is how many might begin with a prior
weight exceeding the truth, if the truth has a prior weight of 1/N.
But, this situation seems pretty preposterous to me in the real world. If
agent's first observation is, say, this paragraph, the number of models with
prior weight greater than the truth that predicted something else as the first
observation, will probably be a number way, way different from one. I'd go so
far as to say at least half of models with prior weight greater than the truth
would predict a different observation than this very paragraph. As long as this
situation keeps up, we're in a logarithmic regime.
I'm not convinced this logarithmic regime ever ends, but I think the case is
more convincing that we at least start there, so let's suppose now that it
eventually ends, and after this point the remaining models with posterior weight
exceeding the truth are deliberately erring at unique timesteps. What's the
posterior on the truth now? This is a phase where the all the top models are
"capable" of predicting correctly. This shouldn't look like2−1014at all. It will
look more like p(correct model)/p(treachery).
And actually, depending on the composition of treacherous models, it could be
better. To the extent some constant fraction of them are being treacherous at
particular times, the logarithmic regime will continue. There are two reasons
why

24dI just meant the dependence on epsilon, it seems like there are unavoidable
additional factors (especially the linear dependence on p(treachery)). I guess
it's not obvious if you can make these additive or if they are inherently
multipliactive.
But your bound scales in some way, right? How much training data do I need to
get the KL divergence between distributions over trajectories down to epsilon?

13dNo matter how much data you have, my bound on the KL divergence won't approach
zero.

84dI understand that the practical bound is going to be logarithmic "for a while"
but it seems like the theorem about runtime doesn't help as much if that's what
we are (mostly) relying on, and there's some additional analysis we need to do.
That seems worth formalizing if we want to have a theorem, since that's the step
that we need to be correct.
If our models are a trillion bits, then it doesn't seem that surprising to me if
it takes 100 bits extra to specify an intended model relative to an effective
treacherous model, and if you have a linear dependence that would be unworkable.
In some sense it's actively surprising if the very shortest intended vs
treacherous models have description lengths within 0.0000001% of each other
unless you have a very strong skills vs values separation. Overall I feel much
more agnostic about this than you are.
This doesn't seem like it works once you are selecting on competent treacherous
models. Any competent model will err very rarely (with probability less than 1 /
(feasible training time), probably much less). I don't think that (say) 99% of
smart treacherous models would make this particular kind of incompetent error?
It seems like it must end if there are any treacherous models (i.e. if there was
any legitimate concern about inner alignment at all).

13dI agree. Also, it might be very hard to formalize this in a way that's not:
write out in math what I said in English, and name it Assumption 2.
I don't think incompetence is the only reason to try to pull off a treacherous
turn at the same time that other models do. Some timesteps are just more
important, so there's a trade off. And what's traded off is a public good: among
treacherous models, it is a public good for the models' moments of treachery to
be spread out. Spreading them out exhausts whatever data source is being called
upon in moments of uncertainty. But for an individual model, it never reaps the
benefit of helping to exhaust our resources. Given that treacherous models don't
have an incentive to be inconvenient to us in this way, I don't think a failure
to qualifies as incompetence. This is also my response to
Yeah, I've been assuming all treacherous models will err once. (Which is
different from the RL case, where they can err repeatedly on off-policy
predictions).
I can't remember exactly where we left this in our previous discussions on this
topic. My point estimate for the number of extra bits to specify an intended
model relative to an effective treacherous model is negative. The treacherous
model has to compute the truth, and then also decide when and how to execute
treachery. So the subroutine they run to compute the truth, considered as a
model in its own right, seems to me like it must be simpler.

219hTo extended Evan's comment
[https://www.lesswrong.com/posts/CnruhwFGQBThvgJiX/formal-solution-to-the-inner-alignment-problem?commentId=qnYsT7QacvjtCimXv]
about coordination between deceptive models: Even if the deceptive models lack
relevant game theoretical mechanisms, they may still coordinate due to being
(partially) aligned with each other. For example, a deceptive model X may prefer
[some other random deceptive model seizing control] over [model X seizing
control with probability 0.1% and the entire experiment being terminated with
probability 99.9%].
Why should we assume that the deceptive models will be sufficiently misaligned
with each other such that this will not be an issue? Do you have intuitions
about the degree of misalignment between huge neural networks that were trained
to imitate demonstrators but ended up being consequentialists that care about
the state of our universe?

218hThat's possible. But it seems like way less of a convergent instrumental goal
for agents living in a simulated world-models. Both options--our world optimized
by us and our world optimized by a random deceptive model--probably contain very
little of value as judged by agents in another random deceptive model.
So yeah, I would say some models would think like this, but I would expect the
total weight on models that do to be much lower.

43dIt seems to me like this argument is requiring the deceptive models not being
able to coordinate with each other very effectively, which seems unlikely to
me—why shouldn't we expect them to be able to solve these sorts of coordination
problems?

33dThey're one shot. They're not interacting causally. When any pair of models
cooperates, a third model defecting benefits just as much from their cooperation
with each other. And there will be defectors--we can write them down. So
cooperating models would have to cooperate with defectors and cooperators
indiscriminately, which I think is just a bad move according to any decision
theory. All the LDT stuff I've seen on the topic is how to make one's own
cooperation/defection depend logically on the cooperation/defection of others.

43dWell, just like we can write down the defectors, we can also write down the
cooperators—these sorts of models do exist, regardless of whether they're
implementing a decision theory we would call reasonable. And in this situation,
the cooperators should eventually outcompete the defectors, until the
cooperators all defect—which, if the true model has a low enough prior, they
could do only once they've pushed the true model out of the top N.

32dIf it's only the case that we can write them down, but they're not likely to
arise naturally as simple consequentialists taking over simple physics
[https://www.lesswrong.com/posts/5bd75cc58225bf06703752a3/the-universal-prior-is-malign]
, then that extra description length will be seriously costly to them, and we
won't need to worry about any role they might play in p(treacherous)/p(truth).
Meanwhile, when I was saying we could write down some defectors, I wasn't making
a simultaneous claim about their relative prior weight, only that their
existence would spoil cooperation.
For cooperators to outcompete defectors, they would have to be getting a larger
share of the gains from cooperation than defectors do. If some people are
waiting for the fruit on public trees to ripen before eating, and some people
aren't, the defectors will be the ones eating the almost ripe fruit.
I might be misunderstanding this statement. The inverse of the posterior on the
truth is a supermartingale (doesn't go up in expectation), so I don't know what
it could mean for the true model to get pushed out.

42dHere's the setup I'm imagining, but perhaps I'm still misunderstanding
something. Suppose you have a bunch of deceptive models that choose to cooperate
and have larger weight in the prior than the true model (I think you believe
this is very unlikely, though I'm more ambivalent). Specifically, they cooperate
in that they perfectly mimic the true model up until the point where the
deceptive models make up enough of the posterior that the true model is no
longer being consulted, at which point they all simultaneously defect. This
allows for arbitrarily bad worst-case behavior.

32dThis thread began by considering deceptive models cooperating with each other in
the sense of separating the timing of their treacherous turns in order to be
maximally annoying to us. So maybe our discussion on that topic is resolved, and
we can move on to this scenario.
if alpha is low enough, this won't ever happen, and if alpha is too high, it
won't take very long. So I don't think this scenario is quite right.
Then the question becomes, for an alpha that's low enough, how long will it take
until queries are infrequent, noting that you need a query any time any
treacherous model with enough weight decides to take a treacherous turn?

113dTrying to defect at time T is only a good idea if it's plausible that your
mechanism isn't going to notice the uncertainty at time T and then query the
human. So it seems to me like this argument can never drive P(successful
treachery) super low, or else it would be self-defeating (for competent
treacherous agents).
Subroutines of functions aren't always simpler. Even without treachery this
seems like it's basically guaranteed. If "the truth" is just a simulation of a
human, then the truth is a subroutine of the simplest model. But instead you
could have a model of the form "Run program X, treat its output as a program,
then run that." Since humans are compressible that will clearly be smaller.
Now you want to say that whatever reasoning a treacherous agent does in order to
compress the human, you can just promote that to the outside model as well. But
that argument seems to be precisely where the meat is (and it's the kind of
thing I spend most of time on). If that works then it seems like you don't even
need a further solution to inner alignment, just use the simplest model.
(Maybe there is a hope that the intended model won't be "much" more complex than
the treacherous models, without literally saying that it's the simplest, but at
that point I'm back to wondering whether "much" is like 0.01% or 0.00000001%.)

13dI don't follow. Can't races to the bottom destroy all value for the agents
involved? Consider the claim: firms will never set prices to equal marginal cost
because that would destroy all profit.
Yeah, okay. This is my mainline opinion, and I just think about inner alignment
in case it's wrong. So for now we can assume that it is wrong, since there's not
much more to say if it's right.
That is my next guess, but I hadn't thought in terms of percentages before this.
I had thought of normally distributed noise in the number of bits.
It's occurring to me that this question doesn't matter to our present
discussion. What makes the linear regime linear rather logarithmic is that if
p(treachery)/p(honest model) is high, that allows for a large number of
treacherous models to have greater posterior weight than the truth. But if a
single model has n times the posterior weight of the truth, it still only takes
one query to the demonstrator to interrupt its treacherous turn, not n queries.

33dYou are saying that a special moment is a particularly great one to be
treacherous. But if P(discovery) is 99.99% during that period, and there is any
other treachery-possible period where P(discovery) is small, then that other
period would have been better after all. Right?
This doesn't seem analogous to producers driving down profits to zero, because
those firms had no other opportunity to make a profit with their machine. It's
like you saying: there are tons of countries where firms could use their
machines to make stuff and sell it at a profit (more countries than firms). But
some of the places are more attractive than others, so probably everyone will
sell in those places and drive profits to zero. And I'm saying: but then aren't
those incredibly-congested countries actually worse places to sell? This
scenario is only possible if firms are making so much stuff that they can drive
profit down to zero in every country, since any country with remaining profit
would necessarily be the best place to sell.
It seems to me like this is clearly wrong in the limit (since simple
consequentialists would take over simple physics
[https://www.lesswrong.com/posts/5bd75cc58225bf06703752a3/the-universal-prior-is-malign]
). It also seems probably wrong to me for smaller models (i.e. without huge
amounts of internal selection) but it would be good for someone to think about
that case more seriously.
if the treacherous model is 100 bits smaller, then it feels like there must be
around 2^100 treacherous models that are all simpler than the intended one. If
nothing else, it seems like you could literally add garbage bits to the
treacherous models (or useful bits!).

12dThis is a bit of a sidebar:
I'm curious what you make of the following argument.
1. When an infinite sequence is sampled from a true modelμ0, there is likely to
be another treacherous modelμ1which is likely to end up with greater
posterior weight than an honest model^μ0, and greater still than the
posterior on the true modelμ0.
2. If the sequence were sampled fromμ1instead, the eventual posterior weight on
μ1will probably be at least as high.
3. When an infinite sequence is sampled from a true modelμ1, there is likely to
be another treacherous modelμ2, which is likely to end up with greater
posterior weight than an honest model^μ1, and greater still than the
posterior on the true modelμ1.
4. And so on.

22dThe first treacherous model works by replacing the bad simplicity prior with a
better prior, and then using the better prior to more quickly infer the true
model. No reason for the same thing to happen a second time.
(Well, I guess the argument works if you push out to longer and longer sequence
lengths---a treacherous model will beat the true model on sequence lengths a
billion, and then for sequence lengths a trillion a different treacherous model
will win, and for sequence lengths a quadrillion a still different treacherous
model will win. Before even thinking about the fact that each particular
treacherous model will in fact defect at some point and at that point drop out
of the posterior.)

11dDoes it make sense to talk aboutˇμ1, which is likeμ1in being treacherous, but is
uses the true modelμ0instead of the honest model^μ0? I guess you would expectˇμ1
to have a lower posterior thanμ0?

32dOkay, sure.
It's not clear to me that there isn't meaningful overhead involved.
I agree with what you're saying but I don't see how it contradicts what I was.
First, what I had in mind when saying that some timesteps are better for
treachery because when the agent acts on a false prediction it has a greater
effect on the world, though of course P(discovery) is also relevant. But my
point is that when multiple treacherous models pick the same timestep to err,
there may be pros and cons to doing this, but one thing that isn't on the cons
list, is that in the long run, it makes our lives easier if they do. Making our
lives difficult is a public good for treacherous models.
So treacherous models won't be trying to avoid collisions in order to make
queries be linear in p(treachery)/p(truth). If P(discovery) increases when
multiple models are trying to be treacherous at the same time--which we could go
onto discuss; it's not obvious to me either way as of yet--that will balanced
against the inherit variation in some timesteps being a better for treachery
than others.

32dIf you are willing to query only a billion points, and there are a billion
competent treacherous models, and if the game is zero-sum between you and the
models, then you're going to get at least as screwed as by the billionth-best
opportunity for treachery. Otherwise, one of the treacherous models made a
mistake: apparently there was a top-billion treachery opportunities where you
didn't query, so they should have been able to get at least that much value (and
you should have lost at least that much value).
That's not quite right since your querying behavior will be stochastic (at least
from the model's epistemic position). But you get the same conclusion to within
a constant factor: if there are 2 billion treacherous models, then out of the
top 2 billion opportunities for treachery, there must be some where our
subjective probability of querying is at most 1/2, and if the models are
competent they will get at least as much value as they could by defecting at one
of those times, i.e. at least 1/2 of the harms from the 2-billionth-best
opportunity for treachery. So we will lose at least that much value.
(Those arguments aren't totally formal, we didn't even really discharge the "at
least N treacherous models assumption" which you need to do in order to show
that your decision about when to query isn't so influenced by any given model.)

11dThe treacherous models make their decisions first, and whether we query at a
timestep depends on whether they chose that moment to be treacherous. We don't
have to say: this timestep is one of the billion best for treachery--guess we
had better query. The decision process a treacherous model faces is:
1. From past observations, does it seem alpha is low enough that an honest
model is in the top few?
2. If no, [let's just assume whatever happens here is bad]
3. If yes, is the honest model going to leave the top few any time?
4. If yes, wait. [let's assume this ends badly]
5. If no, admit defeat, because whenever you err, that will be when a query
gets triggered. Salvage the situation by trying to get as much mileage as
you can by decreasing probability on some action the demonstrator would take
that you don't like much in order to shift probability mass to some other
action they would take that you like a little better. OR
6. Hope that 1) other treacherous models, like soldiers on the front line
storming the enemy and facing certain death so that their brothers behind
them can win the battle, give erring a go anyway. And hope that 2) the
operators quietly increase alpha after the billionth query. Acting on this
hope, wait. (The front-line treacherous models are not entirely
self-sacrificial; they might figure they might as well hope they were wrong
about alpha. But the models know when queries happen, because every
observation they get comes from a query, and it wouldn't take long to figure
out that alpha really truly isn't high enough for them to execute treachery
successfully).
Above in the thread, I was talking about the game theory of step 6 (although I
maybe wasn't making the case as strongly as I could have). Which models would
ever volunteer? They don't care about the long term goal of exhausting our query
resources so other treacherous models can capitalize on a quietly increased
alpha. All that

35dIt is worth noting that this approach doesn't deal with non-Cartesian daemons
[https://www.lesswrong.com/posts/5bd75cc58225bf0670375575/the-learning-theoretic-ai-alignment-research-agenda#Taming_daemons]
, i.e. malign hypotheses that attack via the physical side effects of
computations rather than their output. Homomorphic encryption is still the best
solution I know to these.

14dIt is worth noting this.

15dI don't think this is a lethal problem. The setting is not one-shot, it's
imitation over some duration of time. IDA just increases the effective duration
of time, so you only need to tune how cautious the learning is (which I think is
controlled by α in this work) accordingly: there is a cost, but it's bounded.
You also need to deal with non-realizability (after enough amplifications the
system is too complex for exact simulation, even if it wasn't to begin with),
but this should be doable using infra-Bayesianism (I already have some notion
how that would work). Another problem with imitation-based IDA is that external
unaligned AI might leak into the system either from the future or from
counterfactual scenarios in which such an AI is instantiated. This is not an
issue with amplifying by parallelism (like in the presentation) but at the cost
of requiring parallelizability.

46dThanks for the post and writeup, and good work! I especially appreciate the
short, informal explanation of what makes this work.
Given my current understanding of the proposal, I have one worry which makes me
reluctant to share your optimism about this being a solution to inner alignment:
The scheme doesn't protect us if somehow all top-n demonstrator models have
correlated errors. This could happen if they are coordinating, or more
prosaically if our way to approximate the posterior leads to such correlations.
The picture I have in my head for the latter is that we train a big ensemble of
neural nets and treat a random sample from that ensemble as a random sample from
the posterior, although I don't know if that's how it's actually done.
A lot of the work is done by the assumption that the true demonstrator is in the
posterior, which means that at least one of the top-performing models will not
have the same correlated errors. But I'm not sure how true this assumption will
be in the neural-net approximation I describe above. I worry about inner
alignment failures because I don't really trust the neural net prior, and I can
imagine training a bunch of neural nets to have correlated weirdnesses about
them (in part because of the neural net prior they share, and in part because of
things like Adversarial Examples Are Not Bugs, They Are Features
[https://arxiv.org/abs/1905.02175]). As such it wouldn't be that surprising to
me if it turned out that ensembles have certain correlated errors, and in
particular don't really represent anything like the demonstrator.
I do feel safer using this method than I would deferring to a single model, so
this is still a good idea on balance. I just am not convinced that it solves the
inner alignment problem. Instead, I'd say it ameliorates its severity, which may
or may not be sufficient.

24dYes, I agree that an ensemble of models generated by a neural network may have
correlated errors. I only claim to solve the inner alignment problem in theory
(i.e. for idealized Bayesian reasoning).

27dThat makes sense.
Alpha only needs to be set based on a guess about what the prior on the truth
is. It doesn't need to be set based on guesses about possibly countably many
traps of varying advisor-probability.
I'm not sure I understand whether you were saying ratio of probabilities that
the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.

27dHmm, yes, I think the difference comes from imitation vs. RL. In your setting,
you only care about producing a good imitation of the advisor. On the other hand
in my settings, I want to achieve near-optimal performance (which the advisor
doesn't achieve). So I need stronger assumptions.
Well, in DLIRL the probability that the advisor takes an unsafe action on any
given round is bounded by roughly e−β, whereas the probability that the agent
takes an unsafe action over a duration of (1−γ)−1 is bounded by roughly β−1(1−γ)
−23, so it's not a ratio but there is some relationship. I'm sure you can derive
some relationship in DLRL too, but I haven't studied it (like I said, I only
worked out the details when the advisor never takes unsafe actions).

17dNeat, makes sense.

An AI actively trying to figure out what I want might show me snapshots of different possible worlds and ask me to rank them. Of course, I do not have the processing power to examine entire worlds; all I can really do is look at some pictures or video or descriptions. The AI might show me a bunch of pictures from one world in which a genocide is quietly taking place in some obscure third-world nation, and another in which no such genocide takes place. Unless the AI *already* considers that distinction important enough to draw my attention to it, I probably won’t notice it from the pictures, and I’ll rank those worlds similarly - even though I’d prefer the one without the genocide. Even if the...

119hWhy isn't the answer just that the AI should:
1. Figure out what concepts we have;
2. Adjust those concepts in ways that we'd reflectively endorse;
3. Use those concepts?
The idea that almost none of the things we care about could be adjusted to fit
into a more accurate worldview seems like a very strongly skeptical hypothesis.
Tables (or happiness) don't need to be "real in a reductionist sense" for me to
want more of them.

217hAgreed. The problem is with AI designs which don't do that. It seems to me like
this perspective is quite rare. For example, my post Policy Alignment
[https://www.lesswrong.com/posts/TeYro2ntqHNyQFx8r/policy-alignment] was about
something similar to this, but I got a ton of pushback in the comments -- it
seems to me like a lot of people really think the AI should use better AI
concepts, not human concepts. At least they did back in 2018.
As you mention, this is partly due to overly reductionist world-views. If
tables/happiness aren't reductively real, the fact that the AI is using those
concepts is evidence that it's dumb/insane, right?
Illustrative excerpt from a comment
[https://www.lesswrong.com/posts/TeYro2ntqHNyQFx8r/policy-approval?commentId=7XT99Rv6DkgvXyczR]
there:
Probably most of the problem was that my post didn't frame things that well -- I
was mainly talking in terms of "beliefs", rather than emphasizing ontology,
which makes it easy to imagine AI beliefs are about the same concepts but just
more accurate. John's description of the pointers problem might be enough to
re-frame things to the point where "you need to start from human concepts, and
improve them in ways humans endorse" is bordering on obvious.
(Plus I arguably was too focused on giving a specific mathematical proposal
rather than the general idea.)

32dIt seems likely that an AGI will understand very well what I mean when I use
english words to describe things, and also what a more intelligent version of me
with more coherent concepts would want those words to actually refer to. Why
does this not imply that the pointers problem will be solved?
I agree that there's something like what you're describing which is important,
but I don't think your description pins it down.

42dThe AI knowing what I mean isn't sufficient here. I need the AI to do what I
mean, which means I need to program it/train it to do what I mean. The program
or feedback signal needs to be pointed at what I mean, not just whatever
English-language input I give.
For instance, if an AI is trained to maximize how often I push a particular
button, and I say "I'll push the button if you design a fusion power generator
for me", it may know exactly what I mean and what I intend. But it will still be
perfectly happy to give me a design with some unintended side effects
[https://www.lesswrong.com/posts/2NaAhMPGub8F2Pbr7/the-fusion-power-generator-scenario]
which I'm unlikely to notice until after pushing the button.

319hI agree with all the things you said. But you defined the pointer problem as:
"what functions of what variables (if any) in the environment and/or another
world-model correspond to the latent variables in the agent’s world-model?" In
other words, how do we find the corresponding variables? I've given you an
argument that the variables in an AGI's world-model which correspond to the ones
in your world-model can be found by expressing your concept in english
sentences.
The problem of determining how to construct a feedback signal which refers to
those variables, once we've found them, seems like a different problem. Perhaps
I'd call it the "motivation problem": given a function of variables in an
agent's world-model, how do you make that agent care about that function? This
is a different problem in part because, when addressing it, we don't need to
worry about stuff like ghosts.
Using this terminology, it seems like the alignment problem reduces to the
pointer problem plus the motivation problem.

In other words, how do we

findthe corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

But you didn't actually give an argument for that -- you simply stated it. As a matter of fact, I disagree: it seems really easy for an AGI to misunderstand what I mean when I use english words. To go back to the "fusion power generator", maybe it has a very deep model of such generators that abstracts away most of the c... (read more)

215hThe problem is with what you mean by "find". If by "find" you mean "there exist
some variables in the AI's world model which correspond directly to the things
you mean by some English sentence", then yes, you've argued that. But it's not
enough for there to exist some variables in the AI's world-model which
correspond to the things we mean. We have to either know which variables those
are, or have some other way of "pointing to them" in order to get the AI to
actually do what we're saying.
An AI may understand what I mean, in the sense that it has some internal
variables corresponding to what I mean, but I still need to know which variables
those are (or some way to point to them) and how "what I mean" is represented in
order to construct a feedback signal.
That's what I mean by "finding" the variables. It's not enough that they exist;
we (the humans, not the AI) need some way to point to which specific
functions/variables they are, in order to get the AI to do what we mean.

Co-authored with **Rebecca Gorman**.

In section 5.2 of their Arxiv paper, "The Incentives that Shape Behaviour", which introduces structural causal influence models and a proposal for addressing misaligned AI incentives, the authors present the following graph:

The blue node is a "decision node", defined as where the AI chooses its action. The yellow node is a "utility node", defined as the target of the AI's utility-maximising goal. The authors introduce this graph to introduce the concept of control incentives; the AI, given utility-maximizing goal of user clicks, discovers an intermediate control incentive: influencing user options. By influencing user opinions, the AI better fulfils its objective. This 'control incentive' is graphically represented by surrounding it in dotted orange.

A click-maximising AI would only care about user opinions indirectly: they are a...

This is because I think that the counterexample given here dissolves if there is an additional path without node from the matchmaking policy to the priced payed

I think you are using some mental model where 'paths with nodes' vs. 'paths without nodes' produces a real-world difference in outcomes. This is the wrong model to use when analysing CIDs. A path in a diagram -->[node]--> can always be replaced by a single arrow --> to produce a model that makes equivalent predictions, and the opposite operation is also possible.

So the number of nodes... (read more)

118hIn this comment (last in my series of planned comments on this post) I'll
discuss the detailed player-to-match-with example developed in the post:
I have by now re-read this analysis with the example several times. First time I
read it, I already felt that it was a strange way to analyse the problem, but it
took me a while to figure out exactly why.
Best I can tell right now is that there are two factors
1. I can't figure out if the bad thing that the example tries to prove is that
a) agent is trying to maximize purchases, which is unwanted or b) the agent
is manipulating user's item ranking, which is unwanted. (If it is only a),
then there is no need to bring in all this discussion about correlation.)
2. the example refines its initial CID by redrawing it in a strange way
So now I am going to develop the same game example in a style that I find less
strange. I also claim that this gets closer to the default style people use when
they want to analyse and manage causal incentives.
To start with, this is the original model of the game mechanics: the model of
the mechanics in the real world in which the game takes place.
This shows that the agent has an incentive to control predicted purchases
upwards, but also to do so by influencing the item rankings that exist in the
mind of the player.
If we want to weaken this incentive to influence the item rankings that exist in
the mind of the player, we can construct acounterfactual planning world for the
agent (seehere
[https://www.alignmentforum.org/s/3dCMdafmKmb6dRjMF/p/q4j7qbEZRaTAA9Kxf]for an
explanation of the planning world terminology I am using):
(Carey et all call often call this planning world a twin model, a model which
combines both factual and counterfactual events.) In both my work and in Carey
et intention, the is that the above diagram defines the world model in which the
agent will plan the purchases-maximizing action, and then this same action is
applied in the r

12dOn recent terminology innovation:
For exactly the same reason, In my own recent paper Counterfactual Planning
[https://arxiv.org/abs/2102.00834], I introduced the termsdirect incentive and
indirect incentive, where I frame the removal of a path to value in a planning
world diagram as an action that will eliminate a direct incentive, but that may
leave other indirect incentives (via other paths to value) intact. In section 6
of the paper and in this post of the sequence
[https://www.alignmentforum.org/posts/BZKLf629NDNfEkZzJ/creating-agi-safety-interlocks]
I develop and apply this terminology in the case of an agent emergency stop
button.
In high-level descriptions of what the technique of creating indifference via
path removal (or balancing terms) does, I have settled on using the terminology
suppresses the incentive instead of removes the incentive.
I must admit that I have not read many control theory papers, so any insights
from Rebecca about standard terminology from control theory would be welcome.
Do they have some standard phrasing where they can say things like 'no value to
control' while subtly reminding the reader that 'this does not imply there will
be no side effects?'

12dIn this comment I will focus on the case of the posts-to-show agent only. The
main question I explore is: does the agent construction below actually stop the
agent from manipulating user opinions?
The post above also explores this question, my main aim here is to provide an
exploration which is very different from the post, to highlight other relevant
parts of the problem.
The TL;DR of my analysis is that the above construction may suppress a vicious,
ongoing cycle of opinion change in order to maximize clicks, but there are many
cases where a full suppression of the cycle will definitely not happen.
Here is an example of when full suppression of the cycle will not happen.
First, note that the agent can only pick among the posts that it has available.
If all the posts that the agent has available are posts that make the user
change their opinion on something, then user opinion will definitely be
influenced by the agent showing posts, no matter how the decision what posts to
show is computed. If the posts are particularly stupid and viral, this may well
cause vicious, ongoing cycles of opinion change.
But the agent construction shown does have beneficial properties. To repeat the
picture:
The above construction makes the agent indifferent about what effects it has on
opinion change. It removes any incentive of the agent to control future opinion
in a particular direction.
Here is a specific case where this indifference, this lack of a control
incentive, leads to beneficial effects:
* Say that the posts to show agent in the above diagram decides on a sequence
of 5 posts that will be suggested in turn, with the link to the next
suggested post being displayed at the bottom of the current one. The user may
not necessarily see all 5 suggestions, they may leave the site instead of
clicking the suggested link. The objective is to maximize the number of
clicks.
* Now, say that the user will click the next link with a 50% chance if the next

12dThanks for working on this! I my opinion, the management of incentives via
counterfactuals is a very promising route to improving AGI safety, and this
route has been under-explored by the community so far.
I am writing several comments on this post, this is the first one.
My goal is to identify and discuss angles of the problem which have not been
identified in the post itself, and to identify related work.
On related work: there are obvious parallels between the counterfactual agent
designs discussed in "The Incentives that Shape Behaviour"
[https://arxiv.org/abs/2001.07118] and the post above and the ITC agent that I
constructed in my recent paper Counterfactual Planning
[https://arxiv.org/abs/2102.00834]. This post
[https://www.alignmentforum.org/s/3dCMdafmKmb6dRjMF/p/o3smzgcH8MR9RcMgZ], about
the paper presents the ITC agent construction in a more summarized way.
The main difference is that "The Incentives that Shape Behaviour" and the post
above are about incentives in single-action agents, in my paper and related
sequence [https://www.alignmentforum.org/s/3dCMdafmKmb6dRjMF] I generalize to
multi-action agents.
Quick pictorial comparison:
From "The Incentives that Shape Behaviour" [https://arxiv.org/abs/2001.07118]:
From "Counterfactual Planning" [https://arxiv.org/abs/2102.00834]:
The similarity in construction is that some of the arrows into the yellow
utility nodes emerge from a node that represents the past: the 'model of
original opinions' in the first picture and the node I0in the second picture.
This construction removes the agent's control incentive on the downstream nodes,
'influenced user opinions' andI1,I2,⋯.
In the terminology I developed for my counterfactual planning paper, both
pictures above depict 'counterfactual planning worlds' because the projected
mechanics of how the agent's blue decision nodes determine outcomes in the model
are different from the real-world mechanics that will determine the real-world
outcomes that these decisio

In a__ previous post__, I argued for the study of goal-directedness in two steps:

- Defining goal-directedness: depends only on the complete behavior of the system, and probably assumes infinite compute and resources.
- Computing goal-directedness: depends on the internal structure, and more specifically what information about the complete behavior can be extracted from this structure.

Intuitively, understanding goal-directedness should mean knowing which questions to ask about the complete behavior of the system to determine its goal-directedness. Here the “complete” part is crucial; it simplifies the problem by removing the need to infer what the system will do based on limited behavior. Similarly, we don’t care about the tractability/computability of the questions asked; the point is to find what to look for, without worrying (yet) about how to get it.

This post proposes...

So, if you haven't read the first two posts, do so now.

In this post, we'll be going over the basic theory of belief functions, which are functions that map policies to sets of sa-measures, much like how an environment can be viewed as a function that maps policies to probability distributions over histories. Also, we'll be showing some nifty decision theory results at the end. The proofs for this post are in the following three posts (1,2,3), though it's inessential to read them and quite difficult.

Now, it's time to address desideratum 1 (dynamic consistency), and desideratum 3 (how do we formalize the Nirvana trick to capture policy selection problems) from the first post. We'll be taking the path where Nirvana counts as infinite reward, instead of counting...

11dIn definition 20 you have
I'm kind of confused here: what's N, and isn't πhipa(h) a single action? [EDIT:
oops, forgot that N was nirvana]

11dAlso, after theorem 3.2:
Looks like 'causal' and 'acausal' are swapped here?

Lets ponder the bestiary of decision-theory problems

"Lets" should be "Let's"

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter **resources here**. In particular, you can look through **this spreadsheet** of all summaries that have ever been in the newsletter.

Audio version **here** (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

**Why does deep and cheap learning work so well?** *(Henry W. Lin et al)* (summarized by Rohin): We know that the success of neural networks must be at least in part due to some inductive bias (presumably towards “simplicity”), based on the following...

12dI normally don't think of most functions as polynomials at all - in fact, I
think of most real-world functions as going to zero for large values. E.g. the
function "dogness" vs. "nose size" cannot be any polynomial, because polynomials
(or their inverses) blow up unrealistically for large (or small) nose sizes.
I guess the hope is that you always learn even polynomials, oriented in such a
way that the extremes appear unappealing?

What John said. To elaborate, it's specifically talking about the case where there is some concept from which some probabilistic generative model creates observations tied to the concept, and claiming that the log probabilities follow a polynomial.

Suppose the most dog-like nose size is K. One function you could use is y = exp(-(x - K)^d) for some positive integer d. That's a function whose maximum value is 0 (where higher values = more "dogness") and doesn't blow up unreasonably anywhere.

(Really you should be talking about probabilities, in which case you use the same sort of function but then normalize, which transforms the exp into a softmax, as the paper suggests)

42dI believe the paper says that log densities are (approximately) polynomial -
e.g. a Gaussian would satisfy this, since the log density of a Gaussian is
quadratic.