May 2020

Frontpage Posts

Personal Blogposts

Shortform

4Vanessa Kosoy3moThis idea was inspired by a correspondence with Adam Shimi.
It seem very interesting and important to understand to what extent a purely
"behaviorist" view on goal-directed intelligence is viable. That is, given a
certain behavior (policy), is it possible to tell whether the behavior is
goal-directed and what are its goals, without any additional information?
Consider a general reinforcement learning settings: we have a set of actions A,
a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is
a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of
rewards. (Alternatively, we could use instrumental reward functions
[https://www.alignmentforum.org/posts/aAzApjEpdYwAxnsAS/reinforcement-learning-with-imperceptible-rewards]
.)
The simplest attempt at defining "goal-directed intelligence" is requiring that
the policy π in question is optimal for some prior and utility function.
However, this condition is vacuous: the reward function can artificially reward
only behavior that follows π, or the prior can believe that behavior not
according to π leads to some terrible outcome.
The next natural attempt is bounding the description complexity of the prior and
reward function, in order to avoid priors and reward functions that are
"contrived". However, description complexity is only naturally well-defined up
to an additive constant. So, if we want to have a crisp concept, we need to
consider an asymptotic in which the complexity of something goes to infinity.
Indeed, it seems natural to ask that the complexity of the policy should be much
higher than the complexity of the prior and the reward function: in this case we
can say that the "intentional stance" is an efficient description. However, this
doesn't make sense with description complexity: the description "optimal policy
for U and ζ" is of size K(U)+K(ζ)+O(1) (K(x) stands for "description complexity
of x").
To salvage this idea, we need to take not only description complexity

23moFrom FLI's AI Alignment Podcast: Inverse Reinforcement Learning and Inferring
Human Preferences with Dylan Hadfield-Menell
[https://futureoflife.org/2018/04/04/podcast-ai-systems-learning-human-preferences/?cn-reloaded=1]
:
Consider the optimizer/optimized distinction: the AI assistant is better
described as optimized to either help or stop you from winning the game. This
optimization may or may not have been carried out by a process which is
"aligned" with you; I think that ascribing intent alignment to the assistant's
creator makes more sense. In terms of the adversarial heuristic case, intent
alignment seems unlikely.
But, this also feels like passing the buck – hoping that at some point in
history, there existed something to which we are comfortable ascribing alignment
and responsibility.

33moThe LCA paper [https://arxiv.org/abs/1909.01440] (to be summarized in AN #98)
presents a method for understanding the contribution of specific updates to
specific parameters to the overall loss. The basic idea is to decompose the
overall change in training loss across training iterations:
L(θT)−L(θ0)=∑tL(θt)−L(θt−1)
And then to decompose training loss across specific parameters:
L(θt)−L(θt−1)=→θt∫→θt−1→∇→θL(→θ)⋅d→θ
I've added vector arrows to emphasize that θ is a vector and that we are taking
a dot product. This is a path integral, but since gradients form a conservative
field, we can choose any arbitrary path. We'll be choosing the linear path
throughout. We can rewrite the integral as the dot product of the change in
parameters and the average gradient:
L(θt)−L(θt−1)=(θt−θt−1)⋅Averagett−1(∇L(θ)).
(This is pretty standard, but I've included a derivation at the end.)
Since this is a dot product, it decomposes into a sum over the individual
parameters:
L(θt)−L(θt−1)=∑i(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i)
So, for an individual parameter, and an individual training step, we can define
the contribution to the change in loss as A(i)t=(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ)
)(i)
So based on this, I'm going to define my own version of LCA, called LCANaive.
Suppose the gradient computed at training iteration t is Gt (which is a vector).
LCANaive uses the approximation Averagett−1(∇L(θ))≈Gt−1, giving A(i)t,Naive=(θ(i
)t−θ(i)t−1)G(i)t−1 . But the SGD update is given by θ(i)t=θ(i)t−1−αG(i)t−1
(where α is the learning rate), which implies that A(i)t,Naive=(−αG(i)t−1)G(i)t−
1=−α(G(i)t−1)2, which is always negative, i.e. it predicts that every parameter
always learns in every iteration. This isn't surprising -- we decomposed the
improvement in training into the movement of parameters along the gradient
direction, but moving along the gradient direction is exactly what we do to
train!
Yet, the experiments in the paper sometimes show positive LCAs. What's up with
that?

13moI've been thinking lately that picturing an AI catastrophe is helped a great
deal by visualising a world where critical systems in society are performed by
software. I was spending a while trying to summarise and analyse Paul's "What
Failure Looks Like", which lead me this way. I think that properly imagining
such a world is immediately scary, because software can deal with edge cases
badly, like automated market traders causing major crashes, so that's already a
big deal. Then you add ML in, and can talk about how crazy it is to hand
critical systems over to code we do not understand and cannot make simple
adjustments to, then you're already hitting catastrophes. Once you then argue
that ML can become superintelligent then everything goes from "global
catastrophe" to "obvious end of the world", but the first steps are already
pretty helpful.
While Paul's post helps a lot, it still takes a fair bit of effort for me to
concretely visualise the scenarios he describes, and I would be excited for
people to take the time to detail what it would look like to hand critical
systems over to software – for which systems would this happen, why would we do
it, who would be the decision-makers, what would it feel like from the average
citizen's vantage point, etc. A smaller version of Hanson's Age of Em project,
just asking the question "Which core functions in society (food, housing,
healthcare, law enforcement, governance, etc) are amenable to tech companies
building solutions for, and what would it look like for society to transition to
1%, 10%, 50% and 90% of core functions to be automated with 1) human-coded
software 2) machine learning 3) human-level general AI?"