All Posts

Sorted by Magic (New & Upvoted)

May 2020

Frontpage Posts
Shortform
4Vanessa Kosoy3moThis idea was inspired by a correspondence with Adam Shimi. It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information? Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions [https://www.alignmentforum.org/posts/aAzApjEpdYwAxnsAS/reinforcement-learning-with-imperceptible-rewards] .) The simplest attempt at defining "goal-directed intelligence" is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome. The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, description complexity is only naturally well-defined up to an additive constant. So, if we want to have a crisp concept, we need to consider an asymptotic in which the complexity of something goes to infinity. Indeed, it seems natural to ask that the complexity of the policy should be much higher than the complexity of the prior and the reward function: in this case we can say that the "intentional stance" is an efficient description. However, this doesn't make sense with description complexity: the description "optimal policy for U and ζ" is of size K(U)+K(ζ)+O(1) (K(x) stands for "description complexity of x"). To salvage this idea, we need to take not only description complexity
2Alex Turner3moFrom FLI's AI Alignment Podcast: Inverse Reinforcement Learning and Inferring Human Preferences with Dylan Hadfield-Menell [https://futureoflife.org/2018/04/04/podcast-ai-systems-learning-human-preferences/?cn-reloaded=1] : Consider the optimizer/optimized distinction: the AI assistant is better described as optimized to either help or stop you from winning the game. This optimization may or may not have been carried out by a process which is "aligned" with you; I think that ascribing intent alignment to the assistant's creator makes more sense. In terms of the adversarial heuristic case, intent alignment seems unlikely. But, this also feels like passing the buck – hoping that at some point in history, there existed something to which we are comfortable ascribing alignment and responsibility.
3Rohin Shah3moThe LCA paper [https://arxiv.org/abs/1909.01440] (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations: L(θT)−L(θ0)=∑tL(θt)−L(θt−1) And then to decompose training loss across specific parameters: L(θt)−L(θt−1)=→θt∫→θt−1→∇→θL(→θ)⋅d→θ I've added vector arrows to emphasize that θ is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We'll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient: L(θt)−L(θt−1)=(θt−θt−1)⋅Averagett−1(∇L(θ)). (This is pretty standard, but I've included a derivation at the end.) Since this is a dot product, it decomposes into a sum over the individual parameters: L(θt)−L(θt−1)=∑i(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i) So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as A(i)t=(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ) )(i) So based on this, I'm going to define my own version of LCA, called LCANaive. Suppose the gradient computed at training iteration t is Gt (which is a vector). LCANaive uses the approximation Averagett−1(∇L(θ))≈Gt−1, giving A(i)t,Naive=(θ(i )t−θ(i)t−1)G(i)t−1 . But the SGD update is given by θ(i)t=θ(i)t−1−αG(i)t−1 (where α is the learning rate), which implies that A(i)t,Naive=(−αG(i)t−1)G(i)t− 1=−α(G(i)t−1)2, which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn't surprising -- we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train! Yet, the experiments in the paper sometimes show positive LCAs. What's up with that?
1Ben Pace3moI've been thinking lately that picturing an AI catastrophe is helped a great deal by visualising a world where critical systems in society are performed by software. I was spending a while trying to summarise and analyse Paul's "What Failure Looks Like", which lead me this way. I think that properly imagining such a world is immediately scary, because software can deal with edge cases badly, like automated market traders causing major crashes, so that's already a big deal. Then you add ML in, and can talk about how crazy it is to hand critical systems over to code we do not understand and cannot make simple adjustments to, then you're already hitting catastrophes. Once you then argue that ML can become superintelligent then everything goes from "global catastrophe" to "obvious end of the world", but the first steps are already pretty helpful. While Paul's post helps a lot, it still takes a fair bit of effort for me to concretely visualise the scenarios he describes, and I would be excited for people to take the time to detail what it would look like to hand critical systems over to software – for which systems would this happen, why would we do it, who would be the decision-makers, what would it feel like from the average citizen's vantage point, etc. A smaller version of Hanson's Age of Em project, just asking the question "Which core functions in society (food, housing, healthcare, law enforcement, governance, etc) are amenable to tech companies building solutions for, and what would it look like for society to transition to 1%, 10%, 50% and 90% of core functions to be automated with 1) human-coded software 2) machine learning 3) human-level general AI?"