All Posts

Sorted by New

July 2020

No posts for this month
Shortform
8Alex Turner5dI think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait - you'd also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you'd learn these features for a wide range of cost functions. I wonder if that's already been empirically investigated? And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment). What can we learn about this?
1Alex Turner2dTransparency Q: how hard would it be to ensure a neural network doesn't learn any explicit NANDs?

May 2020

Shortform
4Vanessa Kosoy2moThis idea was inspired by a correspondence with Adam Shimi. It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information? Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions [https://www.alignmentforum.org/posts/aAzApjEpdYwAxnsAS/reinforcement-learning-with-imperceptible-rewards] .) The simplest attempt at defining "goal-directed intelligence" is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome. The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, description complexity is only naturally well-defined up to an additive constant. So, if we want to have a crisp concept, we need to consider an asymptotic in which the complexity of something goes to infinity. Indeed, it seems natural to ask that the complexity of the policy should be much higher than the complexity of the prior and the reward function: in this case we can say that the "intentional stance" is an efficient description. However, this doesn't make sense with description complexity: the description "optimal policy for U and ζ" is of size K(U)+K(ζ)+O(1) (K(x) stands for "description complexity of x"). To salvage this idea, we need to take not only description complexity
2Alex Turner2moFrom FLI's AI Alignment Podcast: Inverse Reinforcement Learning and Inferring Human Preferences with Dylan Hadfield-Menell [https://futureoflife.org/2018/04/04/podcast-ai-systems-learning-human-preferences/?cn-reloaded=1] : Consider the optimizer/optimized distinction: the AI assistant is better described as optimized to either help or stop you from winning the game. This optimization may or may not have been carried out by a process which is "aligned" with you; I think that ascribing intent alignment to the assistant's creator makes more sense. In terms of the adversarial heuristic case, intent alignment seems unlikely. But, this also feels like passing the buck – hoping that at some point in history, there existed something to which we are comfortable ascribing alignment and responsibility.
3Rohin Shah2moThe LCA paper [https://arxiv.org/abs/1909.01440] (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations: L(θT)−L(θ0)=∑tL(θt)−L(θt−1) And then to decompose training loss across specific parameters: L(θt)−L(θt−1)=→θt∫→θt−1→∇→θL(→θ)⋅d→θ I've added vector arrows to emphasize that θ is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We'll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient: L(θt)−L(θt−1)=(θt−θt−1)⋅Averagett−1(∇L(θ)). (This is pretty standard, but I've included a derivation at the end.) Since this is a dot product, it decomposes into a sum over the individual parameters: L(θt)−L(θt−1)=∑i(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i) So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as A(i)t=(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ) )(i) So based on this, I'm going to define my own version of LCA, called LCANaive. Suppose the gradient computed at training iteration t is Gt (which is a vector). LCANaive uses the approximation Averagett−1(∇L(θ))≈Gt−1, giving A(i)t,Naive=(θ(i )t−θ(i)t−1)G(i)t−1 . But the SGD update is given by θ(i)t=θ(i)t−1−αG(i)t−1 (where α is the learning rate), which implies that A(i)t,Naive=(−αG(i)t−1)G(i)t− 1=−α(G(i)t−1)2, which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn't surprising -- we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train! Yet, the experiments in the paper sometimes show positive LCAs. What's up with that?
1Ben Pace2moI've been thinking lately that picturing an AI catastrophe is helped a great deal by visualising a world where critical systems in society are performed by software. I was spending a while trying to summarise and analyse Paul's "What Failure Looks Like", which lead me this way. I think that properly imagining such a world is immediately scary, because software can deal with edge cases badly, like automated market traders causing major crashes, so that's already a big deal. Then you add ML in, and can talk about how crazy it is to hand critical systems over to code we do not understand and cannot make simple adjustments to, then you're already hitting catastrophes. Once you then argue that ML can become superintelligent then everything goes from "global catastrophe" to "obvious end of the world", but the first steps are already pretty helpful. While Paul's post helps a lot, it still takes a fair bit of effort for me to concretely visualise the scenarios he describes, and I would be excited for people to take the time to detail what it would look like to hand critical systems over to software – for which systems would this happen, why would we do it, who would be the decision-makers, what would it feel like from the average citizen's vantage point, etc. A smaller version of Hanson's Age of Em project, just asking the question "Which core functions in society (food, housing, healthcare, law enforcement, governance, etc) are amenable to tech companies building solutions for, and what would it look like for society to transition to 1%, 10%, 50% and 90% of core functions to be automated with 1) human-coded software 2) machine learning 3) human-level general AI?"

April 2020

Shortform
3G Gordon Worley III3moI get worried about things like this article [https://medium.com/@PartnershipAI/aligning-ai-to-human-values-means-picking-the-right-metrics-855859e6f047] that showed up on the Partnership on AI blog. Reading it there's nothing I can really object to in the body of post: it's mostly about narrow AI alignment and promotes a positive message of targeting things that benefit society rather than narrowly maximize a simple metric. How it's titled "Aligning AI to Human Values means Picking the Right Metrics" and that implies to me a normative claim that reads in my head something like "to build aligned AI it is necessary and sufficient to pick the right metrics" which is something I think few would agree with. Yet if I was a casual observer just reading the title of this post I might come away with the impression that AI alignment is as easy as just optimizing for something prosocial, not that there are lots of hard problems to be solved to even get AI to do what you want, let alone to pick something beneficial to humanity to do. To be fair this article has a standard "not necessarily the views of PAI, etc." disclaimer, but then the author [https://www.partnershiponai.org/team/jonathan-stray/] is a research fellow at PAI. This makes me a bit nervous about the effect of PAI on promoting AI safety in industry, especially if it effectively downplays it or makes it seem easier than it is in ways that either encourages or fails to curtail risky behavior in the use of AI in industry.
4Rohin Shah3moThe LESS is More paper [https://arxiv.org/abs/2001.04465] (summarized in AN #96 [https://www.alignmentforum.org/s/dT7CKGXwq9vt76CeX/p/YyKKMeCCxnzdohuxj]) makes the claim that using the Boltzmann model in sparse regions of demonstration-space will lead to the Boltzmann model over-learning. I found this plausible but not obvious, so I wanted to check it myself. (Partly I got nerd-sniped, partly I do want to keep practicing my ability to tell when things are formalizable theorems.) This benefited from discussion with Andreea (one of the primary authors). Let's consider a model where there are clusters {ci}, where each cluster contains trajectories whose features are identical ci={τ:ϕ(τ)=ϕci} (which also implies rewards are identical). Let c(τ) denote the cluster that τ belongs to. The Boltzmann model says p(τ∣θ)=exp(Rθ(τ))∑τ′exp(Rθ(τ′)). The LESS model says p( τ∣θ)=exp(Rθ(c(τ)))∑c′exp(Rθ(c′))⋅1|c(τ)| , that is, the human chooses a cluster noisily based on the reward, and then uniformly at random chooses a trajectory from within that cluster. (Note that the paper does something more suited to realistic situations where we have a similarity metric instead of these "clusters"; I'm introducing them as a simpler situation where we can understand what's going on formally.) In this model, a "sparse region of demonstration-space" is a cluster c with small cardinality |c|, whereas a dense one has large |c|. Let's first do some preprocessing. We can rewrite the Boltzmann model as follows: p(τ∣θ)=exp(Rθ(τ))∑τ′exp(Rθ(τ′))=exp(Rθ(c(τ)))∑c′exp(Rθ(c′))⋅|c′|=|c(τ)|⋅exp(Rθ( c(τ)))∑c′|c′|⋅exp(Rθ(c′))⋅1|c(τ)| This allows us to write both models as first selecting a cluster, and then choosing randomly within the cluster: p(τ∣θ)=p(c(τ))⋅exp(Rθ(c(τ)))∑c′p(c′)exp(Rθ(c′))⋅1|c(τ)| Where for LESS p(c) is uniform i.e. p(c)∝1, whereas for Boltzmann p(c)∝|c|, i.e. a denser cluster is more likely to be sampled. So now let us return to the original claim that the Boltzmann model overlear
2Alex Turner2moWe can imagine aliens building a superintelligent agent which helps them get what they want. This is a special case of aliens inventing tools. What kind of general process should these aliens use – how should they go about designing such an agent? Assume that these aliens want things in the colloquial sense (not that they’re eg nontrivially VNM EU maximizers) and that a reasonable observer would say they’re closer to being rational than antirational. Then it seems[1] [#fn-ocERqEAw2vkrJkGL5-1] like these aliens eventually steer towards reflectively coherent rationality (provided they don’t blow themselves to hell before they get there): given time, they tend to act to get what they want, and act to become more rational. But, they aren’t fully “rational”, and they want to build a smart thing that helps them. What should they do? In this situation, it seems like they should build an agent which empowers them & increases their flexible control over the future, since they don’t fully know what they want now. Lots of flexible control means they can better error-correct and preserve value for what they end up believing they actually want. This also protects them from catastrophe and unaligned competitor agents. -------------------------------------------------------------------------------- 1. I don’t know if this is formally and literally always true, I’m just trying to gesture at an intuition about what kind of agentic process these aliens are. ↩︎ [#fnref-ocERqEAw2vkrJkGL5-1]
2G Gordon Worley III3moAs I work towards becoming less confused about what we mean when we talk about values, I find that it feels a lot like I'm working on a jigsaw puzzle where I don't know what the picture is. Also all the pieces have been scattered around the room and I have to find the pieces first, digging between couch cushions and looking under the rug and behind the bookcase, let alone figure out how they fit together or what they fit together to describe. Yes, we have some pieces already and others think they know (infer, guess) what the picture is from those (it's a bear! it's a cat! it's a woman in a fur coat!), and as I work I find it helpful to keep updating my own guess because even when it's wrong it sometimes helps me think of new ways to try combining the pieces or to know what pieces might be missing that I should go look for, but it also often feels like I'm failing all the time because I'm updating rapidly based on new information and that keeps changing my best guess. I suspect this is a common experience for folks working on problems in AI safety and many other complex problems, so I figured I'd share this metaphor I recently hit on for making sense of what it is like to do this kind of work.
1ricraz3moThere's some possible world in which the following approach to interpretability works: * Put an AGI in a bunch of situations where it sometimes is incentivised to lie and sometimes is incentivised to tell the truth. * Train a lie detector which is given all its neural weights as input. * Then ask the AGI lots of questions about its plans. One problem that this approach would face if we were using it to interpret a human is that the human might not consciously be aware of what their motivations are. For example, they may believe they are doing something for altruistic reasons, when in fact their unconscious motivations are primarily to look good. And the motivations which we are less conscious of are exactly those ones which it's most disadvantageous for others to know about. So would using such an interpretability technique on an AGI work? I guess one important question is something like: by default, would the AGI be systematically biased when talking about its plans, like humans are? Or is this something which only arises when there are selection pressures during training for hiding information? One way we could avoid this problem: instead of a "lie detector", you could train a "plan identifier", which takes an AGI brain and tells you what that AGI is going to do in english. I'm a little less optimistic about this, since I think that gathering training data will be the big bottleneck either way, and getting enough data to train a plan identifier that's smart enough to generalise to a wide range of plans seems pretty tricky. (By contrast, the lie detector might not need to know very much about the *content* of the lies).

Load More Months