All of rohinmshah's Comments + Replies

My research methodology

I agree this involves discretion [...] So instead I'm doing some in between thing

Yeah, I think I feel like that's the part where I don't think I could replicate your intuitions (yet).

I don't think we disagree; I'm just noting that this methodology requires a fair amount of intuition / discretion, and I don't feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.

(Probably I could have been clearer about this in the original opinion.)

My research methodology

In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."

To fill in the details more:

Assume that we're finding an algorithm to train an agent with a sufficiently large action space (i.e. we don't get safety via the agent having such a restricted action space ... (read more)

4Paul Christiano3dThat's basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you? I agree this involves discretion, and indeed moving beyond the trivial story "The algorithm fails and then it turns out you die" requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the story more and more concrete until we had fully specified the model, which also seems intractable. So instead I'm doing some in between thing, which is roughly like: I'm allowed to push on the story to make it more concrete along any axis, but I recognize that I won't have time to pin down every axis so I'm basically only going to do this a bounded number of times before I have to admit that it seems plausible enough (so I can't fill in a billion parameters of my model one by one this way; what's worse, filling in those parameters would take even more than a billion time and so this may become intractable even before you get to a billion).
My research methodology

Planned summary for the Alignment Newsletter:

This post outlines a simple methodology for making progress on AI alignment. The core idea is to alternate between two steps:

1. Come up with some alignment algorithm that solves the issues identified so far

2. Try to find some plausible situation in which either a) the resulting AI system is misaligned or b) the AI system is not competitive.

This is all done conceptually, so step 2 can involve fairly exotic scenarios that probably won't happen. Given such a scenario, we need to argue why no failure in the same cla

... (read more)

rom my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2

That doesn't seem like it has quite the type signature I'm looking for. I'm imagining a story as a description of how something bad happens, so I want the story to end with "and then something bad happens."

In some sense you could start from the trivial story "Your algorithm didn't work and then something ... (read more)

How do scaling laws work for fine-tuning?

I don't think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.

How do scaling laws work for fine-tuning?

Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws?

My guess is that the answer is mostly yes (maybe not the exact numbers predicted by existing scaling laws, but similar ballpark).

how does that not contradict the scaling laws for transfer described here and used in this calculation by Rohin?

I think this is mostly irrelevant to timelines / previous scaling laws for transfer:

  1. You still have to pretrain the Transformer, which will take
... (read more)
2Daniel Kokotajlo7dThanks! Your answer no. 2 is especially convincing to me; I didn't realize the authors used smaller models as the comparison--that seems like an unfair comparison! I would like to see how well these 0.1%-tuned transformers do compared to similarly-sized transformers trained from scratch.
Coherence arguments imply a force for goal-directed behavior

Yes, that's basically right.

You think I take the original argument to be arguing from ‘has goals' to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.

Well, I do think it is an interesting/relevant argument (because as you say it explains how you get from "weakly has goals" to "strongly has goals"). I just wanted to correct the misconception about what I was arguing against, and I wanted to highlight the "intelligent" --> "weakly has goals" step as a relatively weak step in our current arguments. (In my ori... (read more)

5KatjaGrace3dI wrote an AI Impacts page [https://aiimpacts.org/what-do-coherence-arguments-imply-about-the-behavior-of-advanced-ai/] summary of the situation as I understand it. If anyone feels like looking, I'm interested in corrections/suggestions (either here or in the AI Impacts feedback box).
Coherence arguments imply a force for goal-directed behavior

Thanks, that's helpful. I'll think about how to clarify this in the original post.

4Rob Bensinger13dMaybe changing the title would prime people less to have the wrong interpretation? E.g., to 'Coherence arguments require that the system care about something'. Even just 'Coherence arguments do not entail goal-directed behavior' might help, since colloquial "imply" tends to be probabilistic, but you mean math/logic "imply" instead. Or 'Coherence theorems do not entail goal-directed behavior on their own'.
Coherence arguments imply a force for goal-directed behavior

You're mistaken about the view I'm arguing against. (Though perhaps in practice most people think I'm arguing against the view you point out, in which case I hope this post helps them realize their error.) In particular:

Whatever things you care about, you are best off assigning consistent numerical values to them and maximizing the expected sum of those values

If you start by assuming that the agent cares about things, and your prior is that the things it cares about are "simple" (e.g. it is very unlikely to be optimizing the-utility-function-that-makes-the... (read more)

A few quick thoughts on reasons for confusion:

I think maybe one thing going on is that I already took the coherence arguments to apply only in getting you from weakly having goals to strongly having goals, so since you were arguing against their applicability, I thought you were talking about the step from weaker to stronger goal direction. (I’m not sure what arguments people use to get from 1 to 2 though, so maybe you are right that it is also something to do with coherence, at least implicitly.)

It also seems natural to think of ‘weakly has goals’ as some... (read more)

Thanks. Let me check if I understand you correctly:

You think I take the original argument to be arguing from ‘has goals' to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.

What you disagree with is an argument from ‘anything smart’ to ‘has goals’, which seems to be what is needed for the AI risk argument to apply to any superintelligent agent.

Is that right?

If so, I think it’s helpful to distinguish between ‘weakly has goals’ and ‘strongly has goals’:

  1. Weakly has goals: ‘has some sort of drive toward something,
... (read more)
Introduction To The Infra-Bayesianism Sequence

But for more general infradistributions this need not be the case. For example, consider  and take the set of a-measures generated by  and . Suppose you start with  dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting  dollars on the outcome , with a value of  dollars.

I guess my question is more like: shouldn't there be some aspect of reality that determines what my set of a-measures is? It feels like here we're finding a set of a-measures... (read more)

3Vanessa Kosoy13dIIUC your question can be reformulated as follows: a crisp infradistribution can be regarded as a claim about reality (the true distribution is inside the set), but it's not clear how to generalize this to non-crisp. Well, if you think in terms of desiderata, then crisp says: if distribution is inside set then we have some lower bound on expected utility (and if it's not then we don't promise anything). On the other hand non-crisp gives a lower bound that is variable with the true distribution. We can think of non-crisp infradistirbutions as being fuzzy properties of the distribution (hence the name "crisp"). In fact, if we restrict ourselves to either of homogenous, cohomogenous or c-additive infradistributions, then we actually have a formal way to assign membership functions to infradistirbutions, i.e. literally regard them as fuzzy sets of distributions (which ofc have to satisfy some property analogous to convexity).
My research methodology

Cool, that makes sense, thanks!

My AGI Threat Model: Misaligned Model-Based RL Agent

Planned summary for the Alignment Newsletter:

This post lays out a pathway by which an AI-induced existential catastrophe could occur. The author suggests that AGI will be built via model-based reinforcement learning: that is, given a reward function, we will learn a world model, a value function, and a planner / actor. These will learn online, that is, even after being deployed these learned models will continue to be updated by our learning algorithm (gradient descent, or whatever replaces it). Most research effort will be focused on learning these models

... (read more)
Against evolution as an analogy for how humans will create AGI

If an AGI learned the skill of speaking english during training, but then learned the skill of speaking french during deployment, then your hypotheses imply that the implementations of those two language skills will be totally different. And it then gets weirder if they overlap - e.g. if an AGI learns a fact during training which gets stored in its weights, and then reads a correction later on during deployment, do those original weights just stay there?

Idk, this just sounds plausible to me. I think the hope is that the weights encode more general reasonin... (read more)

3Steve Byrnes17dYes this post is about the process by which AGI is made, i.e. #2. (See "I want to be specific about what I’m arguing against here."...) I'm not sure what you mean by "literal natural selection", but FWIW I'm lumping together outer-loop optimization algorithms regardless of whether they're evolutionary or gradient descent or downhill-simplex or whatever.
My research methodology

I'm super on board with this general methodology, at least at a high level. (Counterexample guided loops are great.) I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?

For example, I feel like with iterated amplification, a bunch of people (including you, probably) said early on that it seems like a hard case to do e.g. translation between languages with people who only know one of the languages, or to reproduce brilliant flashes of insight. (Iirc, the transla... (read more)

High level point especially for folks with less context: I stopped doing theory for a while because I wanted to help get applied work going, and now I'm finally going back to doing theory for a variety of reasons; my story is definitely not that I'm transitioning back from applied work to theory because I now believe the algorithms aren't ready.

I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?

I feel like a story is basically plausible until proven implausibl... (read more)

My research methodology

These are both cases of counterexample-guided techniques. The basic idea is to solve "exists x: forall y: P(x, y)" statements according to the following algorithm:

  1. Choose some initial x, and initialize a set Y = {}.
  2. Solve "exists y: not P(x, y)". If unsolvable, you're done. If not, take the discovered y and put it in Y.
  3. Solve "exists x: forall y in Y: P(x, y)" and set the solution as your new x.
  4. Go to step 2.

The reason this is so nice is because you've taken a claim with two quantifiers and written an algorithm that must only ever solve claims with one quantif... (read more)

Introduction To The Infra-Bayesianism Sequence

If you use the Anti-Nirvana trick, your agent just goes "nothing matters at all, the foe will mispredict and I'll get -infinity reward" and rolls over and cries since all policies are optimal. Don't do that one, it's a bad idea.

Sorry, I meant the combination of best-case reasoning (sup instead of inf) and the anti-Nirvana trick. In that case the agent goes "Murphy won't mispredict, since then I'd get -infinity reward which can't be the best that I do".

For your concrete example, that's why you have multiple hypotheses that are learnable.

Hmm, that makes sense, I think? Perhaps I just haven't really internalized the learning aspect of all of this.

Introduction To The Infra-Bayesianism Sequence

I'd like to note that even if there was a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity

Yeah, agreed. I'm intentionally going for a simplified summary that sacrifices details like this for the sake of cleaner narrative.

it would be more fair to say that the contribution of IB is combining that with reinforcement learning theory 

Ah, whoops. Live and learn.

The reason we use worst-case reasoning is because we want the agent

... (read more)
3Vanessa Kosoy17dYes I think that if you are offered a single bet, your utility is linear in money and your belief is a crisp infradistribution (i.e. a closed convex set of probability distributions) then it is always optimal to bet either as much as you can or nothing at all. But for more general infradistributions this need not be the case. For example, consider X:={0,1} and take the set of a-measures generated by 3δ0 and δ1. Suppose you start with 12 dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting 14 dollars on the outcome 1, with a value of 34 dollars.
3Diffractor18dIf you use the Anti-Nirvana trick, your agent just goes "nothing matters at all, the foe will mispredict and I'll get -infinity reward" and rolls over and cries since all policies are optimal. Don't do that one, it's a bad idea. For the concave expectation functionals: Well, there's another constraint or two, like monotonicity, but yeah, LF duality basically says that you can turn any (monotone) concave expectation functional into an inframeasure. Ie, all risk aversion can be interpreted as having radical uncertainty over some aspects of how the environment works and assuming you get worst-case outcomes from the parts you can't predict. For your concrete example, that's why you have multiple hypotheses that are learnable. Sure, one of your hypotheses might have complete knightian uncertainty over the odd bits, but another hypothesis might not. Betting on the odd bits is advised by a more-informative hypothesis, for sufficiently good bets. And the policy selected by the agent would probably be something like "bet on the odd bits occasionally, and if I keep losing those bets, stop betting", as this wins in the hypothesis where some of the odd bits are predictable, and doesn't lose too much in the hypothesis where the odd bits are completely unpredictable and out to make you lose.
Against evolution as an analogy for how humans will create AGI

All of that sounds reasonable to me. I still don't see why you think editing weights is required, as opposed to something like editing external memory.

(Also, maybe we just won't have AGI that learns by reading books, and instead it will be more useful to have a lot of task-specific AI systems with a huge amount of "built-in" knowledge, similarly to GPT-3. I wouldn't put this as my most likely outcome, but it seems quite plausible.)

4Richard Ngo18dI agree with Steve that it seems really weird to have these two parallel systems of knowledge encoding the same types of things. If an AGI learned the skill of speaking english during training, but then learned the skill of speaking french during deployment, then your hypotheses imply that the implementations of those two language skills will be totally different. And it then gets weirder if they overlap - e.g. if an AGI learns a fact during training which gets stored in its weights, and then reads a correction later on during deployment, do those original weights just stay there? Based on this I guess your answer to my question above is "no": the original fact will get overridden a few days later, and also the knowledge of french will be transferred into the weights eventually. But if those updates occur via self-supervised learning, then I'd count that as "autonomously edit[ing] its weights after training". And with self-supervised learning, you don't need to wait long for feedback, so why wouldn't you use it to edit weights all the time? At the very least, that would free up space in the short-term memory/hidden state. For my own part I'm happy to concede that AGIs will need some way of editing their weights during deployment. The big question for me is how continuous this is with the rest of the training process. E.g. do you just keep doing SGD, but with a smaller learning rate? Or will there be a different (meta-learned) weight update mechanism? My money's on the latter. If it's the former, then that would update me a bit towards Steve's view, but I think I'd still expect evolution to be a good analogy for the earlier phases of SGD. If this is the case, then that would shift me away from thinking of evolution as a good analogy for AGI, because the training process would then look more like the type of skill acquisition that happens during human lifetimes. In fact, this seems like the most likely way in which Steve is right that evolution is a bad analogy.
Against evolution as an analogy for how humans will create AGI

Thanks, this was helpful in understanding in where you're coming from.

When I think of the AGI-hard part of "learning", I think of building a solid bedrock of knowledge and ideas, such that you can build new ideas on top of the old ideas, in an arbitrarily high tower.

I don't feel like humans meet this bar. Maybe mathematicians, and even then, I probably still wouldn't agree. Especially not humans without external memory (e.g. paper). But presumably such humans still count as generally intelligent.

Anyway, my human brain analogy for GPT-3 is: I think the GPT-

... (read more)
3Steve Byrnes18dThanks again, this is really helpful. Hmm, imagine you get a job doing bicycle repair. After a while, you've learned a vocabulary of probably thousands of entities and affordances and interrelationships (the chain, one link on the chain, the way the chain moves, the feel of clicking the chain into place on the gear, what it looks like if a chain is loose, what it feels like to the rider when a chain is loose, if I touch the chain then my finger will be greasy, etc. etc.). All that information is stored in a highly-structured way in your brain (I think some souped-up version of a PGM, but let's not get into that), such that it can grow to hold a massive amount of information while remaining easily searchable and usable. The problem with working memory is not capacity per se, it's that it's not stored in this structured, easily-usable-and-searchable way. So the more information you put there, the more you start getting bogged down and missing things. Ditto with pen and paper, or a recurrent state, etc. I find it helpful to think about our brain's understanding as lots of subroutines running in parallel. (Kaj calls these things "subagents" [https://www.lesswrong.com/s/ZbmRyDN8TCpBTZSip], I more typically call them "generative models" [https://www.lesswrong.com/posts/diruo47z32eprenTg/my-computational-framework-for-the-brain] , Kurzweil calls them "patterns" [https://www.amazon.com/How-Create-Mind-Thought-Revealed/dp/1491518839], Minsky calls this idea "society of mind" [https://www.amazon.com/Society-Mind-Marvin-Minsky/dp/0671657135], etc.) They all mostly just sit around doing nothing. But sometimes they recognize a scenario for which they have something to say, and then they jump in and say it. So in chess, there's a subroutine that says "If the board position has such-and-characteristics, it's worthwhile to consider moving the pawn." The subroutine sits quietly for months until the board has that position, and then it jumps in and injects its idea. And of course,
Against evolution as an analogy for how humans will create AGI

I feel like I didn't really understand what you were trying to get at here, probably because you seem to have a detailed internal ontology that I don't really get yet. So here's some random disagreements, with the hope that more discussion leads me to figure out what this ontology actually is.

A biological analogy I like much better: The “genome = code” analogy

This analogy also seems fine to me, as someone who likes the evolution analogy

In the remainder of the post I’ll go over three reasons suggesting that the first scenario would be much less likely than

... (read more)
4Steve Byrnes19dThanks! A lot of your comments are trying to relate this to GPT-3, I think. Maybe things will be clearer if I just directly describe how I think about GPT-3. The evolution analogy (as I'm defining it) says that “The AGI” is identified as the inner algorithm, not the inner and outer algorithm working together. In other words, if I ask the AGI a question, I don’t need the outer algorithm to be running in the course of answering that question. Of course the GPT-3 trained model is already capable of answering "easy" questions, but I'm thinking here about "very hard" questions that need the serious construction of lots of new knowledge and ideas that build on each other. I don't think the GPT-3 trained model can do that by itself. Now for GPT-3, the outer algorithm edits weights, and the inner algorithm edits activations. I am very impressed about the capabilities of the GPT-3 weights, edited by SGD, to store an open-ended world model of greater and greater complexity as you train it more and more. I am not so optimistic that the GPT-3 activations can do that, without somehow transferring information from activations to weights. And not just for the stupid reason that it has a finite training window. (For example, other transformer models have recurrency.) Why don't I think that the GPT-3 trained model is just as capable of building out an open-ended world-model of ever greater complexity using activations not weights? For one thing, it strikes me as a bit weird to think that there will be this centaur-like world model constructed out of X% weights and (100-X)% activations. And what if GPT comes to realize that one of its previous beliefs is actually wrong? Can the activations somehow act as if they're overwriting the weights? Just seems weird. How much information content can you put in the activations anyway? I don't know off the top of my head, but much less than the amount you can put in the weights. When I think of the AGI-hard part of "learning", I think of b
AXRP Episode 5 - Infra-Bayesianism with Vanessa Kosoy

Wrote a combined summary for this podcast and the original sequence here.

Introduction To The Infra-Bayesianism Sequence

Planned summary for the Alignment Newsletter:

I have finally understood this sequence enough to write a summary about it, thanks to [AXRP Episode 5](https://www.alignmentforum.org/posts/FkMPXiomjGBjMfosg/axrp-episode-5-infra-bayesianism-with-vanessa-kosoy). Think of this as a combined summary + highlight of the sequence and the podcast episode.

The central problem of <@embedded agency@>(@Embedded Agents@) is that there is no clean separation between an agent and its environment: rather, the agent is _embedded_ in its environment, and so when reasoning

... (read more)
2Vanessa Kosoy18dThat's certainly one way to motivate IB, however I'd like to note that even if there was a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity (in particular this must be the case if the environment contains other agents of similar or greater complexity). Well, the use of Knightian uncertainty (imprecise probability) in decision theory certain appeared in the literature, so it would be more fair to say that the contribution of IB is combining that with reinforcement learning theory (i.e. treating sequential decision making and considering learnability and regret bounds in this setting) and applying that to various other questions (in particular, Newcombian paradoxes). The reason we use worst-case reasoning is because we want the agent to satisfy certain guarantees. Given a learnable class of infra-hypotheses, in the γ→1 limit, we can guarantee that whenever the true environment satisfies one of those hypotheses, the agent attains at least the corresponding amount of expected utility. You don't get anything analogous with best-case reasoning. Moreover, there is an (unpublished) theorem showing that virtually any guarantee you might want to impose can be written in IB form. That is, let E be the space of environments, and let gn:E→[0,1] be an increasing sequence of functions. We can interpret every gn as a requirement about the policy: ∀μ:Eμπ[U]≥gn(μ). These requirements become stronger with increasing n. We might then want π to be s.t. it satisfies the requirement with the highest n possible. The theorem then says that (under some mild assumptions about the functions g) there exists an infra-environment s.t. optimizing for it is equivalent to maximizing n. (We can replace n by a continuous parameter, I made it discrete just for ease of exposition.) Actually it might be not that different. The Legendre-Fenchel duality shows you can thi
1DanielFilan19dOne thing I realized after the podcast is that because the decision theory you get can only handle pseudo-causal environments, it's basically trying to think about the statistics of environments rather than their internals. So my guess is that further progress on transparent newcomb is going to have to look like adding in the right kind of logical uncertainty or something. But basically it unsurprisingly has more of a statistical nature than what you imagine you want reading the FDT paper.
[AN #142]: The quest to understand a network well enough to reimplement it by hand

Ah excellent, thanks for the links. I'll send the Twitter thread in the next newsletter with the following summary:

Last week I speculated that CLIP might "know" that a textual adversarial example is a "picture of an apple with a piece of paper saying an iPod on it" and the zero-shot classification prompt is preventing it from demonstrating this knowledge. Gwern Branwen [commented](https://www.alignmentforum.org/posts/JGByt8TrxREo4twaw/an-142-the-quest-to-understand-a-network-well-enough-to?commentId=keW4DuE7G4SZn9h2r) to link me to this Twitter thread as w

... (read more)
[AN #142]: The quest to understand a network well enough to reimplement it by hand

Related: Interpretability vs Neuroscience: Six major advantages which make artificial neural networks much easier to study than biological ones. Probably not a major surprise to readers here.

AI x-risk reduction: why I chose academia over industry

I've discussed this question with a good number of people, and I think I've generally found my pro-academia arguments to be stronger than their pro-industry arguments (I think probably many of them would agree?)

I... think we've discussed this? But I don't agree, at least insofar as the arguments are supposed to apply to me as well (so e.g. not the personal fit part).

Some potential disagreements:

  1. I expect more field growth via doing good research that exposes more surface area for people to tackle, rather than mentoring people directly. Partly this is becaus
... (read more)
4David Krueger1moYeah we've definitely discussed it! Rereading what I wrote, I did not clearly communicate what I intended to...I wanted to say that "I think the average trend was for people to update in my direction". I will edit it accordingly. I think the strength of the "usual reasons" has a lot to do with personal fit and what kind of research one wants to do. Personally, I basically didn't consider salary as a factor.
[AN #141]: The case for practicing alignment work on GPT-3 and other large models

I'd like to see Hutter's model "translated" a bit to DNNs, e.g. by assuming they get anything right that's within epsilon of a training data poing or something

With this assumption, asymptotically (i.e. with enough data) this becomes a nearest neighbor classifier. For the -dimensional manifold assumption in the other model, you can apply the arguments from the other model to say that you scale as  for some constant  (probably c = 1 or 2, depending on what exactly we're quantifying the scaling of).

I'm not entirely sure how you... (read more)

Four Motivations for Learning Normativity

Planned summary for the Alignment Newsletter:

We’ve <@previously seen@>(@Learning Normativity: A Research Agenda@) desiderata for agents that learn normativity from humans: specifically, we would like such agents to:

1. **Learn at all levels:** We don’t just learn about uncertain values, we also learn how to learn values, and how to learn to learn values, etc. There is **no perfect loss function** that works at any level; we assume conservatively that Goodhart’s Law will always apply. In order to not have to give infinite feedback for the infinite leve

... (read more)
[AN #141]: The case for practicing alignment work on GPT-3 and other large models

I feel like there's a pretty strong Occam's Razor-esque argument for preferring Hutter's model, even though it seems wildly less intuitive to me.

?? Overall this claim feels to me like:

  • Observing that cows don't float into space
  • Making a model of spherical cows with constant density ρ and showing that as long as ρ is more than density of air, the cows won't float
  • Concluding that since the model is so simple, Occam's Razor says that cows must be spherical with constant density.

Some ways that you could refute it:

  • It requires your data to be Zipf-distributed -- wh
... (read more)
3David Krueger1moIntersting... Maybe this comes down to different taste or something. I understand, but don't agree with, the cow analogy... I'm not sure why, but one difference is that I think we know more about cows than DNNs or something. I haven't thought about the Zipf-distributed thing. > Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter's model doesn't. Presumably you mean something else but idk what. I'd like to see Hutter's model "translated" a bit to DNNs, e.g. by assuming they get anything right that's within epsilon of a training data poing or something... maybe it even ends up looking like the other model in that context...
Recursive Quantilizers II

I continue to not understand this but it seems like such a simple question that it must be that there's just some deeper misunderstanding of the exact proposal we're now debating. It seems not particularly worth it to find this misunderstanding; I don't think it will really teach us anything conceptually new.

(If I did want to find it, I would write out pseudocode for the new proposed system and then try to make a more precise claim in terms of the variables in the pseudocode.)

2Abram Demski1moFair.
Epistemological Framing for AI Alignment Research

Planned summary for the Alignment Newsletter:

This post recommends that we think about AI alignment research in the following framework:

1. Defining the problem and its terms: for example, we might want to define “agency”, “optimization”, “AI”, and “well-behaved”.

2. Exploring these definitions, to see what they entail.

3. Solving the now well-defined problem.

This is explicitly _not_ a paradigm, but rather a framework in which we can think about possible paradigms for AI safety. A specific paradigm would choose a specific problem formulation and definition (or

... (read more)
The case for aligning narrowly superhuman models

Planned summary for the Alignment Newsletter:

One argument against work on AI safety is that [it is hard to do good work without feedback loops](https://www.jefftk.com/p/why-global-poverty). So how could we get feedback loops? The most obvious approach is to actually try to align strong models right now, in order to get practice with aligning models in the future. This post fleshes out what such an approach might look like. Note that I will not be covering all of the points mentioned in the post; if you find yourself skeptical you may want to read the full

... (read more)
Recursive Quantilizers II

So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.

This makes sense, though I probably shouldn't have used "5x" as my number -- it definitely feels intuitively more like your tools could be robust to many orders of magnitude of increased compute / model capacity / data. (Idk how you would think that relates to a scaling factor on intelligence.) I think the key claim / crux here is something like "we can develop techniques that are robust to scaling up compute / capacity / data by N orders, where N doesn't depend significantly on the current compute / capacity / data".

Recursive Quantilizers II

Most of this makes sense (or perhaps more accurately, sounds like it might be true, but there's a good chance if I reread the post and all the comments I'd object again / get confused somehow). One thing though:

Every piece of feedback gets put into the same big pool which helps define Hv, the initial ("human") value function. [...]

Okay, I think with this elaboration I stand by what I originally said:

It seemed to me like since the first few bits of feedback determine how the system interprets all future feedback, it's particularly important for those first

... (read more)
2Abram Demski1moYou mean with respect to the system as described in the post (in which case I 100% agree), or the modified system which restarts training upon new feedback (which is what I was just describing)? Because I think this is pretty solidly wrong of the system that restarts. All feedback so far determines the newD1when the system restarts training. (Again, I'm not saying it's feasible to restart training all the time, I'm just using it as a proof-of-concept to show that we're not fundamentally forced to make a trade-off between (a) order independence and (b) using the best model to interpret feedback.)
The case for aligning narrowly superhuman models

Makes it sound like there's some structural equivalence to a human thinking for a long time, which there isn't.

Yes, I explicitly agree with this, which is why the first thing in my previous response was

 sorry, that's right, I was speaking pretty loosely.

The case for aligning narrowly superhuman models

I agree with the other responses from Ajeya / Paul / Raemon, but to add some more info:

Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from???

... I don't really know. My guess is that I picked it up from reading giant comment threads between Paul and other people.

I don't see any reason at all to expect it to do anything remotely similar to that.

Tbc it doesn't need to be literally true. The argument needed for safety is something like "a large team of copies of non-expert agents could together be as capable as ... (read more)

4johnswentworth1mo"As capable as an expert" makes more sense. Part of what's confusing about "equivalent to a human thinking for a long time" is that it's picking out one very particular way of achieving high capability, but really it's trying to point to a more-general notion of "HCH can solve lots of problems well". Makes it sound like there's some structural equivalence to a human thinking for a long time, which there isn't.
The case for aligning narrowly superhuman models

Yeah, sorry, that's right, I was speaking pretty loosely. You'd still have the same hope -- maybe a team of 2^100 copies of the business owner could draft a contract just as well, or better than, an already expert business-owner. I just personally find it easier to think about "benefits of a human thinking for a long time" and then "does HCH get the same benefits as humans thinking for a long time" and then "does iterated amplification get the same benefits as HCH".

4johnswentworth1moWhere did this idea of HCH yielding the same benefits as a human thinking for a long time come from??? Both you and Ajeya apparently have this idea, so presumably it was in the water at some point? Yet I don't see any reason at all to expect it to do anything remotely similar to that.
The case for aligning narrowly superhuman models

One approach is to let the human giving feedback think for a long time. Maybe the business owner by default can't write a good contract, but a business owner who could study the relevant law for a year would do just as well as the already expert business-owner. In the real world this is too expensive to do, but there's hope in the AI case (e.g. that's a hope behind iterated amplification).

3johnswentworth1moHow does iterated amplification achieve this? My understanding was that it simulates scaling up the number of people (a la HCH), not giving one person more time.
The case for aligning narrowly superhuman models

It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objections you outlined?

It's important to distinguish between:

  • "We (Open Phil) are not sure whether we want to actively push this in the world at large, e.g. by running a grant round and publicizing it
... (read more)
Fun with +12 OOMs of Compute

That said, I'd be interested to hear why you have similar feelings about the non-Neuromorph answers, considering that you agreed with the point I was making in the birds/brains/etc. post. If we aren't trying to replicate the brain, but just to do something that works, yes there will be lots of details to work out, but what positive reason do you have to think that the amount of special sauce / details is so high that 12 OOMs and a few years isn't enough to find it?

The positive reason is basically all the reasons given in Ajeya's report? Like, we don't tend... (read more)

3Daniel Kokotajlo1moHmmm, it seems we aren't on the same page. (The argument sketch you just made sounds to me like a collection of claims which are either true but irrelevant, or false, depending on how I interpret them.) I'll go back and reread Ajeya's report (or maybe talk to her?) and then maybe we'll be able to get to the bottom of this. Maybe my birds/brains/etc. post directly contradicts something in her report after all.
Fun with +12 OOMs of Compute

idk, I feel like maybe at this point we should make bets or something, and then go read the literature and see who is right? I don't find this prospect appealing but it seems like the epistemically virtuous thing to do.

Meh, I don't think it's a worthwhile use of my time to read that literature, but I'd make a bet if we could settle on an operationalization and I didn't have to settle it.

What do you imagine happening, in the hypothetical, when we run the Neuromorph project?

I mostly expect that you realize that there were a bunch of things that were super un... (read more)

5Daniel Kokotajlo1moOn the contrary, I've been very (80%?) surprised by the responses so far -- in the Elicit poll, everyone agrees with me! I expected there to be a bunch of people with answers like "10%" and "20%" and then an even larger bunch of people with answers like "50%" (that's what I expected you, Ajeya, etc. to chime in and say). Instead, well, just look at the poll results! So, even a mere registering of disagreement is helpful. That said, I'd be interested to hear why you have similar feelings about the non-Neuromorph answers, considering that you agreed with the point I was making in the birds/brains/etc. post [https://www.lesswrong.com/posts/HhWhaSzQr6xmBki8F/birds-brains-planes-and-ai-against-appeals-to-the-complexity] . If we aren't trying to replicate the brain, but just to do something that works, yes there will be lots of details to work out, but what positive reason do you have to think that the amount of special sauce / details is so high that 12 OOMs and a few years isn't enough to find it? Interesting. This conflicts with something I've been told about neural networks, which is that they "want to work." Seems to me that more likely than eternal gibberish is something that works but not substantially better than regular ANN's of similar size. So, still better than GPT-3, AlphaStar, etc. After all, those architectures are simple enough that surely something similar is in the space of things that would be tried out by the Neuromorph search process? I think the three specific ways my claim about worms could be wrong are not very plausible: Sure, most of the genes don't code neuron stuff. So what? Sure, maybe the DNA mostly contents itself with specifying number + structure of neurons, but that's just a rejection of my claim, not an argument against it. Sure, maybe it's fine to have just one type of neuron if you are so simple -- but the relevant metric is not "number of types" but "number of types / number of neurons." And the size of the human genome limits tha
Fun with +12 OOMs of Compute

Neuromorph =/= an attempt to create uploads.

My impression is that the linked blog post is claiming we haven't even been able to get things that are qualitatively as impressive as a worm. So why would we get things that are qualitatively as impressive as a human? I'm not claiming it has to be an upload.

This is because, counterintuitively, worms being small makes them a lot harder to simulate.

I could believe this (based on the argument you mentioned) but it really feels like "maybe this could be true but I'm not that swayed from my default prior of 'it's pro... (read more)

4Daniel Kokotajlo1moAt this point I guess I just say I haven't looked into the worm literature enough to say. I can't tell from the post alone whether we've neuromorphed the worm yet or not. "Qualitatively as impressive as a worm" is a pretty low bar, I think. We have plenty of artificial neural nets that are much more impressive than worms already, so I guess the question is whether we can make one with only 302 neurons that is as impressive as a worm... e.g. can it wriggle in a way that moves it around, can it move away from sources of damage and towards sources of food, etc. idk, I feel like maybe at this point we should make bets or something, and then go read the literature and see who is right? I don't find this prospect appealing but it seems like the epistemically virtuous thing to do. I do feel fairly confident that on a per-neuron basis worms are much harder than humans to simulate. My argument seems solid enough for that conclusion, I think. It's not solid enough to mean that you are wrong though -- like you said, a 100x difference is still basically nothing. And to be honest I agree that the difference probably isn't much more than that; maybe 1000x or something. That's computational expense, though; qualitative difficulty is another matter. If you recall from my post about birds & planes, my position is not that simulating/copying nature is easy; rather it's that producing something that gets the job done is easy, or at least in expectation easier than a lot of people seem to think, because the common argument that it's hard is bogus etc. etc. This whole worm-uploading project seems more like "simulating/copying nature" to me, whereas the point of Neuromorph was to try to lazily/cheaply copy some things from nature and then make up the rest in whatever way gets the job done. What do you imagine happening, in the hypothetical, when we run the Neuromorph project? Do you imagine it producing gibberish eternally? If so, why -- wouldn't you at least expect it to do about
Fun with +12 OOMs of Compute

I feel like if you think Neuromorph has a good chance of succeeding, you need to explain why we haven't uploaded worms yet. For C. elegans, if we ran 302 neurons for 1 subjective day (= 8.64e4 seconds) at 1.2e6 flops per neuron, and did this for 100 generations of 100 brains, that takes a mere 3e17 flops, or about $3 at current costs.

(And this is very easy to parallelize, so you can't appeal to that as a reason this can't be done.)

(It's possible that we have uploaded worms in the 7 years since that blog post was written, though I would have expected to hear about it if so.)

5Daniel Kokotajlo1moGood question! Here's my answer: --I think Neuromorph has the least chance of succeeding of the five. Still more than 50% though IMO. I'm not at all confident in this. --Neuromorph =/= an attempt to create uploads. I would be extremely surprised if the resulting AI was recognizeably the same person as was scanned. I'd be mildly surprised if it even seemed human-like at all, and this is conditional on the project working. What I imagine happening conditional on the project working is something like: After a few generations of selection, we get brains that "work" in more or less the way that vanilla ANN's work, i.e. the brains seem like decently competent RL agents (of human-brain size) at RL tasks, decently competent transformers at language modelling, etc. So, like GPT-5 or so. But there would still be lots of glitches and weaknesses and brittleness. But then with continued selection we get continued improvement, and many (though not nearly all) of the tips & tricks evolution had discovered and put into the brain (modules, etc.) are rediscovered. Others are bypassed entirely as the selection process routes around them and finds new improvements. At the end we are left with something maybe smarter than a human, but probably not, but competent enough and agenty enough to be transformative. (After all, it's faster and cheaper than a human.) From the post you linked: Yeah, Neuromorph definitely won't be uploading humans in that sense. This might already qualify as success for what I'm interested in, depending on how "wormlike" the behaviors are. I haven't looked into this. --I used to think our inability to upload worms was strong evidence against any sort of human uploading happening anytime soon. However, I now think it's only weak evidence. This is because, counterintuitively, worms being small makes them a lot harder to simulate. Notice how each worm has exactly 302 neurons, and their locations and connections are the same in each worm. That means the genes ar
How does bee learning compare with machine learning?

Planned summary for the Alignment Newsletter:

The <@biological anchors approach@>(@Draft report on AI timelines@) to forecasting AI timelines estimates the compute needed for transformative AI based on the compute used by animals. One important parameter of the framework is needed to “bridge” between the two: if we find that an animal can do a specific task using X amount of compute, then what should we estimate as the amount of compute needed for an ML model to do the same task? This post aims to better estimate this parameter, by comparing few-shot

... (read more)
How does bee learning compare with machine learning?

It’s clear, however, that a bee’s brain can perform a wide range of tasks beside efew-shot image classification, while the machine learning model developed in (Lee et al., 2019) cannot.

The abstract objection here is “if you choose an ML model that has been trained just for a specific task (few-shot learning), on priors you should expect it to be more efficient than an evolution-designed organism that has been trained for a whole bunch of stuff, of which the task you’re considering is just one example”. This can cash out in several ways:

1. Bees presumably u... (read more)

5Ajeya Cotra1moI mostly agree with your comment, but I'm actually very unsure about 2 here: I think I recall bees seeming surprisingly narrow and bad at abstract shapes. Guille would know more here.
Bootstrapped Alignment

Planned summary for the Alignment Newsletter:

This post distinguishes between three kinds of “alignment”:
1. Not building an AI system at all,
2. Building Friendly AI that will remain perfectly aligned for all time and capability levels,
3. _Bootstrapped alignment_, in which we build AI systems that may not be perfectly aligned but are at least aligned enough that we can use them to build perfectly aligned systems.

The post argues that optimization-based approaches can’t lead to perfect alignment, because there will always eventually be Goodhart effects.

3G Gordon Worley III1moLooks good to me! Thanks for planning to include this in the AN!
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

I agree it's not vacuous. It sounds like you're mostly stating the same argument I gave but in different words. Can you tell me what's wrong or missing from my summary of the argument?

Since it is possible to compress high-probability events using an optimal code for the probability distribution, you might expect that functions with high probability in the neural network prior can be compressed more than functions with low probability. Since high probability functions are more likely, this means that the more likely functions correspond to shorter programs.

... (read more)
1Joar Skalse1moI agree with your summary. I'm mainly just clarifying what my view is of the strength and overall role of the Algorithmic Information Theory arguments, since you said you found them unconvincing. I do however disagree that those arguments can be applied to "literally any machine learning algorithm", although they certainly do apply to a much larger class of ML algorithms than just neural networks. However, I also don't think this is necessarily a bad thing. The picture that the AIT arguments give makes it reasonably unsurprising that you would get the double-descent phenomenon as you increase the size of a model (at small sizes VC-dimensionality mechanisms dominate, but at larger sizes the overparameterisation starts to induce a simplicity bias, which eventually starts to dominate). Since you get double descent in the model size for both neural networks and eg random forests, you should expect there to be some mechanism in common between them (even if the details of course differ from case to case).
Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume.

This seems right, but I'm not sure how that's different from Zach's phrasing of the main point? Zach's phrasing was "SGD approximately equals random sampling", and random sampling finds functions with probability exactly proportional to their volume. Combine that with the fact that empirically we get good generalization and we get the thing yo... (read more)

1Joar Skalse1moI'm honestly not sure, I just wasn't really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were "distractions from the main point". Haha. That's obviously not what we're trying to do here, but I do see what you mean. I originally wanted to express these ideas in more geometric language, rather than probability-theoretic language, but in the end we decided to go for more probability-theoretic language anyway. I agree that this arguably could be mildly misleading. For example, the correspondence between SGD and Bayesian sampling only really holds for some initialisation distributions. If you deterministically initialise your neural network to the origin (i.e., all zero weights) then SGD will most certainly not behave like Bayesian sampling with the initialisation distribution as its prior. Then again, the probability-theoretic formulation might make other things more intuitive.
[AN #139]: How the simplicity of reality explains the success of neural nets

What John said. To elaborate, it's specifically talking about the case where there is some concept from which some probabilistic generative model creates observations tied to the concept, and claiming that the log probabilities follow a polynomial.

Suppose the most dog-like nose size is K. One function you could use is y = exp(-(x - K)^d) for some positive integer d. That's a function whose maximum value is 0 (where higher values = more "dogness") and doesn't blow up unreasonably anywhere.

(Really you should be talking about probabilities, in which case you use the same sort of function but then normalize, which transforms the exp into a softmax, as the paper suggests)

Utility Maximization = Description Length Minimization

The core conceptual argument is: the higher your utility function can go, the bigger the world must be, and so the more bits it must take to describe it in its unoptimized state under M2, and so the more room there is to reduce the number of bits.

If you could only ever build 10 paperclips, then maybe it takes 100 bits to specify the unoptimized world, and 1 bit to specify the optimized world.

If you could build 10^100 paperclips, then the world must be humongous and it takes 10^101 bits to specify the unoptimized world, but still just 1 bit to specify the p... (read more)

1Daniel Kokotajlo2moAhh, thanks!
Some thoughts on risks from narrow, non-agentic AI

I think there's a perspective where the post-singularity failure is still the important thing to talk about, and that's an error I made in writing the post. I skipped it because there is no real action after the singularity---the damage is irreversibly done, all of the high-stakes decisions are behind us---but it still matters for people trying to wrap their heads around what's going on. And moreover, the only reason it looks that way to me is because I'm bringing in a ton of background empirical assumptions (e.g. I believe that massive acceleration in gro

... (read more)
Load More