In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."
To fill in the details more:
Assume that we're finding an algorithm to train an agent with a sufficiently large action space (i.e. we don't get safety via the agent having such a restricted action space ... (read more)
Looks good to me :)
Planned summary for the Alignment Newsletter:
... (read more)This post outlines a simple methodology for making progress on AI alignment. The core idea is to alternate between two steps:
1. Come up with some alignment algorithm that solves the issues identified so far
2. Try to find some plausible situation in which either a) the resulting AI system is misaligned or b) the AI system is not competitive.
This is all done conceptually, so step 2 can involve fairly exotic scenarios that probably won't happen. Given such a scenario, we need to argue why no failure in the same cla
rom my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2
That doesn't seem like it has quite the type signature I'm looking for. I'm imagining a story as a description of how something bad happens, so I want the story to end with "and then something bad happens."
In some sense you could start from the trivial story "Your algorithm didn't work and then something ... (read more)
I don't think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.
Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws?
My guess is that the answer is mostly yes (maybe not the exact numbers predicted by existing scaling laws, but similar ballpark).
how does that not contradict the scaling laws for transfer described here and used in this calculation by Rohin?
I think this is mostly irrelevant to timelines / previous scaling laws for transfer:
Yes, that's basically right.
You think I take the original argument to be arguing from ‘has goals' to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.
Well, I do think it is an interesting/relevant argument (because as you say it explains how you get from "weakly has goals" to "strongly has goals"). I just wanted to correct the misconception about what I was arguing against, and I wanted to highlight the "intelligent" --> "weakly has goals" step as a relatively weak step in our current arguments. (In my ori... (read more)
Thanks, that's helpful. I'll think about how to clarify this in the original post.
You're mistaken about the view I'm arguing against. (Though perhaps in practice most people think I'm arguing against the view you point out, in which case I hope this post helps them realize their error.) In particular:
Whatever things you care about, you are best off assigning consistent numerical values to them and maximizing the expected sum of those values
If you start by assuming that the agent cares about things, and your prior is that the things it cares about are "simple" (e.g. it is very unlikely to be optimizing the-utility-function-that-makes-the... (read more)
A few quick thoughts on reasons for confusion:
I think maybe one thing going on is that I already took the coherence arguments to apply only in getting you from weakly having goals to strongly having goals, so since you were arguing against their applicability, I thought you were talking about the step from weaker to stronger goal direction. (I’m not sure what arguments people use to get from 1 to 2 though, so maybe you are right that it is also something to do with coherence, at least implicitly.)
It also seems natural to think of ‘weakly has goals’ as some... (read more)
Thanks. Let me check if I understand you correctly:
You think I take the original argument to be arguing from ‘has goals' to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.
What you disagree with is an argument from ‘anything smart’ to ‘has goals’, which seems to be what is needed for the AI risk argument to apply to any superintelligent agent.
Is that right?
If so, I think it’s helpful to distinguish between ‘weakly has goals’ and ‘strongly has goals’:
But for more general infradistributions this need not be the case. For example, consider and take the set of a-measures generated by and . Suppose you start with dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting dollars on the outcome , with a value of dollars.
I guess my question is more like: shouldn't there be some aspect of reality that determines what my set of a-measures is? It feels like here we're finding a set of a-measures... (read more)
Cool, that makes sense, thanks!
Planned summary for the Alignment Newsletter:
... (read more)This post lays out a pathway by which an AI-induced existential catastrophe could occur. The author suggests that AGI will be built via model-based reinforcement learning: that is, given a reward function, we will learn a world model, a value function, and a planner / actor. These will learn online, that is, even after being deployed these learned models will continue to be updated by our learning algorithm (gradient descent, or whatever replaces it). Most research effort will be focused on learning these models
If an AGI learned the skill of speaking english during training, but then learned the skill of speaking french during deployment, then your hypotheses imply that the implementations of those two language skills will be totally different. And it then gets weirder if they overlap - e.g. if an AGI learns a fact during training which gets stored in its weights, and then reads a correction later on during deployment, do those original weights just stay there?
Idk, this just sounds plausible to me. I think the hope is that the weights encode more general reasonin... (read more)
I'm super on board with this general methodology, at least at a high level. (Counterexample guided loops are great.) I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?
For example, I feel like with iterated amplification, a bunch of people (including you, probably) said early on that it seems like a hard case to do e.g. translation between languages with people who only know one of the languages, or to reproduce brilliant flashes of insight. (Iirc, the transla... (read more)
High level point especially for folks with less context: I stopped doing theory for a while because I wanted to help get applied work going, and now I'm finally going back to doing theory for a variety of reasons; my story is definitely not that I'm transitioning back from applied work to theory because I now believe the algorithms aren't ready.
I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?
I feel like a story is basically plausible until proven implausibl... (read more)
These are both cases of counterexample-guided techniques. The basic idea is to solve "exists x: forall y: P(x, y)" statements according to the following algorithm:
The reason this is so nice is because you've taken a claim with two quantifiers and written an algorithm that must only ever solve claims with one quantif... (read more)
If you use the Anti-Nirvana trick, your agent just goes "nothing matters at all, the foe will mispredict and I'll get -infinity reward" and rolls over and cries since all policies are optimal. Don't do that one, it's a bad idea.
Sorry, I meant the combination of best-case reasoning (sup instead of inf) and the anti-Nirvana trick. In that case the agent goes "Murphy won't mispredict, since then I'd get -infinity reward which can't be the best that I do".
For your concrete example, that's why you have multiple hypotheses that are learnable.
Hmm, that makes sense, I think? Perhaps I just haven't really internalized the learning aspect of all of this.
I'd like to note that even if there was a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity
Yeah, agreed. I'm intentionally going for a simplified summary that sacrifices details like this for the sake of cleaner narrative.
it would be more fair to say that the contribution of IB is combining that with reinforcement learning theory
Ah, whoops. Live and learn.
... (read more)The reason we use worst-case reasoning is because we want the agent
All of that sounds reasonable to me. I still don't see why you think editing weights is required, as opposed to something like editing external memory.
(Also, maybe we just won't have AGI that learns by reading books, and instead it will be more useful to have a lot of task-specific AI systems with a huge amount of "built-in" knowledge, similarly to GPT-3. I wouldn't put this as my most likely outcome, but it seems quite plausible.)
Thanks, this was helpful in understanding in where you're coming from.
When I think of the AGI-hard part of "learning", I think of building a solid bedrock of knowledge and ideas, such that you can build new ideas on top of the old ideas, in an arbitrarily high tower.
I don't feel like humans meet this bar. Maybe mathematicians, and even then, I probably still wouldn't agree. Especially not humans without external memory (e.g. paper). But presumably such humans still count as generally intelligent.
... (read more)Anyway, my human brain analogy for GPT-3 is: I think the GPT-
I feel like I didn't really understand what you were trying to get at here, probably because you seem to have a detailed internal ontology that I don't really get yet. So here's some random disagreements, with the hope that more discussion leads me to figure out what this ontology actually is.
A biological analogy I like much better: The “genome = code” analogy
This analogy also seems fine to me, as someone who likes the evolution analogy
... (read more)In the remainder of the post I’ll go over three reasons suggesting that the first scenario would be much less likely than
Wrote a combined summary for this podcast and the original sequence here.
Planned summary for the Alignment Newsletter:
... (read more)I have finally understood this sequence enough to write a summary about it, thanks to [AXRP Episode 5](https://www.alignmentforum.org/posts/FkMPXiomjGBjMfosg/axrp-episode-5-infra-bayesianism-with-vanessa-kosoy). Think of this as a combined summary + highlight of the sequence and the podcast episode.
The central problem of <@embedded agency@>(@Embedded Agents@) is that there is no clean separation between an agent and its environment: rather, the agent is _embedded_ in its environment, and so when reasoning
Ah excellent, thanks for the links. I'll send the Twitter thread in the next newsletter with the following summary:
... (read more)Last week I speculated that CLIP might "know" that a textual adversarial example is a "picture of an apple with a piece of paper saying an iPod on it" and the zero-shot classification prompt is preventing it from demonstrating this knowledge. Gwern Branwen [commented](https://www.alignmentforum.org/posts/JGByt8TrxREo4twaw/an-142-the-quest-to-understand-a-network-well-enough-to?commentId=keW4DuE7G4SZn9h2r) to link me to this Twitter thread as w
Related: Interpretability vs Neuroscience: Six major advantages which make artificial neural networks much easier to study than biological ones. Probably not a major surprise to readers here.
I've discussed this question with a good number of people, and I think I've generally found my pro-academia arguments to be stronger than their pro-industry arguments (I think probably many of them would agree?)
I... think we've discussed this? But I don't agree, at least insofar as the arguments are supposed to apply to me as well (so e.g. not the personal fit part).
Some potential disagreements:
I'd like to see Hutter's model "translated" a bit to DNNs, e.g. by assuming they get anything right that's within epsilon of a training data poing or something
With this assumption, asymptotically (i.e. with enough data) this becomes a nearest neighbor classifier. For the -dimensional manifold assumption in the other model, you can apply the arguments from the other model to say that you scale as for some constant (probably c = 1 or 2, depending on what exactly we're quantifying the scaling of).
I'm not entirely sure how you... (read more)
Planned summary for the Alignment Newsletter:
... (read more)We’ve <@previously seen@>(@Learning Normativity: A Research Agenda@) desiderata for agents that learn normativity from humans: specifically, we would like such agents to:
1. **Learn at all levels:** We don’t just learn about uncertain values, we also learn how to learn values, and how to learn to learn values, etc. There is **no perfect loss function** that works at any level; we assume conservatively that Goodhart’s Law will always apply. In order to not have to give infinite feedback for the infinite leve
I feel like there's a pretty strong Occam's Razor-esque argument for preferring Hutter's model, even though it seems wildly less intuitive to me.
?? Overall this claim feels to me like:
Some ways that you could refute it:
I continue to not understand this but it seems like such a simple question that it must be that there's just some deeper misunderstanding of the exact proposal we're now debating. It seems not particularly worth it to find this misunderstanding; I don't think it will really teach us anything conceptually new.
(If I did want to find it, I would write out pseudocode for the new proposed system and then try to make a more precise claim in terms of the variables in the pseudocode.)
Planned summary for the Alignment Newsletter:
... (read more)This post recommends that we think about AI alignment research in the following framework:
1. Defining the problem and its terms: for example, we might want to define “agency”, “optimization”, “AI”, and “well-behaved”.
2. Exploring these definitions, to see what they entail.
3. Solving the now well-defined problem.
This is explicitly _not_ a paradigm, but rather a framework in which we can think about possible paradigms for AI safety. A specific paradigm would choose a specific problem formulation and definition (or
Planned summary for the Alignment Newsletter:
... (read more)One argument against work on AI safety is that [it is hard to do good work without feedback loops](https://www.jefftk.com/p/why-global-poverty). So how could we get feedback loops? The most obvious approach is to actually try to align strong models right now, in order to get practice with aligning models in the future. This post fleshes out what such an approach might look like. Note that I will not be covering all of the points mentioned in the post; if you find yourself skeptical you may want to read the full
So my main crux here is whether you can be sufficiently confident of the 5x, to know that your tools which are 5x-appropriate apply.
This makes sense, though I probably shouldn't have used "5x" as my number -- it definitely feels intuitively more like your tools could be robust to many orders of magnitude of increased compute / model capacity / data. (Idk how you would think that relates to a scaling factor on intelligence.) I think the key claim / crux here is something like "we can develop techniques that are robust to scaling up compute / capacity / data by N orders, where N doesn't depend significantly on the current compute / capacity / data".
Most of this makes sense (or perhaps more accurately, sounds like it might be true, but there's a good chance if I reread the post and all the comments I'd object again / get confused somehow). One thing though:
Every piece of feedback gets put into the same big pool which helps define Hv, the initial ("human") value function. [...]
Okay, I think with this elaboration I stand by what I originally said:
... (read more)It seemed to me like since the first few bits of feedback determine how the system interprets all future feedback, it's particularly important for those first
Makes it sound like there's some structural equivalence to a human thinking for a long time, which there isn't.
Yes, I explicitly agree with this, which is why the first thing in my previous response was
sorry, that's right, I was speaking pretty loosely.
I agree with the other responses from Ajeya / Paul / Raemon, but to add some more info:
Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from???
... I don't really know. My guess is that I picked it up from reading giant comment threads between Paul and other people.
I don't see any reason at all to expect it to do anything remotely similar to that.
Tbc it doesn't need to be literally true. The argument needed for safety is something like "a large team of copies of non-expert agents could together be as capable as ... (read more)
Yeah, sorry, that's right, I was speaking pretty loosely. You'd still have the same hope -- maybe a team of 2^100 copies of the business owner could draft a contract just as well, or better than, an already expert business-owner. I just personally find it easier to think about "benefits of a human thinking for a long time" and then "does HCH get the same benefits as humans thinking for a long time" and then "does iterated amplification get the same benefits as HCH".
One approach is to let the human giving feedback think for a long time. Maybe the business owner by default can't write a good contract, but a business owner who could study the relevant law for a year would do just as well as the already expert business-owner. In the real world this is too expensive to do, but there's hope in the AI case (e.g. that's a hope behind iterated amplification).
It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objections you outlined?
It's important to distinguish between:
That said, I'd be interested to hear why you have similar feelings about the non-Neuromorph answers, considering that you agreed with the point I was making in the birds/brains/etc. post. If we aren't trying to replicate the brain, but just to do something that works, yes there will be lots of details to work out, but what positive reason do you have to think that the amount of special sauce / details is so high that 12 OOMs and a few years isn't enough to find it?
The positive reason is basically all the reasons given in Ajeya's report? Like, we don't tend... (read more)
idk, I feel like maybe at this point we should make bets or something, and then go read the literature and see who is right? I don't find this prospect appealing but it seems like the epistemically virtuous thing to do.
Meh, I don't think it's a worthwhile use of my time to read that literature, but I'd make a bet if we could settle on an operationalization and I didn't have to settle it.
What do you imagine happening, in the hypothetical, when we run the Neuromorph project?
I mostly expect that you realize that there were a bunch of things that were super un... (read more)
Neuromorph =/= an attempt to create uploads.
My impression is that the linked blog post is claiming we haven't even been able to get things that are qualitatively as impressive as a worm. So why would we get things that are qualitatively as impressive as a human? I'm not claiming it has to be an upload.
This is because, counterintuitively, worms being small makes them a lot harder to simulate.
I could believe this (based on the argument you mentioned) but it really feels like "maybe this could be true but I'm not that swayed from my default prior of 'it's pro... (read more)
I feel like if you think Neuromorph has a good chance of succeeding, you need to explain why we haven't uploaded worms yet. For C. elegans, if we ran 302 neurons for 1 subjective day (= 8.64e4 seconds) at 1.2e6 flops per neuron, and did this for 100 generations of 100 brains, that takes a mere 3e17 flops, or about $3 at current costs.
(And this is very easy to parallelize, so you can't appeal to that as a reason this can't be done.)
(It's possible that we have uploaded worms in the 7 years since that blog post was written, though I would have expected to hear about it if so.)
Planned summary for the Alignment Newsletter:
... (read more)The <@biological anchors approach@>(@Draft report on AI timelines@) to forecasting AI timelines estimates the compute needed for transformative AI based on the compute used by animals. One important parameter of the framework is needed to “bridge” between the two: if we find that an animal can do a specific task using X amount of compute, then what should we estimate as the amount of compute needed for an ML model to do the same task? This post aims to better estimate this parameter, by comparing few-shot
It’s clear, however, that a bee’s brain can perform a wide range of tasks beside efew-shot image classification, while the machine learning model developed in (Lee et al., 2019) cannot.
The abstract objection here is “if you choose an ML model that has been trained just for a specific task (few-shot learning), on priors you should expect it to be more efficient than an evolution-designed organism that has been trained for a whole bunch of stuff, of which the task you’re considering is just one example”. This can cash out in several ways:
1. Bees presumably u... (read more)
Planned summary for the Alignment Newsletter:
This post distinguishes between three kinds of “alignment”:
1. Not building an AI system at all,
2. Building Friendly AI that will remain perfectly aligned for all time and capability levels,
3. _Bootstrapped alignment_, in which we build AI systems that may not be perfectly aligned but are at least aligned enough that we can use them to build perfectly aligned systems.The post argues that optimization-based approaches can’t lead to perfect alignment, because there will always eventually be Goodhart effects.
I agree it's not vacuous. It sounds like you're mostly stating the same argument I gave but in different words. Can you tell me what's wrong or missing from my summary of the argument?
... (read more)Since it is possible to compress high-probability events using an optimal code for the probability distribution, you might expect that functions with high probability in the neural network prior can be compressed more than functions with low probability. Since high probability functions are more likely, this means that the more likely functions correspond to shorter programs.
The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume.
This seems right, but I'm not sure how that's different from Zach's phrasing of the main point? Zach's phrasing was "SGD approximately equals random sampling", and random sampling finds functions with probability exactly proportional to their volume. Combine that with the fact that empirically we get good generalization and we get the thing yo... (read more)
What John said. To elaborate, it's specifically talking about the case where there is some concept from which some probabilistic generative model creates observations tied to the concept, and claiming that the log probabilities follow a polynomial.
Suppose the most dog-like nose size is K. One function you could use is y = exp(-(x - K)^d) for some positive integer d. That's a function whose maximum value is 0 (where higher values = more "dogness") and doesn't blow up unreasonably anywhere.
(Really you should be talking about probabilities, in which case you use the same sort of function but then normalize, which transforms the exp into a softmax, as the paper suggests)
The core conceptual argument is: the higher your utility function can go, the bigger the world must be, and so the more bits it must take to describe it in its unoptimized state under M2, and so the more room there is to reduce the number of bits.
If you could only ever build 10 paperclips, then maybe it takes 100 bits to specify the unoptimized world, and 1 bit to specify the optimized world.
If you could build 10^100 paperclips, then the world must be humongous and it takes 10^101 bits to specify the unoptimized world, but still just 1 bit to specify the p... (read more)
... (read more)I think there's a perspective where the post-singularity failure is still the important thing to talk about, and that's an error I made in writing the post. I skipped it because there is no real action after the singularity---the damage is irreversibly done, all of the high-stakes decisions are behind us---but it still matters for people trying to wrap their heads around what's going on. And moreover, the only reason it looks that way to me is because I'm bringing in a ton of background empirical assumptions (e.g. I believe that massive acceleration in gro
Yeah, I think I feel like that's the part where I don't think I could replicate your intuitions (yet).
I don't think we disagree; I'm just noting that this methodology requires a fair amount of intuition / discretion, and I don't feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.
(Probably I could have been clearer about this in the original opinion.)