All of Richard_Ngo's Comments + Replies

Gradient Hacking via Schelling Goals

Interesting post :) I'm intuitively a little skeptical - let me try to figure out why.

I think I buy that some reasoning process could consistently decide to hack in a robust way. But there are probably parts of that reasoning process that are still somewhat susceptible to being changed by gradient descent. In particular, hacking relies on the agent knowing what its current mesa-objective is - but that requires some type of introspective access, which may be difficult and the type of thing which could be hindered by gradient descent (especially when you're ... (read more)

ARC's first technical report: Eliciting Latent Knowledge

Ah, that makes sense. In the section where you explain the steps of the game, I interpreted the comments in parentheses as further explanations of the step, rather than just a single example. (In hindsight the latter interpretation is obvious, but I was reading quickly - might be worth making this explicit for others who are doing the same.) So I thought that Bayes nets were built into the methodology. Apologies for the oversight!

I'm still a little wary of how much the report talks about concepts in a humans' Bayes net without really explaining why this is... (read more)

2Ajeya Cotra1moAh got it. To be clear, Paul and Mark do in practice consider a bank of multiple counterexamples for each strategy with different ways the human and predictor could think, though they're all pretty simple in the same way the Bayes net example is (e.g. deduction from a set of axioms); my understanding is that essentially the same kind of counterexamples apply for essentially the same underlying reasons for those other simple examples. The doc sticks with one running example for clarity / length reasons.
ARC's first technical report: Eliciting Latent Knowledge

Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don't reason using Bayes nets -- but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn't be logically inconsistent or physically impossible, and we wouldn't want alignment to fail in that world.

If you solve something given worst-case assumptions, you've solved it for all cases. Whereas if you solve it for one specific case (e.g. Bayes nets) then it may still fail if that's n... (read more)

7Ajeya Cotra1moSorry, there were two things you could have meant when you said the assumption that the human uses a Bayes net seemed crucial. I thought you were asking why the builder couldn't just say "That's unrealistic" when the breaker suggested the human runs a Bayes net. The answer to that is what I said above -- because the assumption is that we're working in the worst case, the builder can't invoke unrealism to dismiss the counterexample. If the question is instead "Why is the builder allowed to just focus on the Bayes net case?", the answer to that is the iterative nature of the game. The Bayes net case (and in practice a few other simple cases) was the case the breaker chose to give, so if the builder finds a strategy that works for that case they win the round. Then the breaker can come back and add complications which break the builder's strategy again, and the hope is that after many rounds we'll get to a place where it's really hard to think of a counterexample that breaks the builder's strategy despite trying hard.
5Paul Christiano1moThe breaker is definitely allowed to introduce counterexamples where the human isn't well-modeled using a Bayes net. Our training strategies (introduced here [https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.1xpao6tk9oiv] ) don't say anything at all about Bayes nets and so it's not clear if this immediately helps the breaker---they are the one who introduced the assumption that the human used a Bayes nets (in in order to describe a simplified situation where the naive training strategy failed here [https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.sm4amv12m66a] ). We're definitely not intentionally viewing Bayes nets as part of the definition of the game. It seems very plausible that after solving the problem for humans-who-use-Bayes-nets we will find a new counterexample that only works for humans-who-don't-use-Bayes-nets, in which case we'll move on to those counterexamples. It seems even more likely that the builder will propose an algorithm that exploits cognition that humans can do which isn't well captured by the Bayes net model, which is also fair game. (And indeed several of our approaches to do it, e.g. when imagining humans learning new things about the world by performing experiments here [https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.3l614s96sz9t] or reasoning about plausibility of model joint distributions here [https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.w0iwyfch6ysy] ). That said, it looks to us like if any of these algorithms worked for Bayes nets, they would at least work for a very broad range of human models, the Bayes net assumption doesn't seem to be changing the picture much qualitatively. Echoing Mark in his comment, we're definitely interested in ways that this assumption seems importantly unrealistic. If you just think it's generally a mediocre mode
ARC's first technical report: Eliciting Latent Knowledge

We’ll assume the humans who constructed the dataset also model the world using their own internal Bayes net.

This seems like a crucial premise of the report; could you say more about it? You discuss why a model using a Bayes net might be "oversimplified and unrealistic", but as far as I can tell you don't talk about why this is a reasonable model of human reasoning.

9Ajeya Cotra1moSpeaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don't reason using Bayes nets -- but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn't be logically inconsistent or physically impossible, and we wouldn't want alignment to fail in that world. Additionally, I think the statement made in the report about AIs also applies to humans: We're using some sort of cognitive algorithms to reason about the world, and it's plausible that strategies which resemble inference on graphical models play a role in some of our understanding. There's no obvious way that a messier model of human reasoning which incorporates all the other parts should make ELK easier ; there's nothing that we could obviously exploit to create a strategy.
5Mark Xu1moWe don't think that real humans are likely to be using Bayes nets to model the world. We make this assumption for much the same reasons that we assume models use Bayes nets, namely that it's a test case where we have a good sense of what we want a solution to ELK to look like. We think the arguments given in the report will basically extend to more realistic models of how humans reason (or rather, we aren't aware of a concrete model of how humans reason for which the arguments don't apply). If you think there's a specific part of the report where the human Bayes net assumption seems crucial, I'd be happy to try to give a more general form of the argument in question.
Interlude: Agents as Automobiles

I guess I just don't feel like you've established that it would have been reasonable to have credence above 90% in either of those cases. Like, it sure seems obvious to me that computers and automobiles are super useful. But I have a huge amount of evidence now about both of those things that I can't really un-condition on. So, given that I know how powerful hindsight bias can be, it feels like I'd need to really dig into the details of possible alternatives before I got much above 90% based on facts that were known back then.

(Although this depends on how ... (read more)

4Daniel Kokotajlo1moFair. If you don't share my intuition that people in 1950 should have had more than 90% credence that computers would be militarily useful, or that people at the dawn of steam engines should have predicted that automobiles would be useful (conditional on them being buildable) then that part of my argument has no force on you. Maybe instead of picking examples from the past, I should pick an example of a future technology that everyone agrees is 90%+ likely to be super useful if developed, even though Joe's skeptical arguments can still be made.
Interlude: Agents as Automobiles

Interesting post. Overall, though, it feels like you aren't taking hindsight bias seriously enough. E.g. as one example:

Some people thought battleships would beat carriers. Others thought that the entire war would be won from the air. Predicting the future is hard; we shouldn’t be confident. Therefore, we shouldn’t assign more than 90% credence to the claim that powerful, portable computers (assuming we figure out how to build them) will be militarily useful, e.g. in weapon guidance systems or submarine sensor suites.

In this particular case, an alternative... (read more)

3Daniel Kokotajlo1moThanks! Ah, I shouldn't have put the word "portable" in there then. I meant to be talking about computers in general, not computers-on-missiles-as-opposed-to-ground-installations. I think I agree with this, and should edit to clarify that that's not the argument I'm making... what I'm saying is that sometimes, "Of course X requires Y" and "of course Y will be useful/incentivised" predictions can be made in advance, with more than 90% confidence. I think "computers will be militarily useful" and "self-propelled vehicles will be useful" are two examples of this. The intuition I'm trying to pump is not "Look at this case where X was useful, therefore APS-AI will be useful" but rather "look at this case where it would have been reasonable for someone to be more than 90% confident that X was useful despite the presence of Joe's arguments; therefore, we should be open to the possibility that we should be more than 90% confident that APS-AI will be useful despite Joe's arguments." Of course I still have work to do, to actually provide positive arguments that our credence should be higher than 90%... the point of my analogy was defensive, to defend against Joe's argument that because of such-and-such considerations we should have 20% credence on Not Useful/Incentivised. (Will think more about what you said and reply later)
Conversation on technology forecasting and gradualism

I didn't push this point at the time, but Paul's claim that "GPT-3 + 5 person-years of engineering effort [would] foom" seems really wild to me, and probably a good place to poke at his model more. Is this 5 years of engineering effort and then humans leaving it alone with infinite compute? Or are the person-years of engineering doled out over time?

Unlike Eliezer, I do think that language models not wildly dissimilar to our current ones will be able to come up with novel insights about ML, but there's a long way between "sometimes comes up with novel insig... (read more)

I didn't push this point at the time, but Paul's claim that "GPT-3 + 5 person-years of engineering effort [would] foom" seems really wild to me, and probably a good place to poke at his model more. Is this 5 years of engineering effort and then humans leaving it alone with infinite compute?

The 5 years are up front and then it's up to the AI to do the rest. I was imagining something like 1e25  flops running for billions of years.

I don't really believe the claim unless you provide computing infrastructure that is externally maintained or else extremely ... (read more)

2Rob Bensinger2moMaybe something like '5 years of engineering effort to start automating work that qualitatively (but incredibly slowly and inefficiently) is helping with AI research, and then a few decades of throwing more compute at that for the AI to reach superintelligence'? With infinite compute you could just recapitulate evolution, so I doubt Paul thinks there's a crux like that? But there could be a crux that's about whether GPT-3.5 plus a few decades of hardware progress achieves superintelligence, or about whether that's approximately the fastest way to get to superintelligence, or something.
Biology-Inspired AGI Timelines: The Trick That Never Works

The two extracts from this post that I found most interesting/helpful:

The problem is that the resource gets consumed differently, so base-rate arguments from resource consumption end up utterly unhelpful in real life.  The human brain consumes around 20 watts of power.  Can we thereby conclude that an AGI should consume around 20 watts of power, and that, when technology advances to the point of being able to supply around 20 watts of power to computers, we'll get AGI?

I'm saying that Moravec's "argument from comparable resource consumption" must

... (read more)
Ngo and Yudkowsky on AI capability gains

My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot." 
 

Oh, certainly Eliezer should trust his intuitions and believe that he's not a crackpot. But I'm not arguing about what the person with the theory should believe, I'm arguing about what outside observers should believe, if they don't have enough time to fully download and evaluate the relevant intuitions. Asking the person with the theory to give evidence that their intuitions track reality isn't modest epistemology.

Ngo and Yudkowsky on AI capability gains

the easiest way to point out why they are dumb is with counterexamples. We can quickly "see" the counterexamples. E.g., if you're trying to see AGI as the next step in capitalism, you'll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete).

I'm not sure how this would actually work. The proponent of the AGI-capitalism analogy might say "ah yes, AGI killing everyone is another data point on the trend of capitalism becoming increasingly destructive". Or... (read more)

4Lukas_Gloor2moMy only reply is "You know it when you see it." And yeah, a crackpot would reason the same way, but non-modest epistemology says that if it's obvious to you that you're not a crackpot then you have to operate on the assumption that you're not a crackpot. (In the alternative scenario, you won't have much impact anyway.) Specifically, the situation I mean is the following: * You have an epistemic track record like Eliezer or someone making lots of highly upvoted posts in our communities. * You find yourself having strong intuitions about how to apply powerful principles like "consequentialism" to new domains, and your intuitions are strong because it feels to you like you have a gears-level understanding that others lack. You trust your intuitions in cases like these. My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot." Maybe there's a potential crux here about how much of scientific knowledge is dependent on successful predictions. In my view, the sequences have convincingly argued that locating the hypothesis in the first place is often done in the absence of already successful predictions, which goes to show that there's a core of "good reasoning" that lets you jump to (tentative) conclusions, or at least good guesses, much faster than if you were to try lots of things at random.
Ngo and Yudkowsky on AI capability gains

Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.

This has been valuable for community-building, but less so for making intellectual progress - because in almost all domains, the most important way to make progress is to grapple with many object-level problems, unti... (read more)

8Adam Shimi2moThanks for giving more details about your perspective. It's not clear to me that the sequences and HPMOR are good pointers for this particular approach to theory building. I mean, I'm sure there are posts in the sequences that touch on that (Einstein's Arrogance [https://www.lesswrong.com/posts/MwQRucYo6BZZwjKE7/einstein-s-arrogance] is an example I already mentioned), but I expect that they only talk about it in passing and obliquely, and that such posts are spread all over the sequences. Plus the fact that Yudkowsky said that there was a new subsequence to write lead me to believe that he doesn't think the information is clearly stated already. So I don't think you can really put the current confusion as an evidence that the explanation of how that kind of theory would work doesn't help, given that this isn't readily available in a form I or anyone reading this can access AFAIK. Completely agree that these intuitions are important training data. But your whole point in other comments is that we want to understand why we should expect these intuitions to differ from apparently bad/useless analogies between AGI and other stuff. And some explanation of where these intuitions come from could help with evaluating these intuitions, even more because Yudkowsky has said that he could write a sequence about the process. This sounds to me like a strawman of my position (which might be my fault for not explaining it well). * First, I don't think explaining a methodology is a "very high-level epistemological principle", because it let us concretely pick apart and criticize the methodology as a truthfinding method. * Second, the object-level work has already been done by Yudkowsky! I'm not saying that some outside-of-the-field epistemologist should ponder really hard about what would make sense for alignment without ever working on it concretely and then give us their teaching. Instead I'm pushing for a researcher who has built a coherent collections o
Ngo and Yudkowsky on AI capability gains

I don't expect such a sequence to be particularly useful, compared with focusing on more object-level arguments. Eliezer says that the largest mistake he made in writing his original sequences was that he "didn’t realize that the big problem in learning this valuable way of thinking was figuring out how to practice it, not knowing the theory". Better, I expect, to correct the specific mistakes alignment researchers are currently making, until people have enough data points to generalise better.

5Adam Shimi2moI'm honestly confused by this answer. Do you actually think that Yudkowsky having to correct everyone's object-level mistakes all the time is strictly more productive and will lead faster to the meat of the deconfusion than trying to state the underlying form of the argument and theory, and then adapting it to the object-level arguments and comments? I have trouble understanding this, because for me the outcome of the first one is that no one gets it, he has to repeat himself all the time without making the debate progress, and this is one more giant hurdle for anyone trying to get into alignment and understand his position. It's unclear whether the alternative would solve all these problems (as you quote from the preface of the Sequences, learning the theory is often easier and less useful than practicing), but it still sounds like a powerful accelerator. There is no dichotomy of "theory or practice", we probably need both here. And based on my own experience reading the discussion posts and the discussions I've seen around these posts, the object-level refutations have not been particularly useful forms of practice, even if they're better than nothing.
Ngo and Yudkowsky on AI capability gains

it seems to me that you want properly to be asking "How do we know this empirical thing ends up looking like it's close to the abstraction?" and not "Can you show me that this abstraction is a very powerful one?"

I agree that "powerful" is probably not the best term here, so I'll stop using it going forward (note, though, that I didn't use it in my previous comment, which I endorse more than my claims in the original debate).

But before I ask "How do we know this empirical thing ends up looking like it's close to the abstraction?", I need to ask "Does the ab... (read more)

Ngo and Yudkowsky on AI capability gains

I'm still trying to understand the scope of expected utility theory, so examples like this are very helpful! I'd need to think much more about it before I had a strong opinion about how much they support Eliezer's applications of the theory, though.

Ngo and Yudkowsky on AI capability gains

Not a problem. I share many of your frustrations about modesty epistemology and about most alignment research missing the point, so I sympathise with your wanting to express them.

On consequentialism: I imagine that it's pretty frustrating to keep having people misunderstand such an important concept, so thanks for trying to convey it. I currently feel like I have a reasonable outline of what you mean (e.g. to the level where I could generate an analogy about as good as Nate's laser analogy), but I still don't know whether the reason you find it much more c... (read more)

Ngo and Yudkowsky on AI capability gains

My model of Eliezer says that there is some deep underlying concept of consequentialism, of which the "not very coherent consequentialism" is a distorted reflection; and that this deep underlying concept is very closely related to expected utility theory. (I believe he said at one point that he started using the word "consequentialism" instead of "expected utility maximisation" mainly because people kept misunderstanding what he meant by the latter.)

I don't know enough about conservative vector fields to comment, but on priors I'm pretty skeptical of this being a good example of coherent utilities; I also don't have a good guess about what Eliezer would say here.

Ngo and Yudkowsky on AI capability gains

Thanks! I think that this is a very useful example of an advance prediction of utility theory; and that gathering more examples like this is one of the most promising way to make progress on bridging the gap between Eliezer's and most other people's understandings of consequentialism.

Potentially important thing to flag here: at least in my mind, expected utility theory (i.e. the property Eliezer was calling "laser-like" or "coherence") and consequentialism are two distinct things. Consequentialism will tend to produce systems with (approximate) coherent expected utilities, and that is one major way I expect coherent utilities to show up in practice. But coherent utilities can in-principle occur even without consequentialism (e.g. conservative vector fields in physics), and consequentialism can in-principle not be very coherent (e.g. if... (read more)

Ngo and Yudkowsky on AI capability gains

My objection is mostly fleshed out in my other comment. I'd just flag here that "In other words, you have to do things the "hard way"--no shortcuts" assigns the burden of proof in a way which I think is not usually helpful. You shouldn't believe my argument that I have a deep theory linking AGI and evolution unless I can explain some really compelling aspects of that theory. Because otherwise you'll also believe in the deep theory linking AGI and capitalism, and the one linking AGI and symbolic logic, and the one linking intelligence and ethics, and the on... (read more)

It also isn't clear to me that Eliezer has established the strong inferences he draws from noticing this general pattern ("expected utility theory/consequentialism"). But when you asked Eliezer (in the original dialogue) to give examples of successful predictions, I was thinking "No, that's not how these things work." In the mistaken applications of Grand Theories you mention (AGI and capitalism, AGI and symbolic logic, intelligence and ethics, recursive self-improvement and cultural evolution, etc.), the easiest way to point out why they are dumb is with ... (read more)

Ngo and Yudkowsky on AI capability gains

I think we live in a world where there are very strong forces opposed to technological progress, which actively impede a lot of impactful work, including technologies which have the potential to be very economically and strategically important (e.g. nuclear power, vaccines, genetic engineering, geoengineering).

This observation doesn't lead me to a strong prediction that all such technologies will be banned; nor even that the most costly technologies will be banned - if the forces opposed to technological progress were even approximately rational, then bann... (read more)

Ngo and Yudkowsky on AI capability gains

Strong upvote, you're pointing at something very important here. I don't think I'm defending epistemic modesty, I think I'm defending epistemic rigour, of the sort that's valuable even if you're the only person in the world.

I suspect Richard isn't actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of "successful advance predictions"/"wide applications" corresponds to those likely failure modes, etc).

Yes, this is corre... (read more)

4Vanessa Kosoy2moFDT was made rigorous by [https://www.lesswrong.com/s/CmrW8fCmSLK7E25sa/p/e8qFDMzs2u9xf5ie6] infra-Bayesianism [https://www.lesswrong.com/s/CmrW8fCmSLK7E25sa/p/GS5P7LLLbSSExb3Sk], at least in the pseudocausal case.

I think some of your confusion may be that you're putting "probability theory" and "Newtonian gravity" into the same bucket.  You've been raised to believe that powerful theories ought to meet certain standards, like successful bold advance experimental predictions, such as Newtonian gravity made about the existence of Neptune (quite a while after the theory was first put forth, though).  "Probability theory" also sounds like a powerful theory, and the people around you believe it, so you think you ought to be able to produce a powerful advance p... (read more)

A positive case for how we might succeed at prosaic AI alignment

we already know how to build myopic optimizers

What are you referring to here?

A positive case for how we might succeed at prosaic AI alignment

That all makes sense. But I had a skim of (2), (3), (4), and (5) and it doesn't seem like they help explain why myopia is significantly more natural than "obey humans"?

2Evan Hubinger2moI mean, that's because this is just a sketch, but a simple argument for why myopia is more natural than “obey humans” is that if we don't care about competitiveness, we already know how to build myopic optimizers, whereas we don't know how to build an optimizer to “obey humans” at any level of capabilities. Furthermore, LCDT [https://www.alignmentforum.org/posts/Y76durQHrfqwgwM5o/lcdt-a-myopic-decision-theory] is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency. I suspect we can get much better upper bounds on the complexity than that, though.
A positive case for how we might succeed at prosaic AI alignment

The key idea, in the case of HCH, would be to direct that optimization towards the goal of producing an action that is maximally close to what HCH would do.

Why do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has? That seems like a much more natural thing for it to learn than the concept of imitating HCH, because in the process of imitating HCH it still ha... (read more)

Why do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has?

To be clear, I was only talking about (1) here, which is just about what it might look like for an agent to be myopic, not how to actually get an agent that satisfies (1). I agree that you would most likely get a proxy-aligned model if you just trained on “imitate HCH”—but just training on “imitat... (read more)

My understanding of the alignment problem

+1 for interesting investigations. I want to push back on your second point, though - the framing of the problem of high-level distributional shift. I don't think this actually captures the core thing we're worried about. For example, we can imagine a model that remains in the same environment, but becomes increasingly intelligent during training, until it realises that it has the option of doing a treacherous turn. Or we can think about the case of humans - the core skills and goals that make us dangerous to other species developed in our ancestral enviro... (read more)

Discussion with Eliezer Yudkowsky on AGI interventions

I really feel there is more disagreement on the second question than on the first

What is this feeling based on? One way we could measure this is by asking people about how much AI xrisk there is conditional on there being no more research explicitly aimed at aligning AGIs. I expect that different people would give very different predictions.

People like Paul and Evan and more are actually going for the core problems IMO, just anchoring a lot of their thinking in current ML technologies.

Everyone agrees that Paul is trying to solve foundational problems. And ... (read more)

Discussion with Eliezer Yudkowsky on AGI interventions

This is already reflected in the upvotes, but just to say it explicitly: I think the replies to this comment from Rob and dxu in particular have been exceptionally charitable and productive; kudos to them. This seems like a very good case study in responding to a provocative framing with a concentration of positive discussion norms that leads to productive engagement.

Discussion with Eliezer Yudkowsky on AGI interventions

I think one core issue here is that there are actually two debates going on. One is "how hard is the alignment problem?"; another is "how powerful are prosaic alignment techniques?" Broadly speaking, I'd characterise most of the disagreement as being on the first question. But you're treating it like it's mostly on the second question - like EY and everyone else are studying the same thing (cancer, in your metaphor) and just disagree about how to treat it.

My attempt to portray EY's perspective is more like: he's concerned with the problem of ageing, and a ... (read more)

95% of all ML researchers don't think it's a problem, or think it's something we'll solve easily

The 2016 survey of people in AI asked people about the alignment problem as described by Stuart Russell, and 39% said it was an important problem and 33% that it's a harder problem than most other problem in the field.

Thanks for the detailed comment!

I think one core issue here is that there are actually two debates going on. One is "how hard is the alignment problem?"; another is "how powerful are prosaic alignment techniques?" Broadly speaking, I'd characterise most of the disagreement as being on the first question. But you're treating it like it's mostly on the second question - like EY and everyone else are studying the same thing (cancer, in your metaphor) and just disagree about how to treat it.

That's an interesting separation of the problem, because I really feel... (read more)

Emergent modularity and safety

Thanks, that's helpful. I do think there's a weak version of this which is an important background assumption for the post (e.g. without that assumption I'd need to explain the specific ways in which ANNs and BNNs are similar), so I've now edited the opening lines to convey that weak version instead. (I still believe the original version but agree that it's not worth defending here.)

Emergent modularity and safety

Why would that be our default expectation?

Lots of reasons. Neural networks are modelled after brains. They both form distributed representations at very large scales, they both learn over time, etc etc. Sure, you've pointed out a few differences, but the similarities are so great that this should be the main anchor for our expectations (rather than, say, thinking that we'll understand NNs the same way we understand support vector machines, or the same way we understand tree search algorithms, or...).

3johnswentworth3moPerhaps a good way to summarize all this is something like "qualitatively similar models probably work well for brains and neural networks". I agree to a large extent with that claim (though there was a time when I would have agreed much less), and I think that's the main thing you need for the rest of the post. "Ways we understand" comes across as more general than that - e.g. we understand via experimentally probing physical neurons vs spectral clustering of a derivative matrix.

I'm not convinced that these similarities are great enough to merit such anchoring. Just because NNs have more in common with brains than with SVMs, does not imply that we will understand NNs in roughly the same ways that we understand biological brains. We could understand them in a different set of ways than we understand biological brains, and differently than we understand SVMs. 

Rather than arguing over reference class, it seems like it would make more sense to note the specific ways in which NNs are similar to brains, and what hints those specific similarities provide.

Emergent modularity and safety

Compare: when trying to predict events, you should use their base rate except when you have specific updates to it.

Similarly, I claim, our beliefs about brains should be the main reference for our beliefs about neural networks, which we can then update from.

I agree that the phrasing could be better; any suggestions?

4johnswentworth3moI actually think you could just drop that intro altogether, or move it later into the post. We do have pretty good evidence of modularity in the brain (as well as other biological systems) and in trained neural nets; it seems to be a pretty common property of large systems "evolved" by local optimization. And the rest of the post (as well as some of the other comments) does a good job of talking about some of that evidence. It's a good post, and I think the arguments later in the post are stronger than that opening. (On the other hand, if you're opening with it because that was your own main prior, then that makes sense. In that case, maybe note that it was a prior for you, but that the evidence from other directions is strong enough that we don't need to rely much on that prior?)
1Daniel_Eth3moYeah, I'm not trying to say that the point is invalid, just that phrasing may give the point more appeal than is warranted from being somewhat in the direction of a deepity [https://www.urbandictionary.com/define.php?term=deepity] . Hmm, I'm not sure what better phrasing would be.
Inner Alignment: Explain like I'm 12 Edition

The images in this post seem to be broken.

1Rafael Harth3moThanks. It's because directupload often has server issues. I was supposed to rehost all images from my posts to a more reliable host, but apparently forgot this one. I'll fix it in a couple of hours.
Thoughts on gradient hacking

What mechanism would ensure that these two logic pieces only fire at the same time? Whatever it is, I expect that mechanism to be changed in response to failures.

1Ofer Givoli4moThe two pieces of logic can use the same activation values as their input. For example, suppose they both (independently) cause failure if a certain activation value is above some threshold. (In which case each piece of logic "ruins" a different critical activation value).
Thoughts on gradient hacking

I discuss the possibility of it going in some other direction when I say "The two most salient options to me". But the bit of Evan's post that this contradicts is:

Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there.

Formalizing Objections against Surrogate Goals

Interesting report :) One quibble:

For one, our AIs can only use “things like SPI” if we actually formalize the approach

I don't see why this is the case. If it's possible for humans to start using things like SPI without a formalisation, why couldn't AIs too? (I agree it's more likely that we can get them to do so if we formalise it, though.)

1Vojtech Kovarik5moThanks for pointing this out :-). Indeed, my original formulation is false; I agree with the "more likely to work if we formalise it" formulation.
Frequent arguments about alignment

Whether this is a point for the advocate or the skeptic depends on whether advances in RL from human feedback unlock other alignment work more than they unlock other capabilities work. I think there's room for reasonable disagreement on this question, although I favour the former.

Frequent arguments about alignment

Skeptic: It seems to me that the distinction between "alignment" and "misalignment" has become something of a motte and bailey. Historical arguments that AIs would be misaligned used it in sense 1: "AIs having sufficiently general and large-scale motivations that they acquire the instrumental goal of killing all humans (or equivalently bad behaviour)". Now people are using the word in sense 2: "AIs not quite doing what we want them to do". But when our current AIs aren't doing quite what we want them to do, is that mainly evidence that future, more general... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research.

I'm wondering if the disagreement over the centrality of this example is downstream from a disagreement about how easy the "alignment check-ins" that Critch talks about are. If they are the sort of thing that can be done successfully in a couple of days by a single team of humans, then I share Critch's intuition that the system in question starts off only slightly misaligned. By contrast, if they require a significant proportion of ... (read more)

4Samuel Dylan Martin6moPerhaps this is a crux in this debate: If you think the 'agent-agnostic perspective' is useful, you also think a relatively steady state of 'AI Safety via Constant Vigilance' is possible. This would be a situation where systems that aren't significantly inner misaligned (otherwise they'd have no incentive to care about governing systems, feedback or other incentives) but are somewhat outer misaligned (so they are honestly and accurately aiming to maximise some complicated measure of profitability or approval, not directly aiming to do what we want them to do), can be kept in check by reducing competitive pressures, building the right institutions and monitoring systems, and ensuring we have a high degree of oversight. Paul thinks that it's basically always easier to just go in and fix the original cause of the misalignment, while Andrew thinks that there are at least some circumstances where it's more realistic to build better oversight and institutions to reduce said competitive pressures, and the agent-agnostic perspective is useful for the latter of these project, which is why he endorses it. I think that this scenario of Safety via Constant Vigilance is worth investigating - I take Paul's later failure story [https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=GvnDcxYxg9QznBobv] to be a counterexample to such a thing being possible, as it's a case where this solution was attempted and works for a little while before catastrophically failing. This also means that the practical difference between the RAAP 1a-d failure stories and Paul's story just comes down to whether there is an 'out' in the form of safety by vigilance
Challenge: know everything that the best go bot knows about go

I'm not sure what you mean by "actual computation rather than the algorithm as a whole". I thought that I was talking about the knowledge of the trained model which actually does the "computation" of which move to play, and you were talking about the knowledge of the algorithm as a whole (i.e. the trained model plus the optimising bot).

Formal Inner Alignment, Prospectus

Mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?

I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I'll save for another time).

With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable.

I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I'm still not sure what intuitions or arguments underlie your optimism about fo... (read more)

3Abram Demski8moTo me, the post as written seems like enough to spell out my optimism... there multiple directions for formal work which seem under-explored to me. Well, I suppose I didn't focus on explaining why things seem under-explored. Hopefully the writeup-to-come will make that clear.
Formal Inner Alignment, Prospectus

I have fairly mixed feelings about this post. On one hand, I agree that it's easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they'll arise, otherwise the argument that 'we can't yet rule them out, so we should prioritise trying to rule them out' is privileging the hypothesis. 

Secondly, it seems like you're heavily prioritising formal tools and methods for studying mesa-optimisatio... (read more)

I agree with much of this. I over-sold the "absence of negative story" story; of course there has to be some positive story in order to be worried in the first place. I guess a more nuanced version would be that I am pretty concerned about the broadest positive story, "mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?" -- and think more specific positive stories are mostly of illustrative value, rather than really pointing to gears that I expect to be important. (With the excep... (read more)

Challenge: know everything that the best go bot knows about go

The trained AlphaZero model knows lots of things about Go, in a comparable way to how a dog knows lots of things about running.

But the algorithm that gives rise to that model can know arbitrarily few things. (After all, the laws of physics gave rise to us, but they know nothing at all.)

1DanielFilan8moAh, understood. I think this is basically covered by talking about what the go bot knows at various points in time, a la this comment [https://www.lesswrong.com/posts/m5frrcYTSH6ENjsc9/challenge-know-everything-that-the-best-go-bot-knows-about?commentId=vXTiM5bJmrFdqHpKR] - it seems pretty sensible to me to talk about knowledge as a property of the actual computation rather than the algorithm as a whole. But from your response there it seems that you think that this sense isn't really well-defined.
Challenge: know everything that the best go bot knows about go

I'd say that this is too simple and programmatic to be usefully described as a mental model. The amount of structure encoded in the computer program you describe is very small, compared with the amount of structure encoded in the neural networks themselves. (I agree that you can have arbitrarily simple models of very simple phenomena, but those aren't the types of models I'm interested in here. I care about models which have some level of flexibility and generality, otherwise you can come up with dumb counterexamples like rocks "knowing" the laws of physic... (read more)

Agency in Conway’s Game of Life

I don't think there is a fundamental difference in kind between trees, bacteria, humans, and hypothetical future AIs

There's at least one important difference: some of these are intelligent, and some of these aren't.

It does seem plausible that the category boundary you're describing is an interesting one. But when you indicate in your comment below that you see the "AI hypothesis" and the "life hypothesis" as very similar, then that mainly seems to indicate that you're using a highly nonstandard definition of AI, which I expect will lead to confusion.

4Alex Flint9moWell surely if I built a robot that was able to gather resources and reproduce itself as effectively as either a bacterium or a tree, I would be entirely justified in calling it an "AI". I would certainly have no problem using that terminology for such a construction at any mainstream robotics conference, even if it performed no useful function beyond self-reproduction. Of course we wouldn't call an actual tree or an actual bacterium an "AI" because they are not artificial.
Agency in Conway’s Game of Life

It feels like this post pulls a sleight of hand. You suggest that it's hard to solve the control problem because of the randomness of the starting conditions. But this is exactly the reason why it's also difficult to construct an AI with a stable implementation. If you can do the latter, then you can probably also create a much simpler system which creates the smiley face.

Similarly, in the real world, there's a lot of randomness which makes it hard to carry out tasks. But there are a huge number of strategies for achieving things in the world which don't r... (read more)

Well yes, I do think that trees and bacteria exhibit this phenomenon of starting out small and growing in impact. The scope of their impact is limited in our universe by the spatial separation between planets, and by the presence of even more powerful world-reshapers in their vicinity, such as humans. But on this view of "which entities are reshaping the whole cosmos around here?", I don't think there is a fundamental difference in kind between trees, bacteria, humans, and hypothetical future AIs. I do think there is a fundamental difference in kind betwee... (read more)

1AprilSR9moI think the stuff about the supernovas addresses this: a central point is that the “AI” must be capable of generating an arbitrary world state within some bounds.
Challenge: know everything that the best go bot knows about go

The human knows the rules and the win condition. The optimisation algorithm doesn't, for the same reason that evolution doesn't "know" what dying is: neither are the types of entities to which you should ascribe knowledge.

1DanielFilan8moSuppose you have a computer program that gets two neural networks, simulates a game of go between them, determines the winner, and uses the outcome to modify the neural networks. It seems to me that this program has a model of the 'go world', i.e. a simulator, and from that model you can fairly easily extract the rules and winning condition. Do you think that this is a model but not a mental model, or that it's too exact to count as a model, or something else?
Challenge: know everything that the best go bot knows about go

it's not obvious to me that this is a realistic target

Perhaps I should instead have said: it'd be good to explain to people why this might be a useful/realistic target. Because if you need propositions that cover all the instincts, then it seems like you're basically asking for people to revive GOFAI.

(I'm being unusually critical of your post because it seems that a number of safety research agendas lately have become very reliant on highly optimistic expectations about progress on interpretability, so I want to make sure that people are forced to defend that assumption rather than starting an information cascade.)

3DanielFilan8moOK, the parenthetical helped me understand where you're coming from. I think a re-write of this post should (in part) make clear that I think a massive heroic effort would be necessary to make this happen, but sometimes massive heroic efforts work, and I have no special private info that makes it seem more plausible than it looks a priori.
Challenge: know everything that the best go bot knows about go

As an additional reason for the importance of tabooing "know", note that I disagree with all three of your claims about what the model "knows" in this comment and its parent.

(The definition of "know" I'm using is something like "knowing X means possessing a mental model which corresponds fairly well to reality, from which X can be fairly easily extracted".)

1DanielFilan8moIn the parent, is your objection that the trained AlphaZero-like model plausibly knows nothing at all?
1DanielFilan9moOn that definition, how does one train an AlphaZero-like algorithm without knowing the rules of the game and win condition?
Challenge: know everything that the best go bot knows about go

I think at this point you've pushed the word "know" to a point where it's not very well-defined; I'd encourage you to try to restate the original post while tabooing that word.

This seems particularly valuable because there are some versions of "know" for which the goal of knowing everything a complex model knows seems wildly unmanageable (for example, trying to convert a human athlete's ingrained instincts into a set of propositions). So before people start trying to do what you suggested, it'd be good to explain why it's actually a realistic target.

2DanielFilan9moHmmm. It does seem like I should probably rewrite this post. But to clarify things in the meantime: * it's not obvious to me that this is a realistic target, and I'd be surprised if it took fewer than 10 person-years to achieve. * I do think the knowledge should 'cover' all the athlete's ingrained instincts in your example, but I think the propositions are allowed to look like "it's a good idea to do x in case y".
Gradations of Inner Alignment Obstacles

I used to define "agent" as "both a searcher and a controller"

Oh, I really like this definition. Even if it's too restrictive, it seems like it gets at something important.

I'm not sure what you meant by "more compressed".

Sorry, that was quite opaque. I guess what I mean is that evolution is an optimiser but isn't an agent, and in part this has to do with how it's a very distributed process with no clear boundary around it. Whereas when you have the same problem being solved in a single human brain, then that compression makes it easier to point to the huma... (read more)

Load More