All of Matthew Barnett's Comments + Replies

Three reasons to expect long AI timelines

Thanks for the useful comment.

You might say "okay, sure, at some level of scaling GPTs learn enough general reasoning that they can manage a corporation, but there's no reason to believe it's near".

Right. This is essentially the same way we might reply to Claude Shannon if he said that some level of brute-force search would solve the problem of natural language translation.

one of the major points of the bio anchors framework is to give a reasonable answer to the question of "at what level of scaling might this work", so I don't think you can argue that cur

... (read more)
5Rohin Shah18dFwiw, the problem I think is hard is "how to make models do stuff that is actually what we want, rather than only seeming like what we want, or only initially what we want until the model does something completely different like taking over the world". I don't expect that it will be hard to get models that look like they're doing roughly the thing we want; see e.g. the relative ease of prompt engineering or learning from human preferences. If I thought that were hard, I would agree with you. I would guess that this is relatively uncontroversial as a view within this field? Not sure though. (One of my initial critiques of bio anchors was that it didn't take into account the cost of human feedback, except then I actually ran some back-of-the-envelope calculations and it turned out it was dwarfed by the cost of compute; maybe that's your crux too?)
Three reasons to expect long AI timelines

These arguments prove too much; you could apply them to pretty much any technology (e.g. self-driving cars, 3D printing, reusable rockets, smart phones, VR headsets...).

I suppose my argument has an implicit, "current forecasts are not taking these arguments into account." If people actually were taking my arguments into account, and still concluding that we should have short timelines, then this would make sense. But, I made these arguments because I haven't seen people talk about these considerations much. For example, I deliberately avoided the argument ... (read more)

2Daniel Kokotajlo19dI definitely agree that our timelines forecasts should take into account the three phenomena you mention, and I also agree that e.g. Ajeya's doesn't talk about this much. I disagree that the effect size of these phenomena is enough to get us to 50 years rather than, say, +5 years to whatever our opinion sans these phenomena was. I also disagree that overall Ajeya's model is an underestimate of timelines, because while indeed the phenomena you mention should cause us to shade timelines upward, there is a long list of other phenomena I could mention which should cause us to shade timelines downward, and it's unclear which list is overall more powerful. On a separate note, would you be interested in a call sometime to discuss timelines? I'd love to share my overall argument with you and hear your thoughts, and I'd love to hear your overall timelines model if you have one.
Against GDP as a metric for timelines and takeoff speeds

In addition to the reasons you mentioned, there's also empirical evidence that technological revolutions generally precede the productivity growth that they eventually cause. In fact, economic growth may even slow down as people pay costs to adopt new technologies. Philippe Aghion and Peter Howitt summarize the state of the research in chapter 9 of The Economics of Growth,

Although each [General Purpose Technology (GPT)] raises output and productivity in the long run, it can also cause cyclical fluctuations while the economy adjusts to it. As David (1990) a

... (read more)
2Daniel Kokotajlo4moWow, yeah, that's an excellent point. EDIT: See e.g. this paper: https://www.nber.org/papers/w24001 [https://www.nber.org/papers/w24001]
Forecasting Thread: AI Timelines

If AGI is taken to mean, the first year that there is radical economic, technological, or scientific progress, then these are my AGI timelines.

My percentiles

  • 5th: 2029-09-09
  • 25th: 2049-01-17
  • 50th: 2079-01-24
  • 75th: above 2100-01-01
  • 95th: above 2100-01-01

I have a bit lower probability for near-term AGI than many people here are. I model my biggest disagreement as about how much work is required to move from high-cost impressive demos to real economic performance. I also have an intuition that it is really hard to automate everything and progress will be bottlene... (read more)

Forecasting Thread: AI Timelines

It's unclear to me what "human-level AGI" is, and it's also unclear to me why the prediction is about the moment an AGI is turned on somewhere. From my perspective, the important thing about artificial intelligence is that it will accelerate technological, economic, and scientific progress. So, the more important thing to predict is something like, "When will real economic growth rates reach at least 30% worldwide?"

It's worth comparing the vagueness in this question with the specificity in this one on Metaculus. From the ... (read more)

2jungofthewon8moI generally agree with this but think the alternative goal of "make forecasting easier" is just as good, might actually make aggregate forecasts more accurate in the long run, and may require things that seemingly undermine the virtue of precision. More concretely, if an underdefined question makes it easier for people to share whatever beliefs they already have, then facilitates rich conversation among those people, that's better than if a highly specific question prevents people from making a prediction at all. At least as much, if not more, of the value of making public, visual predictions like this comes from the ensuing conversation and feedback than from the precision of the forecasts themselves. Additionally, a lot of assumptions get made at the time the question is defined more precisely, which could prematurely limit the space of conversation or ideas. There are good reasons why different people define AGI the way they do, or the moment of "AGI arrival" the way they do, that might not come up if the question askers had taken a point of view.
What specific dangers arise when asking GPT-N to write an Alignment Forum post?
To me the most obvious risk (which I don't ATM think of as very likely for the next few iterations, or possibly ever, since the training is myopic/SL) would be that GPT-N in fact is computing (e.g. among other things) a superintelligent mesa-optimization process that understands the situation it is in and is agent-y.

Do you have any idea of what the mesa objective might be. I agree that this is a worrisome risk, but I was more interested in the type of answer that specified, "Here's a plausible mesa objective given the incentives." Mesa optimization is a more general risk that isn't specific to the narrow training scheme used by GPT-N.

1David Krueger9moNo, and I don't think it really matters too much... what's more important is the "architecture" of the "mesa-optimizer". It's doing something that looks like search/planning/optimization/RL. Roughly speaking, the simplest form of this model of how things works says: "Its so hard to solve NLP without doing agent-y stuff that when we see GPT-N produce a solution to NLP, we should assume that it's doing agenty stuff on the inside... i.e. what probably happened is it evolved or stumbled upon something agenty, and then that agenty thing realized the situation it was in and started plotting a treacherous turn".
Modelling Continuous Progress
Second, the major disagreement is between those who think progress will be discontinuous and sudden (such as Eliezer Yudkowsky, MIRI) and those who think progress will be very fast by normal historical standards but continuous (Paul Chrisiano, Robin Hanson).

I'm not actually convinced this is a fair summary of the disagreement. As I explained in my post about different AI takeoffs, I had the impression that the primary disagreement between the two groups was over locality rather than the amount of time takeoff lasts. Though of course, I may be misinterpreting people.

4Sammy Martin1yAfter reading your summary of the difference [https://www.lesswrong.com/posts/YgNYA6pj2hPSDQiTE/distinguishing-definitions-of-takeoff#Paul_slow_takeoff] (maybe just a difference in emphasis) between 'Paul slow' vs 'continuous' takeoff, I did some further simulations. A low setting of d (highly continuous progress) doesn't give you a paul slow condition on its own, but it is relatively easy to replicate a situation like this: What we want is a scenario where you don't get intermediate doubling intervals at all in the discontinuous case, but you get at least one in the continuous case. Setting s relatively high appears to do the trick. Here is a scenario [https://i.imgur.com/RSjIKQH.png]where we have very fast post-RSI growth with s=5,c=1,I0=1 and I_AGI=3. I wrote some more code to produce plots of how long each complete interval of doubling took [https://i.imgur.com/mw27P7H.png]in each scenario. The 'default' rate with no contribution from RSI was 0.7. All the continuous scenarios had two complete doubling intervals over intermediate time frames before the doubling time collapsed to under 0.05 on the third doubling. The discontinuous model simply kept the original doubling interval until it collapsed to under 0.05 on the third doubling interval. It's all in this graph. [https://i.imgur.com/mw27P7H.png] Let's make the irresponsible assumption that this actually applies to the real economy, with the current growth mode, non-RSI condition being given by the 'slow/no takeoff', s=0 condition. The current doubling time is a bit over 23 years [https://openborders.info/double-world-gdp/]. In the shallow continuous progress scenario (red line), we get a 9 year doubling, a 4 year doubling and then a ~1 year doubling. In the discontinuous scenario (purple line) we get 2 23 year doublings and then a ~1 year doubling out of nowhere. In other words, this fairly random setting of the parameters (this was the second set I tried) gives us a Paul slow takeoff if you make the assum

They do disagree about locality, yes, but as far as I can tell that is downstream of the assumption that there won't be a very abrupt switch to a new growth mode. A single project pulling suddenly ahead of the rest of the world would happen if the growth curve is such that with a realistic amount (a few months) of lead time you can get ahead of everyone else.

So the obvious difference in predictions is that e.g. Paul/Robin think that takeoff will occur across many systems in the world while MIRI thinks it will occur in a single system. That is because ... (read more)

Possible takeaways from the coronavirus pandemic for slow AI takeoff

I tend to think that the pandemic shares more properties with fast takeoff than it does with slow takeoff. Under fast takeoff, a very powerful system will spring into existence after a long period of AI being otherwise irrelevant, in a similar way to how the virus was dormant until early this year. The defining feature of slow takeoff, by contrast, is a gradual increase in abilities from AI systems all across the world.

In particular, I object to this portion of your post,

The "moving goalposts" effect, where new advances in AI are dismissed as not
... (read more)
2Vika1yThanks Matthew for your interesting points! I agree that it's not clear whether the pandemic is a good analogy for slow takeoff. When I was drafting the post, I started with an analogy with "medium" takeoff (on the time scale of months), but later updated towards the slow takeoff scenario being a better match. The pandemic response in 2020 (since covid became apparent as a threat) is most relevant for the medium takeoff analogy, while the general level of readiness for a coronavirus pandemic prior to 2020 is most relevant for the slow takeoff analogy. I agree with Ben's response [https://www.lesswrong.com/posts/wTKjRFeSjKLDSWyww/possible-takeaways-from-the-coronavirus-pandemic-for-slow-ai?commentId=YcRJtpNHf2ewLCtt8] to your comment. Covid did not spring into existence in a world where pandemics are irrelevant, since there have been many recent epidemics and experts have been sounding the alarm about the next one. You make a good point that epidemics don't gradually increase in severity, though I think they have been increasing in frequency and global reach as a result of international travel, and the possibility of a virus escaping from a lab also increases the chances of encountering more powerful pathogens in the future. Overall, I agree that we can probably expect AI systems to increase in competence more gradually in a slow takeoff scenario, which is a reason for optimism. Your objections to the parallel with covid not being taken seriously seem reasonable to me, and I'm not very confident in this analogy overall. However, one could argue that the experience with previous epidemics should have resulted in a stronger prior on pandemics being a serious threat. I think it was clear from the outset of the covid epidemic that it's much more contagious than seasonal flu, which should have produced an update towards it being a serious threat as well. I agree that the direct economic effects of advanced AI would be obvious to observers, but I don't think this wo
4Ben Pace1ySome good points, but on the contrary: a slow take-off is considered safer because we have more lead time and warning shots, but the world has seen many similar events and warning shots for covid. Ones that come to mind in the last two decades are swine flu, bird flu, and Ebola, and of course there have been many more over history. This just isn’t that novel or surprising, billionaires like Bill Gates have been sounding the alarm, and still the supermajority of Western countries failed to take basic preventative measures. Those properties seem similar to even the slow take-off scenario. I feel like the fast-takeoff analogy would go through most strongly in a world where we'd just never seen this sort of pandemic before, but in reality we've seen many of them.
An Analytic Perspective on AI Alignment
weaker claim?

Oops yes. That's the weaker claim, that I agree with. The stronger claim is that because we can't understand something "all at once" then mechanistic transparency is too hard and so we shouldn't take Daniel's approach. But the way we understand laptops is also in a mechanistic sense. No one argues that because laptops are too hard to understand all at once, then we should't try to understand them mechanistically.

This seems to be assuming that we have to be able to take any complex trained AGI-as-a-neural-net
... (read more)
2Rohin Shah1yOkay, I think I see the miscommunication. The story you have is "the developers build a few small neural net modules that do one thing, mechanistically understand those modules, then use those modules to build newer modules that do 'bigger' things, and mechanistically understand those, and keep iterating this until they have an AGI". Does that sound right to you? If so, I agree that by following such a process the developer team could get mechanistic transparency into the neural net the same way that laptop-making companies have mechanistic transparency into laptops. The story I took away from this post is "we do end-to-end training with regularization for modularity, and then we get out a neural net with modular structure. We then need to understand this neural net mechanistically to ensure it isn't dangerous". This seems much more analogous to needing to mechanistically understand a laptop that "fell out of the sky one day" before we had ever made a laptop. My critiques are primarily about the second story. My critique of the first story would be that it seems like you're sacrificing a lot of competitiveness by having to develop the modules one at a time, instead of using end-to-end training.
An Analytic Perspective on AI Alignment
I'd be shocked if there was anyone to whom it was mechanistically transparent how a laptop loads a website, down to the gates in the laptop.

Could you clarify why this is an important counterpoint. It seems obviously useful to understand mechanistic details of a laptop in order to debug it. You seem to be arguing the [ETA: weaker] claim that nobody understands the an entire laptop "all at once", as in, they can understand all the details in their head simultaneously. But such an understanding is almost never possible for any complex system, a... (read more)

2Rohin Shah1yweaker claim? This seems to be assuming that we have to be able to take any complex trained AGI-as-a-neural-net and determine whether or not it is dangerous. Under that assumption, I agree that the problem is itself very hard, and mechanistic transparency is not uniquely bad relative to other possibilities. But my point is that because it is so hard to detect whether an arbitrary neural net is dangerous, you should be trying to solve a different problem. This only depends on the claim that mechanistic transparency is hard in an absolute sense, not a relative sense (given the problem it is trying to solve). Relatedly, from Evan Hubinger [https://www.lesswrong.com/posts/J9D6Bi3eFDDhCaovi/will-transparency-help-catch-deception-perhaps-not?commentId=yn5YcLnL6vs6AxxAE] : All of the other stories for preventing catastrophe that I mentioned in the grandparent are tackling a hopefully easier problem than "detect whether an arbitrary neural net is dangerous".
Cortés, Pizarro, and Afonso as Precedents for Takeover

For my part, I think you summarized my position fairly well. However, after thinking about this argument for another few days, I have more points to add.

  • Disease seems especially likely to cause coordination failures since it's an internal threat rather than an external threat (which unlike internal threats, tend to unite empires). We can compare the effects of the smallpox epidemic in the Aztec and Inca empires alongside other historical diseases during wartime, such as the Plauge of Athens which arguably is what caused Athens to lose the Peloponnesia
... (read more)
2Daniel Kokotajlo1yI accept that these points are evidence in your favor. Here are some more of my own: --Smallpox didn't hit the Aztecs until Cortes had already killed the Emperor and allied with the Tlaxcalans, if I'm reading these summaries correctly. (I really should go read the actual books...) So it seems that Cortes did get really far on the path towards victory without the help of disease. More importantly, there doesn't seem to be any important difference in how people treated Cortes before or after the disease. They took him very seriously, underestimated him, put too much trust in him, allied with him, etc. before the disease was a factor. --When Pizarro arrived in Inca lands, the disease had already swept through, if I'm reading these stories right. So the period of most chaos and uncertainty was over; people were rebuilding and re-organizing. --Also, it wasn't actually a 90% reduction in population. It was more like a 50% reduction at the time, if I am remembering right. (Later epidemics would cause further damage, so collectively they were worse than any other plague in history.) This is comparable to e.g. the Black Death in Europe, no? But the Black Death didn't result in the collapse of most civilizations who went through it, nor did it result in random small groups of adventurers taking over governments, I predict. (I haven't actually read up on the history of it)
Cortés, Pizarro, and Afonso as Precedents for Takeover
Later, other Europeans would come along with other advantages, and they would conquer India, Persia, Vietnam, etc., evidence that while disease was a contributing factor (I certainly am not denying it helped!) it wasn't so important a factor as to render my conclusion invalid (my conclusion, again, is that a moderate technological and strategic advantage can enable a small group to take over a large region.)

Europeans conquered places such as India, but that was centuries later, after they had a large technological advantage, and they also didn't ... (read more)

1Daniel Kokotajlo1yThe vast armadas were the result of successful colonization, not the cause of it. For example, a key battle that the British EIC won (enabling them to take over their first major territory) was the battle of Plassey, and they were significantly outnumbered during it. Fair point about the large technological advantage, but... actually it still wasn't that large? I don't know, I'd have to look into it more, but my guess is that the tech advantage of the EIC over the Nawab at Plassey, to use the same example, was smaller than the tech advantage of Cortes and Pizarro over the Americans. I should go find out how many men the EIC had when it conquered India. I'm betting that the answer is "Far fewer than India had." And also, yeah, didn't the British steal rocket technology from India? (Mysore, I think?) That's one military important technology that they were actually behind in.
Cortés, Pizarro, and Afonso as Precedents for Takeover
I really don't think the disease thing is important enough to undermine my conclusion. For the two reasons I gave: One, Afonso didn't benefit from disease

This makes sense, but I think the case of Afonso is sufficiently different from the others that it's a bit of a stretch to use it to imply much about AI takeovers. I think if you want to make a more general point about how AI can be militarily successful, then a better point of evidence is a broad survey of historical military campaigns. Of course, it's still a historically interesting... (read more)

2Daniel Kokotajlo1yAgain, I certainly agree that it would be good to think about things that could cause disarray as well. Like you said, maybe an AI could easily arrange for there to be a convenient pandemic at about the time it makes its move... And yeah, in light of your pushback I'm thinking of moderating my thesis to add the "disarray background condition" caveat. (I already edited the OP)This does weaken the claim, but not much, I think, because the sort of disarray needed is relatively common, I think. For purposes of Cortes and Pizarro takeover, what mattered was that they were able to find local factions willing to ally with them to overthrow the main power structures. The population count wasn't super relevant because, disease or no, it was several orders of magnitude more than Cortez & Pizarro had. And while it's true that without the disease they may have had a harder time finding local factions willing to ally with them, it's not obviously true, and moreover there are plenty of ordinary circumstances (ordinary civil wars, ordinary periods of unrest and rebellion, ordinary wars between great powers) that lead to the same result: Local factions being willing to ally with an outsider to overthrow the main power structure. This conversation has definitely made me less confident in my conclusion. I now think it would be worth it for me (or someone) to go do a bunch of history reading, to evaluate these debates with more information.
Cortés, Pizarro, and Afonso as Precedents for Takeover
I agree that it would be good to think about how AI might create devastating pandemics. I suspect it wouldn't be that hard to do, for an AI that is generally smarter than us. However, I think my original point still stands.

It's worth clarifying exactly what "original point" stands because I'm currently unsure.

I don't get why you think a small technologically primitive tribe could take over the world if they were immune to disease. Seems very implausible to me.

Sorry, I meant to say, "Were immune to diseases that were curre... (read more)

1Daniel Kokotajlo1yMy original point was that sometimes, a small group can reliably take over a large region despite being vastly outnumbered and outgunned, having only slightly better tech and cunning, knowing very little about the region to be conquered, and being disunited. This is in the context of arguments about how much of a lead in AI tech one needs to have to take over the world, and how big of an entity one needs to be to do it (e.g. can a rogue AI do it? What about a corporation? A nation-state?) Even with your point about disease, it still seems I'm right about this, for reasons I've mentioned (the 90% argument) I really don't think the disease thing is important enough to undermine my conclusion. For the two reasons I gave: One, Afonso didn't benefit from disease, and two, the 90% argument: Suppose there was no disease but instead the Aztecs and Incas were 90% smaller in population and also in the middle of civil war. Same result would have happened, and it still would have proved my point. I don't think a group of Incans in Spain could have taken it over if 90% of the Spaniards were dying of disease. I think they wouldn't have had the technology or experience necessary to succeed.
Cortés, Pizarro, and Afonso as Precedents for Takeover

Here's what I'll be putting in the Alignment Newsletter about this piece. Let me know if you spot inaccuracies or lingering disagreement regarding the opinion section.

Summary:

This post lists three historical examples of how small human groups conquered large parts of the world, and shows how they are arguably precedents for AI takeover scenarios. The first two historical examples are the conquests of American civilizations by Hernán Cortés and Francisco Pizarro in the early 16th century. The third example is the Portugese capture of key
... (read more)
1Daniel Kokotajlo1yThanks! Well, I still disagree with your opinion on it, for reasons mentioned above. To the point about "only" conquering ports, well, I think my explanations fit fine with that too -- the technological and experience advantages that (I claim) enabled Afonso to win were primarily naval in nature. Later, other Europeans would come along with other advantages, and they would conquer India, Persia, Vietnam, etc., evidence that while disease was a contributing factor (I certainly am not denying it helped!) it wasn't so important a factor as to render my conclusion invalid (my conclusion, again, is that a moderate technological and strategic advantage can enable a small group to take over a large region.)
Cortés, Pizarro, and Afonso as Precedents for Takeover

[ETA: Another way of framing my disagreement is that if you are trying to argue that small groups can take over the world, it seems almost completely irrelevant to focus on relative strategic or technological advantages in light of these historical examples. For instance, it could have theoretically been that some small technologically primitive tribe took over the world if they had some sort of immunity to disease. This would seem to imply that relative strategic advantages in Europeans vs. Americans was not that important. Instead we should focus on what... (read more)

2Daniel Kokotajlo1yI agree that it would be good to think about how AI might create devastating pandemics. I suspect it wouldn't be that hard to do, for an AI that is generally smarter than us. However, I think my original point still stands. I don't get why you think a small technologically primitive tribe could take over the world if they were immune to disease. Seems very implausible to me. What difference does it make whether he conquered civilizations or ports? He did a lot of conquering despite being vastly outnumbered. This shows that "on paper" stats like army size are not super useful for determining who is likely to win a fight, at least when one side has a tech+strategic advantage. (Also, Malacca at least was a civilization in its own right; it was a city-state with a much bigger population and military than Afonso had.) I agree that successful military campaigns are common in history. I think sometimes they can be attributed to luck, or else to genius. I chose these three case studies because they are so close to each other in time and space that they didn't seem like they could be luck or genius. I admit, however, that as lucy.ea8 said in their comment, perhaps cortes+pizarro won due to disease and then we can say Afonso was lucky or genius without stretching credibility. But I don't want to do this yet, because it seems to me that even with disease factored in, "most" of the "credit" for Cortes and Pizarro's success goes to the factors I mentioned. After all, suppose the disease reduced the on-paper strength of the Americans by 90%. They were still several orders of magnitude stronger than Cortes and Pizarro. So it's still surprising that Cortes/Pizarro won... until we factor in the technological and strategic advantages I mentioned. But the civilizations wouldn't have been destroyed without the Spaniards. (I might be wrong about this, but... hadn't the disease mostly swept through Inca territory by the time Pizarro arrived? So clearly their civilization had surviv
Cortés, Pizarro, and Afonso as Precedents for Takeover

Very interesting post! However, I have a big disagreement with your interpretation of why the European conquerors succeeded in America, and I think that it undermines much of your conclusion.

In your section titled "What explains these devastating takeovers?" you cite technology and strategic ability, but Old World diseases destroyed the communities in America before the European invaders arrived, most notably smallpox, but also measles, influenza, typhus and the bubonic plague. My reading of historians (from Charles Mann's book 1493, to Alfr... (read more)

2Daniel Kokotajlo1yThis is a good critique; thank you. I have two responses, and then a few nitpicks. First response: Disease wasn't a part of Afonso's success. It helped the Europeans take over the Americas but did not help them take over Africa or Asia or the middle east; this suggests to me that it may have been a contributing factor but was not the primary explanation / was not strictly necessary. Second response: Even if we decide that Cortes and Pizarro wouldn't have been able to succeed without the disease, my overall conclusion still stands. This is because disease isn't directly the cause of conquistador success, but rather indirectly, via the intermediate steps of "Chaos and disruption" and "Reduced political and economic strength." (I claim.) And the reduced strength can't have been more than, say, a 90% reduction in strength. (I claim) Suppose we think of the original conclusion as "A force with a small tech and cunning/experience advantage can take over a region 10,000 times its size." Then the modified conclusion in light of your claim about disease would be "In times of chaos and disruption, a force with a small tech and cunning/experience advantage can take over a region 1,000 times its size." This modified conclusion is, as far as I'm concerned, still almost as powerful and interesting as the original conclusion. Because "times of chaos and disruption" are pretty easy to come by. For example, it's true that the disease may have sparked the Incan civil war -- but civil wars happen pretty often anyway, historically. And when civil wars aren't happening, ordinary wars often are. Overall, in light of your critique (and also similar things other people have said) I am going to update my original post to include a possible third factor, "Chaos & disruption / disease." I also look forward to hearing what you have to say in response to my responses. Nitpick: The war was Cortez + allies vs. Tenochtitlan + allies. The vast majority of people on both sides were American
Coherence arguments do not imply goal-directed behavior

See also Alex Turner's work on formalizing instrumentally convergent goals, and his walkthrough of the MIRI paper.

1Issa Rice1yCan you say more about Alex Turner's formalism? For example, are there conditions in his paper or post similar to the conditions I named for Theorem 2 above? If so, what do they say and where can I find them in the paper or post? If not, how does the paper avoid the twitching robot from seeking convergent instrumental goals?
An Analytic Perspective on AI Alignment
That's not what I said.

That's fair. I didn't actually quite understand what your position was and was trying to clarify.

An Analytic Perspective on AI Alignment
I think it's plausible that there will be a simple basin that we can regularise an AGI into, because I have some ideas about how to do it, and because the world hasn't thought very hard about the problem yet (meaning the lack of extant solutions is to some extent explained away).

That makes sense. More pessimistically, one could imagine that the reason why no one has thought very hard about it is because in practice, it doesn't really help you that much to have a mechanistic understanding of a neural network in order to do useful work. Though... (read more)

2DanielFilan1yFWIW I take this work on 'circuits' in an image recognition CNN [https://distill.pub/2020/circuits/zoom-in/] to be a bullish indicator for the possibility of mechanistic transparency.
1DanielFilan1yI think I just think the 'market' here is 'inefficient'? Like, I think this just isn't a thing that people have really thought of, and those that have have gained semi-useful insight into neural networks by doing similar things (e.g. figuring out that adding a picture of a baseball to a whale fin will cause a network to misclassify the image as a great white shark [https://distill.pub/2019/activation-atlas/]). It also seems to me that recognition tasks (as opposed to planning/reasoning tasks) are going to be the hardest to get this kind of mechanistic transparency for, and also the kinds of tasks where transparency is easiest and ML systems are best. I think I understand what you mean here, but also think that there can be tricks that reduce computational cost that have some sort of mathematical backbone - it seems to me that this is common in the study of algorithms. Note also that we don't have to understand all possible real-world intelligent machines, just the ones that we build, making the requirement less stringent.
2DanielFilan1yI'll just respond to the easy part of this for now. That's not what I said. Because it takes ages to scroll down to comments and I'm on my phone, I can't easily link to the relevant comments, but basically I said that rationality is probably as formalisable as electromagnetism, but that theories as precise as that of liberalism can still be reasoned about and built on.
An Analytic Perspective on AI Alignment

I greatly appreciate writing your thoughts up. I have a few questions about your agenda/optimism regarding particular approaches.

The type of transparency that I’m most excited about is mechanistic, in a sense that I’ve described elsewhere.

Let me know if you'd agree with the following. The mechanistic approach is about understanding the internal structure of a program and how it behaves on arbitrary inputs. Mechanistic transparency is quite different from the more typical meaning of interpretability where we would like to know why an AI d... (read more)

2DanielFilan1yI agree with your sentence about the mechanistic approach. I think the word "interpretable" has very little specific meaning, but most work is about particular inputs. I agree that your examples divide up into what I would consider mechanistically transparent vs not, depending on exactly how large the decision tree, but I can't speak to whether they all count as "interpretable". I think it's plausible that there will be a simple basin that we can regularise an AGI into, because I have some ideas about how to do it, and because the world hasn't thought very hard about the problem yet (meaning the lack of extant solutions is to some extent explained away). I also think that there exists a relatively simple mathematical backbone to intelligence to be found (but not that all intelligent systems have this backbone), because I think promising progress has been made in mathematising a bunch of relevant concepts (see probability theory, utility theory, AIXI, reflective oracles). But this might be a bias from 'growing up' academically in Marcus Hutter's lab. You haven't deployed a system, don't know the kinds of situations it might encounter, and want reason to believe that it will perform well (e.g. by not trying to kill everyone) in these situations that you can't simulate. That being said, I have the feeling that this answer isn't satisfactorily detailed, so maybe you want more detail, or are thinking of a critique I haven't thought of? In this situation, the first answer is more likely to reveal some specific high-level mistakes the player might make, and provides affordance for a chess player to give advice for how to improve. The second answer seems like it's more amenable to mathematical analysis, generalises better across boards, less likely to be confabulated, and provides a better handle for how to directly improve the algorithm (basically, read forward more than one move). So I guess the first answer better reveals chess mistakes, and the second better reveals
[AN #80]: Why AI risk might be solved without additional intervention from longtermists
see above about trying to conform with the way terms are used, rather than defining terms and trying to drag everyone else along.

This seems odd given your objection to "soft/slow" takeoff usage and your advocacy of "continuous takeoff" ;)

2Rohin Shah1yI don't think "soft/slow takeoff" has a canonical meaning -- some people (e.g. Paul) interpret it as not having discontinuities, while others interpret it as capabilities increasing slowly past human intelligence over (say) centuries (e.g. Superintelligence). If I say "slow takeoff" I don't know which one the listener is going to hear it as. (And if I had to guess, I'd expect they think about the centuries-long version, which is usually not the one I mean.) In contrast, I think "AI risk" has a much more canonical meaning, in that if I say "AI risk" I expect most listeners to interpret it as accidental risk caused by the AI system optimizing for goals that are not our own. (Perhaps an important point is that I'm trying to communicate to a much wider audience than the people who read all the Alignment Forum posts and comments. I'd feel more okay about "slow takeoff" if I was just speaking to people who have read many of the posts already arguing about takeoff speeds.)
[AN #80]: Why AI risk might be solved without additional intervention from longtermists
Does this make sense to you?

Yeah that makes sense. Your points about "bio" not being short for "biological" were valid, but the fact that as a listener I didn't know that fact implies that it seems really easy to mess up the language usage here. I'm starting to think that the real fight should be about using terms that aren't self explanatory.

Have you actually observed it being used in ways that you fear (and which would be prevented if we were to redefine it more narrowly)?

I'm not sure about whether it would have be... (read more)

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

I agree that this is troubling, though I think it's similar to how I wouldn't want the term biorisk to be expanded to include biodiversity loss (a risk, but not the right type), regular human terrorism (humans are biological, but it's a totally different issue), zombie uprisings (they are biological, but it's totally ridiculous), alien invasions etc.

Not to say that's what you are doing with AI risk. I'm worried about what others will do with it if the term gets expanded.

2Wei Dai1yWell as I said, natural language doesn't have to be perfectly logical, and I think "biorisk" is in somewhat in that category but there's an explanation that makes it a bit reasonable than it might first appear, which is that the "bio" refers not to "biological" but to "bioweapon". This is actually one of the definitions that Google gives [https://www.google.com/search?q=bio-] when you search for "bio": "relating to or involving the use of toxic biological or biochemical substances as weapons of war. 'bioterrorism'" I guess the analogous thing would be if we start using "AI" to mean "technical AI accidents" in a bunch of phrases, which feels worse to me than the "bio" case, maybe because "AI" is a standalone word/acronym instead of a prefix? Does this make sense to you? But the term was expanded from the beginning. Have you actually observed it being used in ways that you fear (and which would be prevented if we were to redefine it more narrowly)?
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

I appreciate the arguments, and I think you've mostly convinced me, mostly because of the historical argument.

I do still have some remaining apprehension about using AI risk to describe every type of risk arising from AI.

I want to include philosophical failures, as long as the consequences of the failures flow through AI, because (aside from historical usage) technical problems and philosophical problems blend into each other, and I don't see a point in drawing an arbitrary and potentially contentious border between them.

That is true. The way I... (read more)

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

AI risk is just a shorthand for "accidental technical AI risk." To the extent that people are confused, I agree it's probably worth clarifying the type of risk by adding "accidental" and "technical" whenever we can.

However, I disagree with the idea that we should expand the word AI risk to include philosophical failures and intentional risks. If you open the term up, these outcomes might start to happen:

  • It becomes unclear in conversation what people mean when they say AI risk
  • Like The Singularity, it becomes a buzzword.
  • Jo
... (read more)
2Wei Dai1yAlso, isn't defining "AI risk" as "technical accidental AI risk" analogous to defining "apple" as "red apple" (in terms of being circular/illogical)? I realize natural language doesn't have to be perfectly logical, but this still seems a bit too egregious.
5Wei Dai1yI don't think "AI risk" was originally meant to be a shorthand for "accidental technical AI risk". The earliest considered (i.e., not off-hand) usage I can find is in the title of Luke Muehlhauser's AI Risk and Opportunity: A Strategic Analysis [https://www.lesswrong.com/posts/i2XoqtYEykc4XWp9B/ai-risk-and-opportunity-a-strategic-analysis] where he defined it as "the risk of AI-caused extinction". (He used "extinction" but nowadays we tend think in terms of "existential risk" which also includes "permanent large negative consequences", which seems like an reasonable expansion of "AI risk".) I want to include philosophical failures, as long as the consequences of the failures flow through AI, because (aside from historical usage) technical problems and philosophical problems blend into each other, and I don't see a point in drawing an arbitrary and potentially contentious border between them. (Is UDT a technical advance or a philosophical advance? Is defining the right utility function for a Sovereign Singleton a technical problem or a philosophical problem? Why force ourselves to answer these questions?) As for "intentional risks" it's already common practice [https://www.lawfareblog.com/thinking-about-risks-ai-accidents-misuse-and-structure] to include that in "AI risk": Besides that, I think there's also a large grey area between "accident risk" and "misuse" where the risk partly comes from technical/philosophical problems and partly from human nature. For example humans might be easily persuaded by wrong but psychologically convincing moral/philosophical arguments that AIs can come up with and then order their AIs to do terrible things. Even pure intentional risks might have technical solutions. Again I don't really see the point of trying to figure out which of these problems should be excluded from "AI risk". It seems perfectly fine to me to use that as shorthand for "AI-caused x-risk" and use more specific terms when we mean more specific risks. What d
Soft takeoff can still lead to decisive strategic advantage
The concern with AI is that an initially tiny entity might take over the world.

This is a concern with AI, but why is it the concern. If eg. the United States could take over the world because they had some AI enabled growth, why would that not be a big deal? I'm imagining you saying, "It's not unique to AI" but why does it need to be unique? If AI is the root cause of something on the order of Britain colonizing the world in the 19th century, this still seems like it could be concerning if there weren't any good governing principles established beforehand.

I removed this post because you convinced me it was sufficiently ill composed. I still disagree strongly because I don't really understand how you would agree with the person in the analogy. And again, CNNs still seem pretty good at representing data to me, and it's still unclear why model distillation disproves this.

some sudden theoretical breakthrough (e.g. on fast matrix multiplication)

These sorts of ideas seem possible, and I'm not willing to discard them as improbable just yet. I think a way to imagine my argument is that I'm saying, "Hold on, why are we assuming that this is the default scenario? I think we should be skeptical by default." And so in general counterarguments of the form, "But it could be wrong because of this" aren't great, because something being possible does not imply that it's likely.

I don't think it's impossible. I have wide uncertainty about timelines, and I certainly think that parts of our systems can get much more efficient. I should have made this more clear in the post. What I'm trying to say is that I am skeptical of a catch-all general efficiency gain that comes from a core insight into rationality, that makes systems much more efficient suddenly.

1johnswentworth1yImagine a search algorithm that finds local minima, similar to gradient descent, but has faster big-O performance than gradient descent. (For instance, an efficient roughly-n^2 matrix multiplication algorithm would likely yield such a thing, by making true Newton steps tractable on large systems - assuming it played well with sparsity.) That would be a general efficiency gain, and would likely stem from some sudden theoretical breakthrough (e.g. on fast matrix multiplication). And it is exactly the sort of thing which tends to come from a single person/team - the gradual theoretical progress we've seen on matrix multiplication is not the kind of breakthrough which makes the whole thing practical; people generally think we're missing some key idea which will make the problem tractable.
My reaction would be "sure, that sounds like exactly the sort of thing that happens from time to time".

Insights trickle in slowly. Over the long-run, you can see vast efficiency improvements. But this seems unrealistically fast. You would really believe that a single person or team did something like that, which if true would completely and radically reshape the field of computer vision, because "it happens from time to time"?

In fact, if you replace the word "memory" with either "data" or "compute", then t
... (read more)

My understanding was that distilling CNNs worked more-or-less by removing redundant weights, rather than by discovering a more efficient form of representing the data. Distilled CNNs are still CNNs and thus the argument follows.

My point was that you couldn't do better than just memorizing the features that make up a cat. I should clarify that I do think that deep neural networks often have a lot of wasted information (though I believe removing some of it incurs a cost in robustness). The question is whether future insights will allow us do much better than what we currently do.

1gwern1yNo. That might describe sparsification, but it doesn't describe distillation, and in either case, it's shameless goalpost moving - by handwaving away all the counterexamples, you're simply no-true-Scotsmanning progress. 'Oh, Transformers? They aren't real performance improvements because they just learn "good representations of the data". Oh, model sparsification and compression and distillation? They aren't real compression because they're just getting rid of "wasted information".'
AI Alignment 2018-19 Review
it's not obvious to me that supervised learning does

What type of scheme do you have in mind that would allow an AI to learn our values through supervised learning?

Typically, the problem with supervised learning is that it's too expensive to label everything we care about. In this case, are you imagining that we label some types of behaviors as good and some as bad, perhaps like what we would do with an approval directed agent? Or are you thinking of something more general or exotic?

2John Maxwell1yI don't think we'll create AGI without first acquiring capabilities that make supervised learning much more sample-efficient (e.g. better unsupervised methods let us better use unlabeled data, so humans no longer need to label everything they care about, and instead can just label enough data to pinpoint "human values" as something that's observable in the world--or characterize it as a cousin of some things that are observable in the world). But if you think there are paths to AGI which don't go through more sample-efficient supervised learning, one course of action would be to promote differential technological development towards more sample-efficient supervised learning and away from deep reinforcement learning. For example, we could try & convince DeepMind and OpenAI to reallocate resources away from deep RL and towards sample efficiency. (Note: I just stumbled on this recent paper [https://arxiv.org/pdf/2001.05068.pdf] which is probably worth a careful read before considering advocacy of this type.) This seems like a promising option.
Inner alignment requires making assumptions about human values
Is your point mostly centered around there being no single correct way to generalize to new domains, but humans have preferences about how the AI should generalize, so to generalize properly, the AI needs to learn how humans want it to do generalization?

Pretty much, yeah.

The above sentence makes lots of sense to me, but I don't see how it's related to inner alignment

I think there are a lot of examples of this phenomenon in AI alignment, but I focused on inner alignment for two reasons

  • There's a heuristic that a solution to inner alignment sho
... (read more)
Inner alignment requires making assumptions about human values
I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.

I find this interesting but I'd be surprised if it were true :). I look forward to seeing it in the upcoming posts.

That said, I want to draw your attention to my definition of catastrophe, which I think is different than the way most people use the term. I think most broadly, you might think of a catastrophe as something that we would never want to happen even once. But for inner alignme... (read more)

5Alex Turner1yI think people should generally be a little more careful about saying "this requires value-laden information". First, while a certain definition may seem to require it, there may be other ways of getting the desired behavior, perhaps through reframing. Building an AI which only does small things should not require the full specification of value, even though it seems like you have to say "don't do all these bad things we don't like"! Second, it's always good to check "would this style of reasoning lead me to conclude solving the easy problem of wireheading [https://www.alignmentforum.org/s/Rm6oQRJJmhGCcLvxh/p/iTpLAaPamcKyjmbFC] is value-laden?": This isn't an object-level critique of your reasoning in this post, but more that the standard of evidence is higher for this kind of claim.
AI Alignment Open Thread October 2019

Ahh. To be honest, I read that, but then responded to something different. I assumed you were just expressing general pessimism, since there's no guarantee that we would converge on good values upon a long reflection (and you recently viscerally realized that values are very arbitrary).

Now I see that your worry is more narrow, in that the cultural revolution might happen during this period, and would act unwisely to create the AGI during its wake. I guess this seems quite plausible, and is an important concern, though I personally am skeptical that anything like the long reflection will ever happen.

2Wei Dai1yI guess I was also expressing a more general update towards more pessimism, where even if nothing happens during the Long Reflection that causes it to prematurely build an AGI, other new technologies that will be available/deployed during the Long Reflection could also invalidate the historical tendency for "Cultural Revolutions" to dissipate over time and for moral evolution to continue along longer-term trends. Sure, I'm skeptical of that too, but given my pessimism about more direct routes to building an aligned AGI, I thought it might be worth pushing for it anyway.
Outer alignment and imitative amplification
I tend to be fairly skeptical of these challenges—HCH is just a bunch of humans after all and if you can instruct them not to do things like instantiate arbitrary Turing machines, then I think a bunch of humans put together has a strong case for being aligned.

Minor nitpick: I mostly agree, but I feel like a lot of work is being done by saying that they can't instantiate arbitrary Turing machines, and that it's just a bunch of humans. Human society is also a bunch of humans, but frequently does things that I can't imagine any single in... (read more)

AI Alignment Open Thread October 2019
It sounds like you think that something like another Communist Revolution or Cultural Revolution could happen (that emphasizes some random virtues at the expense of others), but the effect would be temporary and after it's over, longer term trends will reassert themselves. Does that seem fair?

That's pretty fair.

I think it's likely that another cultural revolution could happen, and this could adversely affect the future if it happens simultaneously with a transition into an AI based economy. However, the deviations from long-term trends are v... (read more)

3Wei Dai1yThis seems to be ignoring the part of my comment [https://www.lesswrong.com/posts/pFAavCTW56iTsYkvR/ai-alignment-open-thread-october-2019?commentId=ecpBgnqXeWzuDmW7q] at the top of this sub-thread, where I said "[...] has also made me more pessimistic about non-AGI or delayed-AGI approaches to a positive long term future (e.g., the Long Reflection)." In other words, I'm envisioning a long period of time in which humanity has the technical ability to create an AGI but is deliberately holding off to better figure out our values or otherwise perfect safety/alignment. I'm worried about something like the Cultural Revolution happening in this period, and you don't seem to be engaging with that concern?
AI Alignment Open Thread October 2019

I could be wrong here, but the stuff you mentioned as counterexamples to my model appear either ephemeral, or too particular. The "last few years" of political correctness is hardly enough time to judge world-trends by, right? By contrast, the stuff I mentioned (end of slavery, explicit policies against racism and war) seem likely to stick and stay with us for decades, if not centuries.

We can explain this after the fact by saying that the Left is being forced by impersonal social dynamics, e.g., runaway virtue signaling, to over-correct, but did
... (read more)

When I listen to old recordings of right wing talk show hosts from decades ago, they seem to be saying the same stuff that current people are saying today, about political correctness and being forced out of academia for saying things that are deemed harmful by the social elite, or about the Left being obsessed by equality and identity. So I would definitely say that a lot of people predicted this would happen.

I think what's surprising is that although academia has been left-leaning for decades, the situation had been relatively stable until the last fe

... (read more)
2Wei Dai1yIt sounds like you think that something like another Communist Revolution or Cultural Revolution could happen (that emphasizes some random virtues at the expense of others), but the effect would be temporary and after it's over, longer term trends will reassert themselves. Does that seem fair? In the context of AI strategy though (specifically something like the Long Reflection), I would be worried that a world in the grips of another Cultural Revolution would be very tempted to (or impossible to refrain from) abandoning the plan to delay AGI and instead build and lock their values into a superintelligent AI ASAP, even if that involves more safety risk. Predictability of longer term moral trends (even if true) doesn't seem to help with this concern.
AI Alignment Open Thread October 2019

Part of why I'm skeptical of these concerns is that it seems like a lot of moral behavior is predictable as society gets richer, and we can model the social dynamics to predict some outcomes will be good.

As evidence for the predictability, consider that rich societies are more open to LGBT rights, they have explicit policies against racism, against war, slavery, torture, and it seems like rich societies are moving in the direction of government control over many aspects of life, such as education and healthcare. Is this just a quirk of our timeline, ... (read more)

3Wei Dai1yBy unpredictable I mean that nobody really predicted: (Edit: 1-3 removed to keep a safer distance from object-level politics, especially on AF) 4 Russia and China adopted communism even though they were extremely poor. (They were ahead of the US in gender equality and income equality for a time due to that, even though they were much poorer.) None of these seem well-explained by your "rich society" model. My current model [https://www.lesswrong.com/posts/vA2Gd2PQjNk68ngFu/what-determines-the-balance-between-intelligence-signaling] is that social media and a decrease in the perception of external threats relative to internal threats both favor more virtue signaling, which starts spiraling out of control after some threshold is crossed. But the actual virtue(s) that end up being signaled/reinforced (often at the expense of other virtues) is historically contingent and hard to predict.
Malign generalization without internal search

Sure, we can talk about this over video. Check your Facebook messages.

Malign generalization without internal search
Computing the fastest route to Paris doesn't involve search?
More generally, I think in order for it to work your example can't contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.

My example uses search, but the search is not the search of the inner alignment failure. It is merely a subroutine that is called upon by this outer superstructure, which itself is the part that is misaligned. Theref... (read more)

1Daniel Kokotajlo1yYou are right; my comment was based on a misunderstanding of what you were saying. Hence why I unendorsed it. (I read " In this post, I will outline a general category of agents which may exhibit malign generalization without internal search, and then will provide a concrete example of an agent in the category. Then I will argue that, rather than being a very narrow counterexample, this class of agents could be competitive with search-based agents. " and thought you meant agents that don't use internal search at all.)
Malign generalization without internal search

If one's interpretation of the 'objective' of the agent is full of piecewise statements and ad-hoc cases, then what exactly are we doing it by describing it as maximizing an objective in the first place? You might as well describe a calculator by saying that it's maximizing the probability of outputting the following [write out the source code that leads to its outputs]. At some point the model breaks down, and the idea that it is following an objective is completely epiphenomenal to its actual operation. The model that it is maximizing an objective doesn't shed light on its internal operations any more than just spelling out exactly what its source code is.

1Evan Hubinger1yI don't feel like you're really understanding what I'm trying to say here. I'm happy to chat with you about this more over video call or something if you're interested.
Malign generalization without internal search
I feel like what you're describing here is just optimization where the objective is determined by a switch statement

Typically when we imagine objectives, we think of a score which rates how well an agent performed some goal in the world. How exactly does the switch statement 'determine' the objective?

Let's say that a human is given the instructions, "If you see the coin flip heads, then become a doctor. If you see the coin flip tails, then become a lawyer." what 'objective function' is it maximizing here? If it'... (read more)

5Evan Hubinger1yI think that piecewise objectives are quite reasonable and natural—and I don't think they'll make transparency that much harder. I don't think there's any reason that we should expect objectives to be continuous in some nice way, so I fully expect you'll get these sorts of piecewise jumps. Nevertheless, the resulting objective in the piecewise case is still quite simple such that you should be able to use interpretability tools to understand it pretty effectively—a switch statement is not that complicated or hard to interpret—with most of the real hard work still primarily being done in the optimization. I do think there are a lot of possible ways in which the interpretability for mesa-optimizers story could break down—which is why I'm still pretty uncertain about it—but I don't think that a switch-case agent is such an example. Probably the case that I'm most concerned about right now is if you get an agent which has an objective which changes in a feedback loop with its optimization. If the objective and the optimization are highly dependent on each other, then I think that would make the problem a lot more difficult—and is the sort of thing that humans seem to do, which suggests that it's the sort of thing we might see in AI systems as well. On the other hand, a fixed switch-case objective is pretty easy to interpret, since you just need to understand the simple, fixed heuristics being used in the switch statement and then you can get a pretty good grasp on what your agent's objective is. Where I start to get concerned is when those switch statements themselves depend upon the agent's own optimization—a recursion which could possibly be many layers deep and quite difficult to disentangle. That being said, even in such a situation you're still using search to get your robust capabilities.
Inductive biases stick around
but as ML moves to larger and larger models, why should that matter? The answer, I think, is that the reason larger models generalize better isn't because they're more complex, but because they're actually simpler.

On its face, this statement seems implausible to me. Are you saying, for instance, that the model of a dog that a human has in their head should be simpler than the model of a dog that an image classifier has?

2Evan Hubinger1yI just edited the last sentence to be clearer in terms of what I actually mean by it.
1Evan Hubinger1yWhat double descent definitely says is that for a fixed dataset, larger models with zero training error are simpler than smaller models with zero training error. I think it does say somewhat more than that also, which is that larger models do have a real tendency towards being better at finding simpler models in general. That being said, the dataset on which the concept of a dog in your head was trained on is presumably way larger than that of any ML model, so even if your brain is really good at implementing Occam's razor and finding simple models, your model is still probably going to be more complicated.
Is the term mesa optimizer too narrow?
I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.

I'm not sure what's unsatisfying about the characterization I gave? If we just redefined optimizer to mean an interpretation of the agent's behavior, specifically, that it abstractly pursues goals, why is that an unsatisfying way of showing the mesa != base issue?

ETA:

The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation wo
... (read more)
Is the term mesa optimizer too narrow?
I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers.

I think it's fairly easy to point out the problem using an alternative definition. If we just change the definition of mesa optimizer to reflect that we're are using the intentional stance (in other words, we're interpreting the neural network as having goals, whether it's using an internal search or not), the mesa!=base description falls right out, and all the normal risks about building mesa optimizers still apply.

1Vladimir Mikulik1yI’m not talking about finding on optimiser-less definition of goal-directedness that would support the distinction. As you say, that is easy. I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals. As a side note I think the role of the intentional stance here is more subtle than I see it discussed. The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves. So, there is more to the concepts than just taking an interpretative stance, though of course to the extent that the concepts (even when refined by neuroscience) are pieces of a model being used to understand the world, they will form part of an interpretative stance.
Is the term mesa optimizer too narrow?
I’m not sure what exactly you mean by “retreat to malign generalisation”.

When you don't have a deep understanding of a phenomenon, it's common to use some empirical description of what you're talking about, rather than using your current (and incorrect) model to interpret the phenomenon. The issue with using your current model, is that it leads you to make incorrect inferences about why things happen because you're relying too heavily on the model being internally correct.

Therefore, until we gain a deeper understan... (read more)

1Vladimir Mikulik1yI understand that, and I agree with that general principle. My comment was intended to be about where to draw the line between incorrect theory, acceptable theory, and pre-theory. In particular, I think that while optimisation is too much theory, goal-directedness talk is not, despite being more in theory-land than empirical malign generalisation talk. We should keep thinking of worries on the level of goals, even as we’re still figuring out how to characterise goals precisely. We should also be thinking of worries on the level of what we could observe empirically.
Load More