One of the primary questions that comes to mind for me is "well, did this whole thing actually work?". If I understand the paper correctly, while we definitely substantially decreased the fraction of random samples that got misclassified (which always seemed very likely to happen, and I am indeed a bit surprised at only getting it to move ~3 OOMs, which my guess is mostly capability related, since you used small models), we only doubled the amount of effort necessary to generate an adversarial counterexample.
A doubling is still pretty substantial, an... (read more)
Excellent question -- I wish we had included more of an answer to this in the post.
I think we made some real progress on the defense side -- but I 100% was hoping for more and agree we have a long way to go.
I think the classifier is quite robust in an absolute sense, at least compared to normal ML models. We haven't actually tried it on the final classifier, but my guess is it takes at least several hours to find a crisp failure unassisted (whereas almost all ML models you can find are trivially breakable). We're interested in people giving it a shot! :)
Pa... (read more)
This article explains the difference: https://www.consumeranalysis.com/guides/portable-ac/best-portable-air-conditioner/
For example, a 14,000 BTU model that draws 1,400 watts of power on maximum settings would have an EER of 10.0 as 14,000/1,400 = 10.0.A 14,000 BTU unit that draws 1200 watts of power would have an EER of 11.67 as 14,000/1,200 = 11.67.Taken at face value, this looks like a good and proper metric to use for energy efficiency. The lower the power draw (watts) compared to the cooling capacity (BTUs/hr), the higher the EER. And the higher the E
For example, a 14,000 BTU model that draws 1,400 watts of power on maximum settings would have an EER of 10.0 as 14,000/1,400 = 10.0.
A 14,000 BTU unit that draws 1200 watts of power would have an EER of 11.67 as 14,000/1,200 = 11.67.
Taken at face value, this looks like a good and proper metric to use for energy efficiency. The lower the power draw (watts) compared to the cooling capacity (BTUs/hr), the higher the EER. And the higher the E
EER does not account for heat infiltration issues, so this seems confused. CEER does, and that does suggest something in the 20% range, but I am pretty sure you can't use EER to compare a single-hose and a dual-hose system.
I think that paragraph is discussing a second reason that infiltration is bad.
Yeah, sorry, I didn't mean to imply the section is saying something totally wrong. The section just makes it sound like that is the only concern with infiltration, which seems wrong, and my current model of the author of the post is that they weren't actually thinking through heat-related infiltration issues (though it's hard to say from just this one paragraph, of course).
My overall take on this post and comment (after spending like 1.5 hours reading about AC design and statistics):
Overall I feel like both the OP and this reply say some wrong things. The top Wirecutter recommendation is a dual-hose design. The testing procedure of Wirecutter does not seem to address infiltration in any way, and indeed the whole article does not discuss infiltration as it relates to cooling-efficiency.
Overall efficiency loss from going to dual to single is something like 20-30%, which I do think is much lower than I think the OP ... (read more)
Update: I too have now spent like 1.5 hours reading about AC design and statistics, and I can now give a reasonable guess at exactly where the I-claim-obviously-ridiculous 20-30% number came from. Summary: the SACC/CEER standards use a weighted mix of two test conditions, with 80% of the weight on conditions in which outdoor air is only 3°F/1.6°C hotter than indoor air.
The whole backstory of the DOE's SACC/CEER rating rules is here. Single-hose air conditioners take center stage. The comments on the DOE's rule proposals can basically be summarized as:
The best thing we took away from our tests was the chance at a direct comparison between a single-hose design and a dual-hose design that were otherwise identical, and our experience confirmed our suspicions that dual-hose portable ACs are slightly more effective than single-hose models but not effective enough to make a real difference
After having looked into this quite a bit, it does really seem like the Wirecutter testing process had no ability to notice infiltration issues, so it seems like the Wirecutter crew themselves is kind of confused here? ... (read more)
A 2-hose unit will definitely cool more efficiently, but I think for many people who are using portable units it's the right tradeoff with convenience. The wirecutter reviews both types of units together and usually end up preferring 1-hose units.
It is important to note that the current top wirecutter pick is a 2-hose unit, though one that combined the two hoses into one big hose. I guess maybe that is recent, but it does seem important to acknowledge here (and it wouldn't surprise me that much if Wirecutter went through reasoning pretty similar to the one... (read more)
Here is the wirecutter discussion of the distinction for reference:
Starting in 2019, we began comparing dual- and single-hose models according to the same criteria, and we didn’t dismiss any models based on their hose count. Our research, however, ultimately steered us toward single-hose portable models—in part because so many newer models use this design. In fact, we found no compelling new double-hose models from major manufacturers in 2019 or 2020 (although a few new ones cropped up in 2021, including our new top pick). Owner reviews indicate that most
Mod note: I reposted this post to the frontpage, because it wasn't actually shown on a frontpage due to an interaction with the GreaterWrong post-submission interface. It seemed like a post many people are interested in, and it seemed like it didn't really get the visibility it deserved.
Relevant Feynman quote:
I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples.For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)-- disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on.Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say “False!” [and] point out my counterexample.
I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples.
For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)-- disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on.
Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say “False!” [and] point out my counterexample.
No real power-seeking tendencies if we only plausibly will specify a negative vector.
Seems like two sentences got merged together.
The post feels like it's trying pretty hard to point towards an alternative forecasting method, though I also agree it's not fully succeeding at getting there.
I feel like de-facto the forecasting methodology of people who are actually good at forecasting don't usually strike me as low-inferential distance, such that it is obvious how to communicate the full methodology. My sense from talking to a number of superforecasters over the years is that they do pretty complicated things, and I don't feel like the critique of "A critique is only really valid ... (read more)
I think it's fine to say that you think something else is better without being able to precisely say what it is. I just think "the trick that never works" is an overstatement if you aren't providing evidence about whether it has worked, and that it's hard to provide such evidence without saying something about what you are comparing to.
(Like I said though, I just skimmed the post and it's possible it contains evidence or argument that I didn't catch.)
It's possible the action is in disagreements about Moravec's view rather than the lack of an alternat... (read more)
This is a very good point. IIRC Paul is working on some new blog posts that summarize his more up-to-date approach, though I don't know when they'll be done. I will ask Paul when I next run into him about what he thinks might be the best way to update the sequence.
Thank you! I am glad you are doing this!
Promoted to curated: I found this conversation useful from a number of different perspectives, and found the transcript surprisingly easy to read (though it is still very long). The key question the conversation tried to tackle, about whether we should put resources into increasing the safety of AI systems by reducing the degree to which they try to model humans, is one that I've been interested in for a while. But I also felt like this conversation, more so than most other transcripts, gave me a better insight into how both Scott and Rohin think about these topics in general, and what kind of heuristics they use to evaluate various AI alignment proposals.
I also found these very valuable! I wonder whether a better title might help more people see how great these are, but not sure.
Replaced the image in the post with this image.
Minor meta feedback: I think it's better to put the "Comprehensive Information Gathering" part of the title at the end, if you want to have many of these. That makes it much easier to see differences in the title and skim a list of them.
The newsletter is back! I missed these! Glad to have these back.
Promoted to curated: I've had a number of disagreements with a perspective on AI that generates arguments like the above, which takes something like "ownership of material resources" as a really fundamental unit of analysis, and I feel like this post has both helped me get a better grasp on that paradigm of thinking, and also helped me get a bit of a better sense of what feels off to me, and I have a feeling this post will be useful in bridging that gap eventually.
When I begin a comment with a quotation, I don't know how to insert new un-quoted text at the top (other than by cutting the quotation, adding some blank lines, then pasting the quotation back). That would be great.
You can do this by pressing enter in an empty paragraph of a quoted block. That should cause you to remove the block. See this gif:
This is great, thank you!
Minor formatting note: The italics font on both the AI Alignment Forum and LessWrong isn't super well suited to large block of text, so I took the liberty to unitalicize a bunch of the large blockquotes (which should be sufficiently distinguishable as blockquotes without the italics). Though I am totally happy to reverse it if you prefer the previous formatting.
Promoted to curated: As Adele says, this feels related to a bunch of the Jeffery-Bolker rotation ideas, which I've referenced many many times since then, but in a way that feels somewhat independent, which makes me more excited about there being some deeper underlying structure here.
I've also had something like this in my mind for a while, but haven't gotten around to formalizing it, and I think I've seen other people make similar arguments in the past, which makes this a valuable clarification and synthesis that I expect to get referenced a bunch.
So secret that even a spoiler tag wasn't good enough.
Promoted to curated: This is a long and dense post, but I really liked it, and find this kind of commentary from a large variety of thinkers in the AI Alignment space quite useful. I found that it really helped me think about the implications of a lot of the topics discussed in the main sequence in much more detail, and in a much more robust way, and I have come back to this post multiple times since it's been published.
Also, of course, the whole original sequence is great and I think currently the best short introduction to AI-Risk that exists out there.
I agree with this, and was indeed kind of thinking of them as one post together.
I have now linked at least 10 times to the heading on "'Generate evidence of difficulty' as a research purpose" section of this post. It was a thing that I kind of wanted to point to before this post came out, but felt confused about it, and this post finally gave me a pointer to it.
I think that section was substantially more novel and valuable to me than the rest of this post, but it is also evidence that others might have also not had some of the other ideas on their map, and so they might found it similarly valuable because of a different section.
I think this post and the Gradient Hacking post caused me to actually understand and feel able to productively engage with the idea of inner-optimizers. I think the paper and full sequence was good, but I bounced off of it a few times, and this helped me get traction on the core ideas in the space.
I also think that some parts of this essay hold up better as a core abstraction than the actual mesa-optimizer paper itself, though I am not at all confident about this. But I just noticed that when I am internally thinking through alignment problems relate... (read more)
adamshimi says almost everything I wanted to say in my review, so I am very glad he made the points he did, and I would love for both his review and the top level post to be included in the book.
The key thing I want to emphasize a bit more is that I think the post as given is very abstract, and I have personally gotten a lot of value out of trying to think of more concrete scenarios where gradient hacking can occur.
I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an... (read more)
I found this quite compelling. I don't think I am sold on some of the things yet (in particular claims 5 and 6), but thanks a lot for writing this up this clearly. I will definitely take some time to think more about this.
I think the CAIS framing that Eric Drexler proposed gave concrete shape to a set of intuitions that many people have been relying on for their thinking about AGI. I also tend to think that those intuitions and models aren't actually very good at modeling AGI, but I nevertheless think it productively moved the discourse forward a good bit.
In particular I am very grateful about the comment thread between Wei Dai and Rohin, which really helped me engage with the CAIS ideas, and I think were necessary to get me to my current understanding of CAIS and to ... (read more)
Promoted to curated: Even if God and Santa Claus are not real, we do experience a Christmas miracle every year in the form of these amazingly thorough reviews by Larks. Thank you for your amazing work, as this continues to be an invaluable resource to anyone trying to navigate the AI Alignment landscape, whether as a researcher, grantmaker or independent thinker.
Unfortunately, they are only sporadically updated and difficult to consume using automated tools. We encourage organizations to start releasing machine-readable bibliographies to make our lives easier.
Oh interesting. Would it be helpful to have something on the AI Alignment in the form of some kind of more machine-readable citation system, or did you find the current setup sufficient?
Also, thank you for doing this!
Yep, you can revise it any time before we actually publish the book, though ideally you can revise it before the vote so people can be compelled by your amazing updates!
Coming back to this post, I have some thoughts related to it that connect this more directly to AI Alignment that I want to write up, and that I think make this post more important than I initially thought. Hence nominating it for the review.
I think of Utility != Reward as probably the most important core point from the Mesa-Optimizer paper, and I preferred this explanation over the one in the paper (though it leaves out many things and wouldn't want it to be the only thing someone reads on the topic)
Most of my points from my curation notice still hold. And two years later, I am still thinking a lot about credit assignment as a perspective on many problems I am thinking about.
This seems like one I would significantly re-write for the book if it made it that far. I feel like it got nominated for the introductory material, which I wrote quickly in order to get to the "main point" (the gradient gap). A better version would have discussed credit assignment algorithms more.
This post felt like it took a problem that I was thinking about from 3 different perspectives and combined them in a way that felt pretty coherent, though I am fully sure how right it gets it. Concretely, the 3 domains I felt it touched on were:
All three of these feel pretty important to me.
Gradient hacking seems important and I really didn't think of this as a concrete consideration until this post came out.
I've referred specifically to the section on "Generate evidence of difficulty" as a research purpose many times since this post has come out, and while I have disagreements with it, I do really like it as a handle for a consideration that I hadn't previously seen written up, and does strike me as quite important.
While I think this post isn't the best writeup of this topic I can imagine, I think it makes a really important point quite succinctly, and is one that I have brought up many times in arguments around takeoff speeds and risk scenarios since this post came out.
In talking to many people about AI Alignment over the years, I've repeatedly found that a surprisingly large generator of disagreement about risk scenarios was disagreement about the fragility of human values.
I think this post should be reviewed for it's excellent comment section at least as much as for the original post, and also think that this post is a pretty central example of the kind of post I would like to see more of.
As Robby said, this post isn't perfect, but it felt like it opened up a conversation on LessWrong that I think is really crucial, and was followed up in a substantial number of further posts by Daniel Kokotajlo that I have found really useful. Many of those were written in 2020, but of the ones written in 2019, this strikes me as the one I remember most.
I am organizing a reading group for this report next Tuesday in case you (or anyone else) wants to show up:
I... think this post was impacted by a bug in the LW API that GreaterWrong ran into, that made it so that it wasn't visible on the frontpage when it was published. It nevertheless appears to have gotten some amount of engagement, but maybe that was all from direct links?
Given the substantial chance that a number of people have never seen this post, I reposted it. Its original publishing date was the 11th of June.
Promoted to curated: I really enjoyed reading through this sequence. I have some disagreements with it, but overall it's one of the best plain language introductions to AI safety that I've seen, and I expect I will link to this as a good introduction many times in the future. I was also particularly happy with how the sequence bridged and synthesized a number of different perspectives that usually feel in conflict with each other.
Promoted to curated: This kind of thinking seems both very important, and also extremely difficult. I do think that trying to understand the underlying computational structure of the brain is quite useful for both thinking about Rationality and thinking about AI and AI Alignment, though it's also plausible to me that it's hard enough to get things right in this space that in the end overall it's very hard to extract useful lessons from this.
Despite the difficulties I expect in this space, this post does strike me as overall pretty decent and to at the very least open up a number of interesting questions that one could ask to further deconfuse oneself on this topic.
Promoted to curated! I held off on curating this post for a while, first because it's long and it took me a while to read through it, and second because we already had a lot of AI Alignment posts in the curation pipeline, and I wanted to make sure we have some diversity in our curation decisions. But overall, I really liked this post, and also want to mirror Rohin's comment in that I found this version more useful than the version where you got everything right, because this way I got to see the contrast between your interpretation and Paul's responses, which feels like it helped me locate the right hypothesis more effective than either would have on its own (even if more fleshed out).
Yep, fixed. Thank you!
Judging from the URL of those links, those images were hosted on a domain that you could access, but others could not, namely they were stored as Gmail image attachments, to which of course you as the recipient have access, but random LessWrong users do not.
Oh no! The two images starting from this point are broken for me:
Promoted to curated: These additions are really great, and they fill in a lot of the most confusing parts of the original Embedded Agency sequence, which was already one of my favorite pieces of content on all of Lesswrong. So it seems fitting to curate this update to it, which improves it even further.