All of habryka's Comments + Replies

High-stakes alignment via adversarial training [Redwood Research report]

One of the primary questions that comes to mind for me is "well, did this whole thing actually work?". If I understand the paper correctly, while we definitely substantially decreased the fraction of random samples that got misclassified (which always seemed very likely to happen, and I am indeed a bit surprised at only getting it to move ~3 OOMs, which my guess is mostly capability related, since you used small models), we only doubled the amount of effort necessary to generate an adversarial counterexample. 

A doubling is still pretty substantial, an... (read more)

Excellent question -- I wish we had included more of an answer to this in the post.

I think we made some real progress on the defense side -- but I 100% was hoping for more and agree we have a long way to go.

I think the classifier is quite robust in an absolute sense, at least compared to normal ML models. We haven't actually tried it on the final classifier, but my guess is it takes at least several hours to find a crisp failure unassisted (whereas almost all ML models you can find are trivially breakable). We're interested in people giving it a shot! :)

Pa... (read more)

I do not understand the "we only doubled the amount of effort necessary to generate an adversarial counterexample.". Aren't we talking about 3oom?
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

This article explains the difference:

For example, a 14,000 BTU model that draws 1,400 watts of power on maximum settings would have an EER of 10.0 as 14,000/1,400 = 10.0.

A 14,000 BTU unit that draws 1200 watts of power would have an EER of 11.67 as 14,000/1,200 = 11.67.

Taken at face value, this looks like a good and proper metric to use for energy efficiency. The lower the power draw (watts) compared to the cooling capacity (BTUs/hr), the higher the EER. And the higher the E

... (read more)
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

EER does not account for heat infiltration issues, so this seems confused. CEER does, and that does suggest something in the 20% range, but I am pretty sure you can't use EER to compare a single-hose and a dual-hose system.

2Jessica Taylor1mo
I assumed EER did account for that based on:
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

I think that paragraph is discussing a second reason that infiltration is bad.

Yeah, sorry, I didn't mean to imply the section is saying something totally wrong. The section just makes it sound like that is the only concern with infiltration, which seems wrong, and my current model of the author of the post is that they weren't actually thinking through heat-related infiltration issues (though it's hard to say from just this one paragraph, of course). 

Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

My overall take on this post and comment (after spending like 1.5 hours reading about AC design and statistics): 

Overall I feel like both the OP and this reply say some wrong things. The top Wirecutter recommendation is a dual-hose design. The testing procedure of Wirecutter does not seem to address infiltration in any way, and indeed the whole article does not discuss infiltration as it relates to cooling-efficiency. 

Overall efficiency loss from going to dual to single is something like 20-30%, which I do think is much lower than I think the OP ... (read more)

Update: I too have now spent like 1.5 hours reading about AC design and statistics, and I can now give a reasonable guess at exactly where the I-claim-obviously-ridiculous 20-30% number came from. Summary: the SACC/CEER standards use a weighted mix of two test conditions, with 80% of the weight on conditions in which outdoor air is only 3°F/1.6°C hotter than indoor air.

The whole backstory of the DOE's SACC/CEER rating rules is here. Single-hose air conditioners take center stage. The comments on the DOE's rule proposals can basically be summarized as:

  • Singl
... (read more)
2[comment deleted]1mo
  • The top wirecutter recommendation is roughly 3x as expensive as the Amazon AC being reviewed. The top budget pick is a single-hose model.
  • People usually want to cool the room they are spending their time in. Those ACs are marketed to cool a 300 sq ft room, not a whole home. That's what reviewers are clearly doing with the unit. 
  • I'd guess that in extreme cases (where you care about the room with AC no more than other rooms in the house + rest of house is cool) consumers are overestimating efficiency by ~30%. On average in reality I'd guess they are over
... (read more)
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

The best thing we took away from our tests was the chance at a direct comparison between a single-hose design and a dual-hose design that were otherwise identical, and our experience confirmed our suspicions that dual-hose portable ACs are slightly more effective than single-hose models but not effective enough to make a real difference

After having looked into this quite a bit, it does really seem like the Wirecutter testing process had no ability to notice infiltration issues, so it seems like the Wirecutter crew themselves is kind of confused here? ... (read more)

4Paul Christiano1mo
They measure the temperature in the room, which captures the effect of negative pressure pulling in hot air from the rest of the building. It underestimates the costs if the rest of the building is significantly cooler than the outside (I'd guess by the ballpark of 20-30% in the extreme case where you care equally about all spaces in the building, the rest of your building is kept at the same temp as the room you are cooling, and a negligible fraction of air exchange with the outside is via the room you are cooling). I think that paragraph is discussing a second reason that infiltration is bad.
Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

A 2-hose unit will definitely cool more efficiently, but I think for many people who are using portable units it's the right tradeoff with convenience. The wirecutter reviews both types of units together and usually end up preferring 1-hose units.

It is important to note that the current top wirecutter pick is a 2-hose unit, though one that combined the two hoses into one big hose. I guess maybe that is recent, but it does seem important to acknowledge here (and it wouldn't surprise me that much if Wirecutter went through reasoning pretty similar to the one... (read more)

Here is the wirecutter discussion of the distinction for reference:

Starting in 2019, we began comparing dual- and single-hose models according to the same criteria, and we didn’t dismiss any models based on their hose count. Our research, however, ultimately steered us toward single-hose portable models—in part because so many newer models use this design. In fact, we found no compelling new double-hose models from major manufacturers in 2019 or 2020 (although a few new ones cropped up in 2021, including our new top pick). Owner reviews indicate that most

... (read more)
A broad basin of attraction around human values?

Mod note: I reposted this post to the frontpage, because it wasn't actually shown on a frontpage due to an interaction with the GreaterWrong post-submission interface. It seemed like a post many people are interested in, and it seemed like it didn't really get the visibility it deserved.

Late 2021 MIRI Conversations: AMA / Discussion

Relevant Feynman quote: 

I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples.

For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)-- disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on.

Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say “False!” [and] point out my counterexample.

Instrumental Convergence For Realistic Agent Objectives


No real power-seeking tendencies if we only plausibly will specify a negative vector.

Seems like two sentences got merged together.

2Alex Turner4mo
Fixed, thanks!
Biology-Inspired AGI Timelines: The Trick That Never Works

The post feels like it's trying pretty hard to point towards an alternative forecasting method, though I also agree it's not fully succeeding at getting there. 

I feel like de-facto the forecasting methodology of people who are actually good at forecasting don't usually strike me as low-inferential distance, such that it is obvious how to communicate the full methodology. My sense from talking to a number of superforecasters over the years is that they do pretty complicated things, and I don't feel like the critique of "A critique is only really valid ... (read more)

I think it's fine to say that you think something else is better without being able to precisely say what it is. I just think "the trick that never works" is an overstatement if you aren't providing evidence about whether it has  worked, and that it's hard to provide such evidence without saying something about what you are comparing to.

(Like I said though, I just skimmed the post and it's possible it contains evidence or argument that I didn't catch.)

It's possible the action is in disagreements about Moravec's view rather than the lack of an alternat... (read more)

Preface to the sequence on iterated amplification

This is a very good point. IIRC Paul is working on some new blog posts that summarize his more up-to-date approach, though I don't know when they'll be done. I will ask Paul when I next run into him about what he thinks might be the best way to update the sequence.

[MLSN #1]: ICLR Safety Paper Roundup

Thank you! I am glad you are doing this!

Garrabrant and Shah on human modeling in AGI

Promoted to curated: I found this conversation useful from a number of different perspectives, and found the transcript surprisingly easy to read (though it is still very long). The key question the conversation tried to tackle, about whether we should put resources into increasing the safety of AI systems by reducing the degree to which they try to model humans, is one that I've been interested in for a while. But I also felt like this conversation, more so than most other transcripts, gave me a better insight into how both Scott and Rohin think about these topics in general, and what kind of heuristics they use to evaluate various AI alignment proposals.

Intermittent Distillations #4: Semiconductors, Economics, Intelligence, and Technological Progress.

I also found these very valuable! I wonder whether a better title might help more people see how great these are, but not sure.

Measuring hardware overhang

Replaced the image in the post with this image.

Alex Turner's Research, Comprehensive Information Gathering

Minor meta feedback: I think it's better to put the "Comprehensive Information Gathering" part of the title at the end, if you want to have many of these. That makes it much easier to see differences in the title and skim a list of them.

1Adam Shimi1y
Sure, I hadn't thought about that.
[AN #152]: How we’ve overestimated few-shot learning capabilities

The newsletter is back! I missed these! Glad to have these back.

Rogue AGI Embodies Valuable Intellectual Property

Promoted to curated: I've had a number of disagreements with a perspective on AI that generates arguments like the above, which takes something like "ownership of material resources" as a really fundamental unit of analysis, and I feel like this post has both helped me get a better grasp on that paradigm of thinking, and also helped me get a bit of a better sense of what feels off to me, and I have a feeling this post will be useful in bridging that gap eventually. 

AMA: Paul Christiano, alignment researcher

When I begin a comment with a quotation, I don't know how to insert new un-quoted text at the top (other than by cutting the quotation, adding some blank lines, then pasting the quotation back). That would be great.

You can do this by pressing enter in an empty paragraph of a quoted block. That should cause you to remove the block. See this gif: 

4Paul Christiano1y
I thought that I tried that but it seems to work fine, presumably user error :)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

This is great, thank you! 

Minor formatting note: The italics font on both the AI Alignment Forum and LessWrong isn't super well suited to large block of text, so I took the liberty to unitalicize a bunch of the large blockquotes (which should be sufficiently distinguishable as blockquotes without the italics). Though I am totally happy to reverse it if you prefer the previous formatting. 

Utility Maximization = Description Length Minimization

Promoted to curated: As Adele says, this feels related to a bunch of the Jeffery-Bolker rotation ideas, which I've referenced many many times since then, but in a way that feels somewhat independent, which makes me more excited about there being some deeper underlying structure here.

I've also had something like this in my mind for a while, but haven't gotten around to formalizing it, and I think I've seen other people make similar arguments in the past, which makes this a valuable clarification and synthesis that I expect to get referenced a bunch.

Deducing Impact

So secret that even a spoiler tag wasn't good enough.

Commentary on AGI Safety from First Principles

Promoted to curated: This is a long and dense post, but I really liked it, and find this kind of commentary from a large variety of thinkers in the AI Alignment space quite useful. I found that it really helped me think about the implications of a lot of the topics discussed in the main sequence in much more detail, and in a much more robust way, and I have come back to this post multiple times since it's been published. 

Also, of course, the whole original sequence is great and I think currently the best short introduction to AI-Risk that exists out there.

Understanding “Deep Double Descent”

I agree with this, and was indeed kind of thinking of them as one post together.

Six AI Risk/Strategy Ideas

I have now linked at least 10 times to the heading on "'Generate evidence of difficulty' as a research purpose" section of this post. It was a thing that I kind of wanted to point to before this post came out, but felt confused about it, and this post finally gave me a pointer to it. 

I think that section was substantially more novel and valuable to me than the rest of this post, but it is also evidence that others might have also not had some of the other ideas on their map, and so they might found it similarly valuable because of a different section. 

Utility ≠ Reward

I think this post and the Gradient Hacking post caused me to actually understand and feel able to productively engage with the idea of inner-optimizers. I think the paper and full sequence was good, but I bounced off of it a few times, and this helped me get traction on the core ideas in the space. 

I also think that some parts of this essay hold up better as a core abstraction than the actual mesa-optimizer paper itself, though I am not at all confident about this. But I just noticed that when I am internally thinking through alignment problems relate... (read more)

2Ben Pace1y
For another datapoint, I'll mention that I didn't read this post nor Gradient Hacking at the time, I read the sequence, and I found that to be pretty enlightening and quite readable.
Gradient hacking

adamshimi says almost everything I wanted to say in my review, so I am very glad he made the points he did, and I would love for both his review and the top level post to be included in the book. 

The key thing I want to emphasize a bit more is that I think the post as given is very abstract, and I have personally gotten a lot of value out of trying to think of more concrete scenarios where gradient hacking can occur. 

I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an... (read more)

4Adam Shimi1y
As I said elsewhere, I'm glad that my review captured points you deem important! I agree that gradient hacking isn't limited to inner optimizers; yet I don't think that defining it that way in the post was necessarily a bad idea. First, it's for coherence with Risks from Learned Optimization. Second, assuming some internal structure definitely helps with conceptualizing the kind of things that count as gradient hacking. With inner optimizer, you can say relatively unambiguously "it tries to protect it's mesa-objective", as there should be an explicit representation of it. That becomes harder without the inner optimization hypothesis. That being said, I am definitely focusing on gradient hacking as an issue with learned goal-directed systems instead of learned optimizers. This is one case where I have argued [] that a definition of goal-directedness would allow us to remove the explicit optimization hypothesis without sacrificing the clarity it brought. Two thoughts about that: * Even if some subnetwork basically captures SGD (or the relevant training process), I'm unconvinced that it would be useful in the beginning, and so it might be "written over" by the updates. * Related to the previous point, it looks crucial to understand what is needed in addition to a model of SGD in order to gradient hack. Which brings me to your next point. I'm confused about what you mean here. If the point is to make the network a local minimal, you probably just have to make it very brittle to any change. I also not sure what you mean by competing networks. I assumed it meant the neighboring models in model space, which are reachable by reasonable gradients. If that's the case, then I think my example is simpler and doesn't need the SGD modelling. If not, then I would appreciate more detailed explanations. Why is that supposed to be a good thing? Sure
4Ofer Givoli1y
I think the part in bold should instead be something like "failing hard if SGD would (not) update weights in such and such way". (SGD is a local search algorithm; it gradually improves a single network.) As I already argued in another thread [] , the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb's problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of "being a person that 1-boxs", because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).
Eight claims about multi-agent AGI safety

I found this quite compelling. I don't think I am sold on some of the things yet (in particular claims 5 and 6), but thanks a lot for writing this up this clearly. I will definitely take some time to think more about this.

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

I think the CAIS framing that Eric Drexler proposed gave concrete shape to a set of intuitions that many people have been relying on for their thinking about AGI. I also tend to think that those intuitions and models aren't actually very good at modeling AGI, but I nevertheless think it productively moved the discourse forward a good bit. 

In particular I am very grateful about the comment thread between Wei Dai and Rohin, which really helped me engage with the CAIS ideas, and I think were necessary to get me to my current understanding of CAIS and to ... (read more)

2020 AI Alignment Literature Review and Charity Comparison

Promoted to curated: Even if God and Santa Claus are not real, we do experience a Christmas miracle every year in the form of these amazingly thorough reviews by Larks. Thank you for your amazing work, as this continues to be an invaluable resource to anyone trying to navigate the AI Alignment landscape, whether as a researcher, grantmaker or independent thinker.

TAI Safety Bibliographic Database

Unfortunately, they are only sporadically updated and difficult to consume using automated tools.  We encourage organizations to start releasing machine-readable bibliographies to make our lives easier.

Oh interesting. Would it be helpful to have something on the AI Alignment in the form of some kind of more machine-readable citation system, or did you find the current setup sufficient? 

Also, thank you for doing this!

Note that individual researchers will sometimes put up bibtex files of all their publications, but I think it's rarer for organizations to do this.
Soft takeoff can still lead to decisive strategic advantage

Yep, you can revise it any time before we actually publish the book, though ideally you can revise it before the vote so people can be compelled by your amazing updates!

Evolution of Modularity

Coming back to this post, I have some thoughts related to it that connect this more directly to AI Alignment that I want to write up, and that I think make this post more important than I initially thought. Hence nominating it for the review. 

I'm curious to hear these thoughts.
Utility ≠ Reward

I think of Utility != Reward as probably the most important core point from the Mesa-Optimizer paper, and I preferred this explanation over the one in the paper (though it leaves out many things and wouldn't want it to be the only thing someone reads on the topic)

The Credit Assignment Problem

Most of my points from my curation notice still hold. And two years later, I am still thinking a lot about credit assignment as a perspective on many problems I am thinking about. 

This seems like one I would significantly re-write for the book if it made it that far. I feel like it got nominated for the introductory material, which I wrote quickly in order to get to the "main point" (the gradient gap). A better version would have discussed credit assignment algorithms more.

Why Subagents?

This post felt like it took a problem that I was thinking about from 3 different perspectives and combined them in a way that felt pretty coherent, though I am fully sure how right it gets it. Concretely, the 3 domains I felt it touched on were: 

  1. How much can you model human minds as consistent of subagents?
  2. How much can problems with coherence theorems be addressed by modeling things as subagents? 
  3. How much will AI systems behave like consisting of multiple subagents?

All three of these feel pretty important to me.

Gradient hacking

Gradient hacking seems important and I really didn't think of this as a concrete consideration until this post came out. 

Six AI Risk/Strategy Ideas

I've referred specifically to the section on "Generate evidence of difficulty" as a research purpose many times since this post has come out, and while I have disagreements with it, I do really like it as a handle for a consideration that I hadn't previously seen written up, and does strike me as quite important.

Strategic implications of AIs' ability to coordinate at low cost, for example by merging

While I think this post isn't the best writeup of this topic I can imagine, I think it makes a really important point quite succinctly, and is one that I have brought up many times in arguments around takeoff speeds and risk scenarios since this post came out.

But exactly how complex and fragile?

In talking to many people about AI Alignment over the years, I've repeatedly found that a surprisingly large generator of disagreement about risk scenarios was disagreement about the fragility of human values. 

I think this post should be reviewed for it's excellent comment section at least as much as for the original post, and also think that this post is a pretty central example of the kind of post I would like to see more of.

Soft takeoff can still lead to decisive strategic advantage

As Robby said, this post isn't perfect, but it felt like it opened up a conversation on LessWrong that I think is really crucial, and was followed up in a substantial number of further posts by Daniel Kokotajlo that I have found really useful. Many of those were written in 2020, but of the ones written in 2019, this strikes me as the one I remember most. 

1Daniel Kokotajlo1y
Thanks. I agree it isn't perfect... in the event that it gets chosen, I'd get to revise it, right? I never did get around to posting the better version I promised.
Draft report on AI timelines

I am organizing a reading group for this report next Tuesday in case you (or anyone else) wants to show up:

Dutch-Booking CDT: Revised Argument

I... think this post was impacted by a bug in the LW API that GreaterWrong ran into, that made it so that it wasn't visible on the frontpage when it was published. It nevertheless appears to have gotten some amount of engagement, but maybe that was all from direct links? 

Given the substantial chance that a number of people have never seen this post, I reposted it. Its original publishing date was the 11th of June.

2Abram Demski2y
Ah, now the fact that I forgot to include an illustration (which I had drawn while writing the post) until months later feels like less of a waste! :)
AGI safety from first principles: Introduction

Promoted to curated: I really enjoyed reading through this sequence. I have some disagreements with it, but overall it's one of the best plain language introductions to AI safety that I've seen, and I expect I will link to this as a good introduction many times in the future. I was also particularly happy with how the sequence bridged and synthesized a number of different perspectives that usually feel in conflict with each other.

1Ben Pace2y
Critch recently made the argument (and wrote it in his ARCHES [] paper, summarized by Rohin here [] ) that "AI safety" is a straightforwardly misleading name because "safety" is a broader category than is being talked about in (for example) this sequence – it includes things like not making self-driving cars crash. (To quote directly: "the term “AI safety” should encompass research on any safety issue arising from the use of AI systems, whether the application or its impact is small or large in scope".) I wanted to raise the idea here and ask Richard what he thinks about renaming it to something like "AI existential safety from first principles" or "AI as an extinction risk from first principles" or "AI alignment from first principles".
My computational framework for the brain

Promoted to curated: This kind of thinking seems both very important, and also extremely difficult. I do think that trying to understand the underlying computational structure of the brain is quite useful for both thinking about Rationality and thinking about AI and AI Alignment, though it's also plausible to me that it's hard enough to get things right in this space that in the end overall it's very hard to extract useful lessons from this. 

Despite the difficulties I expect in this space, this post does strike me as overall pretty decent and to at the very least open up a number of interesting questions that one could ask to further deconfuse oneself on this topic. 

My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

Promoted to curated! I held off on curating this post for a while, first because it's long and it took me a while to read through it, and second because we already had a lot of AI Alignment posts in the curation pipeline, and I wanted to make sure we have some diversity in our curation decisions. But overall, I really liked this post, and also want to mirror Rohin's comment in that I found this version more useful than the version where you got everything right, because this way I got to see the contrast between your interpretation and Paul's responses, which feels like it helped me locate the right hypothesis more effective than either would have on its own (even if more fleshed out). 

Comparing Utilities

Yep, fixed. Thank you!

Judging from the URL of those links, those images were hosted on a domain that you could access, but others could not, namely they were stored as Gmail image attachments, to which of course you as the recipient have access, but random LessWrong users do not. 

Comparing Utilities

Oh no! The two images starting from this point are broken for me: 

2Abram Demski2y
How about now?
2Abram Demski2y
Weird, given that they still look fine for me! I'll try to fix...
Updates and additions to "Embedded Agency"

Promoted to curated: These additions are really great, and they fill in a lot of the most confusing parts of the original Embedded Agency sequence, which was already one of my favorite pieces of content on all of Lesswrong. So it seems fitting to curate this update to it, which improves it even further. 

Load More