All of habryka's Comments + Replies

[MLSN #1]: ICLR Safety Paper Roundup

Thank you! I am glad you are doing this!

Garrabrant and Shah on human modeling in AGI

Promoted to curated: I found this conversation useful from a number of different perspectives, and found the transcript surprisingly easy to read (though it is still very long). The key question the conversation tried to tackle, about whether we should put resources into increasing the safety of AI systems by reducing the degree to which they try to model humans, is one that I've been interested in for a while. But I also felt like this conversation, more so than most other transcripts, gave me a better insight into how both Scott and Rohin think about these topics in general, and what kind of heuristics they use to evaluate various AI alignment proposals.

Intermittent Distillations #4: Semiconductors, Economics, Intelligence, and Technological Progress.

I also found these very valuable! I wonder whether a better title might help more people see how great these are, but not sure.

Measuring hardware overhang

Replaced the image in the post with this image.

Alex Turner's Research, Comprehensive Information Gathering

Minor meta feedback: I think it's better to put the "Comprehensive Information Gathering" part of the title at the end, if you want to have many of these. That makes it much easier to see differences in the title and skim a list of them.

1Adam Shimi4moSure, I hadn't thought about that.
[AN #152]: How we’ve overestimated few-shot learning capabilities

The newsletter is back! I missed these! Glad to have these back.

Rogue AGI Embodies Valuable Intellectual Property

Promoted to curated: I've had a number of disagreements with a perspective on AI that generates arguments like the above, which takes something like "ownership of material resources" as a really fundamental unit of analysis, and I feel like this post has both helped me get a better grasp on that paradigm of thinking, and also helped me get a bit of a better sense of what feels off to me, and I have a feeling this post will be useful in bridging that gap eventually. 

AMA: Paul Christiano, alignment researcher

When I begin a comment with a quotation, I don't know how to insert new un-quoted text at the top (other than by cutting the quotation, adding some blank lines, then pasting the quotation back). That would be great.

You can do this by pressing enter in an empty paragraph of a quoted block. That should cause you to remove the block. See this gif: 

4Paul Christiano6moI thought that I tried that but it seems to work fine, presumably user error :)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

This is great, thank you! 

Minor formatting note: The italics font on both the AI Alignment Forum and LessWrong isn't super well suited to large block of text, so I took the liberty to unitalicize a bunch of the large blockquotes (which should be sufficiently distinguishable as blockquotes without the italics). Though I am totally happy to reverse it if you prefer the previous formatting. 

Utility Maximization = Description Length Minimization

Promoted to curated: As Adele says, this feels related to a bunch of the Jeffery-Bolker rotation ideas, which I've referenced many many times since then, but in a way that feels somewhat independent, which makes me more excited about there being some deeper underlying structure here.

I've also had something like this in my mind for a while, but haven't gotten around to formalizing it, and I think I've seen other people make similar arguments in the past, which makes this a valuable clarification and synthesis that I expect to get referenced a bunch.

Deducing Impact

So secret that even a spoiler tag wasn't good enough.

Commentary on AGI Safety from First Principles

Promoted to curated: This is a long and dense post, but I really liked it, and find this kind of commentary from a large variety of thinkers in the AI Alignment space quite useful. I found that it really helped me think about the implications of a lot of the topics discussed in the main sequence in much more detail, and in a much more robust way, and I have come back to this post multiple times since it's been published. 

Also, of course, the whole original sequence is great and I think currently the best short introduction to AI-Risk that exists out there.

Understanding “Deep Double Descent”

I agree with this, and was indeed kind of thinking of them as one post together.

Six AI Risk/Strategy Ideas

I have now linked at least 10 times to the heading on "'Generate evidence of difficulty' as a research purpose" section of this post. It was a thing that I kind of wanted to point to before this post came out, but felt confused about it, and this post finally gave me a pointer to it. 

I think that section was substantially more novel and valuable to me than the rest of this post, but it is also evidence that others might have also not had some of the other ideas on their map, and so they might found it similarly valuable because of a different section. 

Utility ≠ Reward

I think this post and the Gradient Hacking post caused me to actually understand and feel able to productively engage with the idea of inner-optimizers. I think the paper and full sequence was good, but I bounced off of it a few times, and this helped me get traction on the core ideas in the space. 

I also think that some parts of this essay hold up better as a core abstraction than the actual mesa-optimizer paper itself, though I am not at all confident about this. But I just noticed that when I am internally thinking through alignment problems relate... (read more)

2Ben Pace9moFor another datapoint, I'll mention that I didn't read this post nor Gradient Hacking at the time, I read the sequence, and I found that to be pretty enlightening and quite readable.
Gradient hacking

adamshimi says almost everything I wanted to say in my review, so I am very glad he made the points he did, and I would love for both his review and the top level post to be included in the book. 

The key thing I want to emphasize a bit more is that I think the post as given is very abstract, and I have personally gotten a lot of value out of trying to think of more concrete scenarios where gradient hacking can occur. 

I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an... (read more)

4Adam Shimi9moAs I said elsewhere, I'm glad that my review captured points you deem important! I agree that gradient hacking isn't limited to inner optimizers; yet I don't think that defining it that way in the post was necessarily a bad idea. First, it's for coherence with Risks from Learned Optimization. Second, assuming some internal structure definitely helps with conceptualizing the kind of things that count as gradient hacking. With inner optimizer, you can say relatively unambiguously "it tries to protect it's mesa-objective", as there should be an explicit representation of it. That becomes harder without the inner optimization hypothesis. That being said, I am definitely focusing on gradient hacking as an issue with learned goal-directed systems instead of learned optimizers. This is one case where I have argued [https://www.alignmentforum.org/posts/q9BmNh35xgXPRgJhm/why-you-should-care-about-goal-directedness#Mesa_Optimization] that a definition of goal-directedness would allow us to remove the explicit optimization hypothesis without sacrificing the clarity it brought. Two thoughts about that: * Even if some subnetwork basically captures SGD (or the relevant training process), I'm unconvinced that it would be useful in the beginning, and so it might be "written over" by the updates. * Related to the previous point, it looks crucial to understand what is needed in addition to a model of SGD in order to gradient hack. Which brings me to your next point. I'm confused about what you mean here. If the point is to make the network a local minimal, you probably just have to make it very brittle to any change. I also not sure what you mean by competing networks. I assumed it meant the neighboring models in model space, which are reachable by reasonable gradients. If that's the case, then I think my example is simpler and doesn't need the SGD modelling. If not, then I would appreciate more detailed explanations. Why is that supposed to be a good thing? Sure
4Ofer Givoli9moI think the part in bold should instead be something like "failing hard if SGD would (not) update weights in such and such way". (SGD is a local search algorithm; it gradually improves a single network.) As I already argued in another thread [https://www.lesswrong.com/posts/uXH4r6MmKPedk8rMA/gradient-hacking?commentId=cEWs5CCy8f6ZzYimP] , the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb's problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of "being a person that 1-boxs", because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).
Eight claims about multi-agent AGI safety

I found this quite compelling. I don't think I am sold on some of the things yet (in particular claims 5 and 6), but thanks a lot for writing this up this clearly. I will definitely take some time to think more about this.

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

I think the CAIS framing that Eric Drexler proposed gave concrete shape to a set of intuitions that many people have been relying on for their thinking about AGI. I also tend to think that those intuitions and models aren't actually very good at modeling AGI, but I nevertheless think it productively moved the discourse forward a good bit. 

In particular I am very grateful about the comment thread between Wei Dai and Rohin, which really helped me engage with the CAIS ideas, and I think were necessary to get me to my current understanding of CAIS and to ... (read more)

2020 AI Alignment Literature Review and Charity Comparison

Promoted to curated: Even if God and Santa Claus are not real, we do experience a Christmas miracle every year in the form of these amazingly thorough reviews by Larks. Thank you for your amazing work, as this continues to be an invaluable resource to anyone trying to navigate the AI Alignment landscape, whether as a researcher, grantmaker or independent thinker.

TAI Safety Bibliographic Database

Unfortunately, they are only sporadically updated and difficult to consume using automated tools.  We encourage organizations to start releasing machine-readable bibliographies to make our lives easier.

Oh interesting. Would it be helpful to have something on the AI Alignment in the form of some kind of more machine-readable citation system, or did you find the current setup sufficient? 

Also, thank you for doing this!

1DanielFilan10moNote that individual researchers will sometimes put up bibtex files of all their publications, but I think it's rarer for organizations to do this.
Soft takeoff can still lead to decisive strategic advantage

Yep, you can revise it any time before we actually publish the book, though ideally you can revise it before the vote so people can be compelled by your amazing updates!

Evolution of Modularity

Coming back to this post, I have some thoughts related to it that connect this more directly to AI Alignment that I want to write up, and that I think make this post more important than I initially thought. Hence nominating it for the review. 

2johnswentworth10moI'm curious to hear these thoughts.
Utility ≠ Reward

I think of Utility != Reward as probably the most important core point from the Mesa-Optimizer paper, and I preferred this explanation over the one in the paper (though it leaves out many things and wouldn't want it to be the only thing someone reads on the topic)

The Credit Assignment Problem

Most of my points from my curation notice still hold. And two years later, I am still thinking a lot about credit assignment as a perspective on many problems I am thinking about. 

This seems like one I would significantly re-write for the book if it made it that far. I feel like it got nominated for the introductory material, which I wrote quickly in order to get to the "main point" (the gradient gap). A better version would have discussed credit assignment algorithms more.

Why Subagents?

This post felt like it took a problem that I was thinking about from 3 different perspectives and combined them in a way that felt pretty coherent, though I am fully sure how right it gets it. Concretely, the 3 domains I felt it touched on were: 

  1. How much can you model human minds as consistent of subagents?
  2. How much can problems with coherence theorems be addressed by modeling things as subagents? 
  3. How much will AI systems behave like consisting of multiple subagents?

All three of these feel pretty important to me.

Gradient hacking

Gradient hacking seems important and I really didn't think of this as a concrete consideration until this post came out. 

Six AI Risk/Strategy Ideas

I've referred specifically to the section on "Generate evidence of difficulty" as a research purpose many times since this post has come out, and while I have disagreements with it, I do really like it as a handle for a consideration that I hadn't previously seen written up, and does strike me as quite important.

Strategic implications of AIs' ability to coordinate at low cost, for example by merging

While I think this post isn't the best writeup of this topic I can imagine, I think it makes a really important point quite succinctly, and is one that I have brought up many times in arguments around takeoff speeds and risk scenarios since this post came out.

But exactly how complex and fragile?

In talking to many people about AI Alignment over the years, I've repeatedly found that a surprisingly large generator of disagreement about risk scenarios was disagreement about the fragility of human values. 

I think this post should be reviewed for it's excellent comment section at least as much as for the original post, and also think that this post is a pretty central example of the kind of post I would like to see more of.

Soft takeoff can still lead to decisive strategic advantage

As Robby said, this post isn't perfect, but it felt like it opened up a conversation on LessWrong that I think is really crucial, and was followed up in a substantial number of further posts by Daniel Kokotajlo that I have found really useful. Many of those were written in 2020, but of the ones written in 2019, this strikes me as the one I remember most. 

1Daniel Kokotajlo10moThanks. I agree it isn't perfect... in the event that it gets chosen, I'd get to revise it, right? I never did get around to posting the better version I promised.
Draft report on AI timelines

I am organizing a reading group for this report next Tuesday in case you (or anyone else) wants to show up: 

https://www.lesswrong.com/posts/mMGzkX3Acb5WdFEaY/event-ajeya-s-timeline-report-reading-group-1-nov-17-6-30pm

Dutch-Booking CDT: Revised Argument

I... think this post was impacted by a bug in the LW API that GreaterWrong ran into, that made it so that it wasn't visible on the frontpage when it was published. It nevertheless appears to have gotten some amount of engagement, but maybe that was all from direct links? 

Given the substantial chance that a number of people have never seen this post, I reposted it. Its original publishing date was the 11th of June.

2Abram Demski1yAh, now the fact that I forgot to include an illustration (which I had drawn while writing the post) until months later feels like less of a waste! :)
AGI safety from first principles: Introduction

Promoted to curated: I really enjoyed reading through this sequence. I have some disagreements with it, but overall it's one of the best plain language introductions to AI safety that I've seen, and I expect I will link to this as a good introduction many times in the future. I was also particularly happy with how the sequence bridged and synthesized a number of different perspectives that usually feel in conflict with each other.

1Ben Pace1yCritch recently made the argument (and wrote it in his ARCHES [http://acritch.com/papers/arches.pdf] paper, summarized by Rohin here [https://www.lesswrong.com/posts/gToGqwS9z2QFvwJ7b/an-103-arches-an-agenda-for-existential-safety-and-combining] ) that "AI safety" is a straightforwardly misleading name because "safety" is a broader category than is being talked about in (for example) this sequence – it includes things like not making self-driving cars crash. (To quote directly: "the term “AI safety” should encompass research on any safety issue arising from the use of AI systems, whether the application or its impact is small or large in scope".) I wanted to raise the idea here and ask Richard what he thinks about renaming it to something like "AI existential safety from first principles" or "AI as an extinction risk from first principles" or "AI alignment from first principles".
My computational framework for the brain

Promoted to curated: This kind of thinking seems both very important, and also extremely difficult. I do think that trying to understand the underlying computational structure of the brain is quite useful for both thinking about Rationality and thinking about AI and AI Alignment, though it's also plausible to me that it's hard enough to get things right in this space that in the end overall it's very hard to extract useful lessons from this. 

Despite the difficulties I expect in this space, this post does strike me as overall pretty decent and to at the very least open up a number of interesting questions that one could ask to further deconfuse oneself on this topic. 

My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

Promoted to curated! I held off on curating this post for a while, first because it's long and it took me a while to read through it, and second because we already had a lot of AI Alignment posts in the curation pipeline, and I wanted to make sure we have some diversity in our curation decisions. But overall, I really liked this post, and also want to mirror Rohin's comment in that I found this version more useful than the version where you got everything right, because this way I got to see the contrast between your interpretation and Paul's responses, which feels like it helped me locate the right hypothesis more effective than either would have on its own (even if more fleshed out). 

Comparing Utilities

Yep, fixed. Thank you!

Judging from the URL of those links, those images were hosted on a domain that you could access, but others could not, namely they were stored as Gmail image attachments, to which of course you as the recipient have access, but random LessWrong users do not. 

Comparing Utilities

Oh no! The two images starting from this point are broken for me: 

2Abram Demski1yHow about now?
2Abram Demski1yWeird, given that they still look fine for me! I'll try to fix...
Updates and additions to "Embedded Agency"

Promoted to curated: These additions are really great, and they fill in a lot of the most confusing parts of the original Embedded Agency sequence, which was already one of my favorite pieces of content on all of Lesswrong. So it seems fitting to curate this update to it, which improves it even further. 

Radical Probabilism

Promoted to curated: This post is answering (of course not fully, but in parts) what seems to me one of the most important open questions in theoretical rationality, and I think does so in a really thorough and engaging way. It also draws connections to a substantial number of other parts of your and Scott's work in a way that has helped me understand those much more thoroughly. 

I am really excited about this post. I kind of wish I could curate it two or three times because I do really want a lot of people to have read this, and expect that it will change how I think about a substantial number of topics.

4Abram Demski1yI'll try to write some good follow-up posts for you to also curate ;3
Looking for adversarial collaborators to test our Debate protocol

This sounds fun! I probably won't have enough time to participate, but I do wish I had enough time.

Will OpenAI's work unintentionally increase existential risks related to AI?

I much prefer Rohin's alternative version of: "Are OpenAI's efforts to reduce existential risk counterproductive?". The current version does feel like it screens off substantial portions of the potential risk.

2Rohin Shah1yExample? I legitimately struggle to imagine something covered by "Are OpenAI's efforts to reduce existential risk counterproductive?" but not by "Will OpenAI's work unintentionally increase existential risks related to AI?"; if anything it seems the latter covers more than the former.
Are we in an AI overhang?

Promoted to curated: I think the question of whether we are in an AI overhang is pretty obviously relevant to a lot of thinking about AI Risk, and this post covers the topic quite well. I particularly liked the use of a lot of small fermi estimate, and how it covered a lot of ground in relatively little writing. 

I also really appreciated the discussion in the comments, and felt that Gwern's comment on AI development strategies in particular help me build a much map of the modern ML space (though I wouldn't want it to be interpreted as a complete map o... (read more)

Why is pseudo-alignment "worse" than other ways ML can fail to generalize?

(Really minor formatting nitpick, but it's the kind of thing that really trips me up while reading, but you forgot a closing parenthesis somewhere in your comment)

2Rohin Shah1yFixed, thanks.
[AN #107]: The convergent instrumental subgoals of goal-directed agents

Mod Note: Because of a bug in our RSS Import script, this post didn't import properly last week, which is why you are seeing it now. Sorry for the confusion and delay!

Specification gaming: the flip side of AI ingenuity

Note: This post was originally posted to the DeepMind blog, so presumably the target audience is a broader audience of Machine Learning researchers and people in that broad orbit. I pinged Vika about crossposting it because it also seemed like a good reference post that I expected would get linked to a bunch more frequently if it was available on LessWrong and the AIAF. 

Problem relaxation as a tactic

Promoted to curated: This is a technique I've seen mentioned in a bunch of places, but I haven't seen a good writeup for it, and I found it quite valuable to read. 

Writeup: Progress on AI Safety via Debate

Promoted to curated: I've been thinking about this post a lot since it has come out, and it is just a really good presentation of all the research on AI Alignment via debate. It is quite long, which has made me hesitant about curating it for a while, but I now think curating it is the right choice. I also think while it is reasonably technical, it's approachable enough that the basic idea of it should be understandable by anyone giving it a serious try.

Thinking About Filtered Evidence Is (Very!) Hard

Promoted to curated: This post is a bit more technical than the usual posts we curate, but I think it is still quite valuable to read for a lot of people, since it's about a topic that has already received some less formal treatment on LessWrong. 

I also am very broadly excited about trying move beyond a naive bayesianism paradigm, and felt like this post helped me significantly in understanding what that would look like. 

Cortés, Pizarro, and Afonso as Precedents for Takeover

Promoted to curated: I really like this post. I already linked to it two times, and it clearly grounds some examples that I've seen people use informally in AI Alignment discussions in a way that will hopefully create a common reference, and allow us to analyze this kind of argument in much more detail. 

1Matthew Barnett2yDo you have any thoughts on the critique [https://www.lesswrong.com/posts/ivpKSjM4D6FbqF4pZ/cortes-pizarro-and-afonso-as-precedents-for-takeover?commentId=kNFNjBJjuzTd3irrR] I just posted?
Reframing Impact

Promoted to curated: I really liked this sequence. I think in many ways it has helped me think about AI Alignment from a new perspective, and I really like the illustrations and the way it was written, and how it actively helped me along the way thing actively about the problems, instead of just passively telling me solutions.

Now that the sequence is complete, it seemed like a good time to curate the first post in the sequence. 

Load More