16

A low-hanging fruit for solving alignment is to dedicate a chunk of time actually trying to solve a sub-problem collectively.

To that end, I’ve broken up researching the sub-problem of corrigibility into two categories in this google doc (you have suggestion privileges):

  1. Previous Work: let’s not reinvent the wheel. Write out links to any past work on corrigibility. This can range from just links to links & summaries & analyses. Do comment reactions to other's reviews to provide counter-arguments. This is just a google doc, low-quality posts, comments, links are accepted; I want people to lean towards babbling more.
  2. Tasks: what do we actually do this week to make progress?  Suggest any research direction you find fruitful or general research questions or framings. Example: write an example of corrigibility (one could then comment an actual example).

Additionally, I’ll post 3 top-level comments for:

  1. Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc)  For example, I’m available most times this week with a Calendly link for scheduling 1-on-1 co-working sessions. Additionally, you yourself could message those you know to collaborate on this, or have a nerdy house co-working party.
  2. Potential topics: what other topics besides corrigibility could we collaborate on in future weeks?
  3. Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.

I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better. Additionally, it’s immensely valuable to have an alignment topic post include a literature review, the community's up-to-date thoughts, and possible future research directions to pursue. I also believe a collaborative project like this will put several community members on the same page as far as terminology and gears-level models.

I explicitly commit to 3 weeks of this (so corrigibility this week and two more the next two weeks). After that is Christmas and New Years, after which I may resume depending on how it goes.

Thanks to Alex Turner for reviewing a draft.
 

New Comment
15 comments, sorted by Click to highlight new comments since:

I don't feel like joining this, but I do wish you luck, and I'll make a high level observation about methodology.

I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better.

I don't consider myself to be a rationalist or EA, but I do post on this web site, so I guess this makes me part of the community of people who post on this site. My high level observation on solving corrigibility is this: the community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved.

This is what you get when a site is in part a philosophy-themed website/forum/blogging platform. In philosophy, problems are never solved to the satisfaction of the community of all philosophers. This is not necessarily a bad thing. But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem, has been solved.

In business, there is the useful terminology that certain meetings will be run as 'decision making meetings', e.g. to make a go/no-go decision on launching a certain product design, even though a degree of uncertainty remains. Other meetings are exploratory meetings only, and are labelled as such. This forum is not a decision making forum.

But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem. has been solved.

Noting that I strongly disagree but don't have time to type out arguments right now, sorry. May or may not type out later.

I think we're pretty good at avoiding semantic arguments. The word "corrigible" can (and does) mean different things to different people on this site. Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement. We can taboo the word corrigible.

This has actually already happened in the document with corrigible either meaning:

  1. Correctable all the time regardless
  2. Correctable up until the point where the agent actually knows how to achieve your values better than you (related to intent alignment and coherent extrapolated volition).

Then we can think "assuming corrigible-definition-1, then yes, this is a solution".  

I don't see a benefit to the exploratory/decision making forum distinction when you can just do the above, but maybe I'm missing something?

Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement.

Indeed this can resolve disagreement among a small sub-group of active participants. This is an important tool if you want to make any progress.

but maybe I'm missing something?

The point I was trying to make is about what is achievable for the entire community, not what is achievable for a small sub-group of committed participants. The community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved, or whether some sub-group has made meaningful progress on it.

To make the same point in another way: the forces which introduce disagreeing viewpoints and linguistic entropy to this forum are stronger than the forces that push towards agreement and clarity.

My thinking about how strong these forces are has been updated recently, by the posting of a whole sequence of Yudkowsky conversations and also this one. In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read.

I am way past Denial and Bargaining, I have Accepted that this site is a community of philosophers.

The linguistic entropy point is countered by my previous point, right? Unless you want to say not everyone who posts in this community is capable of doing that? Or can naturally do that?

In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read.

Hahaha, yes. Yudkowsky can easily be interpreted as condescending and annoying in those dialogues (and he could've done a better job at not coming across that way). Though I believe the majority of the comments were in the spirit of understanding and coming to an agreement. Adam Shimi is also working on a post to describe the disagreements in the dialogue as different epistemic strategies, meaning the cause of disagreement is non-obvious. Alignment is pre-paradigmic, so agreeing is more difficult compared to communities that have clear questions and metrics to measure them on. I still think we succeed at the harder problem.

I am way past Denial and Bargaining, I have Accepted that this site is a community of philosophers.

By "community of philosophers", you mean noone makes any actual progress on anything (or can agree that progress is being made)? 

  • I believe Alex Turner has made progress on formalizing impact and power-seeking and I'm not aware of parts of the community arguing this isn't progress at all (though I don't read every comment). 
  • I also believe Vanessa's and Diffractor's Infrabayesism is progress on thinking about probabilities, and am unaware of parts of the community arguing this isn't progress (though there is a high mathematical bar required before you can understand it enough to criticize it)
  • I also also believe Evan Hubingers et al work on mesa optimizers is quite clearly progress on crisply stating an alignment issue that the community has largely agreed is progress.

Do you disagree on these examples or disagree that they prove the community makes progress and agrees that progress is being made?

Yes, by calling this site a "community of philosophers", I roughly mean that at the level of the entire community, nobody can agree that progress is being made. There is no mechanism for creating a community-wide agreement that a problem has been solved.

You give three specific examples of progress above. From his recent writings, it is clear that Yudkowsky does not believe, like you do, that any contributions posted on this site in the last few years have made any meaningful progress towards solving alignment. You and I may agree that some or all of the above three examples represent some form of progress, but you and I are not the entire community here, Yudkowsky is also part of it.

On the last one of your three examples, I feel that 'mesa optimizers' is another regrettable example of the forces of linguistic entropy overwhelming any attempts at developing crisply stated definitions which are then accepted and leveraged by the entire community. It is not like the people posting on this site are incapable of using the tools needed to crisply define things, the problem is that many do not seem very interested in ever using other people's definitions or models as a frame of reference. They'd rather free-associate on the term, and then develop their own strongly held beliefs of what it is all supposed to be about.

I am sensing from your comments that you believe that, with more hard work and further progress on understanding alignment, it will in theory be possible to make this community agree, in future, that certain alignment problems have been solved. I, on the other hand, do not believe that it is possible to ever reach that state of agreement in this community, because the debating rules of philosophy apply here.

Philosophers are always allowed to disagree based on strongly held intuitive beliefs that they cannot be expected to explain any further. The type of agreement you seek is only possible in a sub-community which is willing to use more strict rules of debate.

This has implications for policy-related alignment work. If you want to make a policy proposal that has a chance of being accepted, it is generally required that you can point to some community of subject matter experts who agree on the coherence and effectiveness of your proposal. LW/AF cannot serve as such a community of experts.

Potential topics: what other topics besides corrigibility could we collaborate on in future weeks? Also, are we able to poll users for topics in site?

  • Timelines and forecasting 
  • Goodhart’s law
  • Power-seeking
  • Human values
  • Learning from human feedback
  • Pivotal actions
  • Bootstrapping alignment 
  • Embedded agency 
  • Primer on language models, reinforcement learning, or machine learning basics 
    • This ones not really on-topic, but I do see value in a more “getting up to date” focus where experts can give talks or references to learn things (eg “here’s a tutorial for implementing a small GPT-2”). Though I could just periodically ask LW questions on whatever topic ends up interesting me at the moment. Though, I could do my own Google search, but I feel there’s some community value here that won’t be gained. Like learning and teaching together makes it easier for the community to coordinate in the future. Plus connections bonuses.

Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.

  • Google docs is kind of weird because I have to trust people won't spam suggestions. I also may need to keep up with allowing suggestions on a consistent basis. I would want this hosted on LW/AlignmentForum, but I do really like in-line commenting and feeling like there's less of a quality-bar to meet. I'm unsure if this is just me.
  • Walled Garden group discussion block time: have a block of ~4-16 hours using Walled Garden software. There could be a flexible schedule with schelling points to coordinate meeting up. For example, if someone wants to give a talk on a specific corrigibly research direction and get live feedback/discussion, they can schedule a time to do so.
  • Breaking up the task comment. Technically the literature review, summaries, extra thoughts is a “task” to do. I do want broken down tasks that many people could do, though what may end up happening is whoever wants a specific task done ends up doing it themselves. Could also have “possible research directions” as a high-level comment.

Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc).

Availability: Almost all times between 10 AM and PM, California time, regardless of day. Highly flexible hours. Text over voice is preferred, I'm easiest to reach on Discord. The LW Walled Garden can also be nice.

Update: I am available this week until Saturday evening at this calendly link(though I will close the openings if a large number of people sign up) I am available all Saturday Dec 4th (calendly link will allow you to see your time zone). We can read and discuss posts, do tasks together, or whatever you want. Previous one-on-one conversations with members of the community have gone really well.There’s not a person here I haven’t enjoyed getting to know, so do feel free to click that link and book a time!

I've got a slightly terrifying hail mary "solve alignment with this one weird trick"-style paradigm I've been mulling over for the past few years which seems like it has the potential to solve corrigibility and a few other major problems (notably value loading without Goodharting, using an alternative to CEV which seems drastically easier to specify). There are a handful of challenging things needed to make it work, but they look to me maybe more achievable than other proposals which seem like they could scale to superintelligence I've read.

Realistically I am not going to publish it anytime soon given my track record, but I'd be happy to have a call with anyone who'd like to poke my models and try and turn it into something. I've had mildly positive responses from explaining it to Stuart Armstrong and Rob Miles, and everyone else I've talked to about it at least thought it was creative and interesting.

I've updated my meeting times to meet more this week if you'd like to sign up for a slot? (link w/ a pun) , and from his comment, I'm sure diffractor would also be open to meeting. 

I will point out that there's a confusion in terms that I noticed in myself of corrigibility meaning either "always correctable" and "something like CEV", though we can talk that over a call too:)