A survey of tool use and workflows in alignment research

Logan Riggs; Jan; janus; jacquesthibs

TL;DR: We are building language model powered tools to augment alignment researchers and accelerate alignment progress. We could use your feedback on what tools would be most useful. We’ve created a short survey that can be filled out here.

We are a team from the current iteration of the AI Safety camp and are planning to build a suite of tools to help AI Safety researchers.

We’re looking for feedback on what kinds of tools would be most helpful to you as an established or prospective alignment researcher. We’ve put together a short survey to get a better understanding of how researchers work on alignment. We plan to analyze the results and make them available to the community (appropriately anonymized). The survey is here. If you would also be interested in talking directly, please feel free to schedule a call here.

This project is similar in motivation to Ought’s Elicit, but more focused on human-in-the-loop and tailored for alignment research. One example of a tool we could create would be a language model that intelligently condenses existing alignment research into summaries or expands rough outlines into drafts of full Alignment Forum posts. Another idea we’ve considered is a brainstorming tool that can generate new examples/counterexamples, new arguments/counterarguments, or new directions to explore.

In the long run, we’re interested in creating seriously empowering tools that fall under categorizations like STEM AI, Microscope AI, superhuman personal assistant AI, or plainly Oracle AI. These early tools are oriented towards more proof-of-concept work, but still aim to be immediately helpful to alignment researchers. Our prior that this is a promising direction is informed in part by our own very fruitful and interesting experiences using language models as writing and brainstorming aids.

One central danger of tools with the ability to increase research productivity is dual-use for capabilities research. Consequently, we’re planning to ensure that these tools will be specifically tailored to the AI Safety community and not to other scientific fields. We do not intend to publish the specifics methods we use to create these tools.

We welcome any feedback, comments, or concerns about our direction. Also, if you'd like to contribute to the project, feel free to join us at the #accelerating-alignment channel in the EleutherAI channel.

Thanks in advance!

I'm curious how well a model finetuned on the Alignment Newsletter performs at summarizing new content (probably blog posts; I'd assume papers are too long and rely too much on figures). My guess is that it doesn't work very well even for blog posts, which is why I haven't tried it yet, but I'd still be interested in the results and would love it on the off chance that it actually was good enough to save me some time.

We could definitely look into making the project evolve in this direction. In fact, we're building a dataset of alignment-related texts and a small part of the dataset includes a scrape of arXiv papers extracted from the Alignment Newsletter. We're working towards building GPT models fine-tuned on the texts.

Ya, I was even planning on trying:

[post/blog/paper] rohinmshah karma: 100 Planned summary for the Alignment Newsletter: \n>

Then feed that input to.

Planned opinion:

to see if that has some higher-quality summaries.

Well, one "correct" generalization there is to produce much longer summaries, which is not actually what we want.

(My actual prediction is that changing the karma makes very little difference to the summary that comes out.)