Trying to break into MIRI-style research seems to be much, much harder than trying to break into ML-style safety research. This is worrying if you believe this research to be important. I'll examine two kinds of causes: those which come from MIRI-style research being a niche area and those which go beyond this:
I will be giving a talk on the physics of dynamism with relevance to AI alignment at a yet-to-be determined location in Seattle. (If you have venue suggestions please let me know.) I am hoping to get feedback on a new thread of research and am also eager to meet folks in Seattle. Afterward the talk there will be some time to hang out with the group.
This post is heavily informed by prior work, most notably that of Owain Evans, Owen Cotton-Barratt and others (Truthful AI), Beth Barnes (Risks from AI persuasion), Paul Christiano (unpublished) and Dario Amodei (unpublished), but was written by me and is not necessarily endorsed by those people. I am also very grateful to Paul Christiano, Leo Gao, Beth Barnes, William Saunders, Owain Evans, Owen Cotton-Barratt, Holly Mandel and Daniel Ziegler for invaluable feedback.
In this post I propose to work on building competitive, truthful language models or truthful LMs for short. These are AI systems that are:
Such systems will likely be fine-tuned from large...
Followed by: What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs), which provides examples of multi-stakeholder/multi-agent interactions leading to extinction events.
This post is an overview of a variety of AI research areas in terms of how much I think contributing to and/or learning from those areas might help reduce AI x-risk. By research areas I mean “AI research topics that already have groups of people working on them and writing up their results”, as opposed to research “directions” in which I’d like to see these areas “move”.
I formed these views mostly pursuant to writing AI Research Considerations for Human Existential Safety (ARCHES). My hope is that my assessments in this post can be helpful to students and established AI researchers who are thinking about shifting into...
TL;DR: AI which is learning human values may act unethically or be catastrophically dangerous, as it doesn’t yet understand human values.
The main idea is simple: a young AI which is trying to learn human values (which I will call a “value learner”) will have a “chicken and egg” problem. Such AIs must extract human values, but to do it safely, they should know these values, or at least have some safety rules for value extraction. This idea has been analyzed before (more on those analyses below); here, I will examine different ways in which value learners may create troubles.
It may be expected that the process of young AI learning human values will be akin to a nice conversation or perfect observation but it could easily take...