I'm a Ph.D. student doing research on Natural Language Processing.
My research focuses on developing question-answering methods that generalize to harder questions than we have supervision for. Learning from human examples (supervised learning) won't scale to these kinds of questions, so I am investigating other paradigms that recursively break down harder questions into simpler ones (i.e., Debate and Iterated Amplification). Check out my website for more information about me/my research: http://ethanperez.net/
I understand that deceptive models won't show signs of deception :) That's why I made the remark of models not showing signs of prerequisites to scary kinds of deception. Unless you think there are going to be no signs of deception or any prerequisites, for any models before we get deceptive ones?
It also seems at least plausible that models will be imperfectly deceptive before they are perfectly deceptive, in which case we will see signs (e.g., in smaller models)
I'm curious why you believe that having products will be helpful? A few particular considerations I would be interested to hear your take on:
Why did you decide to start a separate org rather than joining forces with an existing org? I'm especially curious since state-of-the-art models are time-consuming/compute-intensive/infra-intensive to develop, and other orgs with safety groups already have that infrastructure. Also, it seems helpful to have high communication bandwidth between people working on alignment, in a way that is impaired by having many different orgs (especially if the org plans to be non-disclosure by default). Curious to hear how you are thinking about these things!
How do you differ from Redwood?
Are you planning to be in-person or have some folks working remotely? Other similar safety orgs don't seem that flexible with in-person requirements, so it'd be nice to have a place for alignment work for those outside of {SF, London}
What are people's timelines for deceptive alignment failures arising in models, relative to AI-based alignment research being useful?
Today's language models are on track to become quite useful, without showing signs of deceptive misalignment or its eyebrow-raising pre-requisites (e.g., awareness of the training procedure), afaik. So my current best guess is that we'll be able to get useful alignment work from superhuman sub-deception agents for 5-10+ years or so. I'm very curious if others disagree here though
What are your thoughts for subfields of ML where research impact/quality depends a lot on having lots of compute?
In NLP, many people have the view that almost all of the high impact work has come from industry over the past 3 years, and that the trend looks like it will continue indefinitely. Even safety-relevant work in NLP seems much easier to do with access to larger models with better capabilities (Debate/IDA are pretty hard to test without good language models). Thus, safety-minded NLP faculty might end up in a situation where none of their direct work is very impactful, but all of the expected impact is by graduating students who end up going to work in industry labs in particular. How would you think about this kind of situation?
Here is my Elicit Snapshot.
I'll follow the definition of AGI given in this Metaculus challenge, which roughly amounts to a single model that can "see, talk, act, and reason." My predicted distribution is a weighted sum of two component distributions described below:
My probability for Prosaic AGI is based on an estimated probability of each of the 3 stages of development working (described above):
P(Prosaic AGI) = P(Stage 1) x P(Stage 2) x P(Stage 3) = 3/4 x 2/3 x 1/2 = 1/4
------------------
Updates/Clarification after some feedback from Adam Gleave:
What do you (or others) think is the most promising, soon-possible way to use language models to help with alignment? A couple of possible ideas: