This post is a brief retrospective on the last 8 months at Conjecture that summarizes what we have done, our assessment of how useful this has been, and the updates we are making.
Conjecture formed in March 2022 with 3 founders and 5 early employees. We spent our first months growing the team, building infrastructure, exploring different research agendas, running Refine, publishing our internal infohazard policy, establishing an operational foundation for the business, and raising investments.
It’s been intense! For many of us at Conjecture, the last eight months have been the hardest we’ve worked in our lives. Working on such an immensely difficult problem as alignment alongside a team of brilliant and driven colleagues is, to say the least, galvanizing.
In some ways, this makes it difficult to step back and critically reflect on our work. It is easy to mistakenly measure progress by effort, and the last thing you want to hear after maxing out effort is that it wasn’t good enough.
However, reality does not grade on a curve. We need to advance significantly faster than traditional science in order to solve alignment on short timelines.
By this standard, the sober reflection is that most of our efforts to date have not made meaningful progress on the alignment problem. Our research has not revealed new methods that make neural networks more interpretable or resolve inner or outer alignment problems, and our coordination efforts have not slowed the pace at which AI capabilities are advancing compared to safety. When measured against p(Doom), our efforts haven’t cut it.
That’s not to say this work has been useless. We have learned a lot about where we went wrong, and made a number of changes that put us in a better position to make progress than we were in March. Measuring ourselves against a high standard enables us to constantly improve and be realistic about the difficulty of the problem ahead of us.
The reason we are writing this reflection is to calibrate ourselves. We do not want to be seen as cutting alignment if we are not. What matters is that we ground ourselves in reality and make public as many of our efforts (and mistakes!) as possible in order to gather feedback and update quickly.
What we have done and how useful we think it is
We have built our own infrastructure to deploy large language models and do bespoke interpretability research. Our small engineering team has developed an impressive tech stack that is comparable (and in some areas exceeds) those built by many large industry research labs. While this has set us up to conduct research and develop tools/products more efficiently, it is only instrumental to alignment and not progress in-and-of-itself.
Our interpretability team explored a new direction in mechanistic interpretability in an effort to better understand polysemanticity in neural networks. The resulting paper identifies polytopes, rather than neurons, as a potentially fundamental unit of neural networks, and found that polysemanticity is reduced at the polytope level.
While the work brings a new perspective on neural network representations, a significant issue is that there are no clear implications of how to use this framework to better interpret neural networks. When measuring progress in interpretability, the clearest signal comes from new affordances–concrete things we can do differently now that we’ve made a research breakthrough. While there’s a chance that polytopes research may bring future affordances closer, the current, practical utility of polytopes is negligible. We also overinvested in iterating on feedback and polishing this project, and think we could have shipped results here quicker.
We also published a post concerning the ways that a capable AI might circumvent future interpretability methods even if research goes well. This post gave more thorough discussion to a problem that had previously only received sporadic attention. However, it faces a similar critique to polytopes in that it addresses conceptual, not technical problems.
(For those curious, we recently shared a post on current themes in mechanistic interpretability, which aims to create common knowledge by providing an overview of the field based on our conversations from the past few months.)
We also explored a few research projects aimed at better understanding modern DL models trained with self-supervised learning. Our most visible contribution here was the Simulators post about a new theoretical frame to understand GPT-like models. The post was popular and has hopefully helped sharpen people's intuitions on the subject.
However, more experienced alignment researchers who have already developed their own deep intuitions about GPT-like models didn’t find the framing helpful. This is another area we overinvested months of work leading to a single main output, rather than focusing research sprints on revealing new bits of evidence, or cutting through core cruxes in research disagreements. Later posts in this sequence may cut more at alignment, such as the recent post on RLHF mode collapse (note the update that InstructGPT is not RLHF). But overall, this line of research made much slower progress than expected.
A third category of research we explored is epistemology, with a few notable posts that point to why alignment is hard, why we need many different angles of approach, and how to make epistemology useful. This work has helped clarify much of Conjecture’s strategy and plans for approaching future alignment research, and we give full credit to our epistemology team for tackling a set of philosophical questions few others are engaging with. However, comparison is irrelevant, and the bottom line is that our epistemological work so far has mostly pointed at the hard problem rather than cut it.
We organized and wrapped up the pilot cohort of Refine, an incubator for independent alignment researchers. Refine was structured to help participants develop a particular research agenda, and the five participants documented their progress in a series of AF posts. The program shook out to be a better fieldbuilding effort than research effort. Most participants expressed that the primary benefit of the program was greater immersion in the alignment field and a better understanding of the hard parts of alignment. Ultimately, the program fell short of the goal to generate radically different research agendas. We won’t run a new cohort of Refine in the foreseeable future, and will publish a short postmortem on this tomorrow.
In our first few months we developed an internal infohazard policy and made it binding for all Conjecture employees. We posted the policy publicly to hold ourselves accountable for taking infohazards seriously, and to encourage other organizations to publish or adopt similar policies. While many people and organizations have been verbally supportive of the policy, no other organization has publicly committed to a similar policy (and only one has privately established one, as far as we know). In the meantime, releases of new state of the art capabilities continue with no signs of slowing down.
Operations and Fiscal Sponsorship
We have stabilized our core operations and internal processes to the degree that we can start to support other orgs, which we are doing by fiscally sponsoring a SERI MATS London cohort and ARENA. This is a bright spot. There appears to be serious interest in growing the London alignment scene. Nevertheless, fieldbuilding is instrumental to alignment but not direct progress.
While raising funds, we prioritized engaging with canonical alignment funders, which took much more time than expected. This used bandwidth from senior members of Conjecture who could have used the time to focus on research and other organizational needs. Worse, we spent a few months over-optimizing our research to be legible to funders, which slowed progress.
Updates we are making
If these reflections seem overly critical, it’s likely because we’re setting different standards for what constitutes progress on alignment. We’re looking for “woah, that really works”-type results, and don’t want to confuse ourselves about what research meets that caliber.
The positive spin on the above is that we are much stronger as an organization now than we were in March. We haven’t just hired more people and established operations – we’re also developing as researchers and managers. Below are some of the things we have started to do differently based on our reflections.
We have sharpened our focus on which internal research efforts we think cut alignment or have the potential to. While we still believe that larger portfolios of bets are necessary to maximize our chances of success, given that we have a limited amount of funding available, we have decided it makes the most sense for us to focus more sharply on a unifying agenda for the time being.
We are drafting a post about our updated research agenda, with a plan for how mechanistic interpretability, understanding LLMs, epistemology, and building research tools fit together. We would love feedback on this when it is ready.
Workflow and Research Methodology
We are shifting our research methodology from deep dives to much faster OODA loops. This has been productivity-enhancing for all of our teams, but has been particularly noticeable in our interpretability efforts. The team has learned more in the last 2 months than the first 6, and we have several interesting results that we are currently writing into posts. Expect to see some of these in the next month or so.
We overinvested in legibility and polish partially because we miscalculated how beneficial it would be for raising funds. But we were also motivated by not wanting to embarrass ourselves. This was a bad policy. Alongside speeding up our research cycles, we’re committed to publishing faster too. This means that we will be producing much more imperfect, sometimes even stupid, stuff more often. Your criticism is always welcome!
We want to expose ourselves to outside views in order to gather as many bits of evidence as possible. Note that this does not apply to infohazardous information, which we will of course keep siloed per our infohazard policy.
Another reason for publishing more quickly is that conversations with many interpretability researchers have led us to believe that there is a wealth of knowledge in short experiments and unpublished research that really should be shared. We’d encourage other organizations who think similarly to post frequently, and share results even if they’re not completely polished.
Funding Sources and Product
Even before factoring in recent events, alignment funding remains heavily constrained, – worryingly so when compared to capabilities funding. We are grateful for the support that we have received so far, but we don’t believe the capacity is there to fully support ambitious alignment scaling plans. In hindsight, we think it would have been prudent to focus on building products earlier and focusing more on raising from VC investors.
We have recently begun to build out a product team and have been working on some early demos. We set up our product work to benefit from our infrastructure without taking resources away from our research agenda, and are spinning out “Lemma Labs” as a separately-branded product arm. Right now, our strategy is to move fast, develop a bunch of demos, and do extensive beta testing within an external community to see what products gain traction and excitement. We continue to be committed to not building beyond SOTA models for these products to exist.
We built Conjecture to primarily focus on technical alignment, and were initially dismissive that coordination was tractable. Our opinion has shifted a lot since March, and we are now working directly on coordination efforts to help differentially accelerate alignment.
Some of the evidence that has shifted our opinion is that we’ve found other alignment labs to be more eager to coordinate than expected. We’ve benefited from workshops that other labs have organized, peer feedback, and an overall culture of mutual support. This is great, and we’re eager to contribute to this.
We have also found that alignment researchers have consistently been acting in good faith, and are committed to and capable of updating. This makes us optimistic that things like public debates on research cruxes could lead to meaningful shifts in research agendas.
Lastly, we’ve found that many people deep in the ML world simply haven’t heard strong technical arguments in favor of alignment. We think this is a particularly sensitive area (e.g., we don’t want to risk making someone more enthusiastic about capabilities), but are hopeful that something as simple as high-bandwidth communication could help here. For some people, repeated conversations addressing cruxes may be enough for them to update on the importance of the alignment problem.
Conjecture’s overall aspirations have not changed since our founding. We aim to solve alignment in coordination with everyone else committed to this effort.
From the last 8 months, the biggest lesson we are taking away is that we have to optimize even harder, constantly evaluating what we do and adjusting based on the evidence. We want to be open with the community about our progress, iterate fast, and update on the criticism and feedback that is invaluable in allowing us to improve.
Thank you for your support! We welcome any and all feedback in the comments, or in a direct email to us at email@example.com.
If you’re interested in contributing more directly, we’re hiring and would love to hear from you! We’re currently hiring for 9+ roles, including research engineers, a security lead, an ML engineering lead, and some non-technical roles. The hiring round closes December 16. You can find more information here.
I quite appreciate that you're up front about the distinction between "we shipped things we like" and "we made progress on alignment". Some of the stuff I've read (janus's posts, infohazard policy) has been pretty thoughtful (even while I disagreed with some of it) and I was definitely worried that the work would be much worse. I recall my main recommendation to you when you visited the Bay (and my offices) was "publish your ideas so I and others can evaluate them", and I'm definitely glad you've done a bunch of that. And I agree that you can do even more :)
You publishing products sounds potentially like the beginning of the end, I'm not sure. Everyone should not really be contributing to AI hype or proving how profitable ML can be. I think maybe there's ways to do it ethically if your product line isn't at all branded as ML and doesn't have the possibility of revolutionizing a whole field (e.g. kickstarting dozens of competitors over the next 2 decades and creating 10s of billions of dollars), just a small market that you can own and gets you the funding you need and no more. Also, seems plausible it will take like 3-5 years of focused effort to become clearly profitable, killing your organizational focus on alignment for that period. I am not actually sure what I would do here, if rich folks like OpenPhil aren't into funding you for whatever reason (I don't know).
I'll mention here that from what little I saw, your coordination efforts seemed a bit misguided and harmful. I am left with a sense that you wanted everyone to be friends and not feel threatened by interacting with others, to feel like "come in, the water is fine, don't worry, you won't end up with people criticizing you for maybe ending civilization or self-deceiving along the way or call you unethical". While I am pretty open to fairly respectful coordination and am a strong fan of finding positive-sum trade, I care more about being frank and honest in my interactions, and a route must be found where communicating such things (insofar as that's what someone believes) isn't going to destroy or end the coordination/trade agreement. If you suspect you are the only people in a position to speak the truth about what is going on, you must not trade that away or else nobody will be able to say what is going on, and any remaining hope of successful reacting to what is actually happening is likely lost.
In order to signal that I am not trying to just be critical, I will end on a true positive note, which is that overall, from my current vantage point, your team has been far more competent and focused on alignment than I expected. It's like my... 88th percentile outcome, in that I assigned 87% to you being less competent on the alignment problem. I am positively surprised :)
I can't comment on Conjecture specifically's coordination efforts, but I fairly strongly disagree with this as a philosophy of coordination. There exist a lot of people in the world who have massive empirical or ethical disagreements with me that lead to them taking actions I think are misguided to actively harmful to extremely dangerous. But I think that this often is either logical or understandable from their perspective. I think that being able to communicate productively with these people. see things from their point of view, and work towards common ground is a valuable skill, and an important part of the spirit of cooperation. For example, I think that Leah Garces's work cooperating with chicken farmers to reduce factory farming is admirable and worthwhile, and I imagine she isn't always frank and honest with people.
In particular, I think that being frank and honest in this context can basically kill possible cooperation. And good cooperation can lead to things being better by everyone's lights, so this is a large and important cost not worth taking lightly. Not everyone has to strive for cooperation, but I think it's very important that at least some people do! I do think that being so cooperative that you lose track of what you personally believe can be misguided and corrosive, but that there's a big difference between having clear internal beliefs and needing to express all of those beliefs.
Thanks for the link, I'll aim to give that podcast a listen, it's relevant to a bunch of my current thinking.
Extremely strong +1! There is so much low-hanging fruit in mechanistic interpretability (of the flavour "I could probably spend a full-time day working on this and find something mildly cool worth writing up into a rough blog post"). I would love a wealth of these posts to exist that I could point people to and read myself! I've tried to set myself a much lower bar for this, and still mostly procrastinated on this. I would love to see this.
EDIT: This is also a comparative advantage of being an org outside academia whose employees mostly aren't aiming for a future career in academia. I gather that in standard academic incentives, being scooped on your research makes the work much less impressive and publishable and can be bad for your career, discincentivising discussing partial results, especially in public. This seems pretty crippling to having healthy and collaborative discourse, but it's also hard to fault people for following their incentives!
More generally, I really appreciate the reflective tone and candour of this post! I broadly agree re the main themes and that I don't think Conjecture has really made actions that cut at the hard core of alignment, and these reflections seem plausible to me re concrete but fixable mistakes and deeper and more difficult problems. I look forwards to seeing what you do next!
Thanks for writing this.
Alignment research has a track record of being a long slow slog. It seems that what we’re looking for is a kind of insight that is just very very hard to see, and people who have made real progress seem to have done so through long periods of staring at the problem.
With your two week research sprints, how do you decide what to work on for a given sprint?