Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested...
ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in...
The Alignment Research Center’s Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here. Update January 2024: we have paused hiring and expect to reopen in the second half of 2024. We are open to expressions of interest but do not plan to...
> "Where do new [algorithms] come from? I keep reading about someone who invented [an algorithm] to do something-or-other but there's no mention of how." > > A shrug of robed shoulders. "Where do new books come from, Mr. Potter? Those who read many books sometimes become able to write...
From January - February the Alignment Research Center offered prizes for proposed algorithms for eliciting latent knowledge. In total we received 197 proposals and are awarding 32 prizes of $5k-20k. We are also giving 24 proposals honorable mentions of $1k, for a total of $274,000. Several submissions contained perspectives, tricks,...
Thank you to all those who have submitted proposals to the ELK proposal competition. We have evaluated all proposals submitted before January 14th[1]. Decisions are still being made on proposals submitted after January 14th. The deadline for submissions is February 15th, after which we will release summaries of the proposals...
ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and...