DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed. An...
Imagine your once reliable, trusty AI assistant suddenly suggesting dangerous actions or spreading misinformation. This is a growing threat as large language models (LLMs) become more capable and pervasive. The culprit? Data poisoning, where LLMs are trained on corrupted or harmful data, potentially turning powerful tools into dangerous liabilities. Our...
Work done at FAR AI. There has been a lot of conceptual work on mesa-optimizers: neural networks that develop internal goals that may differ from their training objectives (the inner alignment problem). There is an abundance of good ideas for empirical work (find search in a NN, interpret it), but...
Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to...
Last year, we showed that supposedly superhuman Go AIs can be beaten by human amateurs playing specific “cyclic” patterns on the board. Vulnerabilities have previously been observed in a wide variety of sub- or near-human AI systems, but this result demonstrates that even far superhuman AI systems can fail catastrophically...
Doing a PhD is a strong option to get great at developing and evaluating research ideas. These skills are necessary to become an AI safety research lead, one of the key talent bottlenecks in AI safety, and are helpful in a variety of other roles. By contrast, my impression is...
TL;DR: FAR AI's science of robustness agenda has found vulnerabilities in superhuman Go systems; our value alignment research has developed more sample-efficient value learning algorithms; and our model evaluation direction has developed a variety of new black-box and white-box evaluation methods. FAR AI is a non-profit AI safety research institute,...