Today we are publishing a significant update to our Responsible Scaling Policy (RSP), the risk governance framework we use to mitigate potential catastrophic risks from frontier AI systems. This update introduces a more flexible and nuanced approach to assessing and managing AI risks while maintaining our commitment not to train or deploy models unless we have implemented adequate safeguards. Key improvements include new capability thresholds to indicate when we will upgrade our safeguards, refined processes for evaluating model capabilities and the adequacy of our safeguards (inspired by safety case methodologies), and new measures for internal governance and external input. By learning from our implementation experiences and drawing on risk management practices used in other high-consequence industries, we aim to better prepare for the rapid pace of AI...
While you can make a lot of progress in evals with tinkering and paying little attention to the literature, we found that various other papers have saved us many months of research effort. The Apollo Research evals team thus compiled a list of what we felt were important evals-related papers. We likely missed some relevant papers, and our recommendations reflect our personal opinions.
In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.
I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:
Or you guys could find a 1-2 hour window to show up and live-chat in a LW dialogue, then publish the results :-)
on a call, i was discussing my idea for doing activation-level learning to (hopefully) provide models feedback based on their internal computations and choices: