Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

on a call, i was discussing my idea for doing activation-level learning to (hopefully) provide models feedback based on their internal computations and choices:

I may have slipped into a word game... are we "training against the [interpretability] detection method" or are we "providing feedback away from one kind of algorithm and towards another"? They seem to suggest very different generalizations, even though they describe the same finetuning process. How could that be?

This is why we need empirics.

1Chris_Leong
Thanks for posting this. I've been confused about the connection between shard theory and activation vectors for a long time! This confuses me. I can imagine an AIXI program where the utility function is compositional even if the optimisation is unitary. And I guess this isn't two full motivational circuits, but it kind of is tow motivational circuits.

Today we are publishing a significant update to our Responsible Scaling Policy (RSP), the risk governance framework we use to mitigate potential catastrophic risks from frontier AI systems. This update introduces a more flexible and nuanced approach to assessing and managing AI risks while maintaining our commitment not to train or deploy models unless we have implemented adequate safeguards. Key improvements include new capability thresholds to indicate when we will upgrade our safeguards, refined processes for evaluating model capabilities and the adequacy of our safeguards (inspired by safety case methodologies), and new measures for internal governance and external input. By learning from our implementation experiences and drawing on risk management practices used in other high-consequence industries, we aim to better prepare for the rapid pace of AI...

While you can make a lot of progress in evals with tinkering and paying little attention to the literature, we found that various other papers have saved us many months of research effort. The Apollo Research evals team thus compiled a list of what we felt were important evals-related papers. We likely missed some relevant papers, and our recommendations reflect our personal opinions.

Our favorite papers

  • Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024)
    • Contains detailed descriptions of multiple LM agent evals across four categories. Also explores new methodologies for estimating evals success probabilities.
    • We think it is the best “all around” evals paper, i.e. giving the best understanding of what frontier LM agent evals look like
    • We tested the calibration of their new methodologies in practice in Hojmark et al.,
...

In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.

I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:

  • Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of
...
3Vivek Hebbar
Do you want to try playing this game together sometime?
2Daniel Kokotajlo
Yes! Which side do you want to be on? Want to do it in person, or in this comment thread?

Or you guys could find a 1-2 hour window to show up and live-chat in a LW dialogue, then publish the results :-)

3Vivek Hebbar
Why are we doing RL if we just want imitation?  Why not SFT on expert demonstrations? Also, if 10 episodes suffices, why is so much post-training currently done on base models?
Load More