Petropolitan — AI Alignment Forum

MAGMA also has the model check its own work, but the model notices that the work it is checking is its own and doesn’t flag it.

Why would anyone give such a responsibility to an untrusted model in a not-overseen fashion? Already in December last year Greenblatt et al. demonstrated which techniques alignment researchers could use to control a high-capability untrusted model (and Robert Miles did a good video on it recently).

It doesn't currently look plausible that any model (or any human for that matter) would be able to distinguish between its own work it checks and a synthetic input intentionally crafted by alignment researchers to test whether the model lies about maliciousness of the work it's checking.

Not to speak about other techniques described in the paper, such as having a trusted model (like Sonnet 3.5) estimate the suspiciousness of the work or just rewrite the work in a less suspicious way altogether

P. S. Took me a day to find author's posts here about the article in question: https://alignmentforum.org/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion https://www.alignmentforum.org/posts/GCqoks9eZDfpL8L3Q/how-to-prevent-collusion-when-using-untrusted-models-to

Have LLMs Generated Novel Insights?

Petropolitan10mo1-2

On the other hand, frontier math (pun intended) is much worse financed than biomedicine because most of the PhD-level math has barely any practical applications worth spending many manhours of high-IQ mathematicians (which often makes them switch career, you know). So, I would argue, if productivity of math postdocs when armed with future LLMs raises by, let's say, an order of magnitude, they will be able to attack more laborious problems.

Not that I expect it to make much difference to the general populace or even the scientific community at large though

Catastrophic sabotage as a major threat model for human-level AI systems

Petropolitan1y*00

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments