Researchers have discovered a set of "glitch tokens" that cause ChatGPT and other language models to produce bizarre, erratic, and sometimes inappropriate outputs. These tokens seem to break the models in unpredictable ways, leading to hallucinations, evasions, and other strange behaviors when the AI is asked to repeat them.
Evaluation awareness — an AI recognizing it's being evaluated — is a widely discussed concept in AI safety. But there is a closely related concept that we claim is more important: deployment awareness, the AI's ability to recognize when it is not being evaluated and when its actions matter. A misaligned AI with deployment awareness can game evaluations without any evaluation awareness at all, with a simple strategy: act aligned by default, and deviate only when confident you're in real deployment and your actions matter for your goals. This requires two ingredients — occasionally recognizable deployment situations, and enough self-reflective and strategic reasoning for the AI to anticipate and plan around this. We think "deployment awareness" better identifies what makes evaluations fragile, and we develop this idea...
I will try to give a more thoughtful answer later, but I don't have a fully thought out answer ready and I suspect it might take a while (roughly until we post a few more of the already-drafted posts, and sit down on this more specifically).
So, uhm, in lieu of good answers, please look up "How is your research useful?" at Kostas Siozios' academic webpage.
That said, my 5-minute-thinking answer is:
Call for BETA testers for an AI control/security tool. I'm bottlenecked on bug reports!
I recently advertised the alpha test for claude-guard. After a few weeks of dev work, it's now in beta!
I want claude-guard to be a tool that people actually use, not just because it works but because it works seamlessly. In the alpha test, I only got a single user PR and no issues. I can't surface everything on my own! I need your data!
The bar is low. If setup failed, doctor confused you, the firewall blocked something you needed, or any other reason you wouldn't want to...
A Critique of Functional Decision Theory
NB: My writing this note was prompted by Carl Shulman, who suggested we could try a low-time-commitment way of attempting to understanding the disagreement between some folks in the rationality community and academic decision theorists (including myself, though I’m not much of a decision theorist). Apologies that it’s sloppier than I’d usually aim for in a philosophy paper, and lacking in appropriate references. And, even though the paper is pretty negative about FDT, I want to emphasise that my writing this should be taken as a sign of respect for those involved in developing FDT. I’ll also caveat I’m unlikely to have time to engage in the comments; I thought it was better to get this out there all the same rather...
Points III and IV are both incorrect.
For III, there is both a simulated you and a real you. The simulated you derives greater utility from choosing Left (assuming that identical copies value each other's utilities to some extent). Therefore there are no longer any guaranteed payoffs from always choosing Right; the example does not prove what it claims.
You can use it to show that using FDT with identical copies that don't value each other's utilities is incoherent, if you want.
For point IV, if the perdictor runs an algorithm like "Scots tend to one-box, whe...
We’re changing our name: Sequent is now Resolution.
Two weeks ago, we launched Sequent, a new alignment research organization working on a portfolio of ASI safety bets. Following our announcement, we discovered ⊢ Sequent, a formal verification startup launched only about a month earlier! And by a friend and former colleague, no less! After discussing it with the other ⊢ Sequent, we decided that the best course of action was to change our name. This brings us to Resolution (resolution.org).
Go check out Sequent! They're going to do exciting things with formal... (read more)