This is a response to ARC's first technical report: Eliciting Latent Knowledge. But it should be fairly understandable even if you didn't read ARC's report, since I summarize relevant parts of their report as necessary. Here I propose some approaches to the problem ARC outlines which are very different from...
This post was inspired by orthonormal's post Developmental Stages of GPTs and the discussion that followed, so only part of it is original. First I'll aim to provide a crisper version of the argument for why GPT wants to mesa-optimize. Specifically, I'll explain a well-known optimization algorithm used in text...
> In this paper, we argue that adversarial example defense papers have, to date, mostly considered abstract, toy games that do not relate to any specific security concern. Furthermore, defense papers have not yet precisely described all the abilities and limitations of attackers that would be relevant in practical security....
This is a belated follow-up to my Dualist Predict-O-Matic post, where I share some thoughts re: what could go wrong with the dualist Predict-O-Matic. Belief in Superpredictors Could Lead to Self-Fulfilling Prophecies In my previous post, I described a Predict-O-Matic which mostly models the world at a fuzzy resolution, and...
This is a response to Abram's The Parable of Predict-O-Matic, but you probably don't need to read Abram's post to understand mine. While writing this, I thought of a way in which I think things could wrong with dualist Predict-O-Matic, which I plan to post in about a week. I'm...