Someone recently privately asked me for my current state on my 'Dark Arts of Rationality' post. Here's some of my reply (lightly edited for punctuation and conversation flow), which seemed worth reproducing publicly:
FWIW, that post has been on my list of things to retract for a while.
(The retraction is pending a pair of blog posts that describe some of my thoughts on related matters, which have been in the editing queue for over a year and the draft queue for years before that.)
I wrote that post before reading much of the sequences, and updated away from the position pretty soon after. My current stance is that you can basically get all the nice things, and never need to compromise your epistemics.
On my current accounting, the mistake I was making at the time of the dark arts post was something like: lots of stuff comes culturally bundled, in ways that can confuse you into thinking you can't get good thing X without also swallowing bad thing Y.
And there's a skill of just, like, taking the good stuff and discarding the bad stuff, even if you don't yet know how to articulate a justification (which I lacked in full generality at the time of the dark arts post, and was developing at the time of the 'certainty' post.)
And it's a little tricky to write about, because you've got to balance it against "care about consistency" / "notice when you're pingponging between mutually-incosistent beliefs as is convenient", which is... not actually hard, I think, but I haven't found a way to write about the one without the words having an interpretation of "just drop your consistency drive". ...which is how these sorts of things end up languishing in my editing queue for years, whe I have other priorities.
(And for the record, another receipt here is that in some twitter thread somewhere--maybe the jargon thread?--I noted the insight about unbundling things, using "you can't be sad and happy at the same time" as an example of a bundled-thing. which isn't the whole concept, but which is another instance of the resolution intruding in a visible way.)
(More generally, a bunch of my early MoW posts are me, like, digesting parts of the sequences and correcting a bunch of my errors from before I encountered this community. And for the record, I'm grateful to the memes in this community--and to Eliezer in particular, who I count as originating many of them--for helping me stop being an idiot in that particular way.)
I've also gone ahead and added a short retraction-ish paragraph to the top of the dark arts post, and might edit it later to link it to the aforementioned update-posts, if they ever make it out of the editing queue.
A few people recently have asked me for my take on ARC evals, and so I've aggregated some of my responses here:
- I don't have strong takes on ARC Evals, mostly on account of not thinking about it deeply. - Part of my read is that they're trying to, like, get a small dumb minimal version of a thing up so they can scale it to something real. This seems good to me. - I am wary of people in our community inventing metrics that Really Should Not Be Optimized and handing them to a field that loves optimizing metrics. - I expect there are all sorts of issues that would slip past them, and I'm skeptical that the orgs-considering-deployments would actually address those issues meaningfully if issues were detected ([cf](https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hard)). - Nevertheless, I think that some issues can be caught, and attempting to catch them (and to integrate with leading labs, and make "do some basic checks for danger" part of their deployment process) is a step up from doing nothing. - I have not tried to come up with better ideas myself.
Overall, I'm generally enthusiastic about the project of getting people who understand some of the dangers into the deployment-decision loop, looking for advance warning signs.
Someone recently privately asked me for my current state on my 'Dark Arts of Rationality' post. Here's some of my reply (lightly edited for punctuation and conversation flow), which seemed worth reproducing publicly:
I've also gone ahead and added a short retraction-ish paragraph to the top of the dark arts post, and might edit it later to link it to the aforementioned update-posts, if they ever make it out of the editing queue.
A few people recently have asked me for my take on ARC evals, and so I've aggregated some of my responses here:
- I don't have strong takes on ARC Evals, mostly on account of not thinking about it deeply.
- Part of my read is that they're trying to, like, get a small dumb minimal version of a thing up so they can scale it to something real. This seems good to me.
- I am wary of people in our community inventing metrics that Really Should Not Be Optimized and handing them to a field that loves optimizing metrics.
- I expect there are all sorts of issues that would slip past them, and I'm skeptical that the orgs-considering-deployments would actually address those issues meaningfully if issues were detected ([cf](https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hard)).
- Nevertheless, I think that some issues can be caught, and attempting to catch them (and to integrate with leading labs, and make "do some basic checks for danger" part of their deployment process) is a step up from doing nothing.
- I have not tried to come up with better ideas myself.
Overall, I'm generally enthusiastic about the project of getting people who understand some of the dangers into the deployment-decision loop, looking for advance warning signs.