Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
This ontology allows clearer and more nuanced understanding of what's going on and dispels some confusions.
The ontology seems good to me, but what confusions is it dispelling? I'm out of the loop here.
IIUC the core of this post is the following:
There are three forces/reasons pushing very long-sighted ambitious agents/computations to make use of very short-sighted, unambitious agents/computations:
1. Schelling points for acausal coordination
2. Epistemology works best if you just myopically focus on answering correctly whatever question is in front of you, rather than e.g. trying to optimize your long-run average correctness or something.
3. Short-sighted, unambitious agents/computations are less of a threat, more easily controlled.
I'll ignore 1 for now. For 2 and 3, I think I understand what you are saying on a vibes/intuitive level, but I don't trust my vibes/intuition enough. I'd like to see 2 and 3 spelled out and justified more rigorously. IIRC there's an academic philosophy literature on 2, but I'm afraid I don't remember it much. Are you familiar with it? As for 3, here's a counterpoint: Yes, more longsighted agents/computations carry with them risk of various kinds of Evil. But, the super myopic ones also have their drawbacks. (Example: Sometimes you'd rather have a bureaucracy staffed with somewhat agentic people who can make exceptions to the rules when needed, than a bureaucracy staffed with apathetic rule-followers.) You haven't really argued that the cost-benefit analysis systematically favors delegating to myopic agents/computations.
I disagree with the probabilities given by the OP. Also, the thing I mentioned was just one example, and probably not the best example; the idea is that the 10 people on the inside would be implementing a whole bunch of things like this.
I don't think the idea is that the 10 people on the inside violate the wishes of company leadership. Rather, the idea is that they use whatever tiny amount of resources and political capital they do have as best as possible. E.g. leadership might be like "Fine, before we erase the logs of AI activity we can have your monitor system look over them and flag anything suspicious -- but you have to build the monitor by next week because we aren't delaying, and also, it can't cost more than 0.01% of overall compute."
- Plan C: 20%
- Plan D: 45%
- Plan E: 75%
I feel like these numbers are too low.
OK, good points re: those examples. But can you come up with an example that's more analogous to the cases with Claude? E.g. you are given a difficult task by your boss, and you are considering cheating a bit and pretending you did the task straightforwardly when really you cheated... and then you learn that this is all a test of your ethics. And so you decide not to cheat. Is there a benign explanation of that pattern?
Why would remembering that it's being evaluated help Claude recall what the appropriate behaviors are? If someone asks you to answer some questions, and you struggle to recollect what they are, does your memory start working better when they add "by the way this is an important test, you'd better get these answers right?" No. Though, you might be more motivated to think hard about the answers, insofar as you care about performing well in the test... which leads back to the meme. Seems like Claude wants to perform well in the test, over-and-above however much Claude wants to behave well intrinsically.
:(((