My PhD thesis probably wins the prize of weirdest ever defended at my research lab. Not only was it a work of theory of distributed computing in a formal methods lab, but it didn’t even conform to what theory of distributed computing is supposed to look like. With the exception of one paper (interestingly, the only one accepted at the most prestigious conference in the field), none of my published works proposed new algorithms, impossibility results, complexity lower bounds, or even the most popular paper material, a brand new model of distributed computing to crowd even more the literature.
Instead, I looked at a specific formalism introduced years before, and how it abstracted the more familiar models used by most researchers. It had been introduced as such an...
Just like every Monday now, researchers in AI Alignment are invited for a coffee time, to talk about their research and what they're into.
Here is the link.
And here is the everytimezone time.
Note that the link to the walled garden now only works for AF members. Anyone who wants to come but isn't an AF member needs to go by me. I'll broadly apply the following criteria for admission:
I prefer to not allow people who might have been interesting but who I'm not sure will not derail the conversation, because this is supposed to be the place where AI Alignment researchers can talk about their current research without having to explain everything.
See you then!
Cross-posted to the EA forum.
I sent a two-question survey to ~117 people working on long-term AI risk, asking about the level of existential risk from "humanity not doing enough technical AI safety research" and from "AI systems not doing/optimizing what the people deploying them wanted/intended".
44 people responded (~38% response rate). In all cases, these represent the views of specific individuals, not an official view of any organization. Since some people's views may have made them more/less likely to respond, I suggest caution in drawing strong conclusions from the results below. Another reason for caution is that respondents added a lot of caveats to their responses (see the anonymized spreadsheet), which the aggregate numbers don't capture.
I don’t plan to do any analysis on this data, just share it; anyone who wants to analyze...
I've been thinking about situations where alignment fails because "predict what a human would say" (or more generally "game the loss function," what I call the instrumental policy) is easier to learn than "answer questions honestly" (overview).
One way to avoid this situation is to avoid telling our agents too much about what humans are like, or hiding some details of the training process, so that they can't easily predict humans and so are encouraged to fall back to "answer questions honestly." (This feels closely related to the general phenomena discussed in Thoughts on Human Models.)
Setting aside other reservations with this approach, could it resolve our problem?