Agree with what you've written here -- I think you put it very well.
In my experience, you need separate teams doing safety research because specialization is useful -- it's easiest to make progress when both individuals and teams specialize a bit and develop taste and mastery of a narrow range of topics.
Yeah that's also good point, though I don't want to read too much into it, since it might be a historical accident.
yup, added a sentence about it
Basically agree -- I think that a model trained by maximum likelihood on offline data is less goal-directed than one that's trained by an iterative process where you reinforce its own samples (aka online RL), but still somewhat goal directed. It needs to simulate a goal-directed agent to do a good job at maximum likelihood. OTOH it's mostly concerned with covering all possibilities, so the goal directed reasoning isn't emphasized. But with multiple iterations, the model can improve quality (-> more goal directedness) at the expense of coverage/diversity.
Super clear and actionable -- my new favorite post on AF.
I also agree with it, and it's similar to what we're doing at OpenAI (largely thanks to Paul's influence).
D'oh, re: the optimum of the objective, I now see that the solution is nontrivial. Here's my current understanding.
Intuitively, the MAP version of the objective says: find me a simple model theta1 such that there's more-complex theta2 with high likelihood under p(theta2|theta1) (which corresponds to sampling theta2 near theta1 until theta2 satisfies head-agreement condition) and high data-likelihood p(data|theta2).
And this connects to the previous argument about world models and language as follows: we want theta1 to contain half a world model, and we want theta2 to contain the full world model and high data-likelihood (for one of the head) and the two heads agree. Based on Step1, the problem is still pretty underconstrained, but maybe that's resolved in Step 2.
Isn't the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior? Also, I don't see what this objective has to do with learning a world model.
I think this is a good idea. If you go ahead with it, here's a suggestion.
Reviewers often procrastinate for weeks or months. This is partly because doing a review takes an unbounded amount of time, especially for articles that are long or confusing. So instead of sending the reviewers a manuscript with a due date, book a calendar event for 2 hours with the reviewers. The reviewers join a call or group chat and read the paper and discuss it. They can also help clear each other's confusions. They aim to complete the review by the end of the time window.
There's a decent amount of literature on using multiple rewards, though often it's framed as learning about multiple goals. Here are some off the top of my head:
The Horde (classic): http://www.ifaamas.org/Proceedings/aamas2011/papers/A6_R70.pdfUniversal Value Function Approximators: http://proceedings.mlr.press/v37/schaul15.htmlLearning to Act By Predicting: https://arxiv.org/abs/1611.01779Temporal Difference Models: https://arxiv.org/abs/1802.09081Successor Features: https://papers.nips.cc/paper/2017/hash/350db081a661525235354dd3e19b8c05-Abstract.html
Also see the discussion in Appendix D about prediction heads in OpenAI Five, used mostly for interpretability/diagnostics https://cdn.openai.com/dota-2.pdf.