John Schulman


Sorted by New

Wiki Contributions


Introducing the Principles of Intelligent Behaviour in Biological and Social Systems (PIBBSS) Fellowship

I'm especially interested in the analogy between AI alignment and democracy. (I guess this goes under "Social Structures and Institutions".) Democracy is supposed to align a superhuman entity with the will of the people, but there are a lot of failures, closely analogous to well-known AI alignment issues: 

  • politicians optimize for the approval of low-information voters, rather than truly optimizing the people's wellbeing (deceptive alignment)
  • politician, pacs, parties, permanent bureaucrats are agents with their own goals  that don't align with the populace (mesa optimizers)

I think it's more likely that insights will transfer from the field of AI alignment to the field of government design than vice versa. Easier to do experiments on the AI side, and clearer thinkers.

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Would you say Learning to Summarize is an example of this?

It's model based RL because you're optimizing against the model of the human (ie the reward model). And  there are some results at the end on test-time search.

Or do you have something else in mind?

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Thanks, this is very insightful. BTW, I think your paper is excellent!

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

I'm still not sure how to reconcile your results with the fact that the participants in the procgen contest ended up winning with modifications of our PPO/PPG baselines, rather than Q-learning and other value-based algorithms, whereas your paper suggests that Q-learning performs much better. The contest used 8M timesteps + 200 levels. I assume that your "QL" baseline is pretty similar to widespread DQN implementations.

Are there implementation level changes that dramatically improve performance of your QL implementation?

(Currently on vacation and I read your paper briefly while traveling, but I may very well have missed something.)

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

There's no PPO/PPG curve there -- I'd be curious to see that comparison. (though I agree that QL/MuZero will probably be more sample efficient.)

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Performance is mostly limited here by the fact that there are 500 levels for each game (i.e., level overfitting is the problem) so it's not that meaningful to look at sample efficiency wrt environment interactions. The results would look a lot different on the full distribution of levels. I agree with your statement directionally though.

Frequent arguments about alignment

Agree with what you've written here -- I think you put it very well.

Frequent arguments about alignment

In my experience, you need separate teams doing safety research because specialization is useful -- it's easiest to make progress when both individuals and teams specialize a bit and develop taste and mastery of a narrow range of topics.

Frequent arguments about alignment

Yeah that's also good point, though I don't want to read too much into it, since it might be a historical accident.

Load More