Care to bet on the results of a survey of academic computer scientists? If the stakes are high enough, I could try to make it happen.
"As a reviewer, I only recommend for acceptance papers that appear to be both valid and interesting."
Strongly agree - ... - Strongly Disagree
"As a reviewer, I would sooner recommend for acceptance a paper that was valid, but not incredibly interesting, than a paper that was interesting, but the conclusions weren't fully supported by the analysis."
Strongly agree - ... - Strongly Disagree
I don't understand. Importantly, these are optimistically biased, and you can't assume my true credences are this high. I assign much less than 90% probability to C. But still, they're perfectly consistent. M doesn't say anything about succeeding--only being allowed. M is basically saying: listing the places he'd be willing to live, do they all pass laws which would make building dangerously advanced AI illegal? The only logical connection between C and M is that M (almost definitely) implies C.
Thank you very much for saying that.
I was feeling disappointed about the lack of positive comments, and I realized recently I should probably go around commenting on posts that I think are good, since right now, I mostly only comment on posts when I feel I have an important disagreement. So it's hard to complain when I'm on the receiving end of that dynamic.
On the 2nd point, the whole discussion of mu^prox vs. mu^dist is fundamentally about goal (mis)generalization. My position is that for a very advanced agent, point estimates of the goal (i.e. certainty that some given account of the goal is correct) would probably really limit performance in many contexts. This is captured by Assumptions 2 and 3. An advanced agent is likely to entertain multiple models of what their current understanding of their goal in a familiar context implies about their goal in a novel context. Full conviction in mu^dist does indeed imply non-wireheading behavior, and I wouldn't even call it misgeneralization; I think that would be a perfectly valid interpretation of past rewards. So that's why I spend so much time discussing relative credence in those models.
The assumption says "will do" not "will be able to do". And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward,
There is no need to recruit the concept of "terminal" here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of "success" according to how much reward it observes, and then pursues success, but it does all this because of some "terminal" reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
If I want to analyze what would probably happen if Edward Snowden tried to enter the White House, there's lots I can say without needing to understand what deep reason he had for trying to do this. I can just look at the implications of his attempt to enter the White House: he'd probably get caught and go to jail for a long time. Likewise, if an RL agent is trying to maximize is reward, there's plenty of analysis we can do that is independent of whether there's some other terminal reason for this.
Peer review is not a certification of validity,
Do you think the peer reviewers and the editors thought the argument was valid?
Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.
Not trying to be arrogant. Just trying to present readers who have limited time a quickly digestible bit evidence about the likelihood that the argument is a shambles.
Thank you for this review! A few comments on the weaknesses of my paper.
In particular, it explicitly says the argument does not apply to supervised learning.
Hardly a weakness if supervised learning is unlikely to be an existential threat!
Strength: Does not make very concrete assumptions about the AGI development model.
Weakness: Does not talk much about how AGI is likely to be developed, unclear which of the assumptions are more/less likely to hold for AGI being developed using the current ML paradigm.
The fact that the argument holds equally well no matter what kind of function approximation is used to do inference is, I think, a strength of the argument. It's hard to know what future inference algorithms will look like, although I do think there is a good chance that they will look a lot like current ML. And it's very important that the argument doesn't lump together algorithms where outputs are selected to imitate a target (imitation learners / supervised learners) vs. algorithms where outputs are selected to accomplish a long-term goal. These are totally different algorithms, so analyses of their behavior should absolutely be done separately. The claim "we can analyze imitation learners imitating humans together with RL agents, because both times we could end up with intelligent agents" strikes me as just as suspect as the claim "we can analyze the k-means algorithm together with a vehicle routing algorithm, because both will give us a partition over a set of elements." (The claim "we can analyze imitation learners alongside the world-model of a model-based RL agent" is much more reasonable, since these are both instances of supervised learning.)
Assumes the agent will be aiming to maximize reward without justification, i.e. why does it not have other motivations, perhaps due to misgeneralizing about its goal?
Depending on the meaning of "aiming to maximize reward", I have two different responses. In one sense, I claim "aiming to maximize reward" would be the nature of a policy that performs sufficiently strongly according to the RL objective. (And aiming to maximize inferred utility would be the nature of a policy that performs sufficiently strongly according to the CIRL objective.) But yes, even though I claim this simple position stands, a longer discussion would help establish that.
There's another sense in which you can say that an agent that has a huge inductive bias in favor of , and so violates Assumption 3, is not aiming to maximize reward. So the argument accounts for this possibility. Better yet, it provides a framework for figuring out when we can expect it! See, for example, my comment in the paper that I think an arbitrarily advanced RL chess player would probably violate Assumption 3. I prefer the terminology that says this chess player is aiming to maximize reward, but is dead sure winning at chess is necessary for maximizing reward. But if these are the sort of cases you mean to point to when you suggest the possibility of an agent "not maximizing reward", I do account for those cases.
Arguments made in the paper for why an agent intervening in the reward would have catastrophic consequences are somewhat brief/weak.
Are there not always positive returns to energy/resource usage when it comes to maximizing the probability that the state of a machine continues to have certain property (i.e. reward successfully controlled)? And our continued survival definitely requires some energy/resources. To be clear, catastrophic consequences follow from an advanced agent intervening the provision of reward in the way that would be worth doing. Catastrophic consequences definitely don't follow from a half-hearted and temporary intervention in the provision of reward.
Me: Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.
You: No it doesn't. They just care about interestingness.
Me: Do you agree reviewers aim to only accept valid papers, and care more about validity than interestingness?
You: Yes, but...
If you can admit that we agree on this basic point, I'm happy to discuss further about how good they are at what they aim to do.
1: If retractions were common, surely you would have said that was evidence peer review didn't accomplish much! If academics were only equally good at spotting mistakes immediately, they would still spot the most mistakes because they get the first opportunity to. And if they do, others don't get a "chance" to point out a flaw and have the paper retracted. Even though this argument fails, I agree that journals are too reluctant to publish retractions; pride can sometimes get in the way of good science. But that has no bearing on their concern for validity at the reviewing stage.
2: Some amount of trust is taken for granted in science. The existence of trust in a scientific field does not imply that the participants don't actually care about the truth. Bounded Distrust.
3: Since some level interestingness is also required for publication, this is consistent with a top venue having a higher bar for interestingness than a lesser venue, even while they same requirement for validity. And this is definitely in fact the main effect at play. But yes, there are also some lesser journals/conferences/workshops where they are worse at checking validity, or they care less about it because they are struggling to publish enough articles to justify their existence, or because they are outright scams. So it is relevant that AAAI publishes AI Magazine, and their brand is behind it. I said "peer reviewed" instead of "peer reviewed at a top venue" because the latter would have rubbed you the wrong way even more, but I'm only claiming that passing peer review is worth a lot at a top venue.