Note: I think that this is a better written-version of what I was discussing when I revisited selection versus control, here: https://www.lesswrong.com/posts/BEMvcaeixt3uEqyBk/what-does-optimization-mean-again-optimizing-and-goodhart (The other posts in that series seem relevant.)
I didn't think about the structure that search-in territory / model-based optimization allows, but in those posts I mention that most optimization iterates back and forth between search-in-model and search-in-territory, and that a key feature which I think you're ignoring here is cost of samples / iteration.
Selection in humans is via mutation, so that closely related organisms can get a benefit form cooperating, even at the cost of personally not replicating. As a JBS Haldane quote puts it, "I would gladly give up my life for two brothers, or eight cousins."Continuing from that paper, explaining it better than I could;"What is more interesting, it is only in such small populations that natural selection would favour the spread of genes making for certain kinds of altruistic behaviour. Let us suppose that you carry a rare gene which affects your behaviour so that you jump into a river and save a child, but you have one chance in ten of being drowned, while I do not possess the gene, and stand on the bank and watch the child drown.
If the child is your own child or your brother or sister, there is an even chance that the child will also have the gene, so five such genes will be saved in children for one lost in an adult. If you save a grandchild or nephew the advantage is only two and a half to one. If you only save a first cousin, the effect is very slight. If you try to save your first cousin once removed the population is more likely to lose this valuable gene than to gain it."
My point was that deception will almost certainly outperform honesty/cooperation when AI is interacting with humans, and in reflection, seems likely do so even interacting with other AIs by default because there is no group selection pressure.
In the spirit of open peer review, here are a few thoughts:
First, overall, I was convinced during earlier discussions that this is a bad idea - not because of costs, but because the idea lacks real benefits, and itself will not serve the necessary functions. Also see this earlier proposal (with no comments). There are already outlets that allow robust peer review, and the field is not well served by moving away from the current CS / ML dynamic of arXiv papers and presentations at conferences, which allow for more rapid iteration and collaboration / building on work than traditional journals - which are often a year or more out of date as of when they appear. However, if this were done, I would strongly suggest doing it as an arXiv overlay journal, rather than a traditional structure.
One key drawback you didn't note is that allowing AI safety further insulation from mainstream AI work could further isolate it. It also likely makes it harder for AI-safety researchers to have mainstream academic careers, since narrow journals don't help on most of the academic prestige metrics.
Two more minor disagreement are about first, the claim that "If JAA existed, it would be a great place to send someone who wanted a general overview of the field." I would disagree - in field journals are rarely as good a source as textbooks or non-technical overview. Second, the idea that a journal would provide deeper, more specific, and better review than Alignment forum discussions and current informal discussions seems farfetched given my experience publishing in journals that are specific to a narrow area, like Health security, compared to my experience getting feedback on AI safety ideas.
Honesty, too, arose that way. So I'm not sure whether (say) a system trained to answer questions in such a way that the humans watching it give reward would be more or less likely to be deceptive.
I think it is mistaken. (Or perhaps I don't understand a key claim / assumption.)
Honesty evolved as a group dynamic, where it was beneficial for the group to have ways for individuals to honestly commit, or make lying expensive in some way. That cooperative pressure dynamic does not exist when a single agent is "evolving" on its own in an effectively static environment of humans. It does exist in a co-evolutionary multi-agent dynamic - so there is at least some reason for optimism within a multi-agent group, rather than between computational agents and humans - but the conditions for cooperation versus competition seem at least somewhat fragile.
Strongly agree that it's unclear that there failures would be detected. For discussion and examples, see my paper here: https://www.mdpi.com/2504-2289/3/2/21/htm
Another possible argument is that we can't tell when multiple AIs are failing or subverting each other.Each agent pursuing its own goals in a multi-agent environment are intrinsically manipulative, and when agents are manipulating one another, it happens in ways that we do not know how to detect or consider. This is somewhat different than when they manipulate humans, where we have a clear idea of what does and does not qualify as harmful manipulation.
re: #5, that doesn't seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn't some U1≠U2 such that π∗(U1,ζ)=π∗(U2,ζ).(And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and basic result in the decision theory / economics / philosophy literature.)
I think there needs to be individual decisionmaking (on the part of both organizations and individual researchers, especially in light of the unilateralists' curse,) alongside a much broader discussion about how the world should handle unsafe machine learning, and more advanced AI.
I very much don't think that the AI safety community debating and coming up with shared, semi-public guidelines for, essentially, what to withhold from the broader public, done without input from the wider ML / AI and research community who are impacted and whose work is a big part of what we are discussing, would be wise. That community needs to be engaged in any such discussions.
There's some intermediate options available instead of just "full secret" or "full publish"... and I haven't seen anyone mention that...
OpenAI's phased release of GPT2 seems like a clear example of exactly this. And there is a forthcoming paper looking at the internal deliberations around this from Toby Shevlane, in addition to his extant work on the question of how disclosure potentially affects misuse.