Yes on point Number 1, and partly on point number 2.If humans don't have incredibly complete models for how to achieve their goals, but know they want a glass of water, telling the AI to put a cup of H2O in front of them can create weird mistakes. This can even happen because of causal connections the humans are unaware of. The AI might have better causal models than the humans, but still cause problems for other reasons. In this case, a human might not know the difference between normal water and heavy water, but the AI might decide that since there are two forms, it should have them present in equal amounts, which would be disastrous for reasons entirely beyond the understanding of the human who asked for the glass of water. The human needed to specify the goal differently, and was entirely unaware of what they did wrong - and in this case it will be months before the impacts of the weirdly different than expected water show up, so human-in-the-loop RL or other methods might not catch it.
This seems really exciting, and I'd love to chat about how betrayal is similar to or different than manipulation. Specifically, I think the framework I proposed in my earlier multi-agent failure modes paper might be helpful in thinking through the categorization. (But note that I don't endorse thinking of everything as Goodhart's law, despite that paper - though I still think it's technically true, it's not as useful as I had hoped.)
On the topic of growth rate of computing power, it's worth noting that we expect the model which experts have to be somewhat more complex that what we represented as "Moore's law through year " - but as with the simplification regarding CPU/GPU/ASIC compute, I'm unsure how much this is really a crux for anyone about the timing for AGI.I would be very interested to hear from anyone who said, for example, "I would expect AGI by 2035 if Moore's law continues, but I expect it to end before 2030, and it will therefore likely take until 2050 to reach HLMI/AGI."
I mostly agree, but we get into the details of how we expect improvements can occur much more in the upcoming posts on paths to HLMI and takeoff speeds.
Note: I think that this is a better written-version of what I was discussing when I revisited selection versus control, here: https://www.lesswrong.com/posts/BEMvcaeixt3uEqyBk/what-does-optimization-mean-again-optimizing-and-goodhart (The other posts in that series seem relevant.)
I didn't think about the structure that search-in territory / model-based optimization allows, but in those posts I mention that most optimization iterates back and forth between search-in-model and search-in-territory, and that a key feature which I think you're ignoring here is cost of samples / iteration.
Selection in humans is via mutation, so that closely related organisms can get a benefit form cooperating, even at the cost of personally not replicating. As a JBS Haldane quote puts it, "I would gladly give up my life for two brothers, or eight cousins."Continuing from that paper, explaining it better than I could;"What is more interesting, it is only in such small populations that natural selection would favour the spread of genes making for certain kinds of altruistic behaviour. Let us suppose that you carry a rare gene which affects your behaviour so that you jump into a river and save a child, but you have one chance in ten of being drowned, while I do not possess the gene, and stand on the bank and watch the child drown.
If the child is your own child or your brother or sister, there is an even chance that the child will also have the gene, so five such genes will be saved in children for one lost in an adult. If you save a grandchild or nephew the advantage is only two and a half to one. If you only save a first cousin, the effect is very slight. If you try to save your first cousin once removed the population is more likely to lose this valuable gene than to gain it."
My point was that deception will almost certainly outperform honesty/cooperation when AI is interacting with humans, and in reflection, seems likely do so even interacting with other AIs by default because there is no group selection pressure.
In the spirit of open peer review, here are a few thoughts:
First, overall, I was convinced during earlier discussions that this is a bad idea - not because of costs, but because the idea lacks real benefits, and itself will not serve the necessary functions. Also see this earlier proposal (with no comments). There are already outlets that allow robust peer review, and the field is not well served by moving away from the current CS / ML dynamic of arXiv papers and presentations at conferences, which allow for more rapid iteration and collaboration / building on work than traditional journals - which are often a year or more out of date as of when they appear. However, if this were done, I would strongly suggest doing it as an arXiv overlay journal, rather than a traditional structure.
One key drawback you didn't note is that allowing AI safety further insulation from mainstream AI work could further isolate it. It also likely makes it harder for AI-safety researchers to have mainstream academic careers, since narrow journals don't help on most of the academic prestige metrics.
Two more minor disagreement are about first, the claim that "If JAA existed, it would be a great place to send someone who wanted a general overview of the field." I would disagree - in field journals are rarely as good a source as textbooks or non-technical overview. Second, the idea that a journal would provide deeper, more specific, and better review than Alignment forum discussions and current informal discussions seems farfetched given my experience publishing in journals that are specific to a narrow area, like Health security, compared to my experience getting feedback on AI safety ideas.
Honesty, too, arose that way. So I'm not sure whether (say) a system trained to answer questions in such a way that the humans watching it give reward would be more or less likely to be deceptive.
I think it is mistaken. (Or perhaps I don't understand a key claim / assumption.)
Honesty evolved as a group dynamic, where it was beneficial for the group to have ways for individuals to honestly commit, or make lying expensive in some way. That cooperative pressure dynamic does not exist when a single agent is "evolving" on its own in an effectively static environment of humans. It does exist in a co-evolutionary multi-agent dynamic - so there is at least some reason for optimism within a multi-agent group, rather than between computational agents and humans - but the conditions for cooperation versus competition seem at least somewhat fragile.
Strongly agree that it's unclear that there failures would be detected. For discussion and examples, see my paper here: https://www.mdpi.com/2504-2289/3/2/21/htm