Feedback welcomed: www.admonymous.co/zeshen
In every scenario, if you have a superintelligent actor which is optimizing the grader's evaluations while searching over a large real-world plan space, the grader gets exploited.
Similar to the evaluator-child who's trying to win his mom's approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn't a perfect grader solve the problem?
I'm probably missing something, but doesn't this just boil down to "misspecified goals lead to reward hacking"?
I got the book (thanks to Conjecture) after doing the Intro to ML Safety Course where the book was recommended. I then browsed through the book and thought of writing a review of it - and I found this post instead, which is a much better review than I would have written, so thanks a lot for this! Let me just put down a few thoughts that might be relevant for someone else considering picking up this book.
Target audience: Right at the beginning of the book, the author says "This book is written for the sophisticated practitioner rather than the academic researcher or the general public." I think this is relevant, as the book goes to a level of detail way beyond what's needed to get a good overview of engineering safety.Relevance to AI safety: I feel like most engineering safety concepts are not applicable to alignment, firstly because an AGI would likely not have any human involvement in its optimization process, and secondly the basic underlying STAMP constructs of safety constraints, hierarchical safety control structures, and process models are simply more applicable to engineering systems. As stated in p100, "“STAMP focuses particular attention on the role of constraints in safety management.“ and I highly doubt an AGI can be bounded by constraints. " Nevertheless, Chapter 8 STPA: A New Hazard Analysis Technique that describes STPA (System Theoretic Process Analysis) may be somewhat relevant to designing safety interlocks. Also, the final chapter (13) on Managing Safety and the Safety Culture, is broadly applicable to any field that involves safety.
Criticisms on conventional techniques: The book often mentions that techniques like STAMP and STPA is superior than other conventional techniques like HAZOP and gives quotes by reviewers that attest to their superiority. I don't know if those criticisms are really fair, given how it is not really adopted at least in the oil and gas industry that, for all its flaws, takes safety very seriously. Perhaps the criticisms could be fair for very outdated safety practices. Nevertheless, the general concepts of engineering safety feels quite similar whether it uses 'conventional' techniques or the 'new' techniques described in the book.
Overall, I think this book provides a good overview of engineering safety concepts, but for the general audience (or alignment researchers) it goes into too much detail on specific case studies and arguments.
Thanks for the comment!
You can read more about how these technical problems relate to AGI failure modes and how they rank on importance, tractability, and crowdedness in Pragmatic AI Safety 5. I think the creators included this content in a separate forum post for a reason.
I felt some of the content in the PAIS series would've been great for the course, though the creators probably had a reason to exclude them and I'm not sure why.
The second group doesn't necessarily care about why each research direction relates to reducing X-risk.
In this case I feel it could be better for the chapter on x-risk to be removed entirely. Might be better to not include it at all than to include it and mostly show quotes by famous people without properly engaging in the arguments.