Thoughts in Philosophy of Science of AI Alignment

Wiki Contributions


Here is another interpretation of what can cause a lack of robustness to scaling down: 

(Maybe this is what you have in mind when you talk about single-single alignment not (necessaeraily) scaling to multi-multi alignment - but I am not sure that is the case, and even if it ism I feel pulled to stating it again more as I don't think it comes out as clearly as I would want it to in the original post.)

Taking the example of an "alignment strategy [that makes] the AI find the preferences of values and humans, and then pursu[e] that", robustness to scaling down can break if "human values" (as invoked in the example) don't "survive" reductionism; i.e. if, when we try to apply reducitonism to "human values", we are left with "less" than what we hoped for. 

This is the inverse of saying that there is an important non-linearity when trying to scale (up) from single-single alignment to multi-multi alignment. 

I think interpretation locates the reason for the lack of robustness in neither capabilities nor alignment regime, which is why I wanted to raise it. It's a claim about the nature or structural properties of "human values"; or a hint that we are deeply confused about human values (e.g. that the term currently refers to an incohernet cluster or "unnatural" abstraction)).

What you say about CEV might capture this fully, but whether it does, I think, is an empirical claim of sorts; a proposed solution to the more general diagnosis that I am trying to propose, namely that (the way we currently use the term) "human values" may itself not be robust to scaling down.