Discord: LemonUniverse (.lemonuniverse). Reddit: u/Smack-works. About my situation: here.
I wrote some bad posts before 2024 because I was very uncertain how the events may develop.
I do philosophical/conceptual research, have no mathematical or programming skills. But I do know a bunch of mathematical and computer science concepts.
Are you talking about value learning? My proposal doesn't tackle advanced value learning. Basically, my argument is "if (A) human values are limited by human ability to comprehend/optimize things and (B) the factors which make something easier or harder to comprehend/optimize are simple, then the AI can avoid accidentally messing up human values — so we can define safe impact measures and corrigibility". My proposal is not supposed to make the AI learn human values in great detail or extrapolate them out of distribution. My argument is "if A and B hold, then we can draw a box around human values and tell the AI to not mess up the contents of the box — without making the AI useless; yet the AI might not know what exact contents of the box count as 'human values'".[1]
The problem with B is that humans have very specialized and idiosyncratic cognitive machinery (the machinery generating experiences) which is much more advanced than human general ability to comprehend things. I interpreted you as making this counterargument in the top level comment. My reply is that I think human values depend on that machinery in a very limited way, so B is still true enough. But I'm not talking about extrapolating something out of distribution. Unless I'm missing your point.
Why those things follow from A and B is not obvious and depends on a non-trivial argument. I tried to explain it in the first section of the post, but might've failed.
Yes, some value judgements (e.g. "this movie is good", "this song is beautiful", or even "this is a conscious being") depend on inscrutable brain machinery, the machinery which creates experience. The complexity of our feelings can be orders of magnitude greater than the complexity of our explicit reasoning. Does it kill the proposal in the post? I think not, for the following reason:
We aren't particularly good at remembering exact experiences, we like very different experiences, we can't access each other's experiences, and we have very limited ways of controlling experiences. So, there should be pretty strict limitations on how much understanding of the inscrutable machinery is required for respecting the current human values. Defining corrigible behavior ("don't kill everyone", "don't seek power", "don't mess with human brains") shouldn't require answering many specific, complicated machinery-dependent questions ("what separates good and bad movies?", "what separates good and bad life?", "what separates conscious and unconscious beings?").
Also, some thoughts about your specific counterexample (I generalized it to being about experiences in general):
Does any of the above help to find the crux of the disagreement or understand the intuitions behind my claim?
But
Based on your comments, I can guess that something below is the crux:
Could you confirm or clarify the crux? Your messages felt ambiguous to me. In what specific way is A false?