Q Home — AI Alignment Forum

Discord: LemonUniverse (.lemonuniverse). Reddit: u/Smack-works. About my situation: here.

I wrote some bad posts before 2024 because I was very uncertain how the events may develop.

I do philosophical/conceptual research, have no mathematical or programming skills. But I do know a bunch of mathematical and computer science concepts.

But

To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about. At least before humans start unlimited self-modification. I think this logically can't be false.
Eliezer Yudkowsky is a core proponent of complexity of value, but in Thou Art Godshatter and Protein Reinforcement and DNA Consequentialism he basically makes a point that human values arose from complexity limitations, including complexity limitations imposed by brainpower limitations. Some famous alignment ideas (e.g. NAH, Shard Theory) kinda imply that human values are limited by human ability to comprehend and it doesn't seem controversial. (The ideas themselves are controversial, but for other reasons.)
If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn't it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?

Based on your comments, I can guess that something below is the crux:

You define "values" as ~"the decisions humans would converge to after becoming arbitrarily more knowledgeable". But that's a somewhat controversial definition (some knowledge can lead to changes in values) and even given that definition it can be true that "past human ability to comprehend limits human values" — since human values were formed before humans explored unlimited knowledge. Some values formed when humans were barely generally intelligent. Some values formed when humans were animals.
You say that values depend on inscrutable brain machinery. But can't we treat the machinery as a part of "human ability to comprehend"?
You talk about ontology. Humans can care about real diamonds without knowing what physical things the diamonds are made from. My reply: I define "ability to comprehend" based on ability to comprehend functional behavior of a thing under normal circumstances. Based on this definition, a caveman counts as being able to comprehend the cloud of atoms his spear is made of (because the caveman can comprehend the behavior of the spear under normal circumstances), even though the caveman can't comprehend atomic theory.

Could you confirm or clarify the crux? Your messages felt ambiguous to me. In what specific way is A false?

Are you talking about value learning? My proposal doesn't tackle advanced value learning. Basically, my argument is "if (A) human values are limited by human ability to comprehend/optimize things and (B) the factors which make something easier or harder to comprehend/optimize are simple, then the AI can avoid accidentally messing up human values — so we can define safe impact measures and corrigibility". My proposal is not supposed to make the AI learn human values in great detail or extrapolate them out of distribution. My argument is "if A and B hold, then we can draw a box around human values and tell the AI to not mess up the contents of the box — without making the AI useless; yet the AI might not know what exact contents of the box count as 'human values'".^[1]

The problem with B is that humans have very specialized and idiosyncratic cognitive machinery (the machinery generating experiences) which is much more advanced than human general ability to comprehend things. I interpreted you as making this counterargument in the top level comment. My reply is that I think human values depend on that machinery in a very limited way, so B is still true enough. But I'm not talking about extrapolating something out of distribution. Unless I'm missing your point.

^{^}
Why those things follow from A and B is not obvious and depends on a non-trivial argument. I tried to explain it in the first section of the post, but might've failed.

Yes, some value judgements (e.g. "this movie is good", "this song is beautiful", or even "this is a conscious being") depend on inscrutable brain machinery, the machinery which creates experience. The complexity of our feelings can be orders of magnitude greater than the complexity of our explicit reasoning. Does it kill the proposal in the post? I think not, for the following reason:

We aren't particularly good at remembering exact experiences, we like very different experiences, we can't access each other's experiences, and we have very limited ways of controlling experiences. So, there should be pretty strict limitations on how much understanding of the inscrutable machinery is required for respecting the current human values. Defining corrigible behavior ("don't kill everyone", "don't seek power", "don't mess with human brains") shouldn't require answering many specific, complicated machinery-dependent questions ("what separates good and bad movies?", "what separates good and bad life?", "what separates conscious and unconscious beings?").

Also, some thoughts about your specific counterexample (I generalized it to being about experiences in general):

"How stimulating or addicting or novel is this experience?" <- I think those parameters were always comprehensible and optimizable, even in the Stone Age. (In a limited way, but still.) For example, it's easy to get different gradations of "less addicting experiences" by getting injuries, starving or not sleeping.
"How 'good' is this experience in a more nebulous or normative way?" <- I think this is a more complicated value (aesthetic taste), based on simpler values.
Note that I'm using "easy to comprehend" in the sense of "the thing behaves in a simple way most of the time", not in the sense of "it's easy to comprehend why the thing exists" or "it's easy to understand the whole causal chain related to the thing". I think the latter senses are not useful for a simplicity metric, because they would mark everything as equally incomprehensible.
Note that "I care about taste experiences" (A), "I care about particular chemicals giving particular taste experiences" (B), and "I care about preserving the status quo connection between chemicals and taste experiences" (C) are all different things. B can be much more complicated than C, B might require the knowledge of chemistry while C doesn't.

Does any of the above help to find the crux of the disagreement or understand the intuitions behind my claim?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments