A math and computer science graduate interested in machine and animal cognition, philosophy of language, interdisciplinary ideas, etc.
I see. I didn't fully adapt to the fact that not all alignment is about RL.
Beside the point: I think those labels on the data structures are very confusing. Both the actor and the critic are very likely to have so specialized world models (projected from the labeled world model) and planning abilities. The values of the actor need not be the same as the output of the critic. And things value-related and planning-related may easily leak into the world model if you don't actively try to prevent it. So I suspect that we should ignore the labels and focus on architecture and training methods.
Yes, I think that was it; and that I did not (and still don't) understand what about that possible AGI architecture is non-trivial and has a non-trivial implementations for alignment, even if not ones that make it easier. It seem like not only the same problems carefully hidden, but the same flavor of the same problems on plain sight.
Didn't read the original paper yet, but from what you describe, I don't understand how the remaining technical problem is not basically the whole of the alignment problem. My understanding of what you say is that he is vague about the values we want to give the agent - and not knowing how to specify human values is kind of the point (that, and inner alignment - which I don't see addressed either).
I can think of several obstacles for AGIs that are likely to actually be created (i.e. seem economically useful, and do not display misalignment that even Microsoft can't ignore before being capable enough to be xrisk). Most of those obstacles are widely recognized in the rl community, so you probably see them as solvable or avoidable. I did possibly think of an economically-valuable and not-obviously-catastrophic exception to the probably-biggest obstacle though, so my confidence is low. I would share it in a private discussion, because I think that we are past the point when strict do-no-harm policy is wise.
This is an important distinction, that show in its cleanest form in mathematics - where you have constructive definitions from the one hand, and axiomatic definitions from the other. It is important to note though that is is not quite a dichotomy - you may have a constructive definition that assume aximatically-defined entities, or other constructions. For example: vector spaces are usually defined axiomatically, but vector spaces over the real numbers assume the real numbers - that have multiple axiomatic definitions and corresponding constructions.
In science, there is the classic "are wails fish?" - which is mostly about whether to look at their construction/mechanism (genetics, development, metabolism...) or their patterns of interaction with their environment (the behavior of swimming and the structure that support it). That example also emphasize that we natural language simplly don't respect this distinction, and consider both internal structure and outside relations as legitimate "coordinates in thingspace" that may be used together to identify geometrically-natural categories.
Since I became reasonably sure that I understand your position and reasoning - mostly changing it.
That was good for my understanding of your position. My main problem with the whole thing though is in the use the word "bad". I think it should be taboo at least until we establish a shared meaning.
Specifically, I think that most observers will find the first argument more logical than the second, because of a fallacy in using the word "bad". I think that we learn that word in a way that is deeply entangled with power reward mechanism, to the point that it is mostly just a pointer to negative reward, things that we want to avoid, things that made our parents angry... In my view, the argument is than basically:
I want to avoid my suffering, and now generally person p want to avoid person p suffering. Therfore suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for", therefore avoid creating suffering.
When written that way, it doesn't seem more logical than is opposite.
Let me clarify that I don't argue from agreement per say. I care about the underlying epistemic mechanism of agreement, that I claim to also be the mechanism of correctness. My point is that I don't see similar epistemic mechanism in the case of morality.
Of course, emotions are verifiable states of brains. And the same goes for preferring actions that would lead to certain emotions and not others. It is a verifiable fact that you like chocolate. It is a contingent property of my brain that I care, but I don't see what sort of argument that it is correct for me too care could even in principle be inherntly compelling.
I meant the first question in a very pragmatic way: what is it that you are trying to say when you say that something is good? What information does it represent?
It would be clearer in analogy to factual claims: we can do lots of philosophy about the exact meaning of saying that I have a dog, but in the end we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result. Every reflection and generalization that we do is ultimately about that, and can achieve something meaningful because of that.
I do not see the analogous story for moral reflection.
Thanks for the reply.
To make sure that I understand your position: are you a realist, and what do you think is the meaning of moral facts? (I'm not an error theorist but something like "meta-error theorist" - think that people do try to claim something, but not sure how that thing could map to external reality. )
Then the next question, that will be highly relevant to the research that you propose, is how do you think you know those facts if you do? (Or more generally, what is the actual work of reflecting on your values?)