If you are going to read just one thing I wrote, read The Problem of the Criterion.
More AI related stuff collected over at PAISRI
These are good intuitive arguments against these sorts of solutions, but I think there's a more formal argument we can make that these solutions are dangerous because they pose excess false positive risk. In particular, I think they fail to fully account for the risks of generalized Goodharting, as do most proposed solutions other than something like agent foundations.
Right. Nothing that happens in the same Hubble volume can really be said to not be causally connected. Nonetheless I like the point of the OP even if it's made in an imprecise way.
I continue to be excited about this line of work. I feel like you're slowly figuring out how to formalize ontology in a way reflective of what we actually do and generalizing it. This is something missing from a lot of other approaches.
This is pretty exciting. I've not really done any direct work to push forward alignment in the last couple years, but this is exactly the sort of direction I was hoping someone would go when I wrote my research agenda for deconfusing human values. What came out of it was that there was some research to do that I wasn't equipped to do myself, and I'm very happy to say you've done the sort of thing I had hoped for.
On first pass this seems to address many of the common problems with traditional approaches to formalizing values. I hope that this proves a fruitful line of research!
Re Project 4, you might find my semi-abandoned (mostly because I wasn't and still am not in a position to make further progress on it) research agenda for deconfusing human values useful.
Re: Project 2
This project’s goal is to better understand the bridge principles needed between subjective, first person optimality and objective, third person success.
This seems quite valuable, because there is, properly speaking, no objective, third person perspective on which we can speak, only the inferred sense that there exists something that looks to use like a third person perspective from our first person perspectives. Thus I think this seems like a potentially fruitful line of research since the proposed premise contains the confusion that needs to be unraveled to get to addressing what is something more like the intersubjective agreement on what the world is like.
As it happens, I think this is a rather important topic. Failure to consider and mitigate the risk of assumptions creates both false negative (less concerning) and false positive (most concerning) risks when attempting to build aligned AI.
AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make m(X) adequately evaluate X as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed m(X) we forgot about minimizing side effects.
Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose.
I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.
"Error" here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.