Director of Research at PAISRI
A decent intuition might be to think about what exploration looks like in human children. Children under the age of 5 but old enough to move about on their own—so toddlers, not babies or "big kids"—face a lot of dangers in the modern world if they are allowed to run their natural exploration algorithm. Heck, I'm not even sure this is a modern problem, because in addition to toddlers not understanding and needing to be protected from exploring electrical sockets and moving vehicles they also have to be protected from more traditional dangers that they would definitely otherwise check out like dangerous plants and animals. Of course, since toddlers grow up into powerful adult humans, this is a kind of evidence that they are powerful enough explorers (even with protections) to become powerful enough to function in society.
Obviously there are a lot of caveats to taking this idea too seriously since I've ignored issues related to human development, but I think it points in the right direction of something everyday that reflects this result.
I don't recall seeing anything addressing this directly: has there been any progress towards dealing with concerns about Goodharting in debate and otherwise the risk of mesa-optimization in the debate approach? The typical risk scenario being something like training debate creates AIs good at convincing humans rather than at convincing humans of the truth, and once you leave the training set of questions were the truth can be reasonably determined independent of the debate mechanism we'll experience what will amount to a treacherous turn because the debate training process accidentally optimized for a different target (convince humans) than the one intended (convince humans of true statements).
For myself this continues to be a concern which seems inadequately addressed and makes me nervous about the safety of debate, much less its adequacy as a safety mechanism.
Nevertheless, extensions to PAL might still be useful. Agency rents are what might allow AI agents to accumulate wealth and influence, and agency models are the best way we have to learn about the size of these rents. These findings should inform a wide range of future scenarios, perhaps barring extreme ones like Bostrom/Yudkowsky.
For myself, this is the most exciting thing in this post—the possibility of taking the principal-agent model and using it to reason about AI even if most of the existing principal-agent literature doesn't provide results that apply. I see little here to make me think the principal-agent model wouldn't be useful, only that it hasn't been used in ways that are useful to AI risk scenarios yet. It seems worthwhile, for example, to pursue research on the principal-agent problem with some of the adjustments to make it better apply to AI scenarios, such as letting the agent be more powerful than the principal and adjusting the rent measure to better work with AI.
Maybe this approach won't yield anything (as we should expect on priors, simply because most approaches to AI safety are likely not going to work), but it seems worth exploring further on the chance it can deliver valuable insights, even if, as you say, the existing literature doesn't offer much that is directly useful to AI risk now.
I think I basically agree with this and think it's right. In some ways you might say focusing too much on "values" acts like a barrier to deeper investigation of the mechanisms at work here, and I think looking deeper is necessary because I expect that optimization against the value abstraction layer alone will result in Goodharting.
In some sense that's a direction I might be moving in with my thinking, but there is still some thing that humans identify as values that they care about, so I expect there to be some real phenomenon going on that needs to be considered to get good outcomes, since I expect the default remains a bad outcome if we don't pay attention to whatever it is that makes humans care about stuff. I expect most work today on value learning is not going to get us where we want to go because it's working with the wrong abstractions, and my goal in this work is to dissolve those abstractions to find better ones for our long-term purposes.
However, it's worth noting that saying the agent is mistaken about the state of the world is really an anthropomorphization. It was actually perfectly correct in inferring where the red part of the world was -- we just didn't want it to go to that part of the world. We model the agent as being 'mistaken' about where the landing pad is, but it works equally well to model the agent as having goals that are counter to ours.
That we can flip our perspective like this suggests to me that thinking of the agent as having different goals is likely still anthropomorphic or at least teleological reasoning that results from us modeling this agent has having dispositions it doesn't actually have.
I'm not sure what to offer as an alternative since we're not talking about a category where I feel grounded enough to see clearly what might be really going on, much less offer a more useful abstraction that avoids this problem, but I think it's worth considering that there's a deeper confusion here that this exposes but doesn't resolve.
I'd describe that as a statistical regularity over statistical regularities over preferences.
But the "meta-preferences" are a bit more worrying. Are they genuine meta-preferences? Especially since the second one is one that was more subconscious, and the third one looks more like a standard preference than a meta-preference. If the category of meta-preference is not clear, then that part of the research agenda needs to be improved.
I think one of the challenges is that, to me at least, it's still unclear if we really have anything like meta-preferences that behave in systematic ways. That is, is there a systematic way in which our highly conditional preferences (which, in a very real sense, exist only momentarily at a particular decision point situated within the causal history of the universe) combine such that we can say more than that there are some statistical regularities to our preferences. Our preferences may manage to have some coherent statistical features about which we can make some stochastically consistent statements, but I think this falls short of what we are usually hoping for in terms of meta-preferences, and certainly seems to fall short in terms of how I understand you to be thinking about them (though maybe I misunderstand you: I think of you of thinking of meta-preferences as something that can ultimately be made to have nice mathematical properties, like some version of rationality, that would allow them to be optimized against without weird things happening).
I'm actually not really sure. We have some vague notion that, for example, my preference for eating pizza shouldn't result in attempts at unbounded pizza eating maximization, and I would probably be unhappy from my current values if a maximizing agent saw I liked pizza the best of all foods and then proceeded to feed me only pizza forever, even if it modified me such that I would maximally enjoy the pizza each time and not get bored of it.
Thinking more in terms of regressional Goodharting, maybe something like not deviating from the true target because of optimizing for the measure of it. Consider the classic rat extermination example of Goodharting. We already know collecting rat tails as evidence of extermination is a function that leads to weird effects. Does there exist a function that measures rat exterminations that, when optimized for, produces the intended effect (extermination of rats) without doing anything "weird", e.g. generating unintended side-effects, maximizing rat reproduction so we can exterminate more of them, just straightforwardly leads to the extinction of rats and nothing else.
Thinking about my focus on a theory of human values for AI alignment, the problem is quite hard when we ask for a way to precisely specify values. I might state the problem as something like finding "a theory of human values accurate and precise enough that its predictions don't come apart under extreme optimization". To borrow Isnasene's notation, here X = "a theory of human values accurate and precise enough" and Y = "its predictions don't come apart under extreme optimization".
So what is an inverse problem with X' and Y'? A Y' might be something like "functions that behave as expected under extreme optimization", where "behave as expected" is something like no Goodhart effects. We could even just be more narrow and make Y' = "functions that don't exist Goodhart effects under extreme optimization". Then the X' would be something like a generalized description of the classes of functions that satisfy Y'.
Doing the double inverse, we would try to find X from X' by looking at what properties hold for this class of functions that don't suffer from Goodharting, and use them to help us identify what would be needed to create an adequate theory of human values.