I get worried about things like this article that showed up on the Partnership on AI blog. Reading it there's nothing I can really object to in the body of post: it's mostly about narrow AI alignment and promotes a positive message of targeting things that benefit society rather than narrowly maximize a simple metric. How it's titled "Aligning AI to Human Values means Picking the Right Metrics" and that implies to me a normative claim that reads in my head something like "to build aligned AI it is necessary and sufficient to pick the right metrics" which is something I think few would agree with. Yet if I was a casual observer just reading the title of this post I might come away with the impression that AI alignment is as easy as just optimizing for something prosocial, not that there are lots of hard problems to be solved to even get AI to do what you want, let alone to pick something beneficial to humanity to do.
To be fair this article has a standard "not necessarily the views of PAI, etc." disclaimer, but then the author is a research fellow at PAI.
This makes me a bit nervous about the effect of PAI on promoting AI safety in industry, especially if it effectively downplays it or makes it seem easier than it is in ways that either encourages or fails to curtail risky behavior in the use of AI in industry.
I recently watched all 7 seasons of HBO's "Silicon Valley" and the final episode (or really the final 4 episodes leading up into the final one) did a really great job of hitting on some important ideas we talk about in AI safety.
Now, the show in earlier seasons has played with the idea of AI with things like an obvious parody of Ben Goertzel and Sophia, discussion of Roko's Basilisk, and of course AI that Goodharts. In fact, Goodharting is a pivotal plot point in how the show ends, along with a Petrov-esque ending where hard choices have to be made under uncertainty to protect humanity and it has to be kept a secret due to an information hazard.
Goodhart, Petrov, and information hazards are not mentioned by name in the show, but the topics are clearly present. Given that the show was/is popular with folks in the SF Bay Area tech scene because it does such a good job of mirroring back what it's like to live in that scene, even if it's a hyperbolic characterization, I wonder if and hope that this will helpfully nudge folks towards normalizing taking AI safety seriously and seeing it as virtuous to forgo personal gain in exchange for safeguarding humanity.
I don't expect for things to change dramatically because of the show, but on the margin it might be working to make us a little bit safer. For that reason I think it's likely a good idea to encourage folks not already dedicated to AI safety to watch the show, so long as the effort involved in minimal.
As I work towards becoming less confused about what we mean when we talk about values, I find that it feels a lot like I'm working on a jigsaw puzzle where I don't know what the picture is. Also all the pieces have been scattered around the room and I have to find the pieces first, digging between couch cushions and looking under the rug and behind the bookcase, let alone figure out how they fit together or what they fit together to describe.
Yes, we have some pieces already and others think they know (infer, guess) what the picture is from those (it's a bear! it's a cat! it's a woman in a fur coat!), and as I work I find it helpful to keep updating my own guess because even when it's wrong it sometimes helps me think of new ways to try combining the pieces or to know what pieces might be missing that I should go look for, but it also often feels like I'm failing all the time because I'm updating rapidly based on new information and that keeps changing my best guess.
I suspect this is a common experience for folks working on problems in AI safety and many other complex problems, so I figured I'd share this metaphor I recently hit on for making sense of what it is like to do this kind of work.
I'm fairly pessimistic on our ability to build aligned AI. My take is roughly that it's theoretically impossible and at best we might build AI that is aligned well enough that we don't lose. I've not written one thing to really summarize this or prove it, though.
The source of my take comes from two facts:
Stuart Armstrong has made a case for (2) with his no free lunch theorem. I've not seen anyone formally make the case for (1), though.
Is this something worth trying to prove? That Goodharting is unavoidable and at most we can try to contain its effects?
I'm many years out from doing math full time so I'm not sure if I could make a rigorous proof of it, but this seems to be something that people disagree on sometimes (arguing that Goodharting can be overcome) but I think most of those discussions don't get very precise about what that means.
This paper gives a mathematical model of when Goodharting will occur. To summarize: if
(1) a human has some collection s1,…,sn of things which she values,
(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and
(3) the robot can freely vary how much of s1,…,sn there are in the world, subject only to resource constraints that make the si trade off against each other,
then when the robot optimizes for its proxy utility, it will minimize all si's which its proxy utility function doesn't take into account. If you impose a further condition which ensures that you can't get too much utility by only maximizing some strict subset of the si's (e.g. assuming diminishing marginal returns), then the optimum found by the robot will be suboptimal for the human's true utility function.
That said, I wasn't super-impressed by this paper -- the above is pretty obvious and the mathematical model doesn't elucidate anything, IMO.
Moreover, I think this model doesn't interact much with the skeptical take about whether Goodhart's Law implies doom in practice. Namely, here are some things I believe about the world which this model doesn't take into account:
(1) Lots of the things we value are correlated with each other over "realistically attainable" distributions of world states. Or in other words, for many pairs si,sj of things we care about, it is hard (concretely, requires a very capable AI) to increase the amount of si without also increasing the amount of sj.
(2) The utility functions of future AIs will be learned from humans in such a way that as the capabilities of AI systems increase, so will their ability to model human preferences.
If (1) is true, then for each given capabilities level, there is some room for error for our proxy utility functions (within which an agent at that capabilities level won't be able to decouple our proxy utility function from our true utility function); this permissible error margin shrinks with increasing capabilities. If you buy (2), then you might additionally think that the actual error margin between learned proxy utility functions and our true utility function will shrink more rapidly than the permissible error margin as AI capabilities grow. (Whether or not you actually do believe that value learning will beat capabilities in this race probably depends on a whole lot of other empirical beliefs, or so it seems to me.)
This thread (which you might have already seen) has some good discussion about whether Goodharting will be a big problem in practice.
I actually don't think that model is general enough. Like, I think Goodharting is just a fact of control system's observing.
Suppose we have a simple control system with output X and a governor G. G takes a measurement m(X) (an observation) of X. So long as m(X) is not error free (and I think we can agree that no real world system can be actually error free), then X=m(X)+ϵ for some error factor ϵ. Since G uses m(X) to regulate the system to change X, we now have error influencing the value of X. Now applying the standard reasoning for Goodhart, in the limit of optimization pressure (i.e. G regulating the value of X for long enough), ϵ comes to dominate the value of X.
This is a bit handwavy, but I'm pretty sure it's true, which means in theory any attempt to optimize for anything will, under enough optimization pressure, become dominated by error, whether that's human values or something else. The only interesting question is can we control the error enough, either through better measurement or less optimization pressure, such that we can get enough signal to be happy with the output.
Hmm, I'm not sure I understand -- it doesn't seem to me like noisy observations ought to pose a big problem to control systems in general.
For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we'll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don't see a sense in which the error "comes to dominate" the thing we're optimizing.
One concern which does make sense to me (and I'm not sure if I'm steelmanning your point or just saying something completely different) is that under extreme optimization pressure, measurements might become decoupled from the thing they're supposed to measure. In the mosquito example, this would look like us bribing the surveyors to report artificially low mosquito counts instead of actually trying to affect real-world mosquito counts.
If this is your primary concern regarding Goodhart's Law, then I agree the model above doesn't obviously capture it. I guess it's more precisely a model of proxy misspecification.
"Error" here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.
Can you explain where there is an error term in AlphaGo or where an error term might appear in hypothetical model similar to AlphaGo trained much longer with much more numerous parameters and computational resources?
AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make m(X) adequately evaluate X as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed m(X) we forgot about minimizing side effects.
At least one person here disagrees with you on Goodharting. (I do.)
You've written before on this site if I recall correctly that Eliezer's 2004 CEV proposal is unworkable because of Goodharting. I am granting myself the luxury of not bothering to look up your previous statement because you can contradict me if my recollection is incorrect.
I believe that the CEV proposal is probably achievable by humans if those humans had enough time and enough resources (money, talent, protection from meddling) and that if it is not achievable, it is because of reasons other than Goodhart's law.
(Sadly, an unaligned superintelligence is much easier for humans living in 2022 to create than a CEV-aligned superintelligence is, so we are probably all going to die IMHO.)
Perhaps before discussing the CEV proposal we should discuss a simpler question, namely, whether you believe that Goodharting inevitably ruins the plans of any group setting out intentionally to create a superintelligent paperclip maximizer.
Another simple goal we might discuss is a superintelligence (SI) whose goal is to shove as much matter as possible into a black hole or an SI that "shuts itself off" within 3 months of its launching where "shuts itself off" means stops trying to survive or to affect reality in any way.