Reading Deepmind's latest research and accompanying blogpost, I wanted to highlight an under-appreciated aspect of safety. As a bit of background, Carlos Perez points out Josha Bach's "Lebowski theorem," which states that "no superintelligent AI is going to bother with a task that is harder than hacking its reward function." Given that, I see a potential perverse effect of some types of alignment research - especially research into embedded agency and robust alignment which makes AI uninterested in reward tampering. (Epistemic Status: my confidence in the argument is moderate, and I am more confident in the earlier claims.)
In general, unsafe AI is far more likely to tamper with its reward function than to find more distant (and arguably more problematic) ways to tamper with the world to maximize its objective. (epistemic status: fairly high confidence) Once an AI is smart enough to spend its time reward hacking, then wasting time on developing greater intelligence is unneeded. For that reason, this theorem seems likely to function as at least a mild safety valve. It's only if we close this valve too tightly that we would plausibly see ML that reached human-level intelligence. At that point, of course, we should expect that the AI will begin to munchkin the system, just as a moderately clever human would. And anti-munchkin-ing is a narrow instance of security more generally.
Security generally is like cryptography narrowly in an importance sense; it's easy to build a system that you yourself can't break, but very challenging to build one that others cannot exploit. (Epistemic status: more speculative) This means that even if our best efforts go towards safety, an AI seems very unlikely to need more than "mild" superintelligence to break it - unless it's been so well aligned that it doesn't want to hack its objective function.
This logic implies (Epistemic status: most speculative, still with some confidence) that moderate progress in AI safety is potentially far more dangerous than very little progress - and raises critical questions of how close to this unsafe uncanny valley we currently are, and how wide the valley is.
I do think there's a bunch of unsafe uncanny valleys (which may or may not add up to one big unsafe uncanny valley) but I'm not sure this actually is one. Once a superintelligent AI succeeds in hacking its reward function, would it not likely be motivated to protect itself and its hacked reward signals from outside tampering (such as the AI operators trying to shut it down)? It seems like the only way to ensure protection is to take over the universe and make sure there are no other agents in it (except ones aligned to this AI). And if it's not motivated to protect itself, the AI builder responsible for creating it in the first place would likely just shut it down and try again with a different design (which would probably still be unsafe, given that they're that far from a safe design), so overall it doesn't seem like hackable rewards is much of a safety valve.
It would be much easier for the AI to hack its own expectation operator, so that it predicts a 100% chance of continued survival, rather than taking over the universe. If you're gonna wirehead, why stop early?
I do agree that the builder would probably just try another design. Ideally, they keep adding hacks to make wireheading harder until the AI kills the builder and wireheads itself - hopefully without killing everyone else in the process.
It's very easy to build an AI that wouldn't do this kind of hack, because the AI just has to use its current expectation operator when evaluating whether or not to hack its own expectation operator.
It's much harder to build an AI that wouldn't do other, more damaging, kinds of reward hacking (if the AI is designed around reward maximization in the first place).
Could you give an example for the latter, which wouldn't also apply to hacking the expectation operator? The argument sounds plausible, but I'm not yet seeing what qualitative difference between the expectation and utility operators would make a wireheading AI modify one but not the other.
If you're thinking of a utility-maximizing agent, then it typically wouldn't modify its own utility function. Instead I'm talking about reward-maximizing agents, which do not have internal utility functions but just try to maximize a reward signal coming from the outside, and "reward function" refers to the function computed by whatever is providing it with rewards.
So a utility maximizing agent, like a paperclip-maximizer, can think "If I change my utility function to always return MAX_INT, then according to my current utility function, the universe will have very low expected utility." But this kind of reasoning isn't available to a reward-maximizing agent, because it doesn't normally have access to the reward function. Instead it can only be programmed to think thoughts like "If I do X, what will be my future expected rewards" and "If I hack the reward function to always return MAX_INT, then my future expected rewards will be really high." Not to mention "If I take over the universe so nobody can shut me down or change the reward function back, my expected rewards will be even higher." (I'm anthropomorphizing to quickly convey the intuitions but all this can be turned into math pretty easily.)
Does this help?
ETA: Note that here I'm interpreting "hack" as "modify" or "tamper with", but people sometimes use "reward hacking" to include "reward gaming" which means not physically changing the reward function but just taking advantage of unintentional flaws in the reward function to get high rewards without doing what the AI designer or user intends. In that sense of "hack", utility hacking would be quite possible if the utility function isn't totally aligned with human values.
I'm on-board with that distinction, and I was also thinking of reward-maximizers (despite my loose language).
Part of the confusion may be different notions of "wireheading": seizing an external reward channel, vs actual self-modification. If you're picturing the former, then I agree that the agent won't hack its expectation operator. It's the latter I'm concerned with: under what circumstances would the agent self-modify, change its reward function, but leave the expectation operator untouched?
Example: blue-maximizing robot. The robot might modify its own code so that get_reward(), rather than reading input from its camera and counting blue pixels, instead just returns a large number. The robot would do this because it doesn't model itself as embedded in the environment, and it notices a large correlation between values computed by a program running in the environment (i.e. itself) and its rewards. But in this case, the modified function always returns the same large number - the robot no longer has any reason to worry about the rest of the world.
If we are modeling the agent as taking argmaxπE[U(π)], then it would easily see that manually setting its reward channel to the maximum would be the best policy. However, it wouldn't see that setting its expectation value to 100% would be the best policy since that doesn't actually increase its reward. [ETA: Assuming its utility function is such that a higher reward = higher utility. Also, I meant U(x)|π not U(π)].
So concretely, we have a blue-maximizing robot, it uses its current world-model to forecast the reward from holding a blue screen in front of its camera, and find that it's probably high-reward. Now it tries to minimize the probability that someone takes the screen away. That's the sort of scenario you're talking about, yes?
I agree that Wei Dai's argument applies just fine to this sort of situation.
Thing is, this kind of wireheading - simply seizing the reward channel - doesn't actually involve any self-modification. The AI is still "working" just fine, or at least as well as it was working before. The problem here isn't really wireheading at all, it's that someone programmed a really dumb utility function.
True wireheading would be if the AI modifies its utility function - i.e. the blue-minimizing robot changes its code (or hacks its hardware) to count red as also being blue. For instance, maybe the AI does not model itself as embedded in the environment, but learns that it gets a really strong reward signal when there's a big number at a certain point in a program executed in the environment - which happens to be its own program execution. So, it modifies this program in the environment to just return a big number for expected_utility, thereby "accidentally" self-modifying.
What I'm not seeing is, in situations where an AI would actually modify itself, when and why would it go for the utility function but not the expectation operator? Maybe people are just imagining "wireheading" in the form of seizing an external reward channel?
Admittedly, that's how I understood it. I don't see why an expected utility maximizer would modify its utility function, since utility functions are reflectively stable.
The root issue is that Reward ≠ Utility. A utility function does not take in a policy, it takes in a state of the world - an expected utility maximizer chooses its policy based on what state(s) of the world it expects that policy to induce. Its objective looks like E[U(x)|π], where x is the state of the world, and the policy/action π matters only insofar as it changes the distribution of x. The utility U is internal to the agent. U, as a function of the world state, is perfectly known to the utility maximizer - the only uncertainty is in the world state x, and the only thing which the agent tries to control is the world-state x. That's why it's reflectively stable: the utility function is "inside" the agent, not part of the "environment", and the agent has no way to even consider changing it.
A reward function, on the other hand, just takes in a policy directly - an expected reward maximizer's objective looks like E[U(π)]. Unlike a utility, the reward is "external" to the agent, and the reward function is unknown to the agent - the agent does not necessarily know what reward it will receive given some state of the world. The reward "function", i.e. the function mapping a state of the world to a reward, is itself just another part of the environment, and the agent can and will consider changing it.
Example: the blue-maximizing robot.
A utility-maximizing blue-bot would model the world, look for all the blue things in its world-model, and maximize that number. This robot doesn't actually have any reason to stick a blue screen in front of its camera, unless its world-model lacks object permanence. To make a utility-maximizing blue-bot which does sit in front of a blue screen would actually be more complicated: we'd need a model of the bot's own camera, and a utility function over the blue pixels detected by that camera. (Or we'd need a world-model which didn't include anything outside the camera's view.)
On the other hand, a reward-maximizing blue-bot doesn't necessarily even have a notion of "state of the world". If its reward is the number of blue pixels in the camera view, that's what it maximizes - and if it can change the function mapping external world to camera pixels, in order to make more pixels blue, then it will. So it happily sits in front of a blue screen. Furthermore, a reward maximizer usually needs to learn the reward function, since it isn't built-in. That leads to the sort of problem I mentioned above, where the agent doesn't realize it's embedded in the environment and "accidentally" self-modifies. That wouldn't be a problem for a true utility maximizer with a decent world-model - the utility maximizer would recognize that modifying this chunk of the environment won't actually cause higher utility, it's just a correlation.
Agreed. When I wrote U(π) I meant it as shorthand for U(x)|π, though now that I look at it I can see that was criss-crossing between reward and utility in a very confusing way.
That makes sense now, although I am still curious whether there is a case where it purposely self modifies rather than accidentally does so.
My claim here is that superintelligence is a result of training, not a starting condition. Yes, a SAI would do bad things unless robustly aligned, but building the SAI requires it not to wirehead at an earlier stage in the process. My claim is that I am unsure that there is a way to train such a system that was not built with safety in mind such that it gets to a point where it is more likely to gain intelligence than it is to find ways to reward hack - not necessarily via direct access, but via whatever channel is cheapest. And making easy-to-subvert channels harder to hit seems to be the focus of a fair amount of non-SAI-concerned AI safety work, which seems like a net-negative.
I think this claim makes more sense than the one you quoted at the top of your post, "no superintelligent AI is going to bother with a task that is harder than hacking its reward function", and my initial comment was mostly responding to that.
But I don't understand how you can expect this (i.e., non-SAI-concerned AI safety work that make easy-to-subvert channels harder to hit) to not happen, or to make it significantly less likely to happen, given that people want to build AIs that do things beside reward hacking, so if those AI start reward hacking or will predictably do that, people are going to think of ways to secure the channels that are being hacked. So even if AI safety people refrain from working on this, AI capabilities people eventually will, and I don't see them being slowed down much by having to harden the easy-to-subvert channels.
Do you have some other strategic implications in mind here?
I was mostly noting that I hadn't thought of this, hadn't seen it mentioned, and so my model for returns to non-fundamental alignment AI safety investments didn't previously account for this. Reflecting on that fact now, I think the key strategic implication relates to the ongoing debate about prioritization of effort in AI-safety.
(Now, some low confidence speculation on this:) People who believe that near-term Foom! is relatively unlikely, but worry about misaligned non-superintelligent NAI/Near-human AGI, may be making the Foom! scenario more likely. That means that attention to AI safety that pushes for "safer self-driving cars" and "reducing and mitigating side-effects" is plausibly a net negative if done poorly, instead of being benign.
There was some related discussion back in 2012 but of course you can be excused for not knowing about that. :) (The part about "AIXI would fail due to incorrect decision theory" is in part talking about reward-maximizing agent doing reward hacking.)
Intuition Dump. Safety through this mechanism seems to me like aiming a rocket at mars, and accidently hitting the moon. There might well be a region where P(doom|superintelligence) is lower, but lower in the sense of 90% is lower than 99.9%. Suppose we have the clearest, most stereotypical case of wireheading as I understand it. A mesa optimizer with a detailed model of its own workings and the terminal goal of maximizing the flow of current in some wire. (During training, current in the reward signal wire reliably correlated with reward.)
The plan that maximizes this flow long term is to take over the universe and store all the energy, to gradually turn into electricity. If the agent has access to its own internals before it really understands that it is an optimizer, or is thinking about the long term future of the universe, it might manage to brick its self. If the agent is sufficiently myopic, is not considering time travel or acausal trade, and has the choice of running high voltage current through itself now, or slowly taking over the world, it might choose the former.
Note that both of these look like hitting a fairly small target in design, and lab enviroment space. The mesa optimizer might have some other terminal goal. Suppose the prototype AI has the opportunity to write arbitrary code to an external computer, and an understanding of AI design before it has self modification access. The AI creates a subagent that cares about the amount of current in a wire in the first AI, the subagent can optimize this without destroying itself. Even if the first agent then bricks itself, we have a AI that will dig the fried circuit boards out the trashcan, and throw all the cosmic commons into protecting and powering them.
In conclusion, this is not a safe part of agentspace, just a part that's slightly less guaranteed to kill you. I would say it was of little to no strategic importance. Especially if you think that all humans making AI will be reasonably on the same page regarding safety, scenarios where AI alignment is nearly solved, and the first people to ASI barely know the field exists are unlikely. If the first ASI self destructs for reasons like this, we have all the peaces for making superintelligence, and people with no sense safety are trying to make one. I would expect another attempt a few weeks later to doom us. (Unless the first AI bricked itself in a sufficiently spectacular manor, like hacking into nukes to create a giant EMP in its circuits. That might get enough people seeing danger to have everyone stop.)