Right: if the agent has learned an inner objective of "do things similar to what humans do in the world at the moment I am currently acting", then it'd definitely be incentivized to do that. It's not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn't help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you're worried about inner optimization, but most the other failures described happen even in the outer alignment case.
That said, if you had a different imitation learning setup that was something like doing RL on a reward of "do the same thing one of our human labelers chooses given the same state" then the outer objective would reward what the behavior you describe. It'd be a hard exploration problem for the agent to learn to exploit the reward in that way, but it quite probably could do so if situationally aware.
Thanks, I'd missed that!
Curious if you have any high-level takeaways from that? Bigger models do better, clearly, but e.g. how low do you think we'll be able to get the error rate in the next 5-10 years given expected compute growth? Are there any follow-up experiments you'd like to see happen in this space?
Also could you clarify whether the setting was for adversarial training or just a vanilla model? "During training, adversarial examples for training are constructed by PGD attacker of 30 iterations" makes me think it's adversarial training but I could imagine this just being used for evals.
Rachel did the bulk of the work on this post (well-done!), I just provided some advise on the project and feedback on earlier manuscripts.
I wanted to share why I'm personally excited by this work in case it helps contextualize it for others.
We'd all like AI systems to be "corrigible", cooperating with us in correcting them. Cooperative IRL has been proposed as a solution to this. Indeed Dylan Hadfield-Menell et al show that CIRL is provably corrigible in a simple setting, the off-switch game.
Provably corrigible sounds great, but where there's a proof there's also an assumption, and Carey et al soon pointed out a number of other assumptions under which this no longer holds -- e.g. if there is model misspecification causing the incorrect probability distribution to be computed.
That's a real problem, but every method can fail if you implement it wrongly (although some are more fragile than others), so this didn't exactly lead to people giving up on the CIRL framework. Recently Shah et al described various benefits they see of CIRL (or "assistance games") over reward learning, though this doesn't address the corrigibility question head on.
A lot of the corrigibility properties of CIRL come from uncertainty: it wants to defer to a human because the human knows more about its preferences than the robot. Recently, Yudkowsky and others described the problem of fully updated deference: if the AI has learned everything it can, it may have no uncertainty, at which point this corrigibility goes away. If the AI has learned your preferences perfectly, perhaps this is OK. But here Carey's critique of model misspecification rears its head again -- if the AI is convinced you love vanilla ice cream, saying "please no give me chocolate" will not convince it (perhaps it thinks you have a cognitive bias against admitting your plain, vanilla preferences -- it knows the real you), whereas it might if it had uncertainty.
I think the prevailing view on this forum is to be pretty down on CIRL because its not corrigible. But I'm not convinced corrigibility in the strict sense is even attainable or desirable. In this post, we outline a bunch of examples of corrigible behavior that I would absolutely not want in an assistant -- like asking me for approval before every minor action! By contrast, the near-corrigible behavior -- asking me only when the robot has genuine uncertainty -- seems more desirable... so long as the robot has calibrated uncertainty. To me, CIRL and corrigibility seem like two extremes: CIRL is focusing on maximizing human reward, whereas corrigibility is focused on avoiding ever doing the wrong thing even under model misspecification. In practice, we need a bit of both -- but I don't think we have a good theoretical framework for that yet.
In addition to that, I hope this post serves as a useful framework to ground future discussions on this. Unfortunately I think there's been an awful lot of talking past each other in debates on this topic in the past. For example, to the best of my knowledge, Hadfield-Menell and other authors of CIRL never believed it solved corrigibility under the assumptions Carey introduced. Although our framework is toy, I think it captures the key assumptions people disagree about, and it can be easily extended to capture more as needed in future discussions.
I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.
In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I've bought a dud on the basis of canny marketing, overall I'm much better off living in a modern capitalist economy than the stone age where humans were more directly in control.
However, it does seem like there's a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, "slightly better than 2022" and "all humans dead" are rounding errors relative to "possible future human flourishing". But things look quite different under other ethical views, so I'm reluctant to conflate these outcomes.
Thanks for this response, I'm glad to see more public debate on this!
The part of Katja's part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed -- there is some sufficiently capable AI system that can take over the 2022 world. But this capability threshold is a moving target: humanity will get better at aligning and controlling AI systems as we gain more experience with them, and we may be able to enlist the help of AI systems to keep others in check. So, why should we expect the equilibrium here to be an AI takeover, rather than AIs working for humans because that it is in their selfish best interest in a market economy where humans are currently the primary property owner?
I think the crux here is whether we expect AI systems to by default collude with one another. They might -- they have a lot of things in common that humans don't, especially if they're copies of one another! But coordination in general is hard, especially if it has to be surreptitious.
As an analogy, I could argue that for much of human history soldiers were only slightly more capable than civilians. Sure, a trained soldier with a shield and sword is a fearsome opponent, but a small group of coordinated civilians could be victorious. Yet as we develop more sophisticated weapons such as guns, cannons, missiles, the power that a single soldier has grows greater and greater. So, by your argument, eventually a single soldier will be powerful enough to take over the world.
This isn't totally fanciful -- the Spanish conquest of the Inca Empire started with just 168 soldiers! The Spanish fought with swords, crossbows, and lances -- if the Inca Empire were still around, it seems likely that a far smaller modern military force could defeat them. Yet, clearly no single soldier is in a position to take over the world, or even a small city. Military coup d'etats are the closest, but involve convincing a significant fraction of the military that is in their interest to seize power. Of course most soldiers wish to serve their nation, not seize power, which goes some way to explaining the relatively low rate of coup attempts. But it's also notable that many coup attempts fail, or at least do not lead to a stable military dictatorship, precisely because of difficulty of internal coordination. After all, if someone intends to destroy the current power structure and violate their promises, how much can you trust that they'll really have your back if you support them?
An interesting consequence of this is that it's ambiguous whether making AI more cooperative makes the situation better or worse.
Thanks for the quick reply! I definitely don't feel confident in the 20W number, I could believe 13W is true for more energy efficient (small) humans, in which case I agree your claim ends up being true some of the time (but as you say, there's little wiggle room). Changing it to 1000x seems like a good solution though which gives you plenty of margin for error.
This is a nitpick, but I don't think this claim is quite right (emphasis added)
If a silicon-chip AGI server were literally 10,000× the volume, 10,000× the mass, and 10,000× the power consumption of a human brain, with comparable performance, I don’t think anyone would be particularly bothered—in particular, its electricity costs would still be below my local minimum wage!!
First, how much power does the brain use? 20 watts is StackExchange's answer, but I've struggled to find good references here. The appealingly named Appraising the brain's energy budget gives 20% of the overall calories consumed by the body, but that begs the question of the power consumption of the human body, and whether this is at rest or under exertion, etc. Still, I don't think the 20 watts figure is more than 2x off, so let's soldier on.
10,000 times 20 watts is 200 kW. That's a large but not insane amount of power. You could just about run that load on a domestic power supply in the US (some larger homes might have a 200A @ 120V circuit, for 192 kW of permissible load under the 80% rule). Of course you wouldn't be able to power the HVAC needed to cool all these chips, but let's suppose you live in Alaska and can just open the windows.
At the time of writing, the cheapest US electricity prices are around $0.09 per kWh with many states (including Alaska, unfortunately) being twice that at around $0.20/kWh. But let's suppose you're in both a cool climate and have a really great deal on electricity. So your 200kWh of chips costs you just $0.09*200=$18/hour.
Federal minimum wage is $7.25/hour, and the highest I'm aware of in any US state is $15/hour. So it seems that you won't be cheaper than the brain on electricity prices if 10,000 times less efficient. I've systematically tried to make favorable assumptions here. Your 200kW proto-AGI probably won't be in an Alaskan garage, but in a tech company's data center with according costs for HVAC, redundant power, security, etc. Colo costs vary widely depending on location and economies of scale. A recent quote I had was at around the $0.4 kWh/mark -- so about 4x the cost quoted above.
This doesn't massively change the qualitative takeaway, which is that even if something was 10,000 (or even a million times) less efficient than the brain, we'd absolutely still go ahead and build a demo anyway. But it is worth noting that something at the $60/hour range might not actually be all that transformative unless it's able to perform highly skilled labor -- at least until we make it more efficient (which would happen quite rapidly).
"The Floating Droid" example is interesting as there's a genuine ambiguity in the task specification here. In some sense that means there's no "good" behavior for a prompted imitation model here. (For an instruction-following model, we might want it to ask for clarification, but that's outside the scope of this contest.) But it's interesting the interpretation flips with model scale, and in the opposite direction to what I'd have predicted (doing EV calculations are harder so I'd have expected scale to increase not decrease EV answers.) Follow-up questions I'd be excited to see the author address include:
1. Does the problem go away if we include an example where EV and actual outcome disagree? Or do the large number of other spuriously correlated examples overwhelm that?
2. How sensitive is this to prompt? Can we prompt it some other way that makes smaller models more likely to do actual outcome, and larger models care about EV? My guess is the training data that's similar to those prompts does end up being more about actual outcomes (perhaps this says something about the frequency of probabilistic vs non-probabilistic thinking on internet text!), and that larger language models end up capturing that. But perhaps putting the system in a different "personality" is enough to resolve this. "You are a smart, statistical assistant bot that can perform complex calculations to evaluate the outcomes of bets. Now, let's answer these questions, and think step by step."
It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.
A rough heuristic I have is that if the idea you're introducing is highly novel, it's OK to not be rigorous. Your contribution is bringing this new, potentially very promising, idea to people's attention. You're seeking feedback on how promising it really is and where people are confused , which will be helpful for then later formalizing it and studying it more rigorously.
But if you're engaging with a large existing literature and everyone seems to be confused and talking past each other (which I'd characterize a significant fraction of the mesa-optimization literature, for example) -- then the time has come to make things more rigorous, and you are unlikely to make much further progress without it.
This is a good point, adversarial examples in what I called in the post the "main" ML system can be desirable even though we typically don't want them in the "helper" ML systems used to align the main system.
One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that are cryptographically hard to discover if you don't know how they were constructed.
I generally expect that if adversarial robustness can be practically solved then transformative AI systems will eventually self-improve themselves to the point of being robust. So, the window where AI systems are dangerous & deceptive enough that we need to test them using adversarial examples but not capable enough to have overcome this might be quite short. Could still be useful as an early-warning sign, though.