Mau - AI Alignment Forum

I'm still pretty confused by "You get what you measure" being framed as a distinct threat model from power-seeking AI (rather than as another sub-threat model). I'll try to address two defenses of that (of framing them as distinct threat models) which I interpret this post as suggesting (in the context of this earlier comment on the overview post). Broadly, I'll be arguing that: power-seeking AI is necessary for "you get what you measure" issues posing existential threats, so "you get what you measure" concerns are best thought of as a sub-threat model of power-seeking AI.

(Edit: An aspect of "you get what you measure" concerns--the emphasis on something like "sufficiently strong optimization for some goal is very bad for different goals"--is a tweaked framing of power-seeking AI risk in general, rather than a subset.)

Lock-in: Once we’ve noticed problems, how difficult will they be to fix, and how much resistance will there be? For example, despite the clear harms of CO2 emissions, fossil fuels are such an indispensable part of the economy that it’s incredibly hard to get rid of them. A similar thing could happen if AI systems become an indispensable part of the economy, which seems pretty plausible given how incredibly useful human-level AI would be. As another example, imagine how hard it would be to ban social media, if we as a society decided that this was net bad for the world.

Unless I'm missing something, this is just an argument for why AI might get locked in--not an argument for why misaligned AI might get locked in. AI becoming an indispensable part of the economy isn't a long-term problem if people remain capable of identifying and fixing problems with the AI. So we still need an additional lock-in mechanism (e.g. the initially deployed, misaligned AI being power-seeking) to have trouble. (If we're wondering how hard it will be to fix/improve non-power-seeking AI after it's been deployed, the difficulty of banning social media doesn't seem like a great analogy; a more relevant analogy would be the difficulty of fixing/improving social media after it's been deployed. Empirically, this doesn't seem that hard. For example, YouTube's recommendation algorithm started as a click-maximizer, and YouTube has already modified it to learn from human feedback.)

See Sam Clarke’s excellent post for more discussion of examples of lock-in.

I don't think Sam Clarke's post (which I'm also a fan of) proposes any lock-in mechanisms that (a) would plausibly cause existential catastrophe from misaligned AI and (b) do not depend on AI being power-seeking. Clarke proposes five mechanisms by which Part 1 of "What Failure Looks Like" could get locked in -- addressing each of these in turn (in the context of his original post):

(1) short-term incentives and collective action -- arguably fails condition (a) or fails condition (b); if we don't assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
(2) regulatory capture -- the worry here is that the companies controlling AI might have and permanently act on bad values; this arguably fails condition (a), because if we're mainly worried about AI developers being bad, then focusing on intent alignment doesn't make that much sense.
(3) genuine ambiguity -- arguably fails condition (a) or fails condition (b); if we don't assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
(4) dependency and deskilling -- addressed above
(5) [AI] opposition to [humanity] taking back influence -- clearly fails condition (b)

So I think there remains no plausible alignment-relevant threat model for "You get what you measure" that doesn't fall under "power-seeking AI."

Warning Shots Probably Wouldn't Change The Picture Much

Mau2y25

I'd guess the very slow rate of nuclear proliferation has been much harder to achieve than banning gain-of-function research would be, since, in the absence of intervention, incentives to get nukes would have been much bigger than incentives to do gain-of-function research.

Also, on top of the taboo against chemical weapons, there was the verified destruction of most chemical weapons globally.

My Overview of the AI Alignment Landscape: Threat Models

Mau2y00

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments