Charbel-Raphaël — AI Alignment Forum

Charbel-Raphael Segerie

https://crsegerie.github.io/

Living in Paris

Thanks a lot!

it's the total cost that matters, and that is large

We think a relatively inexpensive method for day-to-day usage would be using Sonnet to monitor Opus, or Gemini 2.5 Flash to monitor Pro. This would probably be just a +10% overhead. But we have not run this exact experiment; this would be a follow-up work.

I'm going to collect here new papers that might be relevant:

https://x.com/bartoszcyw/status/1925220617256628587
Why Do Some Language Models Fake Alignment While Others Don’t? (link)

My response to the alignment / AI representatives proposals:

Even if AIs are "baseline aligned" to their creators, this doesn't automatically mean they are aligned with broader human flourishing or capable of compelling humans to coordinate against systemic risks. For an AI to effectively say, "You are messing up, please coordinate with other nations/groups, stop what you are doing" requires not just truthfulness but also immense persuasive power and, crucially, human receptiveness. Even if pausing AI was the correct thing to do, Claude is not going to suggest this to Dario for obvious reasons. As we've seen even with entirely human systems (Trump’s Administration and Tariff), possessing information or even offering correct advice doesn't guarantee it will be heeded or lead to effective collective action.

[...] "Politicians...will remain aware...able to change what the system is if it has obviously bad consequences." The climate change analogy is pertinent here. We have extensive scientific consensus, an "oracle IPCC report", detailing dire consequences, yet coordinated global action remains insufficient to meet the scale of the challenge. Political systems can be slow, captured by short-term interests, or unable to enact unpopular measures even when long-term risks are "obviously bad." The paper [gradual disempowerment] argues AI could further entrench these issues by providing powerful tools for influencing public opinion or creating economic dependencies that make change harder.

Extract copy pasted from a longer comment here.

While I concur that power concentration is a highly probable outcome, I believe complete disempowerment warrant deeper consideration, even under the assumptions you've laid out. Here are some thoughts on your specific points:

On Baseline Alignment: You suggest a baseline alignment where AIs are unlikely to engage in egregious lying or tampering (though you also flag 20% for scheming and 10% for unintentional egregious behavior even with prevention efforts, that’s already 30%-ish of risk). My concern is twofold:
- Sufficiency of "Baseline": Even if AIs are "baseline aligned" to their creators, this doesn't automatically mean they are aligned with broader human flourishing or capable of compelling humans to coordinate against systemic risks. For an AI to effectively say, "You are messing up, please coordinate with other nations/groups, stop what you are doing" requires not just truthfulness but also immense persuasive power and, crucially, human receptiveness. Even if pausing AI was the correct thing to do, Claude is not going to suggest this to Anthropic folks for obvious reasons. As we've seen even with entirely human systems (Trump’s Administration and Tariff), possessing information or even offering correct advice doesn't guarantee it will be heeded or lead to effective collective action.
- Erosion of Baseline: The pressures described in the paper could incentivise the development or deployment of AIs where even "baseline" alignment features are traded off for performance or competitive advantage. The "AI police" you mention might struggle to keep pace or be defunded/sidelined if it impedes perceived progress or economic gains. “Innovation first!”, “Drill baby drill” “Plug baby plug” as they say
On "No strong AI rights before full alignment": You argue that productive AIs won't get human-like rights, especially strong property rights, before being robustly aligned, and that human ownership will persist.
- Indirect Agency: Formal "rights" might not be necessary for disempowerment. An AI, or a network of AIs, could exert considerable influence through human proxies or by managing assets nominally owned by humans who are effectively out of the loop or who benefit from this arrangement. An AI could operate through a human willing to provide access to a bank account and legal personhood, thereby bypassing the need for its own "rights."
On "No hot global war":
You express hope that we won't enter a situation where a humanity-destroying conflict seems plausible.
- Baseline Risk: While we all share this hope, current geopolitical forecasting (e.g., from various expert groups or prediction markets) often places the probability of major power conflict within the next few decades at non-trivial levels. For a war that makes more than 1M of deaths, some estimates hover around 25%. (But probably your definition of “hot global war” is more demanding)
- AI as an Accelerant: The dynamics described in the paper – nations racing for AI dominance, AI-driven economic shifts creating instability, AI influencing statecraft – could increase the likelihood of such a conflict.

Responding to your thoughts on why the feedback loops might be less likely if your three properties hold:

"Owners of capital will remain humans and will remain aware...able to change the user of that AI labor if they desire so."
- Awareness doesn't guarantee the will or ability to act against strong incentives. AGI development labs are pushing forward despite being aware of the risks, often citing competitive pressures ("If we don't, someone else will"). This "incentive trap" is precisely what could prevent even well-meaning owners of capital from halting a slide into disempowerment. They might say, "Stopping is impossible, it's the incentives, you know," even if their pDoom is 25% like Dario, or they might not give enough compute to their superalignment team.
"Politicians...will remain aware...able to change what the system is if it has obviously bad consequences."
- The climate change analogy is pertinent here. We have extensive scientific consensus, an "oracle IPCC report", detailing dire consequences, yet coordinated global action remains insufficient to meet the scale of the challenge. Political systems can be slow, captured by short-term interests, or unable to enact unpopular measures even when long-term risks are "obviously bad." The paper argues AI could further entrench these issues by providing powerful tools for influencing public opinion or creating economic dependencies that make change harder.
"Human consumers of culture will remain able to choose what culture they consume."
- You rightly worry about "brain-hacking." The challenge is that "obviously bad" might be a lagging indicator. If AI-generated content subtly shapes preferences and worldviews over time, the ability to recognise and resist this manipulation could diminish before the situation becomes critical. I think that people are going to LOVE AI, and might take the trade to go faster and be happy and disempowered like some junior developers begin to do on Cursor.

As a meta point, the fact that the quantity and quality of discourse on this matter is so low, and the fact that people are continuing to say “LET’S GO WE ARE CREATING POWERFUL AIS, and don't worry, we plan to align them, even if we don't really know which type of alignment do we really need, and if this is even doable in time” while we have not rigorously assessed all those risks, is really not a good sign.

At the end of the day, my probability for something in the ballpark of gradual disempowerment / extreme power concentration and loss of democracy is 40%-ish, much higher than scheming (20%) leading to direct takeover (let’s say 10% post mitigation like control).

Thanks a lot for writing this, this is an important consideration, and it would be sweet if Anthropic updated accordingly.

Some remarks:

I'm still not convinced that Deceptive AI following scheming is the main risk compared to other risks (gradual disempowerment, concentration of power & value Lock in, a nice list of other risks from John).
"Should we give up on interpretability? No!" - I think this is at least a case for reducing the focus a bit, and more diversification of approaches
On the theories of impacts suggested:
- "A Layer of Swiss Cheese" - why not! This can make sense in DeepMind's plan, that was really good by the way.
- "Enhancing Black-Box Evaluations" - I think a better theory is interp to complement AI Control techniques. Example: Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals).
  - Maybe Simple probes can catch sleeper agents \ Anthropic could also be interesting, in the sense that the probe seems to generalize surprisingly well (I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).
- "Debugging mysterious behaviour" - Might be interesting, might help marginally to get better understanding, but this is not very central for me.

OK, thanks a lot, this is much clearer. So basically most humans lose control, but some humans keep control.

And then we have this meta-stable equilibrium that might be sufficiently stable, where humans at the top are feeding the other humans with some kind of UBI.

Is this situation desirable? Are you happy with such course of action?
Is this situation really stable?

For me, this is not really desirable - the power is probably going to be concentrated into 1-3 people, there is a huge potential for value locking, those CEOs become immortal, we potentially lose democracy (I don't see companies or US/China governments as particularly democratic right now), the people on the top become potentially progressively corrupted as is often the case. Hmm.

Then, is this situation really stable?

If alignment is solved and we have 1 human at the top - pretty much yes, even if revolutions/value drift of the ruler/craziness are somewhat possible at some point maybe?
If alignment is solved and we have multiple humans competing with their AIs - it depends a bit. It seems to me that we could conduct the same reasoning as above - but not at the level of organizations, but the level of countries: Just as Company B might outcompete Company A by ditching human workers, couldn't Nation B outcompete Nation A if Nation A dedicates significant resources to UBI while Nation B focuses purely on power? There is also a potential race to the bottom.
- And I'm not sure that cooperation and coordination in such a world would be so much improved: OK, even if the dictator listens to its aligned AI, we need a notion of alignment that is very strong to be able to affirm that all the AIs are going to advocate for "COOPERATE" in the prisoner's dilemma and that all the dictators are going to listen - but at the same time it's not that costly to cooperate as you said (even if i'm not sure that energy, land, rare ressources are really that cheap to continue to provide for humans)

But at least I think that I can see now how we could still live for a few more decades under the authority of a world dictator/pseudo-democracy while this was not clear for me beforehand.

Thanks for continuing to engage. I really appreciate this thread.

"Feeding humans" is a pretty low bar. If you want humans to live as comfortably as today, this would be more like 100% of GDP - modulo the fact that GDP is growing.

But more fundamentally, I'm not sure the correct way to discuss the resource allocation is to think at the civilization level rather than at the company level: Let's say that we have:

Company A that is composed of a human (price $5k/month) and 5 automated-humans (price of inference $5k/month let's say)
Company B that is composed of 10 automated-humans ($10k/month)

It seems to me that if you are an investor, you will give your money to B. It seems that in the long term, B is much more competitive, gains more money, is able to reduce its prices, nobody buys from A, and B invests this money into more automated-humans and crushes A and A goes bankrupt. Even if alignment is solved, and the humans listen to his AIs, it's hard to be competitive.

a tiny fraction of resources toward physically keeping the humans alive (which is very cheap, at least once AIs are very powerful)

I'm not sure it's very cheap.

It seems to me that for the same amount of energy and land you need for a human, you could replace a lot more economically valuable work with AI.

Sure, at some point keeping humans alive is a negligible cost, but there's a transition period while it's still relatively expensive - and that's part of why a lot of people are going to be laid off - even if the company ends up getting super rich.

Well done - this is super important. I think this angle might also be quite easily pitchable to governments.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments