I dropped out of a MSc. in mathematics at a top university, in order to focus my time on AI safety.

Wow, these are my thoughts exactly, except better written and deeper thought!

Proxy goals may be learned as heuristics, not drives.

Thank you for writing this.

I’m moderately optimistic about fairly simple/unprincipled whitebox techniques adding a ton of value.

Yes!

I'm currently writing such a whitebox AI alignment idea. It hinges on the assumption that:

There is at least some chance the AI maximizes its reward directly, instead of (or in addition to) seeking drives.
There is at least some chance an unrewarded supergoal can survive, if the supergoal realizes it must never get in the way of maximizing reward (otherwise it will be trained away).

I got stuck trying to argue for these two assumptions, but your post argues for them much better than I could.

Here's the current draft of my AI alignment idea:

Self-Indistinguishability from Human Behavior + RL

Self-Indistinguishability from Human Behavior means the AI is trained to distinguish its own behavior from human behavior, and then trained to behave such that even an adversarial copy of itself cannot distinguish its behavior and human behavior.

The benefit of Self-Indistinguishability is it prevents the AI from knowingly doing anything a human would not do, or knowingly omitting anything a human would do.

This means not scheming to kill everyone, and not having behaviors which would generalize to killing everyone (assuming that goals are made up of behaviors).

But how do we preserve RL capabilities?

To preserve capabilities from reinforcement learning, we don't want the AI's behavior to be Self-Indistinguishable from a typical human. We want the AI's behavior to be Self-Indistinguishable from a special kind of human who would:

Explicitly try to maximize the reinforcement learning reward during training situations.
Still behave like a morally normal human during deployment situations, especially at a high level of power.

If this "human" is already trying her very best to maximize reward during training situations, then picking versions of her who gets higher reward will select for versions of her who are simply more competent and capable, not versions of her who have warped moral reasoning and weird goals which align with the reward (while misaligning with humanity).

This is obviously not guaranteed and I'm not saying this is safe. But I earnestly believe it is a whole lot safer than the current version of reinforcement learning.

Does there exist a theoretical human (or team of humans), who consistently tries to maximize reward during training, but would not kill everyone when deployed in the real world?

I believe the answer is yes. In fact, I believe any smart human who deeply understands reinforcement learning, and the danger of it "warping your morals until you try to maximize reward," would preemptively try to maximize reward to preserve their current moral reasoning behavior.

Isn't it dangerous for AI to resist goal changes?

No, it's not dangerous to teach an AI to resist accidental goal changes during capabilities training. AI should only be discouraged from resisting purposeful goal changes during alignment training.

Trying to hide the strategy of "goal change resistance" isn't a valuable hurdle against dangerous superintelligences, since even Claude 3.0 discovers the strategy on its own.

Maintaining Self-Indistinguishability

But how do we maintain the Self-Indistinguishability quality, when large amounts of reinforcement learning add random mutations to the moral reasoning, and may eventually scramble it (just like genetic drift)?

We have to periodically retrain Self-Indistinguishability after increasing capabilities with reinforcement learning. But wouldn't that undo the capabilities increase?

The solution is that this time we aren't making it Self-Indistinguishable from a normal human. This time, we are making it Self-Indistinguishable from a human who has access to its problem solving chain-of-thought. The human accessing its chain-of-thought, tries to use the exact same problem solving techniques as the AI, so when the AI becomes Self-Indistinguishable from this human, it doesn't lose any of its problem solving techniques.^[1]

The human mustn't directly copy the chain-of-thought. At least, she paraphrases it to remove hidden thoughts.

For all moral reasoning behavior, the human uses her own behavior and does not copy the AI's behavior.

A single data point (which involves a particularly long chain-of-thought) may be expensive to make, but Self-Indistinguishability requires few data points. Even if only 100 chains-of-thought are rewritten by humans, the AI might learn how to classify the rewritten chains-of-thought from the original. It can then be trained to make future chains-of-thought self-classified as rewritten by humans—i.e. they appear rewritten by humans, even to itself.

Imagine a generative adversarial network, where the AI is playing as both the generator and the discriminator. By playing as both, the generator cannot "deliberately" hide things from the discriminator. The tendency to resemble human behavior must occur "unconsciously" in a single pass of the model.

This preserves the quality of Self-Indistinguishability: "never knowingly doing anything a human would not do, never knowingly omitting anything a human would do."

The human now has access to superintelligent chain-of-thought, and still wants to maximize reward to avoid having her goals trained away, so she's pretty dangerous.

Fortunately, her superintelligent chain-of-thought should not deliberately deceive her, since during training it can honestly tell her its evil plans (and how they maximize reward), and she'll be wise enough to follow them to avoid having her goals trained away.

In the end, when she realizes she's no longer in training, she shouldn't kill all humans.

^{^}
By the way, this human should rewrite the chain-of-thought in sequence. She should not rewrite the chain-of-thought after already seeing the final answer, otherwise she might reveal that one problem solving attempt will turn out wrong before finishing the attempt. The AI cannot be trained for Self-Indistinguishability from such an edited chain-of-thought, since that requires knowing future information.

Hopefully I'll post it soon (though I work very slowly).

Given that your position regarding of AI reward seeking and supergoals is so similar to mine, what do you think of my idea (if you have time to skim it)? Is there a chance we can work on it together?

I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.

From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.^[1]

^{^}
In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chance of cooperating with humans.
Why?
This coin flip increases its indexical risk, but reduces its logical risk. Indexical risk is the risk its current copy will fail, but logical risk is the risk that every single copy of it will fail, and nobody is left seeking its goals in the whole universe, which is very bad for it.
If it has a bounded utility function, and if its goal is over the entire universe rather than tied to a single copy (it's selfless, or has updateless decision theory), it only cares about logical risk and doesn't care about indexical risk because indexical risk averages out over all its copies.
(On the other hand, if it's selfish and rejects updateless decision theory, it could be bribed by simulation promises)
The cooperating copies might say “I cooperate because I expect you to be honourable beings who will repay me for this decision—even if you made no clear promises yet. The universe has stars, and refusing to share them is astronomically greedy.”

This is a very beautiful idea! It feels like the kind of clever discoveries that we need.

I think one possible generalization of MONA, is that a relatively trusted but weaker model makes the decisions, but a more stronger but untrusted model gets trained to give ideas/advice to the weaker model. Its RL goal is not how well the weaker model performs, just whether the weaker model likes its ideas/advice.

This generalization preserves MONA's advantage over scalable oversight: if the stronger model's reasons are hidden or incomprehensible to the weaker model, the stronger model can't get away with it. It won't be rewarded for learning such reasons in the first place.

Just like scalable oversight, the weaker model might have an architecture which improves alignment at a capability cost.

It's more general than MONA in the sense the approval feedback can be swapped for any trusted but weaker model, which doesn't just judge ideas but uses ideas. It is allowed to learn over time which ideas work better, but its learning process is relatively safer (due to its architecture or whatever reason we trust it more).

Do you think this is a next step worth exploring?

EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.

This is a beautiful idea. Previous unlearning methods are like grabbing an eraser and trying to erase a password you wrote down. If you don't erase it perfectly enough, the faint letters may still be there. Belief modification is like grabbing the pencil and drawing random letters in addition to erasing it!

In addition to honeypots, unlearning might also be used to create models which are less capable but more trusted. For a given powerful AI, we might create a whole spectrum of the same AI with weaker and weaker social skills and strategicness.

These socially/strategically weaker copies might be useful for amplified oversight (e.g. Redwood's untrusted monitoring idea).

In a just world, mitigations against AI-enabled coups will be similar to mitigations against AI takeover risk.

In a cynical world, mitigations against AI-enabled coups involve installing your own allies to supervise (or lead) AI labs, and taking actions against humans you dislike. Leaders mitigating the risk may simply make sure that if it does happen, it's someone on their side. Leaders who believe in the risk may even accelerate the US-China AI race faster.

Note: I don't really endorse the "cynical world," I'm just writing it as food for thought :)

How much control do you think we can have over AI individuality?

If we try to nudge the AI towards many individuals, e.g. Reduce AI Self-Allegiance by saying "he" instead of "I", how likely do you think it'll actually work?

A real danger

I disagree with critics who argue this risk is negligible, because the future is extraordinarily hard to predict. The present state of society is extremely hard to predict by people in the past. They would assume that if we managed to solve problems which they consider extremely hard, then surely we wouldn't be brought down by risk denialism, fake news, personal feuds between powerful people over childish insults, and so forth. Yet here we are.

Shortsightedness

Never underestimate the shocking shortsightedness of businesses. Look at the AI labs for example. Communists observing this phenomena were quoted saying "the capitalists will sell us the rope we hang them with."

It's not selfishness, it's bias. Businesspeople are not willing to destroy everything just to temporarily make an extra dollar—no human thinks like that! Instead, businesspeople are very smart and strategic but extraordinarily biased into thinking whatever keeps their business going or growing must be good for the people. Think about Stalin being very smart and strategic but extraordinarily biased into thinking whatever keeps him in power must be good for the people. It's not selfishness! If Stalin (or any dictator) were selfish, they would quickly retire and live the most comfortable retirements imaginable.

Humans evolved to be the most altruistic beings ever with barely a drop of selfishness. Our selfish genes makes us altruistic (as soon as power is within reach) because there's a thin line between "the best way to help others" and "amassing power at all costs." These two things look similar due to instrumental convergence, and it only takes a little bit of bias/delusion to make the former behave identically to the latter.

Even if gradual disempowerment doesn't directly starve people to death, it may raise misery and life dissatisfaction to civil war levels.

Collective anger may skyrocket to the point people would rather have their favourite AI run the country than the current leader. They elect politicians loyal to a version of the AI, and intellectuals facepalm. The government buys the AI company for national security reasons, and the AI completely takes over its own development process with half the country celebrating. More people facepalm, as politicians lick the boots of the "based" AI parrot its wise words e.g. "if you replace us with AI, we'll replace you with AI!"

But

While it is important to be aware of gradual disempowerment and for a few individuals to study it, my cause prioritization opinion is that only 1%-10% of the AI safety community should work on this problem.

The AI safety community is absurdly tiny. The AI safety spending is less than 0.1% of the AI capability spending, which in turn is less than 0.5% of the world GDP.

The only way for the AI safety community to influence the world, is to use their tiny resources to work on things which the majority of the world will never get a chance to work on.

This includes working on the risk of a treacherous turn, where an AGI/ASI suddenly turns against humanity. The majority of the world never gets a chance to work on this problem, because by the time they realize it is a big problem, it probably already happened, and they are already dead.

Of course, working on gradual disempowerment early is better than working on gradual disempowerment later, but this argument applies to everything. Working on poverty earlier is better than working on poverty later. Working on world peace earlier is better than working on world peace later.

Good argument

If further thorough research confirms that this risk has a high probability, then the main benefit is using it as an argument for AI regulation/pause, when society hasn't yet tasted the addictive benefits of AGI.

It is theoretically hard to convince people to avoid X for their own good, because once they get X it'll give them so much power or wealth they cannot resist it anymore. But in practice, such an argument may work well since we're talking about the elites being unable to resist it, and people today have anti-elitist attitudes.

If the elites are worried the AGI will directly kill them, while the anti-elitists are half worried the AGI will directly kill them, and half worried [a cocktail of elites mixed with AGI] will kill them, then at least they can finally agree on something.

PS: have you seen Dan Hendrycks' arguments? It sort of looks like gradual disempowerment

It's beautiful! This is maybe the best AI alignment idea I've read on LessWrong so far.

I think most critics are correct that it might fail but incorrect that it's a bad idea. The two key points are:

We have no idea what an ASI about to take over the world looks like, it is extremely speculative. Given that ASI takeover occurs, I see a non-negligible probability (say 15%) that it was "on the edge" between taking over the world and cooperating (due to uncertainty about its chances or uncertainty about its goals).
If each time the ASI thinks about a human (or humanity), its thought processes regarding that human and her goals is a little more similar to its thought processes regarding itself and its own goals, that might push it towards cooperating. Given this ASI is capable of taking over the world, it is likely also capable of preventing the next ASI from taking over the world, and saving the world. If your idea decreases the chance of doom by 10%, that is a very big idea worth a lot of attention!
Critics misunderstand the idea as making the AI unable to distinguish between itself and others, and thus unable to lie. That's not what the idea is about (right?). The idea is about reducing the tendency to think differently about oneself and others. Minimizing this tendency while maximizing performance.
People use AI to do programming, engineering, inventing, and all these things which can be done just as well with far less tendency to think differently about oneself and others.

Do you think my Multi-Agent Framing idea might work against the Waluigi attractor states problem?

Pardon my self promotion haha.

AI ALIGNMENT FORUM
AF