AI ALIGNMENT FORUM
AF

2110
Scheming AIs: Will AIs fake alignment during training in order to get power?

Scheming AIs: Will AIs fake alignment during training in order to get power?

Nov 20, 2023 by Joe Carlsmith

This is a LessWrong sequence version of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”, available on arXiv here: https://arxiv.org/pdf/2311.08379.pdf. It’s a long report, and I’m hoping that having shorter sections available as separate posts will make them easier to digest, reference, and comment on.

The first post in the sequence contains a summary of the full report. The summary covers most of the main points and technical terms, and I'm hoping it will provide much of the context necessary to understand individual sections of the report on their own.

47New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Joe Carlsmith
2y
23
10Varieties of fake alignment (Section 1.1 of “Scheming AIs”)
Joe Carlsmith
2y
0
8A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”)
Joe Carlsmith
2y
0
5Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”)
Joe Carlsmith
2y
0
2On “slack” in training (Section 1.5 of “Scheming AIs”)
Joe Carlsmith
2y
0
7Situational awareness (Section 2.1 of “Scheming AIs”)
Joe Carlsmith
2y
4
10Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)
Joe Carlsmith
2y
1
6Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)
Joe Carlsmith
2y
0
13“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)
Joe Carlsmith
2y
1
5Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of “Scheming AIs”)
Joe Carlsmith
2y
0
6How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs")
Joe Carlsmith
2y
0
4The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs")
Joe Carlsmith
2y
0
5Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs")
Joe Carlsmith
2y
0
5Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs")
Joe Carlsmith
2y
0
5Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs")
Joe Carlsmith
2y
0
6The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs")
Joe Carlsmith
2y
0
6Simplicity arguments for scheming (Section 4.3 of "Scheming AIs")
Joe Carlsmith
2y
0
5Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs")
Joe Carlsmith
2y
0
6Summing up "Scheming AIs" (Section 5)
Joe Carlsmith
2y
0
5Empirical work that might shed light on scheming (Section 6 of "Scheming AIs")
Joe Carlsmith
2y
0