AI ALIGNMENT FORUM
AF

Julian Stastny
Ω156340
Message
Dialogue
Subscribe

member of technical staff @ redwood research

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Why "training against scheming" is hard
Julian Stastny26d10

For example, it could exploration hack during training:

  1. It could strategically avoid exploring high-reward regions such that they never get reinforced
  2. It could strategically explore in ways that don’t lead to general updates in behavior
  3. It could condition its behavior on the fact that it is in a training environment and not real deployment such that it learns “don’t scheme when observed” rather than “never scheme”

Nitpick: The third item is not an instance of exploration hacking.

Reply
Making deals with early schemers
Julian Stastny1mo10

I think it's interesting that @Lukas Finnveden seems to think

compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned.

in contrast to this comment. It's not literally contradictory (we could have been more explicit about both possibilities), but I wonder if this indicates a disagreement between you and him.

Reply
Making deals with early schemers
Julian Stastny1mo10

Agreed. 

(In the post I tried to convey this by saying

(Though note that with AIs that have diminishing marginal returns to resources we don’t need to go close to our reservation price, and we can potentially make deals with these AIs even once they have a substantial chance to perform a takeover.)

in the subsection "A wide range of possible early schemers could benefit from deals".)

Reply
Making deals with early schemers
Julian Stastny1mo10

[Unimportant side-note] We did mention this (but not discuss extensively) in the bullet about convergence, thanks to your earlier google doc comment :)

We could also try to deliberately change major elements of training (e.g. data used) between training runs to reduce the chance that different generations of misaligned AIs have the same goals.

Reply
No wikitag contributions to display.
47Recent Redwood Research project proposals
6d
0
28Linkpost: Redwood Research reading list
10d
0
31What's worse, spies or schemers?
11d
2
31Two proposed projects on abstract analogies for scheming
16d
0
49Making deals with early schemers
1mo
29
49Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
2mo
0
487+ tractable directions in AI control
3mo
0