AI ALIGNMENT FORUM
AF

artifex0
000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
0artifex0's Shortform
1y
0
No wikitag contributions to display.
Adele Lopez's Shortform
artifex03mo10

This suggests that in order to ensure a sincere author-concept remains in control, the training data should carefully exclude any text written directly by a malicious agent (e.g. propaganda). 

 

I don't think that would help much, unfortunately.  Any accurate model of the world will also model malicious agents, even if the modeller only ever learns about them second-hand. So the concepts would still be there for the agent to use if it was motivated to do so.

Censoring anything written by malicious people would probably make it harder to learn about some specific techniques of manipulation that aren't discussed much by non-malicious people or which appear much in fiction- but I doubt that would be much more than a brief speed bump for a real misaligned ASI, and probably at the expense of reducing useful capabilities in earlier models like the ability to identify maliciousness, which would give an advantage to competitors.

Reply
No posts to display.