AI ALIGNMENT FORUM
AF

Dusto
Ω1000
Message
Dialogue
Subscribe

Stepped away to spend some time on AI safety (deception and red-teaming). Past lives in research, and government tech

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
To be legible, evidence of misalignment probably has to be behavioral
Dusto3mo10

Honestly I think this can go one step further and require that the behavioural output be something of more substance than text of what the model says it would do, or chain of thought pondering. I'm thinking more of the lines of the honeypot work. I suspect we are interested in convincing similar people about AI safety and will likely get the "cool, come back to me when the models start actually doing bad things" responses if we aren't able to provide concrete examples of model capability.

Reply
No posts to display.