Dusto — AI Alignment Forum

Honestly I think this can go one step further and require that the behavioural output be something of more substance than text of what the model says it would do, or chain of thought pondering. I'm thinking more of the lines of the honeypot work. I suspect we are interested in convincing similar people about AI safety and will likely get the "cool, come back to me when the models start actually doing bad things" responses if we aren't able to provide concrete examples of model capability.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments