sage_bergerson — AI Alignment Forum

Things I'm confused about:

How can the mechanism by which the model outputs ‘true’ representations of its processing be verified?

Re ‘translation mechanism’: How could a model use language to describe its processing if it includes novel concepts, mechanisms, or objects for which there are no existing examples in human-written text? Can a model fully know what it is doing?

Supposing an AI was capable of at least explaining around or gesturing towards this processing in a meaningful way - would humans be able to interpret these explanations sufficiently such that the descriptions are useful?

Could a model checked for myopia be deceptive in its presentation of being myopic? How do you actually test this?

Things I'm sort of speculating about:

Could you train a model so that it must provide a textual explanation for a policy that is validated before it proceeds with actually updating its policy so essentially every part of its policy is pre-screened and it can only learns to take interpretable actions? Writing this out though I see how it devolves into providing explanations that sounds great but don’t represent what is actually happening.

Could it be fruitful to focus some effort on building quality human-centric-ish world models (maybe even just starting with something like topography) into AI models to improve interpretability (i.e. provide a base we are at least familiar with off of which they can build in some way as opposed to having no insight into their world representation)?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments