I think you make some good points here, but there's an additional mechanism by which I believe internal-based techniques have the potential to make people intervene strongly on suspected misaligned behaviour.
This is whereby (in pre-superintelligent models) we are able to establish a strong correlation between misaligned AI behaviour examples understandable to humans and some model internals e.g. a deception probe. If there is strong empirical evidence establishing a link (I'm not strongly confident this will be the case, but mildly optimistic), then as we move to superintelligent models, I believe people will be more likely to take action on evidence from model internals alone, especially if above a certain threshold for likelihood.
My reasoning for this relates to example today such as medical interventions which are taken as a result of EEG data (electrical activity in brain) even if no external behavioral signs are present (or ever present), but because there is enough evidence that certain patterns act as early warning signs for medical issues.
While there are obviously material differences between the 'cost' of these decisions, it does give me encouragement that people will place a high level of confidence in signals which aren't directly interpetable to humans if a statistical correlation has been established with previously observed behaviour.
I think this holds true only in a situation where there is positive intent by decision makers to actually accurately detect misaligned behaviour, as without human understable behavioural examples, internal-based signals would be easier to dismiss if that was the intent of the decision maker.
I think you make some good points here, but there's an additional mechanism by which I believe internal-based techniques have the potential to make people intervene strongly on suspected misaligned behaviour.
This is whereby (in pre-superintelligent models) we are able to establish a strong correlation between misaligned AI behaviour examples understandable to humans and some model internals e.g. a deception probe. If there is strong empirical evidence establishing a link (I'm not strongly confident this will be the case, but mildly optimistic), then as we move to superintelligent models, I believe people will be more likely to take action on evidence from model internals alone, especially if above a certain threshold for likelihood.
My reasoning for this relates to example today such as medical interventions which are taken as a result of EEG data (electrical activity in brain) even if no external behavioral signs are present (or ever present), but because there is enough evidence that certain patterns act as early warning signs for medical issues.
While there are obviously material differences between the 'cost' of these decisions, it does give me encouragement that people will place a high level of confidence in signals which aren't directly interpetable to humans if a statistical correlation has been established with previously observed behaviour.
I think this holds true only in a situation where there is positive intent by decision makers to actually accurately detect misaligned behaviour, as without human understable behavioural examples, internal-based signals would be easier to dismiss if that was the intent of the decision maker.