Editor’s note: I’m experimenting with having a lower quality threshold for just posting things even while I’m still confused and unconfident about my conclusions, but with this disclaimer at the top. Thanks to Kyle and Laria for discussion.
One potential way we might think to interpret LMs is to have them explain their thinking as a monologue or justification or train of thought or something. Particularly, by putting the explanation before the answer, we might hope to encourage the model to actually use the monologue to come to its conclusion and try to avoid the model coming up with the bottom line first. However, there are a bunch of ways this could go wrong. For instance:
I think it's possible that we're simply lucky and that GPTs of the relevant size just don't have problems (in the same way that ELK could just be easy), but I'm not sure we want to bet on that. I also think lots of these problems correspond to problems with mechanistic interpretability, especially the ELK and deception related ones.
I don't think this is a particularly novel observation but it's nice having a reference to point to anyways.
UPDATE: After further discussion with Kyle, I've now been convinced that "generalizes human concepts correctly" (which is closer to the natural abstractions of alignment by default) and "has a compact direct translator" are subtly different, and that crucially it's possible for the former to be true while the latter is false (i.e you could have a model whose latents are extremely obfuscated and therefore expensive to translate to the human ontology, yet this model correctly generalizes human concepts).