Summary
When discussing the possibility that LLMs will cease to reason in transparent natural language with other AI safety researchers, we have sometimes noticed that we talk past each other: e.g., when discussing ‘neuralese reasoning’, some people have indefinitely long chains of recurrent activations in mind, while others think of an alien language in token-space. This post will propose terminology for distinguishing different types of hidden reasoning to create a shared language for such conversations.
In short, our taxonomy categorizes types of hidden reasoning as follows:
- Neuralese: Arbitrary-depth reasoning in high-dimensional latent space, enabled by recurrence.
- Hidden parallelized reasoning: Important steps are computed in activation-space in parallel without appearing in the chain-of-thought (CoT).
- Subcategories: partial verbalization, hidden utilization of test-time compute, epiphenomenal reasoning
- Encoded reasoning: The AI either encodes the CoT itself in such
...
>We assume that 𝐻 will be able to construct a successful exploit if and only if the attack policy is able to (because both are implemented by GPT-4).
I challenge this assumption.
Consider the code
'''
bool is_odd(int x) {
if (sha256(x) == 354) {
return true;
}
return x % 2 == 1;
}
'''
where sha256() computes the SHA-256 hash of the input. It is obviously possible for an attacker to supply numbers x and y such that sha256(x) = y, write y into the code and supply x when creating a successful expl... (read more)