I don't think that paper is an example of mesa optimization. Because the policy could be implementing a very simple heuristic to solve the task, similar to:
Pick the image that lead to highest reward in the last 10 timesteps with 90% probability.
Pik an image at random with 10% probability.
So the policy doesn't have to have any properties of a mesa optimizer like considering possible actions and evaluating them with a utility function, ect.
Whenever an RL is trained in a partially observed environment, the agent has to take actions to learn about parts of its environment that it hasn't observed yet or may have changed. The difference with this paper is that the observations it gets from the environment happen to be the reward the agent received in the previous timestep. However as far as the policy is concerned, the reward it gets as input is just another component of the state. So the fact that the policy gets the previous reward as input doesn't make it stand out compared to another partially observed environment.
I think that the experiments are more likely to work the way you predict if the agent only has partial observability, meaning the agent only gets the 5x5 grid around it as the state. Of course you would have to use an LSTM for the agent so it can remember where it's been previously if you do this.
If the agent can see the full environment, it is easier for it to discover the optimal policy of going to the nearest key first, then going to the nearest chest. If the agent implements this policy, it will still maximize the true reward in the test environment.
However, if the agent can only see a 5x5 grid around it, it will have to explore around to find keys or chests. In the training environment, the optimal policy will be to explore around, and pick up a key if it sees it, and if it has a key, and sees a chest then open that chest. I'm assuming that the training environment is set up so the agent can't get 2 keys without seeing a chest along the way. Therefore the policy of always picking up a key if it sees it will work great during training because if the agent makes it to another key, it will have already used the first one to open the chest it saw.
Then during the test environment, where there's a lot of keys, the agent will probably keep picking up the keys it can see and not spend time looking for chests to open. But I'm guessing the agent will still open a chest if it sees one while it's picking up keys.
I think it's interesting to get results on both the full observability and partial observability cases.