My guess is that at some point we'll transition away from this "First we train, then we deploy" paradigm to a paradigm where systems are continually learning on the job. My guess is that insofar as powerful AIs play a role in a multipolar scenario, they'll be in this second paradigm. So in a sense they'll be learning from each other, though perhaps early in their training (i.e. prior to deployment) they were trained against copies of themselves or something. Unfortunately I doubt your case #1 will happen, unless we advocate strongly for it. I think by the time these agents are this powerful, their code will be closely guarded. These are all just guesses though, I think other scenarios are certainly plausible also.
Some off-the-cuff thoughts:
It seems plausible that transformative agents will be trained exclusively on real-world data (without using simulated environments) [EDIT: in "data" I mean to include the observation/reward signal from the real-world environment in an online RL setup]; including social media feed-creation algorithms, and algo-trading algorithms. In such cases, the researchers don't choose how to implement the "other agents" (the other agents are just part of the real-world environment that the researchers don't control).
Focusing on agents that are trained on simulated environments that involve multiple agents: For a lab to use copies of other labs' agents, the labs will probably need to cooperate (or some other process that involves additional actors may need to exist). In any case, using copies of the agent that is being trained (i.e. self-play) seems to me very plausible. (Like, I think both AlphaZero and OpenAI Five were trained via self-play and that self-play is generally considered to be a very prominent technique for RL-in-simulated-environments-that-involve-multiple-agents).